Incident Management

Incident detection, response procedures, and resolution tracking

Incident Management

Incident Detection & Alerting

Monitoring Systems

Grafana: Real-time monitoring dashboards with automated alerts to Slack
VictoriaMetrics: Metrics collection and threshold-based alerting
OpenTelemetry: Application performance monitoring and error tracking
Uptime Kuma: Service availability monitoring and uptime tracking
Slack Integration: All monitoring alerts received via Slack channels

Alert Sources

Application performance degradation
Infrastructure resource thresholds (CPU, memory, disk)
Service availability and health checks
Security anomalies and authentication failures

Incident Classification

Severity Levels

Critical: Complete service outage or security breach
High: Partial service degradation affecting multiple users
Medium: Limited functionality impact or performance issues
Low: Minor issues with workarounds available

Response Procedures

Immediate Response

Detection: Automated alerts or manual reporting
Assignment: Engineering team lead assigns incident owner
Assessment: Incident owner evaluates impact and severity
Communication: Notify stakeholders via Slack

Escalation Process

Level 1: Engineering team response and initial resolution attempt
Level 2: CTO involvement if engineering team cannot resolve
Level 3: CEO notification for business-critical incidents as needed

Detailed Escalation Matrix

This section defines the specific escalation paths for various types of incidents that may arise during operations or partner interactions.

Technical Incidents

System Alerts & Monitoring Issues

L1: Slack alerts → DevOps Engineers (initial response and triage)
L2: DevOps Engineers → CTO (technical resolution and architectural decisions)
L3: CTO → Product Manager (business impact assessment and stakeholder communication)

Partner-Reported Technical Issues

Freshdesk/Freshchat: → Wealthy Support (AI-assisted resolution) → Product Manager → CTO
Direct PST Contact: Partner → PST → Wealthy Support → Product Manager → CTO

Partner Communication Channels

Communication Methods

Freshchat: Real-time chat with Wealthy Support team
Freshdesk: Tickets handled by Wealthy Support (partners or PST can create)
Direct PST Contact: Phone calls to assigned PST member (established relationship)

Flow

Freshchat/Freshdesk → Wealthy Support (AI-assisted resolution)
PST Contact → PST creates ticket → Wealthy Support
All incidents tracked in Freshdesk system

Security Incidents

System-Generated Security Alerts

Immediate: Slack alerts → CTO (direct escalation)
Critical: Immediate escalation with no intermediate steps for severe security incidents

Partner-Reported Security Issues

Freshdesk/Freshchat: → Wealthy Support (AI-assisted resolution) → CTO
Direct PST Contact: Partner → PST → Wealthy Support → CTO

CERT-In Reporting Requirements

For cybersecurity incidents that fall under CERT-In reportable categories, additional regulatory reporting is mandatory:

Timeline: All reportable cybersecurity incidents must be reported to CERT-In within 6 hours of detection.

Process:

Detection: Security incident identified via monitoring or reporting
Notification: Immediate alert to security@wealthy.in
Assessment: SRE team assesses if incident requires CERT-In reporting (within 2 hours)
CERT-In Report: If reportable, submit incident report to CERT-In (within 6 hours total)
Escalation: Notify CTO and Broking Head for critical incidents in parallel
Follow-up: Provide updates to CERT-In as incident progresses

CERT-In Contact:

Email: security@wealthy.in (actively monitored)
Website: https://cert-in.org.in/ (for CERT-In reporting contact information)

Reportable Incidents include but not limited to:

Data breaches and unauthorized access
Malware/ransomware attacks
Website/application defacement
DDoS attacks
Phishing attacks
Identity theft
Critical system compromises

For complete CERT-In compliance documentation, see CERT-In Compliance.

Development & Product Issues

Development Blockers

Immediate: Engineering Teams → Product Manager (blocker resolution within sprint cycle)
Complex: Product Manager → CTO (technical architecture or resource decisions)
Strategic: CTO → CEO collaboration (major strategic or business alignment decisions)

Bug Reports & Issues

L1: Engineering Teams (initial assessment and fix attempt)
L2: Product Manager (prioritization and resource allocation)
L3: CTO (complex technical issues or system-wide impacts)

Emergency Escalation Procedures

Critical Incidents (System Down, Security Breach, Major Partner Impact)

Parallel Escalation: Immediate notification to both Product Manager AND CTO simultaneously
Response Time: 15 minutes for critical incidents, 1 hour for high priority
Review Cycle: 24-hour stakeholder review for emergency changes
Resource Allocation: Immediate resource reallocation from other projects if needed

Escalation Criteria

Critical: System outages, security breaches, regulatory compliance issues
High: Partner-blocking issues, major feature failures, performance degradation
Medium: Standard bugs, feature requests, minor system issues
Low: Documentation updates, minor enhancements, optimization requests

Communication Protocols

Internal Communication

Slack: Primary incident coordination channel and immediate alerts
Plane: Task creation for larger system-level changes and future prevention measures
Direct Communication: Phone/video calls for critical incidents requiring immediate attention
Documentation: All escalated incidents must be documented in appropriate tracking system (Freshdesk, Plane)

External Communication

Partner Updates: PST provides updates via preferred communication channel
Status Pages: System-wide incident communication
Stakeholder Notifications: Product Manager coordinates stakeholder communication

Resolution & Documentation

Resolution Tracking

Root cause analysis and remediation steps
Timeline documentation and lessons learned
Post-incident review within 48 hours of resolution

Key Contacts

Engineering Team Lead: First responder and incident coordination
CTO: Technical escalation and critical incident management
Operations Manager: Business impact assessment and stakeholder communication

Root Cause Analyses

Recent RCAs

View our collection of Root Cause Analyses for detailed post-incident documentation and lessons learned.

Date	Incident	Severity
2025-09-14	GKE Auto-Upgrade Failure	Critical

RCA Process

After every critical or high-severity incident, we conduct a thorough root cause analysis to:

Document the complete incident timeline
Identify primary and contributing causes
Define prevention measures
Share lessons learned with the team

All RCAs follow a structured format and are stored in the RCA repository for future reference and continuous improvement.

Last modified February 24, 2026: Merge pull request #1 from wealthyautobot/sales-business-docs (7d097da)