Incident Management

Incident detection, response procedures, and resolution tracking

Incident Management

Incident Detection & Alerting

Monitoring Systems

  • Grafana: Real-time monitoring dashboards with automated alerts to Slack
  • VictoriaMetrics: Metrics collection and threshold-based alerting
  • OpenTelemetry: Application performance monitoring and error tracking
  • Uptime Kuma: Service availability monitoring and uptime tracking
  • Slack Integration: All monitoring alerts received via Slack channels

Alert Sources

  • Application performance degradation
  • Infrastructure resource thresholds (CPU, memory, disk)
  • Service availability and health checks
  • Security anomalies and authentication failures

Incident Classification

Severity Levels

  • Critical: Complete service outage or security breach
  • High: Partial service degradation affecting multiple users
  • Medium: Limited functionality impact or performance issues
  • Low: Minor issues with workarounds available

Response Procedures

Immediate Response

  1. Detection: Automated alerts or manual reporting
  2. Assignment: Engineering team lead assigns incident owner
  3. Assessment: Incident owner evaluates impact and severity
  4. Communication: Notify stakeholders via Slack

Escalation Process

  • Level 1: Engineering team response and initial resolution attempt
  • Level 2: CTO involvement if engineering team cannot resolve
  • Level 3: CEO notification for business-critical incidents as needed

Detailed Escalation Matrix

This section defines the specific escalation paths for various types of incidents that may arise during operations or partner interactions.

Technical Incidents

System Alerts & Monitoring Issues

  • L1: Slack alerts β†’ DevOps Engineers (initial response and triage)
  • L2: DevOps Engineers β†’ CTO (technical resolution and architectural decisions)
  • L3: CTO β†’ Product Manager (business impact assessment and stakeholder communication)

Partner-Reported Technical Issues

  • Freshdesk/Freshchat: β†’ Wealthy Support (AI-assisted resolution) β†’ Product Manager β†’ CTO
  • Direct PST Contact: Partner β†’ PST β†’ Wealthy Support β†’ Product Manager β†’ CTO

Partner Communication Channels

Communication Methods

  • Freshchat: Real-time chat with Wealthy Support team
  • Freshdesk: Tickets handled by Wealthy Support (partners or PST can create)
  • Direct PST Contact: Phone calls to assigned PST member (established relationship)

Flow

  • Freshchat/Freshdesk β†’ Wealthy Support (AI-assisted resolution)
  • PST Contact β†’ PST creates ticket β†’ Wealthy Support
  • All incidents tracked in Freshdesk system

Security Incidents

System-Generated Security Alerts

  • Immediate: Slack alerts β†’ CTO (direct escalation)
  • Critical: Immediate escalation with no intermediate steps for severe security incidents

Partner-Reported Security Issues

  • Freshdesk/Freshchat: β†’ Wealthy Support (AI-assisted resolution) β†’ CTO
  • Direct PST Contact: Partner β†’ PST β†’ Wealthy Support β†’ CTO

CERT-In Reporting Requirements

For cybersecurity incidents that fall under CERT-In reportable categories, additional regulatory reporting is mandatory:

Timeline: All reportable cybersecurity incidents must be reported to CERT-In within 6 hours of detection.

Process:

  1. Detection: Security incident identified via monitoring or reporting
  2. Notification: Immediate alert to security@wealthy.in
  3. Assessment: SRE team assesses if incident requires CERT-In reporting (within 2 hours)
  4. CERT-In Report: If reportable, submit incident report to CERT-In (within 6 hours total)
  5. Escalation: Notify CTO and Broking Head for critical incidents in parallel
  6. Follow-up: Provide updates to CERT-In as incident progresses

CERT-In Contact:

Reportable Incidents include but not limited to:

  • Data breaches and unauthorized access
  • Malware/ransomware attacks
  • Website/application defacement
  • DDoS attacks
  • Phishing attacks
  • Identity theft
  • Critical system compromises

For complete CERT-In compliance documentation, see CERT-In Compliance.

Development & Product Issues

Development Blockers

  • Immediate: Engineering Teams β†’ Product Manager (blocker resolution within sprint cycle)
  • Complex: Product Manager β†’ CTO (technical architecture or resource decisions)
  • Strategic: CTO β†’ CEO collaboration (major strategic or business alignment decisions)

Bug Reports & Issues

  • L1: Engineering Teams (initial assessment and fix attempt)
  • L2: Product Manager (prioritization and resource allocation)
  • L3: CTO (complex technical issues or system-wide impacts)

Emergency Escalation Procedures

Critical Incidents (System Down, Security Breach, Major Partner Impact)

  • Parallel Escalation: Immediate notification to both Product Manager AND CTO simultaneously
  • Response Time: 15 minutes for critical incidents, 1 hour for high priority
  • Review Cycle: 24-hour stakeholder review for emergency changes
  • Resource Allocation: Immediate resource reallocation from other projects if needed

Escalation Criteria

  • Critical: System outages, security breaches, regulatory compliance issues
  • High: Partner-blocking issues, major feature failures, performance degradation
  • Medium: Standard bugs, feature requests, minor system issues
  • Low: Documentation updates, minor enhancements, optimization requests

Communication Protocols

Internal Communication

  • Slack: Primary incident coordination channel and immediate alerts
  • Plane: Task creation for larger system-level changes and future prevention measures
  • Direct Communication: Phone/video calls for critical incidents requiring immediate attention
  • Documentation: All escalated incidents must be documented in appropriate tracking system (Freshdesk, Plane)

External Communication

  • Partner Updates: PST provides updates via preferred communication channel
  • Status Pages: System-wide incident communication
  • Stakeholder Notifications: Product Manager coordinates stakeholder communication

Resolution & Documentation

Resolution Tracking

  • Root cause analysis and remediation steps
  • Timeline documentation and lessons learned
  • Post-incident review within 48 hours of resolution

Key Contacts

  • Engineering Team Lead: First responder and incident coordination
  • CTO: Technical escalation and critical incident management
  • Operations Manager: Business impact assessment and stakeholder communication

Root Cause Analyses

Recent RCAs

View our collection of Root Cause Analyses for detailed post-incident documentation and lessons learned.

Date Incident Severity
2025-09-14 GKE Auto-Upgrade Failure Critical

RCA Process

After every critical or high-severity incident, we conduct a thorough root cause analysis to:

  • Document the complete incident timeline
  • Identify primary and contributing causes
  • Define prevention measures
  • Share lessons learned with the team

All RCAs follow a structured format and are stored in the RCA repository for future reference and continuous improvement.

Last modified November 11, 2025: RCA added for SIP failure (16439aa)