Incident Management
Incident Management
Incident Detection & Alerting
Monitoring Systems
- Grafana: Real-time monitoring dashboards with automated alerts to Slack
- VictoriaMetrics: Metrics collection and threshold-based alerting
- OpenTelemetry: Application performance monitoring and error tracking
- Uptime Kuma: Service availability monitoring and uptime tracking
- Slack Integration: All monitoring alerts received via Slack channels
Alert Sources
- Application performance degradation
- Infrastructure resource thresholds (CPU, memory, disk)
- Service availability and health checks
- Security anomalies and authentication failures
Incident Classification
Severity Levels
- Critical: Complete service outage or security breach
- High: Partial service degradation affecting multiple users
- Medium: Limited functionality impact or performance issues
- Low: Minor issues with workarounds available
Response Procedures
Immediate Response
- Detection: Automated alerts or manual reporting
- Assignment: Engineering team lead assigns incident owner
- Assessment: Incident owner evaluates impact and severity
- Communication: Notify stakeholders via Slack
Escalation Process
- Level 1: Engineering team response and initial resolution attempt
- Level 2: CTO involvement if engineering team cannot resolve
- Level 3: CEO notification for business-critical incidents as needed
Detailed Escalation Matrix
This section defines the specific escalation paths for various types of incidents that may arise during operations or partner interactions.
Technical Incidents
System Alerts & Monitoring Issues
- L1: Slack alerts β DevOps Engineers (initial response and triage)
- L2: DevOps Engineers β CTO (technical resolution and architectural decisions)
- L3: CTO β Product Manager (business impact assessment and stakeholder communication)
Partner-Reported Technical Issues
- Freshdesk/Freshchat: β Wealthy Support (AI-assisted resolution) β Product Manager β CTO
- Direct PST Contact: Partner β PST β Wealthy Support β Product Manager β CTO
Partner Communication Channels
Communication Methods
- Freshchat: Real-time chat with Wealthy Support team
- Freshdesk: Tickets handled by Wealthy Support (partners or PST can create)
- Direct PST Contact: Phone calls to assigned PST member (established relationship)
Flow
- Freshchat/Freshdesk β Wealthy Support (AI-assisted resolution)
- PST Contact β PST creates ticket β Wealthy Support
- All incidents tracked in Freshdesk system
Security Incidents
System-Generated Security Alerts
- Immediate: Slack alerts β CTO (direct escalation)
- Critical: Immediate escalation with no intermediate steps for severe security incidents
Partner-Reported Security Issues
- Freshdesk/Freshchat: β Wealthy Support (AI-assisted resolution) β CTO
- Direct PST Contact: Partner β PST β Wealthy Support β CTO
CERT-In Reporting Requirements
For cybersecurity incidents that fall under CERT-In reportable categories, additional regulatory reporting is mandatory:
Timeline: All reportable cybersecurity incidents must be reported to CERT-In within 6 hours of detection.
Process:
- Detection: Security incident identified via monitoring or reporting
- Notification: Immediate alert to security@wealthy.in
- Assessment: SRE team assesses if incident requires CERT-In reporting (within 2 hours)
- CERT-In Report: If reportable, submit incident report to CERT-In (within 6 hours total)
- Escalation: Notify CTO and Broking Head for critical incidents in parallel
- Follow-up: Provide updates to CERT-In as incident progresses
CERT-In Contact:
- Email: security@wealthy.in (actively monitored)
- Website: https://cert-in.org.in/ (for CERT-In reporting contact information)
Reportable Incidents include but not limited to:
- Data breaches and unauthorized access
- Malware/ransomware attacks
- Website/application defacement
- DDoS attacks
- Phishing attacks
- Identity theft
- Critical system compromises
For complete CERT-In compliance documentation, see CERT-In Compliance.
Development & Product Issues
Development Blockers
- Immediate: Engineering Teams β Product Manager (blocker resolution within sprint cycle)
- Complex: Product Manager β CTO (technical architecture or resource decisions)
- Strategic: CTO β CEO collaboration (major strategic or business alignment decisions)
Bug Reports & Issues
- L1: Engineering Teams (initial assessment and fix attempt)
- L2: Product Manager (prioritization and resource allocation)
- L3: CTO (complex technical issues or system-wide impacts)
Emergency Escalation Procedures
Critical Incidents (System Down, Security Breach, Major Partner Impact)
- Parallel Escalation: Immediate notification to both Product Manager AND CTO simultaneously
- Response Time: 15 minutes for critical incidents, 1 hour for high priority
- Review Cycle: 24-hour stakeholder review for emergency changes
- Resource Allocation: Immediate resource reallocation from other projects if needed
Escalation Criteria
- Critical: System outages, security breaches, regulatory compliance issues
- High: Partner-blocking issues, major feature failures, performance degradation
- Medium: Standard bugs, feature requests, minor system issues
- Low: Documentation updates, minor enhancements, optimization requests
Communication Protocols
Internal Communication
- Slack: Primary incident coordination channel and immediate alerts
- Plane: Task creation for larger system-level changes and future prevention measures
- Direct Communication: Phone/video calls for critical incidents requiring immediate attention
- Documentation: All escalated incidents must be documented in appropriate tracking system (Freshdesk, Plane)
External Communication
- Partner Updates: PST provides updates via preferred communication channel
- Status Pages: System-wide incident communication
- Stakeholder Notifications: Product Manager coordinates stakeholder communication
Resolution & Documentation
Resolution Tracking
- Root cause analysis and remediation steps
- Timeline documentation and lessons learned
- Post-incident review within 48 hours of resolution
Key Contacts
- Engineering Team Lead: First responder and incident coordination
- CTO: Technical escalation and critical incident management
- Operations Manager: Business impact assessment and stakeholder communication
Root Cause Analyses
Recent RCAs
View our collection of Root Cause Analyses for detailed post-incident documentation and lessons learned.
| Date | Incident | Severity |
|---|---|---|
| 2025-09-14 | GKE Auto-Upgrade Failure | Critical |
RCA Process
After every critical or high-severity incident, we conduct a thorough root cause analysis to:
- Document the complete incident timeline
- Identify primary and contributing causes
- Define prevention measures
- Share lessons learned with the team
All RCAs follow a structured format and are stored in the RCA repository for future reference and continuous improvement.