2025-11-06: SIP Job Execution Failure
Incident Summary
Date & Time
- Date: November 6, 2025
- Duration: 5 hours
- Severity: Critical
What Happened
The automated Systematic Investment Plan (SIP) order placement job faced a temporary disruption during its early-morning run. This job is responsible for automatically placing SIP orders for investors scheduled on that day.
The issue was detected during the morning SIP validation job, which showed a lower-than-expected SIP count. The technology team immediately investigated and identified that the SIP execution job had stopped midway due to a data-handling issue introduced in a recent performance optimization update.
Initially, this appeared to be a server overload, but the detailed analysis confirmed the underlying cause as a data structure mismatch between the new optimized code and legacy code paths that were not updated during the performance improvements.
Because SIP volumes on November 6 were relatively low, the team quickly rolled back to the previous stable version, and all SIPs were executed the same morning through a combination of system recovery and manual validation.
However, for a subset of investors whose mandates use eMandate-based authorization, debit processing can delay depending on bank. The 5-hour delay in SIP execution caused some actual account debits to be delayed beyond normal processing windows.
Impact Assessment
Services Affected
- SIP Processing: Temporary disruption (5 hours) during morning run
- Background Jobs: SIP execution job halted midway
Business Impact
- SIPs executed same morning after recovery
- eMandate-based mandates might have experienced delayed account debits due to the 5-hour delay exceeding bank processing cutoff windows
- No missed orders; however, timing variance caused debit delays for some eMandate customers
- Some NAV allocation affected eMandate-based SIPs due to delayed execution
- Reputational risk mitigated via proactive monitoring and recovery
Root Cause Analysis
Primary Causes
- Performance Optimization Update: Recent code changes were made to improve SIP processing performance by optimizing data fetching mechanisms — from a database query (lazy evaluation) to a pre-fetched list in memory
- Incomplete Code Migration: A logging and count operation still relied on the older data structure and was not updated during the performance optimization, leading to a runtime halt during the job’s execution under concurrent system load
- Data Structure Mismatch: The new optimized code path and legacy code path had incompatible data structures
Contributing Factors
- Insufficient test coverage for the affected code path during optimization deployment
- Code change introduced without comprehensive integration testing across all legacy code paths
- Ongoing Performance Optimization: Team proactively optimizes infrastructure to maintain cost efficiency while scaling for growing SIP volumes
Resolution Steps
Immediate Actions
- Issue detected promptly via SIP validation checks
- Immediate rollback to the last stable version ensured SIP continuity
- All SIPs executed successfully the same morning after system recovery
Final Resolution
- Root cause analyzed and code fixed
- Redeployed on November 7, 2025 after thorough validation
- Added unit and integration test coverage for the affected code path
- Rescheduled SIP job to run one hour earlier to minimize overlap with other system jobs
Validation
- Verified all SIPs processed correctly
- Confirmed job timing improvements
- Tested code changes thoroughly with enhanced test coverage
Prevention Measures
| Area | Action Taken | Status |
|---|---|---|
| Code Stability | Fixed logic; added strong type-safe test coverage | ✅ Completed |
| Monitoring | Enhanced SIP job rate tracking and anomaly alerts | ✅ Active |
| Scheduling | Moved SIP job one hour earlier to reduce load overlap | ✅ Implemented |
| Auto-Healing | Already active — system automatically retries failed batches | ✅ Operational |
| Escalation Mechanism | Automated call in addition to Slack alerts will now trigger to Product and Development teams if SIP processing rates degrade during night runs | 🔄 Planned |
Lessons Learned
Key Takeaways
- Performance optimizations require comprehensive test coverage across all data access patterns, especially legacy code paths
- Fast development cycles for cost efficiency need robust testing gates
- Early validation jobs are critical for detecting processing failures quickly
- Quick rollback capability minimizes customer impact
- Job scheduling needs to account for system load patterns
- Automated monitoring and escalation mechanisms are essential for night-time job failures
- Data structure changes must be thoroughly validated across all dependent code paths
- Cost optimization is critical but must be balanced with thorough testing to avoid production incidents
Action Items
- Fixed data structure mismatch in SIP processing code
- Added unit and integration tests for affected code paths
- Enhanced monitoring with rate tracking and anomaly detection
- Rescheduled SIP job to avoid load overlap
- Implemented automated escalation for SIP processing degradation