2025-11-06: SIP Job Execution Failure

Root cause analysis for SIP order placement job failure on November 6, 2025

Incident Summary

Date & Time

  • Date: November 6, 2025
  • Duration: 5 hours
  • Severity: Critical

What Happened

The automated Systematic Investment Plan (SIP) order placement job faced a temporary disruption during its early-morning run. This job is responsible for automatically placing SIP orders for investors scheduled on that day.

The issue was detected during the morning SIP validation job, which showed a lower-than-expected SIP count. The technology team immediately investigated and identified that the SIP execution job had stopped midway due to a data-handling issue introduced in a recent performance optimization update.

Initially, this appeared to be a server overload, but the detailed analysis confirmed the underlying cause as a data structure mismatch between the new optimized code and legacy code paths that were not updated during the performance improvements.

Because SIP volumes on November 6 were relatively low, the team quickly rolled back to the previous stable version, and all SIPs were executed the same morning through a combination of system recovery and manual validation.

However, for a subset of investors whose mandates use eMandate-based authorization, debit processing can delay depending on bank. The 5-hour delay in SIP execution caused some actual account debits to be delayed beyond normal processing windows.

Impact Assessment

Services Affected

  • SIP Processing: Temporary disruption (5 hours) during morning run
  • Background Jobs: SIP execution job halted midway

Business Impact

  • SIPs executed same morning after recovery
  • eMandate-based mandates might have experienced delayed account debits due to the 5-hour delay exceeding bank processing cutoff windows
  • No missed orders; however, timing variance caused debit delays for some eMandate customers
  • Some NAV allocation affected eMandate-based SIPs due to delayed execution
  • Reputational risk mitigated via proactive monitoring and recovery

Root Cause Analysis

Primary Causes

  • Performance Optimization Update: Recent code changes were made to improve SIP processing performance by optimizing data fetching mechanisms — from a database query (lazy evaluation) to a pre-fetched list in memory
  • Incomplete Code Migration: A logging and count operation still relied on the older data structure and was not updated during the performance optimization, leading to a runtime halt during the job’s execution under concurrent system load
  • Data Structure Mismatch: The new optimized code path and legacy code path had incompatible data structures

Contributing Factors

  • Insufficient test coverage for the affected code path during optimization deployment
  • Code change introduced without comprehensive integration testing across all legacy code paths
  • Ongoing Performance Optimization: Team proactively optimizes infrastructure to maintain cost efficiency while scaling for growing SIP volumes

Resolution Steps

Immediate Actions

  1. Issue detected promptly via SIP validation checks
  2. Immediate rollback to the last stable version ensured SIP continuity
  3. All SIPs executed successfully the same morning after system recovery

Final Resolution

  1. Root cause analyzed and code fixed
  2. Redeployed on November 7, 2025 after thorough validation
  3. Added unit and integration test coverage for the affected code path
  4. Rescheduled SIP job to run one hour earlier to minimize overlap with other system jobs

Validation

  • Verified all SIPs processed correctly
  • Confirmed job timing improvements
  • Tested code changes thoroughly with enhanced test coverage

Prevention Measures

Area Action Taken Status
Code Stability Fixed logic; added strong type-safe test coverage ✅ Completed
Monitoring Enhanced SIP job rate tracking and anomaly alerts ✅ Active
Scheduling Moved SIP job one hour earlier to reduce load overlap ✅ Implemented
Auto-Healing Already active — system automatically retries failed batches ✅ Operational
Escalation Mechanism Automated call in addition to Slack alerts will now trigger to Product and Development teams if SIP processing rates degrade during night runs 🔄 Planned

Lessons Learned

Key Takeaways

  • Performance optimizations require comprehensive test coverage across all data access patterns, especially legacy code paths
  • Fast development cycles for cost efficiency need robust testing gates
  • Early validation jobs are critical for detecting processing failures quickly
  • Quick rollback capability minimizes customer impact
  • Job scheduling needs to account for system load patterns
  • Automated monitoring and escalation mechanisms are essential for night-time job failures
  • Data structure changes must be thoroughly validated across all dependent code paths
  • Cost optimization is critical but must be balanced with thorough testing to avoid production incidents

Action Items

  • Fixed data structure mismatch in SIP processing code
  • Added unit and integration tests for affected code paths
  • Enhanced monitoring with rate tracking and anomaly detection
  • Rescheduled SIP job to avoid load overlap
  • Implemented automated escalation for SIP processing degradation
Last modified November 11, 2025: RCA added for SIP failure (16439aa)