2025-11-06: SIP Job Execution Failure

Root cause analysis for SIP order placement job failure on November 6, 2025

Incident Summary

Date & Time

Date: November 6, 2025
Duration: 5 hours
Severity: Critical

What Happened

The automated Systematic Investment Plan (SIP) order placement job faced a temporary disruption during its early-morning run. This job is responsible for automatically placing SIP orders for investors scheduled on that day.

The issue was detected during the morning SIP validation job, which showed a lower-than-expected SIP count. The technology team immediately investigated and identified that the SIP execution job had stopped midway due to a data-handling issue introduced in a recent performance optimization update.

Initially, this appeared to be a server overload, but the detailed analysis confirmed the underlying cause as a data structure mismatch between the new optimized code and legacy code paths that were not updated during the performance improvements.

Because SIP volumes on November 6 were relatively low, the team quickly rolled back to the previous stable version, and all SIPs were executed the same morning through a combination of system recovery and manual validation.

However, for a subset of investors whose mandates use eMandate-based authorization, debit processing can delay depending on bank. The 5-hour delay in SIP execution caused some actual account debits to be delayed beyond normal processing windows.

Impact Assessment

Services Affected

SIP Processing: Temporary disruption (5 hours) during morning run
Background Jobs: SIP execution job halted midway

Business Impact

SIPs executed same morning after recovery
eMandate-based mandates might have experienced delayed account debits due to the 5-hour delay exceeding bank processing cutoff windows
No missed orders; however, timing variance caused debit delays for some eMandate customers
Some NAV allocation affected eMandate-based SIPs due to delayed execution
Reputational risk mitigated via proactive monitoring and recovery

Root Cause Analysis

Primary Causes

Performance Optimization Update: Recent code changes were made to improve SIP processing performance by optimizing data fetching mechanisms — from a database query (lazy evaluation) to a pre-fetched list in memory
Incomplete Code Migration: A logging and count operation still relied on the older data structure and was not updated during the performance optimization, leading to a runtime halt during the job’s execution under concurrent system load
Data Structure Mismatch: The new optimized code path and legacy code path had incompatible data structures

Contributing Factors

Insufficient test coverage for the affected code path during optimization deployment
Code change introduced without comprehensive integration testing across all legacy code paths
Ongoing Performance Optimization: Team proactively optimizes infrastructure to maintain cost efficiency while scaling for growing SIP volumes

Resolution Steps

Immediate Actions

Issue detected promptly via SIP validation checks
Immediate rollback to the last stable version ensured SIP continuity
All SIPs executed successfully the same morning after system recovery

Final Resolution

Root cause analyzed and code fixed
Redeployed on November 7, 2025 after thorough validation
Added unit and integration test coverage for the affected code path
Rescheduled SIP job to run one hour earlier to minimize overlap with other system jobs

Validation

Verified all SIPs processed correctly
Confirmed job timing improvements
Tested code changes thoroughly with enhanced test coverage

Prevention Measures

Area	Action Taken	Status
Code Stability	Fixed logic; added strong type-safe test coverage	✅ Completed
Monitoring	Enhanced SIP job rate tracking and anomaly alerts	✅ Active
Scheduling	Moved SIP job one hour earlier to reduce load overlap	✅ Implemented
Auto-Healing	Already active — system automatically retries failed batches	✅ Operational
Escalation Mechanism	Automated call in addition to Slack alerts will now trigger to Product and Development teams if SIP processing rates degrade during night runs	🔄 Planned

Lessons Learned

Key Takeaways

Performance optimizations require comprehensive test coverage across all data access patterns, especially legacy code paths
Fast development cycles for cost efficiency need robust testing gates
Early validation jobs are critical for detecting processing failures quickly
Quick rollback capability minimizes customer impact
Job scheduling needs to account for system load patterns
Automated monitoring and escalation mechanisms are essential for night-time job failures
Data structure changes must be thoroughly validated across all dependent code paths
Cost optimization is critical but must be balanced with thorough testing to avoid production incidents

Action Items

Fixed data structure mismatch in SIP processing code
Added unit and integration tests for affected code paths
Enhanced monitoring with rate tracking and anomaly detection
Rescheduled SIP job to avoid load overlap
Implemented automated escalation for SIP processing degradation

Last modified February 24, 2026: Merge pull request #1 from wealthyautobot/sales-business-docs (7d097da)