Recurrence of the SFTP File Processing Issue

Incident Report for Reward Gateway

Postmortem

Incident Information

In recent weeks, there have been multiple disruptions to Secure File Transfer Protocol (SFTP) membership file imports. These disruptions caused increased processing times and led to manual intervention being necessary to resolve timeouts.

Our Engineering and Operations have been actively investigating this disruption and found that a defect in an underlying library, when processing a specific file, was capable of causing an infinite loop. This infinite loop then prevented other files from being processed.

This has been resolved through an upgrade of the library.

What happened?

Reports were received from clients in early September indicating that membership file imports through SFTP were ‘stuck’.

Upon investigation, it was identified that one of the background jobs for processing the SFTP membership file had encountered a problem. The Engineering team worked to address this by restarting it and requeuing the files for processing. Logs were collected for review and it was hypothesised that the communication between the SFTP and background workers was failing.

During a review of the communication flows, it was identified that if processing was delayed, subsequent messages could be dropped due to timeouts. Steps were taken to address this and additional background jobs launched to minimise the likelihood that a communication would timeout. This change in early October has allowed all files to be processed without manual intervention.

This change also allowed the identification of the root cause where it was identified that a single file was taking significant time. This behaviour, under the earlier configuration, then impacted processing of subsequent files received i.e. they were not processed because of the communication timeout.

Further review in a separate, secure environment revealed that the file was triggering an infinite loop in an underlying library used for handling Excel files. A newer version of this library was available and it was decided to test if the infinite loop was still present. This testing has been completed and the upgraded library has been deployed to production.

We have further decided too not to revert the configuration changes as they remain beneficial under

What did we learn:

This incident highlighted the importance of robust monitoring, quick response to technical disruptions, and the need for continuous improvement in handling file processing systems to prevent future occurrences.

Our monitoring of this service was not granular enough to identify communication timeouts.

Temporary measures such as scaling queue visibility time and the number of consumers were effective in mitigating the immediate impact of the issue.

Conclusion

We will be making changes to our processes to prevent similar incidents in future. We are sorry for the disruption that this caused and the impact it may have had on your employees.

Posted Nov 11, 2024 - 17:44 UTC

Resolved

We have resolved the recent SFTP file processing incident. Over the past weeks, our team implemented a system upgrade and successfully tested files that previously encountered processing issues. We are now confident that SFTP files will process smoothly, without delays or interruptions.

Thank you for your patience as we worked to enhance the reliability of our system.

Posted Nov 11, 2024 - 17:43 UTC

Update

Yesterday, the team has successfully reproduced the issue that caused the last SFTP file processing disruption (that occurred on 25th September 2024) and identified the root cause of the incident. The team is now actively working on developing improvements to prevent any further disruptions.

Posted Oct 15, 2024 - 10:54 UTC

Monitoring

No further disruptions were noticed by the engineering team. The team is actively resolving the issue for clients and communicating their progress in Zendesk.

Posted Sep 27, 2024 - 15:17 UTC

Identified

We have experienced yesterday a recurrence of the SFTP file processing issue that impacted clients last week. Despite implementing preventive measures following the previous incident, including additional logging, the issue has reappeared.
However, the impact this time appears to be more limited, affecting fewer clients.

Our team is treating this with high priority and is actively investigating the root cause to ensure a long-term resolution. We are also closely monitoring the system and working to mitigate any immediate impact on our clients.

We will provide further updates as soon as more information is available.

Posted Sep 27, 2024 - 09:10 UTC

This incident affected: Reward Manager™.