Incident Information
In recent weeks, there have been multiple disruptions to Secure File Transfer Protocol (SFTP) membership file imports. These disruptions caused increased processing times and led to manual intervention being necessary to resolve timeouts.
Our Engineering and Operations have been actively investigating this disruption and found that a defect in an underlying library, when processing a specific file, was capable of causing an infinite loop. This infinite loop then prevented other files from being processed.
This has been resolved through an upgrade of the library.
What happened?
Reports were received from clients in early September indicating that membership file imports through SFTP were ‘stuck’.
Upon investigation, it was identified that one of the background jobs for processing the SFTP membership file had encountered a problem. The Engineering team worked to address this by restarting it and requeuing the files for processing. Logs were collected for review and it was hypothesised that the communication between the SFTP and background workers was failing.
During a review of the communication flows, it was identified that if processing was delayed, subsequent messages could be dropped due to timeouts. Steps were taken to address this and additional background jobs launched to minimise the likelihood that a communication would timeout. This change in early October has allowed all files to be processed without manual intervention.
This change also allowed the identification of the root cause where it was identified that a single file was taking significant time. This behaviour, under the earlier configuration, then impacted processing of subsequent files received i.e. they were not processed because of the communication timeout.
Further review in a separate, secure environment revealed that the file was triggering an infinite loop in an underlying library used for handling Excel files. A newer version of this library was available and it was decided to test if the infinite loop was still present. This testing has been completed and the upgraded library has been deployed to production.
We have further decided too not to revert the configuration changes as they remain beneficial under
What did we learn:
This incident highlighted the importance of robust monitoring, quick response to technical disruptions, and the need for continuous improvement in handling file processing systems to prevent future occurrences.
Our monitoring of this service was not granular enough to identify communication timeouts.
Temporary measures such as scaling queue visibility time and the number of consumers were effective in mitigating the immediate impact of the issue.
Conclusion
We will be making changes to our processes to prevent similar incidents in future. We are sorry for the disruption that this caused and the impact it may have had on your employees.