Last night, one of the servers that handles our incoming messages from Network Rail and other suppliers failed due to a lack of disk space. This brought down the services that distributes messages to other internal servers, meaning maps and train running information was out-of-date on the website.
We resolved this problem by freeing up disk space on the server and restarting the necessary services.
The reason this happened was straightforward. Each month, we archive off the logs from our messaging servers to offline storage. We do this on both our live and backup servers, and clear out the old data from the servers once it has been successfully verified to be copied and complete.
Last month, this process failed in a subtle way and messages on one of the servers were archived but not deleted from the server, meaning the disk space used was not freed up. Last night, the disk on the server filled up.
This archiving process has been only failed once before and the underlying cause fixed. This time was a process issue, not a technical issue, and we will be taking steps to ensure it doesn’t happen again.