Earlier today we experienced an outage of roughly 1.5 hours on Talis Prism 3. We are very sorry for the inconvenience caused and wanted to give you a little technical background into what went wrong, and what we will be doing going forward to prevent this happening again.
This morning, at around 11:28, the Talis Platform Team observed an increase in the number of requests being sent to our backend data storage platform. We started investigating the reason for this immediately.
Whilst carrying out this investigation, at around 12:07, we started to see instability in Talis Prism 3; by 12:25 it was mostly unreachable, with a limited number of requests processing successfully.
Just after this, we identified that the outage was due to a 3rd party caching system, which saves configuration values for each tenancy for subsequent requests, failing. Further investigation highlighted a bug that could cause failure under high load. We have just had some large customers launch Talis Prism 3 which pushed this system over a limit it was unable to handle. This load was expected, and planned for, however we were unaware of the existence of this bug.
Once we identified the culprit we put a configuration change in place to make Talis Prism 3 use a different caching system. This was completed at 13:27 and deployed to one of the live servers for testing.
We turned off all access to Talis Prism 3 at 13:35 to allow the backend data platform to recover and for us to evaluate if the new caching system was working as intended. Once we were confident that it was working and could handle the load, we preloaded all tenancies and turned Talis Prism 3 back on. All services resumed by 13:55.
Talis Prism 3 is designed to be able to use different caching systems and once this was proved to be the cause of the outage, switching over to a new one was very quick to implement. We have been planning to deprecate the current system and are in the final stages of testing a more robust replacement. Talis Prism 3 already sees very rigorous testing, both automated and manual, however, we have identified several places where we can put in extra load tests to simulate the traffic we saw today. We will be making the implementation of this our top priority.
Recent Comments