On Wednesday 28th April Prolific had an outage that lasted several hours from 17:38 UTC to 20:52 UTC. This meant that users weren’t able to reliably access the platform.
We're sorry that this happened. Our team are working hard to make the platform more stable and make incidents like this less likely, and easier to resolve.
At Prolific we believe in being open, so we want to share with you a timeline of what happened and what we're planning going forward.
Timeline of what happened
All times are in UTC.
15:22 - Prolific engineers received automated alerts that the core database servers were experiencing high CPU usage. They started investigating the cause and performing emergency fixes.
15:27 - Our core database experienced a failover. This normally doesn't cause issues.
16:23 - Another failover event occurred.
17:25 - Repeated failover events started happening every few minutes.
17:38 - These failovers caused the platform to become unavailable to users. The engineers started turning off different parts of the system to reduce traffic to the database.
18:09 - Despite decreasing traffic to the database the failovers were still happening. At this point we reached out to the database host's support team as we couldn't see a cause for this behaviour.
18:57 - We started the process of increasing the capacity of the database servers.
19:25 - The failovers stopped. It's unclear why they stopped. At this point the database stabilised.
19:42 - The servers started switching over to the higher capacity hardware.
19:50 - A further fix was made by Prolific engineers to minimise database traffic as we started to move things back into their normal state.
20:09 - The database hardware upgrades finished without further disruption.
20:40 - Happy that the database seemed stable, Prolific engineers started to turn different parts of the system back on.
20:52 - Confident everything was working, Prolific engineers turned the main web interface back on. Services resumed as normal and Prolific engineers remained on standby to monitor.
We’ve already been crediting our users for submissions that timed out and studies that were affected by this outage. Please do get in touch with our Support team if you need any help.
There are two unanswered questions we’re investigating:
- Why did the database get stuck in a failover loop?
- How do we prevent this in the future?
We’ll also be holding a postmortem with our team to talk through the incident and ask ourselves some questions:
- How could we have communicated better?
- How could we fixed the problem faster?
- How could we have seen this coming and prevented it?
We’ve fallen short of our own standards for the stability of the platform.
As Prolific is growing, we’re finding parts of the system that aren't coping with the increase in users so we're working hard to find and fix areas that need to change. In many cases, we’re prioritising this work over other product development like new features.
We’re also hiring more engineers to help us make these improvements sooner. If this sounds up your street we'd love for you to apply!
Once again, we're sorry for the disruption this incident caused, and we all thank you for your continued patience and trust in Prolific.
Come discuss this blog post with our community!