Incident Summary: 09 June 2025 Service Disruption

KPal · June 17, 2025, 1:31pm

On Monday, 9th June 2025, our platform experienced a significant service disruption resulting in approximately 10 hours of downtime for our users. We understand the impact this had on our community, and we sincerely apologize. This post provides a summary of the incident, its cause, and the steps we took to restore service.

What Happened?

At approximately 11:07 AM BST, we began receiving reports of service unavailability. Our monitoring systems soon confirmed a platform-wide issue affecting access to our services. Our engineering team immediately began an investigation to identify the root cause and restore functionality as quickly as possible.

Cause of the Incident

Our investigation determined that the incident was triggered by an unusually large and sudden surge in incoming traffic, which began at approximately 11:05 AM BST. The volume of traffic was many times higher than our normal peak levels and overwhelmed the capacity of our systems, leading to service unavailability. The sustained nature and characteristics of this traffic spike indicate it was a distributed denial-of-service (DDoS) attack.

Our Response and Resolution

Our engineering team, in collaboration with external support partners, worked throughout the day to mitigate the impact and restore service. The path to resolution involved several key actions:

Investigation: We immediately began diagnosing the issue, which was challenging due to the scale of the traffic overwhelming our systems and affecting our standard monitoring.
System Adjustments: A primary focus was on reconfiguring our traffic management systems. We implemented stricter IP rate limiting and filtering rules to improve resilience against the high load and ensure our core systems could remain healthy and responsive. These changes were effective in stabilizing the platform, even while the elevated traffic levels continued.

Service was progressively restored from 6:00 PM BST. By 9:00 PM BST, after the necessary system adjustments and configuration corrections were completed, we confirmed that all services were fully recovered.

Future Mitigation

We set up frequent calls with our cloud provider to identify issues as early as possible and continue to improve the integrity of the platform.

We are taking this incident very seriously and have implemented immediate and long-term measures to prevent these types of attacks, improve the resilience of our platform, and streamline our incident response processes to better handle future events.

uhu · June 17, 2025, 5:21pm

Thanks a lot for your detailed description of the incident. Attacks can happen anytime and your steps for closely monitoring hopefully will be able to discover most of them in the future.

And thanks a lot for bringing the services back to work for us.

Best,
Urs