2013-11-29 Postgres ran out of disk space

Incident summary

Around 7:36 UTC, our database ran out of the available disk space.

We noticed the problem at 8:31 UTC and started to investigate.

By 9:01 UTC, all services were up and running again.

Impact

The round 568 was extended to last for ~2h20m instead of the usual one hour.

Some of the SPARK measurements submitted by Station nodes between 7:30 and 7:36 UTC were not published in time; they ended up being associated with the next round and later rejected as invalid.

While our services were offline, Stations were not able to fetch tasks for the current round and submit measurements for retrievals started soon before the outage began.

As a result, a fraction of jobs performed by Stations between 7:36 and 9:01 UTC were not rewarded.

Corrective actions

We implemented a new alert for the database running out of available disk space.

We are implementing alerts for service crashes.