2013-11-29 Postgres ran out of disk space
Incident summary
- Around 7:36 UTC, our database ran out of the available disk space.
- We noticed the problem at 8:31 UTC and started to investigate.
- By 9:01 UTC, all services were up and running again.
Impact
- The round 568 was extended to last for ~2h20m instead of the usual one hour.
- Some of the SPARK measurements submitted by Station nodes between 7:30 and 7:36 UTC were not published in time; they ended up being associated with the next round and later rejected as invalid.
- While our services were offline, Stations were not able to fetch tasks for the current round and submit measurements for retrievals started soon before the outage began.
- As a result, a fraction of jobs performed by Stations between 7:36 and 9:01 UTC were not rewarded.
Corrective actions
- We implemented a new alert for the database running out of available disk space.
- We are implementing alerts for service crashes.