2013-11-29 Postgres ran out of disk space

Incident summary

  • Around 7:36 UTC, our database ran out of the available disk space.
  • We noticed the problem at 8:31 UTC and started to investigate.
  • By 9:01 UTC, all services were up and running again.

Impact

  • The round 568 was extended to last for ~2h20m instead of the usual one hour.
  • Some of the SPARK measurements submitted by Station nodes between 7:30 and 7:36 UTC were not published in time; they ended up being associated with the next round and later rejected as invalid.
  • While our services were offline, Stations were not able to fetch tasks for the current round and submit measurements for retrievals started soon before the outage began.
  • As a result, a fraction of jobs performed by Stations between 7:36 and 9:01 UTC were not rewarded.

Corrective actions

  • We implemented a new alert for the database running out of available disk space.
  • We are implementing alerts for service crashes.