Scaling Spark PG DB Beyond 500 GiB
Ideas:
Constraints:
- If we run our DB in a different cloud provider than Fly.io, then we need to pay for bandwidth between spark services running on Fly.io and the DB server.
- We can migrate spark service too, but that’s even more effort.
Migrate our DB to (hosted platform)
- AWS RDS Aurora
- Estimated cost at current usage: 316$/month
- CockroachDB is too expensive
Use web3.storage for historical measurements
👉🏻 Pull request: https://github.com/filecoin-station/spark-api/pull/182
Roll up published_as CIDs into a new table and remove them from the table (keep only measurements for the last X hours.)
Downside:
- Before we can run any more analysis on the SQL data, we need to build a tool to fetch measurements from web3.storage.
- The data migration is going to be very CPU intensive and will take a long time to complete. We need to run it manually outside of the usual deployment/auto-migration process.
Algorithm:
- In spark-publish, after we publish CID to the smart-contract, insert a new row
(contract_address, round_index, cid)
- Periodically, delete all measurements that have
published_atfield set and are older than X hours.
- Initial migration:
For historical measurements where we don’t know the exact round index they were published in (because this round index can be different from what spark-api filled in the measurements table), assign their
published_asCID to a “virtual” contract and round index, e.g.contract_address=0xUNKNOWNandround_index=0
Improvement:
- To make it easier to filter CIDs based on date ranges, we can add a timestamp field to each
(contract_address, round_index, cid)row.
- For historical measurements, we can set this value as
MAX(published_as) + INTERVAL '3 minutes'aggregated across all measurements published in the same batch.
Expected savings:
- If each commitment contains ~50k measurements 600 bytes each, we will add 100 bytes per commitment and delete 30MB. We are publishing a new commitment every ~3 minutes, i.e. ~480 commitments per day = 14.4GB/day.
Compress data
- Postgres Pro
- Compression file system