DB for SPARK public data

We want to build a Public Spark Dashboard showing aggregated data about retrieval measurements performed by Spark checker nodes.

We agreed to aggregate the data with one-day granularity.

What data we want to show?

Step 1: retrieval success rate (one number - percentage)
1. One number per day

Step 2: RSR per Storage Provider (minerId)
1. N numbers per day, where N is ~600. We expect this number to grow with no upper bound.

Later: sky is the limit. See Retrieval Bot Dashboard.
- We may want to show data for each retrieval task performed, where a retrieval task is (payload CID + the deal it belongs to)
- We may want to show individual measurements are reported by the checker nodes, grouped by the provider
- RSR per Station instance
- RSR per geo-region where Stations are running
- RSR per SP per geo-region of Stations
- Breakdown of retrieval error types - overall + per-SP
- Breakdowns per clientId from the FIL deal

What database to use

Prior art:

RetrievalBot uses MongoDB to store the data. They are producing multiple orders of magnitude less measurements than SPARK does.

We used to use Postgres in SPARK, but reached 500GiB limit imposed by Fly.io after less than 3 months.
- spark-evaluate is aggregating data into InfluxDB now. It’s not clear how much more expensive or cheaper is InfluxDB cloud compared to Postgres-on-Fly.

Known issues:

InfluxDB Cloud sometimes needs 30-100 minutes to make new data available for reading. Aggregation queries building per-day data from per-hour data must take this into account.
InfluxDB docs for downsampling: Downsampling with InfluxDB v2.0

InfluxDB does not support versioning task queries/scripts in git. We could use infrastructure-as-a-code tool like Terraform, but that’s a lot of additional complexity.
- The InfluxDB plugin for Terraform has been abandoned and does not support InfluxDB 2.0.
- InfluxDB provides a REST API for manipulating tasks, but I don’t feel like implementing our own CD pipeline on top of that.

InfluxDB indexes measurements and tags but does not index values. Tags should not store values from a growing/unbounded set, such as miner IDs. As a result, we cannot get an efficient index-based queries for per-miner stats. (Learn more in InfluxDB docs: Data Layout and Schema Design Best Practices for InfluxDB.)

If we wanted to expose Postgres to Grafana, we would have to open it up to the entire internet. We have had this setup in the past and had seen hackers trying to guess our admin password and compromise our DB. To stay safe, I prefer to keep the DB hidden and expose the data via REST API - but that requires more work on our side.

Proposal

I am proposing the following architecture. It’s reusing the existing building blocks we are already familiar with.

Store the public data in a Postgres database.
1. This allows us to store rich objects with many properties and define indexes to get performant queries.

Modify the existing spark-evaluate service to update the public data every round, similarly to how it publishes aggregated per-round metrics to InfluxDB.

Build a new micro service exposing the public data via HTTP JSON API.
1. This gives us full control in how we cache SQL query results to reduce DB load. We can even leverage Redis if needed in the future.
  See also async-cache-dedupe and caching in retrievalbot-dashboard
1. We can configure Cloudflare for an extra layer of caching.
1. In the future, we can easily switch the Dashboard from Grafana to a single-page app with no change required on the backend.
We are already using JSON-API data sources in our dashboards (e.g. GH Release Downloads via Deno Deploy, data from Plausible analytics), we know it’s possible to build Grafana visualisations using JSON API data sources.

Open questions:

Should we use the existing PG instance or create a new one?
If using the existing instance:
- Our operations are easier, we don’t need to monitor another system.
- If case of large load on the public dashboard that is not handled by our caching layers, we will degrade performance of the single PG instance and thus affect the performance of the core services (spark-api, spark-publish).
Possible solution: configure the dashboard API to use the read-only replica.
→ I am leaning towards keeping a single PG instance for everything.
→ However, I’d also like to create a new database for tables holding public data, so that we can evolve the schema independently from the main spark-api schema.

Should we use the existing PL’s Grafana Cloud account or create a new one paid by the Station team?
- Miroslav doesn’t have permissions to add new datasources to PL’s Grafana.

Alternatives considered

Decouple aggregation of public data from spark-evaluate

Introducing a new micro-service would add significant costs. For each round, spark-evaluate produces ~1 GB of data. If we introduced a new micro-service to aggregate data into public stats, then we would need to transfer this data from spark-evaluate to the new micro-service.

Implications:

We need a reliable transport mechanism. Because the dataset is large, we must implement error recovery, preferably one that does not require us to re-upload the entire data set on error.

Depending on how we implement this transport, we may need to pay additional storage & network bandwidth costs.

This means significant increase of complexity.

Instead, I propose to apply the design pattern called Modular Monolith: keep the aggregation module decoupled from the rest of spark-evaluate at the design level, but keep all modules running inside the same monolith. We are already using this approach for other steps in the evaluation workflow.

graph LR
  RoundStarted --> FraudDetection
  FraudDetection --> SubmitScoresToMeridian
  FraudDetection --> CalculateStatsForInternalGrafana
  FraudDetection --> CalculateStatsForPublicDashboard