Move Spark score from individual retrievals to retrievable deals

Problem Statement

At the moment, Spark calculates RSR using data at the measurement (retrieval attempt) granularity. This has several problems:

We cannot trust individual measurements. We need to form a large-enough majority agreeing on the same result to give us a confidence in the correctness of such result. In other words, the Spark protocol produces answer to the following question: Does the majority of the checker nodes agree that this deal can be retrieved? The protocol is not designed to give us more granular data that we could trust.

From the perspective of a storage provider, there are many legitimate reasons why a retrieval request may fail:
- A temporary outage of the IPNI endpoint used by the checker node to find where to retrieve the deal payload from.
- There is no network route between the client and the SP. This could be caused e.g. by a client that has intermittent internet connection, a client that’s behind a firewall, and so on.
- The client is malicious and reports made-up results.
Such measurements must not influence the score that Spark calculates for the provider. Otherwise the SPs can justifiably argue that our score is unfair or even meaningless.

THIS IS NO LONGER RELEVANT - I FIXED SPARK-EVALUATE:
With the recently introduced “committees & majorities” , measurements that don’t agree with the majority are rejected. As a result, the code calculating RSR will receive only majority measurements, it will see that either a) all accepted measurements say the deal is retrievable or b) all accepted measurements say the deal is not retrievable. As a result, the RSR is heavily influenced by how many measurements were collected for each deal.
For example, let’s say an SP has one deal that’s retrievable and another that is not. Now consider two cases:
1. The task testing the retrievable deal produces 100 retrieval requests (accepted measurements) while the task testing the non-retrievable deal produces 50 retrieval requests (accepted measurements). The RSR is $\frac{100}{100+50}=66\%$ .
1. The task testing the retrievable deal produces 50 retrieval requests (accepted measurements) while the task testing the non-retrievable deal produces 100 retrieval requests (accepted measurements). The RSR is $\frac{50}{100+50}=33\%$ .
Note: This can be easily fixed by changing the code producing RSR to look at minority measurements too. However, such solution increases the complexity of Spark design & codebase.

Example

Consider the following imaginary measurements for a single SP:

Deal (PayloadCID)	IndexerResult	RetrievalResult	Evaluation
`bafyone`	OK	OK	OK
`bafyone`	OK	OK	OK
`bafyone`	OK	ERROR_502	MINORITY_RESULT
`bafyone`	ERROR_500	IPNI_ERROR_500	MINORITY_RESULT
`Qm1234`	ERROR_404	IPNI_ERROR_404	OK
`Qm1234`	ERROR_404	IPNI_ERROR_404	OK
`Qm1234`	ERROR_404	IPNI_ERROR_404	OK
`Qm1234`	ERROR_500	ERROR_502	MINORITY_RESULT

Before the recent change that introduced “committees and majorities” (see roadmap#59), the measurements with MINORITY_RESULT evaluation were evaluated as OK. We would get the following RSR:

Total accepted measurements: 8

Successfull retrievals: 2

RSR: 25%

With the committees & majorities in place, we get the following RSR:

Total accepted measurements: 5

Successfull retrievals: 2

RSR: 40%

Note: I have recently fixed the RSR calculation to account for minority measurements. For the example described above, Spark produces RSR 25% again.

Proposal

De-emphasize the retrievability score calculated from individual measurements until we implement checker node reputation, e.g. based on how often the node reports a majority result.

Introduce a new score - Deal Retrievability Score - that’s calculated at deal (task/committee) level only:

DRS = The fraction of committees targeting the SP, for which a majority could be found, and the majority result is a successful retrieval.

🤔

The algorithm above gives us a score based on retrieval tasks, not deals. We pick tasks randomly from all eligible deals. In short intervals, it’s very unlikely for the same deal to be tested twice, so we can assume task = deal. The longer the time period, the higher the probability that we will test a single deal multiple times. We have 36.8 million deals eligible for retrieval testing. With the current params (round=20 mins, 1000 tasks/round), Spark tests 72k deals daily. It takes about 511 days to test all deals. Even more days will be required as we start testing DDO deals.

Using the data above, we will get the following RSR:

Total deals tested where a majority result was found: 2

Deals where majority agrees on retrievability: 1

Score: 50%

How to interpret the new score

This new algorithm will produce scores that are almost back and white. If the SP has correctly configured IPNI announcements and retrievals, then they should score 100% all the time. Regardless of outlier Spark measurements caused by networking issues, intermittent IPNI outages, etc. Those outliers will be removed as being in the minority.

It’s still possible for an SP to earn a score other than 0% or 100%.

Some SPs may decide to make only some deals hot and retrievable and keep other deals in cold storage only (keep only the sealed copy).

Other SPs may tweak the IPNI integration to support Spark but start announcing only new deals created after that configuration change. Deals created before the config change won’t be retrievable; deals created after the config change will be retrievable.

In both cases, the Spark Deal Retrievability Score will tell us how large is the group of “hot” deals vs “cold” deals relative to each other.

Ramifications

It is more difficult to explain the link between Spark RSR and Retrieval SLA. An SP serving 60% Spark retrieval requests will most likely get the same Spark score as the SP serving 100% of Spark retrieval requests.
- However, there is a catch: if there are many clients performing retrievals, Spark retrievals are only a small fraction of all retrievals (let’s say less than 5%) and SPs cannot detect Spark from non-Spark retrievals, then they have to serve almost all requests equally well. Otherwise, all Spark requests could fall to the group of requests they did not serve.
  - Let’s say 5% of retrievals are from Spark.
  - If the SP decides to serve 90% of retrieval requests and drop (e.g. rate-limit) the remaining 10%, it’s possible that all of the Spark requests (5%) will fall into the set of requests that were dropped (10%).
  - Another way how to look at this: In which situation it’s possible for the SP to get a bad RSR, although they’re serving most of the requests?
    For an RSR of 99%, the server has to respond to 99.5% of requests
    For an RSR of 50%, the server has to respond to 40% of requests
    For an RSR of 25%, the server has to respond to 25% of requests
- The current status is not good either: we pretend that all measurements (retrieval results) are correct (reported honestly) and therefore the RSR calculated from those measurements is fair and meaningful. That’s not the case for the reasons explained in .

We started collecting deal-based scores on Jun 12th (see spark-evaluate#256). There is no way to backfill historical deal-based scores from measurement-based scores. Our charts won’t be able to show data before June 12th, unless we pretend that “measurement-based RSR” was the same thing as “deal-based RSR” before June 12th.
- Solution: We will simply announce that with the launch of committees, the way the score is evaluated has slightly changed. And we add an annotation into any charts.
- Also, note that we are keeping both the old RSR and new DRS, so the impact of this change is lower.

We didn’t have committees & majorities until recently, therefore the first version of the deal-based score considers a deal to be indexed/retrievable if at least one (accepted/honest) measurement indicates that. Obviously, that’s very easy to cheat, therefore I don’t trust this data too much.

Deal-based score is currently calculated for the entire dataset, we don’t have per-miner granularity.

Implementation Plan

Backend

Since this is a major change, I propose to create new REST API endpoints for deal-based scores and keep the old data & API endpoints around to give more time to Spark data consumers to upgrade.

In particular, we need to create deal-based counterparts for these endpoints:

http://stats.filspark.com/retrieval-success-rate

https://stats.filspark.com/retrieval-success-rate?nonZero=true

https://stats.filspark.com/miners/retrieval-success-rate/summary

We also need to enrich the data about tested deals with miner_id.

While we are making this change, I propose adding also client_id, since we know we want to provide score per client and allocator in the near future (see roadmap#139). The sooner we start collecting the data about the deal clients, the more history we will be able to provide once we implement the API & visualisation.

Optionally, we can modify existing endpoints to return a redirect to their newer counterparts.

Frontend

Then we need to rework the following dashboard panels to use the new deal-based score:

Spark Internal Dashboard

Deals Advertising HTTP Retrievals

Retrieval Success Rate

Retrieval Status (Honest)

IPNI Advertisements (Honest)

CAR Size (Successful & Honest)

Edit the info for “TTFB (Honest & Successful)” to mention these values are not based on majorities and cannot be fully trusted.

Spark Public Dashboard

Retrieval Success Rate

Retrieval Success Rate (Non-Zero Miners only)

Per-Miner Retrieval Success Rate

Per-Miner Non-Zero Retrieval Success Rate

Miners Performance

Measurements (1d) and Measurements (7d): We should modify how we count accepted measurements to include minority results.

Spark for Storage Providers

Retrieval Success Rate

Adoption

Write a blog post explaining what is changing and why (the rationale).

Publicise the change & the blog post on social media (Twitter, Slack, etc.)

Reach out to consumers of Spark data, help them understand the change and how they can migrate.
- It’s especially important to explain this change to people owning the allocator compliance review process.
- contains the information we need to spread.

Rework our weekly infographics to correctly describe what is the new score we show.
- Unfortunately, we can no longer talk about “three nines retrievability SLA”, because we are no longer producing data for SLA of individual retrievals.

Open questions

How to call the new score? It’s not “Retrieval Success Rate”.
- How about simply Spark Score?
- A more descriptive alternative: Deal Retrievability Score (DRS).
- UPDATE: Since we decide we want to preserve both measurement-level and deal-level scores, we need a name that’s more specific that just “Spark Score”.

Are there any documentation pages or website(s) we need to update?

What to do about the lack of historical data?
- Obviously, we can say we don’t care about history and present only data collected after June 12th or even August 28th. However, this will make it difficult to show the RSR improvement achieved over the last year.
- We can use measurement-based RSR to backfill the missing deal-based RSR. It’s not ideal, we will be selling apples as oranges, but it also seems to be the easiest way. From the perspective of charts showing only the RSR number, this difference can be easily explained - we improved the way how we calculate the RSR number to make it more reliable.

Alternatives Considered

Deal-based RSR and Retrieval-Request-based RSR

Provide two RSR values:

% of deals that are retrievable based on what the majority reports.

% of retrieval requests that were successful. A “retrieval request” means a request to the SP. The are no retrieval requests for deals that are not advertised to IPNI.

Important: to calculate the % of successful retrieval requests, we need to look at measurements that were considered to be valid measurements but that we later rejected because they reported a result that’s in minority.

A storage provider that serves retrievals well should score:

100% of deals are retrievable

Majority of retrieval requests are handled. We don’t expect SPs to achieve 100% score, because some requests can fail for reasons outside of SPs control, as explained in .

Note that it’s possible to have a low Deal-based score and a high Request-based score, therefore it’s crucial to always look at DRS first and treat RRSR as a secondary metric only.

Let’s say a SP stores 100 deals, serves retrievals for one deal only, and serves these retrievals very well.

In such case, DRS will be 1% and RRSR will be >99.9%.

Downsides:

We cannot trust individual measurements for the reasons outlined in . We have no guarantee that the calculated RRSR value matches the reality.

The difference between the current RSR and the new RRSR is very subtle. People will be confused, we will spend a lot of time on explaining it.

Rework existing RSR endpoints to return deal-based scores

I am concerned about the confusion this can create.

Consumers of these APIs will start receiving subtly different data.

The existing endpoints were named after retrieval-based scores. Returning deal-based scores will be confusing to future users.

On the other hand, especially if we decide to backfill the historical data using measurement-based scores, reworking the existing endpoints to use the new data would be require the least amount of effort on the consumer side (our frontends, Spark Data consumers).

Drop existing RSR endpoints immediatelly

Instead of deprecating the current endpoints and keeping them around, we can simply drop them. This will reduce our future maintenance work, but it will create urgent work items for everybody consuming Spark Data (e.g. FIL + allocator compliance tooling).

Use Deal based scoring as reputation for Spark checkers

to do