Spark Table Temp

When s are submitted into the Spark API, they are stored in Storacha and committed on chain as part of . Spark Evaluate then runs logic to determine which measurements are valid and contribute to the Spark RSR and which are not valid. The following table shows which retrieval result codes are contributing to the RSR calculation.

Retrieval Result CodeSituations That Yield This ResultContributes to RSR?
OKThe retrieval was successful.YES
RPC FailureJSON RPC service is:
1. unreachable OR
2. not returning results OR
3. returning empty results for a storage provider’s peerId
after multiple timeout-bound attempts (5) with exponential backoff.
NO
TIMEOUTA request is marked as timed out if no progress has been made for 60 seconds, where progress occurs if:
- The connection has been established
- A byte has been received
YES
CAR_TOO_LARGEThe root block is bigger than 200MB

Notes:
* At least as of 2024-10-21, this error isn’t observed.
* If it feeds into RSR, it is a way to poison an SP’s stats by doing a deal with an SP that has a payloadCid that is “too large”.
* If it is discarded from RSR, it is a way for SP to go into “ghost mode” (no metrics reported).
YES
UNSUPPORTED_MULTIADDR_FORMAT(Trustless HTTP GW retrievals only.)
The checker node was not able to convert the multiaddr found in the IPNI advertisement to an HTTP(S) URL.
YES
HOSTNAME_DNS_ERROR(Trustless HTTP GW retrievals only.)

When making an HTTP request to the address found in the IPNI advertisement, the checker node was not able to resolve the hostname into an IP address.
YES
CONNECTION_REFUSED(Trustless HTTP GW retrievals only.)

When making an HTTP request to the address found in the IPNI advertisement, the remote server refused the incoming connection.
YES
UNKNOWN_FETCH_ERROR(Trustless HTTP GW retrievals only.)

The checker could not make the HTTP request for reasons we don’t recognise yet. (Recognised reasons: HOSTNAME_DNS_ERROR and CONNECTION_REFUSED.)
YES
UNSUPPORTED_CID_HASH_ALGOThe checker could not verify the received content matches the requested CID - the CID uses a hash algorithm we don’t support (yet).YES
CONTENT_VERIFICATION_FAILEDWhen the payloadCID bytes received from a storage provider don’t match the CID. YES
UNEXPECTED_CAR_BLOCKThe CAR file returned by the provider includes other blocks than the requested one.YES
CANNOT_PARSE_CAR_FILEThe provider returned a malformed/corrupted CAR file.YES
IPNI_NOT_QUERIEDAn outdated (or fraudulent) checker node did not report the result of the IPNI query in the measurement it submitted.NO
IPNI_NO_VALID_ADVERTISEMENTAfter calling IPNI to obtain all retrieval providers, if when filtering the response to find the provider, the provider isn’t present.
IPNI_ERROR_FETCHThe HTTP request to IPNI failed at the networking level (e.g. DNS resolution failed, connection refused, etc.)

Reference: https://github.com/filecoin-station/spark/blob/8214ca93fd6cbab79e14e31bbaa2e22b584fb587/lib/ipni-client.js#L16-L29
YES
IPNI_ERROR_{number}IPNI responded with HTTP status code {number} .

IPNI_ERROR_404 - nobody advertised retrievals for this payload CID. (This is similar to IPNI_NO_VALID_ADVERTISEMENT described above.)
YES
HTTP_{number}When the checker requested the CAR bytes using the Trustless HTTP Gateway protocol, the server (storage provider) responed with HTTP status code {number}.

Example codes: HTTP_502, HTTP_504.
YES
LASSIE_${number}When the checker made an HTTP request to the local Lassie daemon handling Graphsync retrievals for Spark, Lassie responded with HTTP status code {number} .

Example codes: LASSIE_502, LASSIE_504.

Documentation for Lassie HTTP response status codes:
https://github.com/filecoin-project/lassie/blob/main/docs/HTTP_SPEC.md#response-status-codes
YES
LASSIE_504Lassie encountered timeout while fetching the data from the storage provider. This does not happen in practice, because we configure Lassie timeouts to one day (24 hours) and then trigger the timeout on the Spark side after 60 seconds.YES
UNKNOWN_ERRORThe retrieval check failed for a reason different from the ones described above. (This is a generic “fallback” code.)YES
CommitteeCheckErrorNO
Checker to Spark Publish FailureNO
IPNI is not responding with results within timeout / retry allotment.YES

Per Request (non-committee) Score vs. Committee Scoring

non-committee (per request) scoringcommittee scoring
DescriptionEvery request made by a Spark checker feeds into the metric (after doing basic “fraud” detection”), regardless of whether the checker’s result aligns with its committee.

For every <round, providerId, payloadCid>, the number of datapoints feeding into the SLI should be close to the size of the committee. Committee size p50 is ~80, but it ultimately depends on depends on which nodes participate in the round. The aim is to have most committee sizes in the range of 40 to 100.
Only the committee’s honest majority result for a <round, providerId, payloadCid> feeds into the metric.

For every <round, providerId, payloadCid>, there will be a single datapoint feeding into the SLI. (This is ~80x less data points than the non-committee case.)
Currently used by FIL+YesNo
What it is sensitive to vs. what it obscuresCon: If a bad actor has 1% of the Spark Checkers and they always report that SPs fail, they can effectively bring every SP’s Spark RSR down by 1 percentage point.

That seems too sensitive.

Pro: assuming most checkers are honest, it will differentiate the SPs that 80% of the time respond to requests with valid/correct results vs. those that give 99+%.
Pro: Bad actors would need to have 51% of the Spark Checkers in a given randomly generated committee to have impact.

Con: It the worst case (from a visibility regard), an SP could fail retrieval for 49% of requests but still have 100% committee retrieval success rate.

I can’t imagine saying, “store with us! 99.9% of the time if you store with us one of two retrieval requests will succeed”
Mitigations1. Spark checkers are only paid if their result is the same as the honest majority. (That said, since the pay out is small, this is not a big incentive.)

2. Any results in a <round, providerID, payloadCID> from the same IPv4 /24 subnet are discarded as part of “fraud detection”. This means an attacker can’t simply spin up a plethora of nodes on one machine or their local network, but would need to distribute across multiple IP addresses. (Note that in 202410, the Spark dashboard shows ~7k daily active checkers.)
web2 cloud analog

S3 produces an “error rate” for each 5 minute bucket. This error rate is a value between 0 (no errors) and 1 (all requests “errored”). It then calculates an uptime percentage based on the average “error rate” across all 5 minute periods in the month.
Sort of. We could calculate a value for each 5 minute period, and it will be a value between zero and one. That said, to ensure there’s enough data points, we publish an SP’s “Spark retrieval success rate” each day based.No. If we view committee rounds to being similar to S3’s 5 minute buckets, the stark difference is that Spark committee scoring is 0 or 1., not a value in between.

FAQ