SPARK Content retrieval attestation

In SPARK, we want to reward Station instances for periodically making retrieval requests to check the availability of content stored by SP. The reward function is based on the number of checks performed. There are various attack vectors we need to prevent to avoid abuse. One of them is a cheating client that does not make any retrieval requests but simply reports fake retrieval metrics. In this document, we design a solution based on signature chains that allows 3rd parties (e.g. MERidian measurement service) to verify that the SPARK client attempted a retrieval from the given SP.

Workflow of a single retrieval check performed by SPARK

The current version of SPARK (Storage Provider Retrieval Checker) follows the following process for each retrieval check it performs:

The SPARK orchestrator defines a new retrieval checking job. The job record has several fields, among others:
1. unique job_id
1. cid of the content to retrieve (bafy...)
1. address to retrieve the content from (/ip4/211.254.148.138/tcp/8180/p2p/12D3KooWHeLUGxJsnsCsHnNW7CpvzumuDVq6vt9NWinUAXtFyD6H)
In the future, we want to replace the orchestrator with a smart-contract-driven solution. The important part is that the network assigns (cid, address) pair to the checker in a random & uniformly distributed way, the checker does not have any control over that selection and SPs cannot predict what CIDs will be checked (e.g. by reading data of the scheduling smart contract).
Different design options are discussed here:

The SPARK module running inside Filecoin Station (SPARK checker) retrieves the given CID from the given address using the HTTP protocol, using the Lassie HTTP interface under the hood.

The SPARK checker reports retrieval results to the SPARK orchestrator (MERidian measurement service).

sequenceDiagram
  participant SparkNode as SPARK Checker
  participant SP as Storage Provider
  box Cyan Private & centralised services operated by SPARK
    participant Orchestrator as SPARK Orchestrator
    participant SparkDB as SPARK DB
  end

  loop every 10 seconds
    SparkNode ->> Orchestrator: give me a new job
    Orchestrator ->> SparkDB: create a new job from a random template
    SparkDB -->> SparkDB: choose a random (cid, address) template
    SparkDB -->> SparkDB: create a new job record with a unique job_id
    SparkDB ->> Orchestrator: (job_id, cid, address)
    Orchestrator ->> SparkNode: (job_id, cid, address)
    SparkNode ->> SP: retrieve CID
    SP ->> SparkNode: (CAR stream, retrieval attestation)
    SparkNode ->> Orchestrator: (job_id, retrieval metrics, attestation)
    Orchestrator ->> SparkDB: update the job record
  end

Fraud detection

MERidian measurement & evaluation service periodically processes retrieval reports to calculate the impact of each Station and assign rewards. As part of the evaluation step, we want to detect fraudulent behaviour.

See

sequenceDiagram
  participant FraudDetection as SPARK/MERidian Fraud Detection
  participant SparkDB as SPARK DB

  loop every MERidian Evaluation epoch
    FraudDetection ->> SparkDB: get job details
    SparkDB ->> FraudDetection: (job_id, cid, address, metrics, attestation)
    FraudDetection -->> FraudDetection: validate retrieval attestation
    FraudDetection ->> SparkDB: flag fraudulent jobs
  end

Attestation verification

The SPARK Fraud Detection service has the following data fields available for each job (retrieval check):

job id

address

protocol (hard-coded to HTTP)

the public key of the SPARK Checker instance initiating the retrieval

the public key of the SP handling the retrieval (peer id from the IPNI record)

retrieval attestation the client obtained from the SP and calculated using the retrieved data

The attestation token must be created in a way that allows our Fraud Detection service to recreate all inputs using the available data listed above.

Importantly, the scheme must not require Fraud Detection to have access to the content of retrieved CAR files to be able to verify the attestations.

What we want from Boost Retrieval Attestation

We want the MERidian evaluation service to be able to verify that the SPARK module performed the retrieval check as it was defined by the orchestrator.

We want a generic solution allowing Boost to provide retrieval attestation, a solution that can be used by other projects retrieving content from Filecoin and IPFS too. This is, in particular, important in order to prevent SP from being able to distinguish SPARK retrieval requests from requests made by other clients.

GitHub discussion in Boost repo:

Proposed solution - V3

The solution has two parts:

Trustless GW implementations (Boost, Frisbii) will append a validation block to CAR stream responses.

SPARK checkers will calculate a proof of inclusion of a range of bytes. The range will be chosen using verifiable randomness.

Retrieval validation block

At the moment, when a retrieval client asks for data of a CID, the server returns a CARv1 stream in the response body.

We want the server to append one additional CAR block at the end of the CARv1 stream.

This extra validation block will be a valid CAR block [ varint | CID | block ]

The CID will use a newly defined multicodec to indicate that this block should be interpreted as EOF.

The exact format of the block payload will be determined later. We want to include the following fields:
- Request query: what CID was requested, what subset of Merkle tree was selected, etc.
- Data length: the size of the returned CAR stream in bytes, excluding the size of the validation block.
- Blake3 hash: a Blake3 hash of the CAR stream, excluding the validation block.
  Note: we can use a different hash, as long as it supports (subtree) inclusion proofs. The SHA family of hashes will NOT work.
- Signature of the previous three fields using SP’s (Boost worker’s) private key.
  Note: the public key for validating the signature can be obtained e.g. from IPNI records or by pinging the worker via libp2p.

The data in the validation block is intentionally client- and request- agnostic. It provides metadata about the request used to build the CAR and the party that produced this CAR.

We believe it’s ok to leave this extra block in the CAR file when uploading it to other services like web3.storage. However, we also expect tools like w3up to recognise the new codec and strip blocks with that codec during ingestion.

Previous art

This proposal extends that has been discussed earlier this year. That proposal is based on even older work that has been discussed in https://github.com/ipfs/specs/pull/332#pullrequestreview-1275495698

In particular, the following comments seems still relevant to me:

https://github.com/ipfs/specs/pull/332#issuecomment-1428709329

null objects are a very nasty way to go.
having a tombstone (ie some car files have it and some dont) is super messy and likely to lead to mysterious errors.
it also opens a surface area of attack: now everyone has to check that we arent linking to the null object from some object, because if we are, then it will terminate a car stream early, and we have to do that in a ton of places (think the lovely wonders of \0 in string operations). this is really nasty and you can guarantee that it will not be running _ correctly everywhere_, so you can guarantee that it will be run incorrectly somewhere, and also have to deal with that.
null termination is just a world of pain, and why ipfs went the way of self-describing objects. usually, most problems can indeed be turned into self-describing structures, even streams of unknown length. -- this is usually done by introducing a wrapper object that contains the information you want.
for example, define a format like a car-stream that includes a header in between every object (similar to tar) and that header can signal the end of a car-stream. you would have a 1-1 from any car to any car-stream, and make very explicit what is data and what is control plane.

Discussion on Slack:

https://filecoinproject.slack.com/archives/C05AW4S5H9P/p1691150166267909

Injection attack

Having a special block indicating EOF opens a door to injection attacks. A malicious party can inject an EOF block in the middle of a CAR file. Nodes not aware of the special EOF semantics will transmit the entire CAR file. Nodes understanding the special EOF semantics will stop processing the CAR stream when they encounter the first EOF block.

Spark workflow

SPARK checker performs a retrieval check for (cid, address) and reports (data length, blake3 hash, signature) to SPARK Ochestrator.

At regular intervals (e.g. at the end of every measurement epoch), SPARK Fraud Detection service:
1. Validates that signature submitted by the checker using the public key of the identity associated with the address in the IPNI record.
1. Verifies that the signature is for the expected attestation block (i.e. the cid matches the job definition, etc.)
1. Uses an Honest Majority scheme to get good-enough confidence that data length and blake3 hash reported by the checkers are accurate.

Proof of data possession (kind of)

After SPARK checker performs the retrieval and obtains (data length, blake3 hash)

It uses verifiable randomness and a private piece of information to pick a position within the CAR stream.
1. Public inputs: job id assigned by the orchestrator/smart contract, DRAND beacon
1. Private input: the private key for the FIL wallet
1. Output: sign(job_id + drand, private_key)
1. The position is calculated as signature mod data_lengh

The checker computes an inclusion proof showing that it knows the data in the CAR stream at the position computed in the previous step, all the way to the root node of the Blake3 hash tree.

The checker submits the signature and the inclusion proof to SPARK using time-lock encryption to ensure nobody else can read the proof until the current epoch is over.
1. This allows us to implement incentives for nodes to report if a misbehaving checker is trying to bypass the system by leaking the signature or even the private key.

At the end of the measurement epoch, the SPARK Fraud Detection service (or a smart-contract) verifies whether the inclusion proof is valid.
1. Verify that the signature used to determine the position was created for the expected payload (job_id + drand) and signed by the checker’s public key.
1. Verify the inclusion proof provided by the checker (this is a built-in feature offered by Blake3).

Proposed solution - V2

Expand to read the second proposal that is outdated now
- Boost will provide retrieval attestation tokens for all requests that include X-Request-Id header.
- The attestation token will be sent in response headers before the actual content.
  Ideally, we would like to send the attestation in response trailers only after the entire content (response body) is transmitted. Unfortunately, response trailers are an HTTP feature not widely supported by the ecosystem.
  Browser Fetch API does not provide API to access response trailers.
  https://github.com/mdn/browser-compat-data/issues/14703
  https://github.com/whatwg/fetch/issues/981
  Reverse proxies may not forward the trailer headers. The nginx versions from 2020 were stripping response trailers.
  https://forum.nginx.org/read.php?2,288298,288301#msg-288301
  Discussion in #browsers-and-platforms: https://filecoinproject.slack.com/archives/C02EQ3ELFBQ/p1688140535583309?thread_ts=1687459377.818719&cid=C02EQ3ELFBQ
  Another option is to add the attestation to the CAR payload as a special block. A similar proposal was discussed in great depth in . Two important arguments:
  DAG House wants to compute a hash of the CAR file and get the same hash for the same CID→CAR retrieval (essentially).
  People often upload the retrieved CAR file to other services. If the CAR file includes the attestation, this attestation will spread across the IPFS network.
- The attestation token will be binary encoded and optimised for size.
There are two variants we want to discuss further. One uses a nonce provided by the client in a request header; the second uses a nonce generated by the server using the current time. Once the server has the nonce value, it will create the retrieval attestation using the following algorithm.
Attestation process
1. Create a JSON representation of the following object using compact serialization with no extra whitespace. The example below is pretty-formatted for better readability.
  { // The value as provided by the client or generated by the server "nonce": "...", // The HTTP verb requested by the client "verb": "GET", // The path requested by the client, excluding the query string "path": "/ipfs/bafybeihyrijbpa4ge4dv7ozuwuaz4vkx54ggkemdv3i55ovl262roji7au", // Query string parts, sorted lexicographically by the key. // Use an empty object when there is no query string. "query": { "format": "raw" }, // Request headers that affect how the server process the request // and/or what response content it returns. This set always includes // "Accepts:" header. Depending on the GW spec implemented by the server, // the set of headers to sign may be larger (e.g. cache control). // All header names are normalised to lowercase and sorted lexicographically. // Use an empty object if there are no headers modifying the GW behaviour. "headers": { "accepts": "application/vnd.ipld.car" } }
  Discussion points:
  Should we use CBOR instead of JSON? CBOR will give us a more compact representation and possibly better performance in Go (Boost side). However, CBOR is more difficult to use in JavaScript (SPARK side).
  The payload includes query string and headers to allow us to verify that the client did not try to craft the retrieval request so that the response will not include the entire content.
  Format (raw, CAR, etc.) can be specified via query string and headers
  The range header affects what is returned
  HTTP caching headers may affect the response too (e.g. server returns 304 Not Modified instead of the content).
  More options can be added later.
  I feel this is a bit ugly and possibly brittle. If the server upgrades to a new version that recognizes a new header that is already used by the SPARK client but was ignored before, then the attestations provided by this new server version will be no longer considered as valid by the SPARK fraud detection service.
  Maybe we should rework the way how the payload is defined, replace the open-ended list of headers & query string values with a strictly defined set of flags describing what kind of retrieval was performed, and ask GW implementations to map their behaviour to these flags.
1. Use the server’s Ed25519 identity (as advertised to IPNI) to sign the payload created in the previous step. This gives us 512 bits long signature (64 bytes).
1. Prepend the signature with a single byte version field, using the value 0x01. Encode the resulting binary data using base64url encoding and prefix the result with u to signal the multibase variant used. The token will be a string ~89 bytes long.
  'u' + base64url.encode(concat([0x01], signatureBytes))
  What is Base64URL encoding? Base64 output includes special characters +/= that must be quoted/encoded in certain contexts. Base64URL omits padding and replaces +/ with -_ characters.
  Discussion point: Should we encode the attestation token using multiformats to get a self-describing value? That would require us to register the token format with multiformats and get a new id assigned. Alternatively, we can use multiformats only for the signature, to encode the algorithm used (e.g. Ed25519).
1. Add a header field with the name X-Attestation and the attestation token as the value. This adds up to 106 bytes to the response size.
Verification process
1. Obtain the public key of SP’s identity.
1. Determine the expected nonce value that should be used for the attestation.
  When using a client-provided nonce, the system can expect the client to generate the nonce in a deterministic way. For example, SPARK retrievals can create the nonce from the unique job id provided by SPARK Orchestrator.
  When using a time-based nonce generated by the server, determine when the retrieval was expected to happen. This can be tricky!
1. Determine the expected properties of the retrieval request - HTTP verb and path, query string, and so on.
1. Build the object describing the request metadata; see the previous section. Convert the object to the binary payload.
1. Decode the attestation token provided by the retrieval client. Verify that the version field is 0x01.
1. Verify that the signature is valid for the binary payload created in step 3 and the signature was created using the private key associated with SP’s public key as obtained in step 1.
Variants of nonce
Variant 1 - a client-provided nonce
In this variant, the client making the request sends a unique nonce in a request header (e.g. X-Request-Id that’s often used for distributed tracing).
Variant 1A: the token should be a SHA-384 hash of a unique payload.
- If all clients use the same hashing function to produce the nonce, the server (SP) cannot discriminate different classes of clients based on the nonce format.
- SHA-256 is vulnerable to length extension attacks. Blake3 would be a great solution, but it’s not supported natively by browsers yet. SHA-384 seems to be a good compromise - it’s not vulnerable to length extension attacks, and it’s widely supported (Go, Rust and WebCrypto API both in browsers and Node.js).
- A Base58-encoded representation of SHA-384 hash has ~67 bytes. Together with the header name `X-Request-Id: ` and CRLF delimiter, this adds ~83 bytes to all requests.
Variant 2B: the token is a UUIDv4 string.
- Lassie already uses this format; see X-Request-Id header spec.
- UUIDv4 is recommended by other places, e.g. https://http.dev/x-request-id
- The typical textual representation of UUIDs requires 36 bytes, which is less than half of the size needed by Base58-encoded SHA-384 hashes. The entire header will add ~52 bytes to all requests. However, Lassie already sends UUIDv4 request id in all requests; therefore, Lassie retrieval requests will be unchanged.
- Downside: it may be more difficult to link a UUIDv4 string to a retrieval job defined in a decentralized way, e.g. by a smart contract. Such a smart contract would not only need to generate what kind of retrieval to check but also assign a UUIDv4 value to that retrieval spec.
  OTOH: I think this should be a solvable problem, and we don’t need to worry about it yet. Considering other benefits, I am leaning to prefer this option over others.
Variant 2 - nonce based on time
- The server generates the nonce as a timestamp, e.g. the number of seconds elapsed since the Unix epoch.
- There is a tradeoff we need to carefully consider:
  If we use a high precision like seconds, then it’s difficult for verifiers to know which timestamp value was used as the nonce - they may need to try many different values to find the right now.
  If we use a low precision like hours, then a single attestation can be reused for multiple requests as long as they happen within the same window.
- An alternative that does not require the verifier to guess the timestamp: encode the timestamp in the attestation token.
  Use the Unix time (seconds since Unix Epoch) as the timestamp value, represented as a 64-bit signed integer (this seems to be the consensus).
  When building the object to sign, convert the timestamp number into a string and use it as nonce. Some languages, like JavaScript, do not support 64-bit precision for “regular” integers, especially when parsing JSON data.
  Create the 512 bits long signature (64 bytes) described in the .
  Build the attestation binary token as <0x01><timestamp><signature> (1+8+64 bytes) and encode it using Base58 encoding. The new token will have ~101 bytes, as compared to ~81 bytes when using the signature only.
  When verifying the attestation:
  Decode the attestation to obtain the time when the retrieval was performed.
  Check that the time in the attestation token is roughly around the time when the retrieval was expected to happen or that retrievals expected to happen at different times are not reusing attestation tokens with the same timestamp.
- I am leaning towards encoding the timestamp in the attestation token. It adds only ~12 more bytes while making verifying the signature's validity significantly easier.
Comparison of variants 1 and 2
Each variant has its own pros and cons.
- A client-provided nonce lets us link the attestation to the client and the particular retrieval request. It makes SPARK fraud detection easier as there are fewer attack vectors - it’s impossible to reuse the same attestation for more than one retrieval check.
- However, we need Lassie to send the new request header in all HTTP requests. Otherwise, SPs can distinguish retrievals from SPARK from other retrievals.
  Note: Lassie sends a unique user-agent string that allows SPs to detect which retrieval requests are coming from Lassie. As a result, we don’t need other retrieval clients to adopt our new X-Request-Id header. Even if the clients were sending it, the servers could still discriminate based on the user agent.
  Note 2: Lassie is already sending a unique UUIDv4 value in X-Request-Id request header if the caller did not provide any.
- A server-generated nonce does not require any changes on the client side.
- However, it does not link the attestation to any particular client and retrieval. It also leaves a small opportunity for reusing attestation tokens across multiple requests, depending on the precision of the timestamp used. Finally, it makes it more difficult to verify the correctness of the attestation, as the verifier needs to use heuristics to decide whether the timestamp matches the job being verified.
I am leaning towards client-provided nonce. The time-based nonce adds extra complexity to verifying attestations, while it does not seem to bring significant advantages over the client-provided nonce (IMO).

Proposed solution - V1 - OUTDATED

Expand to read the first proposal that is outdated now
Retrieval check
1. When the SPARK module makes a retrieval request, it includes a unique token in the request headers.
  The token is created as a SHA-384 hash of the following payload:
  A globally unique prefix to avoid collisions with other projects. This way, we can use a value that’s unique only to our project as the input for the hash function. Example prefix. app.filstation.spark
  A value that’s unique for each retrieval request (job), e.g. job_id created by the orchestrator
  The value sent to the SP is a SHA-384 hash of the payload described above.
  Why not SHA-256? SHA-256 is vulnerable to length extension attacks. Blake3 would be a great solution, but it’s not supported natively by browsers yet. SHA-384 seems to be a good compromise - it’s not vulnerable to length extension attacks, and it’s widely supported (Go, Rust and WebCrypto API both in browsers and Node.js).
  Example implementation in JavaScript:
  function create_request_token(job_id) { const encoder = new TextEncoder(); const payload = encoder.encode(`app.filstation.spark.${job_id}`); const token = await crypto.subtle.digest("SHA-384", payload); }
1. The party handling the retrieval request (typically a Boost worker) takes the token from the request and creates the following payload:
  { "request_id": "<token provided by the client>", "cid": "<requested cid>", "protocol": "<protocol used for this interaction>", "selector": "<optional: the IPLD selector describing the retrieved subtree>" }
  Note: I expect the payload structure is going to change as part of the discussion with the Boost team, also based on what’s possible for different protocols.
  Next, the storage provider signs this payload using its private key. Finally, the payload and the signature are encoded using base64 and concatenated into a single attestation string PAYLOAD.SIGNATURE
  Discussion point: should we follow the JWT and UCAN format that has three segments - HEADER.PAYLOAD.SIGNATURE? The header includes metadata like the version of the format used.
  This attestation is returned back together with the requested CAR stream.
  For HTTP-based transports, we can send the request token in the X-Request-ID request header that’s commonly used if we can convince retrieval clients like Lassie to use SHA-384-hashed value for requests not originating from SPARK.
1. The SPARK module sends the attestation together with other retrieval telemetry to the SPARK API.
Verification
We have the following data fields available:
- job id
- cid
- address
- protocol
- public key of the SPARK instance initiating the retrieval
- public key of the SP handling the retrieval (peer id in multiaddr)
- attestation from SP (ATT)
Verification process:
1. Compute the token payload app.filstation.spark.${job_id}
1. Compute the token hash that was supposed to be sent to SP.
1. Parse ATT payload and verify that request_id matches the token hash, cid and protocol match the job definition. I think we can either skip selector validation or check that it’s the default selector used by Lassie, since SPARK does not specify any selector ATM.
1. Verify that the ATT signature is a valid signature of the ATT payload and that the signature was created by the SP that was supposed to handle the retrieval request.
What attack vectors are impossible now
Dishonest clients + honest SPs
- For each SPARK job, the client must make a new retrieval request using the specified job job_id, cid and protocol, connecting to the SP having the specified peer id. Clients cannot reuse signature chains from other jobs because each chain depends on a unique job id.
- Even if the Station operator deploys multiple SPARK clients using the same identity, these clients still cannot reuse signature chains because each job has a unique id assigned by SPARK API and therefore each job produces a different signature chain.
Dishonest SPs + honest clients
- SPs don’t have much room for cheating here. Since each request token is unique, they must create a new attestation token for each request.
Colluding parties
- ??
What attack vectors are still possible
- Clients can connect to the SP/Boost worker using a different network address than was specified in the job description as long as the remote party uses the same peer id.
- It’s not possible to verify that the client performed the retrieval request at the time when the job was defined. This should not be a problem, though, because we know the time when the job outcome was reported and can use that timestamp to ensure the client performed the retrieval check at the right time.
- If a single party controls both the SP and the SPARK client, then they can share the private key used to sign the retrieval attestation. The SPARK client can skip the retrieval request and build the attestation locally.
  However, this is mitigated by other parts of our design. It’s the orchestrator that assigns jobs to SPARK clients, randomly sampling from all SPs. The likelihood of a SPARK client receiving many retrieval jobs for the same SP is very low.
- It is not possible to verify that the client retrieved all content. A cheating client can abort the retrieval request once it receives the attestation.

SPARK Content retrieval attestation

Workflow of a single retrieval check performed by SPARK

Fraud detection

Attestation verification

What we want from Boost Retrieval Attestation

Proposed solution - V3

Retrieval validation block

Spark workflow

Proof of data possession (kind of)

Proposed solution - V2

Attestation process

Verification process

Variants of nonce

Proposed solution - V1 - OUTDATED

Retrieval check

Verification

What attack vectors are impossible now

What attack vectors are still possible

Meeting notes

2023-08```-07