🄁

Spark Protocol

This document aims to describe the entire workflow of SPARK retrieval checks, based on the parts discussed in , and other related documents. See also

Table of Contents

Context

SPARK will operate within the MERidian framework. We want to be decentralised, with no single party in charge of any component.

Overview

At a high level, the protocol is split into the following steps:

  1. Tasking: Cluster the online SPARK checkers into several committees.
  1. CID Sampling: Choose a random (CID, SP) pair for each committee.
  1. Retrieval with Attestation: Retrieve CID content from SP and obtain the attestation token.
  1. Proof of Data Possession: Create proof that the entire CID content was retrieved.
  1. Measurement: Report job outcome to MERidian
  1. Evaluation & Verification: Evaluate the impact of checkers and detect fraud.

The last phase - Reward - is handled by MERidian smart contracts.

1) Tasking

In this step, we want to match running SPARK checker instances with retrieval jobs in such a way that makes fraud more difficult. For each retrieval job (a pair (CID, SP)), we want to form a small committee of peers to perform the same retrieval redundantly to arrive at an honest majority result.

In the rest of the section, a task represents an abstract but fixed retrieval check that checker nodes will perform. The tasking algorithm does not need to concern with what specific (CID, SP) is derived from each task, as long as we have a deterministic algorithm for that. The conversion from ā€œtaskā€ to (CID, SP) job is explained in the section .

Requirements

  1. Rate Limiting
    We want to rate-limit how many jobs are performed by each checker.
    • This is especially important for the LabWeek23 release, when we won’t have the retrieval fraud detection implemented yet. Without fraud detection, a fraudulent node can short-circuit job execution by skipping the retrieval and reporting fake results instead. This allows nodes to cheaply report many completed jobs, which is later translated to disproportionally large impact & reward.
    • Even with fraud detection in place, if we don’t rate-limit the checks, then nodes on fast, unmetered internet connections can tweak the SPARK checker code to perform more checks than we designed for. This can put too much pressure on SPs. Plus, the operator will get a larger portion of rewards, which is unfair to honest nodes.
    šŸ¤”
    Miroslav
    After describing this requirement in detail, I am no longer sure if it’s really required in the longer term. What can go wrong if we allow nodes to make many honest retrieval checks?
    1. Node operator creates more impact and thus receives more rewards. That’s the core of the MERidian scheme, right?
    1. Node operators download much more data from SPs. If we sample CIDs uniformly, the load should be spread across SPs based on how much data they store. Any single SPs can be overloaded only if they provide a large fraction of FIL capacity. If that’s the case, they must be prepared to handle more retrievals than other SPs providing less capacity.
  1. Make Sybil attacks difficult/expensive
    We want to limit how many checker instances a single party operates by disincentivizing operating too many instances.
    • The primary goal is to make Sybil attacks more difficult/expensive.
    • As a side effect, this also makes it more challenging to avoid rate limiting by running many SPARK instances on the same machine.
  1. Make SP collusion difficult
    We want to make it difficult to run colluding SPARK nodes that will limit the retrieval checks only to a given SP or perhaps a small fixed set of CIDs.
  1. Checker committees
    Our ā€œProof of Retrievalā€ solution requires a committee where an honest majority can be formed.
  1. Avoid DDoSing SPs
    We must spread the load across multiple SPs. If the entire SPARK network checks the same (CID, SP) pair, we will effectively DDoS the poor SP.

See also

Tasking Algorithm

āš ļø
This section is outdated. We designed a new tasking algorithm that does not require membership information; see

We want to use IPv4 as a scarce resource to make it more difficult & expensive for individual operators to run hundreds/thousands of SPARK nodes to gain control of the network.

Each measurement epoch is deterministically tied to a DRAND epoch so that we can use the DRAND signature as the source of randomness.

We want to have a list of currently active nodes to lower the probability of selecting offline nodes for a committee.

Decentralised Design

We assume a network of tasker nodes will be operated by the community. Initially, there will be only one tasker node run by PL. Later, we will design & implement a reputation and incentive system that will allow more parties to join the network. E.g. a new tasker node stakes some FIL to join, the network penalises misbehaving nodes by slashing their stake, and finally, SPARK diverts a part of the reward pool to pay tasker nodes for their service.

Let’s assume we want to define a new set of tasks once per DRAND epoch (every 30 seconds).

TL;DR:

  • The tasking algorithm has two inputs:
    • The DRAND signature for the current epoch as the source of randomness
    • A list of checker nodes active at the beginning of the current epoch, including their public IPv4 addresses.
  • The outputs are deterministically computed from those inputs.
    • A list of committees.
    • For each committee, a list of checkers belonging to the committee.
    • Commitments and inclusion proofs as needed by other SPARK components.
  • This reduces the problem of tasking decentralisation to the following question:
    • How can we maintain a list of currently active checkers & their IP addresses in a decentralised way that the community & we can trust?

Membership service

āš ļø
This section is outdated. We designed a new tasking algorithm that does not require membership information; see

This part described the answer to the question asked above: How can we maintain a list of currently active checkers & their IP addresses in a decentralised way that the community & we can trust?

Let’s recall that we assume the existence of a network of tasker nodes operated by the community.

  1. Each checker node will periodically send a heartbeat message to the tasker network.
    1. There are different ways how we can implement this. For example, there can be a DHT table of all tasker nodes, and the checker node selects the closest tasker node. Another example is using gossip-style broadcasting.
    1. For the initial release, the network will have only one node operated by PL, and the checker nodes will contain hardcoded configuration pointing to this service.
  1. The tasker nodes will build a shared representation of what checker nodes are currently up and running.
    1. Again, there are different ways how to implement this. What we want is a decentralised key/value store with expiration.
    1. For each checker node, we need to record the following fields:
      1. identity - node’s public key
      1. address - node’s public IPv4 address (or the address of the NAT router)
      1. timestamp - time when the last heartbeat was received
  1. At the beginning of each epoch, the tasker network finds a consensus about the list of all active checker nodes.

    We should design this scheme to allow individual checker nodes to verify that they were included in the list while protecting the privacy of other checker nodes in the network.

  1. The tasker network creates a membership commitment satisfying the following requirements:
    1. It allows the MERidian/SPARK evaluation step to verify that the committee selection was executed with the right input data.

      For example, this can be implemented by storing the list of active members in CAS (content-addressable-storage) and recording the CID on the chain. However, we must restrict access to raw data (checkers IPv4 addresses).

    1. It allows individual checker nodes to verify that they were included in the list of members used for task selection.
    1. It must preserve the privacy of checker nodes - their IPv4 addresses must not be open to the public.

Selection of Committees

āš ļø
This section is outdated. We designed a new tasking algorithm that does not require membership information; see

Inputs:

  • List of all active checker nodes (provided by the previous step)
  • DRAND signature as the source of randomness

Steps:

  1. At the beginning of each epoch, after the tasker network reaches a consensus about the list of online checker nodes, we can proceed to define tasks for this epoch.

    There are different ways how to implement this. One option is to choose a different leader for each epoch. The leader will define the tasks, and other nodes will check that the leader is not cheating. The leader gets the reward (mints a block reward).

    For the initial release, the network will have only one node operated by PL, so we don’t need to worry about the details yet.

    1. Partition all active checkers into committees of fixed size so that:
      • each IPv4 /24 subnet is in exactly one committee
      • there is exactly one checker from each IPv4 /24 subnet selected to participate in the committee.
      • the selection process creates a uniform distribution
      • the selection uses the DRAND signature as the source of randomness.
    1. Create a tasking commitment to the list of committees and their members. Record this commitment in a place where the SPARK/MERidian evaluator can find it.
    1. For each checker elected into a committee, create an inclusion proof against the tasking commitment. This allows checkers and the SPARK/MERidian evaluator to validate that the checker was assigned the task.
  1. At regular intervals, checkers ask for a task to perform.
    1. The checker can communicate with the same tasker node it sends heartbeats to.
    1. For the initial release, the network will have only one node operated by PL, and the checker nodes will contain hardcoded configuration pointing to this service.
  1. The tasker node returns one of the following two responses:
    1. No new task is scheduled for the checker in this epoch.
      • Case 1: The checker already performed the task scheduled for this epoch.
      • Case 2: No task was scheduled for this particular checker in the current epoch.
    1. Details of the task scheduled for the checker in the current epoch and the time until the next epoch starts (X seconds).
      Task fields:
      1. DRAND signature for this epoch
      1. Tasking commitment, committee id this checker belongs to and inclusion proof of that.
        • The committee id is computed deterministically from the list of committee members (their public keys).
        • The inclusion proof allows other components in the system to verify that the checker was entitled to perform this task in this epoch.

    In both cases, the response will also contain the following fields:

    • Time until the next epoch starts, e.g. X seconds.
    • Commitment to the list of all active checker nodes. It allows the checker to verify that it was considered for the task selection algorithm. Alternatively, maybe the tasking service should provide inclusion proof instead?

    In both cases, the checker will repeat step 2 after X seconds as instructed by the service.

Outputs

The ā€œtasker nodesā€ know:

  • List of active members for the current epoch

The ā€œnetworkā€ knows - this knowledge is shared by ā€œtasker nodesā€ and the evaluation/fraud detection service, e.g. via on-chain smart contract state:

  • Membership commitment to the list of all active members
  • DRAND signature for the current epoch
  • Tasking commitment to the definition of committees/tasks.

Each tasked checker knows:

  • DRAND signature for the current epoch
  • Tasking commitment to the definition of committees/tasks
  • The committee id this checker belongs to
  • Task inclusion proof allowing 3rd parties to verify that this node belongs to its committee as specified by the id and that this committee is included in the tasking commitment.

Concerns

Proof of IPv4 address

  • Not a problem initially when there is only one tasker node operated by PL.
  • With a network of untrusted tasker nodes:
    • When a checker sends the heartbeat message, the tasker node sees the checker’s public IPv4.
    • We can design a scheme allowing anybody to challenge the checker address reported by a particular tasker node. The process will start with forming an honest majority of tasker nodes, instructing the checker node to send a heartbeat to all tasker nodes, and then comparing the results. Of course, the checker node running on Station Desktop can be offline, so we need to account for that. If the honest majority of tasker nodes agrees on a different IPv4 for the given checker node, then we can slash the tasker which reported incorrect address. However, we must also take into account that people change their IPv4 addresses over time.
    • We can ask the checker node to always send the heartbeat message to multiple tasker nodes. That way, we don’t need to trigger a challenge as long as all taskers report the same address.

We don’t want to make IPv4 addresses public. Hashing an address is not enough because it’s trivial to enumerate & hash all possible IPv4 addresses to create a rainbow table.

  • In the current design, IPv4 addresses are private to tasker nodes.
  • We may want to research options that will make it difficult or expensive for tasker nodes to expose checkers’ addresses.
  • I think this ultimately boils down to the fact that you have to trust the remote parties you are connecting to.

Membership information that’s up-to-date but does not require the frequent invocation of a membership smart contract.

  • We don’t need the membership information to be on the chain. In the current design, membership is tracked by a network of tasker nodes and does not require any checkers to interact with any chains.

What would be a stop-gap solution we can implement faster to buy us more time for implementing the proper decentralised one?

  • Using the proposed design, reduce the network of tasker nodes to a single node operated by PL.

How can checker nodes obtain the tasks defined by the decentralised orchestrator?

  • Checker nodes are registered with tasker nodes. Checker nodes periodically ask their tasker node for a new task.

How to handle the initial state when a checker node does not have any FIL to pay for the gas fee and thus cannot invoke any smart contracts?

  • Not applicable; checker nodes do not need to interact with any chains.

2) CID Sampling (per each task)

Prior discussion:

This algorithm is executed for every task defined above.

  • Checker nodes execute the algorithm to find out what retrieval they should check.
  • The evaluation service may run this algorithm to verify that the checkers selected the right retrieval. In practice, this is required only if there is a disagreement between the same committee members.

Inputs:

  • The committee ID to execute the task
  • The DRAND signature for the current epoch

How to choose a random (CID, SP) pair - implementation details to be added soon-ish.

  1. Initialise the random number generator using the inputs as the seed. E.g. rand = hash(concat(drand_sig, committee_id))
  1. Obtain the current tipset of Filecoin’s blockchain. (It may be easier to obtain the previous tipset in a way we can trust the result.)
  1. From this tipset, get the set of active deals.
    šŸ¤”
    Somebody in the FIL ecosystem is maintaining a snapshot of active deals on Amazon S3. The file can be downloaded from https://marketdeals.s3.amazonaws.com/StateMarketDeals.json.zst; its size is over 3GB.
    Depending on that single file is a single point of failure, though.
    We cannot expect SPARK checker nodes (Station Desktop instances) to download the entireĀ StateMarketDeals.json.zstĀ file to be able to perform CID sampling.
  1. Pick a random deal (using the random number generator created in step 1). Obtain PieceCID and SP identity for that deal.
  1. Obtain the multiaddr of Boost HTTP API and the identity (public key) of the Boost worker registered for the selected SP. If we cannot get that information, then flag the SP as non-conforming and repeat step 3 again.
  1. Get a random CID from the given PieceID (using the random number generator created in step 1).
    1. Ideally, we want each checker to do this on their own. E.g. choose a random range within PieceCID, ask SP for those bytes, then parse the data to find CIDs, then pick one of them at random.
    1. An easier, although more centralised, option is to query the IPNI indexer HTTP API and request a random CID from PieceID using our current randomness seed.

Outputs (per task)

  • CID to retrieve
  • multiaddr of Boost worker HTTP API to dial
  • pubkey the public key of Boost worker identity

Shall we add tipset, so that the evaluation service knows all inputs for verifying the result?

3) Retrieval with Attestation

Prior discussion:

A retrieval-check job is defined as follows:

  • CID to retrieve
  • multiaddr of Boost worker HTTP API to dial
  • pubkey the public key of Boost worker identity
  • rand seed to use for randomness

Additionally, each checker node has a FIL wallet or a libp2p identity (a key pair).

  1. SPARK checker retrieves CID from multiaddr and asks for retrieval attestation (CAR metadata block).
  1. While reading the response body, it incrementally verifies the CAR stream.
  1. While reading the response body, it computes a Blake3 hash of the response bytes up to the CAR metadata block.
  1. After receiving the entire response, it parses car_length, Blake3 hash and the provider’s signature from the CAR metadata block (see IPIP-431).
    1. Verifies that car_length matches the length of the response up to the metadata block
    1. Verifies that the Blake3 hash matches the hash computed locally
    1. Verifies the signature

Outputs (per job)

  • car_length
  • Blake3 hash
  • Any fields from the retrieval attestation required to verify the SP signature
  • SP signature

4) Proof of Data Possession

Using the data from , the SPARK Checker performs the following steps:

  1. It uses its private wallet key to sign the following message:
    <rand><cid><multiaddr>

    and uses the signature to choose a random block in the Blake3 tree:

    block_index = signature % (ROUND_UP(car_length/1024))

  1. It calculates Blake3 inclusion proof of 1024 bytes in the block at the block_index position.

Open questions

  • How to prevent / if we need to prevent one station from asking another station to generate an inclusion proof following a retrieval so that it does not need to do the full retrieval itself

Outputs (per job)

  • signature over <rand><cid><multiaddr>
  • Blake3 inclusion proof for block_index

5) Measurement

Using the data from the previous steps, the SPARK Checker submits the job outcome to the SPARK/MERidian measurement service.

Implementation details of the submission process are covered by MERidian, see e.g.

Besides the retrieval telemetry like response status code and TTFB, it also uploads the following fields:

  • DRAND signature for the current epoch (or perhaps just the DRAND epoch number?)
  • Data from Tasking
    • Tasking commitment to the definition of committees/tasks
    • The committee id this checker belongs to
    • Task inclusion proof allowing 3rd parties to verify that this node belongs to its committee as specified by the id and that this committee is included in the tasking commitment.
  • Data from CID Sampling
    • CID to retrieve
    • multiaddr of Boost worker HTTP API to dial
    • pubkey the public key of Boost worker identity
  • Data from Retrieval with Attestation
    • car_length
    • Blake3 hash
    • Any fields from the retrieval attestation required to verify the SP signature
    • SP signature
  • Data from
    • Checker’s signature over <rand><cid><multiaddr>
    • Blake3 inclusion proof for the selected block of data.
  • Checker’s identity (the public key used for membership registration and PoDP)

Performance aspects

  • The measurement size will be roughly 400 + CIP + B3IP bytes, where CIP is the committee inclusion proof size and B3IP is the Blake3 inclusion proof for a block of data.
  • IIUC, B3IP = 1024 + O(log(S)) where S is the byte size of the retrieved content. We should limit the maximum number of bytes retrieved from SPs to put an upper bound on the size of this part.
  • CIP=O(log(N)), where N is the number of checker nodes online. We should research more efficient inclusion proofs, e.g. vector commitments. If we cannot find a more efficient representation, then we need to adjust the design so that checker nodes do not need to submit inclusion proofs for task-selection verification.

Additional considerations

  • If we make the data publicly available on the chain, we must encrypt the job outcome using timelock encryption that will reveal the data only after the current measurement epoch is over.
  • If the Blake3 inclusion proof is publicly available, then other nodes can use this inclusion proof to calculate inclusion proof for their data block without having access to the entire CAR stream. This is a problem only if we check the retrieval of the same CID repeatedly in different epochs, which is highly unlikely.

6) Evaluation & Verification

These evaluation steps are executed whenever a MERidian measurement service submits a batch of job reports.

Here we assume that the evaluation is performed off-chain, and the MERidian takes care of executing this computation in a decentralised way. Again, refer to for more details.

Certain parts can be verified only after we receive measurements from all members of a committee. MERidian triggers the first evaluation step for every batch of measurements submitted, and a batch of measurements may not cover the entire committee. Therefore, we need the evaluation process to be stateful: wait until all committee members report their measurements and only then perform the verification checks.

The steps below describe the happy path, where we can form an honest majority in every step and don’t penalise misbehaving checkers.

Inputs

  • DRAND signature or epoch number linked to the current SPARK epoch
  • Tasking commitment created by the network of tasking nodes
  • Measurement records as provided by MERidian Measurement service/smart-contract.

Per-measurement steps

These can be executed immediately.

  1. Verify the tasking commitment reported by the checker node matches the commitment for the current epoch stored on-chain or within the tasker network.

    Discard measurements that don’t pass this check.

  1. Verify the task inclusion proof.
    Discard measurements that don’t pass this check.

Per committee steps

These steps must be executed after we collect all measurements for a committee.

At this point, thanks to per-measurement steps, we have the following guarantees:

  • The measurement uses the correct DRAND signature for randomness
  • The measurement ā€œbelongsā€ to the list of committees defined by the tasking network for the current epoch.

The process:

  1. Verify that all committee members arrived at the same job definition (CID, multiaddr, pubkey)
    In case of discrepancies, assume an honest majority to decide what’s the right answer. Alternatively, or if there is no majority, follow to compute the correct values ourselves. This is a potential DoS attack vector; we must limit how often we run this.
    Discard measurements that don’t pass this check.
  1. Compare car_length and Blake3 hash and provider signature reported by different checkers (committee members). Build an honest majority to have confidence about what is the correct car_length and Blake3 hash.
    Discard measurements that don’t pass this check.
    šŸ¤”
    Open questions:
    • How to handle the case when we cannot form any majority?
    • How to distinguish faulty SPs from misbehaving checkers?
    • What to do about reports where the signature does not match?
  1. Verify the validity of the provider’s signature over the retrieval attestation. Note: by now, we have filtered out all measurements that disagree with the majority, all remaining measurements have the same SP signature, and therefore we can run the validation only once per committee.
    Discard measurements that don’t pass this check.
  1. For each measurement in this committee:
    1. Validate the checker’s signature
      1. Build <rand><cid><multiaddr> - we have this data in the measurement
      1. Validate checkers’ signature of the payload - we have the public key in the measurement

      Discard measurements that don’t pass this check.

    1. Calculate block_index = signature % (ROUND_UP(car_length/1024))
    1. Validate the Blake3 inclusion proof for this block index

      Discard measurements that don’t pass this check.

Now we have a set of measurements where we can trust the nodes performed the task assigned by SPARK and retrieved the full content.

Output

  • A filtered set of measurements we can feed into evaluate & reward functions.

Performance aspects

TBD: what is the computational complexity (esp. the time complexity) of the verification algorithm in regards to the number of checker nodes and the size of retrieved data.

My initial estimate: for every SPARK epoch, the time complexity is N*(log(N) + log(S))

  • N is the number of checker nodes online, equal to the number of measurements reported.
  • log(N) comes from the task inclusion proof - we can optimise this by switching to e.g. vector commitments.
  • S is the (mean average?) size of the data retrieved.
  • log(S) comes from the Blake3 inclusion proof for data possession.
OUTDATED CONTENT (PREVIOUS ITERATION)

Node registration

For each pair (CID, SP) we want to check, we need to pick a subset of SPARK checkers to form a committee that will give us an honest majority. To do that, we need a list of currently running checkers (list of network members).

  • We need the list of currently active nodes to lower the probability of selecting offline nodes for a committee.

Options:

  1. Long-polling HTTP requests: a checker sends a request to Spark Orchestrator. The Orchestrator keeps the connection open until there is a job to be performed by the checker. The list of all open HTTP connections gives us the members.
    • Downsides: centralised, very poor scalability
  1. Heartbeat: a checker sends a heartbeat (ping) at a configured interval, e.g. every minute. The Orchestrator keeps a list of (member, timestamp) records. We filter out stale records (timestamp is older than one minute) to get the list of active members.
    • Downsides: centralised, there is a delay in detecting offline nodes
  1. On-chain registration: a checker registers itself with a smart contract. I think this will have to be implemented as a heartbeat, too (?).
    • Downsides: expensive to run, checkers must have FIL to pay gas fees
  1. Centralised registration with verification:
    • Same as Heartbeat (option 2). For each epoch, the membership service publishes an inclusion proof or a commitment on the chain, allowing individual checker nodes to verify that they were considered for receiving a job.

Proposal

Implement the second solution (heartbeat) for the Q3/Q4 launch. Research a decentralised/web3 solution afterwards.

Job scheduling

Prior discussion:

At every measurement epoch, the Orchestrator (Tasker) schedules retrieval checks to be performed by the network of checker nodes.

We want to use IPv4 as a scarce resource to make it more difficult & expensive for individual operators to run hundreds/thousands of SPARK nodes to gain control of the network.

For each (CID, SP) job, we want to form a committee of nodes where each node performs the same job. This should give us a high degree of confidence that if the majority of nodes report the same results, it’s the honest majority.

Each measurement epoch is deterministically tied to a DRAND epoch so that we can use the DRAND beacon as the source of randomness.

Proposed algorithm

This algorithm is executed at the beginning of every SPARK epoch to schedule jobs for this epoch.

Let’s assume committee size CS=10. (We will increase this number later based on modelling)

  1. Initialise the set of partitions P using data from the membership service, grouping checker nodes using the first three bytes of their IPv4 address (their x.y.z.0/24 subnet). Each partition is a list of nodes in the same subnet.
    Note: We never add more members to P. Instead, we create a new P using fresh membership data when the next epoch starts.
  1. Repeat as long as P has at least CS items:
    1. Choose a random (CID, SP) pair using the CID sampling algorithm described below
    1. Choose CS random partitions from P
    1. For each chosen partition (an IPv4 subnet), pick a random member checker in this partition (IPv4 subnet).
    1. Create a new job assignment (CID, SP, checker)
    1. Remove the chosen partitions from P
  1. In most cases, MG will not be divisible by CS, so we end up with some leftover groups that were not given any task to do this epoch. Alternatively, we can add these subnets to the last committee or spread them evenly across multiple committees.

CID sampling

How to choose a random (CID, SP) pair - implementation details to be added soon-ish.

  1. Obtain the current tipset of Filecoin’s blockchain.
  1. From this tipset, get the set of active deals.
  1. Pick a random deal. Obtain PieceCID and SP identity for that deal.
  1. Obtain the multiaddr of Boost HTTP API and the identity (public key) of the Boost worker registered for the selected SP. If we cannot get that information, then flag the SP as non-conforming and repeat step 3 again.
  1. Query the IPNI indexer HTTP API and request a random CID from PieceID using our current DRAND beacon.

Tasking checkers

Each checker node will periodically ask the Orchestrator for a new job via HTTP API. The orchestrator gives one of the following responses:

  • No new job is scheduled for the client, and it should retry in X seconds.
    • Case 1: The node already performed the job scheduled for this epoch.
    • Case 2: No job was scheduled for this particular node in the current epoch.
    • X is computed as the time until the next epoch starts
  • Details of the job scheduled for the node in the current epoch
  • If the node does not receive any job, it will repeat the request in X seconds as instructed by the scheduler.

Smart-contract-driven version

How can we remove the reliance on a centralised Orchestrator/Tasker and let a smart contract do the scheduling work? Can we defer this work after the initial release?

Retrieval check

Prior discussion:

A retrieval-check job is defined as follows:

  • CID to retrieve
  • multiaddr of Boost worker HTTP API to dial
  • pubkey the public key of Boost worker identity
  • drand seed to use for randomness

Additionally, each checker node has a FIL wallet (a key pair).

  1. SPARK checker retrieves CID from multiaddr and asks for retrieval attestation (CAR metadata block).
  1. While reading the response body, it incrementally verifies the CAR stream.
  1. While reading the response body, it computes a Blake3 hash of the response bytes up to the CAR metadata block.
  1. After receiving the entire response, it parses car_length, Blake3 hash and the provider’s signature from the CAR metadata block (see IPIP-431).
    1. Verifies that car_length matches the length of the response up to the metadata block
    1. Verifies that the Blake3 hash matches the hash computed locally
    1. Verifies the signature
  1. Next, it uses its private wallet key to sign the following message:
    <drand><cid><multiaddr>

    and uses the signature to choose a random block in the Blake3 tree:

    block_index = signature % (ROUND_UP(car_length/1024))

  1. It calculates Blake3 inclusion proof of 1024 bytes in the block at the block_index position.
  1. Finally, it reports the job outcome to SPARK Orchestrator. Besides the retrieval telemetry like response status code and TTFB, it also uploads the following fields:
    • car_length, Blake3 hash and provider’s signature
    • checker’s signature over <drand><cid><multiaddr>
    • Blake3 inclusion proof for the selected block of data.

Smart-contract version

  • If we make the data publicly available on the chain, we must encrypt the job outcome using timelock encryption that will reveal the data only after the current measurement epoch is over.
  • If the Blake3 inclusion proof is publicly available, then other nodes can use this inclusion proof to calculate inclusion proof for their data block without having access to the entire CAR stream. This is a problem only if we check the retrieval of the same CID repeatedly in different epochs.

Job verification

  1. After each measurement epoch is over, collect all job outcomes reported by SPARK checkers. For each report, verify that the provider signature matches the provider’s public key.
    • What to do about reports where the signature does not match?
    • How to handle reports with missing signatures?
    • How to distinguish faulty SPs from misbehaving checkers?
  1. For each (CID, multiaddr) pair scheduled for checking:
    1. Compare car_length, Blake3 hash and provider signature reported by different checkers (committee members).
      1. Can we build an honest majority to have confidence about what is the correct car_length and Blake3 hash?
      1. What to do if we cannot?
    1. For each job report:
      1. Validate the checker’s signature
        1. Build <drand><cid><multiaddr> - we have this data in the job definition
        1. Validate checkers’ signature of the payload - we have the signature in the job outcome; we will need to get the checker’s public key (wallet address) from our other records, e.g. the membership service.
      1. Validate the Blake3 inclusion proof
      1. Filter out invalid reports
    1. Now we have a set of retrieval metrics we can trust šŸŽ‰

Public verifiability

How can we allow 3rd parties to repeat the process above using public data to verify that our Orchestrator service is being honest in the way it performs job verification?