CID Sampling for SPARK

The first step in the SPARK retrieval check workflow is the selection of (cid, address) pair that the given checker should test.

We have the following requirements:

The distribution should be uniform.
- There should be equal probability that any given (cid, address) is chosen
- There should be equal probability that any given checker will be assigned to test the given (cid, address) job.
  - There is an exception: we need to build a Honest Majority of checkers retrieving the same cid from the same address in a single measurement epoch. The sampling algorithm must be able to account for this.

The sampling must not be predictable. SPs must not be able to predict what CIDs will be checked in the next round.

Neither SPs nor Checkers can influence what (cid, address, checker) triple is sampled.

We want to use IPv4 address blocks as a scarce resource preventing any single party from spinning up a large number of nodes and controlling a large portion of the network.
- Nodes register themselves with the orchestrator; that’s how we find their IPv4.
  - Later (smart-contracts) - create some sort of an oracle that provides node’s IP.

The system must allow 3rd parties to verify that (CID, address) samples were chosen correctly.

Additional thoughts:

If Content Producers are paying for retrieval checks for the set of CIDs they are publishing, then we need to limit the set of CIDs only to the CIDs reported by this particular Content Producer, as opposed to doing a random walk of IPNI advertisements and/or Boost deals.

Retrieval Bot uses the following algorithm for sampling FIL+ deals:
we actually end up tracking active deals in a db and run sampling based on whats in there periodically / looking to spread across as many cid/sp id pairs as possible, rather than generating new tests every epoch.
https://github.com/data-preservation-programs/RetrievalBot/blob/1bf7e9520f2445ccf9a98a033a08fd4e6f6701f6/filplus.md
See also https://medium.com/filecoin-plus/retrieval-bot-is-live-ea577b61f7d3
Repository: https://github.com/data-preservation-programs/RetrievalBot

Possible solutions

❌ ~~A hard-coded list of~~ ~~(CID, address)~~ ~~to pick from. This list must be private to SPARK Orchestrator (SPs must not be able to access it.)~~
- ~~Upsides: Easy to implement in a centralised services. We already have this.~~
- ~~Downsides:~~
  - ~~This cannot be implemented as a smart contract. A smart contract will need the hard-coded list to be public, thus SPs would be able to predict checks.~~
  - ~~We need to periodically update the list. (Extra maintenance cost.)~~
  - ~~Not useful to the Reputation group, they want a source of CID that the community can trust.~~

A random walk of IPNI advertisements, using DRAND as a source of randomness.
1. Problem: there is only one IPNI instance → a single point of failure. There won’t be another consistent instance up & running by LabWeek.
1. This will be the solution we want to use in the long term; we will be able to trust a consortium of IPNI nodes to be correct.
What would our IPNI query look like?
- Inputs: DRAND seed, Deal ID
- Output: a CID randomly selected using DRAND seed
Can the IPNI team build & ship this API in time for us?
There is no verification that SPs are submitting all CIDs to IPNI and won’t be done by LabWeek.
→ Propose the new API - open a new GH issue in https://github.com/ipni/storetheindex/issues

A random walk of Filecoin storage deals
- Discussed here:
- The expectation is that FIL+ data must be retrievable. There is a flag we can use to determine whether a deal is FIL+ or not.
- We can obtain SP Boost Worker HTTP endpoints by a query, we should run this query as part of job scheduling.
Algo:
- FIL state tree gives us a set of SPs (miners), and for each SP, we get the primary worker endpoint ⇒ that gives us about 700 active miners we can retrieve from.
- We have a list of all deals for each SP on the chain. Just a list of deals.
  There is a catch: the output of StateMarketDeals method is over 3GB compressed, over 23GB decompressed. SPARK nodes cannot work with a dataset this large.
- We can use Boost-provided indexer API to query the SP what CID they have, but we can also fetch it from IPNI instead.
  If the SP does not provide any indexes - we can flag the SP as unreliable.
  If the index data is corrupted - ??
  We can also download the full Piece, re-index it, and verify that SPs are correctly advertising correct indexes.
- Ask for a random offset inside PieceID and look for a CID block in that.

❌ For each SPARK deal, the party paying for the retrieval provides a public list of CIDs & addresses to check. (This will be presumably based on Filecoin storage deals made by the paying party.)
- This list can be directly linked to an instance of MERidian smart-contract governing the work and rewards.
- The sampling can be driven by DRAND randomness (later) or a centralised service (initially?).
- We don’t mind if SPs prioritise retrievals for CIDs on this public list over other retrievals, because the client paying SPARK wants to get good retrievals, right? Anybody can pay for their own SPARK retrieval contract to get better retrieval performance for their content.

Meeting notes

2023-09-11

Will’s doc

We don’t know that the advertisement is honest

IPNI wants to still call it context CID

30 min round, each station will do a check per minute. 50 stations making 50 requests. Masih wants to know how much this is gonna cost on S3.

Action items:

Calculate how many requests we will be making to IPNI

Wha are our latency requirements

It can take up to 3 weeks until the new IPNI version populates data using the new algorithm. Until that happens, some requests for a random CID for (SP, PieceCID) will not return anything.

The IPNI endpoint will accept requests only for recent IPNI epochs (timestamps), e.g. about 1 hour into the past.

2023-09-07

I would like to reopen the discussion about the CID Sampling strategy we can implement for SPARK. So far, we have been thinking about a two-step process:

1. Pick a random (PieceCID, SP) pair

2. Pick a random Content CID inside that PieceCID

Finding a random (PieceCID, SP) pair is tricky because it's difficult to obtain the list of all active storage deals (~47 million records, compressed JSON has 3GB). Let's talk about what alternative options you think are feasible.

The proper solution for long-term

Inputs:
- randomness seed (an array of bytes)
- number of samples to return (e.g. N=100)
- Filecoin epoch (a recent one, e.g. at most several hours old)
  - To avoid time-synchronisation issues between decentralised nodes

Outputs
- N samples, where each sample contains the following fields:
  - Content CID (what to retrieve)
  - booster-http address (where to retrieve from)
  - booster-http identity (to verify Retrieval Attestation)
  - Miner ID (to report measurements to Reputation DB)
  - Piece CID (so that we can link this sample to storage deals)
- Optionally: metadata needed to allow 3rd parties to verify that the output was correctly computed. This is not needed if the sampling algorithm is fully deterministic.

Requirements:
- We want to be able to verify that samples were chosen from the exact list of all content CIDs that were supposed to be stored & retrievable at the given Filecoin epoch.
- We want a decentralised design that does not depend on any service (like IPNI) behaving like a single centralised node.
- Output is deterministic for a given seed + epoch
- Ideally, each pair (Content CID, SP) has the same probability of being chosen for a sample. SP can be defined either as Miner ID or Boost HTTP address.

What we have tried/considered so far

Ideally, we would inspect the Storage Market actor state on the chain to pick a random active deal (PieceCID, MinerID).

It’s not feasible to download the full list of all active deals.

Lotus does not provide any other API (e.g. getNumberOfActiveDeals() and getActiveDealAtIndex(ix)).

We can walk the IPLD tree of the Storage Market actor state instead.
- Partial PoC: https://github.com/bajtos/poc-cid-sampling/blob/main/fetch-filecoin-tipset.js
- I think we cannot do this using a single IPLD selector (a single RPC request), in which case we will send many RPC requests if we use an external Lotus node.
- To avoid excessive RPC calls, we may need to host our own (light?) Lotus node. How expensive is such a node to operate, both on HW/network and maintenance time)?

If we are running our own node, can we find a way to access the list of all active deals directly or let the node expose the new APIs we need for sampling?

Note that there is ongoing work on Direct Data Onboarding where the deal skips the StorageMarket actor. Once that lands, we cannot use StorageMarket actor state to learn about all active deals.
https://github.com/filecoin-project/FIPs/pull/804

Other ideas & next steps

If SPs don’t announce all content CIDs to IPNI, then we can build a different bot to check and flag such issues. SPARK does not need to deal with that.

IPNI does not understand Filecoin epochs. Can we use the concept of an Oracle to submit the IPNI “state” to the chain.

NEXT STEPS:

Proposed design
- IPNI adds a new endpoint to map ContextID (FIL PieceCID) to a list of content CIDs included in that piece
- A new service to randomly sample active storage deals (get PieceCID, SP)
  - Filecoin PieceCID is the same as IPNI ContextID
  - This service will have mapping from SP miner ID to libp2p addresses
- How to map Miner ID to SP HTTP address
  - This can be found on the chain
  - https://github.com/filecoin-project/lassie-event-recorder/tree/main/spmap
Note: For each (CID, address) sample, SPARK needs the following extra metadata:
- The public key of the booster-http worker that is listening on that address
  → I.e. the identity that will sign the retrieval attestations.
- Storage Provider miner id (f0xxxx)
  → We need this to submit data to ReputationDB.
- PieceCID in which the content is stored
  → I think this will make it easier to troubleshoot disputes about whether the CID was supposed to be retrievable from the SP.

next steps
- Will will discuss these ideas with the IPNI team
- Miro to schedule a meeting with Torfinn for next week
- Miro to ask in #ipld about a selector that would allow us to randomly sample active storage deals

CONTENT NO LONGER RELEVANT

An idea for a short-term solution (LabWeek’23/Q4’23):

API Request parameters:
- randomness seed (an array of bytes)
- number of samples to return (e.g. N=100)

Algorithm: IPNI picks N samples (CID, address) from the active records in the database

API Outputs:
- List of samples (CID, address)
- Signature over (randomness_seed, sample_1, sample_2, …sample_N)

Verification:
- Inputs: SPARK round number, SPARK committee, IPNI public key, list of samples, signature
- Algorithm:
  - SPARK protocol provides a deterministic process for mapping the SPARK round number & committees to randomness seed; see Use that algorithm to map SPARK round number & committee to randomness_seed
  - Build (randomness_seed, sample_1,…sample_N)
  - Verify that the signature is over ^^^ and matches IPNI’s public key

Downsides:

The sampling/selection is time-based; we cannot reproduce it later to verify that IPNI ran the algorithm correctly. (That would need a historical snapshot of IPNI’s database.)

It relies on having a single centralised IPNI; it won’t work with a network of eventually-consistent IPNI instances. When we send the same request to different IPNI instances at the same time, we can get back different results based on the internal state of each instance (how far it got in processing the announcements).

We need IPNI to have an identity so that it can sign the responses. Seems like another feature that we need to implement.