How SPARK Samples Filecoin Deals

Written by	Miroslav Bajtoš
Date published	@May 2, 2024
Tags	EngineeringSpark
Status	Published
Featured
Top post
meta:author	Miroslav Bajtoš
meta:description	SPARK is checking whether public content stored on Filecoin can be retrieved. This post explains how SPARK samples Filecoin deals to find content that is expected to be publicly retrievable.
meta:image
meta:title	How SPARK Samples Filecoin Deals

SPARK is checking whether public content stored on Filecoin can be retrieved. To do so, we need to find out which Filecoin deals store data that’s expected to be publicly available.

Filecoin was designed to store all kinds of data, but not all of it is meant to be publicly retrievable. For these “private data” deals, it’s up to the client and the Storage Provider to agree on how the client can access the stored data. Such an agreement happens off-chain.

On the other side of the spectrum is the community program called Filecoin Plus for Large Datasets, often abbreviated as FIL+ LDN. This program aims to incentivise the storage of public open datasets on Filecoin, such as measurements produced by scientific experiments. There is a clear expectation that content stored through FIL+ LDN should be readily retrievable on the network and this can be regularly verified (quoted from current scope in FIL+ LDN docs).

While FIL+ LDN does not cover all publicly retrievable data, it gives us a great start.

Listing active FIL+ LDN deals

How can we find all FIL+ LDN deals to choose some of them to check? There are three steps in this process:

Get a list of all storage deals

Filter active FIL+ deals

Keep FIL+ LDN deals only

💡

You can find our implementation of this sampling algorithm on GitHub at https://github.com/filecoin-station/fil-deal-ingester/.

Get a list of all storage deals

Storage deals are managed by the built-in Storage Market Actor. The RPC API method Filecoin.StateMarketDeals returns a list of all deals created since the Filecoin Mainnet genesis. As you can imagine, it’s a lot of data - more than 20 GB in April 2024 - and the size is steadily growing as more deals are created over time. As a result, most RPC API providers have disabled access to this RPC method.

Fortunately, the awesome folks at Glif.io are creating hourly snapshots of StateMarketDeals data, the latest snapshot is publicly available via their Amazon S3 link.

In Spark, we use this snapshot as the data source of all storage deals.

💡

In the future, we will also need to include deals created via the Direct Data Onboarding mechanism recently introduced by FIP-0076.

Filter active FIL+ deals

The next step in our deal-processing pipeline is discarding all deals that are not active or that are not part of the FIL+ program. This is straightforward to implement using the following fields in the DealProposal objects from the Market Deals state:

Verified is a boolean field set to true if the deal is part of FIL+.

StartEpoch and EndEpoch specify the time interval when the deal is active.

Keep FIL+ LDN deals only

Lastly, we must filter the deals to keep only those made as part of the FIL+ LDN program. Theoretically, all data needed to construct such a filter is available in the on-chain state. In practice, it was easier to implement the following heuristics, which seem to work well.

⚠️

The LDN program was superseded by v5 Allocators. As of July 2024, Spark considers all FIL+ deals as eligible for retrievability checking.

First, we build a list of all clients that are verified for FIL+ LDN. We are using the following two endpoints offered by the public DataCapStats.io API:

getVerifiers (docs) to find all notaries (verifiers) that contain the string ldn in their description.

getVerifiedClients (docs) to get all clients of a given notary.

const notaries = await findNotaries()

const allLdnClients = []
for (const notaryAddressId of notaries) {
  const clients = await getVerifiedClientsOfNotary(notaryAddressId)
  allLdnClients.push(...clients)
}
removeDuplicates(allLdnClients)

async function findNotaries (filter) {
  const res = await fetch(
    'https://api.datacapstats.io/public/api/getVerifiers?limit=1000',
    { headers: { 'X-API-KEY': API_KEY } }
  )
  const body = await res.json()
  return body.data.map(obj => obj.addressId)
}

async function getVerifiedClientsOfNotary (notaryAddressId) {
  const res = await fetch(
    'https://api.datacapstats.io/public/api/getVerifiedClients/${notaryAddressId}?limit=1000',
    { headers: { 'X-API-KEY': API_KEY } }
  )
  const body = await res.json()
  return body.data.map(obj => obj.addressId).filter(val => !!val)
}

Second, to determine whether a deal is expected to be publicly retrievable, we check the Client field of the DealProposal. This field contains the address of the client making the deal. If the client is in the list of clients verified for FIL+ LDN, then we consider the deal to belong to the FIL+ LDN program and to have the expectation of public retrievability.

What’s next

This was the first post in the series explaining how SPARK checks retrievability. In next posts, we will explore how to find content identifiers (CIDs) of data stored in the deal and find the network address where to fetch the content from. Stay tuned!

More posts like this

Posts

Title	Tags	Status	Date published	Written by	meta:author	meta:description	meta:title
Filecoin Spark: Common Critiques	EngineeringSpark	Published	@December 13, 2024	Patrick Woodhead	Patrick Woodhead	A Review of Current Common Critiques of Filecoin Spark	Filecoin Spark: Common Critiques
Spark sees 10x improvement in Filecoin retrievability	ProductSpark	Published	@October 3, 2024	Patrick Woodhead	Patrick Woodhead	On April 18th 2024, the overall Filecoin retrieval success rate (RSR), as measured by Spark, was 1.22%. Many Storage Providers were simply not serving retrievals. On the 25th September 2024, Spark measured the overall Filecoin RSR at 12.8%, a 10.5x improvement.	Spark sees 10x improvement in Filecoin retrievability
How SPARK Retrieves Content From Filecoin	EngineeringSpark	Published	@September 9, 2024	Miroslav Bajtoš	Miroslav Bajtoš	In the previous posts, we explained how SPARK samples Filecoin deals and how SPARK discovers content stored in those deals. In the final post of this series, we will explain how SPARK tests whether the content can be retrieved.	How SPARK Retrieves From Filecoin
Spark Roadmap H2 2024	EngineeringSpark	Published	@July 31, 2024	Patrick Woodhead	Patrick Woodhead	The roadmap for the Spark Protocol in H2 2024	Spark Roadmap H2 2024
How SPARK Discovers Content Stored in FIL+ Deals	EngineeringSpark	Published	@July 2, 2024	Miroslav Bajtoš	Miroslav Bajtoš	SPARK is checking whether public content stored on Filecoin can be retrieved. This post explains how SPARK Spark discovers the CIDs (content identifiers) of the data stored in Filecoin deals.	How Spark Discovers Content Stored in FIL+ Deals
How SPARK Samples Filecoin Deals	EngineeringSpark	Published	@May 2, 2024	Miroslav Bajtoš	Miroslav Bajtoš	SPARK is checking whether public content stored on Filecoin can be retrieved. This post explains how SPARK samples Filecoin deals to find content that is expected to be publicly retrievable.	How SPARK Samples Filecoin Deals
Ethers v5 and Colossal Gas Overspending	Engineering	Published	@April 10, 2024	Julian Gruber	Julian Gruber	The popular JavaScript library Ethers v5 can overpay FVM smart contract calls by 6000x. A single contract call can cost >2FIL instead of negligible 0.0003 FIL.	Ethers v5 and Colossal Gas Overspending
How we reduced memory usage by 90%	EngineeringSpark	Published	@January 11, 2024	Miroslav Bajtoš	Miroslav Bajtoš	Four easy changes reduced the total memory usage of Spark’s Node.js backend from 4+ GB to ~360 MB: Convert plain data objects to class instances. De-duplicate immutable string values. Carefully choose how you calculate percentiles. Represent timestamps as numbers.	How we reduced memory usage by 90%
Optimising Performance of Spark's Postgres Database	EngineeringSpark	Published	@November 30, 2023	Miroslav Bajtoš	Miroslav Bajtoš	Setting up observability for Postgres performance requires a bit of work, but it gives you valuable insights into the performance of your database.	Optimising Performance of Spark's Postgres Database