How SPARK Samples Filecoin Deals

Written byMiroslav Bajtoš
Date published
TagsEngineeringSpark
StatusPublished
Featured
Top post
meta:authorMiroslav Bajtoš
meta:descriptionSPARK is checking whether public content stored on Filecoin can be retrieved. This post explains how SPARK samples Filecoin deals to find content that is expected to be publicly retrievable.
meta:imagenasa-Ed2AELHKYBw-unsplash.jpg
meta:titleHow SPARK Samples Filecoin Deals

SPARK is checking whether public content stored on Filecoin can be retrieved. To do so, we need to find out which Filecoin deals store data that’s expected to be publicly available.

Filecoin was designed to store all kinds of data, but not all of it is meant to be publicly retrievable. For these “private data” deals, it’s up to the client and the Storage Provider to agree on how the client can access the stored data. Such an agreement happens off-chain.

On the other side of the spectrum is the community program called Filecoin Plus for Large Datasets, often abbreviated as FIL+ LDN. This program aims to incentivise the storage of public open datasets on Filecoin, such as measurements produced by scientific experiments. There is a clear expectation that content stored through FIL+ LDN should be readily retrievable on the network and this can be regularly verified (quoted from current scope in FIL+ LDN docs).

While FIL+ LDN does not cover all publicly retrievable data, it gives us a great start.

Listing active FIL+ LDN deals

How can we find all FIL+ LDN deals to choose some of them to check? There are three steps in this process:

  1. Get a list of all storage deals
  1. Filter active FIL+ deals
  1. Keep FIL+ LDN deals only
💡
You can find our implementation of this sampling algorithm on GitHub at https://github.com/filecoin-station/fil-deal-ingester/.

Get a list of all storage deals

Storage deals are managed by the built-in Storage Market Actor. The RPC API method Filecoin.StateMarketDeals returns a list of all deals created since the Filecoin Mainnet genesis. As you can imagine, it’s a lot of data - more than 20 GB in April 2024 - and the size is steadily growing as more deals are created over time. As a result, most RPC API providers have disabled access to this RPC method.

Fortunately, the awesome folks at Glif.io are creating hourly snapshots of StateMarketDeals data, the latest snapshot is publicly available via their Amazon S3 link.

In Spark, we use this snapshot as the data source of all storage deals.

💡
In the future, we will also need to include deals created via the Direct Data Onboarding mechanism recently introduced by FIP-0076.

Filter active FIL+ deals

The next step in our deal-processing pipeline is discarding all deals that are not active or that are not part of the FIL+ program. This is straightforward to implement using the following fields in the DealProposal objects from the Market Deals state:

  • Verified is a boolean field set to true if the deal is part of FIL+.
  • StartEpoch and EndEpoch specify the time interval when the deal is active.

Keep FIL+ LDN deals only

Lastly, we must filter the deals to keep only those made as part of the FIL+ LDN program. Theoretically, all data needed to construct such a filter is available in the on-chain state. In practice, it was easier to implement the following heuristics, which seem to work well.

⚠️
The LDN program was superseded by v5 Allocators. As of July 2024, Spark considers all FIL+ deals as eligible for retrievability checking.

First, we build a list of all clients that are verified for FIL+ LDN. We are using the following two endpoints offered by the public DataCapStats.io API:

  1. getVerifiers (docs) to find all notaries (verifiers) that contain the string ldn in their description.
  1. getVerifiedClients (docs) to get all clients of a given notary.

const notaries = await findNotaries()

const allLdnClients = []
for (const notaryAddressId of notaries) {
  const clients = await getVerifiedClientsOfNotary(notaryAddressId)
  allLdnClients.push(...clients)
}
removeDuplicates(allLdnClients)

async function findNotaries (filter) {
  const res = await fetch(
    'https://api.datacapstats.io/public/api/getVerifiers?limit=1000',
    { headers: { 'X-API-KEY': API_KEY } }
  )
  const body = await res.json()
  return body.data.map(obj => obj.addressId)
}

async function getVerifiedClientsOfNotary (notaryAddressId) {
  const res = await fetch(
    'https://api.datacapstats.io/public/api/getVerifiedClients/${notaryAddressId}?limit=1000',
    { headers: { 'X-API-KEY': API_KEY } }
  )
  const body = await res.json()
  return body.data.map(obj => obj.addressId).filter(val => !!val)
}

Second, to determine whether a deal is expected to be publicly retrievable, we check the Client field of the DealProposal. This field contains the address of the client making the deal. If the client is in the list of clients verified for FIL+ LDN, then we consider the deal to belong to the FIL+ LDN program and to have the expectation of public retrievability.

What’s next

This was the first post in the series explaining how SPARK checks retrievability. In next posts, we will explore how to find content identifiers (CIDs) of data stored in the deal and find the network address where to fetch the content from. Stay tuned!


More posts like this

Posts

TitleTagsStatusDate publishedWritten byFeaturedTop postmeta:authormeta:descriptionmeta:imagemeta:title
Filecoin Spark: Common CritiquesEngineeringSparkPublishedPatrick Woodhead
Patrick WoodheadA Review of Current Common Critiques of Filecoin Sparknasa_new_york_city_grid.jpgFilecoin Spark: Common Critiques
Spark sees 10x improvement in Filecoin retrievabilityProductSparkPublishedPatrick Woodhead
Patrick WoodheadOn April 18th 2024, the overall Filecoin retrieval success rate (RSR), as measured by Spark, was 1.22%. Many Storage Providers were simply not serving retrievals. On the 25th September 2024, Spark measured the overall Filecoin RSR at 12.8%, a 10.5x improvement.nasa_multi-axis_gimbal_rig.jpgSpark sees 10x improvement in Filecoin retrievability
How SPARK Retrieves Content From FilecoinEngineeringSparkPublishedMiroslav Bajtoš
Miroslav BajtošIn the previous posts, we explained how SPARK samples Filecoin deals and how SPARK discovers content stored in those deals. In the final post of this series, we will explain how SPARK tests whether the content can be retrieved.nasa-mars-rover-perseverance.webpHow SPARK Retrieves From Filecoin
Spark Roadmap H2 2024EngineeringSparkPublishedPatrick Woodhead
Patrick WoodheadThe roadmap for the Spark Protocol in H2 2024nasa_space_shuttle_challenger.jpgSpark Roadmap H2 2024
How SPARK Discovers Content Stored in FIL+ DealsEngineeringSparkPublishedMiroslav Bajtoš
Miroslav BajtošSPARK is checking whether public content stored on Filecoin can be retrieved. This post explains how SPARK Spark discovers the CIDs (content identifiers) of the data stored in Filecoin deals.spacex-VBNb52J8Trk-unsplash.jpgHow Spark Discovers Content Stored in FIL+ Deals
How SPARK Samples Filecoin DealsEngineeringSparkPublishedMiroslav Bajtoš
Miroslav BajtošSPARK is checking whether public content stored on Filecoin can be retrieved. This post explains how SPARK samples Filecoin deals to find content that is expected to be publicly retrievable.nasa-Ed2AELHKYBw-unsplash.jpgHow SPARK Samples Filecoin Deals
Ethers v5 and Colossal Gas OverspendingEngineeringPublishedJulian Gruber
Julian GruberThe popular JavaScript library Ethers v5 can overpay FVM smart contract calls by 6000x. A single contract call can cost >2FIL instead of negligible 0.0003 FIL.gecko.jpegEthers v5 and Colossal Gas Overspending
How we reduced memory usage by 90%EngineeringSparkPublishedMiroslav Bajtoš
Miroslav BajtošFour easy changes reduced the total memory usage of Spark’s Node.js backend from 4+ GB to ~360 MB: Convert plain data objects to class instances. De-duplicate immutable string values. Carefully choose how you calculate percentiles. Represent timestamps as numbers.nasa_reduced_gravity_walking_simulator.jpgHow we reduced memory usage by 90%
Optimising Performance of Spark's Postgres DatabaseEngineeringSparkPublishedMiroslav Bajtoš
Miroslav BajtošSetting up observability for Postgres performance requires a bit of work, but it gives you valuable insights into the performance of your database.station-server.svgOptimising Performance of Spark's Postgres Database