🕵🏻

Investigation: Spark RSR drop on 2025-02-28

On February 28th, around midday UTC, there was a significat change in Spark RSR:

  • The overall RSR (Graphsync+HTTP retrievals) converted with HTTP-only retrievals.
  • HTTP-RSR slightly increased - but this could be just a usual noise in our data.
  • Overal RSR significantly dropped from ~22% to 13%

Relevant charts from the Spark Dashboard

The rest of this document is structured in two parts:

  • The first part contains summary about what we find.
  • The second part - - provides detailed description of our investigation process.

Relevant findings

What happened:

  • As part of deploying new component measuring the retrieval success rate for HTTP HEAD requests, we introduced a stricter validation step rejecting successful measurements missing the response status code for the HEAD request. We did not realise this check was too strict and rejected valid measurements for Graphsync retrievals.
  • As a result, no data was recorded for SPs serving Graphsync retrievals only. (See e.g. f03252730 and notice that both total and successful are set to 0.)

What mitigations we implemented:

  • We improved our dashboard to better signal when the validation step starts rejecting more measurements than usual.
  • We added an alert to let us know when this happens, so that we can investigate and fix the problem in timely manner.

Pull requests:

The updated chart with an alert:

https://spacemeridian.grafana.net/d/c4bd3c80-9360-4ea3-8b75-c5ecea907659/spark-internal-dashboard?from=now-24h&to=now&timezone=utc&showCategory=Panel links&orgId=1

Investigation

Charts from the Internal Spark Dashboard:

It’s probably not relevant to this issue, but IPNI service has been degraded since since Feb 15th, see Spark Internal Dashboard >> IPNI Success Rate

Sample spot check

Miner: f03252730 - serving Graphsync retrievals only

https://dashboard.filspark.com/provider/f03252730

https://stats.filspark.com/miner/f03252730/retrieval-success-rate/summary?from=2025-02-25&to=2025-03-07

Non-zero RSR on Feb 28th; dropped to zero on March 1st.

Miner details (obtained using https://gist.github.com/bajtos/d10cfc39f60ed8fe5a7578f416df530c and https://crates.io/crates/libp2p-lookup):

  • On-chain PeerID: 12D3KooWSv7uy3a9RDGYiaVKx8v65oGg9AcTpjMsjFoxKHmQ9SVx
  • Address: '/ip4/210.209.77.162/tcp/17033'
  • Agent version: boost-2.4.0+mainnet+git.390148b8

I did several manual checks for this miner and all passed from my machine.

round	payload_cid
32659	bafykbzaceclqej3k65xu5fdtiovuumo5dr67k6djyjcrngve7kcayxi5v6tu2
32659	bafykbzaceacm5i7zlia7eis2zcqjmjedh47ptloam6opbfgyggpmhfdesv7ty
32658	bafykbzacedznjufre7d34fj44aq66zynmuoam6vt47vulirpnfq25e5y752nc
32654	bafykbzacebbvux22ty7ic7dvop242p5xemmzj3dpmi2deenpmpff5caoh4t32
32653	bafykbzacebn7ilw4lud2iu3dnf4bxttpkhuxosmvvauad6mcgynwwstbe4ilg
32653	bafykbzacedc4krbatg5pybzqqeke5dggsjk2wiyr6aypdymdl6maqrtus2i6q
32650	bafykbzacedzuuvtoq6pjcm3xq6m4b5iguvctdexfxipdayl6abdm7odabpofa
32650	bafykbzacebvbv37ah6ax4pnpwy4hgtofuvj6wt5w7elnbunwr7jp7mpwd6d7o
32650	bafykbzacebev4ywbr7ipl2ilp75mxg3ub53lhflrss7flnvoaixwkyiwy3xhe
32650	bafykbzacea4qmft3apqol2ely7mpibvztafsx4irhzemt3wsdiu3iewva4adi

Miner: f03173127 - serving HTTP retrievals

https://dashboard.filspark.com/provider/f03173127

https://stats.filspark.com/miner/f03173127/retrieval-success-rate/summary?from=2025-02-25&to=2025-03-07

Their RSR slightly decreased, but we account this decrease to the current IPNI service degradation.