From Raw Data to Insights
The collector generates millions of raw timing records. Turning them into something a browser can visualize requires a multi-stage pipeline.
Stage 1: Export to Parquet
The archiver stores data in DuckDB and exports it as Parquet files:
| File | Size | Contents |
|---|---|---|
timings.parquet | ~840 MB (22 shards) | One row per (message, peer) — the core timing data |
messages.parquet | ~10 MB | Message metadata: hash, type, timestamp, payload |
metadata.parquet | ~5 MB | Peer metadata: pubkey, alias, addresses |
Stage 2: Preprocessing
A Python script (preprocess.py) transforms the raw data into visualization-ready JSON:
-
Arrival percentiles — For each message, rank peers by arrival time. A peer’s
avg_arrival_pctacross all messages determines its radial position in the visualization. -
First-responder scores — Peers that consistently deliver messages before others get high scores. These are candidates for being topologically close to message originators.
-
Message selection — From ~416,000 total messages, we select ~181 “interesting” ones: messages received by at least 50 peers, deduplicated, with clear propagation patterns.
-
Community assignment — Peers are grouped into communities using a combination of:
- Known hubs: ~15 manually identified pubkeys (major nodes like ACINQ, Bitfinex, River)
- Alias matching: Nodes with “LNT” in their alias are grouped together
- Unknown: The remaining ~970 of 978 peers fall into the catch-all “unknown” community
Stage 3: JSON Output
The pipeline produces 7 JSON files that the frontend loads directly:
peers.json— Per-peer stats, coordinates, community assignmentwavefronts.json— Per-message arrival sequences (the largest file at 14 MB)messages.json— Message metadata for the selectorcommunities.json— Community definitions with colors and labelsfingerprints.json— Peer timing fingerprintsleaks.json— First-responder and colocation analysissummary.json— Aggregate statistics