Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

From Raw Data to Insights

The collector generates millions of raw timing records. Turning them into something a browser can visualize requires a multi-stage pipeline.

Stage 1: Export to Parquet

The archiver stores data in DuckDB and exports it as Parquet files:

FileSizeContents
timings.parquet~840 MB (22 shards)One row per (message, peer) — the core timing data
messages.parquet~10 MBMessage metadata: hash, type, timestamp, payload
metadata.parquet~5 MBPeer metadata: pubkey, alias, addresses

Stage 2: Preprocessing

A Python script (preprocess.py) transforms the raw data into visualization-ready JSON:

  1. Arrival percentiles — For each message, rank peers by arrival time. A peer’s avg_arrival_pct across all messages determines its radial position in the visualization.

  2. First-responder scores — Peers that consistently deliver messages before others get high scores. These are candidates for being topologically close to message originators.

  3. Message selection — From ~416,000 total messages, we select ~181 “interesting” ones: messages received by at least 50 peers, deduplicated, with clear propagation patterns.

  4. Community assignment — Peers are grouped into communities using a combination of:

    • Known hubs: ~15 manually identified pubkeys (major nodes like ACINQ, Bitfinex, River)
    • Alias matching: Nodes with “LNT” in their alias are grouped together
    • Unknown: The remaining ~970 of 978 peers fall into the catch-all “unknown” community

Stage 3: JSON Output

The pipeline produces 7 JSON files that the frontend loads directly:

  • peers.json — Per-peer stats, coordinates, community assignment
  • wavefronts.json — Per-message arrival sequences (the largest file at 14 MB)
  • messages.json — Message metadata for the selector
  • communities.json — Community definitions with colors and labels
  • fingerprints.json — Peer timing fingerprints
  • leaks.json — First-responder and colocation analysis
  • summary.json — Aggregate statistics