How Blockchain Indexing Evolved — and Why It Still Holds Back Web3

in #blockchain2 days ago

Every blockchain product—from DeFi dashboards and analytics platforms to reactive services monitoring on-chain events—relies on reliable, real-time access to blockchain data. Despite the explosive growth of high-performance Layer 1 blockchains and Layer 2 scaling solutions, blockchain indexing infrastructure remains one of the most underdeveloped and limiting parts of the Web3 ecosystem.

This article walks through the evolution of blockchain indexing methods, explains why traditional node polling falls short, and presents modern indexing tools such as Firehose and Substreams that bring push-based, scalable blockchain data streaming to developers and projects.

We also wrote a deep-dive blog post on this topic —this version hits the core ideas, but if you want to explore the full stack in detail, check it out.


The Limitations of Polling Blockchain Nodes for Data

At the heart of blockchain indexing lies the node — or a cluster of nodes — exposing an RPC interface. Historically, indexers have relied on node polling methods like eth_getLogs and eth_getBlockReceipts to extract on-chain data.

  • eth_getLogs is a basic but widely used method that retrieves event logs filtered by contract, event signature, or block range. It works well for simple use cases but lacks full transaction metadata. For example, logs do not contain timestamps or internal call data, which forces indexers to make additional RPC calls, adding latency and complexity.

  • eth_getBlockReceipts provides complete transaction receipts for a block, including calldata and emitted logs. While this reduces the need for multiple calls, it leads to over-fetching since it lacks filtering by event types or addresses. This can result in unnecessary data processing and network overhead, especially at scale.

  • More advanced approaches like the debug_traceBlock API enable detailed execution tracing of transactions, including storage changes — critical for use cases like Uniswap V3 fee calculations, where intermediate storage states affect computations. However, relying on debug APIs is limited by provider support and performance constraints.


Why Node Polling Is Inefficient for Real-Time Blockchain Data Streaming

Node polling-based indexing systems suffer from two fundamental drawbacks:

  1. No push-based data streaming: Nodes do not natively push data or allow subscribing to event streams from arbitrary block heights. WebSocket connections provide limited streaming functionality but only from the moment of connection, and do not support historical data replay or guaranteed event delivery on reconnects.

  2. Lack of native chain reorganization (reorg) handling: Nodes do not notify clients of chain reorganizations. Indexers must implement complex logic to detect and roll back data from orphaned blocks, or else restrict themselves to finalized blocks, sacrificing freshness and real-time capabilities.

These limitations lead to fragile, inefficient indexing pipelines that are costly to maintain and scale, hindering rapid innovation in Web3 applications.


Replacing Polling with Real-Time Blockchain Data Streaming

Firehose by The Graph fundamentally changes the game by introducing a streaming-first architecture for blockchain data indexing.

How Firehose Works:

  • Modified blockchain nodes push new blocks immediately into a streaming pipeline, instead of requiring clients to poll for data.
  • The data is stored as flat files in S3-compatible cloud buckets, enabling efficient historical data replay without overloading node storage.
  • Firehose exposes a gRPC API optimized for streaming binary blockchain data, replacing inefficient JSON-RPC calls.
  • The system seamlessly switches between historical data (from S3 buckets) and real-time data (from live node streams), so indexers get uninterrupted access from any block height.
  • Built-in reorg notifications alert clients instantly about chain reorganizations, allowing accurate and consistent blockchain data indexing.

What about Availability?

  • Multiple nodes stream data in parallel, with primary and backup nodes to avoid single points of failure.
  • Multiple reader services consume the streams and write blocks independently to cloud storage.
  • A dedicated merger service deduplicates and bundles blocks into optimized storage buckets — one for finalized blocks, one for forked blocks, and one for raw blocks.

This architecture guarantees scalable, reliable blockchain data streaming critical for enterprise-grade Web3 infrastructure.


Developer-Controlled Custom Data Filtering

While Firehose streams the full blockchain data, many applications only need specific subsets of data. Substreams complements Firehose by enabling developers to write custom WebAssembly modules that filter, transform, and reduce blockchain data streams before they reach the application layer.

Benefits of Substreams:

  • Runs developer-supplied filtering logic on every block, extracting exactly the on-chain data your app needs—no more, no less.
  • Supports caching and reuse of filtered data to improve efficiency and lower operational costs.
  • Scales horizontally via a front tier and worker pool architecture, processing large block ranges in parallel with ordered, reliable data output.
  • Integrates seamlessly with tools like SQL Sink to load indexed data directly into databases with built-in support for chain reorganizations.

Why Modern Blockchain Indexing Infrastructure Matters

Most blockchains still lack robust, scalable indexing infrastructure, forcing projects to build and maintain expensive custom solutions. This slows product development and impedes real-time analytics, monitoring, and DeFi composability.

Tools like Firehose and Substreams represent the next generation of blockchain data streaming technology, enabling:

  • Efficient, push-based data streaming from any block height
  • Customizable filtering for targeted application data needs
  • Reliable handling of chain reorganizations
  • Scalable infrastructure to support millions of users and high-throughput chains

Want to Go Deeper?

This is just the surface.
If you want a complete breakdown of how indexing infrastructure evolved, with examples, architecture diagrams, and dev insights:
👉 Read the full article on our blog → https://rocknblock.io/blog/a-deep-dive-into-how-to-index-blockchain-data


About Rock’n’Block

Rock’n’Block is a Web3-native development company. We build backend infrastructure and indexing pipelines for projects and protocols across multiple blockchain ecosystems.

Our work spans real-time and historical data processing, with production-ready systems tailored to handle high throughput and complex queries.

Focus areas:

  • Custom blockchain indexing pipelines
  • EVM chains, Solana, TON
  • DeFi dApps development

Case study: How we built a blockchain data streaming service for Blum → https://rocknblock.io/portfolio/blum

We’ve contributed to over 300 projects that collectively reached 71M+ users, raised $160M+, and hit $2.4B+ in peak market cap. Our role is to handle the backend complexity so teams can move faster and ship with confidence.