Thanh Nguyen

How We Built a Serverless Pipeline to Ingest Solana On-Chain Data

By Thanh Nguyen, CTO & Co-Founder, Dreamerly

In late 2022 - early 2023, my team at Dreamerly decided to build a comprehensive data-ingestion system for the Solana blockchain. Our objective was to retrieve all historical on-chain data, continually capture new transactions on an hourly basis, and store this data for analytics in a scalable and cost-effective manner.

Why We Needed This Data Pipeline

At Dreamerly, we specialize in blockchain analytics tools and data-driven insights for Web3 projects. We needed a reliable, large-scale backend system to gather all transaction details from Solana—one of the fastest blockchains out there—and make them available for downstream processing in Databricks.

Specifically, we wanted:

  • Full Historical Data: A complete record of past Solana transactions, going back to the genesis block.
  • Hourly Updates: Fresh on-chain data ingested at a regular interval (every hour), with no real-time constraints.
  • Scalable and Durable Storage: We planned to store raw transaction data in Amazon S3, compressed in JSON format. We also wanted to store structured indexes in DynamoDB to enable quick lookups of transactions by address, token, or NFT ID.
  • Minimal Ops Overhead: We preferred a managed, serverless approach, so we wouldn't have to maintain our own Solana nodes or big data infrastructure.

Architecture at a Glance

After researching, we decided on a serverless, event-driven design. Below is a simplified diagram of the pipeline:

Solana Data Pipeline Architecture

Building the Historical Catch-up

When we started this project, Solana had already processed billions of transactions. We faced the challenge of ingesting years of data without blowing our time or budget.

Step Functions Orchestration

We wrote a Step Functions workflow to coordinate historical block ingestion. Initially, it fetched the latest slot height from QuickNode, then iterated from slot zero upwards in predefined block ranges (for instance, 50,000 blocks per chunk).

SQS Fan-out

For each chunk, our orchestrator function posted a message to an ingest queue in SQS. This approach allowed us to spawn multiple Lambda workers in parallel—each worker could handle a separate block range.

Parallel Lambda Fetches

Each worker Lambda fetched the blocks in batches (e.g., 500 blocks per RPC call) to reduce overhead. Once it retrieved a batch, it wrote the data to a compressed JSON file in S3.

Handling Ongoing Data Hourly

Once the historical catch-up finished, we switched to a lighter, ongoing ingestion mode. We set up an Amazon EventBridge rule to run every hour:

  • Hourly Trigger: The pipeline checked the last block processed and determined the newest blocks to fetch.
  • Lambda Worker: We used the same code to pull blocks from QuickNode, but this time the chunk sizes were smaller (just covering the last hour of activity).
  • S3 and DynamoDB Writes: We appended the new data in S3, partitioned by date/hour, and updated our DynamoDB indexes with fresh transactions.

Designing DynamoDB Indexes

Storing billions of raw records in DynamoDB would have been extremely expensive. Instead, we kept complete raw data in S3 and only wrote indexing metadata to DynamoDB. For example:

  • Transactions Table: Indexed by the transaction signature as the partition key, with attributes like block number, timestamp, and references to S3 object paths.
  • Token Transfers Table: Indexed by token mint address, with sort keys on block/time. This enabled queries like "get all transfers of this token in the past 24 hours."
  • NFT Trades Table: We recognized certain program IDs for NFT marketplaces and stored these trades under an NFT mint address.

Overcoming Key Challenges

Rate Limits and Throttling

QuickNode enforced request limits. We included exponential backoff in our Lambda RPC calls. We also dynamically adjusted concurrency (the number of parallel Lambdas) so we wouldn't exceed QuickNode's allowance.

Large File Handling in S3

Initially, we tried writing a separate file for each block, but ended up with millions of tiny files, which made listing and reading inefficient. We switched to bundling hundreds of blocks into a single GZIP file, improving both performance and cost.

Idempotency

To prevent double-ingestion, we ensured that each block or transaction update had a consistent primary key in DynamoDB. If the same block data got fetched twice (due to a retry), DynamoDB would overwrite the same item, preventing duplication.

Lessons Learned

  • Partitioning is Key: Breaking data fetches into manageable chunks and distributing them with SQS allowed us to handle Solana's firehose of data without overwhelming QuickNode or hitting Lambda timeouts.
  • Balance Costs & Performance: DynamoDB can scale to billions of items, but it's not always cheap. By limiting how much data we stored in indexes and using S3 as our "source of truth," we kept our costs reasonable.
  • Serverless Saves Ops Overhead: We only paid for the compute we needed. Once the historical backfill ended, we scaled our resources down to a trickle.
  • Testing With Smaller Datasets First: Before ingesting the entire Solana mainnet, we did several dry runs on devnet and a subset of mainnet blocks. Catching memory-limit issues and performance bottlenecks early saved us from painful debugging later.

Conclusion

By embracing a serverless, event-driven architecture, we successfully built a robust pipeline for Solana on-chain data ingestion. The system automatically downloaded and stored years of historical data, then switched to a steady hourly ingestion schedule to keep our dataset current. With the raw data in S3 and light indexes in DynamoDB, we could perform deep-dive analytics in Databricks at scale—without managing a fleet of our own nodes or big data clusters.