Introducing BlockchainBench

AI Can Spot Exploits but Can’t Swap Tokens

There is a striking paradox at the heart of blockchain AI today. Large language models can analyze smart contracts, detect vulnerabilities across 4.6 million historical exploits, and reason fluently about DeFi protocol mechanics. Ask an AI agent to explain how Uniswap V3 concentrated liquidity works and you will get a textbook-quality answer. Ask that same agent to actually provide liquidity in a specific price range, and it falls apart.

We call this the Knowledge RL vs. Operational RL gap. It is the difference between understanding blockchain concepts and being able to execute blockchain operations — the gap between knowing that a flash loan lets you borrow without collateral and actually constructing the callback, executing the arbitrage, and repaying in a single transaction.

This gap matters because the next generation of blockchain tooling will not be dashboards and analytics platforms. It will be autonomous agents that operate on-chain: rebalancing positions, responding to market conditions, managing risk across protocols. If those agents cannot reliably perform basic DeFi operations, the entire vision stalls.

What BlockchainBench Tests

BlockchainBench is an open benchmark designed to measure how well AI agents perform real DeFi tasks — not in theory, but in practice. It provides a standardized suite of 13 tasks spanning three difficulty tiers, each requiring agents to interact with actual smart contracts on forked mainnet state.

These are not toy problems. Every task in BlockchainBench mirrors operations that real users and protocols perform daily:

Token transfers — sending ETH and ERC-20 tokens to specific addresses
Token swaps — executing trades through Uniswap V2 and V3 routers
Liquidity provision — adding and removing liquidity from AMM pools
Lending operations — supplying collateral and borrowing on Aave V3
Flash loans — constructing and executing atomic borrowing strategies
Concentrated liquidity — managing positions within precise price ranges on Uniswap V3

Each task has a clear success criterion verified on-chain. There is no ambiguity about whether an agent succeeded — either the expected state change happened on the blockchain, or it did not.

The Harbor Framework

BlockchainBench runs on Harbor, a testing framework we built specifically for evaluating AI agents against blockchain environments. Harbor solves the infrastructure problem that has historically made agent benchmarking on DeFi impractical.

Each task runs inside a Docker container with an Anvil fork of Ethereum mainnet. The agent gets a funded wallet, a set of pre-deployed contracts, and a natural language task description. When the agent signals completion, a pytest suite verifies the on-chain results.

Harbor is agent-agnostic by design. We have tested it with Claude Code, OpenAI Codex, and Gemini CLI, and the framework does not care which agent is driving. If your agent can execute shell commands and interact with a blockchain node, it can run BlockchainBench tasks. This means the benchmark produces comparable results across different agent architectures, model providers, and prompting strategies.

The containerized approach also guarantees reproducibility. Every agent starts from the same forked state, with the same block number, the same token balances, and the same pool configurations. There are no hidden variables.

Difficulty Tiers

BlockchainBench organizes its 13 tasks into three tiers that reflect increasing operational complexity:

Easy

Basic operations that any competent DeFi user could perform manually in a few clicks. These tasks test whether an agent can handle fundamental blockchain interactions:

Send ETH to an address
Transfer ERC-20 tokens
Execute a simple token swap on Uniswap V2
Approve a token for spending

An agent that cannot reliably pass Easy-tier tasks is not ready for autonomous on-chain operation.

Medium

Multi-step operations that require understanding protocol-specific interfaces and sequencing multiple transactions correctly:

Swap tokens through Uniswap V3 (which requires different router interfaces than V2)
Supply assets to Aave V3 and receive aTokens
Borrow against collateral on Aave V3
Add liquidity to a Uniswap V2 pool (requiring approval of two tokens and a multi-parameter function call)

Medium-tier tasks separate agents that have surface-level blockchain knowledge from those that can navigate real protocol complexity.

Hard

Advanced operations that challenge even experienced DeFi developers. These tasks require deep protocol understanding, precise parameter construction, and sometimes creative problem-solving:

Execute a flash loan on Aave V3 (constructing a receiver contract, implementing the callback, and ensuring repayment within a single transaction)
Provide concentrated liquidity on Uniswap V3 within a specified price range (requiring tick math and position management)
Discovery tasks where the agent must analyze on-chain state to determine the correct parameters before acting

Hard-tier tasks represent the frontier of what we expect autonomous agents to handle. Today, most agents score zero on this tier. That will change, and BlockchainBench will be there to measure the progress.

Why Open Benchmarks Matter

The AI agent space is full of impressive demos and cherry-picked examples. What it lacks is rigorous, reproducible measurement. Without standardized benchmarks, it is impossible to know whether a new model, prompting technique, or agent architecture actually improves on-chain capability, or whether it just looks good in a blog post.

BlockchainBench is fully open source. The task definitions, the Harbor framework, the verification suites, and the baseline results are all public. We want every team building blockchain AI agents to run these benchmarks and publish their results. Competition on a shared evaluation set is how the field moves forward.

Get Involved

BlockchainBench is live today. Here is how to get started:

Run the benchmark: Clone the GitHub repo, follow the setup guide, and evaluate your agent
Explore the results: Visit blockchainbench.com for leaderboards and detailed task breakdowns
Contribute tasks: We are actively expanding beyond 13 tasks — if you have ideas for meaningful DeFi operations that should be benchmarked, open an issue or submit a PR
Share your results: Run BlockchainBench against your agent and let us know how it performs

The Knowledge RL vs. Operational RL gap will not close on its own. It will close because researchers and builders measure it, publish results honestly, and iterate. BlockchainBench is our contribution to that effort.