PRODUCT

Agentic RL
environments
for blockchain.

Docker in. Capability out.

Docker containers where AI agents learn to operate onchain through trial and error. Each environment is a forked EVM chain, a funded wallet, contract ABIs, and a task prompt. No abstraction layers — agents interact via raw JSON-RPC and learn the actual production skill.

Become a design partner → Back to overview

DELIVERYDocker images — spin up thousands of parallel instances

INTERFACERaw JSON-RPC — eth_sendTransaction · eth_call · eth_getLogs

GRADINGAutomated — onchain state delta = reward signal

COST~$0.00 per training episode on a forked chain

LIVE EPISODE

Watch an
agent
learn.

Each episode is one agent interacting with a forked EVM. Every action — every transaction — is recorded. The grader reads post-episode onchain state and returns a scalar reward. No human required.

$ blockchainrl run episode --env dex-trading ● REC

[episode 1/100]

action: eth_sendTransaction → swap(WETH, USDC, 1.0)

result: tx 0x3f...a2 confirmed, block 19841203

grader: slippage 0.12% · gas 84,231 (optimal: 82,000)

reward: +0.87 (P&L: +$14.20, gas penalty: −0.03)

[episode 2/100]

action: eth_call → getPool(WETH, USDC, 500)

result: pool 0x88...f1 liquidity: 4.2M, fee: 0.05%

grader: info gathering — no state change

reward: 0.00 (read-only)

[episode 48/100]

action: eth_sendTransaction → exactInputSingle({path, amountIn, sqrtPriceLimitX96})

result: tx 0x9c...d1 confirmed — route optimized, 3-hop

grader: slippage 0.03% · gas 82,100 (optimal 82,000)

reward: +0.99 (P&L: +$21.40, gas penalty: −0.001)

ENVIRONMENTS

Five categories.
Machine-verifiable
rewards.

DEX

DEX Trading

Execute multi-hop swaps, optimize routing, manage slippage across Uniswap, Curve, and aggregators.

REWARD SIGNAL

P&L · slippage · gas efficiency

LEND

Lending & Borrowing

Supply collateral, manage leverage ratios, monitor liquidation thresholds on Aave, Morpho, and Compound.

REWARD SIGNAL

Interest earned · health factor · gas costs

Liquidity Provision

Open and rebalance LP positions, manage concentrated liquidity ranges, optimize fee capture.

REWARD SIGNAL

Fee yield · impermanent loss · position health

XCHAIN

Cross-Chain Ops

Select bridges, manage multi-chain portfolios, optimize for cost, speed, and settlement guarantees.

REWARD SIGNAL

Cost · time · success rate

STRAT

Complex Strategies

Yield farming, arbitrage, liquidation capture, sandwich defense, multi-protocol portfolio management.

REWARD SIGNAL

Total return · risk-adjusted performance

GRADING

The chain
is the
reward signal.

Blockchain is the only non-game RL domain that grades itself. Every outcome is recorded onchain — funds transferred or not, swap executed or not, position health improved or not.

No human labelers. No LLM judges.
Ground truth by construction.

[tx]

Transaction Outcomes

P&L from swaps, interest earned on deposits, fees captured from LP positions — all read directly from onchain state.

[gas]

Execution Efficiency

Gas consumed, slippage incurred, routing optimality. Every inefficiency is measurable and penalizable.

[ok]

Binary Success Signals

Vulnerability exploited or not. Bridge completed or not. No ambiguity — the EVM is deterministic.

CURRICULUM

Guided.
Then expert.

Agents progress from guided tasks with full context to open-ended scenarios where they must discover contracts, parse ABIs, and devise strategy autonomously. The curriculum comes from task prompt context, not abstraction layers.

L1 Guided

Single-chain swaps with exact ABIs and function hints provided.

provided: [abi, contract_addr, function_sig, example_tx]

L2 Assisted

LP management with contract addresses given, agent encodes calldata.

L3 Independent

Leverage and liquidation scenarios with minimal guidance.

L4 Advanced

Cross-chain operations — agent discovers contracts and plans execution.

L5 Expert

Only a funded wallet. Agent discovers protocols, parses ABIs, devises strategy autonomously.

provided: [funded_wallet]

INTEGRATION

Built for
AI labs.

Bring your own agent, your own training loop, your own infrastructure. BlockchainRL environments integrate with any RL framework.

DELIVERY

Docker Images

Ship container images with task definitions and graders. Your infrastructure, your scale. Spin up thousands of parallel instances on your own clusters.

$ docker pull blockchainrl/env:dex-trading

INTERFACE

Raw JSON-RPC

No SDK lock-in. Agents interact via standard Ethereum RPC calls — eth_sendTransaction, eth_call, eth_getLogs. Compatible with any language, any framework.

eth_sendTransaction | eth_call | eth_getLogs

GRADING

Automated Rewards

Graders read post-episode onchain state and compute reward signals automatically. Trajectory logging captures every agent RPC call for analysis.

reward = grader.score(pre, post)

COST CASE

~$0

PER TRAINING EPISODE ONCHAIN

Economics of
onchain training.

$$$

Robotics RL

Physical hardware, sensor calibration, safety constraints. Every episode costs real money and real time.

Game RL

Fast simulation but synthetic rewards. Skills don't transfer to real-world economic activity.

~$0

Blockchain RL

Forked chains on Anvil. Infinite scenarios, deterministic replay, real economic logic. Near-zero marginal cost.

The cost case for labs: A smaller model RL-trained on DeFi operations via BlockchainRL environments can outperform GPT-4 on onchain tasks at 1/100th the inference cost. Every wrapper-dependent agent that fumbles a complex DeFi operation is evidence the underlying model needs better crypto RL training — not more SDK wrappers.

BECOME A DESIGN PARTNER

Limited
slots.

We work with select AI labs to co-develop environments tailored to their training pipelines. Pre-seed · 2026.

Become a partner ↗ Meet the team →

Agentic RL environments for blockchain.

Watch anagentlearn.

Five categories. Machine-verifiable rewards.

The chain is the reward signal.

Guided. Then expert.

Built for AI labs.

Docker Images

Raw JSON-RPC

Automated Rewards

Economics ofonchain training.

Limitedslots.

Agentic RL
environments
for blockchain.

Watch an
agent
learn.

Five categories.
Machine-verifiable
rewards.

The chain
is the
reward signal.

Guided.
Then expert.

Built for
AI labs.

Economics of
onchain training.

Limited
slots.