Agentic RL
environments
for blockchain.

Docker in. Capability out.

Docker containers where AI agents learn to operate onchain through trial and error. Each environment is a forked EVM chain, a funded wallet, contract ABIs, and a task prompt. No abstraction layers — agents interact via raw JSON-RPC and learn the actual production skill.

DELIVERYDocker images — spin up thousands of parallel instances
INTERFACERaw JSON-RPC — eth_sendTransaction · eth_call · eth_getLogs
GRADINGAutomated — onchain state delta = reward signal
COST~$0.00 per training episode on a forked chain

Watch an
agent
learn.

Each episode is one agent interacting with a forked EVM. Every action — every transaction — is recorded. The grader reads post-episode onchain state and returns a scalar reward. No human required.

$ blockchainrl run episode --env dex-trading ● REC
[episode 1/100]
action: eth_sendTransaction → swap(WETH, USDC, 1.0)
result: tx 0x3f...a2 confirmed, block 19841203
grader: slippage 0.12% · gas 84,231 (optimal: 82,000)
reward: +0.87 (P&L: +$14.20, gas penalty: −0.03)
[episode 2/100]
action: eth_call → getPool(WETH, USDC, 500)
result: pool 0x88...f1 liquidity: 4.2M, fee: 0.05%
grader: info gathering — no state change
reward: 0.00 (read-only)
[episode 48/100]
action: eth_sendTransaction → exactInputSingle({path, amountIn, sqrtPriceLimitX96})
result: tx 0x9c...d1 confirmed — route optimized, 3-hop
grader: slippage 0.03% · gas 82,100 (optimal 82,000)
reward: +0.99 (P&L: +$21.40, gas penalty: −0.001)

Five categories.
Machine-verifiable
rewards.

DEX
DEX Trading
Execute multi-hop swaps, optimize routing, manage slippage across Uniswap, Curve, and aggregators.
REWARD SIGNAL
P&L · slippage · gas efficiency
L1
LEND
Lending & Borrowing
Supply collateral, manage leverage ratios, monitor liquidation thresholds on Aave, Morpho, and Compound.
REWARD SIGNAL
Interest earned · health factor · gas costs
L2
LP
Liquidity Provision
Open and rebalance LP positions, manage concentrated liquidity ranges, optimize fee capture.
REWARD SIGNAL
Fee yield · impermanent loss · position health
L3
XCHAIN
Cross-Chain Ops
Select bridges, manage multi-chain portfolios, optimize for cost, speed, and settlement guarantees.
REWARD SIGNAL
Cost · time · success rate
L4
STRAT
Complex Strategies
Yield farming, arbitrage, liquidation capture, sandwich defense, multi-protocol portfolio management.
REWARD SIGNAL
Total return · risk-adjusted performance
L5

The chain
is the
reward signal.

Blockchain is the only non-game RL domain that grades itself. Every outcome is recorded onchain — funds transferred or not, swap executed or not, position health improved or not.

No human labelers. No LLM judges.
Ground truth by construction.

[tx]
Transaction Outcomes

P&L from swaps, interest earned on deposits, fees captured from LP positions — all read directly from onchain state.

[gas]
Execution Efficiency

Gas consumed, slippage incurred, routing optimality. Every inefficiency is measurable and penalizable.

[ok]
Binary Success Signals

Vulnerability exploited or not. Bridge completed or not. No ambiguity — the EVM is deterministic.

Guided.
Then expert.

Agents progress from guided tasks with full context to open-ended scenarios where they must discover contracts, parse ABIs, and devise strategy autonomously. The curriculum comes from task prompt context, not abstraction layers.

L1 Guided

Single-chain swaps with exact ABIs and function hints provided.

provided: [abi, contract_addr, function_sig, example_tx]
L2 Assisted

LP management with contract addresses given, agent encodes calldata.

L3 Independent

Leverage and liquidation scenarios with minimal guidance.

L4 Advanced

Cross-chain operations — agent discovers contracts and plans execution.

L5 Expert

Only a funded wallet. Agent discovers protocols, parses ABIs, devises strategy autonomously.

provided: [funded_wallet]

Built for
AI labs.

Bring your own agent, your own training loop, your own infrastructure. BlockchainRL environments integrate with any RL framework.

DELIVERY

Docker Images

Ship container images with task definitions and graders. Your infrastructure, your scale. Spin up thousands of parallel instances on your own clusters.

$ docker pull blockchainrl/env:dex-trading
INTERFACE

Raw JSON-RPC

No SDK lock-in. Agents interact via standard Ethereum RPC calls — eth_sendTransaction, eth_call, eth_getLogs. Compatible with any language, any framework.

eth_sendTransaction | eth_call | eth_getLogs
GRADING

Automated Rewards

Graders read post-episode onchain state and compute reward signals automatically. Trajectory logging captures every agent RPC call for analysis.

reward = grader.score(pre, post)
~$0
PER TRAINING EPISODE ONCHAIN

Economics of
onchain training.

$$$
Robotics RL

Physical hardware, sensor calibration, safety constraints. Every episode costs real money and real time.

$$
Game RL

Fast simulation but synthetic rewards. Skills don't transfer to real-world economic activity.

~$0
Blockchain RL

Forked chains on Anvil. Infinite scenarios, deterministic replay, real economic logic. Near-zero marginal cost.

The cost case for labs: A smaller model RL-trained on DeFi operations via BlockchainRL environments can outperform GPT-4 on onchain tasks at 1/100th the inference cost. Every wrapper-dependent agent that fumbles a complex DeFi operation is evidence the underlying model needs better crypto RL training — not more SDK wrappers.
BECOME A DESIGN PARTNER

Limited
slots.

We work with select AI labs to co-develop environments tailored to their training pipelines. Pre-seed · 2026.