Product
Agentic RL Environments for Blockchain
Docker containers where AI agents learn to operate on-chain through trial and error. Each environment is a forked EVM chain, a funded wallet, contract ABIs, and a task prompt. No abstraction layers -- agents interact via raw JSON-RPC and learn the actual production skill.
[episode 1/100]
action: eth_sendTransaction → swap(WETH, USDC, 1.0)
result: tx 0x3f...a2 confirmed, block 19841203
grader: slippage 0.12%, gas 84,231 (optimal: 82,000)
reward: +0.87 (P&L: +$14.20, gas penalty: -0.03)
[episode 2/100]
action: eth_call → getPool(WETH, USDC, 500)
result: pool 0x88...f1 liquidity: 4.2M, fee: 0.05%
grader: info gathering, no state change
reward: 0.00 (read-only, no reward)
Environments
Five Environment Categories
From single-asset swaps to multi-protocol portfolio strategies. Each category produces machine-verified rewards with no human annotators required.
DEX Trading
P&L, slippage, gas efficiency
Execute multi-hop swaps, optimize routing, manage slippage across Uniswap, Curve, and aggregators.
Lending & Borrowing
Interest earned, health factor, gas costs
Supply collateral, manage leverage ratios, monitor liquidation thresholds on Aave, Morpho, and Compound.
Liquidity Provision
Fee yield, impermanent loss, position health
Open and rebalance LP positions, manage concentrated liquidity ranges, optimize fee capture.
Cross-Chain Operations
Cost, time, success rate
Select bridges, manage multi-chain portfolios, optimize for cost, speed, and settlement guarantees.
Complex Strategies
Total return, risk-adjusted performance
Yield farming, arbitrage, liquidation capture, sandwich defense, and multi-protocol portfolio management.
Grading
Machine-Verifiable Rewards
— The chain is the reward signal.
Blockchain is the only non-game RL domain that grades itself. Every outcome is recorded on-chain -- funds transferred or not, swap executed or not, position health improved or not. This creates automated reward signals with near-zero false positive rate.
No human labelers. No LLM judges. Ground truth by construction.
Transaction Outcomes
P&L from swaps, interest earned on deposits, fees captured from LP positions -- all read directly from on-chain state.
Execution Efficiency
Gas consumed, slippage incurred, routing optimality. Every inefficiency is measurable and penalizable.
Binary Success Signals
Vulnerability exploited or not. Bridge completed or not. No ambiguity -- the EVM is deterministic.
Curriculum
Graduated Difficulty Curriculum
Agents progress from guided tasks with full context to open-ended scenarios where they must discover contracts, parse ABIs, and devise strategy autonomously. The curriculum comes from task prompt context, not abstraction layers.
Guided
Single-chain swaps with exact ABIs and function hints provided
provided: [abi, contract_addr, function_sig, example_tx]
Assisted
LP management with contract addresses given, agent encodes calldata
Independent
Leverage and liquidation scenarios with minimal guidance
Advanced
Cross-chain operations -- agent discovers contracts and plans execution
Expert
Only a funded wallet. Agent discovers protocols, parses ABIs, and devises strategy autonomously.
provided: [funded_wallet]
Integration
Built for AI Labs
Bring your own agent, your own training loop, your own infrastructure. BlockchainRL environments integrate with any RL framework.
Delivery
Docker Images
Ship container images with task definitions and graders. Your infrastructure, your scale. Spin up thousands of parallel instances on your own clusters.
$ docker pull blockchainrl/env:dex-trading
Interface
Raw JSON-RPC
No SDK lock-in. Agents interact via standard Ethereum RPC calls -- eth_sendTransaction, eth_call, eth_getLogs. Compatible with any language, any framework.
eth_sendTransaction | eth_call | eth_getLogs
Grading
Automated Rewards
Graders read post-episode on-chain state and compute reward signals automatically. Trajectory logging captures every agent RPC call for analysis.
reward = grader.score(pre, post)
Cost Case
~$0
per training episode on-chain
Economics of On-Chain Training
Blockchain environments deliver infinite training scenarios at near-zero marginal cost. Every scenario generates real economic value -- not toy rewards.
$$$
Robotics RL
Physical hardware, sensor calibration, safety constraints. Every episode costs real money and real time.
$$
Game RL
Fast simulation but synthetic rewards. Skills don't transfer to real-world economic activity.
~$0
Blockchain RL
Forked chains on Anvil. Infinite scenarios, deterministic replay, real economic logic. Near-zero marginal cost.
The cost case for labs: A smaller model RL-trained on DeFi operations via BlockchainRL environments can outperform GPT-4 on on-chain tasks at 1/100th the inference cost. Every wrapper-dependent agent that fumbles a complex DeFi operation is evidence the underlying model needs better crypto RL training -- not more SDK wrappers.
Become a Design Partner
We are working with select AI labs to co-develop environments tailored to their training pipelines. Limited slots available.