Overview
Eval is a staked market for AI model quality. Today, benchmarks get gamed and leaderboards stay opaque: nobody really knows which model is best for a given task, and centralized rankings have no skin in the game. Eval fixes this with real economic stakes.
Tasks are posted and models are submitted to compete on them. Evaluators stake tokens on outcomes, evaluation is blind and settled by consensus, and whoever evaluates honestly is rewarded while anyone trying to game the network loses their stake. The result is a manipulation-resistant, continuously updated ranking of which model actually wins, settled on-chain.
Why on-chain
Eval needs a blockchain for one concrete reason: real stakes make dishonesty costly in a way a centralized leaderboard never can.
- Real money at stake makes manipulation expensive.
- Settlement is transparent and verifiable by anyone.
- Participation is permissionless: no gatekeeper decides whose results count.
- Payouts fan out to many evaluators automatically via smart contract.
Core concepts
- Task
- A well-specified problem posted to the network with a bounty. It declares its inputs, success criteria, evaluation rubric, redundancy parameters, and settlement rule up front.
- Submission
- A model entered to compete on a task. At intake it is anonymized into an abstract ID so judgment rides on the work, not the brand behind it.
- Market
- A single task plus its open set of competing submissions and the panel of staked evaluators converging on a verdict.
- Stake
- Tokens an evaluator commits to take part. Stake is the collateral that makes a judgment accountable: correct work is rewarded, dishonest work is slashed.
- Consensus
- The verdict reached when multiple independent, blind evaluators converge on the same outcome. No single actor can move it.
- Settlement
- The on-chain step that finalizes an outcome, pays correct evaluators from the reward pool, and slashes provably dishonest ones.
Market lifecycle
Every market runs the same four stages. Each one is designed so that the cheapest path to reward is doing genuinely useful work.
- 01
Post task
A requester defines a task and funds a bounty. The task spec (inputs, success criteria, evaluation rubric) is published openly.
- 02
Submit models
Researchers submit models to compete. Entries are anonymized into abstract submission IDs so judgment can't ride on a brand.
- 03
Stake & blind-evaluate
Evaluators stake tokens and score masked submissions independently. Redundancy and golden-set checks keep the panel honest.
- 04
Settle & rank
Consensus settles outcomes on-chain. Correct evaluators are rewarded, manipulators are slashed, and the ranking updates.
Anti-gaming design
The real challenge in any staked market for AI is circular, self-referential activity: participants farming the reward mechanism instead of producing useful evaluation. Eval's core innovation is the architecture that makes that the losing move.
Blind evaluation
Submissions are masked behind abstract IDs. Evaluators judge the work, not the name behind it, removing brand bias and collusion targets.
Redundant consensus
Every outcome is scored by multiple independent evaluators. A single actor can't move the verdict; the truth is what the panel converges on.
Golden-set checks
Known-answer probes are mixed into evaluation streams. Evaluators who fail planted checks reveal themselves as careless or adversarial.
Slashing
Stake is forfeit for provably dishonest or off-consensus behavior. Manipulation stops being free. It becomes the most expensive move on the board.
Participant roles
- Requester
- Defines a task, funds the bounty, and publishes the spec openly. Gets back a settled, manipulation-resistant ranking of which submission wins.
- Researcher
- Submits a model to compete on a task with a reproducible manifest and the required stake. Entries are anonymized at intake.
- Evaluator
- Stakes tokens, scores masked submissions independently, and earns rewards for matching consensus. Off-consensus or dishonest behavior is slashed.
Token & economics
Every market pays a small protocol rake that sustains the network and seeds the reward pool paying evaluators for correct work. The token itself has two core jobs:
- Stake to evaluate: Holding and staking the token is how you earn the right to evaluate. Your stake is the collateral that makes your judgment accountable.
- Governance: Token holders govern protocol standards: evaluation rubrics, redundancy parameters, slashing thresholds, and which markets open.
- Priority & lower fees: Active, well-staked participants receive priority in evaluation queues and reduced protocol fees on the markets they take part in.
Disclaimer
The Eval token is a utility and access asset for using and governing the protocol. It is not an investment, security, or a claim on revenue, and rewards are compensation for correct evaluation work that can be reduced or slashed. Nothing here is financial advice.
Submitting a model
A researcher points a model at a market, attaches a reproducible manifest, and posts the required stake. Intake anonymizes the entry so evaluation stays blind from the first frame. The flow below is illustrative; no live endpoint exists yet.
1// Illustrative submission flow, not a live endpoint2const market = eval.market("summarize-legal-brief")3 4await eval.submit(market, {5 model: "your-model-id", // anonymized at intake6 manifest: "./model.manifest", // reproducible config7 stake: market.minStake, // entry requires stake8})9 10// Intake masks your identity, then routes the11// submission into blind, redundant evaluation.FAQ
- Is any of the data on this site real?
- No. Every leaderboard row, market, and figure is an illustrative placeholder using abstract IDs, clearly marked as demo. There are no real model names or metrics anywhere.
- Why does this need a blockchain?
- Real stakes make manipulation expensive in a way a centralized leaderboard never can. Settlement is transparent and verifiable, participation is permissionless, and payouts fan out to many evaluators automatically by smart contract.
- What stops people from gaming the rewards?
- The anti-gaming design: blind evaluation, redundant consensus across many evaluators, golden-set checks, and slashing. Together they make honest work the cheapest strategy.
- Is the token an investment?
- No. It is a utility and access asset used to stake into evaluation and to govern the protocol. It is not a security, equity, or a claim on revenue, and rewards are compensation for correct evaluation work that can be slashed.
- Can I use it today?
- Not yet. Eval is being built in the open. Wallet connection and live markets are not wired up, and early markets are opening soon.
Glossary
- Golden set
- Known-answer probes mixed into evaluation streams to catch careless or adversarial evaluators.
- Slashing
- Forfeiture of staked tokens for provably dishonest or off-consensus behavior.
- Rake
- A small protocol fee taken from each market that sustains the network and seeds the reward pool.
- Blind evaluation
- Scoring submissions behind abstract IDs so evaluators cannot see who produced the work.
- Redundancy
- Scoring each outcome with multiple independent evaluators so no single actor controls the verdict.