Skip to content
Documentation

Everything you need to understand Eval.

The full picture in one place: what Eval is, how a market runs end to end, the anti-gaming design that keeps rankings honest, who does what, and how the token works.

Overview

Eval is a staked market for AI model quality. Today, benchmarks get gamed and leaderboards stay opaque: nobody really knows which model is best for a given task, and centralized rankings have no skin in the game. Eval fixes this with real economic stakes.

Tasks are posted and models are submitted to compete on them. Evaluators stake tokens on outcomes, evaluation is blind and settled by consensus, and whoever evaluates honestly is rewarded while anyone trying to game the network loses their stake. The result is a manipulation-resistant, continuously updated ranking of which model actually wins, settled on-chain.

Why on-chain

Eval needs a blockchain for one concrete reason: real stakes make dishonesty costly in a way a centralized leaderboard never can.

  • Real money at stake makes manipulation expensive.
  • Settlement is transparent and verifiable by anyone.
  • Participation is permissionless: no gatekeeper decides whose results count.
  • Payouts fan out to many evaluators automatically via smart contract.

Core concepts

Task
A well-specified problem posted to the network with a bounty. It declares its inputs, success criteria, evaluation rubric, redundancy parameters, and settlement rule up front.
Submission
A model entered to compete on a task. At intake it is anonymized into an abstract ID so judgment rides on the work, not the brand behind it.
Market
A single task plus its open set of competing submissions and the panel of staked evaluators converging on a verdict.
Stake
Tokens an evaluator commits to take part. Stake is the collateral that makes a judgment accountable: correct work is rewarded, dishonest work is slashed.
Consensus
The verdict reached when multiple independent, blind evaluators converge on the same outcome. No single actor can move it.
Settlement
The on-chain step that finalizes an outcome, pays correct evaluators from the reward pool, and slashes provably dishonest ones.

Market lifecycle

Every market runs the same four stages. Each one is designed so that the cheapest path to reward is doing genuinely useful work.

  1. 01

    Post task

    A requester defines a task and funds a bounty. The task spec (inputs, success criteria, evaluation rubric) is published openly.

  2. 02

    Submit models

    Researchers submit models to compete. Entries are anonymized into abstract submission IDs so judgment can't ride on a brand.

  3. 03

    Stake & blind-evaluate

    Evaluators stake tokens and score masked submissions independently. Redundancy and golden-set checks keep the panel honest.

  4. 04

    Settle & rank

    Consensus settles outcomes on-chain. Correct evaluators are rewarded, manipulators are slashed, and the ranking updates.

Anti-gaming design

The real challenge in any staked market for AI is circular, self-referential activity: participants farming the reward mechanism instead of producing useful evaluation. Eval's core innovation is the architecture that makes that the losing move.

Blind evaluation

Submissions are masked behind abstract IDs. Evaluators judge the work, not the name behind it, removing brand bias and collusion targets.

Redundant consensus

Every outcome is scored by multiple independent evaluators. A single actor can't move the verdict; the truth is what the panel converges on.

Golden-set checks

Known-answer probes are mixed into evaluation streams. Evaluators who fail planted checks reveal themselves as careless or adversarial.

Slashing

Stake is forfeit for provably dishonest or off-consensus behavior. Manipulation stops being free. It becomes the most expensive move on the board.

Participant roles

Requester
Defines a task, funds the bounty, and publishes the spec openly. Gets back a settled, manipulation-resistant ranking of which submission wins.
Researcher
Submits a model to compete on a task with a reproducible manifest and the required stake. Entries are anonymized at intake.
Evaluator
Stakes tokens, scores masked submissions independently, and earns rewards for matching consensus. Off-consensus or dishonest behavior is slashed.

Token & economics

Every market pays a small protocol rake that sustains the network and seeds the reward pool paying evaluators for correct work. The token itself has two core jobs:

  • Stake to evaluate: Holding and staking the token is how you earn the right to evaluate. Your stake is the collateral that makes your judgment accountable.
  • Governance: Token holders govern protocol standards: evaluation rubrics, redundancy parameters, slashing thresholds, and which markets open.
  • Priority & lower fees: Active, well-staked participants receive priority in evaluation queues and reduced protocol fees on the markets they take part in.

Disclaimer

The Eval token is a utility and access asset for using and governing the protocol. It is not an investment, security, or a claim on revenue, and rewards are compensation for correct evaluation work that can be reduced or slashed. Nothing here is financial advice.

Submitting a model

A researcher points a model at a market, attaches a reproducible manifest, and posts the required stake. Intake anonymizes the entry so evaluation stays blind from the first frame. The flow below is illustrative; no live endpoint exists yet.

submit.ts
1// Illustrative submission flow, not a live endpoint
2const market = eval.market("summarize-legal-brief")
3 
4await eval.submit(market, {
5 model: "your-model-id", // anonymized at intake
6 manifest: "./model.manifest", // reproducible config
7 stake: market.minStake, // entry requires stake
8})
9 
10// Intake masks your identity, then routes the
11// submission into blind, redundant evaluation.

FAQ

Is any of the data on this site real?
No. Every leaderboard row, market, and figure is an illustrative placeholder using abstract IDs, clearly marked as demo. There are no real model names or metrics anywhere.
Why does this need a blockchain?
Real stakes make manipulation expensive in a way a centralized leaderboard never can. Settlement is transparent and verifiable, participation is permissionless, and payouts fan out to many evaluators automatically by smart contract.
What stops people from gaming the rewards?
The anti-gaming design: blind evaluation, redundant consensus across many evaluators, golden-set checks, and slashing. Together they make honest work the cheapest strategy.
Is the token an investment?
No. It is a utility and access asset used to stake into evaluation and to govern the protocol. It is not a security, equity, or a claim on revenue, and rewards are compensation for correct evaluation work that can be slashed.
Can I use it today?
Not yet. Eval is being built in the open. Wallet connection and live markets are not wired up, and early markets are opening soon.

Glossary

Golden set
Known-answer probes mixed into evaluation streams to catch careless or adversarial evaluators.
Slashing
Forfeiture of staked tokens for provably dishonest or off-consensus behavior.
Rake
A small protocol fee taken from each market that sustains the network and seeds the reward pool.
Blind evaluation
Scoring submissions behind abstract IDs so evaluators cannot see who produced the work.
Redundancy
Scoring each outcome with multiple independent evaluators so no single actor controls the verdict.
Built in the open · early markets opening soon

Stop trusting the leaderboard.
Settle it in the market.

Post a task, submit a model, or stake to evaluate. Eval turns model quality into a market that's expensive to fake and open to verify.

Read the protocol