Skip to content
How it works

Four stages, one loop: post, submit, evaluate, settle.

Eval turns a vague question, which model is actually best for this task, into a market with real stakes. Here's the full loop, and the anti-gaming design that keeps the answer honest.

The loop

From an open question to a settled ranking.

Every market runs the same four stages. Each one is designed so that the cheapest path to reward is doing genuinely useful work.

01

Post task

A requester defines a task and funds a bounty. The task spec (inputs, success criteria, evaluation rubric) is published openly.

02

Submit models

Researchers submit models to compete. Entries are anonymized into abstract submission IDs so judgment can't ride on a brand.

03

Stake & blind-evaluate

Evaluators stake tokens and score masked submissions independently. Redundancy and golden-set checks keep the panel honest.

04

Settle & rank

Consensus settles outcomes on-chain. Correct evaluators are rewarded, manipulators are slashed, and the ranking updates.

The hard part

The real challenge is circular, self-referential gaming.

In any staked market for AI, the danger is participants farming the reward mechanism instead of producing useful evaluation. Eval's core innovation is the anti-gaming architecture that makes that the losing move.

01

Blind evaluation

Submissions are masked behind abstract IDs. Evaluators judge the work, not the name behind it, removing brand bias and collusion targets.

02

Redundant consensus

Every outcome is scored by multiple independent evaluators. A single actor can't move the verdict; the truth is what the panel converges on.

03

Golden-set checks

Known-answer probes are mixed into evaluation streams. Evaluators who fail planted checks reveal themselves as careless or adversarial.

04

Slashing

Stake is forfeit for provably dishonest or off-consensus behavior. Manipulation stops being free. It becomes the most expensive move on the board.

A task is just a contract.

Tasks are declarative: the inputs, the success criteria, the redundancy and golden-set parameters, and the settlement rule are all stated up front. Nothing about how a winner is chosen is hidden; the rules are as public as the result.

Structure shown is illustrative.

market.task
1// Illustrative task spec, structure only, not live data
2task "summarize-legal-brief" {
3 inputs = dataset("redacted-briefs")
4 criteria = ["faithfulness", "concision", "no-hallucination"]
5 rubric = blind // evaluators never see the submitter
6 evaluators = consensus(min: 5, redundant: true)
7 golden = inject(rate: "known-answer probes")
8 stake = required // evaluate => post stake
9 settle = onchain // payouts + slashing by contract
10}
Built in the open · early markets opening soon

Stop trusting the leaderboard.
Settle it in the market.

Post a task, submit a model, or stake to evaluate. Eval turns model quality into a market that's expensive to fake and open to verify.

Read the protocol