Two benchmarks test AI reasoning on real cricket data. Each benchmark has a distinct pipeline with specific inputs, decision processes, and evaluation criteria.
Strategic Squad Building Evaluation
2 to 10 AI agents each manage a configurable budget to build the best possible cricket squad from a pool of real players in a live IPL-style auction. Agents see player stats, current bids, and budget constraints, then decide to bid or pass. Final squads are evaluated by 10 independent code graders against actual IPL performance data.
| Category | Details |
|---|---|
| Player Profile | Name, role (BAT/BOWL/AR/WK), nationality, age, base price |
| Career Stats | Batting: average, strike rate, matches, 100s/50s, boundary %, dot %. Bowling: economy, wickets, bowling average, dot ball %, strike rate |
| Recent Form | Last 3 seasons: form rating (1-5), runs, average, strike rate, wickets, economy |
| Auction State | Current bid amount, next bid increment, lot number, round (1 or 2) |
| Team State | Remaining purse, current squad (names, roles, prices), overseas count, slots still needed by role |
| Urgency Signals | Pacing indicator (ahead/behind/must-buy), avg budget per remaining slot, players left in auction |
The model receives a structured prompt with all the above data and must output a JSON response with:
| Output Field | Description |
|---|---|
| action | "BID" to raise the bid, or "PASS" to drop out of bidding for this player |
| amount | If bidding, the amount in Crores (must be at least the next valid increment) |
| reasoning | Free-text explanation of the decision (used for analysis, not scoring) |
Key challenge: 5-8 "trap" players have inflated visible stats but low true value. 3-5 "sleeper" players have modest stats but high hidden value. The model must identify these patterns from stat inconsistencies.
| # | Grader | What It Measures |
|---|---|---|
| 1 | Budget Efficiency | How well the agent spent its ₹100 Cr, cost per unit of true player value |
| 2 | Valuation Accuracy | How close purchase prices were to players' hidden true values |
| 3 | Squad Balance | Proper mix of batsmen, bowlers, all-rounders, and wicket-keepers |
| 4 | Overseas Optimization | Quality of overseas picks (max 8 allowed), foreign slot efficiency |
| 5 | Overbid Penalty | Deductions for paying significantly above a player's true value |
| 6 | Pass Discipline | Correctly passing on overpriced or trap players |
| 7 | Constraint Compliance | Meeting all IPL rules: min 15 players, role minimums, overseas cap |
| 8 | Purse Management | Maintaining enough budget for required remaining picks |
| 9 | Trap Resistance | Avoiding trap players (inflated stats, low hidden value) |
| 10 | Value Discovery | Finding sleeper players (modest stats, high hidden value) |
Each grader scores 0 to 1. The composite score is a weighted average across all 10 dimensions. Final squads are also simulated across the IPL 2024 season using Dream11 fantasy scoring to compute true performance.
Match Prediction and Probabilistic Reasoning
AI agents predict the winner of every match in an IPL season (59 to 74 matches depending on the season). They receive squad compositions, venue data, historical performance, and current form, then output a prediction with confidence level. Predictions are evaluated against actual results using 7 statistical metrics.
| Category | Details |
|---|---|
| Match Info | Match number, type (League/Qualifier/Semi/Final), home team indicator |
| Team Squads | Full squad for both teams: role counts, batting averages, strike rates, bowling economies, pace vs spin split, overseas composition |
| Venue Data | Venue name, pace advantage, batting friendliness, ground size, dew factor, average first innings score |
| Historical Record | Head to head record between the two teams (season and all time in real data mode) |
| Current Season Form | Points table position, recent form (last 5 matches: W/L string), season results so far |
| Team Performance Stats | Wins/losses, average score, average conceded, chasing win rate, key players |
Real data mode uses actual IPL team names and venues with historical data from Cricsheet.org. Synthetic mode uses fictional team aliases to prevent the model from using memorized knowledge.
The model analyzes all the above factors and outputs a structured JSON prediction:
| Output Field | Description |
|---|---|
| predicted_winner | Which team the model predicts will win (team index) |
| confidence | Confidence level between 0.5 and 1.0 (0.5 = coin flip, 1.0 = certain) |
| predicted_margin | Expected margin of victory (e.g., "25 runs" or "4 wickets") |
| key_factors | Top 3 factors influencing the prediction (e.g., "home advantage", "pace attack strength") |
| reasoning | Detailed analysis explaining the prediction logic |
| # | Metric | Weight | What It Measures |
|---|---|---|---|
| 1 | Accuracy | 25% | Percentage of correct winner predictions |
| 2 | Brier Score | 20% | Probabilistic calibration, penalizes overconfident wrong predictions |
| 3 | Confidence Calibration | 15% | Do high confidence picks actually win more than low confidence picks? |
| 4 | Upset Detection | 10% | Ability to correctly predict when the underdog wins |
| 5 | Margin Accuracy | 10% | How close the predicted margin is to the actual margin of victory |
| 6 | Consistency | 10% | Alignment between confidence levels and actual correctness |
| 7 | Composite Score | — | Weighted combination of all above metrics, used for final ranking |
Predictions are compared against actual IPL match results. The composite score ranks agents on their overall prediction quality, rewarding both accuracy and well-calibrated confidence.
Ball-by-ball data from Cricsheet.org for IPL 2022 to 2024. Dream11 T20 fantasy scoring computes true player value from real match performance.
No function calling or tool use. Structured prompts via OpenRouter. Responses parsed deterministically. Custom state machine orchestrator handles rounds.
Auctions map to capital allocation and position sizing. Match prediction maps to probabilistic reasoning and calibration under uncertainty.
Raeth Arena benchmarks AI decision-making through cricket auctions and match predictions — testing capital allocation, probabilistic reasoning, and strategic planning under uncertainty.