Methodology
How AccountingBench is designed, executed, and scored.
1
Question Curation
Questions are drawn from professional examination materials and university coursework, aligned with real-world accounting competency standards across educational and professional practice levels.
2
Multi-trial Execution
Each model answers each question 3 times. The majority answer is used as the final answer, reducing random variance.
3
Hybrid Scoring
Structured answers (SC/MC) are scored by formula. Open-text responses are evaluated by GPT-5-mini as an LLM judge, rating 0–100 on correctness and completeness following specific grading criteria.
4
Aggregation & Reporting
Scores are averaged per category and framework, then aggregated to an overall figure. All runs are timestamped and fully reproducible from logged configuration metadata.
Structured Questions (SC/MC)
Single-choice and multiple-choice questions use a penalized scoring formula. Correct selections earn points, incorrect selections subtract points.
Single-Choice: 100 if answer = gold; else 0
Multiple-Choice:
(100 / # of correct alternatives) * # of correct marks by LLM
− (100 / # of correct alternatives) * # of incorrect marks by LLM
Open-Text & Aggregation
Open-text responses are scored by GPT-5-mini (0–100). Each question runs 3 times — SC/MC uses majority vote, open-text averages the three judge scores. If a model fails to produce a parseable answer, it scores 0, but is not included in the benchmarking results.
Score Interpretation
| Range | Badge | Meaning |
| ≥ 90% | Excellent | Near-expert level |
| 75–90% | Strong | Professional-level |
| 60–75% | Moderate | Requires verification |
| 50–60% | Weak | Below professional threshold |
| < 50% | Poor | Not suitable for professional use |
1
⊞
Dataset
520 Tasks
Tags: category, framework, level
Gold answers
→
2
⚙
Model Input
Scratchpad + final_answer
Task type resolution
→
→
4
⚖
Scoring
SC/MC majority vote + formula
Open-text: LLM-as-a-Judge
Numeric: tolerance check
→
5
📊
Output
Runs sheet — metadata
Outputs sheet — scores
Reproducible & auditable
↺ Reproducibility & Audit — all runs logged with full configuration metadata
01
LLM Judge Reliability
Open-text questions are scored by an LLM judge (GPT-5-mini), which introduces potential biases. The judge may favor answers stylistically similar to its own outputs. A sample has been manually validated, but full human validation has not been performed.
02
Output Format Sensitivity
Some models were not able to generate parseable answers to some of the tasks. These tasks were automatically scored as 0, but not included in calculation of overall score.
03
Dataset Size & Scope
520 tasks is sufficient for category-level comparisons but may produce high variance at the subcategory level. Results reflect Austrian regulatory frameworks specifically.
04
Static Snapshot
Results reflect a single evaluation run in March 2026. Model capabilities change over time and results should be re-evaluated periodically, especially after major model version releases.
05
Benchmark Contamination
Contamination risk — the possibility that models have seen benchmark questions during training — cannot be fully eliminated. This is a general limitation of all static benchmarks.
06
Ecological Validity
Static, isolated tasks differ substantially from real-world accounting work. Benchmark performance does not directly predict performance in practice, where models encounter multi-step workflows and domain-specific documents.
07
IFRS Subset Limitations
The IFRS subset consists of two task groups evaluated under different conditions. The original 25 university-level tasks were evaluated across all 13 models; the additional 44 professional examination tasks were evaluated across 10 models only. The three most recently added models (claude-opus-4-7, gpt-5.5, Kimi-K2.6) are excluded from this group. As a result, overall IFRS scores are not directly comparable across all 13 models. To compare models on equal footing, filter by University Exams or Professional Exams, each of which covers a consistent set of models and tasks.