Overview Leaderboard Dashboard Methodology About Us Sign In

Methodology

How AccountingBench is designed, executed, and scored.

01

Design Principles

1

Question Curation

Questions are drawn from professional examination materials and university coursework, aligned with real-world accounting competency standards across educational and professional practice levels.

2

Multi-trial Execution

Each model answers each question 3 times. The majority answer is used as the final answer, reducing random variance.

3

Hybrid Scoring

Structured answers (SC/MC) are scored by formula. Open-text responses are evaluated by GPT-5-mini as an LLM judge, rating 0–100 on correctness and completeness following specific grading criteria.

4

Aggregation & Reporting

Scores are averaged per category and framework, then aggregated to an overall figure. All runs are timestamped and fully reproducible from logged configuration metadata.

02

Scoring

Structured Questions (SC/MC)

Single-choice and multiple-choice questions use a penalized scoring formula. Correct selections earn points, incorrect selections subtract points.

Single-Choice: 100 if answer = gold; else 0 Multiple-Choice: (100 / # of correct alternatives) * # of correct marks by LLM − (100 / # of correct alternatives) * # of incorrect marks by LLM

Open-Text & Aggregation

Open-text responses are scored by GPT-5-mini (0–100). Each question runs 3 times — SC/MC uses majority vote, open-text averages the three judge scores. If a model fails to produce a parseable answer, it scores 0, but is not included in the benchmarking results.

Score Interpretation

RangeBadgeMeaning
≥ 90%ExcellentNear-expert level
75–90%StrongProfessional-level
60–75%ModerateRequires verification
50–60%WeakBelow professional threshold
< 50%PoorNot suitable for professional use
03

Evaluation Pipeline

1
Dataset
520 Tasks
Tags: category, framework, level
Gold answers
2
Model Input
Scratchpad + final_answer
Task type resolution
3
🧠
3-Trial Inference
Trial 1
Trial 2
Trial 3
4
Scoring
SC/MC majority vote + formula
Open-text: LLM-as-a-Judge
Numeric: tolerance check
5
📊
Output
Runs sheet — metadata
Outputs sheet — scores
Reproducible & auditable
↺  Reproducibility & Audit — all runs logged with full configuration metadata
04

Limitations & Caveats

01
LLM Judge Reliability
Open-text questions are scored by an LLM judge (GPT-5-mini), which introduces potential biases. The judge may favor answers stylistically similar to its own outputs. A sample has been manually validated, but full human validation has not been performed.
02
Output Format Sensitivity
Some models were not able to generate parseable answers to some of the tasks. These tasks were automatically scored as 0, but not included in calculation of overall score.
03
Dataset Size & Scope
520 tasks is sufficient for category-level comparisons but may produce high variance at the subcategory level. Results reflect Austrian regulatory frameworks specifically.
04
Static Snapshot
Results reflect a single evaluation run in March 2026. Model capabilities change over time and results should be re-evaluated periodically, especially after major model version releases.
05
Benchmark Contamination
Contamination risk — the possibility that models have seen benchmark questions during training — cannot be fully eliminated. This is a general limitation of all static benchmarks.
06
Ecological Validity
Static, isolated tasks differ substantially from real-world accounting work. Benchmark performance does not directly predict performance in practice, where models encounter multi-step workflows and domain-specific documents.
07
IFRS Subset Limitations
The IFRS subset consists of two task groups evaluated under different conditions. The original 25 university-level tasks were evaluated across all 13 models; the additional 44 professional examination tasks were evaluated across 10 models only. The three most recently added models (claude-opus-4-7, gpt-5.5, Kimi-K2.6) are excluded from this group. As a result, overall IFRS scores are not directly comparable across all 13 models. To compare models on equal footing, filter by University Exams or Professional Exams, each of which covers a consistent set of models and tasks.
@misc{aschauer2026accountingbench,
  author       = {Ewald Aschauer and Alexander Hofer and Markus Isack and Manuel Kaburek},
  title        = {AccountingBench: A Structured Benchmark for Systematic Evaluation of Large Language Models in Accounting Education and Professional Tasks},
  year         = {2026},
  institution  = {Vienna University of Economics and Business}
}