AccountingBench — Methodology

Methodology

How AccountingBench is designed, executed, and scored.

Design Principles

Question Curation

Questions are drawn from professional examination materials and university coursework, aligned with real-world accounting competency standards across educational and professional practice levels.

Multi-trial Execution

Each model answers each question 3 times. The majority answer is used as the final answer, reducing random variance.

Hybrid Scoring

Structured answers (SC/MC) are scored by formula. Open-text responses are evaluated by GPT-5-mini as an LLM judge, rating 0–100 on correctness and completeness following specific grading criteria.

Aggregation & Reporting

Scores are averaged per category and framework, then aggregated to an overall figure. All runs are timestamped and fully reproducible from logged configuration metadata.

Scoring

Structured Questions (SC/MC)

Single-choice and multiple-choice questions use a penalized scoring formula. Correct selections earn points, incorrect selections subtract points.

Single-Choice: 100 if answer = gold; else 0 Multiple-Choice: (100 / # of correct alternatives) * # of correct marks by LLM − (100 / # of correct alternatives) * # of incorrect marks by LLM

Open-Text & Aggregation

Open-text responses are scored by GPT-5-mini (0–100). Each question runs 3 times — SC/MC uses majority vote, open-text averages the three judge scores. If a model fails to produce a parseable answer, it scores 0, but is not included in the benchmarking results.

Score Interpretation

Range	Badge	Meaning
≥ 90%	Excellent	Near-expert level
75–90%	Strong	Professional-level
60–75%	Moderate	Requires verification
50–60%	Weak	Below professional threshold
< 50%	Poor	Not suitable for professional use

Evaluation Pipeline

⊞

Dataset

520 Tasks

Tags: category, framework, level

Gold answers

→

⚙

Model Input

Scratchpad + final_answer

Task type resolution

→

🧠

3-Trial Inference

Trial 1

Trial 2

Trial 3

→

⚖

Scoring

SC/MC majority vote + formula

Open-text: LLM-as-a-Judge

Numeric: tolerance check

→

📊

Output

Runs sheet — metadata

Outputs sheet — scores

Reproducible & auditable

↺ Reproducibility & Audit — all runs logged with full configuration metadata

Limitations & Caveats

LLM Judge Reliability

Open-text questions are scored by an LLM judge (GPT-5-mini), which introduces potential biases. The judge may favor answers stylistically similar to its own outputs. A sample has been manually validated, but full human validation has not been performed.

Output Format Sensitivity

Some models were not able to generate parseable answers to some of the tasks. These tasks were automatically scored as 0, but not included in calculation of overall score.

Dataset Size & Scope

520 tasks is sufficient for category-level comparisons but may produce high variance at the subcategory level. Results reflect Austrian regulatory frameworks specifically.

Static Snapshot

Results reflect a single evaluation run in March 2026. Model capabilities change over time and results should be re-evaluated periodically, especially after major model version releases.

Benchmark Contamination

Contamination risk — the possibility that models have seen benchmark questions during training — cannot be fully eliminated. This is a general limitation of all static benchmarks.

Ecological Validity

Static, isolated tasks differ substantially from real-world accounting work. Benchmark performance does not directly predict performance in practice, where models encounter multi-step workflows and domain-specific documents.

IFRS Subset Limitations

The IFRS subset consists of two task groups evaluated under different conditions. The original 25 university-level tasks were evaluated across all 13 models; the additional 44 professional examination tasks were evaluated across 10 models only. The three most recently added models (claude-opus-4-7, gpt-5.5, Kimi-K2.6) are excluded from this group. As a result, overall IFRS scores are not directly comparable across all 13 models. To compare models on equal footing, filter by University Exams or Professional Exams, each of which covers a consistent set of models and tasks.

@misc{aschauer2026accountingbench,
  author       = {Ewald Aschauer and Alexander Hofer and Markus Isack and Manuel Kaburek},
  title        = {AccountingBench: A Structured Benchmark for Systematic Evaluation of Large Language Models in Accounting Education and Professional Tasks},
  year         = {2026},
  institution  = {Vienna University of Economics and Business}
}