Skip to main content
Taiwan
AIMenta
Research

NUS and NTU Publish APAC-Bench: Open-Source LLM Benchmark for APAC Regulatory and Financial Tasks

NUS and NTU release APAC-Bench, an open-source LLM benchmark with 12,000 APAC regulatory, legal, and financial tasks — finding GPT-4o and Claude Sonnet outperform Chinese models on English tasks but underperform on Chinese regulatory document reasoning.

AE By AIMenta Editorial Team ·
AIMenta editorial take

NUS and NTU release APAC-Bench, an open-source LLM benchmark with 12,000 APAC regulatory, legal, and financial tasks — finding GPT-4o and Claude Sonnet outperform Chinese models on English tasks but underperform on Chinese regulatory document reasoning.

Researchers at the National University of Singapore and Nanyang Technological University have published APAC-Bench, an open-source evaluation benchmark specifically designed to assess large language model performance on tasks grounded in APAC regulatory frameworks, legal documents, and financial instruments — addressing the gap between Western-centric LLM benchmarks and APAC enterprise AI deployment requirements.

APAC-Bench contains 12,000 tasks across six APAC-specific categories: MAS TRM and HKMA regulatory compliance Q&A (English), CSRC and CBIRC Chinese securities and banking regulation (Mandarin), Japanese FSA financial reporting interpretation (Japanese), Southeast Asian consumer protection law analysis (Bahasa Indonesia and Bahasa Malaysia), APAC financial statement extraction and calculation (bilingual), and APAC legal contract clause identification (mixed language).

Key findings from the APAC-Bench evaluation of 12 leading LLMs: GPT-4o and Claude Sonnet 3.7 top the English-language APAC regulatory categories by 8-12 points over Chinese models (Qwen-2.5, Doubao-pro-32k). However, on Chinese-language regulatory reasoning tasks, Qwen-2.5-72B outperforms GPT-4o by 14 points and Claude Sonnet by 19 points — a reversal of the English ranking that validates the commercial case for Chinese foundation models in Mandarin-primary APAC enterprise workflows. The benchmark is Apache 2.0 licensed and available on Hugging Face, with evaluation scripts for reproducibility by APAC AI engineering teams building foundation model selection frameworks.

Beyond this story

Cross-reference our practice depth.

News pieces sit on top of working capability. Browse the service pillars, industry verticals, and Asian markets where AIMenta turns these stories into engagements.

Tagged
#apac #ai #research

Related stories