Evaluation snapshot

10

Languages evaluated

85

Model chains

‍

500

Audio samples

Low-Resource Language Evaluations

As an AI orchestration platform focused on global impact, Gooey.AI - with the support of the Gates Foundation - actively aggregates and evaluates the latest private and open source AI models to assess their understanding of digitally underrepresented languages.

Our goal is to highlight real-world use cases of common health, agriculture and education audio questions and determine how well and how quickly different AI model combinations respond to these questions (as determined by their semantic similarity to expert provided "golden" answers).

Every evaluation, prompt, model settings and AI workflows are public and immediately forkable for organizations to test AI any workflow with your own data and use case. Organizations are encouraged to create audio evaluations with their own data too.

Read the PaperTalk with Us

Evaluation snapshot

10

Languages evaluated

85

Model chains

‍

500

Audio samples

Pakistani and Nigerian Health

In April 2026, we evaluated the latest and best-performing private (Gemini 3.1, OpenAI 5.4, Intron) and open source/sovereign deployable (Omnilingual, Gemma 4 26B, Kimi2.5) models for their performance on health-related audio questions in the languages commonly found in Pakistan and Nigeria. In our test, Gemini 3.1 Pro as the LLM + Intron or Meta's Omnilingual as the speech recognition model (ASR) scored most accurate, making the combination appropriate for WhatsApp and other async messaging based deployments, though likely too slow for usable voice-only services.

Pakistani Languages

For Urdu, Gemini 3.1 Pro Preview + Omni ranks highest on accuracy but carries ~27s median latency, making it best for async use cases like WhatsApp voice messages. For real-time voice applications, GPT-5.4 + Omni offers the strongest accuracy-speed balance. Gemini 3.1 Flash-Lite is fastest but trades off significant accuracy.

Urdu Evaluation

For Pashto, Gemini 3.1 Pro Preview + Omni leads on accuracy (0.66) but at ~15s latency, suiting async use. GPT-5.4 + Omni offers the best speed-accuracy balance (0.65, 8s) for real-time voice. Gemini Flash-Lite is fastest but scores near zero on accuracy.

Pashto Evaluation

For Sindhi, Gemini 3.1 Pro Preview is most accurate but high latency (~16s), best for async applications. GPT-5.4 + Omni and Gemini Flash-Lite + Omni cluster well on accuracy (~0.65) with moderate latency, making either viable for real-time voice deployments.

Sindhi Evaluation

Nigerian Languages

For Hausa, Gemini 3.1 Pro Preview + Intron & Gemini 3.1 Pro Preview + Omni rank equally for highest accuracy (~0.75), though they slightly differ in speed - Omni (~17s)/Intron (~19.5s), making Gemini 3.1 Pro Preview + Omni best suited for async use. For real-time voice, Gemini 3.1 Flash-Lite + Omni offers the best balance (~0.65 at ~8s). Gemini 3.1 Flash-Lite is fastest (~5s) but trades off significant accuracy at ~0.1. Notably, Omni pairings consistently deliver the highest accuracy gains, albeit with increased latency in most cases.
‍

Hausa Evaluation

For Nigerian Fulfulde, Gemini 3.1 Pro Preview + Omni is the most accurate (~0.6) at ~18s latency, and Gemini 3.1 Flash-Lite is the fastest (~5s) but with negligible accuracy, while Gemini 3.1 Flash-Lite + Omni provides a reasonable speed–accuracy balance (~0.3 at ~8s), though overall accuracy remains low across models. Notably, Nigerian Fulfulde shows a wider performance gap, with most models clustering at very low accuracy except the top Gemini stack.
‍

Nigerian Fulfulde Evaluation

For Igbo, Gemini 3.1 Pro Preview + Omni ranks highest on accuracy (~0.7) at ~17s latency, making it best suited for async use cases. For real-time voice, Gemini 3.1 Flash-Lite + Omni offers the strongest speed–accuracy balance (~0.6 at ~8s). Gemini 3.1 Flash-Lite is the fastest (~6s) but with near zero accuracy. Notably, Omni and Intron pairings consistently improve accuracy across models, with a latency tradeoff.
‍

Igbo Evaluation

For Yoruba, Gemini 3.1 Pro Preview + Omni ranks most accurate (~0.85) but at ~17s latency. GPT-5.4 + Omni gives strong accuracy (~0.55) at ~8s, best for real-time voice. Notably, Intron Voice API pairings consistently add accuracy but increase latency significantly.

Yoruba Evaluation

For Pidgin, Gemini 3.1 Pro Preview + Intron Voice API tops accuracy but at ~18s latency. GPT-5.4 is both fastest and highly accurate (~0.7), making it the standout choice for real-time voice. The Intron Voice API integration notably boosts accuracy across models.

Pidgin Evaluation

Swahili, Kinyarwanda and Kikuyu

In December, 2025 We tested new LLM+ASR AI workflows to understand real-world deployment Swahili, Kikuyu, and Kinyarwanda voice applications. Fine-tuned African ASR models + modern LLMs (GPT-5.1, Gemini 3) achieved 85-96% accuracy with 4-10 second response times—fast enough for phone-based services. In short, Voice AI for many African languages is now technically viable for WhatsApp and basic phone deployments at scale.

Read the Paper

With Kinyarwanda, we found newer open-source models (KimiK2, GLM4.7, MiniMax2.1, DeepSeek3.2) offer a clear accuracy uplift over Qwen3. For phone-based voice apps, MBaza combined with GPTOSS provides the best speed/accuracy tradeoff (~0.88 accuracy at ~4s latency) and is likely most appropriate. For asynchronous use (WhatsApp voice), higher-accuracy but slower stacks such as MBaza+Gem3pro yield more accurate answers despite increased latency for constrained networks and basic phones today.

Kinyarwanda Evaluation

With Swahili, we found that the OpenAI realtime models tended to perform poorly on both accuracy and latency, with the fine-tuned Jacaranda Health model in combination with GPT 4.1 and 5.1 offering a reasonable combination of both speed and accuracy and hence, likely most appropriate for voice applications. For async applications like WhatsApp voice messages, Jacaranda plus Gemini 3 Pro gave the most accurate answers.

Swahili Evaluation

In general, the top performing models in Kikuyu collectively perform worse that Swahili and Kinyarwanda. Additionally, as of Dec 2025, our tested machine translation models between English and Kikuyu were limited to just GhanaNLP’s. Nonetheless, SunbirdV2 + Gemini 3 (both Pro and Flash) offer reasonable accuracy with median latency times of 7 and 10 seconds respectively. These offer latency times low enough for async voice messaging but are likely still too slow for usable voice based feature phone applications.

Kikuyu Evaluation

Theory of Change

AI systems don’t understand local-language speech

The context

Many of Africa’s most widely spoken languages—such as Swahili, Kikuyu and Kinyarwanda—have historically been “invisible” to AI due to limited multilingual datasets used to train LLMs. This results in unequal access to AI and underrepresentation of people in underserved areas.

The ground reality

Unfortunately, capturing this diversity of language is hard for tech companies attempting to make speech recognition and translation models. This poor performance then implies that incredible tools like GPT4 and Gemini - which are primarily trained on English - don’t work particularly well for speakers of low-resource languages.

The need for AI-driven support

AI interventions such as small shareholder agriculture advisory show significant promise in aiding livelihood opportunities in the Global South. Everyone regardless of their language or location should be able to access digital services through voice, not text.

Interventions and Process

The possibilities in 2025

Local NLP and AI teams such as Jacaranda Health and MBaza released better performing fine-tuned African speech recognition models. Additionally, Meta released Omnilingual ASR with support for 1600+ languages.

New LLM generations (GPT-5+, Gemini 3) dramatically improved multilingual text comprehension of local-language text.

Model chains and Architectures

We use various Models & Architectures to create the evaluation process. This is very easily configurable with the help of Gooey.AI's Orchestration tech that allows us to quickly chain ASR -> LLM -> TTS or more.

Our methodology

Golden QnA
Native speakers record audio questions in three African languages; human experts provide gold-standard English answers for consistent evaluation across systems.
‍
LLM-as-a-judge
GPT-5.1 scores system responses against expert answers on 0-1 scale, measuring semantic correctness since African language evaluation remains unreliable.

Latency measurements
Measures total time from audio question to audio answer, including ASR, translation, LLM inference, and TTS on A10 GPUs.

Key Findings

Breaking the audio barrier for African languages

Fine-tuned African ASR models—Jacaranda, MBaza, and Sunbird—trained on high-quality, diverse audio datasets, now reliably convert speech into local-language text with significantly improved accuracy. By addressing dialectal variation and acquiring local and region-specific finetuned training data, these models remove the biggest historical bottleneck for low-resource African languages, enabling scalable speech-to-text for research, civic tech, education, and voice interfaces across the continent and supporting inclusive digital access everywhere today.

Native text understanding without translation layers

New large language models such as GPT-5+ and Gemini 3 can natively understand Swahili, Kikuyu, and Kinyarwanda text, often without requiring machine translation. Even in Q3 2025 these models required Machine Translations to understand and translate information to the LLMs. This update in LLMs improves downstream tasks like information retrieval, summarization, and question answering. Native comprehension reduces latency, translation errors, and developer overhead, enabling more accurate, accessible applications for local services, education, and other frontline work.

End-to-end voice AI ready for real-world use

ASR→LLM→TTS pipelines now deliver acceptable latency and robustness for phone-based voice services on basic feature phones and low-bandwidth networks. SIP via Livekit and other telephony improvements for AI models reduce latency significantly. These end-to-end systems enable automated call handling, IVR, and conversational agents across several African languages, making scalable, affordable voice services for healthcare, banking, government, and small businesses with deployable latency and cost profiles today.

Sign Manifesto

10

85

500