Sign Manifesto

* Email will not be shown anywhere
Thanks for reaching out!
Oops! Something went wrong while submitting the form.
Evaluation snapshot

10

Languages evaluated

85

Model chains

500

Audio samples

Low-Resource Language Evaluations

As an AI orchestration platform focused on global impact, Gooey.AI - with the support of the Gates Foundation - actively aggregates and evaluates the latest private and open source AI models to assess their understanding of digitally underrepresented languages.

Our goal is to highlight real-world use cases of common health, agriculture and education audio questions and determine how well and how quickly different AI model combinations respond to these questions (as determined by their semantic similarity to expert provided "golden" answers).

Every evaluation, prompt, model settings and AI workflows are public and immediately forkable for organizations to test AI any workflow with your own data and use case. Organizations are encouraged to create audio evaluations with their own data too.
Evaluation snapshot
10
Languages evaluated
85
Model chains

500
Audio samples

Pakistani and Nigerian Health

In April 2026, we evaluated the latest and best-performing private (Gemini 3.1, OpenAI 5.4, Intron) and open source/sovereign deployable (Omnilingual, Gemma 4 26B, Kimi2.5) models for their performance on health-related audio questions in the languages commonly found in Pakistan and Nigeria. In our test, Gemini 3.1 Pro as the LLM + Intron or Meta's Omnilingual as the speech recognition model (ASR) scored most accurate, making the combination appropriate for WhatsApp and other async messaging based deployments, though likely too slow for usable voice-only services.

Pakistani Languages

Nigerian Languages


For Hausa, Gemini 3.1 Pro Preview + Intron & Gemini 3.1 Pro Preview + Omni rank equally for highest accuracy (~0.75), though they slightly differ in speed - Omni (~17s)/Intron (~19.5s), making Gemini 3.1 Pro Preview + Omni best suited for async use. For real-time voice, Gemini 3.1 Flash-Lite + Omni offers the best balance (~0.65 at ~8s). Gemini 3.1 Flash-Lite is fastest (~5s) but trades off significant accuracy at ~0.1. Notably, Omni pairings consistently deliver the highest accuracy gains, albeit with increased latency in most cases.


For Nigerian Fulfulde, Gemini 3.1 Pro Preview + Omni is the most accurate (~0.6) at ~18s latency, and Gemini 3.1 Flash-Lite is the fastest (~5s) but with negligible accuracy, while Gemini 3.1 Flash-Lite + Omni provides a reasonable speed–accuracy balance (~0.3 at ~8s), though overall accuracy remains low across models. Notably, Nigerian Fulfulde shows a wider performance gap, with most models clustering at very low accuracy except the top Gemini stack.


For Igbo, Gemini 3.1 Pro Preview + Omni ranks highest on accuracy (~0.7) at ~17s latency, making it best suited for async use cases. For real-time voice, Gemini 3.1 Flash-Lite + Omni offers the strongest speed–accuracy balance (~0.6 at ~8s). Gemini 3.1 Flash-Lite is the fastest (~6s) but with near zero accuracy. Notably, Omni and Intron pairings consistently improve accuracy across models, with a latency tradeoff.

Swahili, Kinyarwanda and Kikuyu

In December, 2025 We tested new LLM+ASR AI workflows to understand real-world deployment Swahili, Kikuyu, and Kinyarwanda voice applications. Fine-tuned African ASR models + modern LLMs (GPT-5.1, Gemini 3) achieved 85-96% accuracy with 4-10 second response times—fast enough for phone-based services. In short, Voice AI for many African languages is now technically viable for WhatsApp and basic phone deployments at scale.

Read the Paper

With Kinyarwanda, we found newer open-source models (KimiK2, GLM4.7, MiniMax2.1, DeepSeek3.2) offer a clear accuracy uplift over Qwen3. For phone-based voice apps, MBaza combined with GPTOSS provides the best speed/accuracy tradeoff (~0.88 accuracy at ~4s latency) and is likely most appropriate. For asynchronous use (WhatsApp voice), higher-accuracy but slower stacks such as MBaza+Gem3pro yield more accurate answers despite increased latency for constrained networks and basic phones today.

With Swahili, we found that the OpenAI realtime models tended to perform poorly on both accuracy and latency, with the fine-tuned Jacaranda Health model in combination with GPT 4.1 and 5.1 offering a reasonable combination of both speed and accuracy and hence, likely most appropriate for voice applications. For async applications like WhatsApp voice messages, Jacaranda plus Gemini 3 Pro gave the most accurate answers.

In general, the top performing models in Kikuyu collectively perform worse that Swahili and Kinyarwanda. Additionally, as of Dec 2025, our tested machine translation models between English and Kikuyu were limited to just GhanaNLP’s. Nonetheless, SunbirdV2 + Gemini 3 (both Pro and Flash) offer reasonable accuracy with median latency times of 7 and 10 seconds respectively. These offer latency times low enough for async voice messaging but are likely still too slow for usable voice based feature phone applications.

Theory of Change

AI systems don’t understand local-language speech
The context
Many of Africa’s most widely spoken languages—such as Swahili, Kikuyu and Kinyarwanda—have historically been “invisible” to AI due to limited multilingual datasets used to train LLMs. This results in unequal access to AI and underrepresentation of people in underserved areas.

The ground reality
Unfortunately, capturing this diversity of language is hard for tech companies attempting to make speech recognition and translation models. This poor performance then implies that incredible tools like GPT4 and Gemini - which are primarily trained on English - don’t work particularly well for speakers of low-resource languages.

The need for AI-driven support
AI interventions such as small shareholder agriculture advisory show significant promise in aiding livelihood opportunities in the Global South. Everyone regardless of their language or location should be able to access digital services through voice, not text.
Interventions and Process
The possibilities in 2025
Local NLP and AI teams such as Jacaranda Health and MBaza released better performing fine-tuned African speech recognition models. Additionally, Meta released Omnilingual ASR with support for 1600+ languages.

New LLM generations (GPT-5+, Gemini 3) dramatically improved multilingual text comprehension of local-language text.

Model chains and Architectures
We use various Models & Architectures to create the evaluation process. This is very easily configurable with the help of Gooey.AI's Orchestration tech that allows us to quickly chain ASR -> LLM -> TTS or more.

Our methodology
Golden QnA
Native speakers record audio questions in three African languages; human experts provide gold-standard English answers for consistent evaluation across systems.

LLM-as-a-judge
GPT-5.1 scores system responses against expert answers on 0-1 scale, measuring semantic correctness since African language evaluation remains unreliable.

Latency measurements
Measures total time from audio question to audio answer, including ASR, translation, LLM inference, and TTS on A10 GPUs.
Key Findings
Breaking the audio barrier for African languages
Fine-tuned African ASR models—Jacaranda, MBaza, and Sunbird—trained on high-quality, diverse audio datasets, now reliably convert speech into local-language text with significantly improved accuracy. By addressing dialectal variation and acquiring local and region-specific finetuned training data, these models remove the biggest historical bottleneck for low-resource African languages, enabling scalable speech-to-text for research, civic tech, education, and voice interfaces across the continent and supporting inclusive digital access everywhere today.

Native text understanding without translation layers
New large language models such as GPT-5+ and Gemini 3 can natively understand Swahili, Kikuyu, and Kinyarwanda text, often without requiring machine translation. Even in Q3 2025 these models required Machine Translations to understand and translate information to the LLMs. This update in LLMs improves downstream tasks like information retrieval, summarization, and question answering. Native comprehension reduces latency, translation errors, and developer overhead, enabling more accurate, accessible applications for local services, education, and other frontline work.

End-to-end voice AI ready for real-world use
ASR→LLM→TTS pipelines now deliver acceptable latency and robustness for phone-based voice services on basic feature phones and low-bandwidth networks. SIP via Livekit and other telephony improvements for AI models reduce latency significantly. These end-to-end systems enable automated call handling, IVR, and conversational agents across several African languages, making scalable, affordable voice services for healthcare, banking, government, and small businesses with deployable latency and cost profiles today.