EG
The Express Gazette
Thursday, September 4, 2025

Researchers release largest African language dataset to close AI translation gap

New corpus aims to help AI systems understand and continent’s largely spoken languages that have been excluded from mainstream models

Technology & AI 4 hours ago

Researchers have released what they describe as the largest known dataset of African languages in an effort to close a widening gap between artificial intelligence tools and speakers of the continent’s many languages. The initiative responds to long-standing limits in the data used to train large language models, which are dominated by English and a handful of other widely written languages.

Africa is home to a large share of the world’s linguistic diversity — scholars and language activists estimate the continent contains well over a quarter of the planet’s languages — but many of those tongues are primarily oral and have little written material available online. That lack of textual data, together with limited investment in language technology, means that popular AI systems often cannot understand or produce those languages, leaving millions of potential users without useful tools.

Woman using a smartphone with an AI app

The dataset, released recently by a team of researchers working on African language technologies, compiles written and transcribed material intended to provide training material for machine translation, speech recognition and text generation systems. The researchers said the resource is intended to make it easier for developers and universities to build models that operate in a wider range of African languages.

“We think in our own languages, dream in them and interpret the world through them. If technology doesn't reflect that, a whole group risks being left behind,” researchers associated with the University of Pretoria said in a statement about the project. The statement framed the dataset as a corrective to a research and development ecosystem that has largely prioritized languages with abundant online text.

Most commercial and open-source large language models are trained on massive quantities of web text, books and other digitized documents. Those sources are plentiful for English, Mandarin, Spanish and several European languages, but far thinner or absent for many African languages, some of which remain underrepresented or omitted entirely from model training corpora.

The dataset includes a mix of written texts and transcriptions of spoken language where available, reflecting the reality that much everyday use of African languages occurs orally. Researchers compiling the corpus said they worked with native speakers, community organizations and academic partners to gather material, and that further work will be needed to expand coverage, standardize orthographies and account for dialectal variation.

Experts and developers say such datasets can enable practical applications including more accurate translation, voice-driven agricultural extension services, local-language chatbots and improved accessibility for government and health information. The BBC reported examples of people already using AI applications that can speak local languages, including a South African farmer who uses an app that communicates in her native tongue.

Community members reviewing language materials

Nevertheless, researchers cautioned that releasing a dataset is only one step toward broader language inclusion in AI. Building usable systems requires continued investment in computing resources, annotated data, ethically sourced recordings and collaborations that respect data ownership and cultural norms. The dataset’s creators also noted technical hurdles, including inconsistent spelling conventions and limited orthographic standards for some languages, which complicate model training and evaluation.

Industry attention to non-English language AI has grown in recent years, but researchers say sustained funding and policy support will be necessary to ensure that advances benefit a wide range of language communities. The newly released corpus is intended to lower one key barrier — the scarcity of training data — and to prompt further work by academic groups, startups and public institutions across the continent.

Those involved in the project said they hope the resource will accelerate the development of AI tools that reflect Africa’s linguistic diversity and make technologies more useful and accessible for speakers of languages that have so far been underrepresented in AI systems.