Whose Language Counts? AI and the Invisibility of African Languages

Picture a maize farmer in the Afram Plains, Ghana. She plants her crops, monitors weather patterns and makes daily decisions that determine her household income. When she uses her phone, she prefers calls and voice notes so she can switch between Ga, Twi, and any local language the way anyone would when thinking out loud. Across the world, an AI tool has been built specifically for farmers like her: one that can detect crop sickness, plan irrigation schedules, and connect maize farmers directly to buyers on the market. However, this tool only recognises conventional English, which creates a big challenge for her and other farmers in the Global South like her. To her, this tool is absolutely useless not because she is unsophisticated, but because it was not built with her in mind. This is the reality of millions of Africans, who are for one reason or another excluded from parts of the digital economy even when their language is considered lingua franca.

Generally, African Languages are excluded from Large Language Models (LLMs) because they are considered “low resource” languages in AI research. Research shows that resources such as websites, books, transcripts, etc that are used to train LLMs in these languages are either scarce or unavailable. To compensate for this missing data, developers in search of data frequently scrape digitised translated religious texts like Bibles and old online articles. This ultimately causes Artificial Intelligence tools to sound unnaturally archaic.

There is also a general disinterest in investing in building datasets on African Languages stemming from a historical pattern of putting foreign interpretations of African languages over indigenous voices and framing these indigenous voices as peripheral. Languages like Swahili, Yoruba, Amharic, and Zulu are spoken across more than a dozen countries combined. Western tech companies classify them as “low resource”, not because they are inherently so, but to justify a lack of investment in digitising their datasets. This lack of training data coupled with the disinterest in funding the datasets means that only 10-20 percent of sentences of Hausa can be recognised by ChatGPT, despite being spoken by 94 million Nigerians. In the context of African languages, this results in technologies that poorly reflect local linguistic practices, including tone, code-switching, and the preference of voice over text in digital communication.

We also see the issue of “tokenisation”; the process of breaking words into smaller units called tokens, which can be words, part of words or even individual characters that AI language models can process. The main challenge with tokenisation is that in Yoruba for instance, more tokens per sentence are needed, compared to English sentences expressing the same semantic content; lengthening the process. Tokenizers also fragment Yoruba text because they ignore the structure and tone of words, thus producing results that do not align with the language. The lack of training data, challenges with tokenisation and lack of investment in building accurate datasets culminate in less accurate and more expensive datasets for African and Global South languages.

Even where African languages are included, the systems often translate Western languages into African languages without adding the relevant social context. This results in translated text adopting Western cultural biases and misrepresenting the local context or perpetuating stereotypes. A proverb in Twi, for instance, may not have an English equivalent that can be swapped in. Similarly, an instruction phrased as a command in English may come off as unnatural in Yoruba, where the same idea may be expressed through a completely different grammatical structure.

In digital communication in Africa, people tend to code-switch; blending conventional English with pidgin, Creole, Yoruba, Swahili, or whatever local language feels more natural at the time. A text message or voice note might say “Bro, watch my things give me. I dey go come now now”. This sentence requires a fair understanding of social register to be understood. In such an instance, AI models trained on conventional English might be unable to comprehend this kind of otherwise natural unnatural speech. Failure to adjust for local context or practices means small businesses, healthcare workers, and other people that can make the best of Artificial Intelligence are losing access to critical tools.

The inability of AI tools to understand African languages is compounded by the fact that African digital communication has always been voice based, rather than text based. Research has shown that in the early to mid 2000s when internet connectivity penetration in Africa was low, radio stations and mobile telephony had higher acceptance and usage levels. Even then, mobile users preferred these voice based channels of communication to Short Message Services (SMS) and Unstructured Supplementary Service Data (USSD), as they are not suitable for people with lower reading skills in Western Languages. In the same way, formal text based AI models fail completely in this voice-first highly colloquial environment, as they effectively force users to assimilate to Western Languages just to access basic technology.

Modern developments in Natural Language Processing (NLP) have presented a solution to the flaws that direct translation has. NLP uses operations, systems, and technologies that allow computers to process and respond to written and spoken language in a way that mirrors human ability. Whereas translation apps may simply convert English to Swahili, NLP models understand the language and culture of Swahili and are thus less likely to miss out on local context or rely on outdated datasets to perform translations. The goal is to progress from direct translation, to meaning-based translation which would provide for cultural nuance.

Long before global conversations began shifting towards LLMs, African AI practitioners were designing small efficient models. InkubaLM, Africa’s first multilingual small language model prioritised efficiency and was able to deliver a strong performance across multiple low resource languages while not being cloud dependent like LLMs.

Also working at the grassroots level is Masakhane, whose mission is to strengthen and spur NLP research in African languages, for Africans, and by Africans. They are doing this to ensure that African languages are represented in technology and build a space that understands African names, cultures, places, and history. Its core value is based on the isiZulu phrase “Umuntu Ngumuntu Ngabantu”, which loosely translates to “a person is a person through another person”, or “I am because you are”. For this reason, Masakhane is based on community, collaboration, and inclusivity. It also seeks to champion data sovereignty and ensure that Africans decide what data represents our communities globally, retain ultimate ownership of that data, and know how it is used. Collaborators are in online communities where facilitators provide tasks in weekly meetings and Masakhane’s work is released in Publications that are submitted to workshops and conferences.

Timnit Gebru’s DAIR Institute also pushes the argument for digital sovereignty further. It advocates for “sovereign language technologies”: AI systems designed around communities and not extracted from them. This is in response to the concern that African languages are underrepresented, and the process of building datasets itself may become exploitative.

Similarly, Lelapa AI seeks to build natural language processing systems designed to scale globally by working in Africa, where the constraints to scaling are more apparent. According to their CEO, Pelonomi Moiloa there are many opportunities to integrate language AI into consumer services, particularly in telecommunications and financial services. Researchers at Ashesi University in Ghana are helping to fill this gap by mapping local languages to improve voice recognition systems for financial inclusion.

Ghana is also seeing more direct efforts to build this language infrastructure locally. The Ghana NLP Community, an open-source initiative focused on Natural Language Processing for Ghanaian languages and local problems, has been building datasets, models, and tools for tasks such as machine translation, speech recognition, and text-to-speech. Its repositories include curated Ghanaian language datasets and models, while related work is already creating room for new downstream applications. One example is Okyeame TTS, a Ghanaian English text-to-speech model built on the Ghanaian English ASR dataset, showing how locally grounded datasets can move from research infrastructure into actual voice technology. Together, these efforts show that the problem is not that African languages cannot support AI systems, but that they have historically been underfunded and under-prioritized.

Behind the mass adoption of LLMs lies a massive, often invisible workforce of African data annotators who serve as the human foundation of the global AI economy. While users enjoy the efficiency of these models, they are running on the labour of workers in countries like Kenya and Ethiopia. The same system that considers Hausa a “low resource” language because nobody has invested in it also classifies the contributors who speak it as cheap labour, rather than people with valuable linguistic and cultural expertise. For a system to be truly fair, justice must extend beyond the code to the people building it. True AI fairness requires the technology sector to publicly acknowledge these contributors and ensure they are fairly compensated for the expertise they provide.

The languages of Africa are not low resource. The question is whether the global technology industry is willing to invest in them, or whether it will continue to build the future in languages that only some people speak.

Key Takeaways

To improve the rate at which multilingual AI systems are developed in Africa, it is important that product managers prioritise native language models that understand context, rather than relying on flawed translation overlays.
The technology sector should also publicly acknowledge and fairly compensate local translators and annotators that contribute to building these datasets.
Policymakers should actively fund grassroots researchers to ensure that African data sovereignty is protected during the ongoing technology boom.

Topics

Read/Watch

Participate

About

Whose Language Counts? AI and the Invisibility of African Languages

Key Takeaways

Daniel Kafui Toseafa