Google has launched WAXAL, a large-scale AI speech dataset designed to support 21 African languages spoken by more than 100 million people across Sub-Saharan Africa.
According to a statement from Google, the dataset was developed in collaboration with a consortium of leading African research institutions, which played a central role in building and curating the data.
The launch comes as voice-enabled technologies continue to expand globally, while most African languages remain excluded due to the lack of high-quality speech data.
According to Google, the WAXAL initiative began over three years ago after researchers identified a major imbalance in global speech datasets, which heavily favour Western and widely spoken languages.
Despite the rapid growth of voice assistants and speech-based tools globally, while having more than 2,000 languages, most African languages remained unsupported due to limited transcribed and high-quality audio data.
This imbalance has limited access to digital services for hundreds of millions of Africans who primarily communicate in local languages. Google said the project was conceived to close this gap by investing in long-term, community-led data collection across multiple African countries.
According to Aisha Walcott-Bryant, Head of Google Research Africa, the project is ultimately about enabling Africans to build technology in their own languages.
“The ultimate impact of WAXAL is the empowerment of people in Africa. This dataset provides the critical foundation for students, researchers, and entrepreneurs to build technology on their own terms, in their own languages, finally reaching over 100 million people. We look forward to seeing African innovators use this data to create everything from new educational tools to voice-enabled services that create tangible economic opportunities across the continent.”
The WAXAL dataset includes about 1,250 hours of transcribed natural speech and more than 20 hours of studio-quality recordings designed for building high-fidelity synthetic voices.
Languages covered include Hausa, Yoruba, Igbo, Luganda, Swahili, Acholi, Fulani, Kikuyu, Lingala, Shona, Malagasy, and several others across Sub-Saharan Africa.
Unlike many global AI projects, data collection was led by African universities and community organisations such as Makerere University in Uganda, the University of Ghana, and Digital Umuganda in Rwanda, with technical guidance from Google. Importantly, these partner institutions retain full ownership of the data, setting a model for more equitable and locally driven AI development.
Joyce Nakatumba-Nabende, a Senior Lecturer at Makerere University, said the dataset has already strengthened local research capacity in Uganda.
“For AI to have a real impact in Africa, it must speak our languages and understand our contexts. The WAXAL dataset gives our researchers the high-quality data they need to build speech technologies that reflect our unique communities.”
Similarly, Prof. Isaac Wiafe of the University of Ghana said the project helped mobilise over 7,000 volunteers and sparked innovation across sectors such as health, education, and agriculture.
Until now, many African innovators and startups had to build speech datasets from scratch, a process that is both expensive and time-consuming.
This development supports broader efforts across Africa to build indigenous AI capacity. When digital systems can understand and respond in native languages, more people can benefit from automated healthcare information, interactive learning platforms, voice-based job training, and improved access to digital public services.
This lowers barriers to entry, democratizes access to AI development, and could accelerate a new wave of locally relevant technology solutions across the continent.
Nigeria launched the Nigerian Atlas for Languages & AI at Scale (N-ATLAS) on September 20, 2025, on the sidelines of the 80th United Nations General Assembly (UNGA80) in New York.
The rollout introduced N-ATLAS v1 as an open‑source, multilingual, and multimodal large language model (LLM) designed to process and generate content in key Nigerian languages including Yoruba, Hausa, Igbo, and Nigerian‑accented English.
Since its launch, N-ATLAS has positioned Nigeria at the forefront of inclusive AI development on the continent. The open‑source model is actively being adopted by developers and institutions working on language technology tools, educational resources, and context‑aware applications that reflect local linguistic realities.
