Google DeepMind’s Indian Division Developing Project Morni to Build LLM for 125 Indic Languages

Google DeepMind’s Initiatives for Language Inclusion in India
Introduction to Project Vaani and Morni
Google DeepMind has embarked on two significant projects in India: Project Vaani and Project Morni. These initiatives, developed in collaboration with the Indian Institute of Science (IISc) and ARTPARK (Artificial Intelligence & Robotics Technology Park), aim to enhance the representation of diverse Indian languages in the digital space. By collecting and making speech data accessible as open-source material, these projects seek to address the challenging landscape of language diversity in India.
Understanding the Language Diversity in India
India boasts an impressive linguistic tapestry, officially recognizing 22 languages, yet more than 100 languages are in daily use. Notably, approximately 60 of these languages are spoken by over a billion people. Furthermore, over 125 languages feature at least 100,000 speakers, highlighting the nation’s rich cultural fabric.
The Digital Gap
Despite this diversity, many languages, particularly the lesser-known ones, lack a robust digital presence. For instance, while Hindi is spoken by about 10% of the global population, it constitutes only 0.1% of the content available on the internet. Alarmingly, 73 of the 125 languages currently lack any digital data, showcasing a significant gap that needs to be filled.
Launch and Goals of Project Vaani
To tackle these challenges, Google DeepMind has initiated Project Vaani, which was first announced in December 2022. The project aims to gather extensive speech data from various regions throughout India. Its primary goals include:
- Data Collection: In its initial phase, Project Vaani successfully amassed over 14,000 hours of speech data collected from 80,000 participants across 80 districts in India.
- Future Plans: The project intends to collect a total of 154,000 hours of transcribed speech data from all 773 districts in the country.
The current phase of Project Vaani seeks to extend its reach by covering 160 districts and includes all states, further widening its data collection efforts to echo the linguistic diversity prevalent in India.
The Importance of Project Morni
Alongside Project Vaani, Google DeepMind is also focusing on Project Morni, which is aimed at developing a sophisticated artificial intelligence model that can understand and represent a multitude of Indian languages. This initiative falls under the broader effort called Multimodal Representation for India (Morni), emphasizing the necessity for technology to recognize and incorporate India’s varied languages.
Key Objectives of Project Morni
- Inclusivity: This project strives to ensure that all languages, no matter how small, have representation in the digital world.
- Cultural Preservation: By documenting and digitizing these languages, the project plays a vital role in preserving India’s linguistic heritage.
- Accessibility in Technology: The initiative is designed to make technology more accessible and user-friendly for individuals who communicate in these languages on a daily basis.
Conclusion: A Vision for An Inclusive Digital Future
In summary, Google DeepMind’s projects, Vaani and Morni, highlight the commitment to acknowledging and integrating the linguistic richness of India into the digital realm. These initiatives not only emphasize technological advancement but also strive to create an inclusive space where every individual’s voice can be reflected and heard. By collating vast amounts of speech data and working towards language representation in AI, these projects are crucial for fostering a more equitable digital landscape.