16 new datasets in Indian languages for Artificial…

16 new datasets in Indian languages for Artificial Intelligence and Machine Learning research

The Linguistic Data Consortium for Indian Languages (LDC-IL) is a Scheme of the Ministry of Education and it works on development of digital corpora in Indian languages. Housed in the Central Institute of Indian Languages (CIIL), Mysuru, the LDC-IL organised the 8th Project Advisory Committee meeting here on Monday.

Chaired by Shailendra Mohan, director, CIIL, the meeting was attended by various domain experts and industry specialists. As an important outcome, LDC-IL launched 16 new datasets in Indian languages to help bolster quality research in Artificial Intelligence and Machine Learning.

The first of its kind, these datasets will help develop new technologies in Indian languages, including Automatic Speech Recognition, Live Voice Translation and improve the quality of the results by such tools in Indian languages, a press release from the CILL said.

The datasets cover 12 scheduled languages - Hindi, Bengali, Tamil, Marathi, Kannada, Malayalam, Odia, Assamese, Konkani, Maithili, Urdu, and Nepali. It has two variants of Indian English, namely the Bengali variant of Indian English and the Kannada variant of English.

It is noted that Indian English is internationally recognised as a language in its own right and further has its own variants within India where different mother tongues influence English to get its own flavour, with some distinct linguistic and phonetic features, the release added.

In a first, the institute also released two datasets for Chhattisgarhi, a mother tongue usually clubbed together with Hindi. “This shows the seriousness of the government to ensure that education and technology will be bolstered for all mother tongues of India as has been recommended in the NEP-2020,” the CIIL said.

These datasets will bolster research and development in all Indian languages and academia and industry both will benefit from them. The applications developed based on these datasets will finally help in promotion of these languages, according to the CIIL.

All of these datasets are available on the Data Distribution Portal of LDC-IL as available at https://data.ldcil.org

The Linguistic Data Consortium for Indian Languages is the largest repository of Curated Text and Speech resources in Indian languages meant for linguistic research and for research and development in Artificial Intelligence and Machine Learning. With these 16 new datasets, the portal now has a total of 57 datasets covering 21 Indian languages.

The datasets produced by the LDC-IL are the first real-world data collected from the field. The LDC-IL datasets are unique in the sense that they are not crowdsourced and have been collected from actual verified sources and verified by the experts in the language. Apart from training, the datasets can also act as the benchmark for testing AI and Generative AI-based technology, the release said.

Read news from 100’s of titles, curated specifically for you.

Already a member? Sign in here