The focus of AI4Bhārat, an initiative of IIT-Madras, is on building open-source language AI for Indian languages, including datasets, models, and applications.
We are pleased to announce the launch of the Nilekani Center at AI4Bharat, IIT Madras on 28th July. The Center's mission is to innovate on open-source Indian language technology with the intention to create societal impact.
📣 📣 📣 New instruction-tuned LLM! 📣 📣 📣
Today, we announce an initial release of "Airavata", an instruction-tuned LLM for Hindi.
Blog:
Model:
Datasets:
(1/N)
A course on LLMs will be offered by Prof Mitesh Khapra. If you are a beginner or have some experience and looking to deepen your knowledge then this course is for you. Right from theory and fundamentals to LLMs in practice everything will be covered.
We are pleased to announce that we will begin recruiting AI residents (and associates) for 2024-25. The AI resident program is an year long pre-doctoral program which allows you to work intensively on NLP, Speech and Vision projects.
Apply below:
🚀IndicLLMSuite Launch Announcement!🚀
We're thrilled to unveil IndicLLMSuite: A collection of data resources and tools for developing Indic LLMs.
📜 Paper:
🌐 Blog (the way forward):
💻 Resources:
(1/n)
🎉 🎉 🎉 Presenting our blog on IndicVoices!
IndicVoices is an ongoing journey spanning 16,237 speakers, 145 Indian districts and 22 Indic languages!
Blog:
Paper:
Dataset:
Kindly help spread the word!
🎉 Exciting News! 🚀
We are thrilled to announce the launch of our AI4Bharat Blog! 🌐✨
Our goal: Empower researchers to share their work with a wide audience. 🧠💻
Debut post: IndicTrans2-M2M, our groundbreaking system for 22 languages! 🗣️🌐
Slowly but surely we have reached 2000 followers. So much more to do! A big thank you to all the members of AI4BHARAT as well as people who have supported us for striving to push the boundaries of open source AI research for India! 🇮🇳
The Nilekani Centre at AI4Bharat, IIT Madras is hiring TRANSCRIPTIONISTS in all of the 22 official Indian Languages. This is a remote (WFH), full-time position, with flexi-hours. Selected candidates will go through tests (speech/audio-to-text) prior to being hired.
Question: What do you do when you have a video in one language but your audience comprises of people speaking 2000 other languages?
Answer: You use Chitralekha!
Presenting our 3rd blog of our blog series.
Please read our paper to find out about how we pushed the boundaries for Indian language MT. We hope to present our work at
@iclr_conf
if alloted a slot. This is just the beginning of our journey to tackle Indian languages MT. Next up: Dialects and codemixing. Stay tuned!
IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled ...
Jay Gala, Pranjal A Chitale, A K Raghavan et al..
Action editor: W Ronny Huang.
#corpus
#multilingual
#corpora
📣 📣 📣 Do you want to create open-source datasets at scale? If so, then our blog detailing Shoonya is for you!
Blog:
GitHub:
Video:
Please give it a read and consider adopting it!
🚨 We are happy to present our second blog on IndicMT Eval acceped to ACL23! While training models is important, it is meaningless without evaluation. But which evaluation metric is reliable, especially for Indic languages? Our blog has answers for you:
I am extremely pleased to announce that IndicTrans2 will be published in TMLR (
@TmlrOrg
). This is a tremendous achievement for my coauthors and me that took nearly 1.5 years of hard work. The camera ready version will be out soon but for now we are over the moon!
#NLProc
#ACL
@ai4bharat
, a center at
@iitmadras
, is excited to announce the 1st AI4B Summer of Code! If you are passionate about contributing to the nation with open-source AI tools and apps for language and speech tech, then consider applying through the links below before 20th April.
We are happy to share that IndicTrans Model is now available on
@huggingface
Spaces with
@Gradio
Indic2En -
En2Indic -
We welcome you to try out the model. Feedback would be appreciated
🚨🚨🚨 Important announcement!
We have identified some suspicious websites as follows:
: Uses AI4Bharat logo for an investment app.
: A basic website serving an unknown purpose.
Please be wary of such sites and spread the word.
Checkout latest work from our lab. We create a strong benchmark (IndicXTREME), upgraded the monolingual corpus (IndicCorp v2) and release new models with various ablations (IndicBERT v2) for all the constitutionally recognised Indian languages. All the artefacts are open sourced
New Paper 🚨
IndicXTREME: A Multi-Task Benchmark For Evaluating Indic Languages
We introduce IndicXTREME, a diverse benchmark of 9 tasks covering 18 Indian languages. We maintain high quality by using human supervision to create all the test sets. [1/8]
It is a pleasure to welcome you to the launch of the Nilekani Center at AI4Bharat, IIT Madras!
We will start live-streaming the launch at approximately 10.45 am. All the zoom links are available on our website.
📊 First comes Sangraha, the largest Indic language corpus spanning Verified (64B), Unverified (24B), Synthetic (162B) tokens.
Verified - Web, PDF and Speech data
Unverified - Other multilingual corpora
Synthetic - Large scale translations and transliterations of Wikimedia
(2/n)
Thanks to everyone who joined us and made the event a grand success. We are energised to keep working towards advancing Speech & NLP tech for Indian languages. We will continue to truly and really open-source all our data, code, models and benchmarks.
#NLProc
@iitmadras
Airavata was created by fine-tuning OpenHathi with diverse, instruction-tuning Hindi datasets to make it better suited for assistive tasks. Along with the model, we also share the instruction tuning datasets to enable further research for IndicLLMs.
(3/N)
We also compile a collection of evaluation benchmarks along with an evaluation framework to compare various LLMs for their abilities on diverse tasks when instructed in Hindi. Currently, Airavata supports Hindi, but we plan to expand this to all 22 scheduled Indic languages!
4/4
We are looking for people with -
1. Excellent command over mother tongue/chosen language.
2. Excellent listening comprehension.
3. Attention to detail.
4. Qualification: Graduate or PG with any Indian language as a core subject at UG &/or PG level.
Salary: Rs.25,000-30,000 pm
AI4Bharat is excited to announce a talk on "RNN-T ASR Systems and Enabling Contextualization For RNN-T ASR Systems" by
@mahajain3
this Saturday, March 19th, 9:00 to 10:45 AM. [1/n]
As we gather today on 73rd Republic Day, to celebrate the mighty strength, rich heritage and cultural diversity of this great nation, we at AI4Bharat are proud to add to the technological advancements towards solving speech & language problems [1/n]
On 28th July, in an event at IIT Madras, the center would be inaugurated by Rohini and
@NandanNilekani
. Following this, we are hosting the Center's first language AI workshop that is open to startups and researchers.
Our residents in collaboration with students and researchers in AI4BHARAT have also produced high impact datasets and models like BPCC, IndicTrans1 and 2, IndicWav2Vec, IndicBart, IndicNLG Benchmark, etc, all of which have seen significant adoption in government and industry.
3. Aksharantar: Towards building open transliteration tools for the next billion users.
Authors: Yash Madani, Sushane Parthan, Priyanka Bedekar, Ruchi Khapra,
@anoopk
,
@pratykumar
,
@MiteshKhapra
. (Findings)
It's a step towards collecting spontaneous speech data across the rich tapestry of Indian languages, while honouring the vast linguistic, cultural, and demographic diversity! With this, we release 7,348 hours of speech data! Let's push the boundaries of Indic speech technologies!
IndicWav2Vec
- Curated 17k hours of raw speech data for 40 Indian languages
- SOTA ASR models for 9 languages on 3 public datasets
paper -
data -
code -
@tahirjmakhdoomi
@_themlstudio
@kaushal_py
[3/n]
AI4BHARAT has had an excellent track record of AI residents conducting cutting edge research and publishing it in top tier venues like ACL, Interspeech, EMNLP, TMLR, etc, under the leadership of Prof Mitesh Khapra, Dr Anoop Kunchukuttan and Dr Pratyush Kumar (now at SarvamAI).
4. NICT-AI4B's Submission to the Indic MT Shared Task in WMT 2023
Authors:
@prajdabre1
,
@jaygala24
,
@pranjalchitale
(WMT shared task for Indic languages)
Congratulations to all the authors!
Camera ready versions will be out soon.
2. DecoMT: Decomposed Prompting for Machine Translation Between Related Languages using Large Language Models
Authors:
@ratishsp
,
@anoopk
,
@prajdabre1
, Ay Ti, Nancy Chen (main)
🌍 Empowering Language Communities:
By releasing our open-license datasets and tools, we aim to empower open research. We hope this effort acts as a blueprint for creating quality resources in other language communities thus democratizing AI.
(6/n)
#IndicLLMSuite
#AI4Bharat
#LLM
IndicBART
- Multilingual pre-trained seq2seq model for 11 Indian Languages and English
- 1/3rd the size of mBART, but better/competitive on NMT and extreme summarisation
paper -
code -
@prajdabre1
@anoopk
@ratishsp
[7/n]
🗣️ For model alignment we release IndicAlign-Instruct a collection of 74.7 million prompt-response pairs in 14 Indic languages created by repurposing and translating existing datasets as well as creating new ones.
(4/n)
🚫 We also create IndicAlign-Toxic: 123K pairs of toxic prompts and safe responses - repurposing existing datasets and using a novel method of taxonomized synthesis of toxic prompts using combination unaligned model and non-toxic responses using an aligned model.
(5/n)
5. Towards Better Evaluation of Instruction-Following: A Case-Study in Summarization (CoNLL 2023)
Authors: Ondrej Skopek, Rahul Aralikatte, Sian Gooding, Victor Carbune.
All the notebooks, slides & posters from our workshops are now available on our website. Please reach out to us if you have any queries wrt any of the models and/or potential use cases for our model.
Chitralekha, which is an open-source AI-powered video transcreation platform. It has an integrated workforce management system, which enables transcreation of a video from one language to another.
Please enjoy our blog, explore Chitralekha and feel free to contribute.
Calling out to
@amitabhk87
to join - a community to innovate on AI solutions for the nation. Mitesh and Pratyush from IIT Madras will host a kick-off today 10th July at 7pm IST here:
#AI
#India
Samanantar
- Complied 49.6M Parallel corpus between Indian languages and English
- Trained MT system that outperforms all publicly available model
paper -
data -
code -
@gowtham_ramesh1
@sumanthd17
[2/n]
As a transcriptionist under the aegis of IIT Madras, you will get to collaborate with language experts at a pan-India level. You will be presented with many opportunities to improve your know-how in AI speech processing.
OpenHands
- OpenHands is an open source toolkit to democratize sign language research by making pose-based Sign Language Recognition more accessible to everyone
paper -
toolkit -
code -
@GokulNC
[5/n]
AI4Bharat is excited to announce a talk on "RNN-T ASR Systems and Enabling Contextualization For RNN-T ASR Systems" by
@mahajain3
this Saturday, March 19th, 9:00 to 10:45 AM. [1/n]
SuperShaper
- We propose SuperShaper, a task agnostic pre-training approach which simultaneously pre-trains a large number of Transformer networks by varying its shape (the hidden dimensions across layers)
paper -
@VinodG93
@gowtham_ramesh1
[n/n]
👉 Dive into the details of IndicTrans2-M2M – Indic to Indic translation covering 22 languages! 🌍 Discover how we achieved 5x compactness and 2x faster models, now on HuggingFace! 🚀💡
#TechInnovation
#IndicTrans2M2M
EvalEval
- Proposed Perturbation Checklists for designing and evaluation of Automatic NLG metrics
- Show that existing NLG metrics are not robust to perturbations and disagree with the human scores
paper -
code -
@AnanyaSaiB
[4/n]
Multilingual Language Models (MLLMs) Survey
- Surveyed literature about MLLMs focussing on: (i) Models &objective functions (ii) Tradeoff between a monolingual & multilingual LM (iii) Zero-Shot transfer
paper -
@gowtham_ramesh1
@sumanthd17
[6/n]
Hi everyone,
We're looking for talented full-stack web developers to join us for a full time posting at our office in the IIT-Madras Research Park. You can apply through this link
Calling out to
@mhrd_innovation
to join - a community to innovate on AI solutions for the nation. Mitesh and Pratyush from IIT Madras will host a kick-off today 10th July at 7pm IST here:
#AI
#India
Calling out to
@PMOIndia
to join - a community to innovate on AI solutions for the nation. Mitesh and Pratyush from IIT Madras will host a kick-off today 10th July at 7pm IST here:
#AI
#India
Join us for Part - 2 of the talk which will be held on Saturday, March 26th, 9:00 to 10:10AM. In this talk
@mahajain3
will be discussing beam-search decoding, and contextualisation for RNN-T
Join us at:
Sign up for the talk here:
AI4Bharat is excited to announce a talk on "RNN-T ASR Systems and Enabling Contextualization For RNN-T ASR Systems" by
@mahajain3
this Saturday, March 19th, 9:00 to 10:45 AM. [1/n]
Calling out to
@isro
to join - a community to innovate on AI solutions for the nation. Mitesh and Pratyush from IIT Madras will host a kick-off today 10th July at 7pm IST here:
#AI
#India
Mahaveer Jain "RNN-T ASR Systems and Enabling Contextualization For RNN-T ASR Systems"
Saturday, March 19 · 9:00 – 10:45am
Google Meet joining info
Video call link:
[2/n]
Speaker Bio:
Mahaveer Jain is a Software Engineer at Facebook. Priorly, he was a graduate research assistant at LTI at CMU, where he finished his Master's in Language Technologies. Mahaveer has worked extensively on building production ready RNN-T ASR systems at Facebook [5/n]
Calling out to AI engineers, domain experts, govt officials to join - a community to innovate on AI solutions for the nation. Mitesh and Pratyush from IIT Madras will host a kick-off today 10th July at 7pm IST here:
#AI
#India
Calling out to
@nasscom
to join - a community to innovate on AI solutions for the nation. Mitesh and Pratyush from IIT Madras will host a kick-off today 10th July at 7pm IST here:
#AI
#India
Calling out to
@FollowCII
to join - a community to innovate on AI solutions for the nation. Mitesh and Pratyush from IIT Madras will host a kick-off today 10th July at 7pm IST here:
#AI
#India
Calling out to
@NITIAayog
to join - a community to innovate on AI solutions for the nation. Mitesh and Pratyush from IIT Madras will host a kick-off today 10th July at 7pm IST here:
#AI
#India
Further, Mahaveer will discuss methods to enable contextualization for RNN-T ASR Systems. Contextualization allows us to use utterance specific context for ASR systems. [4/n]