✨A delayed update, but✨
* I defended my PhD
@LTIatCMU
!
* I recently started as a Research Scientist
@GoogleAI
in New York City!
Pictured: (1) Zoom screenshot after my defense with my incredible thesis committee; (2) the NYC skyline on my first day in the city🏙️
✨Our TACL paper introduces a semi-supervised learning method for OCR post-correction (w/ Daisy Rosenblum,
@gneubig
,
@anas_ant
).
We improve digitization accuracy on endangered languages by up to 29%!
📌Talk+poster at
#EMNLP2021
on Nov 7th!
📌Paper:
1/5
There is *lots* of text in endangered languages that isn't machine-readable: paper books, handwritten notes, scanned images...
Our
#EMNLP2020
paper (w/
@gneubig
,
@anas_ant
) addresses the task of extracting text from these sources.
All the details:
1/5
With the rapid development of language technology, it’s important that as many languages as possible benefit from these technologies, so we’re sharing XTREME-UP, a benchmark for evaluating multilingual models.
📝
💻
Read 🧵↓ (1/3)
I've seen multiple tweets today on temporal adaptation of NLP models, so I thought it might be a good time to promote our temporally-diverse dataset + analysis for NER!
Paper:
Dataset:
(w/
@daniel_preotiuc
and
@TechAtBloomberg
)
On June 25th, at 17:00 UTC, SIGTYP will host a lecture by Shruti Rijhwani (
@shrutirij
) on "Cross-Lingual Entity Linking for Low-Resource Languages."
Registration:
Chat:
The EMNLP 2024 call for papers is here! Submissions are through ARR, with a deadline of June 15 🚀✨
I'm also super excited to be on the organizing committee for the conference, as publicity co-chair!
#EMNLP2024
Super excited to work alongside this amazing group of people organizing this year's
@emnlpmeeting
. If you haven't done so, check out the website:
Looking forward to seeing everyone in Miami later this year!
More info soon.
#NLProc
#EMNLP2024
The code for our paper "OCR Post Correction for Endangered Language Texts" is now available:
Try it out to train OCR post-correction models for low-resource settings!
There is *lots* of text in endangered languages that isn't machine-readable: paper books, handwritten notes, scanned images...
Our
#EMNLP2020
paper (w/
@gneubig
,
@anas_ant
) addresses the task of extracting text from these sources.
All the details:
1/5
I'm presenting my work (w/ Jiateng Xie,
@gneubig
and Jaime Carbonell) on low-resource entity linking at
#AAAI19
. Drop by the talk tomorrow (1/30) at 11.30am!
Paper:
My featured session at the Grace Hopper Celebration begins in an hour!
Come by to learn about some of my recent research on digitizing endangered language texts
@LTIatCMU
.
Link:
#vGHC21
#vGHC2021
I'm at ACL 2022 in Dublin! ☘️ Send me a message if you're around and want to catch up!
Also, come to the
@acl_sigel
ComputEL workshop at the end of the week (26th and 27th)!
The Multilinguality and Linguistic Diversity track at EMNLP2023 is looking for emergency reviewers!
✨Please send me a message or email if you are available to review a paper this week✨
#EMNLP2023
#NLProc
#NLProc
#EMNLP2022
The multilinguality track at EMNLP is looking for emergency reviewers!
If you're able to help out with a review this week, please DM me or sign up here:
My talk on digitizing endangered language texts with
#NLP
at the Grace Hopper Celebration will be featured *again* tomorrow, with a live Q&A session this time!
Come by, if you missed it today!
Link:
#vGHC21
#vGHC2021
@AnitaB_org
I'm looking for emergency reviewers for submissions in the Multilinguality track at
#ACL2024
🚨
Send me a DM or email if you're able to do a review in the next 1-2 days! ✨
SIGEL is very excited to announce that the Workshop on Computational Methods for Endangered Languages (ComputEL-5), after 3 iterations with
@ICLDC_HI
will come back to the CL community (
@aclmeeting
). The CfP is available:
Follow us for more updates!
Our latest version of Gemini 1.5 Pro in AI Studio is
#1
on the LMSys leaderboard. 🚀
This is the result of various advances in post-training and we have more lined up. Congrats to the Gemini team.
“Improving Candidate Generation for Low-resource Cross-lingual Entity Linking” at TACL 2020. In this work, we first assess the problems faced by current entity candidate generation methods for low-resource XEL, then propose three improvements. (1/n)
This was a super exciting paper to collaborate on!
Seahorse 🌊🐴 is a multilingual summarization evaluation dataset that covers multiple languages and models. The preprint is now on arxiv!
We are excited to release Seahorse 🌊🐴, a ✨multilingual, multifaceted summarization evaluation dataset✨
96,000+ human ratings to enable faster progress in training and evaluating learnt metrics for summarization!
Preprint:
Data:
Gemini 1.5 Pro has entered the (LMSys) Arena! Some highlights:
-The only "mid" tier model at the highest level alongside "top" tier models from OpenAI and Anthropic ♊️
-The model excels at multimodal, and long context (not measured here) 🐍
-This model is also state-of-the-art
✨Submit your work on computational methods for endangered and low-resource languages to the ComputEL-7 workshop!✨
Co-located with EACL 2024, in Malta!🥳
📢Deadline: December 15, 2023.
📢Submission link:
With every email update about covid-19 I get from
@CarnegieMellon
, I feel grateful to be part of this institution. The response has been exceptional so far ❤️
Just a tweet to say thank you to our students, staff and faculty. We know you're facing new challenges and working hard to finish the semester.
The CMU community is a wonderful community to be a part of.
1/5 On December 5, we are delighted to have
@shrutirij
present her PhD research on unlocking text data for under-resourced languages. Register for the talk:
#EMNLP2021
session for the talk on our paper Lexically Aware Semi-Supervised Learning for OCR Post-Correction (w/
@mulix
,
@anas_ant
,
@gneubig
) starts in ✨45 minutes✨
Session link:
Paper:
✨Our TACL paper introduces a semi-supervised learning method for OCR post-correction (w/ Daisy Rosenblum,
@gneubig
,
@anas_ant
).
We improve digitization accuracy on endangered languages by up to 29%!
📌Talk+poster at
#EMNLP2021
on Nov 7th!
📌Paper:
1/5
Join our
#acl2020nlp
live QA sessions for our paper "Soft Gazetteers for Low-Resource Named Entity Recognition" tomorrow, July 8!
Find the details and presentation here:
Paper co-authored with
@shuyanzhxyc
@gneubig
and Jaime Carbonell.
Awesome talk from
@simi_97k
on transcreation (multimodal, multilingual, multicultural translation) between diverse countries! Her oral at the
@AmericasNLP
workshop was interesting, engaging, and quite humorous!
Microsoft Research India is accepting applications for our Research Fellow program till 16th Feb 2024.
For more info:
🔗 Research Fellow program:
🔗 Nominations process:
To learn about our lab culture:
We use a ✨WFSA representation✨ to efficiently do joint decoding with the neural post-correction model and the count-based LM.
Experiments on 4 endangered languages (Ainu, Griko, Kwak'wala, Yakkha) show up to 29% improvement in error rates!
Code:
4/5
The
@AmericasNLP
workshop has begun and we have a packed schedule of talks and posters today!
Drop by in-person or on Zoom to hear about NLP for indigenous and low-resource languages, and invited talks by
@simi_97k
and
@gneubig
later today!
Schedule:
The first session of the workshop is underway, and we have a full house here in the Doña Socorro room!
Join us in-person or on Zoom for an exciting schedule of talks and posters all day today! 🎉✨
#NAACL2024
#NLProc
The AmericasNLP Workshop is tomorrow, June 21!
We start at 9am Mexico City time, in person and on zoom!
Join us for exciting research presentations and posters ✨🚀
We also have three invited talks, by
@gneubig
,
@simi_97k
and Fidencio Briceño Chel.
See you tomorrow!
Very little manually transcribed data is available for training OCR and post-correction models in endangered languages.
🚀Our semi-supervised method uses the relatively larger number of raw images that need to be digitized to improve performance.
2/5
#NLProc
#EMNLP2021
The
#acl2020nlp
live QA sessions for our paper "Temporally-Informed Analysis of Named Entity Recognition" are tomorrow, July 8!
Join us to learn about how temporal data drift can affect NER models!
Find the details and presentation here:
(1/3)
Are you
#PhDone
(or close)? Would you like to live in the Washington, DC area and be part of the *exciting*
@GMUCompSci
dept, working on Machine Translation and
#nlproc
for low-resource languages? I'm looking for a postdoc (ideally starting in 2022) -- do reach out if interested!
@mulix
and I will be presenting our work ✨developing OCR for Kwak'wala✨ today at the ComputEL workshop at 3:30pm Dublin time!
✨Join us in Liffey Hall 1 or on Zoom!✨
More info:
@acl_sigel
#acl2022nlp
We use self-training, a simple technique where the model is re-trained on its own predictions.
To counter the potential noise from self-training, we introduce ✨lexically-aware decoding✨ with a count-based language model to reinforce correctly predicted words in decoding.
3/5
📢
#AmericasNLP2024
features not one, but TWO shared tasks!
ST 1: Machine Translation Systems for Indigenous Languages⏩️
ST 2: Creation of Educational Materials for Indigenous Languages⏩️
🚨Deadline for both: 04/10/24🚨
#NLProc
Come say hi and learn more about our paper at our live session today!
Gather session 4A: Machine Translation and Multilinguality
Nov 17, 21:00 EST!
(Nov 18, 02:00 UTC)
Details:
#emnlp2020
#NLProc
There is *lots* of text in endangered languages that isn't machine-readable: paper books, handwritten notes, scanned images...
Our
#EMNLP2020
paper (w/
@gneubig
,
@anas_ant
) addresses the task of extracting text from these sources.
All the details:
1/5
I'll be at the AmericasNLP Workshop on Friday June 21. We have a super exciting workshop program, with a bunch of invited talks and research presentations.
Hope to see many of you there next week!
✨The AmericasNLP Workshop is happening on June 21st, co-located with NAACL in Mexico City and on Zoom!✨
Join us for two exciting talks by Fidencio Briceño Chel and
@gneubig
, plus insightful research presentations and posters!
More info:
#NAACL2024
If you have documents in an endangered language that require digitization, we can apply our OCR post-correction method on your data!
Let us know here: .
#emnlp2020
#NLProc
5/5
📌
#EMNLP2021
Talk:
📌
#EMNLP2021
Poster:
📌Do you have documents in a low-resource language that need to be digitized? Let us know!
Send me an email or fill the form here:
5/5
I am recruiting a PhD research intern at
@ai2_mosaic
!
Topics of interest include:
📋generative model evaluation
🔥robustness
🔎model interpretability
🥸 privacy
✨Deadline: Nov 6 (today!!)✨
The text extracted from these documents can:
- be used to build NLP systems in endangered languages
- aid language documentation and preservation efforts
- make these documents available for searching and online browsing by language learners and speakers
2/5
Last 2021 NLLP Talk! We are happy to host
@Lasha1608
, who will speak about how language technologies can empower users in the specific context of privacy policies: .
@CatalinaGoanta
will discuss legal implications. Register here 👉
Our model improves word recognition by 34% on average!
Get the paper/code/dataset here:
Come to our live QA session at EMNLP 2020 on Nov 18 at 02:00-04:00 UTC!
4/5
There is *lots* of text in endangered languages that isn't machine-readable: paper books, handwritten notes, scanned images...
Our
#EMNLP2020
paper (w/
@gneubig
,
@anas_ant
) addresses the task of extracting text from these sources.
All the details:
1/5
Our paper presents:
- a benchmark dataset with three critically endangered languages: Ainu, Griko, and Yakkha
- an analysis of whether general-purpose OCR systems are robust to endangered languages
- an OCR post-correction model tailored to data-scarce settings
3/5
Poster session starting NOW!
I'll be at the poster for two hours in gather town -- come by and say hi.
Poster Session 1: NLP Applications. Paper TACL-3193. You can "locate" me on gather town to find it!
#EMNLP2021
#NLProc
✨Our TACL paper introduces a semi-supervised learning method for OCR post-correction (w/ Daisy Rosenblum,
@gneubig
,
@anas_ant
).
We improve digitization accuracy on endangered languages by up to 29%!
📌Talk+poster at
#EMNLP2021
on Nov 7th!
📌Paper:
1/5
We’re releasing fairmotion, a library to help AI researchers use motion capture data for graphics and robotics. At
#SIGGRAPH2020
, we presented work using fairmotion to control diverse behaviors for physically simulated characters.
Come say hi and learn more about our paper at our live session today!
Gather session 4A: Machine Translation and Multilinguality
Nov 17, 21:00 EST!
(Nov 18, 02:00 UTC)
Details:
#emnlp2020
#NLProc
I am thrilled to announce that I will be joining
@DukeU
@dukecompsci
as an Assistant Professor in summer 2025. Super excited for the next chapter! Stay tuned for the launch of my lab 🧠🤖
@terrible_coder
I did one of these in 2015, and it helped me figure out whether I wanted to do a PhD or not! I didn't have many opportunities to do research before that, so it was very useful.
I am looking for PhD students at UW-Madison in Fall 2022. If you’re interested in NLP/ML, especially on multilingual NLP, multimodal learning, and transfer learning for translation and healthcare, please apply! Check my website for more details!
We present a method to create gazetteer features for low-resource NER using information from English knowledge bases through entity linking.
It improves NER performance on four low-resource languages: Kinyarwanda, Oromo, Sinhala, Tigrinya.
Code here:
The
#acl2020nlp
live QA sessions for our paper "Temporally-Informed Analysis of Named Entity Recognition" are tomorrow, July 8!
Join us to learn about how temporal data drift can affect NER models!
Find the details and presentation here:
(1/3)
EMNLP, don't forget the time change (fall back) tomorrow! From what I just frantically read, DR is AST (UTC-4) with no daylight savings change, and we in CA are moving from PDT (UTC-7) to PST (UTC-8) so there will be a 4-hour difference starting tomorrow.
Exciting personal update!! I am thrilled to share that I defended my PhD
#PHDone
🎉
Next steps: Young Investigator/Postdoc
@allen_ai
. I’m super excited to work with
@YejinChoinka
and the
@ai2_mosaic
team, as well as to join the vibrant NLP community in Seattle 🗻✨🧋☕️
We show that the NER performance is affected by the temporal distribution of the training set. We recommend that is should be taken into account when building models!
Find all experiments and analysis in the paper:
(3/3)
Join our
#acl2020nlp
live QA sessions for our paper "Soft Gazetteers for Low-Resource Named Entity Recognition" tomorrow, July 8!
Find the details and presentation here:
Paper co-authored with
@shuyanzhxyc
@gneubig
and Jaime Carbonell.
Join our
#acl2020nlp
live QA sessions for our paper "Soft Gazetteers for Low-Resource Named Entity Recognition" tomorrow, July 8!
Find the details and presentation here:
Paper co-authored with
@shuyanzhxyc
@gneubig
and Jaime Carbonell.