📣 I'm recruiting PhD students this cycle! Researchers interested in expanding NLP for mid- and low-resource languages - and/or developing tools for endangered languages and field linguistics - should apply to work with me either through the UR Ling or CS PhD programs!
Thrilled to say that I'll be joining the University of Rochester departments of Linguistics and Data Science (
@UofRDataSci
) this Fall as an assistant professor! My lab will be focused on all things low-resource NLP, especially where useful for endangered and minoritized languages
A bit belated, but I finished my PhD! Can't express enough thanks to my amazing advisors
@ssshanest
and Gina for their investment in my time at UW. Excited for my next adventure of joining the faculty at the University of Rochester!
What's the best way to specialize multilingual LMs for new languages? We address this in our new paper!
Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages ()
With
@terrablvns
, Nora Goldfine, and
@ssshanest
Excited to announce my new preprint with Fei Xia, Gina-Anne Levow, and
@ssshanest
: A Masked Segmental Language Model for Natural Language Segmentation! Submitted to Arxiv, scheduled to be available by Sunday night (1/4)
A paper I co-authored with UW CLMS students
@shivin_thukral7
and Levon Haroutunian, plus alumna Shannon Drizin, will appear in the ACL 2022 main conference!
"Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages"
Preprint! We test methods to adapt a crosslingual model to a language family, and argue for targeted multilinguality as a middle ground for low-resource langs, avoiding the "curse of multilinguality"
w/
@TerraBlvns
,
@quirkyDhwani
,
@dwija_parikh
,
@ssshanest
Haven't been tweeting for a while- my main update is that I'm interning with Apple AI/ML this summer working on crosslingual language modeling with the Siri Web Answers team! Living the virtual intern life from my armchair
My summer research with MSR's fantastic Cryptography and Privacy team is out! TLDR: we study privacy risks in LM services, test several mitigations, and find that Differential Privacy is by far the most effective defense against unintended memorization
Because I can't individually respond to every email:
1. Programming skills (ideally Python) are important for my students in either program
2. For Ling especially, I will strongly weigh interest in endangered languages / fieldwork, to complement existing strengths of the dept
...
📣 I'm recruiting PhD students this cycle! Researchers interested in expanding NLP for mid- and low-resource languages - and/or developing tools for endangered languages and field linguistics - should apply to work with me either through the UR Ling or CS PhD programs!
Here's a good time to shout out to
@ezesanlasai
and
@rhenderson
for graciously allowing me to use their K'iche' data, as well as to
@pywirrarika
for access to additional Wixarika and Nahuatl dev data!
Looking forward to presenting this
@mrl2023_emnlp
! We compare methods for specializing the vocabulary of pre-trained cross-lingual models to specific languages, including under-resourced ones; see the previous thread for more details!
What's the best way to specialize multilingual LMs for new languages? We address this in our new paper!
Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages ()
With
@terrablvns
, Nora Goldfine, and
@ssshanest
Today I had the privilege of viewing original prints of the all-Dakota newspaper Anpao "Sunrise" dated 1880-1895, invaluable records of vibrant Dakota as used in everyday life, and hopefully a resource that can be leveraged for current generations of Dakota speakers and learners!
Our model gives particularly strong performance for Chinese segmentation, and competitive results for English, with several promising avenues for future work and improvement (3/4)
We note that transferring our pre-trained model is especially beneficial at extremely low target resource sizes, and performance of the from-scratch model is more inconsistent, even with higher amounts of available target data
Finally, we perform an ablation by pre-training two additional models, comparing our multilingual model to one trained on Quechua only (the highest-resource lang of the ANLP set), and another multilingual model with its total data downsampled to match the available Quechua data
Since segmental LMs segment as a bi-product of a LM proxy-task, it was unclear whether multilingual training would work, but we observe that segmentation quality on multiple pre-training languages improve simultaneously
This month I am conducting my research at the
#NewberryLibrary
Chicago, investigating the ways in which digital media and computational linguistics can be used in the task of language revitalization.
Our most surprising result may be that choosing a low sampling alpha (up-sampling low-resource langs and down-sampling high-resource) has a significant beneficial effect for low-resource langs, but does *not significantly harm performance in high-resource ones (pics repeated)
Using the Uralic family as a test case, we adapt XLM-R with targeted language modeling and vocab specialization. Our best models show sizable improvements over multilingual baselines for tasks like UAS, while simultaneously cutting up to 65% of the original parameters
We adapt segmental language modeling to a bidirectional span-masking transformer architecture and test our new model (MSLM) on Chinese and English segmentation (2/4)
These ablations show that multilingual pretraining has a decided advantage in the zero-shot case (over monolingual pre-training), but they leave plenty of room for future work and discussion on exactly when and why it is beneficial!
We ask whether...
1. performance on an unsupervised morpheme-segmentation objective can be transferred to a new, very low-resource language
2. multilingual pre-training improves transfer
3. transfer can be achieved with moderate-sized models trained on similar low-resource langs
The answer to all 3 may be "yes"!
We train a segmental LM on 10 Indigenous langs from the
@AmericasNLP
'21 set, and transfer to a new lang: K'iche'
On a range of target data sizes, our multilingual model beats the from-scratch model in 6/10 settings, with a zero-shot F1 of 20.6!
We characterize XLM-R's embedding space to find the most important features by which embeddings cluster, and then propose simple techniques to preserve these features when re-initializing embeddings for a new vocabulary
@rctatman
@GretchenAMcC
I think you might not need a language model or anything fancy. It looks like nunacom is a font that would just render ascii characters as the syllabics that correspond to that place on the keyboard. You might just be able to do a character-wise replacement
(4/5) For the remainder of the Summer I will be working as a Research Assistant to Dr. Gina-Anne Levow working with Language Modeling for low-resource languages
Our results suggest new best practices for bootstrapping NLP systems in low-resource language groups. All of our software, results, and analysis can be found on at . If you find our work interesting, feel free to reach out and let us know what you think!
(2/5) This quarter I started a project with my advisor Dr. Fei Xia on unsupervised morphological segmentation, with the hope of applying the model to low-resource and underdocumented languages, and especially those with rich morphology
A systematic comparison of re-initialization techniques shows:
(1) our method performs well without increased compute or external language alignment
(2) reducing vocabulary size during adaptation promotes efficient training with only a minor decrease in downstream performance
(5/5) Finally, for the coming academic year, I will be funded by the UW Foreign Language and Area Studies (FLAS) fellowship for research on Inuktitut (Eskimo-Aleut), and will continue my research on morphosyntactic parsing for Inuktitut and other morphologically-rich languages
(3/5) I will be spending July in Chicago as UW's rep to the Newberry Research Library's summer institute for American Indian Studies. This year's theme is Language Revitalization, and I hope to join in a conversation about the role of Tech and CompLing in these efforts
We hope these results will help to leverage and adapt pre-trained multilingual LMs for under-resourced languages, as well as to make LMs more accessible by replacing huge crosslingual vocabularies and embedding blocks with compact, language-specialized ones