C.M. Downey Profile Banner
C.M. Downey Profile
C.M. Downey

@cmdowney

342
Followers
131
Following
25
Media
62
Statuses

Asst. professor at @UofR Ling and Data Science \ NLP for low-resource, endangered, and Indigenous languages \ formerly @uwlinguistics , @uwnlp

Rochester, NY
Joined June 2019
Don't wanna be here? Send us removal request.
@cmdowney
C.M. Downey
13 days
📣 I'm recruiting PhD students this cycle! Researchers interested in expanding NLP for mid- and low-resource languages - and/or developing tools for endangered languages and field linguistics - should apply to work with me either through the UR Ling or CS PhD programs!
6
87
265
@cmdowney
C.M. Downey
7 months
Thrilled to say that I'll be joining the University of Rochester departments of Linguistics and Data Science ( @UofRDataSci ) this Fall as an assistant professor! My lab will be focused on all things low-resource NLP, especially where useful for endangered and minoritized languages
7
4
37
@cmdowney
C.M. Downey
2 months
A bit belated, but I finished my PhD! Can't express enough thanks to my amazing advisors @ssshanest and Gina for their investment in my time at UW. Excited for my next adventure of joining the faculty at the University of Rochester!
Tweet media one
Tweet media two
4
1
25
@cmdowney
C.M. Downey
1 year
What's the best way to specialize multilingual LMs for new languages? We address this in our new paper! Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages () With @terrablvns , Nora Goldfine, and @ssshanest
Tweet media one
1
1
19
@cmdowney
C.M. Downey
3 years
Excited to announce my new preprint with Fei Xia, Gina-Anne Levow, and @ssshanest : A Masked Segmental Language Model for Natural Language Segmentation! Submitted to Arxiv, scheduled to be available by Sunday night (1/4)
Tweet media one
1
1
13
@cmdowney
C.M. Downey
2 years
A paper I co-authored with UW CLMS students @shivin_thukral7 and Levon Haroutunian, plus alumna Shannon Drizin, will appear in the ACL 2022 main conference! "Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages"
2
1
11
@cmdowney
C.M. Downey
4 months
Preprint! We test methods to adapt a crosslingual model to a language family, and argue for targeted multilinguality as a middle ground for low-resource langs, avoiding the "curse of multilinguality" w/ @TerraBlvns , @quirkyDhwani , @dwija_parikh , @ssshanest
Tweet media one
Tweet media two
2
6
10
@cmdowney
C.M. Downey
13 days
Here's a link to apply for the University of Rochester Linguistics program:
0
0
9
@cmdowney
C.M. Downey
4 years
Haven't been tweeting for a while- my main update is that I'm interning with Apple AI/ML this summer working on crosslingual language modeling with the Siri Web Answers team! Living the virtual intern life from my armchair
1
0
8
@cmdowney
C.M. Downey
2 years
My summer research with MSR's fantastic Cryptography and Privacy team is out! TLDR: we study privacy risks in LM services, test several mitigations, and find that Differential Privacy is by far the most effective defense against unintended memorization
Tweet media one
1
1
6
@cmdowney
C.M. Downey
12 days
Because I can't individually respond to every email: 1. Programming skills (ideally Python) are important for my students in either program 2. For Ling especially, I will strongly weigh interest in endangered languages / fieldwork, to complement existing strengths of the dept ...
@cmdowney
C.M. Downey
13 days
📣 I'm recruiting PhD students this cycle! Researchers interested in expanding NLP for mid- and low-resource languages - and/or developing tools for endangered languages and field linguistics - should apply to work with me either through the UR Ling or CS PhD programs!
6
87
265
1
1
5
@cmdowney
C.M. Downey
4 years
@Daniel_Nikpayuk @emilymbender @trochee @rctatman @GretchenAMcC @UWlinguistics @dhgarrette Good luck with everything! I think all the Inuktitut should be wrapped in nunacom tags since those are what tell the webpage to display the syllabics, but yeah sometimes it's easy to make mistakes with these types of things. Feel free to reach out to me!
1
0
3
@cmdowney
C.M. Downey
2 years
Here's a good time to shout out to @ezesanlasai and @rhenderson for graciously allowing me to use their K'iche' data, as well as to @pywirrarika for access to additional Wixarika and Nahuatl dev data!
1
0
3
@cmdowney
C.M. Downey
11 months
Looking forward to presenting this @mrl2023_emnlp ! We compare methods for specializing the vocabulary of pre-trained cross-lingual models to specific languages, including under-resourced ones; see the previous thread for more details!
@cmdowney
C.M. Downey
1 year
What's the best way to specialize multilingual LMs for new languages? We address this in our new paper! Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages () With @terrablvns , Nora Goldfine, and @ssshanest
Tweet media one
1
1
19
0
0
3
@cmdowney
C.M. Downey
4 years
@emilymbender @Daniel_Nikpayuk @trochee @rctatman @GretchenAMcC @UWlinguistics @dhgarrette It looks like there is. The Inuktitut is wrapped in the nunacom font tags. I think you wouldn't need a language model
1
0
2
@cmdowney
C.M. Downey
2 years
If you're interested, I hope you'll take a look at our paper, and/or reach out, either online in the meantime or in Dublin this May!
0
0
2
@cmdowney
C.M. Downey
5 years
Today I had the privilege of viewing original prints of the all-Dakota newspaper Anpao "Sunrise" dated 1880-1895, invaluable records of vibrant Dakota as used in everyday life, and hopefully a resource that can be leveraged for current generations of Dakota speakers and learners!
Tweet media one
0
1
2
@cmdowney
C.M. Downey
3 years
Our model gives particularly strong performance for Chinese segmentation, and competitive results for English, with several promising avenues for future work and improvement (3/4)
Tweet media one
1
0
2
@cmdowney
C.M. Downey
3 years
All code for building and testing Segmental Language Models will also go live at once the preprint is released! (4/4)
0
0
2
@cmdowney
C.M. Downey
2 years
We note that transferring our pre-trained model is especially beneficial at extremely low target resource sizes, and performance of the from-scratch model is more inconsistent, even with higher amounts of available target data
Tweet media one
1
0
2
@cmdowney
C.M. Downey
2 years
Finally, we perform an ablation by pre-training two additional models, comparing our multilingual model to one trained on Quechua only (the highest-resource lang of the ANLP set), and another multilingual model with its total data downsampled to match the available Quechua data
Tweet media one
1
0
2
@cmdowney
C.M. Downey
2 years
Since segmental LMs segment as a bi-product of a LM proxy-task, it was unclear whether multilingual training would work, but we observe that segmentation quality on multiple pre-training languages improve simultaneously
Tweet media one
1
0
2
@cmdowney
C.M. Downey
5 years
This month I am conducting my research at the #NewberryLibrary Chicago, investigating the ways in which digital media and computational linguistics can be used in the task of language revitalization.
Tweet media one
Tweet media two
3
0
2
@cmdowney
C.M. Downey
4 months
Our most surprising result may be that choosing a low sampling alpha (up-sampling low-resource langs and down-sampling high-resource) has a significant beneficial effect for low-resource langs, but does *not significantly harm performance in high-resource ones (pics repeated)
Tweet media one
Tweet media two
1
0
2
@cmdowney
C.M. Downey
4 months
Using the Uralic family as a test case, we adapt XLM-R with targeted language modeling and vocab specialization. Our best models show sizable improvements over multilingual baselines for tasks like UAS, while simultaneously cutting up to 65% of the original parameters
Tweet media one
1
0
0
@cmdowney
C.M. Downey
3 years
We adapt segmental language modeling to a bidirectional span-masking transformer architecture and test our new model (MSLM) on Chinese and English segmentation (2/4)
Tweet media one
1
0
2
@cmdowney
C.M. Downey
2 years
These ablations show that multilingual pretraining has a decided advantage in the zero-shot case (over monolingual pre-training), but they leave plenty of room for future work and discussion on exactly when and why it is beneficial!
1
0
2
@cmdowney
C.M. Downey
2 years
We ask whether... 1. performance on an unsupervised morpheme-segmentation objective can be transferred to a new, very low-resource language 2. multilingual pre-training improves transfer 3. transfer can be achieved with moderate-sized models trained on similar low-resource langs
Tweet media one
Tweet media two
1
0
2
@cmdowney
C.M. Downey
2 years
The answer to all 3 may be "yes"! We train a segmental LM on 10 Indigenous langs from the @AmericasNLP '21 set, and transfer to a new lang: K'iche' On a range of target data sizes, our multilingual model beats the from-scratch model in 6/10 settings, with a zero-shot F1 of 20.6!
Tweet media one
1
0
2
@cmdowney
C.M. Downey
5 years
@amandalynneP The ghost we were referring to when we hatched this idea does indeed have it
0
0
1
@cmdowney
C.M. Downey
1 year
We characterize XLM-R's embedding space to find the most important features by which embeddings cluster, and then propose simple techniques to preserve these features when re-initializing embeddings for a new vocabulary
Tweet media one
1
0
1
@cmdowney
C.M. Downey
4 years
@rctatman @GretchenAMcC I think you might not need a language model or anything fancy. It looks like nunacom is a font that would just render ascii characters as the syllabics that correspond to that place on the keyboard. You might just be able to do a character-wise replacement
Tweet media one
2
0
1
@cmdowney
C.M. Downey
5 years
(4/5) For the remainder of the Summer I will be working as a Research Assistant to Dr. Gina-Anne Levow working with Language Modeling for low-resource languages
1
0
1
@cmdowney
C.M. Downey
5 years
@emilymbender My best guess: a. ᐊᓂ-ᙱᑦ-ᑎᑦ-ᑕᕋ b. ᐊᓂ-ᑎᑦ-ᙱᑦ-ᑕᕋ Though that probably needs to be confirmed by someone who knows more than me :)
0
0
1
@cmdowney
C.M. Downey
4 years
@rctatman @GretchenAMcC Like the first word in his picture "ktsC" seems to be ᓄᑎᐅᕋ/nutiura which might be a variation on Norterra
1
0
1
@cmdowney
C.M. Downey
5 years
Someone order a set of Hungarian-to-Northern-Saami flashcards?...No?...Here they are anyway
0
0
1
@cmdowney
C.M. Downey
4 years
@offtheclocksara Ohh, I had heard both, but I didn't know that was the history. That means [B O] is definitely a constituent
0
0
1
@cmdowney
C.M. Downey
4 months
Our results suggest new best practices for bootstrapping NLP systems in low-resource language groups. All of our software, results, and analysis can be found on at . If you find our work interesting, feel free to reach out and let us know what you think!
0
0
1
@cmdowney
C.M. Downey
5 years
(2/5) This quarter I started a project with my advisor Dr. Fei Xia on unsupervised morphological segmentation, with the hope of applying the model to low-resource and underdocumented languages, and especially those with rich morphology
1
0
1
@cmdowney
C.M. Downey
12 days
3. For this cycle, I'm unlikely to bring on a student interested in machine learning but not NLP / CompLing
1
0
1
@cmdowney
C.M. Downey
1 year
A systematic comparison of re-initialization techniques shows: (1) our method performs well without increased compute or external language alignment (2) reducing vocabulary size during adaptation promotes efficient training with only a minor decrease in downstream performance
1
0
1
@cmdowney
C.M. Downey
5 years
(5/5) Finally, for the coming academic year, I will be funded by the UW Foreign Language and Area Studies (FLAS) fellowship for research on Inuktitut (Eskimo-Aleut), and will continue my research on morphosyntactic parsing for Inuktitut and other morphologically-rich languages
0
0
1
@cmdowney
C.M. Downey
5 years
(3/5) I will be spending July in Chicago as UW's rep to the Newberry Research Library's summer institute for American Indian Studies. This year's theme is Language Revitalization, and I hope to join in a conversation about the role of Tech and CompLing in these efforts
1
0
1
@cmdowney
C.M. Downey
1 year
We hope these results will help to leverage and adapt pre-trained multilingual LMs for under-resourced languages, as well as to make LMs more accessible by replacing huge crosslingual vocabularies and embedding blocks with compact, language-specialized ones
0
0
1