C.M. Downey @cmdowney Twitter profile

Last Seen Profiles

@smokitten

@NooBZiiTo1

@starssmax

@NoelBHungry

@YurukZekeriya

@minefreak76

@JTDrake55

@TradesByAI

@cheri_chiwi

@bucksalert

@k0s1ank0

@ikakarlina

@chinen__booo

@miwaku_herka

@topperwhere

@arvin17x

@Team_IN_

@SSN_Grizzlies

@ge4827516612156

@dappo_CAC

@Solivann_

@bennysmallzzzz

@golfiniceland

@LindaM08144702

@NoeliiaSevilla

@JimmyZMusic

@textodd

@SusanaIbarrola2

@maLttiUXZtBlanm

@ijustshippeople

@SimmsMelanie

@Guz_62

@MohMayaMoney

@Ruslan63663654

@MagicYurik95891

@chantermestuet

C.M. Downey

@cmdowney

13 days

📣 I'm recruiting PhD students this cycle! Researchers interested in expanding NLP for mid- and low-resource languages - and/or developing tools for endangered languages and field linguistics - should apply to work with me either through the UR Ling or CS PhD programs!

6

87

265

C.M. Downey

@cmdowney

7 months

Thrilled to say that I'll be joining the University of Rochester departments of Linguistics and Data Science ( @UofRDataSci ) this Fall as an assistant professor! My lab will be focused on all things low-resource NLP, especially where useful for endangered and minoritized languages

7

4

37

C.M. Downey

@cmdowney

2 months

A bit belated, but I finished my PhD! Can't express enough thanks to my amazing advisors @ssshanest and Gina for their investment in my time at UW. Excited for my next adventure of joining the faculty at the University of Rochester!

4

1

25

C.M. Downey

@cmdowney

1 year

What's the best way to specialize multilingual LMs for new languages? We address this in our new paper! Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages () With @terrablvns , Nora Goldfine, and @ssshanest

1

19

C.M. Downey

@cmdowney

3 years

Excited to announce my new preprint with Fei Xia, Gina-Anne Levow, and @ssshanest : A Masked Segmental Language Model for Natural Language Segmentation! Submitted to Arxiv, scheduled to be available by Sunday night (1/4)

1

13

C.M. Downey

@cmdowney

2 years

A paper I co-authored with UW CLMS students @shivin_thukral7 and Levon Haroutunian, plus alumna Shannon Drizin, will appear in the ACL 2022 main conference! "Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages"

Multilingual unsupervised sequence segmentation transfers to...

We show that unsupervised sequence-segmentation performance can be transferred to extremely low-resource languages by pre-training a Masked Segmental Language Model (Downey et al., 2021)...

arxiv.org

2

1

11

C.M. Downey

@cmdowney

4 months

Preprint! We test methods to adapt a crosslingual model to a language family, and argue for targeted multilinguality as a middle ground for low-resource langs, avoiding the "curse of multilinguality" w/ @TerraBlvns , @quirkyDhwani , @dwija_parikh , @ssshanest

2

6

10

C.M. Downey

@cmdowney

13 days

Here's a link to apply for the University of Rochester Linguistics program:

How to Apply

www.sas.rochester.edu

0

9

C.M. Downey

@cmdowney

4 years

Haven't been tweeting for a while- my main update is that I'm interning with Apple AI/ML this summer working on crosslingual language modeling with the Siri Web Answers team! Living the virtual intern life from my armchair

1

0

8

C.M. Downey

@cmdowney

2 years

My summer research with MSR's fantastic Cryptography and Privacy team is out! TLDR: we study privacy risks in LM services, test several mitigations, and find that Differential Privacy is by far the most effective defense against unintended memorization

1

6

C.M. Downey

@cmdowney

12 days

Because I can't individually respond to every email: 1. Programming skills (ideally Python) are important for my students in either program 2. For Ling especially, I will strongly weigh interest in endangered languages / fieldwork, to complement existing strengths of the dept ...

C.M. Downey

@cmdowney

13 days

📣 I'm recruiting PhD students this cycle! Researchers interested in expanding NLP for mid- and low-resource languages - and/or developing tools for endangered languages and field linguistics - should apply to work with me either through the UR Ling or CS PhD programs!

6

87

265

1

5

C.M. Downey

@cmdowney

4 years

@Daniel_Nikpayuk @emilymbender @trochee @rctatman @GretchenAMcC @UWlinguistics @dhgarrette Good luck with everything! I think all the Inuktitut should be wrapped in nunacom tags since those are what tell the webpage to display the syllabics, but yeah sometimes it's easy to make mistakes with these types of things. Feel free to reach out to me!

1

0

3

C.M. Downey

@cmdowney

2 years

Here's a good time to shout out to @ezesanlasai and @rhenderson for graciously allowing me to use their K'iche' data, as well as to @pywirrarika for access to additional Wixarika and Nahuatl dev data!

1

0

3

C.M. Downey

@cmdowney

11 months

Looking forward to presenting this @mrl2023_emnlp ! We compare methods for specializing the vocabulary of pre-trained cross-lingual models to specific languages, including under-resourced ones; see the previous thread for more details!

C.M. Downey

@cmdowney

1 year

What's the best way to specialize multilingual LMs for new languages? We address this in our new paper! Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages () With @terrablvns , Nora Goldfine, and @ssshanest

1

19

0

3

C.M. Downey

@cmdowney

4 years

@emilymbender @Daniel_Nikpayuk @trochee @rctatman @GretchenAMcC @UWlinguistics @dhgarrette It looks like there is. The Inuktitut is wrapped in the nunacom font tags. I think you wouldn't need a language model

1

0

2

C.M. Downey

@cmdowney

2 years

If you're interested, I hope you'll take a look at our paper, and/or reach out, either online in the meantime or in Dublin this May!

0

2

C.M. Downey

@cmdowney

5 years

Today I had the privilege of viewing original prints of the all-Dakota newspaper Anpao "Sunrise" dated 1880-1895, invaluable records of vibrant Dakota as used in everyday life, and hopefully a resource that can be leveraged for current generations of Dakota speakers and learners!

0

1

2

C.M. Downey

@cmdowney

3 years

Our model gives particularly strong performance for Chinese segmentation, and competitive results for English, with several promising avenues for future work and improvement (3/4)

1

0

2

C.M. Downey

@cmdowney

3 years

All code for building and testing Segmental Language Models will also go live at once the preprint is released! (4/4)

0

2

C.M. Downey

@cmdowney

2 years

We note that transferring our pre-trained model is especially beneficial at extremely low target resource sizes, and performance of the from-scratch model is more inconsistent, even with higher amounts of available target data

1

0

2

C.M. Downey

@cmdowney

4 months

@dmimno @TerraBlvns @quirkyDhwani @dwija_parikh @ssshanest Thanks! I suspect the part it would have the most potential to effect is the vocabulary specialization, since it exploits the structure of the original model's embedding space

FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models

Konstantin Dobler, Gerard de Melo. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023.

aclanthology.org

0

1

C.M. Downey

@cmdowney

2 years

Finally, we perform an ablation by pre-training two additional models, comparing our multilingual model to one trained on Quechua only (the highest-resource lang of the ANLP set), and another multilingual model with its total data downsampled to match the available Quechua data

1

0

2

C.M. Downey

@cmdowney

2 years

Since segmental LMs segment as a bi-product of a LM proxy-task, it was unclear whether multilingual training would work, but we observe that segmentation quality on multiple pre-training languages improve simultaneously

1

0

2

C.M. Downey

@cmdowney

5 years

This month I am conducting my research at the #NewberryLibrary Chicago, investigating the ways in which digital media and computational linguistics can be used in the task of language revitalization.

3

0

2

C.M. Downey

@cmdowney

4 months

Our most surprising result may be that choosing a low sampling alpha (up-sampling low-resource langs and down-sampling high-resource) has a significant beneficial effect for low-resource langs, but does *not significantly harm performance in high-resource ones (pics repeated)

1

0

2

C.M. Downey

@cmdowney

4 months

Using the Uralic family as a test case, we adapt XLM-R with targeted language modeling and vocab specialization. Our best models show sizable improvements over multilingual baselines for tasks like UAS, while simultaneously cutting up to 65% of the original parameters

1

0

C.M. Downey

@cmdowney

3 years

We adapt segmental language modeling to a bidirectional span-masking transformer architecture and test our new model (MSLM) on Chinese and English segmentation (2/4)

1

0

2

C.M. Downey

@cmdowney

2 years

These ablations show that multilingual pretraining has a decided advantage in the zero-shot case (over monolingual pre-training), but they leave plenty of room for future work and discussion on exactly when and why it is beneficial!

1

0

2

C.M. Downey

@cmdowney

2 years

We ask whether... 1. performance on an unsupervised morpheme-segmentation objective can be transferred to a new, very low-resource language 2. multilingual pre-training improves transfer 3. transfer can be achieved with moderate-sized models trained on similar low-resource langs

1

0

2

C.M. Downey

@cmdowney

2 years

The answer to all 3 may be "yes"! We train a segmental LM on 10 Indigenous langs from the @AmericasNLP '21 set, and transfer to a new lang: K'iche' On a range of target data sizes, our multilingual model beats the from-scratch model in 6/10 settings, with a zero-shot F1 of 20.6!

1

0

2

C.M. Downey

@cmdowney

5 years

@amandalynneP The ghost we were referring to when we hatched this idea does indeed have it

0

1

C.M. Downey

@cmdowney

1 year

We characterize XLM-R's embedding space to find the most important features by which embeddings cluster, and then propose simple techniques to preserve these features when re-initializing embeddings for a new vocabulary

1

0

1

C.M. Downey

@cmdowney

4 years

@rctatman @GretchenAMcC I think you might not need a language model or anything fancy. It looks like nunacom is a font that would just render ascii characters as the syllabics that correspond to that place on the keyboard. You might just be able to do a character-wise replacement

2

0

1

C.M. Downey

@cmdowney

5 years

(4/5) For the remainder of the Summer I will be working as a Research Assistant to Dr. Gina-Anne Levow working with Language Modeling for low-resource languages

1

0

1

C.M. Downey

@cmdowney

5 years

@emilymbender My best guess: a. ᐊᓂ-ᙱᑦ-ᑎᑦ-ᑕᕋ b. ᐊᓂ-ᑎᑦ-ᙱᑦ-ᑕᕋ Though that probably needs to be confirmed by someone who knows more than me :)

0

1

C.M. Downey

@cmdowney

4 years

@rctatman @GretchenAMcC Like the first word in his picture "ktsC" seems to be ᓄᑎᐅᕋ/nutiura which might be a variation on Norterra

1

0

1

C.M. Downey

@cmdowney

5 years

Someone order a set of Hungarian-to-Northern-Saami flashcards?...No?...Here they are anyway

0

1

C.M. Downey

@cmdowney

4 years

@offtheclocksara Ohh, I had heard both, but I didn't know that was the history. That means [B O] is definitely a constituent

0

1

C.M. Downey

@cmdowney

4 months

Our results suggest new best practices for bootstrapping NLP systems in low-resource language groups. All of our software, results, and analysis can be found on at . If you find our work interesting, feel free to reach out and let us know what you think!

GitHub - CLMBRs/targeted-xlms

Contribute to CLMBRs/targeted-xlms development by creating an account on GitHub.

github.com

0

1

C.M. Downey

@cmdowney

5 years

(2/5) This quarter I started a project with my advisor Dr. Fei Xia on unsupervised morphological segmentation, with the hope of applying the model to low-resource and underdocumented languages, and especially those with rich morphology

1

0

1

C.M. Downey

@cmdowney

12 days

3. For this cycle, I'm unlikely to bring on a student interested in machine learning but not NLP / CompLing

1

0

1

C.M. Downey

@cmdowney

1 year

A systematic comparison of re-initialization techniques shows: (1) our method performs well without increased compute or external language alignment (2) reducing vocabulary size during adaptation promotes efficient training with only a minor decrease in downstream performance

1

0

1

C.M. Downey

@cmdowney

5 years

(5/5) Finally, for the coming academic year, I will be funded by the UW Foreign Language and Area Studies (FLAS) fellowship for research on Inuktitut (Eskimo-Aleut), and will continue my research on morphosyntactic parsing for Inuktitut and other morphologically-rich languages

0

1

C.M. Downey

@cmdowney

5 years

(3/5) I will be spending July in Chicago as UW's rep to the Newberry Research Library's summer institute for American Indian Studies. This year's theme is Language Revitalization, and I hope to join in a conversation about the role of Tech and CompLing in these efforts

1

0

1

C.M. Downey

@cmdowney

1 year

We hope these results will help to leverage and adapt pre-trained multilingual LMs for under-resourced languages, as well as to make LMs more accessible by replacing huge crosslingual vocabularies and embedding blocks with compact, language-specialized ones

0

1