Matthew Finlayson @mattf1n Twitter profile | Pikagi

Pikagi

Matthew Finlayson

@mattf1n

930

Followers

897

Following

23

Media

128

Statuses

First year PhD at @nlp_usc | Former predoc at @allen_ai on @ai2_aristo | Harvard 2021 CS & Linguistics

Los Angeles, CA

https://t.co/1ij7Y0jyKF

Joined October 2013

Don't wanna be here? Send us removal request.

Pinned Tweet

@mattf1n

Matthew Finlayson

7 months

Wanna know gpt-3.5-turbo's embed size? We find a way to extract info from LLM APIs and estimate gpt-3.5-turbo’s embed size to be 4096. With the same trick we also develop 25x faster logprob extraction, audits for LLM APIs, and more! 📄 Here’s how 1/🧵

Tweet media one

6

79

366

Last Seen Profiles

@iowashrinebowl

@ChanningHickman

@BlaakRichardson

@SiradiSow37816

@perkinsmiki

@H29b4I

@vanithavijayku1

@maigo_hanyuu

@bader222abha

@TeWhatuOra

@larrabassadabcn

@LioThe227980

@rwbydustqueen

@JdhMbadl10516

@JdhMbadl10516

@NailElhan

@JdhMbadl10516

@eleatusit

@rwbydustqueen

@MichielStraat

@JdhMbadl10516

@AxjXNews

@VeienM

@larrabassadabcn

@esa

@LRLeyDelRap

@ElenaSB

@ArlenH84

@meteora_staff

@akechi_1582_2

@GGCadoudal

@qxr4ritx

@BALE_FX

@yeqiu96487

@MohamedAlgthey

@yoiverse

@mattf1n

Matthew Finlayson

2 years

Can a language model help you with your math homework? Not on its own, but maybe with the help of a Python interpreter! In our EMNLP paper we present 📜 Līla and 🤖 Bhāskara, a math reasoning benchmark and model. 📄: 🔗: 1/🧵

Tweet media one

Tweet media two

5

38

214

@mattf1n

Matthew Finlayson

1 year

Nucleus and top-k sampling are ubiquitous, but why do they work? @johnhewtt , @alkoller , @swabhz , @Ashish_S_AI and I explain the theory and give a new method to address model errors at their source (the softmax bottleneck)! 📄 🧑‍💻

Tweet media one

3

26

166

@mattf1n

Matthew Finlayson

2 years

🎉 New pre-print 🎉 Teaching models to follow instructions is becoming popular, but what makes Instruction Learning hard in the first place? We investigate with synthetic data and build a challenge dataset! My first paper with @ai2_aristo at @allen_ai

Tweet media one

6

14

74

@mattf1n

Matthew Finlayson

2 years

Applying to a UW PhD today? Did you write your SoP in LaTeX? Pro tip:

Tweet media one

3

0

50

@mattf1n

Matthew Finlayson

6 months

My high quality data annotations aren’t free so I will now select 1 wrong answer per captcha. It still works btw.

Tweet media one

2

2

50

@mattf1n

Matthew Finlayson

7 months

In a remarkable case of simultaneous discovery, a paper released earlier this week () also finds that LLM APIs leak information. We are excited for our colleagues and believe that our papers complement and strengthen one another. Amazing work! 8/8

Tweet card media

Stealing Part of a Production Language Model

We introduce the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI's ChatGPT or Google's PaLM-2. Specifically, our...

@arankomatsuzaki

Aran Komatsuzaki

@arankomatsuzaki

7 months

Google presents: Stealing Part of a Production Language Model - Extracts the projection matrix of OpenAI’s ada and babbage LMs for <$20 - Confirms that their hidden dim is 1024 and 2048, respectively - Also recovers the exact hidden dim size of gpt-3.5-turbo

Tweet media one

16

148

957

1

1

33

@mattf1n

Matthew Finlayson

2 years

📣 Today at EMNLP I will be giving my first-ever in-person oral presentation!!! Come hear about how we used formal languages to learn what kinds of instructions LMs can follow 🤖 Hall A-B at 2pm

@mattf1n

Matthew Finlayson

2 years

🎉 New pre-print 🎉 Teaching models to follow instructions is becoming popular, but what makes Instruction Learning hard in the first place? We investigate with synthetic data and build a challenge dataset! My first paper with @ai2_aristo at @allen_ai

Tweet media one

6

14

74

1

2

23

@mattf1n

Matthew Finlayson

2 years

7/🧵 Curious about the name choices? Our benchmark is named after Līlavati, a treatise by 12th century Indian mathematician Bhāskara. Read more in our blog post.

Līla: A Unified Benchmark for Mathematical Reasoning

To evaluate language model math reasoning abilties, we compile a comprehensive benchmark of over 140K math questions.

blog.allenai.org

1

6

17

@mattf1n

Matthew Finlayson

7 months

LLM outputs lie in a low-dimensional vector space (we call this the LLM’s image). By “low-dimensional”, we mean EXACTLY the embed size. To find the embed size we check the dimension of the space that the outputs span. Ok, but does this have any practical uses? 2/🧵

Tweet media one

1

1

12

@mattf1n

Matthew Finlayson

2 years

Come see our #EMNLP2022 poster at 4pm in the Atrium! @Swarooprm7 and I will be there to chat about math reasoning models and evaluations 🎉

@mattf1n

Matthew Finlayson

2 years

Can a language model help you with your math homework? Not on its own, but maybe with the help of a Python interpreter! In our EMNLP paper we present 📜 Līla and 🤖 Bhāskara, a math reasoning benchmark and model. 📄: 🔗: 1/🧵

Tweet media one

Tweet media two

5

38

214

0

5

12

@mattf1n

Matthew Finlayson

2 years

5/🧵 We find that models perform much better when they output a Python program that prints the answer, instead of directly generating the answer. This answer format has the added benefit that it doubles as a step-by-step solution.

Tweet media one

1

0

12

@mattf1n

Matthew Finlayson

9 months

It was really fun to contribute to this bit of software, and it has some pretty useful applications! If you want logprobs beyond the top-5 that OpenAI (or any other API) gives you, check out our library 🎉

@jxmnop

jack morris

9 months

fun research story about how we jailbroke the the chatGPT API: so every time you run inference with a language model like GPT-whatever, the model outputs a full probabilities over its entire vocabulary (~50,000 tokens) but when you use their API, OpenAI hides all this info from

16

65

609

1

0

11

@mattf1n

Matthew Finlayson

7 months

Thanks to my advisors @swabhz and @xiangrenNLP for their guidance, as well as @jxmnop for his valuable feedback. 7/🧵

1

0

11

@mattf1n

Matthew Finlayson

10 months

It’s that time of year again. Good luck on your applications!

@mattf1n

Matthew Finlayson

2 years

Applying to a UW PhD today? Did you write your SoP in LaTeX? Pro tip:

Tweet media one

3

0

50

0

0

10

@mattf1n

Matthew Finlayson

2 years

3/🧵 Our benchmark, Līla, is available on GitHub. We welcome contributions from the community to continue expanding our data. Open an issue and we'll work with you to add your data set!

GitHub - allenai/Lila: A unified benchmark for math reasoning

A unified benchmark for math reasoning. Contribute to allenai/Lila development by creating an account on GitHub.

1

0

9

@mattf1n

Matthew Finlayson

2 years

8/🧵 A big thanks to all my collaborators and advisors from @ai2_aristo at @allen_ai and elsewhere, starting my co-first-author @Swarooprm7 , @lupantech , @leonardtang_ , @wellecks , @cbaral , @tanmayRPurohit , Oyvind Tafjord, Ashish Sabharwal, Peter Clark, and @ashwinkalyan7 .

0

0

9

@mattf1n

Matthew Finlayson

7 months

Though knowing the embed size of an LLM may not seem very useful on its own, it turns out that there are many practical applications for LLM images, ranging from adversarial attacks, to methods for auditing LLM API providers. Let’s take a look at two of them: 3/🧵

Tweet media one

1

0

9

@mattf1n

Matthew Finlayson

2 years

2/🧵 Existing math reasoning datasets are too narrowly scoped to evaluate general math reasoning ability. To fix this, we annotate over 140K diverse natural language math questions with Python programs. Topics range from arithmetic to algebra to multivariate calculus.

Tweet media one

1

0

8

@mattf1n

Matthew Finlayson

1 year

@jxmnop Beam search to optimize seq prob considered harmful. Often you don’t want the most likely seq, you want a typical seq. Must reads on this topic: “If beam search is the answer, what was the question?” “Locally Typical Sampling”

If beam search is the answer, what was the question?

Clara Meister, Ryan Cotterell, Tim Vieira. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.

aclanthology.org

1

0

7

@mattf1n

Matthew Finlayson

7 months

First, you can get cheap, full LM outputs by accelerating extraction algorithms from O(vocab size) to O(embed size) API calls—a 25x speedup for gpt-3.5-turbo! This means model stealing just got a lot cheaper. Sounds bad for LLM companies, but there is good news too: 4/🧵

Tweet media one

1

0

7

@mattf1n

Matthew Finlayson

2 years

4/🧵 Līla comes with useful built-in data splits for evaluating in-distribution performance, generalization to out-of-distribution data, and robustness to linguistic perturbation.

Tweet media one

1

0

7

@mattf1n

Matthew Finlayson

2 years

6/🧵 We fine-tune a model (GPT-Neo-2.7B) on Līla which outperforms its like-sized peers (T5-3B, GPT-Neo) when fine-tuned on new math reasoning tasks. If you want to train a math reasoning model, consider using our model, Bhāskara, as a starting point:

Tweet media one

1

0

7

@mattf1n

Matthew Finlayson

7 months

Next, LLM images can be viewed as model "signatures". An LLM's outputs all carry its unique signature. This means outsiders can audit LLMs for unannounced changes and catch APIs serving LLMs they aren’t supposed to: a win for accountability! So what should LLM providers do? 5/🧵

Tweet media one

1

0

7

@mattf1n

Matthew Finlayson

8 months

Another great paper from @andreasgrv and friends! Love seeing more work on output bottlenecks 💯 Q: if the hyperplanes were not constrained to intersect at the origin, how much “less bad” would the situation be? Seems related to Moser’s circle problem

Dividing a circle into areas - Wikipedia

en.m.wikipedia.org

@andreasgrv

Andreas Grivas

8 months

How expressive is your deep multi-label classifier? Can it represent all outputs of interest? 🤔 SPOILER🚨: Your model can have *test set* outputs that are impossible to predict! 🚫 Check out our paper! 🧵(1/7)

Tweet media one

2

22

76

3

0

7

@mattf1n

Matthew Finlayson

7 months

Considering the risks posed by our methods and weighing them against the benefits from increased trust and accountability, we take the position that these “vulnerabilities” should be viewed as a feature rather than a bug for LLM APIs. 6/🧵

1

0

6

@mattf1n

Matthew Finlayson

2 years

@ai2_aristo @allen_ai Check out our paper for more in-depth analyses! Our data and code are available at A big thank you to my advisors and mentors!

GitHub - allenai/RegSet

Contribute to allenai/RegSet development by creating an account on GitHub.

0

0

4

@mattf1n

Matthew Finlayson

6 months

@shaily99 I always appreciate it when authors use backrefs \usepackage[backref]{hyperref}

0

1

4

@mattf1n

Matthew Finlayson

2 years

@ai2_aristo @allen_ai Thus, we introduce Hard RegSet as a challenging instruction learning task, and a controlled environment for studying instruction learning.

1

0

3

@mattf1n

Matthew Finlayson

11 months

I got to work with Yuntian during my undergrad. He’s a great mentor with lots of enthusiasm and big ideas!

@yuntiandeng

Yuntian Deng

11 months

I am hiring NLP/ML PhD students at UWaterloo, home to 5 NLP professors! Apply by Dec 1 Strong consideration will be given to those who can tackle the below challenge: Can we use LM's hidden states to reason multiple problems simultaneously? Retweets/shares appreciated🥰

Tweet media one

12

133

479

1

0

3

@mattf1n

Matthew Finlayson

1 year

Nucleus and top-k are threshold-based: they prevent the model from sampling low-probability tokens below a threshold. Why should this help? We prove that thresholds can guarantee that no tokens with zero true probability are ever sampled ✅

Tweet media one

1

0

3

@mattf1n

Matthew Finlayson

2 years

@DanielKhashabi @jhuclsp @JHUCompSci @JohnsHopkins Congratulations!

0

0

1

@mattf1n

Matthew Finlayson

9 months

@oli_0331 In my head I have a separate table for mag- and -um- verbs, each with tense on one axis (past present future) and focus (actor object location) on the other. I think you could add a third axis for affixes though like pang-, pag-, and ma- that change the meaning in other ways

1

0

2

@mattf1n

Matthew Finlayson

2 years

@ai2_aristo @allen_ai Our findings suggest four general implications that we expect will extend beyond the synthetic RegEx environment. For instance, instruction learning is hard when the instructions are not very precise, that is, they can be executed correctly in several different ways.

Tweet media one

1

0

2

@mattf1n

Matthew Finlayson

1 year

In conclusion, by delving into the theory behind why threshold-based sampling works, and identifying the softmax bottleneck as a major source of model errors, we open the door to sampling methods beyond thresholding heuristics to better surface LMs' generative capabilities! 🔑

0

0

2

@mattf1n

Matthew Finlayson

2 years

@ai2_aristo @allen_ai We use the task of deciding whether a given string matches a regular expression (viewed as an instruction) to identify properties of tasks, instructions, and instances that make instruction learning challenging.

Tweet media one

1

0

2

@mattf1n

Matthew Finlayson

2 years

@ai2_aristo @allen_ai Similarly, instruction learning is hard when correctly executing instructions requires keeping in memory a larger context of the partial execution thus far and making choices dependent on this longer history.

Tweet media one

1

0

2

@mattf1n

Matthew Finlayson

2 years

@ai2_aristo @allen_ai Instruction learning, where a model learns to perform new tasks from task descriptions rather than from examples, has become popular with advent of models like T0, but the capabilities of these models as instruction learners are poorly understood.

1

0

2

@mattf1n

Matthew Finlayson

2 years

@ai2_aristo @allen_ai We use our findings to make a hard instruction learning dataset, which we call Hard RegSet. Fine-tuning on Hard RegSet, our transformer learns to correctly interpret only 65.6% of test instructions with at least 90% accuracy.

Tweet media one

1

0

2

@mattf1n

Matthew Finlayson

1 year

@soldni I should have stayed another month

1

0

2

@mattf1n

Matthew Finlayson

1 year

@oli_phd I think the issue that “you” is the reciever not the object. In “sinabi ko (siya) sa iyo” the focus is on the object and “sa iyo” indicates the reciever. “Sinabihan” focuses the reciever which gets marked with focus (“sa iyo” -> “ka”) and the transformation to “kita” occurs.

1

0

1

@mattf1n

Matthew Finlayson

6 months

@_xjdr @soldni (Large Model, Foundation AI, Open-source)

0

0

1

@mattf1n

Matthew Finlayson

1 year

For the curious and nonbelievers

0

0

1

@mattf1n

Matthew Finlayson

2 years

@m_pulkit Yes to 1, no to 2

0

0

1

@mattf1n

Matthew Finlayson

9 months

@oli_0331 Interesting! I think there would be 9 forms for each affix (mag, pag, pang, ma, etc, one for each tense/focus pair) but I’d be interested to know whether all of these forms are used/make sense for different verbs

1

0

1

@mattf1n

Matthew Finlayson

6 months

@soldni Ugh I’m seeing all these great write-ins after I already filled it out 🤦‍♂️

0

0

1

@mattf1n

Matthew Finlayson

1 year

Knowing the set of possible model outputs, and assuming that the model minimizes its training loss w.r.t. the true distribution, we can actually deduce a set of distributions containing the true distribution, and thereby find tokens that must have nonzero true probability 🕵️‍♀️

Tweet media one

1

0

1

@mattf1n

Matthew Finlayson

1 year

In particular, we identify the SOFTMAX BOTTLENECK as a source of the errors. The softmax bottleneck entails that model outputs are restricted to a subset of probability distributions. If the true distribution is not in this set, the model cannot output it 😱

Tweet media one

1

0

1

@mattf1n

Matthew Finlayson

1 year

Unfortunately, thresholds are limited: they necessarily also discard good-but-low-probability tokens. This tradeoff seems inevitable, since good and bad tokens are indistinguishable by probability alone ⚖️ However, we can do better by directly accounting for the source of errors!

1

0

1

@mattf1n

Matthew Finlayson

4 years

@alexcg @huggingface @JanelleCShane Relevant

AI + Vintage American cooking: a combination that cannot be unseen - AI WeirdnessCommentShareComm...

A week ago, in a sudden fit of terrible judgement, I decided to find out what would happen if I: Asked people to help me collect examples of the worst, the weirdest, the most gelatinous recipes that...

www.aiweirdness.com

1

0

1

@mattf1n

Matthew Finlayson

2 years

@anmarasovic I miss this!

1

0

1

@mattf1n

Matthew Finlayson

8 months

@StefanResearch @andreasgrv Cool! I’ll give it a read

0

0

1

@mattf1n

Matthew Finlayson

7 months

@zraytam @DimitrisPapail Author here. It does, but only by making these methods more expensive. It doesn’t prevent the exploit.

1

0

1

@mattf1n

Matthew Finlayson

2 years

@sivil_taram @ai2_aristo @allen_ai Great question! We do find that learning a good general regex executor from examples alone is hard, but the main result here is using an imperfect neural regex executor to identify properties that could make executing general instructions hard.

1

0

1

@mattf1n

Matthew Finlayson

2 years

@anmarasovic 🇭🇷 🇭🇷 🇭🇷

0

0

1

@mattf1n

Matthew Finlayson

2 years

@Swarooprm7 @emnlpmeeting Hope you feel better soon!

0

0

1

@mattf1n

Matthew Finlayson

2 years

@ValvodaJosef Nice work! I’m noticing a theme here: naïve random sampling from a formal language can lead to uninteresting/degenerate datasets and bad generalization. Reminds me of where they do this for SAT problems instead of transducers.

2

0

1

@mattf1n

Matthew Finlayson

1 year

We translate this finding into a new class of threshold-free sampling methods. In pilot studies, our easy-to-implement method (BA-η) performs competitively against existing methods, and outperforms them in low-entropy (close to greedy) decoding 💪

Tweet media one

1

0

1

@mattf1n

Matthew Finlayson

7 months

@citre_piotto @TheXeophon We don’t rule out MoE, also nondeterminism might be due to the model serving from multiple severs with different hardware. It’s hard to tell 🤔

0

0

1

@mattf1n

Matthew Finlayson

1 year

Did you know that the softmax function is linear?

I knew that

10

I did not know that

9

I don’t believe you

26

1

0

1