Matthew Finlayson Profile Banner
Matthew Finlayson Profile
Matthew Finlayson

@mattf1n

930
Followers
897
Following
23
Media
128
Statuses

First year PhD at @nlp_usc | Former predoc at @allen_ai on @ai2_aristo | Harvard 2021 CS & Linguistics

Los Angeles, CA
Joined October 2013
Don't wanna be here? Send us removal request.
Pinned Tweet
@mattf1n
Matthew Finlayson
7 months
Wanna know gpt-3.5-turbo's embed size? We find a way to extract info from LLM APIs and estimate gpt-3.5-turbo’s embed size to be 4096. With the same trick we also develop 25x faster logprob extraction, audits for LLM APIs, and more! 📄 Here’s how 1/🧵
Tweet media one
6
79
366
@mattf1n
Matthew Finlayson
2 years
Can a language model help you with your math homework? Not on its own, but maybe with the help of a Python interpreter! In our EMNLP paper we present 📜 Līla and 🤖 Bhāskara, a math reasoning benchmark and model. 📄: 🔗: 1/🧵
Tweet media one
Tweet media two
5
38
214
@mattf1n
Matthew Finlayson
1 year
Nucleus and top-k sampling are ubiquitous, but why do they work? @johnhewtt , @alkoller , @swabhz , @Ashish_S_AI and I explain the theory and give a new method to address model errors at their source (the softmax bottleneck)! 📄 🧑‍💻
Tweet media one
3
26
166
@mattf1n
Matthew Finlayson
2 years
🎉 New pre-print 🎉 Teaching models to follow instructions is becoming popular, but what makes Instruction Learning hard in the first place? We investigate with synthetic data and build a challenge dataset! My first paper with @ai2_aristo at @allen_ai
Tweet media one
6
14
74
@mattf1n
Matthew Finlayson
2 years
Applying to a UW PhD today? Did you write your SoP in LaTeX? Pro tip:
Tweet media one
3
0
50
@mattf1n
Matthew Finlayson
6 months
My high quality data annotations aren’t free so I will now select 1 wrong answer per captcha. It still works btw.
Tweet media one
2
2
50
@mattf1n
Matthew Finlayson
7 months
In a remarkable case of simultaneous discovery, a paper released earlier this week () also finds that LLM APIs leak information. We are excited for our colleagues and believe that our papers complement and strengthen one another. Amazing work! 8/8
@arankomatsuzaki
Aran Komatsuzaki
7 months
Google presents: Stealing Part of a Production Language Model - Extracts the projection matrix of OpenAI’s ada and babbage LMs for <$20 - Confirms that their hidden dim is 1024 and 2048, respectively - Also recovers the exact hidden dim size of gpt-3.5-turbo
Tweet media one
16
148
957
1
1
33
@mattf1n
Matthew Finlayson
2 years
📣 Today at EMNLP I will be giving my first-ever in-person oral presentation!!! Come hear about how we used formal languages to learn what kinds of instructions LMs can follow 🤖 Hall A-B at 2pm
@mattf1n
Matthew Finlayson
2 years
🎉 New pre-print 🎉 Teaching models to follow instructions is becoming popular, but what makes Instruction Learning hard in the first place? We investigate with synthetic data and build a challenge dataset! My first paper with @ai2_aristo at @allen_ai
Tweet media one
6
14
74
1
2
23
@mattf1n
Matthew Finlayson
2 years
7/🧵 Curious about the name choices? Our benchmark is named after Līlavati, a treatise by 12th century Indian mathematician Bhāskara. Read more in our blog post.
1
6
17
@mattf1n
Matthew Finlayson
7 months
LLM outputs lie in a low-dimensional vector space (we call this the LLM’s image). By “low-dimensional”, we mean EXACTLY the embed size. To find the embed size we check the dimension of the space that the outputs span. Ok, but does this have any practical uses? 2/🧵
Tweet media one
1
1
12
@mattf1n
Matthew Finlayson
2 years
Come see our #EMNLP2022 poster at 4pm in the Atrium! @Swarooprm7 and I will be there to chat about math reasoning models and evaluations 🎉
@mattf1n
Matthew Finlayson
2 years
Can a language model help you with your math homework? Not on its own, but maybe with the help of a Python interpreter! In our EMNLP paper we present 📜 Līla and 🤖 Bhāskara, a math reasoning benchmark and model. 📄: 🔗: 1/🧵
Tweet media one
Tweet media two
5
38
214
0
5
12
@mattf1n
Matthew Finlayson
2 years
5/🧵 We find that models perform much better when they output a Python program that prints the answer, instead of directly generating the answer. This answer format has the added benefit that it doubles as a step-by-step solution.
Tweet media one
1
0
12
@mattf1n
Matthew Finlayson
9 months
It was really fun to contribute to this bit of software, and it has some pretty useful applications! If you want logprobs beyond the top-5 that OpenAI (or any other API) gives you, check out our library 🎉
@jxmnop
jack morris
9 months
fun research story about how we jailbroke the the chatGPT API: so every time you run inference with a language model like GPT-whatever, the model outputs a full probabilities over its entire vocabulary (~50,000 tokens) but when you use their API, OpenAI hides all this info from
16
65
609
1
0
11
@mattf1n
Matthew Finlayson
7 months
Thanks to my advisors @swabhz and @xiangrenNLP for their guidance, as well as @jxmnop for his valuable feedback. 7/🧵
1
0
11
@mattf1n
Matthew Finlayson
10 months
It’s that time of year again. Good luck on your applications!
@mattf1n
Matthew Finlayson
2 years
Applying to a UW PhD today? Did you write your SoP in LaTeX? Pro tip:
Tweet media one
3
0
50
0
0
10
@mattf1n
Matthew Finlayson
2 years
3/🧵 Our benchmark, Līla, is available on GitHub. We welcome contributions from the community to continue expanding our data. Open an issue and we'll work with you to add your data set!
1
0
9
@mattf1n
Matthew Finlayson
2 years
8/🧵 A big thanks to all my collaborators and advisors from @ai2_aristo at @allen_ai and elsewhere, starting my co-first-author @Swarooprm7 , @lupantech , @leonardtang_ , @wellecks , @cbaral , @tanmayRPurohit , Oyvind Tafjord, Ashish Sabharwal, Peter Clark, and @ashwinkalyan7 .
0
0
9
@mattf1n
Matthew Finlayson
7 months
Though knowing the embed size of an LLM may not seem very useful on its own, it turns out that there are many practical applications for LLM images, ranging from adversarial attacks, to methods for auditing LLM API providers. Let’s take a look at two of them: 3/🧵
Tweet media one
1
0
9
@mattf1n
Matthew Finlayson
2 years
2/🧵 Existing math reasoning datasets are too narrowly scoped to evaluate general math reasoning ability. To fix this, we annotate over 140K diverse natural language math questions with Python programs. Topics range from arithmetic to algebra to multivariate calculus.
Tweet media one
1
0
8
@mattf1n
Matthew Finlayson
1 year
@jxmnop Beam search to optimize seq prob considered harmful. Often you don’t want the most likely seq, you want a typical seq. Must reads on this topic: “If beam search is the answer, what was the question?” “Locally Typical Sampling”
1
0
7
@mattf1n
Matthew Finlayson
7 months
First, you can get cheap, full LM outputs by accelerating extraction algorithms from O(vocab size) to O(embed size) API calls—a 25x speedup for gpt-3.5-turbo! This means model stealing just got a lot cheaper. Sounds bad for LLM companies, but there is good news too: 4/🧵
Tweet media one
1
0
7
@mattf1n
Matthew Finlayson
2 years
4/🧵 Līla comes with useful built-in data splits for evaluating in-distribution performance, generalization to out-of-distribution data, and robustness to linguistic perturbation.
Tweet media one
1
0
7
@mattf1n
Matthew Finlayson
2 years
6/🧵 We fine-tune a model (GPT-Neo-2.7B) on Līla which outperforms its like-sized peers (T5-3B, GPT-Neo) when fine-tuned on new math reasoning tasks. If you want to train a math reasoning model, consider using our model, Bhāskara, as a starting point:
Tweet media one
1
0
7
@mattf1n
Matthew Finlayson
7 months
Next, LLM images can be viewed as model "signatures". An LLM's outputs all carry its unique signature. This means outsiders can audit LLMs for unannounced changes and catch APIs serving LLMs they aren’t supposed to: a win for accountability! So what should LLM providers do? 5/🧵
Tweet media one
1
0
7
@mattf1n
Matthew Finlayson
8 months
Another great paper from @andreasgrv and friends! Love seeing more work on output bottlenecks 💯 Q: if the hyperplanes were not constrained to intersect at the origin, how much “less bad” would the situation be? Seems related to Moser’s circle problem
@andreasgrv
Andreas Grivas
8 months
How expressive is your deep multi-label classifier? Can it represent all outputs of interest? 🤔 SPOILER🚨: Your model can have *test set* outputs that are impossible to predict! 🚫 Check out our paper! 🧵(1/7)
Tweet media one
2
22
76
3
0
7
@mattf1n
Matthew Finlayson
7 months
Considering the risks posed by our methods and weighing them against the benefits from increased trust and accountability, we take the position that these “vulnerabilities” should be viewed as a feature rather than a bug for LLM APIs. 6/🧵
1
0
6
@mattf1n
Matthew Finlayson
2 years
@ai2_aristo @allen_ai Check out our paper for more in-depth analyses! Our data and code are available at A big thank you to my advisors and mentors!
0
0
4
@mattf1n
Matthew Finlayson
6 months
@shaily99 I always appreciate it when authors use backrefs \usepackage[backref]{hyperref}
0
1
4
@mattf1n
Matthew Finlayson
2 years
@ai2_aristo @allen_ai Thus, we introduce Hard RegSet as a challenging instruction learning task, and a controlled environment for studying instruction learning.
1
0
3
@mattf1n
Matthew Finlayson
11 months
I got to work with Yuntian during my undergrad. He’s a great mentor with lots of enthusiasm and big ideas!
@yuntiandeng
Yuntian Deng
11 months
I am hiring NLP/ML PhD students at UWaterloo, home to 5 NLP professors! Apply by Dec 1 Strong consideration will be given to those who can tackle the below challenge: Can we use LM's hidden states to reason multiple problems simultaneously? ​​Retweets/shares appreciated🥰
Tweet media one
12
133
479
1
0
3
@mattf1n
Matthew Finlayson
1 year
Nucleus and top-k are threshold-based: they prevent the model from sampling low-probability tokens below a threshold. Why should this help? We prove that thresholds can guarantee that no tokens with zero true probability are ever sampled ✅
Tweet media one
1
0
3
@mattf1n
Matthew Finlayson
9 months
@oli_0331 In my head I have a separate table for mag- and -um- verbs, each with tense on one axis (past present future) and focus (actor object location) on the other. I think you could add a third axis for affixes though like pang-, pag-, and ma- that change the meaning in other ways
1
0
2
@mattf1n
Matthew Finlayson
2 years
@ai2_aristo @allen_ai Our findings suggest four general implications that we expect will extend beyond the synthetic RegEx environment. For instance, instruction learning is hard when the instructions are not very precise, that is, they can be executed correctly in several different ways.
Tweet media one
1
0
2
@mattf1n
Matthew Finlayson
1 year
In conclusion, by delving into the theory behind why threshold-based sampling works, and identifying the softmax bottleneck as a major source of model errors, we open the door to sampling methods beyond thresholding heuristics to better surface LMs' generative capabilities! 🔑
0
0
2
@mattf1n
Matthew Finlayson
2 years
@ai2_aristo @allen_ai We use the task of deciding whether a given string matches a regular expression (viewed as an instruction) to identify properties of tasks, instructions, and instances that make instruction learning challenging.
Tweet media one
1
0
2
@mattf1n
Matthew Finlayson
2 years
@ai2_aristo @allen_ai Similarly, instruction learning is hard when correctly executing instructions requires keeping in memory a larger context of the partial execution thus far and making choices dependent on this longer history.
Tweet media one
1
0
2
@mattf1n
Matthew Finlayson
2 years
@ai2_aristo @allen_ai Instruction learning, where a model learns to perform new tasks from task descriptions rather than from examples, has become popular with advent of models like T0, but the capabilities of these models as instruction learners are poorly understood.
1
0
2
@mattf1n
Matthew Finlayson
2 years
@ai2_aristo @allen_ai We use our findings to make a hard instruction learning dataset, which we call Hard RegSet. Fine-tuning on Hard RegSet, our transformer learns to correctly interpret only 65.6% of test instructions with at least 90% accuracy.
Tweet media one
1
0
2
@mattf1n
Matthew Finlayson
1 year
@soldni I should have stayed another month
1
0
2
@mattf1n
Matthew Finlayson
1 year
@oli_phd I think the issue that “you” is the reciever not the object. In “sinabi ko (siya) sa iyo” the focus is on the object and “sa iyo” indicates the reciever. “Sinabihan” focuses the reciever which gets marked with focus (“sa iyo” -> “ka”) and the transformation to “kita” occurs.
1
0
1
@mattf1n
Matthew Finlayson
6 months
@_xjdr @soldni (Large Model, Foundation AI, Open-source)
0
0
1
@mattf1n
Matthew Finlayson
1 year
For the curious and nonbelievers
0
0
1
@mattf1n
Matthew Finlayson
2 years
@m_pulkit Yes to 1, no to 2
0
0
1
@mattf1n
Matthew Finlayson
9 months
@oli_0331 Interesting! I think there would be 9 forms for each affix (mag, pag, pang, ma, etc, one for each tense/focus pair) but I’d be interested to know whether all of these forms are used/make sense for different verbs
1
0
1
@mattf1n
Matthew Finlayson
6 months
@soldni Ugh I’m seeing all these great write-ins after I already filled it out 🤦‍♂️
0
0
1
@mattf1n
Matthew Finlayson
1 year
Knowing the set of possible model outputs, and assuming that the model minimizes its training loss w.r.t. the true distribution, we can actually deduce a set of distributions containing the true distribution, and thereby find tokens that must have nonzero true probability 🕵️‍♀️
Tweet media one
1
0
1
@mattf1n
Matthew Finlayson
1 year
In particular, we identify the SOFTMAX BOTTLENECK as a source of the errors. The softmax bottleneck entails that model outputs are restricted to a subset of probability distributions. If the true distribution is not in this set, the model cannot output it 😱
Tweet media one
1
0
1
@mattf1n
Matthew Finlayson
1 year
Unfortunately, thresholds are limited: they necessarily also discard good-but-low-probability tokens. This tradeoff seems inevitable, since good and bad tokens are indistinguishable by probability alone ⚖️ However, we can do better by directly accounting for the source of errors!
1
0
1
@mattf1n
Matthew Finlayson
2 years
@anmarasovic I miss this!
1
0
1
@mattf1n
Matthew Finlayson
8 months
@StefanResearch @andreasgrv Cool! I’ll give it a read
0
0
1
@mattf1n
Matthew Finlayson
7 months
@zraytam @DimitrisPapail Author here. It does, but only by making these methods more expensive. It doesn’t prevent the exploit.
1
0
1
@mattf1n
Matthew Finlayson
2 years
@sivil_taram @ai2_aristo @allen_ai Great question! We do find that learning a good general regex executor from examples alone is hard, but the main result here is using an imperfect neural regex executor to identify properties that could make executing general instructions hard.
1
0
1
@mattf1n
Matthew Finlayson
2 years
@anmarasovic 🇭🇷 🇭🇷 🇭🇷
0
0
1
@mattf1n
Matthew Finlayson
2 years
@Swarooprm7 @emnlpmeeting Hope you feel better soon!
0
0
1
@mattf1n
Matthew Finlayson
2 years
@ValvodaJosef Nice work! I’m noticing a theme here: naïve random sampling from a formal language can lead to uninteresting/degenerate datasets and bad generalization. Reminds me of where they do this for SAT problems instead of transducers.
2
0
1
@mattf1n
Matthew Finlayson
1 year
We translate this finding into a new class of threshold-free sampling methods. In pilot studies, our easy-to-implement method (BA-η) performs competitively against existing methods, and outperforms them in low-entropy (close to greedy) decoding 💪
Tweet media one
1
0
1
@mattf1n
Matthew Finlayson
7 months
@citre_piotto @TheXeophon We don’t rule out MoE, also nondeterminism might be due to the model serving from multiple severs with different hardware. It’s hard to tell 🤔
0
0
1
@mattf1n
Matthew Finlayson
1 year
Did you know that the softmax function is linear?
I knew that
10
I did not know that
9
I don’t believe you
26
1
0
1