Sang Michael Xie Profile Banner
Sang Michael Xie Profile
Sang Michael Xie

@sangmichaelxie

3,066
Followers
724
Following
56
Media
366
Statuses

PhD student @StanfordAILab @StanfordNLP @Stanford advised by Percy Liang and Tengyu Ma. Prev: visiting @GoogleAI Brain, BS, MS Stanford ‘17

Stanford, CA
Joined May 2019
Don't wanna be here? Send us removal request.
Pinned Tweet
@sangmichaelxie
Sang Michael Xie
1 year
Should LMs train on more books, news, or web data? Introducing DoReMi🎶, which optimizes the data mixture with a small 280M model. Our data mixture makes 8B Pile models train 2.6x faster, get +6.5% few-shot acc, and get lower pplx on *all* domains! 🧵⬇️
Tweet media one
16
150
679
@sangmichaelxie
Sang Michael Xie
3 years
Why can GPT3 magically learn tasks? It just reads a few examples, without any parameter updates or explicitly being trained to learn. We prove that this in-context learning can emerge from modeling long-range coherence in the pretraining data! (1/n)
Tweet media one
4
138
559
@sangmichaelxie
Sang Michael Xie
2 years
I've gotten some requests about the "building language models" project from last year's Stanford Large Language Models class, so we're releasing it: The task is to finetune LMs to give them new capabilities/properties, similarly to Toolformer and Alpaca.
Tweet media one
7
74
346
@sangmichaelxie
Sang Michael Xie
2 years
Data selection for LMs (GPT-3, PaLM) is done with heuristics that select data by training a classifier for high-quality text. Can we do better? Turns out we can boost downstream GLUE acc by 2+% by adapting the classic importance resampling algorithm.. 🧵
Tweet media one
4
61
340
@sangmichaelxie
Sang Michael Xie
1 year
Releasing an open-source PyTorch implementation of DoReMi! The pretraining data mixture is a secret sauce of LLM training. Optimizing your data mixture for robust learning with DoReMi can reduce training time by 2-3x. Train smarter, not longer!
Tweet media one
@sangmichaelxie
Sang Michael Xie
1 year
Should LMs train on more books, news, or web data? Introducing DoReMi🎶, which optimizes the data mixture with a small 280M model. Our data mixture makes 8B Pile models train 2.6x faster, get +6.5% few-shot acc, and get lower pplx on *all* domains! 🧵⬇️
Tweet media one
16
150
679
2
68
272
@sangmichaelxie
Sang Michael Xie
2 years
How can large language models (LMs) do tasks even when given random labels? While traditional supervised learning would fail, viewing in-context learning (ICL) as Bayesian inference explains how this can work! Blog post with @sewon__min :
2
59
241
@sangmichaelxie
Sang Michael Xie
11 months
I’m presenting 2 papers at #NeurIPS2023 on data-centric ML for large language models: DSIR (targeted data selection): Wed Dec 13 @ 5pm DoReMi (pretraining data mixtures): Thu Dec 14 @ 10:45am Excited to chat about large language models, data, pretraining/adaptation, and more!
Tweet media one
Tweet media two
7
31
216
@sangmichaelxie
Sang Michael Xie
1 year
DSIR is a fast trillion-token-scale data selection tool for LLMs that’s been used on the Pile/RefinedWeb/C4/CCNet/RedPajama/etc. ⚡️Select 100M documents from the full Pile in just 𝟰.𝟱 𝗵𝗼𝘂𝗿𝘀 with 1 CPU node Now on PyPI: pip install data-selection Just 4 lines of code:
Tweet media one
7
35
206
@sangmichaelxie
Sang Michael Xie
4 years
🍔🍟"In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness" Real-world tasks (crop yield prediction from satellites) are often label-scarce. Only some countries have labels - how do we generalize globally?
Tweet media one
1
37
164
@sangmichaelxie
Sang Michael Xie
3 years
Fine-tuning destroys some pre-trained info. Freezing parameters *preserves* it and *simplifies* the learning problem -> better ID and OOD accuracy. Excited to present Composed Fine-Tuning as a long talk at #ICML2021 ! Paper: Talk:
Tweet media one
2
20
109
@sangmichaelxie
Sang Michael Xie
10 months
The 2nd ME-FoMo workshop on understanding foundation models will be at ICLR 2024 in Vienna! Topics include pretraining (data, archs), adaptation (instruct tuning, alignment), and emergence. Paper ddl: Feb 3 Website: OpenReview:
Tweet media one
1
22
113
@sangmichaelxie
Sang Michael Xie
4 years
(1/n) Persistent neck pain and how to fix it: A thread for computer people 💻👇 Neck pain can stick around forever if you use the computer every day, especially in suboptimal WFH setups. Here's my story and a highly effective exercise that worked for me:
Tweet media one
3
17
105
@sangmichaelxie
Sang Michael Xie
4 years
Simplifying Models with Unlabeled Output Data Joint w/ @tengyuma @percyliang Can “unlabeled” outputs help in semi-supervised learning? In problems with rich output spaces like code, images, or molecules, unlabeled outputs help with modeling valid outputs.
Tweet media one
2
20
89
@sangmichaelxie
Sang Michael Xie
5 years
Robustness from adversarial training tends to come at the cost of standard accuracy. We show this can happen in min-norm linear regression models and prove that we can fix the tradeoff with unlabeled data, motivating the use of robust self-training!
2
24
86
@sangmichaelxie
Sang Michael Xie
3 years
Summary of #foundationmodels for robustness to distribution shifts 🧵: Robustness problem: Real-world ML systems need to be robust to distribution shifts (training data distribution differs from test). Report:
Tweet media one
2
9
60
@sangmichaelxie
Sang Michael Xie
2 years
Excited to work and learn with @AdamsYu @hieupham789 @quocleix over the summer at Google Brain in Mountain View! Hope to make some nice improvements in pretraining. Now if only I can figure out how to do all these onboarding tasks…
2
4
48
@sangmichaelxie
Sang Michael Xie
4 years
GPT3 explaining our abstract on tradeoff between robustness and accuracy to 2 year olds: "Sometimes when you train a machine learning model, it gets better at handling bad situations, but it gets worse at handling good situations."🤣😳 cred: Aditi Raghunathan & @siddkaramcheti
1
2
46
@sangmichaelxie
Sang Michael Xie
4 years
How can we mitigate the tradeoff between adversarial robustness and accuracy? w/ Aditi Raghunathan, Fanny Yang, John Duchi, @percyliang Come chat with us #ICML2020 : 11am-12pm PST, 10-11pm PST video: paper:
2
11
46
@sangmichaelxie
Sang Michael Xie
1 year
ShearedLLaMA uses DoReMi with a scaling law as the reference model to optimize the data mixture on RedPajama -> strong results at 1.3B and 3B!
@xiamengzhou
Mengzhou Xia
1 year
We release the strongest public 1.3B and 3B models so far – the ShearedLLaMA series. Structured pruning from a large model to a small one is far more cost-effective (only 3%!) than pre-training them from scratch! Check out our paper and models at: [1/n]
Tweet media one
20
141
767
0
3
40
@sangmichaelxie
Sang Michael Xie
4 years
Had a great time talking to prospective Stanford grad students at #Tapia2020 ! GL to everyone! If you’re applying to the Stanford CS PhD program this fall: for an app review with a current student. Priority for students from under represented minorities.
1
6
34
@sangmichaelxie
Sang Michael Xie
1 year
We publish our optimized domain weights for The Pile to improve future language model training! Interestingly, DoReMi's domain weights decrease the weight on Wikipedia but produce a model with better performance on Wikipedia-based tasks (TriviaQA, SQuAD, etc).
Tweet media one
1
4
33
@sangmichaelxie
Sang Michael Xie
9 months
Excited to co-organize this ICLR 2024 workshop! I think better data will be crucial for the next big advances in foundation models. The submission date is Feb 3 - details at
@ruoxijia
Ruoxi Jia
9 months
Excited to announce the ICLR 2024 Workshop on Data Problems for Foundation Models (DPFM)! - Topics of Interest: data quality, generation, efficiency, alignment, AI safety, ethics, copyright, and more. - Paper Submission Date: Feb 3, 2024 | Workshop Date: May 11, 2024 (Vienna and
Tweet media one
1
23
108
0
4
29
@sangmichaelxie
Sang Michael Xie
5 years
I will be presenting "Reparameterizable Subset Sampling via Continuous Relaxations" @IJCAIconf tomorrow, work with @ermonste ! End-to-end training of models using subset sampling for model explanations, deep kNN, and a new way to do t-SNE.
1
3
27
@sangmichaelxie
Sang Michael Xie
4 years
WILDS collects real world distribution shifts to benchmark robust models! I’m particularly excited about the remote sensing datasets (PovertyMap and FMoW) - spatiotemporal shift is a real problem, and space/time shifts compound upon one another. Led by @PangWeiKoh @shiorisagawa
@shiorisagawa
Shiori Sagawa
4 years
We're excited to announce WILDS, a benchmark of in-the-wild distribution shifts with 7 datasets across diverse data modalities and real-world applications. Website: Paper: Github: Thread below. (1/12)
Tweet media one
8
205
895
0
2
27
@sangmichaelxie
Sang Michael Xie
3 years
Pretraining is surprisingly powerful for domain adaptation, and doesn’t rely on domain invariance! The magic comes from properties of the data augmentations used, leading to disentangled class and domain info. With this, finetuning on class labels allows for ignoring the domain.
@tengyuma
Tengyu Ma
3 years
Pretraining is ≈SoTA for domain adaptation: just do contrastive learning on *all* unlabeled data + finetune on source labels. Features are NOT domain-invariant, but disentangle class & domain info to enable transfer. Theory & exps:
Tweet media one
7
127
679
0
2
26
@sangmichaelxie
Sang Michael Xie
8 months
Interestingly, pretraining on unlabeled source/target+finetuning doesn’t improve much over just supervised learning on source in iWildcam-WILDS. Correspondingly, the connectivity conditions on the success of contrastive pretraining for UDA () also fail!
@_helenqu
Helen Qu
8 months
What's the best way to use unlabeled target data for unsupervised domain adaptation (UDA)? Introducing Connect Later: pretrain on unlabeled data + apply *targeted augmentations* designed for the dist shift during fine-tuning ➡️ SoTA UDA results! 🧵👇
Tweet media one
1
5
24
1
5
26
@sangmichaelxie
Sang Michael Xie
2 years
We’re organizing a workshop on understanding foundation models at #ICLR2023 ! We welcome any unpublished or ongoing work on empirical or theoretical understanding, as well as highlighting phenomena like scaling laws or emergence. Deadline is Feb 3, 2023.
@ananyaku
Ananya Kumar
2 years
Foundation models (BERT, DALLE-2, ChatGPT) have led to a paradigm shift in ML, but are poorly understood. Announcing ME-FoMo, an #ICLR2023 workshop on understanding foundation models. Deadline: Feb 3, 2023 Topics: Pretraining, transfer, scaling laws, etc
Tweet media one
3
70
358
0
3
23
@sangmichaelxie
Sang Michael Xie
2 years
Some of the most powerful abilities of language models (GPT-3) are its intuitive language interface and its few shot learning abilities. These abilities can make it easier to design the rewards for an RL agent, with just a text description or a few demonstrations.
@MinaeKwon
Minae Kwon
2 years
Can we make it easier to specify objectives to an RL agent? We allow users to specify their objectives more cheaply and intuitively using language! 🧵👇 Paper: Code: Talk:
4
27
174
0
3
22
@sangmichaelxie
Sang Michael Xie
3 years
Check out our #ICLR2021 poster on In-N-Out at 5-7pm PST tomorrow (Wed May 5): we show how to improve OOD performance by *leveraging* spurious correlations and unlabeled data, and we test it on real world remote sensing tasks!
@tengyuma
Tengyu Ma
3 years
This appears in #ICLR2021 . Please check out our paper, videos, poster, code, etc! ICLR poster link: ArXiv: Codalab: Github:
0
4
17
1
4
22
@sangmichaelxie
Sang Michael Xie
1 year
Very cool paper on high-dimensional calibration / online learning by my brother Stephan @stephofx , who is applying to PhD programs this year!
@Aaroth
Aaron Roth
1 year
Super excited about a new paper (with applications to conformal prediction and large action space games) that asks what properties online predictions should have such that acting on them gives good results. Calibration would be nice, but is too hard in high dimensions. But...
Tweet media one
2
21
147
0
1
22
@sangmichaelxie
Sang Michael Xie
3 years
Our foundation models (GPT3, BERT) report has a section on how they improve robustness to distribution shifts, what they may not solve, and future directions, w/ @tatsu_hashimoto @ananyaku @rtaori13 @shiorisagawa @PangWeiKoh and TonyL! thx to @percyliang for leading the effort!
Tweet media one
@StanfordHAI
Stanford HAI
3 years
NEW: This comprehensive report investigates foundation models (e.g. BERT, GPT-3), which are engendering a paradigm shift in AI. 100+ scholars across 10 departments at Stanford scrutinize their capabilities, applications, and societal consequences.
Tweet media one
5
178
425
0
4
21
@sangmichaelxie
Sang Michael Xie
8 months
Great effort led by @AlbalakAlon to corral the wild west of LM data selection! A meta-issue: how do we make data work (esp. for pretraining) more accessible? Not everyone can train 7B LMs, but a first bar is to show that the benefits don't shrink with scale, at smaller scales.
@AlbalakAlon
Alon Albalak
8 months
{UCSB|AI2|UW|Stanford|MIT|UofT|Vector|Contextual AI} present a survey on🔎Data Selection for LLMs🔍 Training data is a closely guarded secret in industry🤫with this work we narrow the knowledge gap, advocating for open, responsible, collaborative progress
Tweet media one
10
77
309
1
1
19
@sangmichaelxie
Sang Michael Xie
4 years
this guy...
@percyliang
Percy Liang
4 years
I just discovered this account I made 11 years ago. So how does one use these Twitters?
21
12
341
0
0
19
@sangmichaelxie
Sang Michael Xie
5 years
My brother (a high schooler) made for donating health equipment to people in need directly from home! How it works: 1. Donors don't need anything on hand - you can order directly from online sites like Amazon to the recipient. (1/4)
Tweet media one
1
6
18
@sangmichaelxie
Sang Michael Xie
1 year
But optimizing for worst-case loss could be bad! It upweights the most noisy domains. Instead, we optimize the worst-case loss gap between the model being evaluated and a reference model trained on un-tuned domain weights.
3
0
18
@sangmichaelxie
Sang Michael Xie
3 years
Using our understanding we created GINC, a small-scale synthetic dataset where in-context learning emerges for both Transformers and LSTMs. In-context accuracy increases with the number of examples and the length of each example (k). Code: (1.3/n)
Tweet media one
1
2
17
@sangmichaelxie
Sang Michael Xie
1 year
Some intuitions: - Low entropy / predictable domains don’t need much data to learn well. - High entropy / noisy domains don’t need much data either since models at random init already output a uniform distribution over tokens. DoReMi naturally downweights these domains.
4
0
18
@sangmichaelxie
Sang Michael Xie
5 years
There should be a website that constantly updates the current best practices and models in deep learning No one does early stopping anymore right?
6
2
15
@sangmichaelxie
Sang Michael Xie
3 years
Our theory doesn't capture everything since it focuses on the role of the pretraining distribution. Our experiments show that model architecture and model size matter too. Bigger models were better at in-context learning despite achieving the same pretraining loss! (1.4/n)
Tweet media one
Tweet media two
1
1
14
@sangmichaelxie
Sang Michael Xie
1 year
Importantly, DoReMi can tune the domain weights (sampling frequencies for each domain) with a small model (280M params). These domain weights improve perplexity and downstream accuracy of a 30x larger model (8B params) *without using any downstream tasks*.
1
0
15
@sangmichaelxie
Sang Michael Xie
2 years
Before diving in - we've uploaded pre-filtered datasets to HuggingFace that were selected from The Pile for high-quality text! You can also use the code to select custom language modeling datasets from The Pile. Pre-filtered Pile data and code:
1
2
14
@sangmichaelxie
Sang Michael Xie
5 years
Adversarial Training can Hurt Generalization - even if there is no conflict with infinite data and the problem is convex. With @Aditi_Raghunathan and @Fanny_Yang #icml2019 Identifying and Understanding Deep Learning Phenomena
1
0
13
@sangmichaelxie
Sang Michael Xie
2 years
This course assignment was developed with the help of @percyliang and the rest of the course instructors @tatsu_hashimoto , Chris Ré @HazyResearch , and @RishiBommasani . Class lecture materials are here:
0
1
13
@sangmichaelxie
Sang Michael Xie
2 years
The blog post is based on two papers: (An Explanation of In-context Learning as Implicit Bayesian Inference) and (Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?) by @sewon__min .
1
2
13
@sangmichaelxie
Sang Michael Xie
1 year
Reallocating weight to challenging but learnable domains can improve perplexity on all domains through positive transfer. Often, DoReMi upweights diverse and high-quality domains like filtered web text.
1
0
11
@sangmichaelxie
Sang Michael Xie
3 years
WILDS includes some nice examples of real world distribution shifts, including spatial and temporal shifts in satellite image tasks: poverty (PovertyMap) and building/land use prediction (FMoW). Hope these help spur progress in ML for sustainability + social good! #ICML2021
@PangWeiKoh
Pang Wei Koh
3 years
We’re presenting WILDS, a benchmark of in-the-wild distribution shifts, as a long talk at ICML! Check out @shiorisagawa 's talk + our poster session from 9-11pm Pacific on Thurs. Talk: Poster: Web: 🧵(1/n)
Tweet media one
4
20
100
0
5
11
@sangmichaelxie
Sang Michael Xie
5 years
Couple of unreasonable things. My mother works at a lab where she works with data (a computer job). She asked to stay home but instead they moved her into the wet lab and made her come in. Now they cut half her hours amidst layoffs/furloughs.
1
1
11
@sangmichaelxie
Sang Michael Xie
1 year
Central intuition: To account for a diversity of downstream tasks, try to achieve low loss on all domains. This motivates DoReMi, which adapts group distributionally robust optimization (Group DRO) to find domain weights that optimize for the worst-case domain.
1
0
12
@sangmichaelxie
Sang Michael Xie
2 years
Data selection typically involves filtering a large source of raw data towards some desired target distribution, whether it's high-quality/formal text (e.g., Wikipedia + books) for general-domain LMs like GPT-3 or domain-specific data for specialized LMs like Codex.
1
1
11
@sangmichaelxie
Sang Michael Xie
1 year
Blog on @StanfordCRFM for more details: Thanks to @togethercompute for providing compute resources for testing the codebase!
0
0
11
@sangmichaelxie
Sang Michael Xie
1 year
On The Pile, DoReMi's domain weights improve perplexity on all domains, even though some domains are downweighted. How can this be? Via a simple example, we show it's possible to have no tradeoff when there are very low entropy (predictable) and high entropy (noisy) domains.
1
0
12
@sangmichaelxie
Sang Michael Xie
1 year
This was work done during an internship at @GoogleAI , jointly with @quocleix , @percyliang , @tengyuma , @hyhieu226 , @AdamsYu , @yifenglou , @XuanyiDong , @Hanxiao_6 . Thanks also to @edchi for the support at the Brain team!
1
0
10
@sangmichaelxie
Sang Michael Xie
3 years
We assume all sentences in a pretraining document share a latent concept (long-range coherence). To generate coherent next tokens in LM pretraining, the LM must infer the latent concept. Inferring a shared concept from examples in a prompt leads to in-context learning! (1.2/n)
1
1
10
@sangmichaelxie
Sang Michael Xie
4 years
If anyone got Type 3 font errors while submitting their camera ready to ICML, it's likely because matplotlib defaults to Type 3 fonts... this website helped but still had to regenerate every plot
0
4
10
@sangmichaelxie
Sang Michael Xie
3 years
In-context accuracy on the small-scale GINC data is quite sensitive to example order, just like GPT3 (e.g., ). Each training ID in the figure is a set of examples, and each dot is a different permutation of the examples. (2/n)
Tweet media one
1
1
10
@sangmichaelxie
Sang Michael Xie
1 year
We ran DoReMi with a 120M proxy model on The Pile (w/ NeoX tokenizer + FlashAttention2). A 120M model trained on DoReMi weights surpasses the baseline one-shot performance very early during training, within 70k steps (3x faster) across all tasks and within 20k steps on average!
Tweet media one
1
0
10
@sangmichaelxie
Sang Michael Xie
2 years
Because of the large scale of raw text data, existing methods (GPT-3, PaLM, Pile) use heuristics: simple classifier + a noisy threshold to select data similar to the target distribution. But the selected data may lack diversity/be too diverse compared to the desired target dist.
1
0
9
@sangmichaelxie
Sang Michael Xie
1 year
On the GLaM dataset where domain weights tuned with downstream tasks are available, DoReMi finds domain weights that are similar to downstream-tuned weights. However, DoReMi doesn't use downstream tasks and optimizes the domain weights more efficiently (no RL, search).
1
0
9
@sangmichaelxie
Sang Michael Xie
5 years
I can already tell this will be an amazing course
Tweet media one
0
0
9
@sangmichaelxie
Sang Michael Xie
1 year
How it works: DSIR selects a subset of a large raw dataset that is distributed like a target dataset on a reduced feature space (hashed n-grams). Under the hood, importance resampling automatically balances relevance and diversity by matching the target 𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘪𝘰𝘯.
Tweet media one
1
0
9
@sangmichaelxie
Sang Michael Xie
3 years
@PreetumNakkiran Agreed it's not "standard" learning, but labels still matter - there is still a small drop with random labels in Sewon's paper, and (, "Pushing the Bounds..") shows that if you re-map the labels, the model will learn the remapping
Tweet media one
1
0
8
@sangmichaelxie
Sang Michael Xie
2 years
Please fill out or help circulate this form for those interested in reviewing for our ICLR workshop on understanding foundation models!
@TiffanyVlaar
Tiffany Vlaar
2 years
Call for reviewers for our #ICLR2023 workshop on Mathematical and Empirical Understanding of Foundation Models. Fill out this form if you are interested and we will aim to get back to you asap. Paper deadline: 3 Feb Tentative reviewing period: 10-24 Feb.
Tweet media one
0
12
51
0
2
8
@sangmichaelxie
Sang Michael Xie
2 years
Our method, Data Selection with Importance Resampling (DSIR), uses importance resampling to select data that *matches* a target dist. To make importance weight estimation tractable in the high-dimensional space of text, we work in a reduced feature space.
1
0
8
@sangmichaelxie
Sang Michael Xie
4 years
CubeChat is really cool! Your webcam video is on the side of a cube and you can move and jump around in a 3D world. You can even share your screen and it looks like a drive in movie theater for cubes 😄
@napdivad
David Pan
4 years
Our team just launched CubeChat! 3D video chat with spatial audio--can be multiple conversations going on at once in a room! Please DM me if you are interested in trying it out.
0
4
2
0
4
8
@sangmichaelxie
Sang Michael Xie
1 year
As a black box, the codebase outputs optimized domain weights given a text dataset (+ domain info). Some other useful components: fast, resumable dataloader with domain-level weighted sampling, simple downstream eval harness, and HuggingFace Trainer + FlashAttention2 integration.
1
0
8
@sangmichaelxie
Sang Michael Xie
1 year
DSIR log-importance weights are also included in RedPajamav2 from @togethercompute as quality scores!
1
1
7
@sangmichaelxie
Sang Michael Xie
3 years
We assume the LM fits the pretraining distribution exactly. Thus, we can just characterize the conditional distribution given prompts (posterior) under the pretraining dist, marginalizing over latent concepts. Thus in-context learning can be viewed as Bayesian inference! (6/n)
1
1
7
@sangmichaelxie
Sang Michael Xie
5 years
What are some good datasets for regression? Is it just UCI? Regression dataset land seems a little sparse to me
2
0
7
@sangmichaelxie
Sang Michael Xie
4 years
We propose a framework for using “unlabeled” output data (e.g. code on Github) by learning an output denoiser with self-supervision, then learning a predictor composed with the denoiser. By offloading complexity to the denoiser, the learned predictor can be much simpler.
Tweet media one
1
0
7
@sangmichaelxie
Sang Michael Xie
3 years
Zero-shot is sometimes better than one/few-shot for some settings of GINC because of the OOD/unnatural prompt structure. This matches observations in the GPT-3 paper for some datasets (e.g., LAMBADA, HellaSwag, PhysicalQA, RACE-m, CoQA/SAT analogies for smaller models) (3/n)
Tweet media one
1
1
7
@sangmichaelxie
Sang Michael Xie
2 years
Bayesian inference view of ICL: all examples in a prompt share a latent “concept”, which should be applied to the test example to answer it correctly. The LM infers the latent concept by using the prompt to "locate" it from a large set of concepts learned from pretraining.
2
1
6
@sangmichaelxie
Sang Michael Xie
2 years
Which feature space? N-grams! We find that KL reduction, a cheap data metric that measures how much data selection reduces KL to the target n-gram distribution over random sampling, predicts downstream acc very well (r=0.89). Potential for powering new data-centric workflows!
Tweet media one
1
0
7
@sangmichaelxie
Sang Michael Xie
2 years
Does choice of pretraining data matter? We selected pretraining data for 8 downstream tasks and tried all pretrain->downstream pairs. Using the wrong pretraining data causes a 6% average drop in downstream acc, and drops acc by 30% in one case! Choice of data matters a lot.
Tweet media one
1
0
6
@sangmichaelxie
Sang Michael Xie
4 years
Our algorithm, In-N-Out, first trains a model with auxiliary inputs and uses it to pseudolabel all the in-distribution inputs (where auxiliary inputs help), then pre-trains a model on OOD auxiliary outputs and fine-tunes this model with the pseudolabels (self-training).
Tweet media one
1
2
6
@sangmichaelxie
Sang Michael Xie
3 years
The GINC dataset makes it easier to ask some questions about in-context learning that are hard to ask with real data. For example, we found that the models failed to extrapolate to random unseen concepts. (4/n)
Tweet media one
1
1
6
@sangmichaelxie
Sang Michael Xie
3 years
Details: we introduce a simple pretraining distribution and show that it gives rise to in-context learning. The pretraining documents are generated from HMMs, and the latent concept is the hidden state transition matrix. (5/n)
1
1
5
@sangmichaelxie
Sang Michael Xie
5 years
To become a 10x researcher simply commit to 10x more deadlines, during deadlines you can do in a day what would normally take you 2 weeks 💪
@zacharylipton
Zachary Lipton
5 years
10x researchers Professors, if you ever come across this rare breed of researchers, grab them. If you have a 10x researcher as part of your lab, you increase the odds of winning a Turing award significantly. OK, here is a tough question. How do you spot a 10x researcher?
16
88
593
0
0
6
@sangmichaelxie
Sang Michael Xie
4 years
(6/n) HIGHLY EFFECTIVE exercise for neck pain: Stretch the shoulders by taking a towel or exercise band and grab the ends in front of you, holding it taut. Slowly move your arms over your head and all the way behind your back. Do this back-and-forth 10 times, 2 sets, daily.
Tweet media one
2
1
6
@sangmichaelxie
Sang Michael Xie
3 years
Please see the report (Sec 4.8 ) for more details! Thanks to w/ @tatsu_hashimoto @ananyaku @rtaori13 @shiorisagawa @PangWeiKoh and TonyL for the writing efforts, and @percyliang and @RishiBommasani for organizing!
1
0
6
@sangmichaelxie
Sang Michael Xie
5 years
@ProfFeynman Tell that to my neural network
1
0
6
@sangmichaelxie
Sang Michael Xie
3 years
Real-world example: poverty mapping from satellite imagery in under-resourced countries, where only some countries have labeled data from (expensive) surveys. Satellite images are available everywhere and there's spatial structure. Pre-training seems to help here too.
1
0
6
@sangmichaelxie
Sang Michael Xie
3 years
Thanks to my collaborators Aditi Raghunathan, @percyliang , @tengyuma !
0
2
6
@sangmichaelxie
Sang Michael Xie
4 years
Surprisingly, adding new input features (aux-inputs) can hurt OOD accuracy - even though in-distribution (ID) accuracy improves. Pre-training by predicting auxiliary information as outputs (aux-outputs) instead improves OOD accuracy by learning better features on unlabeled data.
1
1
6
@sangmichaelxie
Sang Michael Xie
3 years
Main result: if the signal about the latent concept from the examples dominates the error from the unnatural transitions between examples, then the model does in-context learning with enough examples. Making each example longer also helps boost the signal. (8/n)
1
1
6
@sangmichaelxie
Sang Michael Xie
1 year
Speed benchmark on selecting from the full Pile: * Resources: 1 CPU node, 96GB RAM, 96 cores * Fit importance estimator: 59s * Compute importance weights: 4.36h * Resample 10M documents : 6min * Total: 4.5 hours This can be even faster with more CPU cores!
1
0
5
@sangmichaelxie
Sang Michael Xie
1 year
There are no preprocessing steps, other than decompressing the data for faster reading. However, we also support reading from any file format, including compressed zst, through custom loading functions.
1
0
5
@sangmichaelxie
Sang Michael Xie
1 year
@andersonbcdefg @Google @GoogleAI @Stanford @StanfordAILab @stanfordnlp @quocleix @percyliang @AdamsYu @hyhieu226 @XuanyiDong We could probably handle hundreds of domains with a large enough batch size. But handling a large number of domains well is a next step!
1
0
4
@sangmichaelxie
Sang Michael Xie
5 years
I guess the implicit knowledge is that we can differentiate through convex cone programs, cool!
@akshaykagrawal
Akshay Agrawal
5 years
CVXPY is now differentiable. Try our PyTorch and TensorFlow layers using our package, cvxpylayers: (& see our NeurIPS paper for details )
2
173
587
1
0
5
@sangmichaelxie
Sang Michael Xie
5 years
My friend flew to the Bay Area because that’s where his medical board exam was scheduled, a $700 (multiple choice?) test that, under lockdown, A. still requires $100 to reschedule and B. had no other open slots anyway. Best answer would have been C. shelter in place
1
0
5
@sangmichaelxie
Sang Michael Xie
4 years
In-N-Out outperforms auxiliary inputs or outputs alone on both in-distribution and OOD error across three image and time-series datasets: CelebA (hat/no hat shift) and 2 real-world remote sensing tasks (landcover and cropland prediction) with unseen countries/regions.
Tweet media one
Tweet media two
1
0
5
@sangmichaelxie
Sang Michael Xie
11 months
Some links DSIR: - - - DoReMi: - - -
0
2
5
@sangmichaelxie
Sang Michael Xie
2 years
When selecting for formal/high-quality text (Wiki+books) to train general-domain LMs, DSIR selects data that contains qualitatively more formal text than random selection and heuristic filters. This results in 2–2.5% higher downstream acc on GLUE than these baselines.
Tweet media one
1
0
5
@sangmichaelxie
Sang Michael Xie
4 years
(7/n) Complement these with neck exercises and stretches, as well as the cobra pose from yoga. It's literally the opposite of sitting. ~balance~ Consistency is key. Foam rolling the neck gently also helped. Finally, TAKE BREAKS - your body and mind will thank you for it!
Tweet media one
1
0
5
@sangmichaelxie
Sang Michael Xie
2 years
Part 1: instill the ability to control the word length of GPT-2's outputs by training on data that has length info prepended before each sentence. Part 2 is open-ended: instill a new capability in an LM. The students were really creative here (humor, poetry, math,...)!
1
0
5
@sangmichaelxie
Sang Michael Xie
4 years
In multi-task linear regression, we show pre-training provably improves robustness to *arbitrary* covariate shift and provides complementary gains with self-training. This could provide theoretical explanations for recent empirical works by @DanHendrycks @barret_zoph @quocleix
1
0
5
@sangmichaelxie
Sang Michael Xie
4 years
Composing with a denoiser provably reduces complexity for 2-layer networks in problems with discrete valid output spaces. Empirically, the framework improves performance on image generation and full-program pseudocode-to-code translation, but many other applications are possible!
1
0
5
@sangmichaelxie
Sang Michael Xie
2 years
This leads to our instantiation of DSIR, where we train 2 generative bag-of-ngrams models (one each for the raw and target data) to estimate importance weights. The n-grams are hashed onto a fixed number of virtual tokens (hashing trick) for simplicity and tractability.
1
0
5