So many new LLM architectures (Mambas🐍, Transformers🤖,🦙,🦔, Hyenas🐺,🦓…), so little GPU time to combine them into hybrid LLMs…
Good news! Today we release Manticore, a system for creating **pretrained hybrids** from pretrained models! 👨🌾🦁🦂
1/n
Tired of reading about superconductors?
Check out our new work that just hit arXiv: about rethinking how we get predictions out of classifiers, and how to incorporate the *geometry* of the labels —i.e., how labels relate to one another! ⚙️📐
[1/n]
ResNet not working? Use our
#NeurIPS2021
paper to find what to use in-place of convs by searching our space of “XD-operations” containing convs, Fourier neural operators, graph convs, SOTA ops for neural PDE solvers, and infinitely many more [1/n]
Excited to share our Automated Weak Supervision benchmark at
@NeurIPSConf
next week!
We’ll be in Hall J,
#1029
at 11:30a on Thursday – drop by and chat with us!
#NeurIPS2022
[1/n]
I’ll be in Vienna this week for ICML! I’ll be presenting Manticore — our exciting new method for creating pretrained hybrid LLMs later in the week at the ES-FoMo, FM-Wild, NGSM, and LCFM workshops.
Come by to chat about pretrained hybrid models!
Stoked to be headed to
@NeurIPSConf
#NeurIPS2023
soon!
Come check out our papers this year!
(Thurs 10:45) Geometry Aware Adaptation for Pretrained Models
(Weds 10:45) Skill-it! A Data-Driven Skills Framework for Understanding and Training LMs
🎉
Tired of reading about superconductors?
Check out our new work that just hit arXiv: about rethinking how we get predictions out of classifiers, and how to incorporate the *geometry* of the labels —i.e., how labels relate to one another! ⚙️📐
[1/n]
Can’t wait to share NAS-Bench-360, our new Neural Architecture Search benchmark for diverse tasks at
@NeurIPSConf
next week!
Come chat with us on Tuesday in Hall J,
#1029
at 11:30!
#NeurIPS2022
[1/n]
This work was just accepted to
#NeurIPS2023
! Unfortunately it’s not about superconductors (what a throwback right?)
Check out our thread to learn about how to improve your existing classifiers using the geometry of the label space!
Tired of reading about superconductors?
Check out our new work that just hit arXiv: about rethinking how we get predictions out of classifiers, and how to incorporate the *geometry* of the labels —i.e., how labels relate to one another! ⚙️📐
[1/n]
We are pleased to present the inaugural MLCommons Rising Stars cohort. This talented group of PhD students are the future leaders of ML and Systems research.
I’m super hyped to finally spam Twitter about this!
The winning team will be getting a $15,000 top prize—BTW, if you’re at
@UWMadison
, you should be aware that our top prize is roughly equivalent to 100 years of
@hoofersailing
membership.
#doubleAdvertisement
A little over a week ago, we launched the AutoML Decathlon
#NeurIPS2022
competition—a competition to develop efficient AutoML methods that work on diverse machine learning tasks for a chance to win a $15,000 top prize! [1/n]
I’ll be in Berlin for the
@automl_conf
next week, hit me up if you want to chat!
PS:
"Why is he tweeting from Montana?”
I’m en route back to Madison from Seattle trying to catch my flight to Berlin on Wednesday 🤠. So I’m kind of already on my way to the conference!
Had such a fun time co-leading the organizing team for the AutoML Decathlon competition! We just had our virtual workshop
@NeurIPSConf
this morning, so in case you missed it, here’s the final leaderboard!
Really excited to share more details soon!
Thrilled to share the final *test* leaderboard rankings for the AutoML Decathlon 2022 competition!!!
Big congrats to Team TrueFit for winning AutoML Decathlon 2022 and the $15,000 grand prize!!! 🧵🧵🧵
Congrats to the winning team and to the runner up team for best presentations today at the
#AutoMLFallSchool
@AutoMLDecathlon
Hackathon, and major props to everyone who participated!
Also a huge thanks and shoutout to the
#AutoMLFallSchool
organizers for having us!!!
Generative models are awesome at producing data, and weak supervision is great at efficient labeling. Can we combine them to get cheap datasets for training or fine-tuning?
Excited to present our
#ICLR2023
paper "Generative Modeling Helps Weak Supervision (and Vice Versa)"
Super pumped to be presenting AutoWS-Bench-101 at NeurIPS next week!
I will be generating more spam about this and our other AutoML for diverse tasks work at NeurIPS in the coming days… So stay tuned 🎸
And thank you
@SnorkelAI
for the shoutout!
AutoWS-Bench-101 by
@nick11roberts
,
@fredsala
, et al. for evaluating automated weak supervision (AutoWS) techniques on a set of diverse application domains on which it has been previously difficult or impossible to apply traditional WS techniques. [4/7]
***Manticore addresses both of these!***
We automate the design of hybrids, while using existing pretrained models that you can just get from Huggingface, to create PRETRAINED hybrids!
3/n
While hybrid models are quickly gaining traction, there are a bunch of challenges:
They require hardcore manual expert-driven design, and new hybrids must be trained completely from scratch.
Tough luck for us GPU-poor right???
2/n
Stoked about this work! If you’re interested in LMs and/or LLMs, this paper is for you, so check out Mayee’s thread on this!!!
Side note I guess there’s a band called “skillet” but this paper is very very unrelated—actually not my first naming clarification today 🤠
Large language models (LMs) rely heavily on training data quality. How do we best select training data for good downstream model performance across tasks? Introducing 🍳Skill-It: a data-driven framework for understanding and training LMs!
Paper: 1/13
Stoked to announce the AutoML Cup 2023 at
@automl_conf
!!!
This competition is the direct follow-up to the AutoML Decathlon 2022 that we ran as part of the
@NeurIPSConf
competition track.
Stay tuned for updates! 🏆🤖🎸
Announcing the AutoML Cup 2023!!! 🤖🏆📊
The AutoML Cup is an automated machine learning competition with a focus on diverse machine learning tasks and data settings — which will be part of the
@automl_conf
2023.
[LG] Pretrained Hybrids with MAD Skills
N Roberts, S Guo, Z Gao, S S S N GNVV… [University of Wisconsin-Madison] (2024)
- Proposes Manticore, a framework to automatically design hybrid architectures combining different pretrained models like Transformers
We will be releasing code for Manticore and models shortly, so stay tuned!
Had a blast creating this with Wisconsin friends: Samuel Guo,
@Zhiqi_Gao_2001
, Satya Sai Srinath Namburi GNVV,
@SonNicCr
, Chengjun Wu, Chengyu Duan, and
@fredsala
12/12
Manticore uses ideas from Neural Architecture Search (NAS) and simple “projector” layers that can translate the features between pretrained blocks with different architectures.
Ok, time for an example…
4/n
By the way, the name “Manticore” comes from Persian mythology:
The Manticore is a hybrid creature with the head of a human, the body of a lion, and the tail of a scorpion.
👨🌾🦁🦂/12
Super excited about this line of work and even more excited to see what people come up with for diverse tasks!!!
Check out NAS-Bench-360:
and our brand new AutoML Decathlon competition at NeurIPS 2022:
Do state-of-the-art AutoML methods work on diverse tasks?
@khodakmoments
and
@atalwalkar
introduce a new benchmark and a NeurIPS 2022 competition with the goal of finding out:
Blog:
The search trajectory from MAD seems to follow the architecture gradient on the fine-tuning task!!! Isn’t that neat? This suggests that the pretrained hybrids that we search for on these tasks may have some form of universality.
10/n
@zacharylipton
So long as search spaces are inspired by existing architectures, you’re going to get things that look like existing architectures.
True for vision/NLP/well-explored domains, which limits the gains there. Instead, NAS for adapting these architectures to diverse tasks is the way.
Remember how much you hated Joffrey in Game of Thrones when you still liked Game of Thrones? That’s XGBoost right now.
Submit your methods to AutoML Decathlon today to claim your place on the iron throne + $15,000!!!
Think you can develop a machine learning method that beats XGBoost and a linear model on diverse tasks for $15,000? We think you can too.
Right now is the perfect time to submit to the AutoML Decathlon 2022 competition at
#NeurIPS2022
!
[1/n]
Competitions and benchmarks have been one of the major accelerators in AutoML.
@nick11roberts
,
@williamcxxz
and Samuel Guo will present the results and insights of the
#Neurips2022
AutoML decathlon challenge on Thursday: .
Lots of cool things at the AutoML Fall School — including the AutoML Decathlon *HACKATHON*
Excited to help folks to get a running start on their submissions!
We have finalized the schedule for the upcoming
#AutoML
Fall School in October. I believe that this is an excellent mix between invited lectures and hands-on sessions, both for academic packages and industry software. I hope I will see you there
Can we align pre-trained models quickly and at no cost? 🤔 Sounds challenging!
Our latest research tackles this question. Surprisingly, we found compelling evidence that it just might be possible! 🌟🔍
Preprint:
We do this by using these super helpful synthetics as a proxy for search, instead of just doing search on the downstream fine-tuning task…
The losses here are on the fine-tuning task, while the search trajectory is from the MAD tasks…
9/n
Throw in some linear layers (with skip connects and gating) before and after the blocks from each of the two models to **translate** the features between them so that they can ‘speak a common language’
6/n
A short poem:
We can’t let XGBoost win $15K,
It’s such a simple baseline, no way!
Submit to AutoML Decathlon today~
And be sure to follow
@AutoMLDecathlon
for leaderboard updates! XGBoost won’t be in the lead for long. 🙂
Think you can develop a machine learning method that beats XGBoost and a linear model on diverse tasks for $15,000? We think you can too.
Right now is the perfect time to submit to the AutoML Decathlon 2022 competition at
#NeurIPS2022
!
[1/n]
Here’s another banger — when the Mamba and Transformer models have different “skills” they can result in a hybrid that is better on fine-tuning tasks where both skills are required.
(shameless Skill-It! 🍳plug
@MayeeChen
)
11/n
So that’s how it’s done folks.
“But wait, is the search for mixture weights expensive?”
I knew you’d ask, and no it’s not. You can actually just ‘program’ the mixture weights using the amazing synthetic Mechanistic Architecture Design (MAD) tasks
8/n
Next, we want to learn how much influence each model will have on the overall hybrid (because what if one of them doesn’t perform well?)
This is where the NAS stuff comes in. We search for mixture weights of a convex combination of their blocks:
7/n
Might have noticed that our method is called “Loki.” It’s not a Marvel reference. It refers to the plural, loci, of the “locus” of the Fréchet mean.
This is (kind of) a convex hull analogue for metric spaces — check out why we used this weird name HERE:
@zacharylipton
Not saying that we are, but seems like long-term human architecture search is better at finding reusable motifs than NAS algorithms in general. I suspect that this is why traditional NAS is often used to navigate the perf-efficiency curve rather than to actually do this
More generally if you want to meet up, I’ll be around all week. Feel free to reach out and we can find a slot
Looking forward to seeing folks in New Orleans next week!!!
@zacharylipton
In general, agreed!
Though sounds expensive and even then, gluing together operations in different combinations in hopes to find new motifs is probably only useful in domains where humans haven’t basically done years of Human Architecture Search themselves
ALERT!!! AutoML Decathlon
@NeurIPSConf
2022 update:
The reign of XGBoost has, at long last, come to an end… And a new competitor enters the ring!
Reminder to submit your methods to AutoML Decathlon for a chance to win $15,000 and eternal glory!
NEW COMPETITOR ALERT!!!
XGBoost has been dethroned and a *NEW* competitor is leading in the AutoML Decathlon competition at
@NeurIPSConf
2022!!!
Submit your method today for a chance to win the $15,000 top prize and stay tuned for more leaderboard updates!
The
#AutoML
Fall school 2023 joins forces with the AutoML Decathlon team. This means,
@atalwalkar
, Samuel and Nick will give a hands-on introduction to the Decathlon setup at the fall school, and we will spend the hackathon on coming up with good submissions for the Decathlon.
Super excited to share that our work with
@nick11roberts
and my advisor
@fredsala
, "Lifting Weak Supervision to Structured Prediction" has been accepted at
#NeurIPS2022
. Preprint coming soon!
With only about a month left to submit, submit today for eternal glory etc. + $15,000!
*Fun fact:* $15,000 can buy you the complete box set of all 6 seasons of Lost 53 times over!
“Kate! We have to go back!”
No Jack, you and Kate need to submit to the AutoML Decathlon!!!
LEADERBOARD UPDATE!!! Many new competitors have entered the ring, including the
#Minions
, who are now the third team to beat XGBoost!
***We are entering the final month of the AutoML Decathlon
@NeurIPSConf
2022 competition, so submit your methods soon!!!***
Think you can develop a machine learning method that beats XGBoost and a linear model on diverse tasks for $15,000? We think you can too.
Right now is the perfect time to submit to the AutoML Decathlon 2022 competition at
#NeurIPS2022
!
[1/n]
I will get my
#Prius
#hitched
as soon as I find someone foolish enough to do the job! After that, I will just tow a vintage Skeeter ice boat out to
#LakeMonona
a few weekends out of the year.
Excited to present our work on 💪🏋️ Lifting Weak Supervision to Structured Prediction 💪🏋️
@NeurIPSConf
this week! We’ll be in Hall J,
#334
at 4pm on Wednesday– drop by and chat with us!
#NeurIPS2022
[1/n]
In this example, the grid can be arbitrarily large, but your pretrained classifier only needs to be trained on a constant number of classes (4)
“but don’t predictions suffer if the space is large?”
Yeah, yeah, we have learning theory results for the prediction error:
[11/n]
The goal of NAS is to automate the design of neural networks for a given task, which saves human effort. This process typically involves the following three components: a search space, a search method, and a way to estimate performance.
[2/n]
This interface works by setting the weights of this Fréchet mean. There are a bunch of ways to set the weights for this interface, some covered by our prior work and
[8/n]
📢 A fun blog post 📢 with my advisor
@fredsala
Checkout our blog post on improving the reliability of LLMs via aggregation using super cool classical tools🔧🔨.
BlogPost:
Paper:
Code:
With this substitution, we slowly realized that our search space elegantly encodes many interesting neural operations. Most excitingly, graph convs and Fourier neural operators
@ZongyiLiCaltech
are XD-operations. [9/n]
Leaderboard update!!!
New competitors have entered the ring.
Submit your method today to earn your spot on the leaderboard and for a chance to win $15,000!
Also today is the last day to get early bird registration for the
#AutoML
Fall school and to get a *head start* on the
@AutoMLDecathlon
competition by participating in the Fall School hackathon!!!
We're very excited to be joining forces with the AutoML Fall School 2022!
Register for the
#AutoML
Fall School today and get a head start on the AutoML Decathlon!!!
Quite literally—*register today* because today is the last day for early bird registration for the Fall school!
This allows the pretrained model to 🗺️ navigate🗺️ the metric space of labels just by using the softmax outputs
E.g., if your metric space is a grid, but your classifier can only output probabilities for the classes in the corners, you can actually output any class!
[10/n]
Jointly led with
@XintongLi0501
(who is actively applying to Ph.D. programs!), with help from
@zihengh1
,
@dyhadila
, Spencer Schoenberg, Cheng-Yu Liu, Lauren Pick, Haotian Ma, and with guidance from
@fredsala
and
@awsTO
!
[12/n], n=12
How the heck do we do this?
In short,
- we deal with the size of the space by using the metric geometry of the labels, and
- we use the Fréchet mean as a plug-in to ⚙️interface⚙️ a ***pretrained model*** with the metric space
[7/n]
No! In reality, we basically want outputs resembling data structures.
Examples:
- ASTs
- chains of thought
- folded proteins
Most ML builds up to these as best we can using our base primitives, but doesn’t use the native relationships between these objects.
[3/n]
We also have a bunch of other exciting theory results:
- characterizing what classes your pretrained model needs,
- how to optimally expand the set of classes you can predict, - how to efficiently figure out what classes you can actually predict
[12/n]
Experimentally, we show consistent lifts over just using the standard argmax prediction rule (which we actually generalize, see paper).
We even show that in cases where you actually can predict everything (zero-shot CLIP), just using the metric geometry STILL helps!
[13/n]
Can’t wait to attend
@NeurIPSConf
tomorrow, my first in-person conference in way too long! And excited to share this experience with several students / collaborators who are finally getting to present their work in person... 1/N
A promising answer is to use Automated Weak Supervision (AutoWS), which replaces label functions with weak learners obtained using a small amount of labeled data. Here’s that visualization of the AutoWS pipeline again:
[4/n]
Weak Supervision is a super powerful framework for constructing labeled datasets – instead of actual labels, it relies on having access to several “labeling functions” that are able to produce noisy guesses about the true label Y, given some X.
[2/n]
Also no! We show how you can *reuse a pretrained classifier* that was trained only on a subset of the space to predict anything you want in your complicated label space of data structures.
[6/n]
So yeah, it seems like training models on such a huge label space, if you try to do it without using base primitives, might be pretty much impossible…
Wouldn’t you need training examples representing every possible AST?
[5/n]
Without primitives, these output spaces are huge…
Let’s consider the AST space — Cayley tells us that the size of the space is actually *worse* than exponential… It’s b^{b-2} in the # of vertices.
Like, Zoinks Scoob…
[4/n]
@khoomeik
Seems like there should be a way to generalize this to trade-off perf and how much you can parallelize it, with this being one extreme and backprop being the other extreme. Neat project!
Let’s start off by unboxing how we make predictions using machine learning…
The majority of ML uses a really simple set of base primitives as labels — binary, multi-class, and regression labels. But is this actually what we want in practice?
[2/n]
But here, we want to somehow plug in a pretrained model to the Fréchet mean…
A natural choice is to directly use the per-class probability estimates from the softmax as the weights!
[9/n]
@BlancheMinerva
@zacharylipton
Can you elaborate on this? Typically when people say this, they really mean random search with weight sharing, which IS a (single-shot) NAS algorithm.
On the other hand if you have unlimited time and compute, RS will outperform anything.
Prior work found that the same set of NAS operations were important across the vision tasks of NAS-Bench-201 – we found that this was not true for diverse tasks:
[11/n]
This pipeline works well for text data – it’s easy to write label functions for text. OTOH, it’s quite a bit harder to write these label functions for data with more complex features, such as images or the vast majority of other ML tasks.
[3/n]
@AlexHermstad
I’m not aware of any current CS curriculum doing this, no. However, my intro to programming class in high school started off by teaching people using a flowchart based UI algorithm builder that compiled into code if you wanted it to.