I'm incredibly excited to announce our new company,
@datologyai
!
Training models is hard and identifying the right data is the most important and difficult part -- our goal
@datologyai
to make optimizing training data at scale easy and automatic across modalities.
Neural scaling laws are great for predictability, but power law scaling is slow, especially in the large data regime when 10x the data results in small gains. Can we do better? We show that exponential scaling is possible via intelligent data pruning.
Web-scale data has driven the incredible progress in AI but do we really need all that data?
We introduce SemDeDup, an exceedingly simple method to remove semantic duplicates in web data which can reduce the LAION dataset (& train time) by 2x w/ minimal performance loss.
🧵👇
We know invariance is important for generalization, but what is the source of this invariance? Does it come from the architecture, augmentations, or the data itself?
In our
#NeurIPS2021
paper led by
@marksibrahim
and
@D_Bouchacourt
, we aim to find out.
Most approaches to learning generalizable representations have focused on constraining the structure of the representation.
But what if you instead constrain *how representations can be manipulated*?
We introduce latent canonicalization to test this:
Are all negatives created equal in contrastive instance discrimination?
In new work led by Tiffany Cai, we show that only the hardest 5% of negatives per query are both necessary and largely sufficient for self-supervised learning.
Tweetprint time!
Repeat after me: data >>>>>> architecture.
Given enough quality data, many different architectures can achieve comparable performance. The secret sauce was, is, and remains the data, not the model.
Announcing RecurrentGemma!
- A 2B model with open weights based on Griffin
- Replaces transformer with mix of gated linear recurrences and local attention
- Competitive with Gemma-2B on downstream evals
- Higher throughput when sampling long sequences
Transformers are great, but I think their importance to the insane progress of the last few years has been massively overstated.
The key was and is larger, higher quality datasets.
Compute is all you need.
For a given amount of compute, ViT and ConvNets perform the same.
Quote from this DeepMind article: "Although the success of ViTs in computer vision is extremely impressive, in our view there is no strong evidence to suggest that pre-trained ViTs
Excited to share our blog post on our
@ICLR18
paper!
*Easy-to-interpret neurons are no more important than hard-to-interpret neurons
*Generalizing networks are more robust to neuron deletion than memorizing networks
Blog:
Paper:
Recent studies have suggested that the earliest iterations of DNN training are especially critical. In our
#ICLR2020
paper with
@jefrankle
and
@davidjschwab
, we use the lottery ticket framework to rigorously examine this crucial phase of training.
Timely paper from
@ShibaniSan
, Dimitris Tsipras,
@andrew_ilyas
, and
@aleks_madry
providing some new insights into why batch norm works. They perform a number of clever experiments to work it out, finding that internal covariate shift is a red herring!
Just read through the
@distillpub
interpretability blog post from
@ch402
and others. Stunning (and fun!) visualizations, but I wonder: what did these visualizations actually teach us about these networks? What do we know now that we didn't know before?
I'll be at
#NeurIPS2022
this week! Will be presenting "Beyond neural scaling laws" (Outstanding Paper) Wed morning and at
@MetaAI
booth, including for our AI Residency Q&A Wed lunch time. Excited to see old friends and new faces. DM me if you want to chat!
Thrilled to announce that
@datologyai
recently raised a $46M Series A!
We'll be using these funds to grow our team and compute to push the frontier of data research to make data curation easy for everyone. We're hiring across engineering and research!
Our mission
@datologyai
is to enable anyone to train powerful AI models by making data curation and optimization easy for everyone. Hear more about our mission here:
Happy to (finally!) share our work on Generative Query Networks! In particular, I'm excited about our efforts to make sense of the representations these networks learn. Here are some of the most interesting findings:
Blog:
Paper:
Very excited to announce our
#ICML2019
workshop on Identifying and Understanding Deep Learning Phenomena! We're looking for papers which rigorously test whether commonly-held, but unproven intuitions/beliefs about DNNs are actually true.
#DeepPhenomena
Why does pruning during training often *improve* generalization?
In our
#NeurIPS2020
paper led by
@bartoldson
, we introduce the generalization-stability tradeoff, in which decreasing pruning-induced stability leads to better generalization.
Beautiful paper from
@ukhndlwl
performing a battery of experiments to evaluate the long-range dependencies of LSTMs to word order, part of speech, word frequency, and more. Would be awesome to see these tests become part of the standard evaluation of RNNs!
Excited to share our recent work showing that pooling is neither necessary nor sufficient for appropriate deformation stability in CNNs! Rather, filter smoothness most directly modulates deformation stability in networks both *with* and *without* pooling.
Finally had a careful read of the "Textbooks are all you need" paper (). Is anyone else surprised that there are not, in fact, any textbooks in the dataset?
How can we combine the best features of both CNNs and ViTs?
In our
@icmlconf
paper, led by
@stephanedascoli
, we introduce ConViT, a ViT with soft convolutional inductive biases which the model can learn to ignore.
Paper:
Blog:
Check out our latest preprint (with
@maithra_raghu
and Samy Bengio) on the relationship between representational similarity and generalization and optimization in CNNs and training and sequence dynamics in RNNs! 1/N
Blog:
Paper:
Thought-provoking work from
@julius_adebayo
,
@goodfellow_ian
and others showing that saliency maps don't change much, even when *all* the weights are reinitialized! Do saliency maps tell us what the network cares about, or merely what the task demands?
Important work from
@sarahookr
,
@doomie
,
@piekindermans
, and
@_beenkim
on quantifying the extent to which various saliency methods *actually* find relevant portions of images.
Really happy to see more work towards bringing rigor into interpretability!
Excited and honored that our paper on beating power law scaling via data pruning earned an outstanding paper award at
#NeurIPS2022
! Congratulations to my amazing co-authors, Ben Sorscher, Robert Geirhos,
@sshkhr16
, and
@SuryaGanguli
!
Paper:
Excited to share our blog summarizing some of our recent work understanding the boundaries of the lottery ticket hypothesis! Do lottery tickets generalize across datasets? Is the phenomenon present in RL and NLP? Can we begin to explain it theoretically?
Very cool paper from
@skornblith
, Jon Shlens, and Quoc Le evaluating what factors lead to better feature transfer. I wonder if ResNets are best for transfer because their identity connections prevent task-irrelevant information from being thrown away...
Today we're sharing our work on PUG, new research from Meta AI on photorealistic, semantically controllable datasets using Unreal Engine for robust model evaluation.
More details & dataset downloads ➡️
Another excellent example that data >>> model, this time from neuroscience.
Two models trained on the same data will have similar behavior despite architectural differences (if enough data), even if those are between biological and artificial neural nets.
5/ Surprisingly, across both measures, we found that even major differences in model architecture (e.g. CNNs vs. Transformers) did not significantly lead to better or worse brain alignment…
While the lack of qualified reviewers is certainly a problem, I wish more authors viewed reviewer misunderstandings as a signal to invest more in the clarity of their writing and figures, rather than simply complaining about the reviewer.
How do 'map-less' agents navigate? They learn to build implicit maps of their environment in their hidden state!
We study 'blind' AI navigation agents and find the following 🧵
Thanks for sharing our work showing data curation can help us train models far faster to far better performance,
@martin_casado
!
Someone should figure out how to make it easy for companies who want to train their own models to use data curation at scale... 🤔🤔🤔
(stay tuned)
Most people engaged in the safety discussion don't have a good sense of how hard it is to get past power law scaling (error drops relative to the power of the training set/model size).
The industry is fighting against rapidly diminishing marginal returns.
Do lottery ticket initializations generalize, or are they overfit to the precise conditions used to generate them? If you're at
#NeurIPS2019
, come see our poster
#170
happening right now!
Paper:
Blog:
Though it seems amazing at first, universality isn't actually a particularly useful property. Learnability is what we actually care about. A solution which is possible but not learnable might as well not exist.
Also worth noting that while we only looked at class selective neurons, our recent
@ICLR18
paper found that easily interpretable neurons were no more important than confusing neurons, which is at odds with the motivation behind many of these techniques.
It never made sense to me how startups could ever beat big tech.
After working in big tech for years and seeing the politics, it now makes perfect sense.
It all comes down to incentives.
When a startup is competing against a large competitor, they aren't competing with the *entire* company, they are likely competing with some PM focused on internal politics/career progression.
With this framing, it shouldn't be surprising to see startups win as often as they do.
A key part of bringing this cost down was
@code_star
and team's focus on increasing data quality to make training ~2x more token efficient.
Massive efficiency gains can be had through better data! And because models don't converge -- compute multipliers = quality multipliers.
Just $10M and two months to train from scratch a GPT3.5 - Llama2 level model. For context, it probably cost 10-20x more to OAI just a year ago!
The more we improve as a field thanks to open-source, the cheaper & more efficient it gets!
All companies should now train their own
At
@datologyai
, we are pushing the frontier of data research to build products that make it easy for anyone to make the most of their data, automatically.
We're hiring for a number of roles across research and engineering. If you're excited about data, please join us!
Happy to release our review (with
@dgtbarrett
and
@jakhmack
) exploring ways in which the machine learning and neuroscience communities might interact to best advance analysis and understanding of neural networks, whether they're biological or artificial!
Given that we realize that data is the most important piece for training foundational models, I'd focus the Manhattan Project on that.
Coordinate with many experts to establish the most comprehensive, high-level database of human knowledge that is otherwise difficult to access.
Nice summary of the advantages and disadvantages of the increased interest in ML, especially with respect to incentive schemes from
@zacharylipton
at the Critiquing Trends Workshop.
#NeurIPS2018
Introducing Meta Llama 3: the most capable openly available LLM to date.
Today we’re releasing 8B & 70B models that deliver on new capabilities such as improved reasoning and set a new state-of-the-art for models of their sizes.
Today's release includes the first two Llama 3
This is why data curation can lead to big performance gains for the same training budget. As models learn, less and less of the data they see is useful, so the rate of learning slows. Eventually we just decide to stop training, but if models see better data, they'll learn faster!
On my way to
#NeurIPS2018
! Happy to chat, especially about ways to build a science of deep learning!
Also excited to present our work (joint with
@maithra_raghu
) on using CCA to understand DNNs on Wednesday morning!
Blog:
Paper:
How do we rigorously measure abstract reasoning capabilities in neural networks? Can we clearly defined different types of generalization? With
@santoroAI
,
@dgtbarrett
, Felix Hill and Tim Lillicrap, we introduce a new dataset in our
@icmlconf
paper to try!
Measuring abstract reasoning in neural networks - our latest
#ICML2018
paper - takes inspiration from human IQ tests to explore abstract reasoning and generalisation in deep neural networks by
@dgtbarrett
, Felix Hill,
@santoroAI
,
@arimorcos
, Tim Lillicrap
I wonder if all of these visualization methods are actually Rorschach ink blot tests for ML researchers. We see in them what we want to see, which may often just be the easiest to understand explanation.
Some personal news: yesterday was my first day at
@datologyai
! I will be working on what I consider to be the most interesting problem in data engineering: curating training datasets for machine learning models.
Inspired by word vector algebra, we also tested whether we could perform "scene algebra". Can you add and subtract the GQN representations for different scenes to create new objects? Yes, you can! This provides further evidence for factorized representations.
Big labs all use active learning data filtering techniques for training- offering them as a service to enterprise users for fine tuning seems like a good startup idea
Great to see that our recent
#NeurIPS2019
paper on generalizing lottery tickets () has been reproduced as part of the NeurIPS reproducibility challenge by
@Deepak120199
,
@VarunGohil9
, and Atishay Jain!
Report:
Really excited that this is finally out! In work led by
@erikwijmans
, we show that agents with no sensory input beyond ego-motion can effectively navigate novel environments and do so by storing maps in memory despite no prior for mapping.
📣 New paper: Emergence of Maps in the Memories of Blind Navigation Agents
Humans have the ability to navigate poorly lit spaces by relying on touch and memory. Our research shows that blind AI agents can learn to do the same.
Read the paper ➡️
IMO fairness and bias are the most critical near-term AI risks, yet unfortunately they often get swept under the rug in favor of x-risk discussions.
Fairness and bias ultimately come down to data. If your data is skewed relative to the real world, your model will be too!
i just wish we spent more time addressing the real risks with AI - e.g. fairness, bias, privacy, sustainability - instead of talking about science fiction doomsday scenarios
"Motivating the Rules of the Game for Adversarial Example Research"
Really important perspective on adversarial examples from
@jmgilmer
,
@goodfellow_ian
, George Dahl et al asking if the security motivation for adversarial research actually makes sense.
Interested in the intersection between data curation, privacy, and fairness?
@kamalikac
, Chuan Guo, and I are looking to hire a postdoc at
@MetaAI
(FAIR) to investigate these directions. We encourage candidates of all backgrounds to apply.
Very cool reproduction of world models showing that an untrained RNN is basically just as good! Perhaps we as a field should revisit reservoir computing. Or, alternatively, having an RNN may aid credit assignment if gradients can flow through...
Very true and also something I've been guilty of saying in the past.
Transformers have far weaker inductive biases than CNNs or RNNs, but they do still exist.
However, this weakness allows them to easily learn the appropriate inductive bias given enough (quality) training
There is a commonly held belief that Transformers have no inductive bias and that this bias is learned throughout the training process. This is not true. Transformers have very strong inductive biases.
At
#ICML2019
? Interested in understanding what's *actually* going on in our networks? Come to our workshop on
#DeepPhenomena
today in Hall B! We have an awesome lineup of speakers and papers!
We are excited to announce our investment in
@datologyai
!🚀Led by
@arimorcos
, Datology is tackling the crucial challenge of data curation, ensuring models are fed quality data for superior performance & efficiency.
@_RobToews
shares why we invested in this week's
#RadicalReads
:
Very nice concurrent work complimenting our findings in SemDeDup regarding high levels of duplication in web datasets like LAION.
Love the application to finding more copied images in generative models!
Extremely well-reasoned thread on why open-sourcing foundation models isn't actually problematic, but rather necessary. I particularly agree with the hubris point -- LLMs are not that complicated! Building them is within the resources of many companies, let alone state actors.
Does the lottery ticket hypothesis generalize to RL and NLP? Come check out our
#ICLR2020
paper to find out! First poster session starting now!
ICLR:
Paper:
Blog post:
Are easy to interpret neurons helpful to performance in CNNs? In a new blog post,
@leavittron
and I summarize our work evaluating the causal impact of selective neurons, finding that easily interpretable neurons can actually be harmful to performance.
Hello world
I’m happy to share that I’m starting a new position as AI Resident at Meta AI (
#Facebook
AI).
A big THANK YOU to all the people I worked with/learned from till this point, especially during the last 18 months.
We are back!
@arimorcos
joins the podcast and talks about the role of different layers in a network. We also discuss Ari's journey from neuroscience research to ML.
This was a good one, and our last episode in season 2.
Time has flown, thanks for all the support!
This is a serious concern for the training of future models. We're polluting the Internet with low quality data. Synthetic data absolutely has its place, but it has to be targeted and curated.
My favorite part of the YOLOv3 paper: the section on things which didn't work. More and more papers should include these and reviewers should demand them! It would be amazing if such sections became as widespread as related work sections.
A good rule of thumb: high quality dataset work is probably more important and impactful long-term than whatever you're currently working on.
(Obviously this doesn't apply to *tweeting* which is well-known to be the highest leverage work in ML research 🤭)
@TechnologyPat
@markchen90
If I ran OAI and believed what they claimed to, I would have faked GPT-3 not working. I would have done everything i could to avoid the scaling race that OpenAI has caused. I wouldn't have rolled out insecure GPT integrations. I wouldn't have given it internet access.
And related to the move towards confusing (or "mixed selectivity" neurons) in neuroscience, with lots of recent results showing that they carry tons of information. E.g., from
@MattiaRigotti
and also among many others.
This has never made sense to me, but is common practice in big tech. The cost of ramping someone up is significant and you have far more signal that current high performers will perform well in the future than you do from a 1 day interview.
And yet this is standard.
MSFT stock grants
New hire: can’t code. has 30k twitter followers. $800k over 4 years.
Current eng: finds 500ms backdoor, saves world. $67k retention grant, vests over 5 years.
Tomorrow at 9:00 am!
How To Be A Good Citizen Of The CVPR Community, Ballroom E
Talks on how to write a paper, give a talk, review, do research, manage time, mentor, lead, be inclusive, do reproducible research, collaborate, ... I can't wait!
#CVPR18
When all that matters is compute and data, the highest leverage comes either from working on compute or data.
Personally, I don't know how to make GPUs go faster, so...
It's stunning that researchers think it's appropriate to try to predict "criminality" based on appearances. This is modern day phrenology and should have no place in our field. I have signed and I encourage others to do so as well.
Springer Nature plans to publish an article "A Deep Neural Network Model to Predict Criminality Using Image Processing" that revives long discredited physiognomist pseudoscience.
Sign this petition to urge
@SpringerNature
to refrain from publishing. RT!
@erikwijmans
@ManolisSavva
@stefmlee
@irrfaan
@DhruvBatraDB
I'm especially pleased by this quote from the award committee regarding our paper:
“I hope that the demonstrated rigor in building up an argument towards answering questions about learned representations will inform future studies across the ICLR community.”
Couldn't agree with this more. If you write a paper such that a reader can easily understand the motivation for each experiment, the results will often seem "obvious" even if no one would have predicted them before reading the paper!
Food for thought for
#acl2020nlp
reviewers: if the work seems "trivial", "expected", or "straightforward", this isn't necessarily a bad thing. In fact, it may mean that the authors did a good and convincing job.
@pmphlt
has a nice take on this:
Though it's not always valued as such, *communicating* science is just as important as doing science. If a paper is difficult to understand, fewer people will read it and fewer will build on it.
These are some really fantastic tips for increasing clarity in paper writing.
Sharing one idea I found useful for paper writing:
Do NOT ask people to solve correspondence problems.
Some Dos and Don'ts examples below:
*Figures*: Don't ask people to match (a), (b), (c) ... with the descriptions in the figure caption.
"Science aims to understand and explain." Couldn't agree more with Joelle Pineau's
#ICLR2018
talk. We need to view our advances skeptically, and make sure we understand *why* they work. Otherwise, our community will constantly trip trying to build on results which don't hold up.
Saliency, activation maximization, etc. give us the impression of understanding, but it's often extremely difficult to express the conclusion of this "understanding." Absent a falsifiable hypothesis and rigorous quantification, can we actually say we've learned anything?
Really enjoyed this conversation with
@prateekvjoshi
on the Infinite ML pod about the potential of fully automated data curation and our mission
@datologyai
. Have a listen if you'd like to learn more!
The topic on Infinite ML pod today is algorithmic data curation.
We have
@arimorcos
to talk about it. He's the cofounder and CEO of
@datologyai
.
In this clip, he talks about the odds that the next data point in your training dataset is going to teach something new to your AI
As someone who came up in neuroscience, where journal publication is all that matters is, the fixed time from submission to publication for conferences is spectacular.
If machine learning research papers are meant for journals rather than for conferences, the original GAN paper might have been published sometime in 2016. DCGAN might have been published in 2018, if at all.
"Towards falsifiable interpretability research"
In this position paper,
@leavittron
and I discuss case studies to exemplify the importance of falsifiability in interpretability research. Intuition is important, but unverified intuition can be dangerous.
Fun blog post from
@ericjang11
demonstrating that even "dumb" learning rate schedules (such as following the pixels of arbitrary images) work better than a fixed learning rate. Also a great example of how simple experiments can make a general point!
I literally cannot agree more with this sentiment. Well-controlled, rigorous toy experiments >> unclear large-scale experiments. Unfortunately, much of the field doesn't agree (e.g., criticism of the beautiful adversarial spheres paper for being too toy: )
Carefully designed, small-scale toy experiments can explain why your algorithm works better than the baseline in a controlled context, and they are often more useful than demonstrating incremental SOTA improvements on large-scale, compute-intensive tasks.
100% agree.
General purpose models make a ton of sense for consumer use cases, but enterprises can benefit tremendously from smaller, specialized models trained on a company's own data because tasks are far better constrained.
Fixing hallucinations in general LLMs is very hard.
Fixing hallucinations in enterprise LLMs is easy.
This is by design. Why?
⚪️ For the general LLM, you want it to perform all sorts of tasks from 5th grade science questions to bedtime stories to even all enterprise use cases.