LLMs are great, but their internals are less explored. I'm excited to share very interesting findings in paper
“Massive Activations in Large Language Models”
LLMs have very few internal activations with drastically outsized magnitudes, e.g., 100,000x larger than others. (1/n)
How to choose a vision model for your specific needs?
How do ConvNet / ViT, supervised / CLIP models compare with each other on metrics beyond ImageNet?
Our work comprehensively compares common vision models on "non-standard" metrics. (1/n)
Diffusion models have achieved remarkable results in visual generation.
We demonstrate it can also generate neural networks parameters, in our new paper:
"Neural Network Diffusion" (1/n)
Neural Network Diffusion
Diffusion models have achieved remarkable success in image and video generation. In this work, we demonstrate that diffusion models can also generate high-performing neural network parameters. Our approach is simple, utilizing an autoencoder and a
Filed my Ph.D. dissertation "Efficient and Scalable Neural Architectures for Visual Recognition" yesterday! Hope this can be helpful to anyone who is interested in neural network architectures, especially if you are looking for a different angle.
A ConvNet for the 2020s
abs:
github:
Constructed entirely from standard ConvNet modules, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation
Meta AI is hiring research interns on computer vision and deep learning for 2023 summer and fall! Apply using the link below.
If you are interested in working with me, please also send me an email with your CV and research interests :)
Very excited to share one of the most interesting projects I've ever worked on, but first, a small game:
Here are 15 images from three of the largest and most diverse modern image datasets: YFCC100M, CC12M and DataComp-1B.
Can you guess which images are from which datasets?
Since AlexNet, dropout has been recognized for reducing overfitting.
But did you know it can also mitigate underfitting?
Excited to share our recent paper - "Dropout Reduces Underfitting".
We find early dropout can lead to a lower train loss.
⬇️
Check out our latest work on pruning LLMs!
Reduces size of LLM to half without retraining or weight update, while largely maintaining zero-shot performance.
My favorite is its simplicity - multiplying weights and activations and you get the metric.
How to reduce the size of a Large Language Model?
Sharing our latest work on pruning LLMs - “A Simple and Effective Pruning Approach for Large Language Models”.
We show LLMs have effective sparse networks without weight update or retraining.
🧵⬇️
Our findings help us better understand what is happening inside LLMs and more generally large Transformers.
Work led by
@_mingjiesun
, and in collaboration with
@endernewton
@zicokolter
!
arXiv:
Code:
I'm here at Hawaii too for ICML! The same place where I entered US for CVPR 2017 and also to start grad school.
Looking to connect with old and new friends! Ping me if you'd like to :)
With 4 borderline reject and 1 borderline accept after rebuttal (lower before it), I feel incredibly lucky to have this paper accepted to ICML'24
Really appreciate the hard decision from the AC, to accept a paper with no new methods, and the feedback from the reviewers
How to choose a vision model for your specific needs?
How do ConvNet / ViT, supervised / CLIP models compare with each other on metrics beyond ImageNet?
Our work comprehensively compares common vision models on "non-standard" metrics. (1/n)
While they are very rare, massive activations cannot be set to zero - this will destroy the model.
But they can be set to input agnostic constant mean values, without hurting the model.
This means massive activations act as fixed but important bias terms in LLMs.
The greater the paper is, the easier it is to find a reason to reject it? (e.g., not SOTA, too trivial/not novel, no theory/experiment, or hard to understand)
Looking back at history I find this may be true? for papers that are above a certain low threshold.
Diffusion models can do more than generation. Check out our new work on analyzing what's useful in diffusion models for visual representation learning!
@endernewton
@sainingxie
Meta presents Deconstructing Denoising Diffusion Models for Self-Supervised Learning
paper page:
examine the representation learning abilities of Denoising Diffusion Models (DDM) that were originally purposed for image generation. Our philosophy is to
Lesson: look beyond pure accuracies! Instead, choose what suits your needs.
Project led by our amazing Kirill Vishniakov
@kirill_vish
, who is seeking a PhD position. Hire him! (n/n)
paper:
code:
web:
Exactly! In the ConvNeXt paper, we did convey the same message two years ago on ImageNet-21k, with step by step experiments on what contributed to the ViT > ConvNet misconception
Check out "A ConvNet for the 2020s" if you haven't
Yeah I think people were/are caught up in the hype. It’s cool that google proved out the scaling laws in a way that only google can, but the ConvNeXt paper from Trevor Darrell’s group (and
@Meta
) in 2020 had the same conclusion on ImageNet:
We call them "massive activations", and they appear in various model sizes and families.
They appear at particular sequence dimensions (e.g., start of sequence, period or newline tokens) and feature dimensions.
I don't know who need to hear this, but arxiv-utils is the browser extension everyone should use!
It shows you the *actual* titles on the tabs and when you download, not xxxx.yyyyy.pdf. Also can go from pdf to the abs page.
You don’t have to train from scratch whenever developing a smaller model of an existing model family.
Sharing our latest work - “Initializing Models with Larger Ones”
arxiv preprint:
code:
Joint work with Kaiming He
Check the paper for more!
(non-)code:
arxiv:
(Answer to the game: YFCC: 1, 4, 7, 10, 13; CC: 2, 5, 8, 11, 14; DataComp: 3, 6, 9, 12, 15)
Given the bad situation for ML reviews, should we make paper-reviewer matching a high-stake AI/NLP challenge (like ImageNet/COCO in vision)? If we use the winner solutions, we might get less random reviews and assignments?
I feel the matching system is not optimized enough...
Excited to be a part of the ImageBind project with the team! Our latest model embeds data from multiple modalities into a shared representation space, enabling representation arithmetic, generations, and more.
Wow,
@MetaAI
is on open-source steroids since Llama.
ImageBind: Meta's latest multimodal embedding, covering not only the usual suspects (text, image, audio), but also depth, thermal (infrared), and IMU signals!
OpenAI Embedding is the foundation for AI-powered search and
Meta releases Llama 2: Open Foundation and Fine-Tuned Chat Models
paper:
blog:
develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion
@aaron_defazio
This point closely relates to two of my previous papers:
using L1 sparsity for pruning/slimming a convnet:
demonstrates structured pruning is actually about architectures not weights, and you can train it from scratch
Robustness and Transferability:
1) Supervised models are superior in robustness benchmarks that are ImageNet variants. But when it comes to feature transferability, CLIP models are better.
2) Surprisingly, supervised ConvNeXt almost matches CLIP in transferability! (8/n)
We train 1) an Autoencoder for projecting NN parameters to a latent space (and back), and 2) a standard LDM to learn the distribution of high-performing parameters in the latent space.
The new parameter generation process then follows standard LDMs. (3/n)
1. Added ImageNet-22k ConvNeXt-Tiny/Small models and results
2. Modified Figure 1 so now ResNet & ViT results are with improved training settings
3. Added EfficientNet-V2 into ImageNet result comparison and discussion
Is p-diff only memorizing the neural network parameters used in its training?
Through multiple experiments, we show the answer is no. p-diff generated networks are not identical or similar copies to the models used for training. (5/n)
🚨Excited to announce a large-scale comparison of pretrained vision backbones including SSL, vision-language models, and CNNs vs ViTs across diverse downstream tasks ranging from classification to detection to OOD generalization and more! NeurIPS 2023🚨🧵
Exploring model mistake factors using ImageNet-X:
(1) CLIP models make fewer mistakes relative to their ImageNet accuracy than supervised.
(2) All models suffer mostly from complex factors like occlusion.
(3) Texture is the most challenging factor for all models.
(4/n)
We analyze a wide range of behaviors for 1) ViT and ConvNeXt architectures, 2) supervised and CLIP training methods.
With almost identical ImageNet accuracy within each training method, models can have vastly different behaviors, detailed below. (3/n)
Exploring model calibration on ImageNet and ImageNet-R:
1) CLIP models tend to be overconfident, and supervised models are slightly underconfident.
2) Supervised ConvNeXt outperforms ViT, challenging previous beliefs that ViTs are better calibrated than ConvNets. (6/n)
1. Neural network training and diffusion generation processes are both transitions from random to highly-specific distributions.
2. High-performing NN parameters and high-quality images can both degrade to simple noise distributions, through compounded noise additions. (2/n)
Transformation Invariance (scale, shift, and resolution transform):
1) Supervised ConvNeXt is the most invariant model for all of the transforms.
2) Overall, models are more robust to shift than to scale/resolution. (9/n)
Exploring shape/texture bias on cue-conflict images:
CLIP models are more shape-biased, showing improvement for 7% and 12% over supervised ViT and ConvNeXt. (7/n)
@giffmana
We observed this too in our 2016 Stochastic Depth project. Even loss is plateauing, it's still better to wait a bit before step decaying lr. We didn't document this on paper though. Curious if there's anything explanation
Why go beyond ImageNet accuracy?
Choosing a model for practical tasks with different conditions naturally demands looking beyond standard performance measures.
As more models achieve similarly high ImageNet accuracy, the number also becomes a little saturated. (2/n)
Our results suggest these large-scale modern vision datasets are still incredibly biased in the eyes of neural networks.
We hope our discovery will inspire the community to rethink the issue involving dataset bias and model capabilities.
Back to 2011, the Torralba and Efros paper below called for a battle against dataset bias in the community, right before the dawn of the deep learning revolution.
They found an SVM can classify images' dataset identity from 12 datasets much better than random guessing.
100% agree. I find there are situations where 1. maximizing paper's impact for general readers, and 2. trying to get it accepted, leads to different ways of writing. This shouldn't be a choice at all but sometimes it is... very unfortunate
A structural issue in research is the short-focus on getting papers accepted.
The optimization for good reviews, however, can be very local and is often uncorrelated with long-term impact.
@jxmnop
We have a very relevant discussion on dropout and overfitting / underfitting at the intro of our paper "Dropout Reduces Underfitting". Recommend a read for anyone interested in this topic
Since AlexNet, dropout has been recognized for reducing overfitting.
But did you know it can also mitigate underfitting?
Excited to share our recent paper - "Dropout Reduces Underfitting".
We find early dropout can lead to a lower train loss.
⬇️
There is so much we still don't know about the most basic components of deep learning. Curious to learn & explore more!
Joint work with
@OscarXu96574719
, Joseph Jin,
@szq0214
,
@trevordarrell
We are excited to present our findings at ICML 2023! Code:
(2/2) We redesign dense prediction vision models so that they output early results progressively, and use the confidence values at different spatial locations to guide later computations. It can save up to 50% total computation while giving additional early predictions
Our further experiments show that such a dataset classifier could learn semantic features that are generalizable and transferable, which cannot be simply explained by memorization.
For example, we report 84.7% accuracy on held-out validation data for the three-way classification problem consisting of the YFCC, CC, and DataComp datasets, which samples were shown at the start of this thread.
In this work, we revisit this “dataset classification” experiment suggested by Torralba and Efros, in the new era with large-scale, diverse, and hopefully less biased datasets as well as more capable neural network architectures.
@tienhaophung
Yes that's a great paper and the most relevant!
The main difference is they generate parameters step by step, more like an optimizer, taking a previous checkpoint as input. We directly generate the whole set of parameters without previous weights as inputs.
Though the game on modern datasets might seem hard for humans, surprisingly, we observe that modern neural networks can achieve excellent accuracy in classifying which dataset an image is from.
@rasbt
@francoisfleuret
I'd like to clarify a bit: our paper finds *early dropout* reduces underfitting, and it's not necessarily only for ViTs, but also for other models.
Thanks for bringing our paper though!
@sidgairo18
We experimented with SSL models - MAE (ViT) and ConvNeXt V2, in our initial tests. They have similar behaviors as our supervised models, possibly because they are also pure vision models, and fine-tuned on ImageNet-1K (needed for many evaluations). So we didn't include them.
thoughts: human and many other species seem to be trained to reproduce themselves and in that process we gained intelligence. If we somehow train models using "reproducing themselves" as the objective and if they indeed learn very well, soon we'll be in danger zone?
Our analysis of network training dynamics revealed an interesting insight - using dropout in early training can reduce mini-batch gradient variances.
It effectively balances the stochasticity of SGD, enabling more consistent, whole-dataset aligned updates
@konstmish
Thanks for the thread! I enjoyed it very much.
Regarding SVRG, check out our alpha-SVRG paper where we find a way to make SVRG useful in deep learning!
@AlexGDimakis
Great question! They do appear since early training but we haven't followed their changing trend closely. We'll observe it and plan to add it
@karpathy
I find it so hard to press all keys together, same for pasting without formatting under many microsoft office products. They are disasters for ergo concerns
I mapped the screenshot 4 keys to a single key on my keyboard with my logitech keyboard
Inspired by this insight, we introduce "early dropout" for enhancing the fitting capabilities of smaller/underfitting models.
We also propose a complementary method - "late dropout" for a more refined regularization of larger/overfitting models.
@peroxycarbonate
It's a highly related impressive work. The main difference is our network is for recognition while theirs is for further generating 3D data, so in some sense their usage of diffusion models is ultimately for visual generation.
It seems not right that, the design of the system that affects many people's education and careers are only driven by the goodwill of OpenReview and TPMS authors... we as a community should give more attention to this task
@ahatamiz1
Our comparisons are contextualized in each property, and we try not to make generic statements. Our overarching message is to choose models based on specific needs, rather than to recommend concrete models
@thegautamkamath
@shadow_dnv
@rasbt
Someone can never prove themselves to be capable of doing "independent research" if all their papers have more than one authors. It is a impossible criterion to evaluate in my opinion, so should be deprecated or at least improved.
@giffmana
@giffmana
basically, yes it is not effective when we double the default batch size
We couldn't reply right to you during the ICML review period, because of social media ban. But thank you for volunteered reviewing!
@de_JQK
@thegautamkamath
@shortstein
@icmlconf
Thank you for bringing up our "Rethinking" project. I just would like to add that in that project we also experimented with and discovered the effects of learning rates on LTH :)
@thegautamkamath
@shadow_dnv
@rasbt
I get this point on developing a unique research vision but sometimes I don't get the emphasis on "independence". Almost all papers have more than one authors. If someone is truly 100% independent and did 100% of work then they should write single-author papers.
@soumithchintala
@arimorcos
@WonderMicky
@tydsh
We had this transferring pruned structure experiment in: . We didn’t use the original init but used random reinit. The sparsity pattern is also visualized and has pretty clear patterns. Also we showed only “avg pattern” is needed, not the exact pattern.
@LinjieXu
@thegautamkamath
@shadow_dnv
@rasbt
I agree, that is the right thing to strive for! The word "independence" seems to convey a different thing, at least in a literal sense, and we may want to change the word :)
@LoadingALIAS
@anshulkundaje
Thank you! Yeah for adversarial example related stuff we only have ImageNet-A as part of the robustness. Yes it would be interesting to see the conventional adversarial results
@ahatamiz1
It's hard to include models of all sizes. We prioritized the number of properties in this work, so only used 4 models we think are most representative for more clarity
@thegautamkamath
@LinjieXu
@shadow_dnv
@rasbt
I've also seen people talking that the most important thing of being *admitted* to a PhD program is to demonstrate you can do independent research.. what? 🤣 That is also why I feel this criterion is often abused