Are small models still undertrained?
We are releasing a 2B model that beats GPT-3.5. The crazy part is that it was distill on only 2T tokens from a small model.
Distillation is the future of LLMs with the growing availability of large and efficient open models!
Rumor has it that an
@OpenAI
announcement is in the works for Thursday, October 17; unsure of the nature as it could be a GPT-4o model update or a public rollout of SearchGPT. One thing for sure, it will NOT be a new frontier model.
We are releasing a series of visual features that are performant across pixel and image level tasks. We achieve this by training a 1b param VIT-g on a large diverse and curated dataset with no supervision, and distill it to smaller models. Everything is open-source.
Announced by Mark Zuckerberg this morning — today we're releasing DINOv2, the first method for training computer vision models that uses self-supervised learning to achieve results matching or exceeding industry standards.
More on this new work ➡️
Gemma 2 27B is now the best open model while being 2.5x smaller than alternatives! This validates the work done by the team and Gemini. This is just the beginning 💙♊️
Super excited to share new open LLMs from FAIR with our research community. Particularly, the LLaMA-13B is competitive with GPT-3, despite being 10x smaller.
Today we release LLaMA, 4 foundation models ranging from 7B to 65B parameters.
LLaMA-13B outperforms OPT and GPT-3 175B on most benchmarks. LLaMA-65B is competitive with Chinchilla 70B and PaLM 540B.
The weights for all models are open and available at
1/n
We have a long history of supporting responsible open source & science, which can drive rapid research progress, so we’re proud to release Gemma: a set of lightweight open models, best-in-class for their size, inspired by the same tech used for Gemini
To support innovation in computer vision, we’re making DINOv2 available under the Apache 2.0 license + releasing a collection of DINOv2-based dense prediction models for semantic image segmentation and monocular depth estimation.
Try our updated demo ➡️
Few understand that the real star of the release is 9B and not 27B
Trained on 8 Trillion tokens with knowledge distillation (presumably with 27B as the teacher model)
It blows the competition (<15 B) out of the water
At Q6/ Q8, it's ~10GB VRAM, making it a powerful model for
A 9B open model that surpasses some of the best open models! Would have love to be the one claiming this win, this is massive! Congrats
@yumeng0818
@xiamengzhou
and
@danqi_chen
!
First, it's a Prefix-LM. Full attention between image and prefix (=user input), auto-regressive only on suffix (=model output).
The intuition is that this way, the image tokens can see the query and do task-dependent "thinking"; if it was full AR, they couldn't.
Results agree:
Our work on learning visual features with an LLM approach is finally out. All the scaling observations made on LLMs transfer to images! It was a pleasure to work under
@alaaelnouby
leadership on this project, and this concludes my fun (but short) time at Apple! 1/n
Excited to share AIM 🎯 - a set of large-scale vision models pre-trained solely using an autoregressive objective. We share the code & checkpoints of models up to 7B params, pre-trained for 1.2T patches (5B images) achieving 84% on ImageNet with a frozen trunk.
(1/n) 🧵
IMHO, Chinchilla is the most impactful paper in the recent development of open LLMs, and its relatively low citation counts shows how much this metric is broken.
I'm a bit obsessed with the Chinchilla paper. It has the largest ratio of "economic worth/idea complexity" of any paper in AI. If Google has locked it down, it's possible open-source would be a year or more behind.
🎉 Unveiling PaSS: Parallel Speculative Sampling
🚀 Need faster LLM decoding?
🔗 Check out our new 1-model speculative sampling algorithm based on parallel decoding with look-ahead tokens:
🤝 In collaboration with
@armandjoulin
and
@EXGRV
@srush_nlp
9b and 2b are fully trained with distillation during pretraining. Finetuning start with standard sft (no distillation), then online distillation and finally rlhf.
@giffmana
FAIR is still home to top tier computer vision like
@imisra_
,
@lvdmaaten
, Christoph Feichtenhofer, Peter Dollar, Yaniv Taigman,
@p_bojanowski
. As
@inkynumbers
I think a lot of us joined 8-9yr ago and there are cycles in research careers.
@karpathy
sometimes i like to think of very large model as just a smart way to run a massive parallel gradient descent in the parameter spaces of small models.
@abacaj
We will look to improve our models in future iterations and any feedback will be appreciated (through DMs?). Mistral's models are amazing and if they work for you, all the best!