TimDarcet @TimDarcet Twitter profile

Pinned Tweet

TimDarcet

1 year

1/ This week we released DINOv2: a series of general vision encoders pretrained without supervision. Good out-of-the-box performance on a variety of domains, matching or surpassing other publicly available encoders.

5

117

696

Last Seen Profiles

@Jbkolat

@moomooreports

@aafterthestorm_

@Haleth_Deap

@papmantan69_

@stwmaniax

@JanosekMadaly

@paymentscm

@miawmiawprison

@SCCommHigherEd

@KasihLudah

@mihthao0

@yiachi2221

@ditaarulia

@MartinGroupBU

@LearnPhysio

@riadlem

@ryooo_845

@LauraEd4699598

@stwmaniax

@depinsapp

@JimBinLV

@vaniapasaran

@EVGHQ

@StatLearning

@SoftballIts

@mikerics2

@DRLucyVanPelt5

@MussaCynthia

@kevoh08

@sanfoor313

@BFilmy_Official

@CarlLPalmer

@futcamisolas23

@Juvevito691

@Wiso88G

TimDarcet

@TimDarcet

1 year

Vision transformers need registers! Or at least, it seems they 𝘸𝘢𝘯𝘵 some… ViTs have artifacts in attention maps. It’s due to the model using these patches as “registers”. Just add new tokens (“[reg]”): - no artifacts - interpretable attention maps 🦖 - improved performances!

43

327

2K

TimDarcet

@TimDarcet

11 months

DINOv2+registers=♥️ We are releasing code and checkpoints for DINOv2 augmented with registers and a slightly better training recipe. No more of those pesky artifacts! Simple one-liner, try it out: dinov2_vitg14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14_reg')

12

42

491

TimDarcet

@TimDarcet

2 months

Still not sure why the ML community adopted conda instead of plain old virtualenv

60

2

324

TimDarcet

@TimDarcet

7 months

Mistral's "Le Chat" logo is a design masterclass The two dots make a smol cat

4

11

241

TimDarcet

@TimDarcet

3 months

Bonus trick: you can remove the gradient reduction of the first backward (which is useless) by wrapping in no_sync() Remember to also include the forward pass in the no_sync context, else it does not work

Gabriele Berton

@gabriberton

3 months

This simple pytorch trick will cut in half your GPU memory use / double your batch size (for real). Instead of adding losses and then computing backward, it's better to compute the backward on each loss (which frees the computational graph). Results will be exactly identical

43

266

2K

1

18

240

TimDarcet

@TimDarcet

5 months

If you need a replacement for an example image in a CV paper, you know what to do

Michael P. Frank has joined a startup!

@MikePFrank

5 months

Just FYI, computer vision papers submitted to IEEE that include this image of Ms. Forsén will no longer be considered for publication

104

347

2K

7

11

166

TimDarcet

@TimDarcet

1 year

Intriguing new property: on some images, the different registers naturally adopt a “slot attention-like” behavior, each attending to a different object! Needless to say, this was never required of the model (or even encouraged). Cool future research direction!

2

11

164

TimDarcet

@TimDarcet

5 months

In case some of you were (like me) curious about this stat for AI conferences: here it is for ICLR2024

Guido Salvaneschi

@guidosalva

5 months

Statistics from @ICSE2024 . Authors submitting, *each*, 33, 27, 24, ... papers. Interactive dashboard:

18

24

89

13

20

154

TimDarcet

@TimDarcet

4 months

ViT need registers got an outstanding paper award! Many thanks to the comittee for the honor

ICLR 2025

@iclr_conf

4 months

Announcing the #ICLR2024 Outstanding Paper Awards: Shoutout to the awards committee: @eunsolc , @katjahofmann , @liu_mingyu , @nanjiang_cs , @guennemann , @optiML , @tkipf , @CevherLIONS

3

53

303

6

10

153

TimDarcet

@TimDarcet

6 months

Hey! If you are using DINOv2, whether in a startup, in research or whatever, could you send me a DM? I want your feedback on the model. Reward for you? Simple: next model is gonna be 𝘦𝘷𝘦𝘯 𝘮𝘰𝘳𝘦 suited to your needs 👌

10

12

134

TimDarcet

@TimDarcet

1 year

Our hypothesis is: the model recognizes useless patches, discards the info in them, and uses them as 𝘢𝘨𝘨𝘳𝘦𝘨𝘢𝘵𝘰𝘳𝘴 𝘰𝘧 𝘨𝘭𝘰𝘣𝘢𝘭 𝘪𝘯𝘧𝘰𝘳𝘮𝘢𝘵𝘪𝘰𝘯.

2

7

132

TimDarcet

@TimDarcet

4 months

Current state of neurips abstract submissions This neurips is gonna be crazy

Chandan Singh

@csinva

4 months

2024 update

2

4

32

11

22

121

TimDarcet

@TimDarcet

5 months

With satellite imagery, it’s hard to get labels. Solution? DINOv2! WRI+Meta trained a satellite DINOv2 for tree height estimation. They created an interactive map of tree height of the whole globe (!) at 1-meter res (!): Quizz: Can you recognize this city?

4

13

121

TimDarcet

@TimDarcet

1 year

What I mean when I say “registers”: additional learnable tokens (like the [CLS]), but these ones are not used at output. No additional info at input, not used at output: these tokens could seem useless!

2

8

119

TimDarcet

@TimDarcet

4 months

echo "echo 'sleep 0.5' >> ~/.bashrc" >> ~/.bashrc

yobibyte

@y0b1byte

4 months

Every time a colleague of mine does not lock their laptop, I add something to their .bashrc. alias vim='nano' is a good one, but moving file to a random folder is even funnier. rm is too evil, don't do it!

4

1

28

7

6

113

TimDarcet

@TimDarcet

8 months

ICLR results are out so its bragging time: ViT need reg got an oral and very good scores (top-15), so that's cool. Thanks a lot to the reviewers who found it good If you want to try a model with registers, we published some DINOv2 checkpoints earlier:

TimDarcet

@TimDarcet

11 months

DINOv2+registers=♥️ We are releasing code and checkpoints for DINOv2 augmented with registers and a slightly better training recipe. No more of those pesky artifacts! Simple one-liner, try it out: dinov2_vitg14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14_reg')

12

42

491

9

7

111

TimDarcet

@TimDarcet

9 months

PSA: when someone asks you a question including words such as "false positive rate", 𝗱𝗼 𝗻𝗼𝘁 𝗮𝗻𝘀𝘄𝗲𝗿 𝗿𝗶𝗴𝗵𝘁 𝗮𝘄𝗮𝘆. Simply state that you know your rights, and go on wikipedia to consult the 𝔐𝔞𝔡 𝕮𝔬𝔫𝔣𝔲𝔰𝔦𝔬𝔫 𝔐𝔞𝔱𝔯𝔦𝔵 𝔬𝔣 𝕳𝔢𝔩𝔩

Jeremy Kauffman 🦔

@jeremykauffman

9 months

Fewer than 1 in 5 doctors can correctly answer a basic question about statistics

488

826

8K

2

16

105

TimDarcet

@TimDarcet

1 year

But in fact, the model learns to use them. And they work quite well: a single register entirely fixes the attention maps, and gives a boost on downstream tasks. Adding more further increases the scores a bit. We improve upon DINOv2, which was already quite stronk 💪

2

102

TimDarcet

@TimDarcet

1 year

Do check out the paper! It’s got much more detail than I can give here. Thanks to Maxime Oquab, Julien Mairal and Piotr Bojanowski who were patient enough to work with me, and competent enough to compensate for my mistakes 😅.

3

2

94

TimDarcet

@TimDarcet

5 months

Actually the accept rate decreases monotonically with number of 1st author submissions: the more prolific the first author is, the lower the quality of their paper.

Jon Barron

@jon_barron

5 months

The acceptance rate among aspiring ICLR2024 first authors who submitted >= 4 papers was 15%! Contrast that with the base acceptance rate that year: 30.5%. Unsettling.

4

5

43

2

13

96

TimDarcet

@TimDarcet

3 months

fuck your fancy personal page template im rawdoggin the html and you wont even make me use css

8

3

95

TimDarcet

@TimDarcet

1 year

“Fine with me if you need global aggregators, but please don’t do this in my feature maps. I need those for downstream tasks! Here, have a few registers instead” - historical reconstruction of how it happened

1

3

91

TimDarcet

@TimDarcet

6 months

Hey guys quick update vision transformers don't need registers after all brb gotta test some stuff

Zhuang Liu

@liuzhuang1234

6 months

LLMs are great, but their internals are less explored. I'm excited to share very interesting findings in paper “Massive Activations in Large Language Models” LLMs have very few internal activations with drastically outsized magnitudes, e.g., 100,000x larger than others. (1/n)

32

169

1K

3

2

81

TimDarcet

@TimDarcet

1 year

This starts with a very simple observation: ~all ViTs have attention maps focused on a few seemingly random patches. DINO has clean attention maps, sure, but then why did the artifacts reappear in DINOv2? What 𝘢𝘳𝘦 these artifacts?

1

4

81

TimDarcet

@TimDarcet

2 months

Okay this uiua thing is actually pretty fun

ludwig

@ludwigABAP

3 months

uiua goes unbelievably hard wtf array-orientated, stack based, glyph programming language and now I wanna make the game of life in it this weekend

14

7

222

2

5

81

TimDarcet

@TimDarcet

3 months

You may not like it, but this is what peak personal page looks like

TimDarcet

@TimDarcet

3 months

fuck your fancy personal page template im rawdoggin the html and you wont even make me use css

8

3

95

12

3

78

TimDarcet

@TimDarcet

1 year

Thanks @_akhaliq and @arankomatsuzaki for featuring our paper! It's great to see it 1st on the trending list on HF papers 😁

Daily Papers - Hugging Face

huggingface.co

AK

@_akhaliq

1 year

Vision Transformers Need Registers paper page: Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT

5

142

836

3

6

66

TimDarcet

@TimDarcet

1 year

We find a few properties of these artifacts. 1. They appear on patches with useless information (redundant to their neighbors). 2. They contain little information about the original patch. It “forgot” its original value!

1

66

TimDarcet

@TimDarcet

7 months

Happy to share that DINOv2 was accepted at TMLR! A special thanks to the reviewers and action editor. I found the review process to be actually pleasant and constructive. I believe that right now TMLR is possibly the best place to publish in ML

Accepted papers at TMLR

@TmlrPub

8 months

DINOv2: Learning Robust Visual Features without Supervision Maxime Oquab, Timothée Darcet, Théo Moutakanni et al.. Action editor: Abhishek Kumar. #supervised #visual #features

1

18

118

1

6

63

TimDarcet

@TimDarcet

1 year

On the other hand, the output tokens seem to contain 𝗹𝗼𝘁𝘀 of global information. We probe on a few different classification datasets. We find that these tokens contain much more class information than other patch tokens, and almost as much as the [CLS]!

1

0

63

TimDarcet

@TimDarcet

1 year

Do try out the new depth estimation parallax view, it's trippy

2

5

59

TimDarcet

@TimDarcet

4 months

Thanks to DINO's nice attention maps, the model's behavior is quite interpretable! That's really cool

TimDarcet

@TimDarcet

4 months

Another banger by @TheoMoutakanni : RayDINO, a DINO for chest X-ray. Excellent results on a ton of benchmarks with the frozen model, with great generalization and low bias. Check it out!

1

5

43

2

11

57

TimDarcet

@TimDarcet

1 year

6/ With these capabilities emerge new interesting properties. A very nice one is the ability to perform semantic keypoint matching between images simply by matching the closest features. This works across very different domains !

2

12

56

TimDarcet

@TimDarcet

10 months

Published my first paper, and my second one. I like them. I used to feel anxious about not being able to publish anything. It's getting better.

Jacy, LPC

@ATMwithJacy

10 months

BRAG ABOUT SOMETHING YOU’RE PROUD OF ACCOMPLISHING IN 2023 ✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨

1K

474

6K

2

0

53

TimDarcet

@TimDarcet

1 year

2/ As opposed to other recent SSL works, the goal is to provide vision encoders that work off-the-shelf, without any fine-tuning. In this setup, we improve significantly over previous SSL works, and even match or surpass CLIP-type models on a variety of tasks

1

4

53

TimDarcet

@TimDarcet

2 months

Lmao they waited for the 405B release just to be able to 1-up it

Mistral AI

@MistralAI

2 months

126

362

2K

3

0

51

TimDarcet

@TimDarcet

4 months

Another banger by @TheoMoutakanni : RayDINO, a DINO for chest X-ray. Excellent results on a ton of benchmarks with the frozen model, with great generalization and low bias. Check it out!

1

5

43

TimDarcet

@TimDarcet

11 months

Quite a few people have been asking me "can registers work with LLMs?" Here is a paper that says yes !

Aran Komatsuzaki

@arankomatsuzaki

11 months

Think before you speak: Training Language Models With Pause Tokens - Performing training and inference on LMs with a learnable pause token appended to the input prefix - Gains on 8 tasks, e,g, +18% on SQuAD

16

177

925

3

40

TimDarcet

@TimDarcet

6 months

1

0

38

TimDarcet

@TimDarcet

1 year

Big news on the DINOv2 side! - Apache2 license (commercial use) - Releasing the segmentation and depth heads - significantly updated demo, with keypoint matching! - New fairness evaluations on FACET

2

5

35

TimDarcet

@TimDarcet

4 months

The viennese street artists are a different breed

0

2

36

TimDarcet

@TimDarcet

5 months

The biggest step change in the DINOv2 project was a skillful yolo run by Maxime yoloing is a dangerous but powerful weapon

Jason Wei

@_jasonwei

5 months

In AI research there is tremendous value in intuitions on what makes things work. In fact, this skill is what makes “yolo runs” successful, and can accelerate your team tremendously. However, there’s no track record on how good someone’s intuition is. A fun way to do this is

19

36

465

2

0

35

TimDarcet

@TimDarcet

5 months

In case you haven't got it yet: google scholar pdf reader extension for chrome is a must

6

1

33

TimDarcet

@TimDarcet

7 months

Next week I'll be talking about registers, what they are and why we need them, at Cohere for AI! More info:

Cohere For AI

@CohereForAI

7 months

Next week on Wednesday, February 7th, our Geo-Regional Asia Group is excited to welcome Timothée Darcet, PhD student, building large vision models at @Meta AI (FAIR) & @Inria to present "Vision Transformers need Registers." Learn more:

2

4

15

0

5

32

TimDarcet

@TimDarcet

1 year

Very clear and simple tutorial on how to use DINOv2 as an image featurizer. Check it out !

Niels Rogge

@NielsRogge

1 year

DINOv2, a SOTA ViT trained by @Meta on 142 million images, is now part of 🤗 Transformers! It's one of the strongest vision backbones at the moment, so I created a tutorial on training a linear classifier on top of it for semantic segmentation, using DINOv2's frozen features 1/2

4

73

406

0

7

31

TimDarcet

@TimDarcet

10 months

Wait till they hear about selective checkpointing

Thomas Capelle

@capetorch

10 months

Gradient Checkpointing is the single most effective way of reducing GPU memory footprint. This thing is fantastic! Am I missing something, or is it that good?

10

3

101

2

0

30

TimDarcet

@TimDarcet

5 months

Okay caveat of my last post: maybe those are all middle authorship? Let's look at the same plot but only for _first_ and _last_ authors. First authors: (1/2)

TimDarcet

@TimDarcet

5 months

In case some of you were (like me) curious about this stat for AI conferences: here it is for ICLR2024

13

20

154

5

3

31

TimDarcet

@TimDarcet

4 months

@Ethan_smith_20 Contrastive loss in general push the model to use the whole space In DINOv2 we used the specific KoLeo loss, which pushes the embedding distribution towards higher entropy Higher entropy --> uniform distribution (on the hypersphere) --> full usage of the space

2

0

31

TimDarcet

@TimDarcet

3 months

Always check the image normalization! It can completely change results. eg CLIP uses its own specific norm, and openclip uses either the CLIP values or the inception values depending on the model. When in doubt, often you can check in timm

Gabriele Berton

@gabriberton

3 months

Notable models that use non-imagenet norm are Dust3r, OpenIBL, many image matching models, and some (many?) remote sensing models. This is an issue when you create a fair codebase to benchmark multiple models (where ideally you can simply swap the model to compute the results).

4

0

8

1

29

TimDarcet

@TimDarcet

2 months

Okay first DFN then that Apple is now the king of open-source datasets, both vision and NLP

Casper Hansen

@casper_hansen_

2 months

Apple released a 7B model that beats Mistral 7B - but the kicker is that they fully open sourced everything, also the pretraining dataset 🤯

29

502

3K

0

1

28

TimDarcet

@TimDarcet

3 months

Wait, are they doing patch size 40?? 170 is 1 [CLS] plus 13x13 patch tokens. Using padding, the smallest patch size you would need for that is 40. That's huge! Bigger than the old patch 32, that nobody does any more.

yobibyte

@y0b1byte

3 months

Very nice and interesting forensics

1

2

35

2

1

27

TimDarcet

@TimDarcet

8 months

QRT-ing this for good measure. The most important thing I've learned about SSL is probably: experiments, experiments, experiments. If it doesn't work, experiment harder It's why big labs have such an unfair advantage

TimDarcet

@TimDarcet

8 months

@nickdaleburns @samsja19 With all that said I think it's not important whether something is contrastive or not. I understand the intuition "there may be similar samples in the batch so I don't want to push them away". But in SSL, intuitions are trash IMO. You just have to try the stuff

5

0

11

0

1

26

TimDarcet

@TimDarcet

1 year

> Dumps a 60 MMLU 7B as a magnet link > Refuses to elaborate > Leaves Uncomprehensibly based

Mistral AI

@MistralAI

1 year

magnet:?xt=urn:btih:208b101a0f51514ecf285885a8b0f6fb1a1e4d7d&dn=mistral-7B-v0.1&tr=udp%3A%2F%%3A1337%2Fannounce&tr=https%3A%2F%%3A443%2Fannounce RELEASE ab979f50d7d406ab8d0b07d09806c72c

209

450

4K

2

0

26

TimDarcet

@TimDarcet

5 months

It's not exactly a Zipf law, but this distribution is still interesting

2

0

25

TimDarcet

@TimDarcet

5 months

This begs the question: are these people contributing to reviews as much as they contribute to submissions? The review system is saturated. If people send lots of paper to it, they should contribute more to make it work.

TimDarcet

@TimDarcet

5 months

In case some of you were (like me) curious about this stat for AI conferences: here it is for ICLR2024

13

20

154

4

24

TimDarcet

@TimDarcet

5 months

I usually say dɪno (dee-no) in french and daɪno (die-no) in english

Andrei Bursuc

@abursuc

5 months

Computer Vision folks, let's settle this. How do you pronounce DINO?

3

2

13

1

0

24

TimDarcet

@TimDarcet

9 months

I really don't like the pressure there is on "number of papers published". In France it's "defend after 3 years, as long as you published 1 paper". PhD students still publish excellent research. We should have incentives to publish fewer, better papers

yi 🦛

@agihippo

9 months

If publishing 3 papers is the bar for a PhD everyone will graduate in two quarters lol

3

0

16

1

0

24

TimDarcet

@TimDarcet

1 year

7/ What's the secret ingredient then? Well, the simplest answers are often the best. Most improvements come from scaling up, tuning carefully, stabilizing the training, efficient implementations... Might seem scientifically boring, but it’s absolutely crucial.

1

22

TimDarcet

@TimDarcet

1 year

Demo: FACET + DINOv2 blog post: Github:

GitHub - facebookresearch/dinov2: PyTorch code and models for the DINOv2 self-supervised learning...

PyTorch code and models for the DINOv2 self-supervised learning method. - facebookresearch/dinov2

github.com

0

4

21

TimDarcet

@TimDarcet

3 months

MLPs are really cool, you can just look at the functions they define, and understand them Really recommend these discussions and the neural redshift paper

François Fleuret

@francoisfleuret

3 months

With x->x.clamp(min=-0.5, max=0.5) as non-linearity. So as expected it's not a question of piecewise linear vs. non-polynomial, but a question of creating sharp but local changes? s=1, s=10 @DamienTeney

2

0

12

0

3

21

TimDarcet

@TimDarcet

9 months

Jeremy Howard

@jeremyphoward

9 months

he says he goes into "cuda mode" to write kernels. No music, lights off, no distractions. He wrote the 4bit kernel in one night.

21

23

724

0

20

TimDarcet

@TimDarcet

1 year

12/ Also check out the demo! Segmentation, depth and retrieval, easily accessible. Arxiv: Demo: Github:

GitHub - facebookresearch/dinov2: PyTorch code and models for the DINOv2 self-supervised learning...

PyTorch code and models for the DINOv2 self-supervised learning method. - facebookresearch/dinov2

github.com

1

6

21

TimDarcet

@TimDarcet

11 months

These (and more!) under Apache2 license at

GitHub - facebookresearch/dinov2: PyTorch code and models for the DINOv2 self-supervised learning...

PyTorch code and models for the DINOv2 self-supervised learning method. - facebookresearch/dinov2

github.com

1

19

TimDarcet

@TimDarcet

7 months

My drug is benchmark curves going up and I'm absolutely addicted this is not a meme send help

5

1

19

TimDarcet

@TimDarcet

1 year

@giffmana @MistralAI We kinda tried to contribute one brick to this with the DINOv2 release this year But in general I agree there's a big diff with NLP where there's new good foundation models every month rn

1

0

18

TimDarcet

@TimDarcet

8 months

Cool paper! Their encoders look quite strong. I'm happy to see ideas such as multicrop or iBOT being used. In my experience, it's free money

AK

@_akhaliq

8 months

Learning Vision from Models Rivals Learning Vision from Data paper page: introduce SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions, without any real data. We synthesize a large dataset

5

91

444

0

2

18

TimDarcet

@TimDarcet

1 year

The keypoints demo is absolutely great, I'm happy we were able to finally release that

1

17

TimDarcet

@TimDarcet

8 months

FAISS is absolutely standard for fast knn. Many approximate indices available, gpu acceleration made easy It was crucial to our dataset creation pipeline in DINOv2 Many thanks to @DouzeMatthijs , @hjegou and team!

Dmytro Mishkin 🇺🇦

@ducha_aiki

8 months

The Faiss library @DouzeMatthijs , Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, @hjegou tl;dr: the faiss and approximate kNN search overview

0

19

73

0

17

TimDarcet

@TimDarcet

5 months

Do not worry DINO fans: I am here to tell you that both pronunciation are officially valid

Andrei Bursuc

@abursuc

5 months

Sorry for DINO fans with French-Italian roots or preference for DI-stillation :)

1

0

5

1

0

17

TimDarcet

@TimDarcet

11 months

Xing Han Lu

@xhluca

11 months

@ylecun @ClementDelangue Scikit-Learn --> Inria (Paris) Torch --> Idiap/EPFL (Switzerland) Theano --> Lisa/UdeM (Montreal) Keras --> François Chollet FAISS, DINO, DETR, LLAMA --> FAIR Paris Tokenizers, Optimum, Accelerate --> Huggingface Wonder if there's something different about speaking French..

0

1

32

0

17

TimDarcet

@TimDarcet

7 months

"Vision transformer need registers" Tomorrow, 4pm CET! Moreinf -->

Cohere For AI

@CohereForAI

7 months

Next week on Wednesday, February 7th, our Geo-Regional Asia Group is excited to welcome Timothée Darcet, PhD student, building large vision models at @Meta AI (FAIR) & @Inria to present "Vision Transformers need Registers." Learn more:

2

4

15

1

0

16

TimDarcet

@TimDarcet

9 months

@XueFz Be careful of the citation tracker you use. Google scholar counts 791 with the right aggregation. I did expect more. But 791 is conceivable

3

0

16

TimDarcet

@TimDarcet

5 months

4:22:37:12 Any% NMG 1.2.2.1 new WR

AK

@_akhaliq

5 months

How Good Are Low-bit Quantized LLaMA3 Models? Meta's LLaMA family has become one of the most powerful open-source Large Language Model (LLM) series. Notably, LLaMA3 models have recently been released and achieve impressive performance across various with super-large scale

4

68

314

1

0

15

TimDarcet

@TimDarcet

1 year

11/ But giant models are impractical. To create smaller, portable models, we distilled the ViT-g into ViT-S, B and L (50x, 14x and 3x smaller). Distilling improves significantly over training from scratch! At these sizes, our distilled models beat all other models we tested.

1

2

15

TimDarcet

@TimDarcet

5 months

I deleted the last plot, as I had made a stupid mistake. Here is the corrected version. Thanks @mihaidusmanu for pointing it out!

Mihai Dusmanu

@mihaidusmanu

5 months

@TimDarcet Are you sure about the Y axis on this one? It feels off. And the X caption? Do I read it correctly that authors with >= 5 submissions as 1st author have an average of 4.6 papers accepted as 1st? Also according to the previous post there are only three samples at >= 5 submissions

1

0

1

15

TimDarcet

@TimDarcet

1 year

3/ TL;DR of results (on the benchmarks we tested) is Classification: DINOv2 ≥ CLIP Dense tasks (segmentation, depth): DINOv2 > CLIP Retrieval: DINOv2 > CLIP

1

3

15

TimDarcet

@TimDarcet

4 months

@p_bojanowski Congrats to you too!

0

1

15

TimDarcet

@TimDarcet

7 months

Wow this new model is so much more interesting than GPT4

Arthur Mensch

@arthurmensch

7 months

As a small surprise, we’re also releasing le Chat Mistral, a front-end demonstration of what Mistral models can do. Learn more on

22

55

465

0

15

TimDarcet

@TimDarcet

11 months

Just read the GAIA-1 paper, and I should've done it sooner. Feels like a big deal, I'm not aware of any previous work achieving such prediction capabilities / temporal consistency on videos. Actually bummed I didn't take the time to visit the @wayve_ai booth at ICCV

Alex Kendall

@alexgkendall

11 months

What’s exciting about #GAIA1 is that it's not just a simulation, but a full world model that understands how the world works Here's two different futures where a reversing car (1) pulls off the street or (2) pulls out suddenly requiring us to brake More of my fav examples in 🧵

18

60

409

0

14

TimDarcet

@TimDarcet

3 months

This is why I work ❤️

Alessandro

@alew3

4 months

🚀 Just launched to help pet owners reunite with their lost pets after Brazil's worst floods 🌊 Upload a picture of your pet, and our AI matches it with rescue shelter photos. Huge thanks to @aziontech and @LightningAI for their support! 🐾❤️ #AIForGood

2

5

12

0

15

TimDarcet

@TimDarcet

1 year

4/ For dense tasks, it makes sense: captions only capture image information up to a certain point, they miss local information. The masked image modeling loss in DINOv2 (from iBOT) gives it local understanding.

2

1

14

TimDarcet

@TimDarcet

9 months

This result got attention, but I feel it deserves more. Similar approach to GAIA-1, and potentially very significant results. The big gap in this paper is the evaluations. Qualitative results are impressive, but we need some quantitative.

Yutong Bai

@YutongBAI1002

9 months

How far can we go with vision alone? Excited to reveal our Large Vision Model! Trained with 420B tokens, effective scalability, and enabling new avenues in vision tasks! (1/N) Kudos to @younggeng @Karttikeya_m @_amirbar , @YuilleAlan Trevor Darrell @JitendraMalikCV Alyosha Efros!

18

160

1K

0

14

TimDarcet

@TimDarcet

9 months

> 8 x 7B First open-source SOTA MoE?

0

14

TimDarcet

@TimDarcet

1 year

5/ The good knn properties of DINO and the new KoLeo regularization combine to produce strong retrieval results. We were surprised by how good the metrics are! Even our ViT-S reaches better scores than any previously released model.

1

14

TimDarcet

@TimDarcet

1 month

Wait wait guys is this... ???

3

0

14

TimDarcet

@TimDarcet

2 months

@TacoCohen

0

1

13

TimDarcet

@TimDarcet

4 months

Hey, quick update and apology: this graph is misleading. 22k is the number of abstract submitted, while last years 12k is the number of papers submitted. afaik last year there were 15.6k abstracts. If the proportion stays the same, there should be 17k papers submitted this year

TimDarcet

@TimDarcet

4 months

Current state of neurips abstract submissions This neurips is gonna be crazy

11

22

121

1

0

13

TimDarcet

@TimDarcet

2 months

@untitled01ipynb All researchers and interns have access to quite a few V100s, as a base allocation To have a big allocation you need to justify it, usually as part of a big project

2

0

13

TimDarcet

@TimDarcet

9 months

In medical / bio images, annotations are very expensive (need very skilled annotators, eg doctors), so you have few labeled data. This is where SSL shines the brightest

George Shaikovski

@gshaikovski

9 months

We trained a self-supervised transformer on 1.5 million megapixel-scale whole slide images and DINOv2 framework, achieving state of the art on all public benchmarks. Now partnering with @MSFTResearch to go further.

0

7

14

1

0

13

TimDarcet

@TimDarcet

1 year

If DINOv2's activations are well correlated to the human brain's visual representations, it must mean we are doing something right. Thanks @adeli_hossein for this work!

Hossein Adeli

@adeli_hossein

1 year

Excited to share our submission to the Algonatus 2023 challenge with @Minni1031 and @KriegeskorteLab : “Predicting brain activity using Transformers” report: code:

1

6

33

1

2

13

TimDarcet

@TimDarcet

5 months

@RylanSchaeffer I think direct public shaming is not the most productive thing to do here, so I prefer not to. But it's pretty easy to reproduce this analysis yourself if you're interested

3

0

13

TimDarcet

@TimDarcet

10 months

DINOv2 being used in prod in @PlantNetProject is the good news of the day I did not expect. If you have never used it, try this app! It's simply amazing.

Hugo Gresse

@HugoGresse

11 months

En 2023, après plus d'un an de travail, nous avons ajouté les flores de tous les pays en utilisant le travail du @tdwg . Et nous avons aussi abandonné les réseaux de neurones convolutionnels (CNN) pour passer aux transformers, d'abord BEiT, puis DinoV2 après l'été.

1

8

0

2

13

TimDarcet

@TimDarcet

3 months

@y0b1byte In a neighbourhood of 0, the loss is O(eps), while the Lp regularization is O(eps^p) If p>1, the loss always win (L2 reg) If p<1, the regul always wins (0 is a local minimum) If p=1, the two can balance out, creating a local minimum if the loss's gradient is low enough

0

13

TimDarcet

@TimDarcet

1 year

9/ There's a bit to be said about the dataset too: we automatically curate 1B image to 140M with a retrieval+deduplication pipeline. More diverse than Imagenet22k, while still having better image quality and balance than uncurated datasets (YFCC, LAION, IG2B...)

1

3

13

TimDarcet

@TimDarcet

3 months

@y0b1byte @Wikipedia In particular for the confusion matrix

TimDarcet

@TimDarcet

9 months

PSA: when someone asks you a question including words such as "false positive rate", 𝗱𝗼 𝗻𝗼𝘁 𝗮𝗻𝘀𝘄𝗲𝗿 𝗿𝗶𝗴𝗵𝘁 𝗮𝘄𝗮𝘆. Simply state that you know your rights, and go on wikipedia to consult the 𝔐𝔞𝔡 𝕮𝔬𝔫𝔣𝔲𝔰𝔦𝔬𝔫 𝔐𝔞𝔱𝔯𝔦𝔵 𝔬𝔣 𝕳𝔢𝔩𝔩

2

16

105

1

0

12

TimDarcet

@TimDarcet

5 months

Last authors: You can guess my opinion on those 2 plots. I don't think this way of submitting is the best contribution to the community

2

4

12

TimDarcet

@TimDarcet

10 months

When did the muffin/chihuaha thing transition from a harmless meme to "famous computer vision problem"? I might be wrong, but I've never seen people showing experiments where this is a failure case. Here is what CLIP (from feb 21) predicts

Xin Eric Wang

@xwang_lk

10 months

The famous "Chihuahua or Muffin" problem in computer vision is considered solved by GPT-4V on social media. But really? The answer is NO. GPT-4V cannot reason well about the same images in the original "Chihuahua or Muffin" grid when they are in a different layout. I

42

145

896

2

12

TimDarcet

@TimDarcet

10 months

DINOv2 depth estimation looks quite robust to this kind of painted optical illusion, that's cool

Dmytro Mishkin 🇺🇦

@ducha_aiki

10 months

FLORIDA: Fake-looking Real Images Dataset Ali Borji tl;dr: 510 real images looking like a fake.

2

14

96

0

12

TimDarcet

@TimDarcet

5 months

The influence of eval formatting on results is always wild to me. Shouldn't we select ~100 standard "formats" and report the average on those? It should be much more robust imo

Clémentine Fourrier 🍊 - is ooo!

@clefourrier

5 months

Follow up "eval is fun" tweet: how much do scores change depending on prompt format choice? The score range for a given model is of 10 points! :D Prompt format on the x axis, all these evals look at the logprob of either "choice A/choice B..." or "A/B...".

3

10

55

1

0

12