Wenhao Li @WenhaoLi29 Twitter profile

Last Seen Profiles

@stwmaniax

@michael_buono23

@salahsa58537811

@PitifulZigzag

@vmaid_subaru

@FraSs_liKe_Me

@APRBD

@catgrlhot

@astarti11a

@SocializeMediaL

@yugiri_Daizy

@CartierA37008

@SaudiDesiGirl

@OlusegunM6326

@sophiatimes

@yerliatesli

@EmelineKal41630

@VikACG_ZH

@valisure

@SharityJ45107

@skitlegend

@BinorRaja

@66Ganzo

@bokeplokalmalam

@sasuga

@KOBBYCARTERR

@ziwen_h

@luteous1998

@alysia69347

@MaikelE36349934

@qDLkw54PuXFtl5K

@CardinalMooneyC

@d5arEM995niI79

@EsperinoP

@agusdisiervi

@jaetotheyun

Wenhao Li

@WenhaoLi29

14 days

We trained a Vision Transformer to solve ONE single task from @fchollet and @mikeknoop ’s @arcprize . Unexpectedly, it failed to produce the test output, even when using 1 MILLION examples! Why is this the case? 🤔

29

132

1K

Wenhao Li

@WenhaoLi29

14 days

We investigated and found that there exist fundamental limitations in the vanilla Vision Transformer preventing it from performing visual abstract reasoning. We propose enhancements to address these shortcomings in our new paper “Tackling the Abstraction and Reasoning Corpus with

6

34

368

Wenhao Li

@WenhaoLi29

14 days

Implementing our enhancements, our framework “ViTARC” saw a significant improvement from the vanilla ViT! Task-specific models were able to achieve 100% accuracy on over half of the 400 ARC training tasks.

6

14

209

Wenhao Li

@WenhaoLi29

14 days

Learn more about our work here: w/ @yudongxuwil @ScottSanner @lyeskhalil

Tackling the Abstraction and Reasoning Corpus with Vision...

The Abstraction and Reasoning Corpus (ARC) is a popular benchmark focused on visual reasoning in the evaluation of Artificial Intelligence systems. In its original framing, an ARC task requires...

arxiv.org

7

8

183

Wenhao Li

@WenhaoLi29

14 days

Specifically, we found that: 1) ViT has limited spatial awareness due to the representation of the images using flattened image patches. We address these issues by introducing special 2D visual Tokens that enable the ViT to become spatially aware. 2) ViT has limited access to

5

7

169

Wenhao Li

@WenhaoLi29

14 days

@fchollet We completely agree and came up with the same idea while working on the ARC-AGI! We explored using 2D Positional Encodings along with some other spatially aware modifications to enhance Vision Transformers in our latest work:

Wenhao Li

@WenhaoLi29

14 days

We investigated and found that there exist fundamental limitations in the vanilla Vision Transformer preventing it from performing visual abstract reasoning. We propose enhancements to address these shortcomings in our new paper “Tackling the Abstraction and Reasoning Corpus with

6

34

368

0

1

33

Wenhao Li

@WenhaoLi29

14 days

@fchollet @far__el Agreed, an ideal solver should output programs in code. Our work focuses on adapting Transformers to better handle 2D representations for reasoning tasks, which applies to input-output models and could extend to program-generating models in the future.

0

1

14

Wenhao Li

@WenhaoLi29

13 days

@GregKamradt @8teAPi @fchollet eah, this model isn’t an ARC solver (yet) since it's more of a 1M-shot rather than few-shot. But the enhancement still matters for an ARC solver using a transformer as the backbone, as it will need to read grids effectively anyway.

1

0

7

Wenhao Li

@WenhaoLi29

13 days

@breenemachine @fchollet @mikeknoop @arcprize Not a solver just yet, but we’re cooking something up!

1

0

6

Wenhao Li

@WenhaoLi29

13 days

@JonathanRoseD @fchollet @mikeknoop @arcprize Yes, we're working on it! The enhancements we mentioned are not too hard to implement on a raw CodeT5 or T5, so you could give it a try directly in the meantime.

0

5

Wenhao Li

@WenhaoLi29

13 days

@HealthyCode @fchollet @mikeknoop @arcprize Great question! We haven't tested it on medical images yet, but we believe our enhancement could help, especially if the patches are small.

0

3

Wenhao Li

@WenhaoLi29

13 days

@LodestoneE621 Nope, we kept it simple for the most conventional setups. RoPE already has both APE and RPE characteristics, so I’d assume it would perform better—especially if someone fine-tunes the sinusoidal base to match the grid size.

0

3

Wenhao Li

@WenhaoLi29

13 days

@rkarmani @fchollet @mikeknoop @arcprize No, this isn’t an ARC solver yet (still working on generalization), but a solver still needs to read grids, so the enhancements are definitely relevant.

0

3

Wenhao Li

@WenhaoLi29

13 days

@ztang230 @yudongxuwil @ScottSanner @lyeskhalil For us: 1. Our model is small, with ~2M trainable params. 2. The number of layers seems to matter for reasoning, but 3 layers were sufficient in our experiments — though adding more may help with tougher tasks. 3. We haven’t observed that sharp loss drop in our tests.

0

3

Wenhao Li

@WenhaoLi29

13 days

@tinycrops @fchollet @mikeknoop @arcprize @_jason_today Looks great! And yes, for OPE in our paper, you can use any external source of objectness information.

0

2

Wenhao Li

@WenhaoLi29

13 days

@georgiysk @fchollet @mikeknoop @arcprize Thanks! We're just using good old seaborn for the figures.

0

2