Wai Keen Vong @wkvong Twitter profile | Pikagi

Pikagi

Wai Keen Vong

@wkvong

1,341

Followers

1,029

Following

6

Media

49

Statuses

research scientist in computational cognitive science @nyuniversity . interested in concepts, language and abstraction

New York City

https://t.co/wnTFNIdwPR

Joined April 2009

Don't wanna be here? Send us removal request.

Pinned Tweet

@wkvong

Wai Keen Vong

7 months

1/ Today in Science, we train a neural net from scratch through the eyes and ears of one child. The model learns to map words to visual referents, showing how grounded language learning from just one child's perspective is possible with today's AI tools.

Tweet media one

56

724

3K

Last Seen Profiles

@kenmuchi0219

@from2heII

@sanmarinotitans

@Lullamae

@Pacifica_Kaz

@brendan_yerkes

@ellebriag_gomes

@Jekkah

@VandaCa16661543

@Ilha__TV

@Wheresoi

@seawhquer

@IrelandUNGeneva

@niiiya___

@insfran_jr

@juri_nsym

@Dr_GeniusPHD

@UMAR_LERWERL

@AminaKamle90112

@bokeplokalmalam

@ncksnowman

@IsahByarugaba

@LaganisCammi

@SBC_PE

@MacGabz_

@thesisxlightsum

@BerthaParkPE

@Uchiha_Perdida

@longmongduecier

@na_an49902

@DaintyTess__

@Dino_Lesson

@KirikoTake98879

@sirsamjenks

@_pokies

@Drew31Sim

@wkvong

Wai Keen Vong

7 months

6/ Results: Even with limited data, we found that the model can acquire word-referent mappings from merely tens to hundreds of examples, generalize zero-shot to new visual datasets, and achieve multi-modal alignment. Again, genuine language learning is possible from a child's

Tweet media one

Tweet media two

2

19

124

@wkvong

Wai Keen Vong

7 months

5/ We ran exactly this experiment. We trained a neural net (which we call CVCL, related to CLIP by its use of a contrastive objective) on headcam video, which captured slices of what a child saw and heard from 6 to 25 months. It’s an unprecedented look at one child’s experience,

Tweet media one

2

18

120

@wkvong

Wai Keen Vong

7 months

10/ Finally, our findings address a classic long-standing debate in philosophy and cognitive science: What ingredients do children need to learn words? Given their everyday experience, do they (or any learner) need language-specific inductive biases or innate knowledge to get

6

12

95

@wkvong

Wai Keen Vong

7 months

4/ To test this, what better than to train a neural network, not on enormous amounts of data from the web, but only on the input that a single child receives? What would it learn then, if anything?

1

13

94

@wkvong

Wai Keen Vong

7 months

8/ Limitations: A typical 2-year-old's vocabulary and word learning skills are still out of reach for the current version of CVCL. What else is missing? Note that the modeling, and the data, are inherently limited compared to children’s actual experiences and capabilities. - CVCL

4

5

91

@wkvong

Wai Keen Vong

7 months

2/ First, we are incredibly grateful for the authors of the SAYCam paper and dataset ( @mcxfrank , @andyperfors , and others not on twitter), which made our work possible and we were not involved in collecting. SAYCam provides an unprecedented look at a child's egocentric,

1

5

81

@wkvong

Wai Keen Vong

7 months

3/ Motivation: There is an enormous gap in how AI systems like GPT-4 and children learn language. The best AI systems now train on text with a word count in the trillions; it would take a child about 100K years to match that amount of data. Due to this data gap, many researchers

1

3

76

@wkvong

Wai Keen Vong

7 months

7/ CVCL can also learn words given naturalistic experience and generalize to very different images in a more “laboratory” style evaluation. Here, note how everyday examples of butterflies in books can help support performance in vocabulary tests.

Tweet media one

1

2

67

@wkvong

Wai Keen Vong

7 months

9/ In ongoing work, we are excited to add components (where possible) and see how much closer models can get to child-like language learning. Our goal is to find the minimal set of ingredients to get there. Nevertheless, our work provides a new conceptual blueprint for how

1

1

57

@wkvong

Wai Keen Vong

7 months

11/ A special thanks to my fantastic collaborators: Wentao Wang ( @wentaow10 ), Emin Orhan and Brenden Lake ( @lakebrenden ), as well as all of the reviewers and commenters who provided feedback on earlier versions of this work. We're very excited to finally share this research!

9

2

47

@wkvong

Wai Keen Vong

6 years

Pretty stoked to catch Cixin Liu (author of The Three-Body Problem) interviewed by @JiayangFan at the China Institute this evening! Snagged a copy of his latest book too

Tweet media one

Tweet media two

0

0

7

@wkvong

Wai Keen Vong

7 months

@ChrisKortge We do use a pretrained vision model, but the pretraining is done using only the video frames from the baby, not using outside sources of data. We also train a blank slate vision model in conjunction with language and find that it performs comparably!

0

1

5

@wkvong

Wai Keen Vong

7 months

@gchrupala @antalvdb @asifa_majid Yes, currently we are piggybacking on a strong speech recognition model (human research assistants who helped transcribe the dataset!). But this is an important limitation we spell out in the paper, and we hope to address this in follow-up work to more closely mimic the learning

0

0

4

@wkvong

Wai Keen Vong

1 year

@mark_ho_ @Huang_Ham Yes! All of our virtual seminars will be recorded and freely available afterwards!

0

0

3

@wkvong

Wai Keen Vong

7 months

@DanOneata That’s right, we use transcribed child-directed speech as input to the language encoder. We’re moving towards working with audio directly however, so stay tuned for answers to your question soon!

1

0

3

@wkvong

Wai Keen Vong

7 months

@GreatKingCnut @wentaow10 @LakeBrenden All of the word embeddings in our model are randomly initialized at the start of training, so there’s no presumed knowledge of language from using pretrained embeddings derived elsewhere

1

0

1

@wkvong

Wai Keen Vong

7 months

@DanOneata The negative pairs are a form of implicit negative evidence, as the child only experiences positive co-occurrences from which the negatives can be derived from. However, the use of a contrastive objective is primarily as a computational-level explanation (in Marr’s hierarchy),

0

0

2

@wkvong

Wai Keen Vong

7 months

@AnnaSchapiro Thank you Anna!

0

0

2

@wkvong

Wai Keen Vong

7 months

@EEMStewart Thanks Emma! And congrats on the new job in London!

0

0

1

@wkvong

Wai Keen Vong

7 months

@sreejan_kumar Thanks Sreejan!

0

0

1

@wkvong

Wai Keen Vong

7 months

@gchrupala @DanOneata I love this work (and we do cite it in our paper)! The variability of spoken language definitely makes things more challenging (especially because the audio quality in SAYCam is not particularly good relative to other common speech datasets), but we're actively exploring which

1

0

0

@wkvong

Wai Keen Vong

7 months

@xuanalogue Hi Xuan, thanks for sharing your thoughts, I largely agree! You might also be interested in a recent preprint by my co-authors and others () showing that training video models on this dataset show more robust object representations than image-based models

Tweet card media

Self-supervised learning of video representations from a...

Children learn powerful internal models of the world around them from a few years of egocentric visual experience. Can such internal models be learned from a child's visual experience with highly...

0

0

1

@wkvong

Wai Keen Vong

7 months

@gchrupala @DanOneata Regardless, I think there are strong similarities in the approach used in Grzegorz's paper and ours that speak the centrality of joint representation and associative learning (via temporal co-occurences) for learning these cross-modal associations

0

0

1

@wkvong

Wai Keen Vong

7 months

@gchrupala Thanks Grzegorz! That's right, the input to the model is the transcribed child-directed speech, which does simplify the learning problem and the work required to evaluate the model, but incorporating spoken language input is the obvious next step in this line of work!

1

0

1

@wkvong

Wai Keen Vong

7 months

@patrickshafto @LakeBrenden Thanks Pat!

0

0

1

@wkvong

Wai Keen Vong

7 months

@m_zettersten Thanks Martin!

0

0

1

@wkvong

Wai Keen Vong

7 months

@najoungkim Thanks Najoung!

0

0

1