Wai Keen Vong Profile Banner
Wai Keen Vong Profile
Wai Keen Vong

@wkvong

1,341
Followers
1,029
Following
6
Media
49
Statuses

research scientist in computational cognitive science @nyuniversity . interested in concepts, language and abstraction

New York City
Joined April 2009
Don't wanna be here? Send us removal request.
Pinned Tweet
@wkvong
Wai Keen Vong
7 months
1/ Today in Science, we train a neural net from scratch through the eyes and ears of one child. The model learns to map words to visual referents, showing how grounded language learning from just one child's perspective is possible with today's AI tools.
Tweet media one
56
724
3K
@wkvong
Wai Keen Vong
7 months
6/ Results: Even with limited data, we found that the model can acquire word-referent mappings from merely tens to hundreds of examples, generalize zero-shot to new visual datasets, and achieve multi-modal alignment. Again, genuine language learning is possible from a child's
Tweet media one
Tweet media two
2
19
124
@wkvong
Wai Keen Vong
7 months
5/ We ran exactly this experiment. We trained a neural net (which we call CVCL, related to CLIP by its use of a contrastive objective) on headcam video, which captured slices of what a child saw and heard from 6 to 25 months. It’s an unprecedented look at one child’s experience,
Tweet media one
2
18
120
@wkvong
Wai Keen Vong
7 months
10/ Finally, our findings address a classic long-standing debate in philosophy and cognitive science: What ingredients do children need to learn words? Given their everyday experience, do they (or any learner) need language-specific inductive biases or innate knowledge to get
6
12
95
@wkvong
Wai Keen Vong
7 months
4/ To test this, what better than to train a neural network, not on enormous amounts of data from the web, but only on the input that a single child receives? What would it learn then, if anything?
1
13
94
@wkvong
Wai Keen Vong
7 months
8/ Limitations: A typical 2-year-old's vocabulary and word learning skills are still out of reach for the current version of CVCL. What else is missing? Note that the modeling, and the data, are inherently limited compared to children’s actual experiences and capabilities. - CVCL
4
5
91
@wkvong
Wai Keen Vong
7 months
2/ First, we are incredibly grateful for the authors of the SAYCam paper and dataset ( @mcxfrank , @andyperfors , and others not on twitter), which made our work possible and we were not involved in collecting. SAYCam provides an unprecedented look at a child's egocentric,
1
5
81
@wkvong
Wai Keen Vong
7 months
3/ Motivation: There is an enormous gap in how AI systems like GPT-4 and children learn language. The best AI systems now train on text with a word count in the trillions; it would take a child about 100K years to match that amount of data. Due to this data gap, many researchers
1
3
76
@wkvong
Wai Keen Vong
7 months
7/ CVCL can also learn words given naturalistic experience and generalize to very different images in a more “laboratory” style evaluation. Here, note how everyday examples of butterflies in books can help support performance in vocabulary tests.
Tweet media one
1
2
67
@wkvong
Wai Keen Vong
7 months
9/ In ongoing work, we are excited to add components (where possible) and see how much closer models can get to child-like language learning. Our goal is to find the minimal set of ingredients to get there. Nevertheless, our work provides a new conceptual blueprint for how
1
1
57
@wkvong
Wai Keen Vong
7 months
11/ A special thanks to my fantastic collaborators: Wentao Wang ( @wentaow10 ), Emin Orhan and Brenden Lake ( @lakebrenden ), as well as all of the reviewers and commenters who provided feedback on earlier versions of this work. We're very excited to finally share this research!
9
2
47
@wkvong
Wai Keen Vong
6 years
Pretty stoked to catch Cixin Liu (author of The Three-Body Problem) interviewed by @JiayangFan at the China Institute this evening! Snagged a copy of his latest book too
Tweet media one
Tweet media two
0
0
7
@wkvong
Wai Keen Vong
7 months
@ChrisKortge We do use a pretrained vision model, but the pretraining is done using only the video frames from the baby, not using outside sources of data. We also train a blank slate vision model in conjunction with language and find that it performs comparably!
0
1
5
@wkvong
Wai Keen Vong
7 months
@gchrupala @antalvdb @asifa_majid Yes, currently we are piggybacking on a strong speech recognition model (human research assistants who helped transcribe the dataset!). But this is an important limitation we spell out in the paper, and we hope to address this in follow-up work to more closely mimic the learning
0
0
4
@wkvong
Wai Keen Vong
1 year
@mark_ho_ @Huang_Ham Yes! All of our virtual seminars will be recorded and freely available afterwards!
0
0
3
@wkvong
Wai Keen Vong
7 months
@DanOneata That’s right, we use transcribed child-directed speech as input to the language encoder. We’re moving towards working with audio directly however, so stay tuned for answers to your question soon!
1
0
3
@wkvong
Wai Keen Vong
7 months
@GreatKingCnut @wentaow10 @LakeBrenden All of the word embeddings in our model are randomly initialized at the start of training, so there’s no presumed knowledge of language from using pretrained embeddings derived elsewhere
1
0
1
@wkvong
Wai Keen Vong
7 months
@DanOneata The negative pairs are a form of implicit negative evidence, as the child only experiences positive co-occurrences from which the negatives can be derived from. However, the use of a contrastive objective is primarily as a computational-level explanation (in Marr’s hierarchy),
0
0
2
@wkvong
Wai Keen Vong
7 months
@AnnaSchapiro Thank you Anna!
0
0
2
@wkvong
Wai Keen Vong
7 months
@EEMStewart Thanks Emma! And congrats on the new job in London!
0
0
1
@wkvong
Wai Keen Vong
7 months
@sreejan_kumar Thanks Sreejan!
0
0
1
@wkvong
Wai Keen Vong
7 months
@gchrupala @DanOneata I love this work (and we do cite it in our paper)! The variability of spoken language definitely makes things more challenging (especially because the audio quality in SAYCam is not particularly good relative to other common speech datasets), but we're actively exploring which
1
0
0
@wkvong
Wai Keen Vong
7 months
@xuanalogue Hi Xuan, thanks for sharing your thoughts, I largely agree! You might also be interested in a recent preprint by my co-authors and others () showing that training video models on this dataset show more robust object representations than image-based models
0
0
1
@wkvong
Wai Keen Vong
7 months
@gchrupala @DanOneata Regardless, I think there are strong similarities in the approach used in Grzegorz's paper and ours that speak the centrality of joint representation and associative learning (via temporal co-occurences) for learning these cross-modal associations
0
0
1
@wkvong
Wai Keen Vong
7 months
@gchrupala Thanks Grzegorz! That's right, the input to the model is the transcribed child-directed speech, which does simplify the learning problem and the work required to evaluate the model, but incorporating spoken language input is the obvious next step in this line of work!
1
0
1
@wkvong
Wai Keen Vong
7 months
0
0
1
@wkvong
Wai Keen Vong
7 months
@m_zettersten Thanks Martin!
0
0
1
@wkvong
Wai Keen Vong
7 months
@najoungkim Thanks Najoung!
0
0
1