Max Bain Profile Banner
Max Bain Profile
Max Bain

@maxhbain

1,921
Followers
547
Following
32
Media
263
Statuses

multimodal @RekaAILabs | prev: phd @Oxford_VGG hardwork-pilled

Joined April 2021
Don't wanna be here? Send us removal request.
Pinned Tweet
@maxhbain
Max Bain
2 years
WhisperX version 2.0 out, now with speaker diarization and character-level timestamps. 🧵
27
202
1K
@maxhbain
Max Bain
2 years
Are you using @openai 's Whisper for speech recognition and finding the timestamps are out of sync? Just dropped: WhisperX with word-level timestamp accuracy by force aligning whisper with wav2vec2.0 🧵 [1/n]
20
77
577
@maxhbain
Max Bain
7 months
RIP webvid dataset, 23 Feb 2024. Today I received a cease and desist letter from @Shutterstock that I must take down WebVid, an academic video captioning dataset, and can no longer provide the urls and captions to the research community.
Tweet media one
18
35
251
@maxhbain
Max Bain
3 years
Our work on Automated Audiovisual Behaviour Recognition in Wild Primates is finally out. An end-to-end detect, track and behaviour recognition pipeline, using both the audio and visual inputs (helpful for robustness in wild footage)
3
64
238
@maxhbain
Max Bain
3 years
Currently working on a demo for our Frozen-in-Time model, retrieving videos amongst millions in the WebVid dataset. Cool to see how sensitive our model is to small changes in the text query!
3
30
219
@maxhbain
Max Bain
3 years
WebVid: large scale text-video dataset now available. 2.5mil text-video pairs (10mil coming soon). Pretrain your E2E video-language models.
3
40
178
@maxhbain
Max Bain
7 months
💡Advice: if you are building yourself a long-term training codebase, then avoid heavy external libraries at all costs: (HF, hydra, lightning, even wandb etc.)
14
4
151
@maxhbain
Max Bain
5 months
New leader on the Reka Vibe-Eval multimodal benchmark. It actually solves some of the anti-scaling examples, nice work @OpenAI . But the hard-set is still hard (only 54%). @RekaAILabs
Tweet media one
6
24
150
@maxhbain
Max Bain
5 months
Tweet media one
@YiTayML
Yi Tay
5 months
New paper from @RekaAILabs 🔥 (yes an actual paper). This time we're releasing part of our internal evals which we call Vibe-Eval 😃 This comprises of a hard set which imo is pretty challenging for frontier models today. The fun part here is that we constructed it by trying to
Tweet media one
22
86
575
3
13
112
@maxhbain
Max Bain
1 year
Come say hi at #CVPR23 Will be presenting the project behind WhisperX 😎🎬 AutoAD: Movie Description in Context. June 22, Thu AM (Highlight, Poster 234). We train a model to automatically generate audio descriptions
Tweet media one
3
11
77
@maxhbain
Max Bain
7 months
So: only big companies who afford to pay for the shutterstock license get to train on those videos. Making it increasingly difficult for academic and independent researchers.
4
11
80
@maxhbain
Max Bain
6 months
A good day. Testing our new ✨Reka Core✨ model and its showing promising capabilities. Complex table understanding is one of them. Lmk if you are interested in early access @RekaAILabs
Tweet media one
Tweet media two
22
19
75
@maxhbain
Max Bain
2 years
Just landed in SF ✈️ to showcase WhisperX at the HF🤗 open source meet up. Looking forward to meeting everyone! Interested in 🗣🎥speech/video understanding, building model apis🌐? Let’s chat & thank you @ClementDelangue @huggingface
Tweet media one
3
5
74
@maxhbain
Max Bain
2 years
Full pipeline now includes speaker diarzation to assign speaker labels to each word (and character). Running runs @openai ’s whisper, @MetaAI 's wav2vec2.0 and diarization independently to produce robust word-level segmentation with speaker labels
Tweet media one
3
5
64
@maxhbain
Max Bain
6 months
Not too long ago we built a scalable dataloader at @RekaAILabs for any text/img/video etc. Its no easy feat, especially with no good open source implementations around, and had to rebuild a couple times. But, looks like @wightmanr cooked here 👏👏.
@m_olbap
Pablo Montalvo
6 months
The datasets are in .tar shards pre-shuffled and ready to be put in most training pipelines, and are compatible with huggingface datasets: you can stream them as above. @wightmanr cooked a lib to have optimized sharded dataloaders at scale, lean and nice!
1
3
42
1
2
50
@maxhbain
Max Bain
6 months
Mood at @RekaAILabs the past few months
Tweet media one
@RekaAILabs
Reka
6 months
Meet Reka Core, our best and most capable multimodal language model yet. 🔮 It’s been a busy few months training this model and we are glad to finally ship it! 💪 Core has a lot of capabilities, and one of them is understanding video --- let’s see what Core thinks of the 3 body
53
242
1K
1
3
46
@maxhbain
Max Bain
5 months
Dont sleep on inverse scaling (appendix). Yes its just a couple qualitative examples but its a big deal. I dont see the current approach of frontier models overcoming this. @YiTayML thinks im being a bit paranoid🥲, but the vibe is off here. Hint: phds heres a good area to solve
Tweet media one
@RekaAILabs
Reka
5 months
Evals are notoriously difficult to get right but necessary to move the field forward. 🌟 As part of our commitment to science, we’re releasing a subset of our internal evals. 🙌 Vibe-Eval is an open and hard benchmark comprising 269 image-text prompts for measuring the
Tweet media one
7
42
190
3
2
41
@maxhbain
Max Bain
5 months
Wow some people really be emailing me for details about our paper then months later publish some little hack on top of it and not discuss or even cite our original work🤦‍♂️🤦‍♂️
3
0
39
@maxhbain
Max Bain
8 months
Honoured to be working with this truly cracked team. (supports video too). more to come.
@RekaAILabs
Reka
8 months
Introducing Reka Flash, our efficient and highly capable multimodal language model. Try it at Reka playground 🛝 for free today. 🧵 Thread, blog & links below 👇
Tweet media one
13
44
257
0
2
38
@maxhbain
Max Bain
3 years
Impressive work but will the model be released publicly? If not, then maybe a more appropriate title would be: "Florence: A New Foundation Model for Companies with access to 500 A100s"
@_akhaliq
AK
3 years
Florence: A New Foundation Model for Computer Vision abs: sota results in majority of 44 representative benchmarks, ImageNet-1K zero-shot classification with top-1 accuracy of 83.74 and the top-5 accuracy of 97.18, 62.4 mAP on COCO fine tuning
Tweet media one
2
59
298
1
0
38
@maxhbain
Max Bain
6 months
Couldnt be happier to hear this. Tbh, WhisperX was a big lesson for me. The "academic" voice inside my head worrying there was little novelty (R2 complained too) -- but it solved a problem for us, and others too. So I told R2 the same, people were using it & i dont care anymore.
@kurianbenoy2
Kurian Benoy 💻
6 months
Still can't believe @maxhbain , my favourite ML researcher, who wrote the iconic WhisperX paper and has PHD from Oxford started following me. The clarity of WhsiperX paper is unparalleled and was selected for Interspeech 2023. I have read it like 50+ times. You made my day 😊.
2
2
36
1
8
38
@maxhbain
Max Bain
1 year
🧵 Personal updates: Phd, @RekaAILabs , WhisperX V4...
4
0
35
@maxhbain
Max Bain
5 months
Wow this paper only has 15 references 🥲. Authors: Please respect the field you build upon. But it’s from Microsoft research, not surprised…
@_akhaliq
AK
5 months
LLM-AD Large Language Model based Audio Description System The development of Audio Description (AD) has been a pivotal step forward in making video content more accessible and inclusive. Traditionally, AD production has demanded a considerable amount of skilled labor,
Tweet media one
2
14
66
3
3
34
@maxhbain
Max Bain
7 months
Tbh i dont really have the bandwidth to fight this so today I took down the website and the csv files of the dataset. Yep, the dataset is just some csv files :')
Tweet media one
2
1
33
@maxhbain
Max Bain
7 months
alt take: if you come straight from academia / OSS then sh*tty hardware, infra, and clusterf--k codebases is simply par for the course ⛳️.
@YiTayML
Yi Tay
7 months
Long overdue but here's a new blogpost on training LLMs in the wilderness from the ground up 😄🧐 In this blog post, I discuss: 1. Experiences in procuring compute & variance in different compute providers. Our biggest finding/surprise is that variance is super high and it's
Tweet media one
44
254
2K
1
0
31
@maxhbain
Max Bain
5 months
obvious take but sometimes all you need is a weekend with friends, adventure, and no laptop
1
1
31
@maxhbain
Max Bain
8 months
Tweet media one
1
6
28
@maxhbain
Max Bain
7 months
Apparently I have not made enough "guardrails" to ensure the dataset is only used for academic purposes... The letter then cites works text-to-video generation works from @BytedanceTalk @tiktok_us @AlibabaGroup that train using WebVid, and apparently dataset owner is liable :')
Tweet media one
3
0
26
@maxhbain
Max Bain
7 months
writing a new codebase😎 modifying a codebase🤔 merging two codebases together🤬🤬🤬
2
0
26
@maxhbain
Max Bain
7 months
E.g. a simple @huggingface tokenizer, or even a basic image transformation, can be riddled with bugs or even have small but catastrophic differences from what you think it does.
1
0
25
@maxhbain
Max Bain
6 months
Some days i'm so scaling-pilled. > just wait bro, we'll 10x compute + data Then others, like today, i run a simple prompt (albeit OOD) and all the frontier VLM's fail miserably, including Reka Core😔. Can scale solve all these problems in the the current meta? I dont think so
Tweet media one
Tweet media two
Tweet media three
3
0
25
@maxhbain
Max Bain
1 year
Catch me at #INTERSPEECH2023 Presenting WhisperX Thurs AM. Chatting about anything: Multispeaker ASR, multimodal LLMs 🗣️📹💬 >>Meanwhile trying to ship WhisperX V4 before I start next job
Tweet media one
1
2
24
@maxhbain
Max Bain
6 months
Reka Edge is OP.
Tweet media one
2
1
22
@maxhbain
Max Bain
1 year
🚨🚨 Those interested in video understanding / video-language: 🧵 Check out the task of Automating Audio Descriptions (AD). ✅ High-quality training and evaluation data ✅ True multimodal reasoning ✅ Immediate societal benefit
@TengdaHan
Tengda Han
1 year
I will be presenting our recent work "AutoAD II: The Sequel - Who, When, and What in Movie Audio Description" @ICCVConference Thursday 5th, 02:30 PM at "Nord" - 148. Work with @maxhbain @NagraniArsha @gulvarol @WeidiXie and Andrew Zisserman. See you then!
Tweet media one
1
7
41
0
1
22
@maxhbain
Max Bain
2 years
@OpenAI @MetaAI Voice Activity Detection pre-filtering improves alignment quality a lot, and prevents catastrophic timestamp errors by whisper (such as negative timestamp duration etc).
2
1
22
@maxhbain
Max Bain
5 months
ur damned if you chase the frontier, and ur damned if you don’t
4
1
20
@maxhbain
Max Bain
2 years
@OpenAI 🧵[3/n] However, phoneme-based models such as Wav2Vec2.0 produce much more accurate timestamps. WhisperX leverages these models using forced alignment on the whisper transcription to generate word-level timestamps.
Tweet media one
0
1
18
@maxhbain
Max Bain
7 months
A lot of my phd I tried to frankenstein together many of these popular tools. Only to later waste many days debugging from some bug introduced by these tools
1
0
18
@maxhbain
Max Bain
7 months
@huggingface @LightningAI bonus: you may later get asked to implement these things (tokenization, model parallelism, save checkpoint etc.) in an interview, so it'll prep you for these too
0
0
18
@maxhbain
Max Bain
3 months
some days the code just flows
0
1
17
@maxhbain
Max Bain
7 months
@Jerry8448 @Shutterstock the dataset is just a list of URLs that point to publicly available, watermarked, low-res videos (see attached), in what way is it stolen? Webvid is minute compared to all the copyrighted data that big tech trains on. Open-source can stay further and further behind, or we can
1
2
16
@maxhbain
Max Bain
2 years
The paper says 4M in the table for # pretraining images. But the implementation details says the visual encoder is initialised with CLIP (400M image-text pairs)🤔?
@_akhaliq
AK
2 years
Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning abs: github: pre-trained with only 4M images, achieves state-of-the-art performance on various downstream vision-language tasks
Tweet media one
1
11
41
1
3
15
@maxhbain
Max Bain
5 months
PS *not* saying this out of self-interest, the 15 refs includes our 2 papers, but there’s like 80 other missing references from others great works in this field. Just copy it from our papers if you’re that lazy jesus
0
0
15
@maxhbain
Max Bain
7 months
What's the best multispeaker ASR model / method right now (open-source) ?
4
0
13
@maxhbain
Max Bain
3 years
Paper: Helping researchers process large volumes of video data to study and monitor animal behaviour at large-scale
2
0
13
@maxhbain
Max Bain
6 months
and as @YiTayML said, at the point you dont care about others short-term judgement on your work, actually helps you become a better researcher.
@YiTayML
Yi Tay
6 months
research is an immensely taxing endeavour. hours spend doing IC work, debugging and what not. a paper is a canvas for researchers to express themselves after all the hard work, at the end of the day. it's my art. at least let me paint the way i want to paint. The reason why i am
7
18
240
3
1
13
@maxhbain
Max Bain
1 year
@RekaAILabs Thankfully defended the DPhil last month, with the pleasure of prof @BernardSGhanem and @chrirupp as my examiners. Concluding a total of 8 (4+4) years at oxford
Tweet media one
1
0
12
@maxhbain
Max Bain
7 months
@huggingface ofc you can write extensive unit tests yourself. But imo, at that point you may as well ctrl C+V only the crticial loc you need from these tools
1
0
11
@maxhbain
Max Bain
1 year
@RekaAILabs @BernardSGhanem @chrirupp Today I begin my first day joining the @RekaAILabs team, building multimodal LLMs
3
0
11
@maxhbain
Max Bain
2 years
@OpenAI 🧵[2/n] @openAI ’s Whisper shows impressive transcription performance, but often the corresponding timestamps are out of sync by several seconds.
1
0
10
@maxhbain
Max Bain
2 years
@OpenAI 🧵[5/n] Of course, it would be better if a single model did everything. One way would be teacher-student, where whisper is learning to output wav2vec's aligned timestamps. If @OpenAI open-sourced the training data and script, it would be cool to try this :)
3
1
10
@maxhbain
Max Bain
7 months
@huggingface also no shade to @huggingface @LightningAI hydra etc! These are goated🐐 ML libraries, for spinning up an MVP/experiments. And have saved many much time
1
0
10
@maxhbain
Max Bain
7 months
i literally dont know a single ML researcher that **hasnt** worked on data curation
@leavittron
Matthew Leavitt
7 months
Data curation is a frontier research problem. There’s only a handful of scientists in the world with deep expertise. And let’s be real—most scientists can’t build a deployable product that scales effortlessly.
2
0
9
0
0
10
@maxhbain
Max Bain
7 months
@kilian_maciej @Shutterstock video2dataset lives on
0
0
9
@maxhbain
Max Bain
6 months
in scale we trust
@ZhongkaiZhu
Zhongkai Zhu
6 months
Can't wait to show the world what our model can do! Have faith in the scaling law and keep cooking🫕
0
1
12
0
0
8
@maxhbain
Max Bain
2 years
Nice to see webvid used for other tasks :) I wonder how this scales to Webvid10M, for retrieval we found performance saturated but might be a task/model-specific problem
@AntoineYang2
Antoine Yang
2 years
Just Ask extension is accepted to the TPAMI Special Issue on the Best Papers of ICCV 2021! We release WebVidVQA3M, a new automatically generated VideoQA dataset. w/ @antoine77340 , J. Sivic, I. Laptev and @CordeliaSchmid . Webpage: @ComputerSociety #IEEECS
Tweet media one
1
1
19
1
0
9
@maxhbain
Max Bain
5 months
🙏Just image models for now*. We could make one for video too, but current video models fail at 'easy'/'normal' set already😭😭
@StanSzymanowicz
Stan Szymanowicz
5 months
Tough eval benchmark for video models from @RekaAILabs - let’s see how the models get on with cracking the ‘hard’ set
0
1
5
1
0
8
@maxhbain
Max Bain
6 months
@fchollet @soumithchintala @JeffDean @GoogleDeepMind benchmarking with HF trainer and models🤣, good one
1
0
7
@maxhbain
Max Bain
2 years
@ClementDelangue @huggingface oh and thank you @ChCh_Oxford for considering my very last minute travel grant application
1
0
7
@maxhbain
Max Bain
3 years
from our frozen-in-time work joint work with @NagraniArsha , @gulvarol , A Zisserman.
0
1
7
@maxhbain
Max Bain
6 months
@RekaAILabs @wightmanr Some of interesting challenges/reqs: - scale to **many** petabytes - agnostic to NFS / object storage - shardable - restartable at any state - tolerant to corrupt media files / network hangs
1
0
6
@maxhbain
Max Bain
3 years
code, pre-trained models and data will be released at this repo: ❄️⏳❄️⏳❄️⏳❄️⏳❄️⏳❄️⏳❄️⏳❄️⏳
@NagraniArsha
Arsha Nagrani
3 years
Separation of effort in the image and video retrieval communities is suboptimal - they share a lot of overlapping info! Check out our NEW model for visual-text retrieval, easily trains on *both* images and videos jointly, setting new SOTA results!
Tweet media one
2
36
171
0
0
7
@maxhbain
Max Bain
5 months
@kohjingyu I agree with you but that's why Google missed a trick not open-sourcing. People attribute to things they can physically build upon
0
0
7
@maxhbain
Max Bain
5 months
@YiTayML in many cases, even @vikhyatk 's moondream a tiny 2B model also solved these inverse scaling examples. Of which GPT-4V, Core, Gemini Pro were all failing.
1
0
6
@maxhbain
Max Bain
5 months
Tweet media one
@maxhbain
Max Bain
6 months
Some days i'm so scaling-pilled. > just wait bro, we'll 10x compute + data Then others, like today, i run a simple prompt (albeit OOD) and all the frontier VLM's fail miserably, including Reka Core😔. Can scale solve all these problems in the the current meta? I dont think so
Tweet media one
Tweet media two
Tweet media three
3
0
25
1
0
5
@maxhbain
Max Bain
5 months
@OpenAI @RekaAILabs See leaderboard at: Vibe-Eval paper:
1
0
6
@maxhbain
Max Bain
5 months
@OpenAI @RekaAILabs better (i am trusting u guys didnt train on these!)
Tweet media one
0
0
5
@maxhbain
Max Bain
3 years
@e_kazakos Thanks evangelos :) For the spectrograms in the paper I used Then for visualising the audio waveform on the demo video, I couldn't find any good python library, so I used this editor:
1
0
5
@maxhbain
Max Bain
1 year
Some promising qualitative results. Still some work to do to get to human-level: reference character names, more fine-grained visual understanding. Work with @TengdaHan , @NagraniArsha , @gulvarol , @WeidiXie , Andrew Zisserman AutoAD II Coming Soon..
Tweet media one
0
0
4
@maxhbain
Max Bain
1 year
@hbredin is the MVP of diarization. Hard to beat #pyannote (trust me I tried)...
@hbredin
Hervé "pyannote" Bredin
1 year
Took a bit more than a month but here it is: #pyannote 3.0.0. Pretrained pipeline should be much (MUCH) better than 2.1.1! #pyannote
1
11
34
0
0
5
@maxhbain
Max Bain
6 months
Why hasn’t @Apple acquired ggml?
1
0
5
@maxhbain
Max Bain
5 months
@YiTayML @OpenAI @RekaAILabs nope, almost 2x out 😬
Tweet media one
1
0
4
@maxhbain
Max Bain
1 year
@RekaAILabs @BernardSGhanem @chrirupp Sadly v4 is not yet ready, still not conclusive state-of-the-art across benchmarks (seems certain things work well on some, but worse on others). There are still things to try and I will hopefully make some intermittent progress every now and then in my spare time.
1
0
5
@maxhbain
Max Bain
5 months
@YiTayML @vikhyatk But there's an interesting trade-off, sometimes the tiny model fails to understand the question (not enough language modelling). But the atomic visual understanding seems to be more robust
1
0
5
@maxhbain
Max Bain
1 year
@RekaAILabs @BernardSGhanem @chrirupp WhisperX: many people have reached out for updates on V4 release, specifically improved multi-speaker ASR. First, apologies for the radio silence, there was a lot of emails on this and I only wanted to respond when I had good news....
1
0
4
@maxhbain
Max Bain
2 years
@altryne the repo generates a .ass file, which you can load in VLC with "add subtitle" or add to ffmpeg natively: ffmpeg -i input.mp4 -vf ass=subtitle.ass output_subs.mp4
1
0
4
@maxhbain
Max Bain
1 year
Training Data? We automatically collect AD text labels from audio with Transcribing 8,000 movies (12k+ hours) in 2 days on a 4-gpu node🏎️
Tweet media one
1
0
4
@maxhbain
Max Bain
10 months
@arouditchenko Really i think its just showing a big LLM can learn the language prior. That's why they report FT not true zero-shot
1
0
4
@maxhbain
Max Bain
7 months
@chaitjo Note that any serious industry lab ( @OpenAI @AIatMeta ) that claims to use @weights_biases will be running on-Prem deployment with FTE working on the system so it’s not comparable
1
0
4
@maxhbain
Max Bain
2 years
@levelsio @OpenAI @pictoryai @synthesiaIO @veedstudio @hencubed internet finna be spammed with this stuff in 2 years :(
1
0
4
@maxhbain
Max Bain
7 months
@agoramorph83877 I like this idea a lot in theory. But in practice high quality data annotation requires financial incentives for the author.
1
0
4
@maxhbain
Max Bain
7 months
@sbmaruf For a year I never even got nemo to run due to env issues... 😭. Even though i needed to compare to it in some work
1
0
4
@maxhbain
Max Bain
1 year
@RekaAILabs @BernardSGhanem @chrirupp I spent the past few weeks remote working in Costa Rica, trying to improve on multispeaker ASR sweeping over a lot of ideas (lots inspired from Interspeech2023 works).
Tweet media one
1
0
3
@maxhbain
Max Bain
5 months
@jxmnop bro predicted here ngl
0
0
3
@maxhbain
Max Bain
6 months
@smartass_cutie @RekaAILabs yeah but its a product, you can contact here if youre interested,
0
0
3
@maxhbain
Max Bain
1 year
Missing Data? A lot of movie data is partially missing and lacking large scale. We propose partial pretraining
Tweet media one
1
0
3
@maxhbain
Max Bain
5 months
@FeinbergVlad @YiTayML @RekaAILabs Human evaluation on hard prompts is not so easy. We found expert annotators (not shown in table) ranked Gemini Pro 1.5 as the best on hard prompts
0
0
3
@maxhbain
Max Bain
5 months
0
0
3