Max Bain @maxhbain Twitter profile

Pinned Tweet

Max Bain

@maxhbain

2 years

WhisperX version 2.0 out, now with speaker diarization and character-level timestamps. 🧵

27

202

1K

Last Seen Profiles

@_aotsuki_

@let001

@bokeplokalmalam

@SterileBernie

@coc_space

@ClarissaChun

@IrishTimes

@amaze_plus_1

@ifrins

@stwmaniax

@mgppap

@HRHealth_

@petergraae

@onlystudiouk

@OffWithOff_Y

@JoppertLeonardo

@jfabianocampos

@BeardedKiryu

@stwmaniax

@Omar_u88

@stwmaniax

@MattStaffordQB

@russopassapusso

@CullensLe51627

@CycleMapApp

@marthynovak

@bokeplokalmalam

@ArianeSilva05

@KimberelyF91779

@SpookyBir

@moze_sume

@bokeplokalmalam

@Todney90

@b_gabri2

@rebe_hind

@aaas

Max Bain

@maxhbain

2 years

Are you using @openai 's Whisper for speech recognition and finding the timestamps are out of sync? Just dropped: WhisperX with word-level timestamp accuracy by force aligning whisper with wav2vec2.0 🧵 [1/n]

20

77

577

Max Bain

@maxhbain

7 months

RIP webvid dataset, 23 Feb 2024. Today I received a cease and desist letter from @Shutterstock that I must take down WebVid, an academic video captioning dataset, and can no longer provide the urls and captions to the research community.

18

35

251

Max Bain

@maxhbain

3 years

Our work on Automated Audiovisual Behaviour Recognition in Wild Primates is finally out. An end-to-end detect, track and behaviour recognition pipeline, using both the audio and visual inputs (helpful for robustness in wild footage)

3

64

238

Max Bain

@maxhbain

3 years

Currently working on a demo for our Frozen-in-Time model, retrieving videos amongst millions in the WebVid dataset. Cool to see how sensitive our model is to small changes in the text query!

3

30

219

Max Bain

@maxhbain

3 years

WebVid: large scale text-video dataset now available. 2.5mil text-video pairs (10mil coming soon). Pretrain your E2E video-language models.

3

40

178

Max Bain

@maxhbain

7 months

💡Advice: if you are building yourself a long-term training codebase, then avoid heavy external libraries at all costs: (HF, hydra, lightning, even wandb etc.)

14

4

151

Max Bain

@maxhbain

5 months

New leader on the Reka Vibe-Eval multimodal benchmark. It actually solves some of the anti-scaling examples, nice work @OpenAI . But the hard-set is still hard (only 54%). @RekaAILabs

6

24

150

Max Bain

@maxhbain

5 months

Yi Tay

@YiTayML

5 months

New paper from @RekaAILabs 🔥 (yes an actual paper). This time we're releasing part of our internal evals which we call Vibe-Eval 😃 This comprises of a hard set which imo is pretty challenging for frontier models today. The fun part here is that we constructed it by trying to

22

86

575

3

13

112

Max Bain

@maxhbain

7 months

RIP, we had a good run, and helped a lot of open text-video research

GitHub - m-bain/webvid: Large-scale text-video dataset. 10 million captioned short videos.

Large-scale text-video dataset. 10 million captioned short videos. - m-bain/webvid

github.com

2

5

89

Max Bain

@maxhbain

1 year

Come say hi at #CVPR23 Will be presenting the project behind WhisperX 😎🎬 AutoAD: Movie Description in Context. June 22, Thu AM (Highlight, Poster 234). We train a model to automatically generate audio descriptions

3

11

77

Max Bain

@maxhbain

7 months

So: only big companies who afford to pay for the shutterstock license get to train on those videos. Making it increasingly difficult for academic and independent researchers.

Shutterstock Expands Partnership with OpenAI, Signs New Six-Year Agreement to Provide High-Quality...

The Investor Relations website contains information about Shutterstock, Inc.'s business for stockholders, potential investors, and financial analysts.

investor.shutterstock.com

4

11

80

Max Bain

@maxhbain

6 months

A good day. Testing our new ✨Reka Core✨ model and its showing promising capabilities. Complex table understanding is one of them. Lmk if you are interested in early access @RekaAILabs

22

19

75

Max Bain

@maxhbain

2 years

Just landed in SF ✈️ to showcase WhisperX at the HF🤗 open source meet up. Looking forward to meeting everyone! Interested in 🗣🎥speech/video understanding, building model apis🌐? Let’s chat & thank you @ClementDelangue @huggingface

3

5

74

Max Bain

@maxhbain

2 years

Full pipeline now includes speaker diarzation to assign speaker labels to each word (and character). Running runs @openai ’s whisper, @MetaAI 's wav2vec2.0 and diarization independently to produce robust word-level segmentation with speaker labels

3

5

64

Max Bain

@maxhbain

6 months

Not too long ago we built a scalable dataloader at @RekaAILabs for any text/img/video etc. Its no easy feat, especially with no good open source implementations around, and had to rebuild a couple times. But, looks like @wightmanr cooked here 👏👏.

Pablo Montalvo

@m_olbap

6 months

The datasets are in .tar shards pre-shuffled and ready to be put in most training pipelines, and are compatible with huggingface datasets: you can stream them as above. @wightmanr cooked a lib to have optimized sharded dataloaders at scale, lean and nice!

1

3

42

1

2

50

Max Bain

@maxhbain

2 years

@OpenAI @MetaAI Code available here

GitHub - m-bain/whisperX: WhisperX: Automatic Speech Recognition with Word-level Timestamps (&...

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization) - m-bain/whisperX

github.com

1

3

46

Max Bain

@maxhbain

6 months

Mood at @RekaAILabs the past few months

Reka

@RekaAILabs

6 months

Meet Reka Core, our best and most capable multimodal language model yet. 🔮 It’s been a busy few months training this model and we are glad to finally ship it! 💪 Core has a lot of capabilities, and one of them is understanding video --- let’s see what Core thinks of the 3 body

53

242

1K

1

3

46

Max Bain

@maxhbain

5 months

Dont sleep on inverse scaling (appendix). Yes its just a couple qualitative examples but its a big deal. I dont see the current approach of frontier models overcoming this. @YiTayML thinks im being a bit paranoid🥲, but the vibe is off here. Hint: phds heres a good area to solve

Reka

@RekaAILabs

5 months

Evals are notoriously difficult to get right but necessary to move the field forward. 🌟 As part of our commitment to science, we’re releasing a subset of our internal evals. 🙌 Vibe-Eval is an open and hard benchmark comprising 269 image-text prompts for measuring the

7

42

190

3

2

41

Max Bain

@maxhbain

5 months

Wow some people really be emailing me for details about our paper then months later publish some little hack on top of it and not discuss or even cite our original work🤦‍♂️🤦‍♂️

3

0

39

Max Bain

@maxhbain

8 months

Honoured to be working with this truly cracked team. (supports video too). more to come.

Reka Playground

Explore the latest multimodal language models built by Reka

chat.reka.ai

Reka

@RekaAILabs

8 months

Introducing Reka Flash, our efficient and highly capable multimodal language model. Try it at Reka playground 🛝 for free today. 🧵 Thread, blog & links below 👇

13

44

257

0

2

38

Max Bain

@maxhbain

3 years

Impressive work but will the model be released publicly? If not, then maybe a more appropriate title would be: "Florence: A New Foundation Model for Companies with access to 500 A100s"

AK

@_akhaliq

3 years

Florence: A New Foundation Model for Computer Vision abs: sota results in majority of 44 representative benchmarks, ImageNet-1K zero-shot classification with top-1 accuracy of 83.74 and the top-5 accuracy of 97.18, 62.4 mAP on COCO fine tuning

2

59

298

1

0

38

Max Bain

@maxhbain

6 months

Couldnt be happier to hear this. Tbh, WhisperX was a big lesson for me. The "academic" voice inside my head worrying there was little novelty (R2 complained too) -- but it solved a problem for us, and others too. So I told R2 the same, people were using it & i dont care anymore.

Kurian Benoy 💻

@kurianbenoy2

6 months

Still can't believe @maxhbain , my favourite ML researcher, who wrote the iconic WhisperX paper and has PHD from Oxford started following me. The clarity of WhsiperX paper is unparalleled and was selected for Interspeech 2023. I have read it like 50+ times. You made my day 😊.

2

36

1

8

38

Max Bain

@maxhbain

2 years

Seems speaker diarization is still far from perfect. seems to be the best open-source but needs tuning depending on your data...

GitHub - pyannote/pyannote-audio: Neural building blocks for speaker diarization: speech activity...

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding - GitHub - pyannote/pyannote-audio: Neural build...

github.com

2

3

36

Max Bain

@maxhbain

1 year

🧵 Personal updates: Phd, @RekaAILabs , WhisperX V4...

4

0

35

Max Bain

@maxhbain

5 months

Wow this paper only has 15 references 🥲. Authors: Please respect the field you build upon. But it’s from Microsoft research, not surprised…

AK

@_akhaliq

5 months

LLM-AD Large Language Model based Audio Description System The development of Audio Description (AD) has been a pivotal step forward in making video content more accessible and inclusive. Traditionally, AD production has demanded a considerable amount of skilled labor,

2

14

66

3

34

Max Bain

@maxhbain

7 months

Tbh i dont really have the bandwidth to fight this so today I took down the website and the csv files of the dataset. Yep, the dataset is just some csv files :')

2

1

33

Max Bain

@maxhbain

7 months

alt take: if you come straight from academia / OSS then sh*tty hardware, infra, and clusterf--k codebases is simply par for the course ⛳️.

Yi Tay

@YiTayML

7 months

Long overdue but here's a new blogpost on training LLMs in the wilderness from the ground up 😄🧐 In this blog post, I discuss: 1. Experiences in procuring compute & variance in different compute providers. Our biggest finding/surprise is that variance is super high and it's

44

254

2K

1

0

31

Max Bain

@maxhbain

5 months

obvious take but sometimes all you need is a weekend with friends, adventure, and no laptop

1

31

Max Bain

@maxhbain

8 months

1

6

28

Max Bain

@maxhbain

7 months

Apparently I have not made enough "guardrails" to ensure the dataset is only used for academic purposes... The letter then cites works text-to-video generation works from @BytedanceTalk @tiktok_us @AlibabaGroup that train using WebVid, and apparently dataset owner is liable :')

3

0

26

Max Bain

@maxhbain

7 months

writing a new codebase😎 modifying a codebase🤔 merging two codebases together🤬🤬🤬

2

0

26

Max Bain

@maxhbain

3 years

Using the FAISS library by @facebookai to quickly search across millions of embeddings. Highly recommend it.

GitHub - facebookresearch/faiss: A library for efficient similarity search and clustering of dense...

A library for efficient similarity search and clustering of dense vectors. - facebookresearch/faiss

github.com

0

2

24

Max Bain

@maxhbain

7 months

E.g. a simple @huggingface tokenizer, or even a basic image transformation, can be riddled with bugs or even have small but catastrophic differences from what you think it does.

1

0

25

Max Bain

@maxhbain

6 months

Some days i'm so scaling-pilled. > just wait bro, we'll 10x compute + data Then others, like today, i run a simple prompt (albeit OOD) and all the frontier VLM's fail miserably, including Reka Core😔. Can scale solve all these problems in the the current meta? I dont think so

3

0

25

Max Bain

@maxhbain

1 year

Catch me at #INTERSPEECH2023 Presenting WhisperX Thurs AM. Chatting about anything: Multispeaker ASR, multimodal LLMs 🗣️📹💬 >>Meanwhile trying to ship WhisperX V4 before I start next job

1

2

24

Max Bain

@maxhbain

6 months

Reka Edge is OP.

2

1

22

Max Bain

@maxhbain

1 year

🚨🚨 Those interested in video understanding / video-language: 🧵 Check out the task of Automating Audio Descriptions (AD). ✅ High-quality training and evaluation data ✅ True multimodal reasoning ✅ Immediate societal benefit

Tengda Han

@TengdaHan

1 year

I will be presenting our recent work "AutoAD II: The Sequel - Who, When, and What in Movie Audio Description" @ICCVConference Thursday 5th, 02:30 PM at "Nord" - 148. Work with @maxhbain @NagraniArsha @gulvarol @WeidiXie and Andrew Zisserman. See you then!

1

7

41

0

1

22

Max Bain

@maxhbain

2 years

@OpenAI @MetaAI Voice Activity Detection pre-filtering improves alignment quality a lot, and prevents catastrophic timestamp errors by whisper (such as negative timestamp duration etc).

2

1

22

Max Bain

@maxhbain

5 months

ur damned if you chase the frontier, and ur damned if you don’t

4

1

20

Max Bain

@maxhbain

2 years

@OpenAI 🧵[3/n] However, phoneme-based models such as Wav2Vec2.0 produce much more accurate timestamps. WhisperX leverages these models using forced alignment on the whisper transcription to generate word-level timestamps.

0

1

18

Max Bain

@maxhbain

7 months

A lot of my phd I tried to frankenstein together many of these popular tools. Only to later waste many days debugging from some bug introduced by these tools

1

0

18

Max Bain

@maxhbain

7 months

@huggingface @LightningAI bonus: you may later get asked to implement these things (tokenization, model parallelism, save checkpoint etc.) in an interview, so it'll prep you for these too

0

18

Max Bain

@maxhbain

3 months

some days the code just flows

0

1

17

Max Bain

@maxhbain

2 years

@OpenAI 🧵[4/n] The result is word-level timestamp output. See more examples and try it yourself at

GitHub - m-bain/whisperX: WhisperX: Automatic Speech Recognition with Word-level Timestamps (&...

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization) - m-bain/whisperX

github.com

1

0

17

Max Bain

@maxhbain

7 months

@Jerry8448 @Shutterstock the dataset is just a list of URLs that point to publicly available, watermarked, low-res videos (see attached), in what way is it stolen? Webvid is minute compared to all the copyrighted data that big tech trains on. Open-source can stay further and further behind, or we can

1

2

16

Max Bain

@maxhbain

2 years

The paper says 4M in the table for # pretraining images. But the implementation details says the visual encoder is initialised with CLIP (400M image-text pairs)🤔?

AK

@_akhaliq

2 years

Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning abs: github: pre-trained with only 4M images, achieves state-of-the-art performance on various downstream vision-language tasks

1

11

41

1

3

15

Max Bain

@maxhbain

5 months

PS *not* saying this out of self-interest, the 15 refs includes our 2 papers, but there’s like 80 other missing references from others great works in this field. Just copy it from our papers if you’re that lazy jesus

0

15

Max Bain

@maxhbain

7 months

What's the best multispeaker ASR model / method right now (open-source) ?

4

0

13

Max Bain

@maxhbain

3 years

Paper: Helping researchers process large volumes of video data to study and monitor animal behaviour at large-scale

Automated audiovisual behavior recognition in wild primates

Deep learning using audiovisual data from chimpanzee percussive behaviors enables action recognition in the wild.

www.science.org

2

0

13

Max Bain

@maxhbain

6 months

and as @YiTayML said, at the point you dont care about others short-term judgement on your work, actually helps you become a better researcher.

Yi Tay

@YiTayML

6 months

research is an immensely taxing endeavour. hours spend doing IC work, debugging and what not. a paper is a canvas for researchers to express themselves after all the hard work, at the end of the day. it's my art. at least let me paint the way i want to paint. The reason why i am

7

18

240

3

1

13

Max Bain

@maxhbain

1 year

@RekaAILabs Thankfully defended the DPhil last month, with the pleasure of prof @BernardSGhanem and @chrirupp as my examiners. Concluding a total of 8 (4+4) years at oxford

1

0

12

Max Bain

@maxhbain

3 years

@NagraniArsha code, pre-trained models and data will be released at this repo: ❄️⏳❄️⏳❄️⏳❄️⏳❄️⏳❄️⏳

GitHub - m-bain/frozen-in-time: Frozen in Time: A Joint Video and Image Encoder for End-to-End...

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [ICCV'21] - m-bain/frozen-in-time

github.com

0

2

11

Max Bain

@maxhbain

7 months

@huggingface ofc you can write extensive unit tests yourself. But imo, at that point you may as well ctrl C+V only the crticial loc you need from these tools

1

0

11

Max Bain

@maxhbain

1 year

@RekaAILabs @BernardSGhanem @chrirupp Today I begin my first day joining the @RekaAILabs team, building multimodal LLMs

Reka

www.reka.ai

3

0

11

Max Bain

@maxhbain

2 years

@OpenAI 🧵[2/n] @openAI ’s Whisper shows impressive transcription performance, but often the corresponding timestamps are out of sync by several seconds.

1

0

10

Max Bain

@maxhbain

2 years

@OpenAI 🧵[5/n] Of course, it would be better if a single model did everything. One way would be teacher-student, where whisper is learning to output wav2vec's aligned timestamps. If @OpenAI open-sourced the training data and script, it would be cool to try this :)

3

1

10

Max Bain

@maxhbain

7 months

@huggingface also no shade to @huggingface @LightningAI hydra etc! These are goated🐐 ML libraries, for spinning up an MVP/experiments. And have saved many much time

1

0

10

Max Bain

@maxhbain

7 months

i literally dont know a single ML researcher that **hasnt** worked on data curation

Matthew Leavitt

@leavittron

7 months

Data curation is a frontier research problem. There’s only a handful of scientists in the world with deep expertise. And let’s be real—most scientists can’t build a deployable product that scales effortlessly.

2

0

9

0

10

Max Bain

@maxhbain

2 months

@YiTayML @RekaAILabs @kilian_maciej @donovanOng_ @spbsamuel @aormazabalo enjoyment optimal

0

10

Max Bain

@maxhbain

7 months

@kilian_maciej @Shutterstock video2dataset lives on

0

9

Max Bain

@maxhbain

6 months

in scale we trust

Zhongkai Zhu

@ZhongkaiZhu

6 months

Can't wait to show the world what our model can do! Have faith in the scaling law and keep cooking🫕

0

1

12

0

8

Max Bain

@maxhbain

2 years

Nice to see webvid used for other tasks :) I wonder how this scales to Webvid10M, for retrieval we found performance saturated but might be a task/model-specific problem

Antoine Yang

@AntoineYang2

2 years

Just Ask extension is accepted to the TPAMI Special Issue on the Best Papers of ICCV 2021! We release WebVidVQA3M, a new automatically generated VideoQA dataset. w/ @antoine77340 , J. Sivic, I. Laptev and @CordeliaSchmid . Webpage: @ComputerSociety #IEEECS

1

19

1

0

9

Max Bain

@maxhbain

5 months

🙏Just image models for now*. We could make one for video too, but current video models fail at 'easy'/'normal' set already😭😭

Stan Szymanowicz

@StanSzymanowicz

5 months

Tough eval benchmark for video models from @RekaAILabs - let’s see how the models get on with cracking the ‘hard’ set

0

1

5

1

0

8

Max Bain

@maxhbain

6 months

@fchollet @soumithchintala @JeffDean @GoogleDeepMind benchmarking with HF trainer and models🤣, good one

1

0

7

Max Bain

@maxhbain

1 year

@NagraniArsha @ICCVConference @Oxford_VGG @LiliMomeni @Sagar_Vaze @y_m_asano @TengdaHan @ChuhanZhang5 @VickyKalogeiton @jdthewlis Damn this looks glorious

0

6

Max Bain

@maxhbain

2 years

@ClementDelangue @huggingface oh and thank you @ChCh_Oxford for considering my very last minute travel grant application

1

0

7

Max Bain

@maxhbain

3 years

from our frozen-in-time work joint work with @NagraniArsha , @gulvarol , A Zisserman.

0

1

7

Max Bain

@maxhbain

3 years

Joint collaboration with @Oxford_VGG : @NagraniArsha , AZ , and expert primatologists: @Joanahbessa , @Dan_schofield_ , @carvalhoprimate , @dora_biro_ , @KJHockings , @sophie_berdugo

1

0

7

Max Bain

@maxhbain

6 months

@RekaAILabs @wightmanr Some of interesting challenges/reqs: - scale to **many** petabytes - agnostic to NFS / object storage - shardable - restartable at any state - tolerant to corrupt media files / network hangs

1

0

6

Max Bain

@maxhbain

3 years

code, pre-trained models and data will be released at this repo: ❄️⏳❄️⏳❄️⏳❄️⏳❄️⏳❄️⏳❄️⏳❄️⏳

GitHub - m-bain/frozen-in-time: Frozen in Time: A Joint Video and Image Encoder for End-to-End...

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [ICCV'21] - m-bain/frozen-in-time

github.com

Arsha Nagrani

@NagraniArsha

3 years

Separation of effort in the image and video retrieval communities is suboptimal - they share a lot of overlapping info! Check out our NEW model for visual-text retrieval, easily trains on *both* images and videos jointly, setting new SOTA results!

2

36

171

0

7

Max Bain

@maxhbain

5 months

@kohjingyu I agree with you but that's why Google missed a trick not open-sourcing. People attribute to things they can physically build upon

0

7

Max Bain

@maxhbain

6 months

@PrannayKaul @Sagar_Vaze Hahah idk, i think its easier. Mine had way too many heavy graphics so overleaf was struggling. Also definitely use @TengdaHan 's cracked thesis template

GitHub - TengdaHan/integrated_thesis_template: Latex template for Oxford integrated thesis

Latex template for Oxford integrated thesis. Contribute to TengdaHan/integrated_thesis_template development by creating an account on GitHub.

github.com

1

0

7

Max Bain

@maxhbain

5 months

@YiTayML in many cases, even @vikhyatk 's moondream a tiny 2B model also solved these inverse scaling examples. Of which GPT-4V, Core, Gemini Pro were all failing.

1

0

6

Max Bain

@maxhbain

5 months

@OpenAI @RekaAILabs better

Max Bain

@maxhbain

6 months

Some days i'm so scaling-pilled. > just wait bro, we'll 10x compute + data Then others, like today, i run a simple prompt (albeit OOD) and all the frontier VLM's fail miserably, including Reka Core😔. Can scale solve all these problems in the the current meta? I dont think so

3

0

25

1

0

5

Max Bain

@maxhbain

5 months

@OpenAI @RekaAILabs See leaderboard at: Vibe-Eval paper:

1

0

6

Max Bain

@maxhbain

5 months

@OpenAI @RekaAILabs better (i am trusting u guys didnt train on these!)

0

5

Max Bain

@maxhbain

3 years

@e_kazakos Thanks evangelos :) For the spectrograms in the paper I used Then for visualising the audio waveform on the demo video, I couldn't find any good python library, so I used this editor:

Add Soundwave Art to Video - Audio Waveform Generator - VEED.IO

VEED’s audiogram generator lets you add waveforms to your videos. Generate sound waves, perfect for sharing audio content on Instagram, TikTok, Facebook, and other social media platforms.

www.veed.io

1

0

5

Max Bain

@maxhbain

1 year

Some promising qualitative results. Still some work to do to get to human-level: reference character names, more fine-grained visual understanding. Work with @TengdaHan , @NagraniArsha , @gulvarol , @WeidiXie , Andrew Zisserman AutoAD II Coming Soon..

0

4

Max Bain

@maxhbain

1 year

@hbredin is the MVP of diarization. Hard to beat #pyannote (trust me I tried)...

Hervé "pyannote" Bredin

@hbredin

1 year

Took a bit more than a month but here it is: #pyannote 3.0.0. Pretrained pipeline should be much (MUCH) better than 2.1.1! #pyannote

1

11

34

0

5

Max Bain

@maxhbain

6 months

Why hasn’t @Apple acquired ggml?

1

0

5

Max Bain

@maxhbain

5 months

@YiTayML @OpenAI @RekaAILabs nope, almost 2x out 😬

1

0

4

Max Bain

@maxhbain

7 months

@Jerry8448 @Shutterstock so you just want the big companies to use this data and monopolize this tech? Who will that benefit?

Shutterstock Expands Partnership with OpenAI, Signs New Six-Year Agreement to Provide High-Quality...

The Investor Relations website contains information about Shutterstock, Inc.'s business for stockholders, potential investors, and financial analysts.

investor.shutterstock.com

1

0

5

Max Bain

@maxhbain

1 year

@RekaAILabs @BernardSGhanem @chrirupp Sadly v4 is not yet ready, still not conclusive state-of-the-art across benchmarks (seems certain things work well on some, but worse on others). There are still things to try and I will hopefully make some intermittent progress every now and then in my spare time.

1

0

5

Max Bain

@maxhbain

5 months

@YiTayML @vikhyatk But there's an interesting trade-off, sometimes the tiny model fails to understand the question (not enough language modelling). But the atomic visual understanding seems to be more robust

1

0

5

Max Bain

@maxhbain

1 year

@RekaAILabs @BernardSGhanem @chrirupp WhisperX: many people have reached out for updates on V4 release, specifically improved multi-speaker ASR. First, apologies for the radio silence, there was a lot of emails on this and I only wanted to respond when I had good news....

1

0

4

Max Bain

@maxhbain

2 years

@altryne the repo generates a .ass file, which you can load in VLC with "add subtitle" or add to ffmpeg natively: ffmpeg -i input.mp4 -vf ass=subtitle.ass output_subs.mp4

1

0

4

Max Bain

@maxhbain

1 year

Training Data? We automatically collect AD text labels from audio with Transcribing 8,000 movies (12k+ hours) in 2 days on a 4-gpu node🏎️

1

0

4

Max Bain

@maxhbain

10 months

@arouditchenko Really i think its just showing a big LLM can learn the language prior. That's why they report FT not true zero-shot

1

0

4

Max Bain

@maxhbain

7 months

@chaitjo Note that any serious industry lab ( @OpenAI @AIatMeta ) that claims to use @weights_biases will be running on-Prem deployment with FTE working on the system so it’s not comparable

1

0

4

Max Bain

@maxhbain

2 years

@levelsio @OpenAI @pictoryai @synthesiaIO @veedstudio @hencubed internet finna be spammed with this stuff in 2 years :(

1

0

4

Max Bain

@maxhbain

7 months

@agoramorph83877 I like this idea a lot in theory. But in practice high quality data annotation requires financial incentives for the author.

1

0

4

Max Bain

@maxhbain

7 months

@sbmaruf For a year I never even got nemo to run due to env issues... 😭. Even though i needed to compare to it in some work

1

0

4

Max Bain

@maxhbain

1 year

@RekaAILabs @BernardSGhanem @chrirupp I spent the past few weeks remote working in Costa Rica, trying to improve on multispeaker ASR sweeping over a lot of ideas (lots inspired from Interspeech2023 works).

1

0

3

Max Bain

@maxhbain

5 months

@jxmnop bro predicted here ngl

0

3

Max Bain

@maxhbain

6 months

@smartass_cutie @RekaAILabs yeah but its a product, you can contact here if youre interested,

0

3

Max Bain

@maxhbain

1 year

Missing Data? A lot of movie data is partially missing and lacking large scale. We propose partial pretraining

1

0

3

Max Bain

@maxhbain

5 months

@FeinbergVlad @YiTayML @RekaAILabs Human evaluation on hard prompts is not so easy. We found expert annotators (not shown in table) ranked Gemini Pro 1.5 as the best on hard prompts

0

3

Max Bain

@maxhbain

3 years

@advadnoun model and code will be released here soon:

GitHub - m-bain/frozen-in-time: Frozen in Time: A Joint Video and Image Encoder for End-to-End...

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [ICCV'21] - m-bain/frozen-in-time

github.com

0

3

Max Bain

@maxhbain

5 months

@finbarrtimbers thank you!

0

3