Are you using
@openai
's Whisper for speech recognition and finding the timestamps are out of sync?
Just dropped: WhisperX with word-level timestamp accuracy by force aligning whisper with wav2vec2.0
🧵 [1/n]
RIP webvid dataset, 23 Feb 2024.
Today I received a cease and desist letter from
@Shutterstock
that I must take down WebVid, an academic video captioning dataset, and can no longer provide the urls and captions to the research community.
Our work on Automated Audiovisual Behaviour Recognition in Wild Primates is finally out. An end-to-end detect, track and behaviour recognition pipeline, using both the audio and visual inputs (helpful for robustness in wild footage)
Currently working on a demo for our Frozen-in-Time model, retrieving videos amongst millions in the WebVid dataset. Cool to see how sensitive our model is to small changes in the text query!
💡Advice: if you are building yourself a long-term training codebase, then avoid heavy external libraries at all costs: (HF, hydra, lightning, even wandb etc.)
New leader on the Reka Vibe-Eval multimodal benchmark. It actually solves some of the anti-scaling examples, nice work
@OpenAI
.
But the hard-set is still hard (only 54%).
@RekaAILabs
New paper from
@RekaAILabs
🔥 (yes an actual paper).
This time we're releasing part of our internal evals which we call Vibe-Eval 😃 This comprises of a hard set which imo is pretty challenging for frontier models today.
The fun part here is that we constructed it by trying to
Come say hi at
#CVPR23
Will be presenting the project behind WhisperX 😎🎬
AutoAD: Movie Description in Context. June 22, Thu AM
(Highlight, Poster 234).
We train a model to automatically generate audio descriptions
So: only big companies who afford to pay for the shutterstock license get to train on those videos. Making it increasingly difficult for academic and independent researchers.
A good day. Testing our new ✨Reka Core✨ model and its showing promising capabilities.
Complex table understanding is one of them.
Lmk if you are interested in early access
@RekaAILabs
Just landed in SF ✈️ to showcase WhisperX at the HF🤗 open source meet up. Looking forward to meeting everyone!
Interested in 🗣🎥speech/video understanding, building model apis🌐? Let’s chat
& thank you
@ClementDelangue
@huggingface
Full pipeline now includes speaker diarzation to assign speaker labels to each word (and character). Running runs
@openai
’s whisper,
@MetaAI
's wav2vec2.0 and diarization independently to produce robust word-level segmentation with speaker labels
Not too long ago we built a scalable dataloader at
@RekaAILabs
for any text/img/video etc.
Its no easy feat, especially with no good open source implementations around, and had to rebuild a couple times.
But, looks like
@wightmanr
cooked here 👏👏.
The datasets are in .tar shards pre-shuffled and ready to be put in most training pipelines, and are compatible with huggingface datasets: you can stream them as above.
@wightmanr
cooked a lib to have optimized sharded dataloaders at scale, lean and nice!
Meet Reka Core, our best and most capable multimodal language model yet. 🔮
It’s been a busy few months training this model and we are glad to finally ship it! 💪
Core has a lot of capabilities, and one of them is understanding video --- let’s see what Core thinks of the 3 body
Dont sleep on inverse scaling (appendix).
Yes its just a couple qualitative examples but its a big deal. I dont see the current approach of frontier models overcoming this.
@YiTayML
thinks im being a bit paranoid🥲, but the vibe is off here.
Hint: phds heres a good area to solve
Evals are notoriously difficult to get right but necessary to move the field forward. 🌟
As part of our commitment to science, we’re releasing a subset of our internal evals. 🙌
Vibe-Eval is an open and hard benchmark comprising 269 image-text prompts for measuring the
Wow some people really be emailing me for details about our paper then months later publish some little hack on top of it and not discuss or even cite our original work🤦♂️🤦♂️
Introducing Reka Flash, our efficient and highly capable multimodal language model.
Try it at Reka playground 🛝 for free today.
🧵 Thread, blog & links below 👇
Impressive work but will the model be released publicly? If not, then maybe a more appropriate title would be:
"Florence: A New Foundation Model for Companies with access to 500 A100s"
Florence: A New Foundation Model for Computer Vision
abs:
sota results in majority of 44 representative benchmarks, ImageNet-1K zero-shot classification with top-1 accuracy of 83.74 and the top-5 accuracy of
97.18, 62.4 mAP on COCO fine tuning
Couldnt be happier to hear this.
Tbh, WhisperX was a big lesson for me. The "academic" voice inside my head worrying there was little novelty (R2 complained too) -- but it solved a problem for us, and others too. So I told R2 the same, people were using it & i dont care anymore.
Still can't believe
@maxhbain
, my favourite ML researcher, who wrote the iconic WhisperX paper and has PHD from Oxford started following me.
The clarity of WhsiperX paper is unparalleled and was selected for Interspeech 2023. I have read it like 50+ times. You made my day 😊.
LLM-AD
Large Language Model based Audio Description System
The development of Audio Description (AD) has been a pivotal step forward in making video content more accessible and inclusive. Traditionally, AD production has demanded a considerable amount of skilled labor,
Tbh i dont really have the bandwidth to fight this so today I took down the website and the csv files of the dataset. Yep, the dataset is just some csv files :')
Long overdue but here's a new blogpost on training LLMs in the wilderness from the ground up 😄🧐
In this blog post, I discuss:
1. Experiences in procuring compute & variance in different compute providers. Our biggest finding/surprise is that variance is super high and it's
Apparently I have not made enough "guardrails" to ensure the dataset is only used for academic purposes...
The letter then cites works text-to-video generation works from
@BytedanceTalk
@tiktok_us
@AlibabaGroup
that train using WebVid, and apparently dataset owner is liable :')
E.g. a simple
@huggingface
tokenizer, or even a basic image transformation, can be riddled with bugs or even have small but catastrophic differences from what you think it does.
Some days i'm so scaling-pilled.
> just wait bro, we'll 10x compute + data
Then others, like today, i run a simple prompt (albeit OOD) and all the frontier VLM's fail miserably, including Reka Core😔.
Can scale solve all these problems in the the current meta? I dont think so
Catch me at
#INTERSPEECH2023
Presenting WhisperX Thurs AM.
Chatting about anything: Multispeaker ASR, multimodal LLMs 🗣️📹💬
>>Meanwhile trying to ship WhisperX V4 before I start next job
🚨🚨 Those interested in video understanding / video-language:
🧵 Check out the task of Automating Audio Descriptions (AD).
✅ High-quality training and evaluation data
✅ True multimodal reasoning
✅ Immediate societal benefit
@OpenAI
🧵[3/n] However, phoneme-based models such as Wav2Vec2.0 produce much more accurate timestamps. WhisperX leverages these models using forced alignment on the whisper transcription to generate word-level timestamps.
A lot of my phd I tried to frankenstein together many of these popular tools.
Only to later waste many days debugging from some bug introduced by these tools
@huggingface
@LightningAI
bonus: you may later get asked to implement these things (tokenization, model parallelism, save checkpoint etc.) in an interview, so it'll prep you for these too
@Jerry8448
@Shutterstock
the dataset is just a list of URLs that point to publicly available, watermarked, low-res videos (see attached), in what way is it stolen?
Webvid is minute compared to all the copyrighted data that big tech trains on. Open-source can stay further and further behind, or we can
The paper says 4M in the table for # pretraining images. But the implementation details says the visual encoder is initialised with CLIP (400M image-text pairs)🤔?
Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning
abs:
github:
pre-trained with only 4M images, achieves state-of-the-art performance on various downstream vision-language tasks
PS *not* saying this out of self-interest, the 15 refs includes our 2 papers, but there’s like 80 other missing references from others great works in this field. Just copy it from our papers if you’re that lazy jesus
research is an immensely taxing endeavour. hours spend doing IC work, debugging and what not. a paper is a canvas for researchers to express themselves after all the hard work, at the end of the day.
it's my art. at least let me paint the way i want to paint. The reason why i am
@RekaAILabs
Thankfully defended the DPhil last month, with the pleasure of prof
@BernardSGhanem
and
@chrirupp
as my examiners. Concluding a total of 8 (4+4) years at oxford
@huggingface
ofc you can write extensive unit tests yourself. But imo, at that point you may as well ctrl C+V only the crticial loc you need from these tools
@OpenAI
🧵[2/n]
@openAI
’s Whisper shows impressive transcription performance, but often the corresponding timestamps are out of sync by several seconds.
@OpenAI
🧵[5/n] Of course, it would be better if a single model did everything. One way would be teacher-student, where whisper is learning to output wav2vec's aligned timestamps.
If
@OpenAI
open-sourced the training data and script, it would be cool to try this :)
Data curation is a frontier research problem. There’s only a handful of scientists in the world with deep expertise. And let’s be real—most scientists can’t build a deployable product that scales effortlessly.
Nice to see webvid used for other tasks :) I wonder how this scales to Webvid10M, for retrieval we found performance saturated but might be a task/model-specific problem
Just Ask extension is accepted to the TPAMI Special Issue on the Best Papers of ICCV 2021! We release WebVidVQA3M, a new automatically generated VideoQA dataset. w/
@antoine77340
, J. Sivic, I. Laptev and
@CordeliaSchmid
.
Webpage:
@ComputerSociety
#IEEECS
@RekaAILabs
@wightmanr
Some of interesting challenges/reqs:
- scale to **many** petabytes
- agnostic to NFS / object storage
- shardable
- restartable at any state
- tolerant to corrupt media files / network hangs
Separation of effort in the image and video retrieval communities is suboptimal - they share a lot of overlapping info!
Check out our NEW model for visual-text retrieval, easily trains on *both* images and videos jointly, setting new SOTA results!
@PrannayKaul
@Sagar_Vaze
Hahah idk, i think its easier. Mine had way too many heavy graphics so overleaf was struggling.
Also definitely use
@TengdaHan
's cracked thesis template
@YiTayML
in many cases, even
@vikhyatk
's moondream a tiny 2B model also solved these inverse scaling examples. Of which GPT-4V, Core, Gemini Pro were all failing.
Some days i'm so scaling-pilled.
> just wait bro, we'll 10x compute + data
Then others, like today, i run a simple prompt (albeit OOD) and all the frontier VLM's fail miserably, including Reka Core😔.
Can scale solve all these problems in the the current meta? I dont think so
@e_kazakos
Thanks evangelos :) For the spectrograms in the paper I used
Then for visualising the audio waveform on the demo video, I couldn't find any good python library, so I used this editor:
Some promising qualitative results. Still some work to do to get to human-level: reference character names, more fine-grained visual understanding.
Work with
@TengdaHan
,
@NagraniArsha
,
@gulvarol
,
@WeidiXie
, Andrew Zisserman
AutoAD II Coming Soon..
@RekaAILabs
@BernardSGhanem
@chrirupp
Sadly v4 is not yet ready, still not conclusive state-of-the-art across benchmarks (seems certain things work well on some, but worse on others). There are still things to try and I will hopefully make some intermittent progress every now and then in my spare time.
@YiTayML
@vikhyatk
But there's an interesting trade-off, sometimes the tiny model fails to understand the question (not enough language modelling). But the atomic visual understanding seems to be more robust
@RekaAILabs
@BernardSGhanem
@chrirupp
WhisperX: many people have reached out for updates on V4 release, specifically improved multi-speaker ASR. First, apologies for the radio silence, there was a lot of emails on this and I only wanted to respond when I had good news....
@altryne
the repo generates a .ass file, which you can load in VLC with "add subtitle" or add to ffmpeg natively:
ffmpeg -i input.mp4 -vf ass=subtitle.ass output_subs.mp4
@chaitjo
Note that any serious industry lab (
@OpenAI
@AIatMeta
) that claims to use
@weights_biases
will be running on-Prem deployment with FTE working on the system so it’s not comparable
@RekaAILabs
@BernardSGhanem
@chrirupp
I spent the past few weeks remote working in Costa Rica, trying to improve on multispeaker ASR sweeping over a lot of ideas (lots inspired from Interspeech2023 works).
@FeinbergVlad
@YiTayML
@RekaAILabs
Human evaluation on hard prompts is not so easy. We found expert annotators (not shown in table) ranked Gemini Pro 1.5 as the best on hard prompts