@ylecun
In my testing, it performs worse than Whisper for transcription to text, mis-hearing words and not hearing implied punctuation. Also it's about 10x slower than Faster-Whisper. Fairseq uses 20 GB of RAM, while Whisper uses about 1 GB. For these reasons and others this is