Guilherme Penedo @gui_penedo Twitter profile

Last Seen Profiles

@kana_qr0604

@Yukika256038556

@celebhotFantasy

@VmaniakJ

@ika_nvtsr

@MaiReinerth

@circesongs

@mipuxten

@etrnalsky

@TrendSpotAI

@VALO_LAB

@teamgalens

@ThuvlogTrn1

@CBredar

@eselmorenox

@sukajilbab02

@yuanyou44790778

@NikilisRBX

@butt_faisa61414

@cryptobynight

@Vickydaskas

@mt_haruru

@kh4_9

@kuuma_hoseki

@HazmatEN

@binarydances

@Clevercast

@SonmezBayr90900

@SocialCenters

@lyl356

@SamanthaGlowing

@q_akqn

@ken1_biz

@hhaabbkkh

@cerol011

@cindyne169

Guilherme Penedo

@gui_penedo

3 months

We have just released 🍷 FineWeb: 15 trillion tokens of high quality web data. We filtered and deduplicated all CommonCrawl between 2013 and 2024. Models trained on FineWeb outperform RefinedWeb, C4, DolmaV1.6, The Pile and SlimPajama!

40

348

2K

Guilherme Penedo

@gui_penedo

1 month

We are (finally) releasing the 🍷 FineWeb technical report! In it, we detail and explain every processing decision we took, and we also introduce our newest dataset: 📚 FineWeb-Edu, a (web only) subset of FW filtered for high educational content. Link:

39

316

1K

Guilherme Penedo

@gui_penedo

3 months

We trained 200+ ablation models to validate our processing decisions, and we share all the code you need to reproduce our setup, along with our dataset comparison ablation models checkpoints! Find out all abut 🍷 FineWeb on the 🤗 model page:

HuggingFaceFW/fineweb · Datasets at Hugging Face

huggingface.co

5

18

215

Guilherme Penedo

@gui_penedo

1 month

I was saving this for later but anyway... Should I send my paypal? 😛

Cody Blakeney

@code_star

1 month

@_lewtun @Yuchenj_UW @karpathy @LoubnaBenAllal1 @gui_penedo I would bet $1000 that for a fix set of processing rules older common crawl dumps, esp pre chatgpt / LLM are better for training LLMs on a mixture of downstream tasks, then the most recent ones.

5

1

24

12

9

137

Guilherme Penedo

@gui_penedo

2 months

@karpathy @realmrfakename We've just added 10, 100 and 350 billion tokens subsets randomly sampled from the entire dataset:

HuggingFaceFW/fineweb · Datasets at Hugging Face

huggingface.co

2

11

90

Guilherme Penedo

@gui_penedo

27 days

We keep getting new pretraining datasets 🔥 Congratulations to the Matrix team for such a strong dataset!

1

11

71

Guilherme Penedo

@gui_penedo

3 months

@rasbt More like 40TB in disk space

3

0

46

Guilherme Penedo

@gui_penedo

2 months

@karpathy @realmrfakename Indeed it's a great idea, we'll add a few randomly sampled small subsets

2

0

43

Guilherme Penedo

@gui_penedo

1 month

Special shout out to @HKydlicek who worked tirelessly over the last week on all the visualizations of the report, as well as to the rest of the team @LoubnaBenAllal1 @anton_lozhkov @colinraffel @lvwerra @Thom_Wolf

1

0

36

Guilherme Penedo

@gui_penedo

1 month

@karpathy A small note about FW-Edu: the main uplift is on more academic benchmarks like MMLU and ARC, the difference might not be as stark on Hellaswag where imo more data diversity might still be more important

2

1

31

Guilherme Penedo

@gui_penedo

5 months

Datatrove 🏭 () is now officially on PyPI! You can install it with "pip install datatrove" Datatrove allows you to quickly scale up your data processing pipelines, and we have been using it extensively for our internal needs at 🤗 HuggingFace for a while!

GitHub - huggingface/datatrove: Freeing data processing from scripting madness by providing a set...

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks. - huggingface/datatrove

github.com

1

6

31

Guilherme Penedo

@gui_penedo

3 months

@sivil_taram We plan to work on multilinguality later, but for now this is mostly an English dataset

4

1

25

Guilherme Penedo

@gui_penedo

1 month

@zraytam

📀 Dataset comparison models - a HuggingFaceFW Collection

huggingface.co

1

0

20

Guilherme Penedo

@gui_penedo

1 month

fineweb-gpt2-chatbot 👀

Loubna Ben Allal

@LoubnaBenAllal1

1 month

Coming soon.. can you guess what's "FineWeb-?" Wrong answers only

43

22

178

0

11

Guilherme Penedo

@gui_penedo

1 month

@TrelisResearch @RealGDT @HKydlicek @LoubnaBenAllal1 @anton_lozhkov @colinraffel @lvwerra @Thom_Wolf @huggingface directed by Guillermo del Toro? 😅

1

0

7

Guilherme Penedo

@gui_penedo

3 months

@iacopo_poli Our base filtering (from RefinedWeb) + dedup brings performance in line with RefinedWeb; the C4 filters we applied bring it above, and the FineWeb filters give an extra push. Likely no on RP2 as we wanted a dataset that "just works" and didn't really annotate. Fixed the link, ty!

0

7

Guilherme Penedo

@gui_penedo

1 month

@TrelisResearch You can fetch the edu version for a specific dump, there is some info on the fineweb-edu repo

0

5

Guilherme Penedo

@gui_penedo

1 month

@mgalle We did. We compared hellaswag results with hellaswag contamination and didn't find any clear correlations. We tried checking absolute counts of matches (assuming it doesn't take much to memorize answers) as well as frequencies and were unable to conclude anything

0

3

Guilherme Penedo

@gui_penedo

1 month

@soldni yep, each point is the average of two runs, each trained on a different 28BT sampling of fineweb data from that dump (pre C4 and FW filters)

1

0

5

Guilherme Penedo

@gui_penedo

1 month

@realmrfakename It's not very clear to me, but the final dataset is a subset of FineWeb and not Llama generated, so I think arguing it breaches the license would be a hard sell

0

4

Guilherme Penedo

@gui_penedo

3 months

@teortaxesTex @georgejrjrjr One thing we noticed and that we will probably expand a bit on the blog post is that common crawl dumps have different quality from dump to dump. One solution for smaller budgets would be to take the N best dumps instead of sampling from the entire dataset

1

0

5

Guilherme Penedo

@gui_penedo

3 months

@ArYoMo WARC :)

1

0

4

Guilherme Penedo

@gui_penedo

1 month

More info on this here:

Guilherme Penedo

@gui_penedo

1 month

We are (finally) releasing the 🍷 FineWeb technical report! In it, we detail and explain every processing decision we took, and we also introduce our newest dataset: 📚 FineWeb-Edu, a (web only) subset of FW filtered for high educational content. Link:

39

316

1K

0

3

Guilherme Penedo

@gui_penedo

1 month

@soldni no :/ this one is our base filtering + minhash (so not the extra filters). I ran like 4-5 dumps with the other filters and the relative positions did not seem to change too much but I didn't run it for all the dumps

1

0

3

Guilherme Penedo

@gui_penedo

3 months

@georgejrjrjr @teortaxesTex We did not. It is the single most effective, but it removes (for us) 30% of tokens. We managed to achieve similar performance by checking the ratio of document lines not ending in punctuation and only dropping entire docs instead (removes 10%) - this is part of FWfilters

1

0

3

Guilherme Penedo

@gui_penedo

3 months

@georgejrjrjr @teortaxesTex (30% of our base filtering/deduped baseline, to clarify) Plus it's quite destructive, it basically destroys any sort of list in the middle of a normal document as those normally don't end with punctuation

2

0

3

Guilherme Penedo

@gui_penedo

8 years

Verifying myself: I am philipsnostrum on Keybase.io. ANiPKu0Ryegg9libTXJ6MhS0uVaf1hy3W0VB /

0

3

Guilherme Penedo

@gui_penedo

5 months

The main goal of Datatrove 🏭 is to make large scale dataset curation accessible, without the need to reinvent the wheel every time you start a new project. It can run locally or on a slurm cluster, and includes logging and stats.

1

0

2

Guilherme Penedo

@gui_penedo

3 years

@mind_booster @DMConstantino @brunomiguel @pmcoelho @playrtp @KodiTV Eu vi este tweet hoje e apesar de já não usar o kodi há algum tempo reservei um tempinho para arranjar o addon e aqui está o resultado :)

FIXED! now uses mobile api; removed py2 compatibility by guipenedo · Pull Request #38 · enen92/pl...

The plugin was heavily recoded. It now uses the mobile api to get all the data, which allowed me to remove beautifulsoup from the dependencies and to improve a whole bunch of different details. I&#...

github.com

0

2

Guilherme Penedo

@gui_penedo

1 month

@WenhuChen @ClementDelangue Is there an easy way to get the CommonCrawl English subset? I am assuming I can just take the "cc_en" files from the hub?

0

2

Guilherme Penedo

@gui_penedo

3 months

@softyoda @itsandrewgao This is just an issue with the dataset viewer indexing, the dataset is around 40TB, not 200GB :)

0

2

Guilherme Penedo

@gui_penedo

26 days

@XYOU @pjox13 We're happy to run a training on the same conditions, but you can find details on the model setup (we haven't posted the exact training script yet) and the exact eval code on our blogpost

1

0

2

Guilherme Penedo

@gui_penedo

3 months

@teortaxesTex @georgejrjrjr "Likely no on RP2" -> we will not release an unfiltered version of FineWeb with annotations, like RP2. RP2 itself is not filtered, it gives you metrics and you can choose your own filtering using them. Our dataset "just works" in the sense that we already chose how to filter it

1

0

2

Guilherme Penedo

@gui_penedo

3 months

@georgejrjrjr @teortaxesTex Another detail btw: while the terminal_punct rule removes 30% and is the single most effective rule, all the other rules remove around half of that and give a bigger boost (when combined)

0

1

Guilherme Penedo

@gui_penedo

8 years

@PokemonGoApp fix the servers before rolling out to more countries yo

0

Guilherme Penedo

@gui_penedo

1 month

@xlr8harder They are averaged together first

1

0

1

Guilherme Penedo

@gui_penedo

3 months

@georgejrjrjr @teortaxesTex Trafilatura is very actively maintained so who knows, maybe it's better now

1

0

1

Guilherme Penedo

@gui_penedo

7 years

@CarolSc99 Nice! Sempre vais para informática?

1

0

1

Guilherme Penedo

@gui_penedo

3 months

@georgejrjrjr @teortaxesTex For markdown I did play around with it a bit and at least at the time found a lot of documents where it would parse them incorrectly and break formatting a lot

1

0

1

Guilherme Penedo

@gui_penedo

26 days

@pjox13 If there is a dataset released on the hub that applied these filtering steps to colossal Oscar I'd be happy to add it to the comparison

0

1

Guilherme Penedo

@gui_penedo

1 month

@TrelisResearch See here:

Guilherme Penedo

@gui_penedo

1 month

We are (finally) releasing the 🍷 FineWeb technical report! In it, we detail and explain every processing decision we took, and we also introduce our newest dataset: 📚 FineWeb-Edu, a (web only) subset of FW filtered for high educational content. Link:

39

316

1K

1

0

1

Guilherme Penedo

@gui_penedo

3 months

@georgejrjrjr @teortaxesTex No rationale, I suppose we just never considered it. I suppose we could have, at least the title/author metadata. In retrospect maybe we should have. The XML not so sure as it would probably have considerably increased the storage size without bringing a big benefit

1

0

1

Guilherme Penedo

@gui_penedo

1 month

@kishorelive We do not recommend using only the highest scoring samples as there starts to be perf degradation on non academic benchmarks

1

0

1

Guilherme Penedo

@gui_penedo

1 month

@kishorelive I think you can use the sample (10, 100 and 350BT) versions of FineWeb-edu (score >= 3). If you really want to take the high (>=4) data then you will definitely need to pair it with something else

1

0

1

Guilherme Penedo

@gui_penedo

5 months

With Datatrove 🏭, you can: - quickly read data in diff formats from disk, the cloud, or hf hub - use SOTA filters out of the box - deduplicate data at scale (minhash, exactsubstr, bloom filters, etc) - tokenize your data - easily run and scale your custom processing logic

1

0

1

Guilherme Penedo

@gui_penedo

3 months

@chewchewchew10 Indeed, they are all false positives on the byte signature

1

0

1

Guilherme Penedo

@gui_penedo

5 months

Still early days for the library, but we are quite excited by the community reception, and thankful for all the feedback and contributions ❤️. Please keep them coming!

0

1

Guilherme Penedo

@gui_penedo

3 months

@Euclaise_ @ITica007 @Thom_Wolf @togethercompute That's exactly right! RP2 is a great work but not directly comparable to FineWeb as you kind of have to "choose your own filtering" whereas FW is "plug and play"

0

1