Guilherme Penedo Profile
Guilherme Penedo

@gui_penedo

2,942
Followers
1,882
Following
5
Media
814
Statuses

ML Research Engineer at 🤗. Lisboeta 🇵🇹

Paris 🇫🇷
Joined April 2012
Don't wanna be here? Send us removal request.
@gui_penedo
Guilherme Penedo
3 months
We have just released 🍷 FineWeb: 15 trillion tokens of high quality web data. We filtered and deduplicated all CommonCrawl between 2013 and 2024. Models trained on FineWeb outperform RefinedWeb, C4, DolmaV1.6, The Pile and SlimPajama!
Tweet media one
40
348
2K
@gui_penedo
Guilherme Penedo
1 month
We are (finally) releasing the 🍷 FineWeb technical report! In it, we detail and explain every processing decision we took, and we also introduce our newest dataset: 📚 FineWeb-Edu, a (web only) subset of FW filtered for high educational content. Link:
Tweet media one
39
316
1K
@gui_penedo
Guilherme Penedo
3 months
We trained 200+ ablation models to validate our processing decisions, and we share all the code you need to reproduce our setup, along with our dataset comparison ablation models checkpoints! Find out all abut 🍷 FineWeb on the 🤗 model page:
5
18
215
@gui_penedo
Guilherme Penedo
1 month
I was saving this for later but anyway... Should I send my paypal? 😛
Tweet media one
@code_star
Cody Blakeney
1 month
@_lewtun @Yuchenj_UW @karpathy @LoubnaBenAllal1 @gui_penedo I would bet $1000 that for a fix set of processing rules older common crawl dumps, esp pre chatgpt / LLM are better for training LLMs on a mixture of downstream tasks, then the most recent ones.
5
1
24
12
9
137
@gui_penedo
Guilherme Penedo
2 months
@karpathy @realmrfakename We've just added 10, 100 and 350 billion tokens subsets randomly sampled from the entire dataset:
2
11
90
@gui_penedo
Guilherme Penedo
27 days
We keep getting new pretraining datasets 🔥 Congratulations to the Matrix team for such a strong dataset!
Tweet media one
1
11
71
@gui_penedo
Guilherme Penedo
3 months
@rasbt More like 40TB in disk space
3
0
46
@gui_penedo
Guilherme Penedo
2 months
@karpathy @realmrfakename Indeed it's a great idea, we'll add a few randomly sampled small subsets
2
0
43
@gui_penedo
Guilherme Penedo
1 month
Special shout out to @HKydlicek who worked tirelessly over the last week on all the visualizations of the report, as well as to the rest of the team @LoubnaBenAllal1 @anton_lozhkov @colinraffel @lvwerra @Thom_Wolf
1
0
36
@gui_penedo
Guilherme Penedo
1 month
@karpathy A small note about FW-Edu: the main uplift is on more academic benchmarks like MMLU and ARC, the difference might not be as stark on Hellaswag where imo more data diversity might still be more important
2
1
31
@gui_penedo
Guilherme Penedo
5 months
Datatrove 🏭 () is now officially on PyPI! You can install it with "pip install datatrove" Datatrove allows you to quickly scale up your data processing pipelines, and we have been using it extensively for our internal needs at 🤗 HuggingFace for a while!
1
6
31
@gui_penedo
Guilherme Penedo
3 months
@sivil_taram We plan to work on multilinguality later, but for now this is mostly an English dataset
4
1
25
@gui_penedo
Guilherme Penedo
1 month
fineweb-gpt2-chatbot 👀
@LoubnaBenAllal1
Loubna Ben Allal
1 month
Coming soon.. can you guess what's "FineWeb-?" Wrong answers only
Tweet media one
43
22
178
0
0
11
@gui_penedo
Guilherme Penedo
3 months
@iacopo_poli Our base filtering (from RefinedWeb) + dedup brings performance in line with RefinedWeb; the C4 filters we applied bring it above, and the FineWeb filters give an extra push. Likely no on RP2 as we wanted a dataset that "just works" and didn't really annotate. Fixed the link, ty!
0
0
7
@gui_penedo
Guilherme Penedo
1 month
@TrelisResearch You can fetch the edu version for a specific dump, there is some info on the fineweb-edu repo
0
0
5
@gui_penedo
Guilherme Penedo
1 month
@mgalle We did. We compared hellaswag results with hellaswag contamination and didn't find any clear correlations. We tried checking absolute counts of matches (assuming it doesn't take much to memorize answers) as well as frequencies and were unable to conclude anything
0
0
3
@gui_penedo
Guilherme Penedo
1 month
@soldni yep, each point is the average of two runs, each trained on a different 28BT sampling of fineweb data from that dump (pre C4 and FW filters)
1
0
5
@gui_penedo
Guilherme Penedo
1 month
@realmrfakename It's not very clear to me, but the final dataset is a subset of FineWeb and not Llama generated, so I think arguing it breaches the license would be a hard sell
0
0
4
@gui_penedo
Guilherme Penedo
3 months
@teortaxesTex @georgejrjrjr One thing we noticed and that we will probably expand a bit on the blog post is that common crawl dumps have different quality from dump to dump. One solution for smaller budgets would be to take the N best dumps instead of sampling from the entire dataset
1
0
5
@gui_penedo
Guilherme Penedo
3 months
@ArYoMo WARC :)
1
0
4
@gui_penedo
Guilherme Penedo
1 month
More info on this here:
@gui_penedo
Guilherme Penedo
1 month
We are (finally) releasing the 🍷 FineWeb technical report! In it, we detail and explain every processing decision we took, and we also introduce our newest dataset: 📚 FineWeb-Edu, a (web only) subset of FW filtered for high educational content. Link:
Tweet media one
39
316
1K
0
0
3
@gui_penedo
Guilherme Penedo
1 month
@soldni no :/ this one is our base filtering + minhash (so not the extra filters). I ran like 4-5 dumps with the other filters and the relative positions did not seem to change too much but I didn't run it for all the dumps
1
0
3
@gui_penedo
Guilherme Penedo
3 months
@georgejrjrjr @teortaxesTex We did not. It is the single most effective, but it removes (for us) 30% of tokens. We managed to achieve similar performance by checking the ratio of document lines not ending in punctuation and only dropping entire docs instead (removes 10%) - this is part of FWfilters
1
0
3
@gui_penedo
Guilherme Penedo
3 months
@georgejrjrjr @teortaxesTex (30% of our base filtering/deduped baseline, to clarify) Plus it's quite destructive, it basically destroys any sort of list in the middle of a normal document as those normally don't end with punctuation
2
0
3
@gui_penedo
Guilherme Penedo
8 years
Verifying myself: I am philipsnostrum on Keybase.io. ANiPKu0Ryegg9libTXJ6MhS0uVaf1hy3W0VB /
0
0
3
@gui_penedo
Guilherme Penedo
5 months
The main goal of Datatrove 🏭 is to make large scale dataset curation accessible, without the need to reinvent the wheel every time you start a new project. It can run locally or on a slurm cluster, and includes logging and stats.
1
0
2
@gui_penedo
Guilherme Penedo
1 month
@WenhuChen @ClementDelangue Is there an easy way to get the CommonCrawl English subset? I am assuming I can just take the "cc_en" files from the hub?
0
0
2
@gui_penedo
Guilherme Penedo
3 months
@softyoda @itsandrewgao This is just an issue with the dataset viewer indexing, the dataset is around 40TB, not 200GB :)
0
0
2
@gui_penedo
Guilherme Penedo
26 days
@XYOU @pjox13 We're happy to run a training on the same conditions, but you can find details on the model setup (we haven't posted the exact training script yet) and the exact eval code on our blogpost
1
0
2
@gui_penedo
Guilherme Penedo
3 months
@teortaxesTex @georgejrjrjr "Likely no on RP2" -> we will not release an unfiltered version of FineWeb with annotations, like RP2. RP2 itself is not filtered, it gives you metrics and you can choose your own filtering using them. Our dataset "just works" in the sense that we already chose how to filter it
1
0
2
@gui_penedo
Guilherme Penedo
3 months
@georgejrjrjr @teortaxesTex Another detail btw: while the terminal_punct rule removes 30% and is the single most effective rule, all the other rules remove around half of that and give a bigger boost (when combined)
0
0
1
@gui_penedo
Guilherme Penedo
8 years
@PokemonGoApp fix the servers before rolling out to more countries yo
0
0
0
@gui_penedo
Guilherme Penedo
1 month
@xlr8harder They are averaged together first
1
0
1
@gui_penedo
Guilherme Penedo
3 months
@georgejrjrjr @teortaxesTex Trafilatura is very actively maintained so who knows, maybe it's better now
1
0
1
@gui_penedo
Guilherme Penedo
7 years
@CarolSc99 Nice! Sempre vais para informática?
1
0
1
@gui_penedo
Guilherme Penedo
3 months
@georgejrjrjr @teortaxesTex For markdown I did play around with it a bit and at least at the time found a lot of documents where it would parse them incorrectly and break formatting a lot
1
0
1
@gui_penedo
Guilherme Penedo
26 days
@pjox13 If there is a dataset released on the hub that applied these filtering steps to colossal Oscar I'd be happy to add it to the comparison
0
0
1
@gui_penedo
Guilherme Penedo
1 month
@gui_penedo
Guilherme Penedo
1 month
We are (finally) releasing the 🍷 FineWeb technical report! In it, we detail and explain every processing decision we took, and we also introduce our newest dataset: 📚 FineWeb-Edu, a (web only) subset of FW filtered for high educational content. Link:
Tweet media one
39
316
1K
1
0
1
@gui_penedo
Guilherme Penedo
3 months
@georgejrjrjr @teortaxesTex No rationale, I suppose we just never considered it. I suppose we could have, at least the title/author metadata. In retrospect maybe we should have. The XML not so sure as it would probably have considerably increased the storage size without bringing a big benefit
1
0
1
@gui_penedo
Guilherme Penedo
1 month
@kishorelive We do not recommend using only the highest scoring samples as there starts to be perf degradation on non academic benchmarks
1
0
1
@gui_penedo
Guilherme Penedo
1 month
@kishorelive I think you can use the sample (10, 100 and 350BT) versions of FineWeb-edu (score >= 3). If you really want to take the high (>=4) data then you will definitely need to pair it with something else
1
0
1
@gui_penedo
Guilherme Penedo
5 months
With Datatrove 🏭, you can: - quickly read data in diff formats from disk, the cloud, or hf hub - use SOTA filters out of the box - deduplicate data at scale (minhash, exactsubstr, bloom filters, etc) - tokenize your data - easily run and scale your custom processing logic
1
0
1
@gui_penedo
Guilherme Penedo
3 months
@chewchewchew10 Indeed, they are all false positives on the byte signature
1
0
1
@gui_penedo
Guilherme Penedo
5 months
Still early days for the library, but we are quite excited by the community reception, and thankful for all the feedback and contributions ❤️. Please keep them coming!
0
0
1
@gui_penedo
Guilherme Penedo
3 months
@Euclaise_ @ITica007 @Thom_Wolf @togethercompute That's exactly right! RP2 is a great work but not directly comparable to FineWeb as you kind of have to "choose your own filtering" whereas FW is "plug and play"
0
0
1