Mark Cummins Profile Banner
Mark Cummins Profile
Mark Cummins

@mark_cummins

2,687
Followers
895
Following
132
Media
2,054
Statuses

Help for blind robots. Other half to @stalagnick

Dublin, Ireland
Joined April 2009
Don't wanna be here? Send us removal request.
@mark_cummins
Mark Cummins
2 months
It’s no secret that LLM training data is running out. How close are we to the limit? To answer that, here's an estimate of the total amount of text in the world from every major source:
85
371
2K
@mark_cummins
Mark Cummins
2 years
I've lived in Ireland, the UK and the US. Just by living there, you can feel what these graphs show. Ireland is a much more equal society, and I would say a better society. We have big issues to fix with housing,but Ireland at the core is a good society, and we can be proud of it
@jburnmurdoch
John Burn-Murdoch
2 years
Oh, and I almost forgot: here is a custom version with Ireland highlighted, to appease the great and powerful @davidmcw 🙏🙇
Tweet media one
149
801
4K
29
76
594
@mark_cummins
Mark Cummins
2 years
To test drive Dall-e, I decided to take a tour of the history of art. I like bicycles, so we will have bicycles. Let's begin. 🧵 #dalle2 #dalle
Tweet media one
18
100
505
@mark_cummins
Mark Cummins
2 months
Llama 3 was trained on 15 trillion tokens (11T words). That’s large - approximately 100,000x what a human requires for language learning
Tweet media one
7
15
273
@mark_cummins
Mark Cummins
2 months
Books are another large source but less accessible source. Google Books is almost 5T tokens, and is available only to Google. It may be the largest proprietary high-quality token source out there.
Tweet media one
4
9
269
@mark_cummins
Mark Cummins
2 months
The primary source of this data is web crawl. Common Crawl is over 100 trillion tokens, though a lot of that is junk and duplicates. Quality filtered subsets like Fineweb are 15T tokens. So Llama 3 is trained on basically all useful English text on the internet
Tweet media one
2
9
235
@mark_cummins
Mark Cummins
2 months
I had never heard of Anna’s Archive, it’s a shadow library with almost 4T tokens of e-books. If you’re not inhibited by legal concerns, this is a big one.
3
7
211
@mark_cummins
Mark Cummins
2 months
Academic articles and patents add another 1.2T tokens. It takes work to extract this from PDFs, but it’s very valuable high-quality text
Tweet media one
2
5
198
@mark_cummins
Mark Cummins
2 years
Some personal news - as of last month I've finished up at Google. With in a good home and the team doing great, it was the right time to say goodbye. I'll still be cheering loudly from outside. Lots of good things still to come there.
22
0
192
@mark_cummins
Mark Cummins
2 months
The web is only 45% English, so you could potentially double training data using multi-lingual text. Empirically this doesn’t help current models, but I suspect someone will figure out how to make effective use of this data before long.
4
6
191
@mark_cummins
Mark Cummins
2 months
Transcribed audio is another sizable source of publicly available tokens. It’s widely believed that OpenAI developed Whisper for this purpose.
Tweet media one
3
3
182
@mark_cummins
Mark Cummins
2 months
So we’re at 15T tokens now, and even with extreme effort there is at most 2x-4x headroom on public data. Historically the jump between each model generation has required 10x training data, so new ideas are going to be needed soon
3
19
181
@mark_cummins
Mark Cummins
2 months
Note that Common Crawl only captures HTML pages. Anything in a PDF, or dynamically rendered, or behind a login, etc, is not captured. So there is more data to be found in other places.
1
1
169
@mark_cummins
Mark Cummins
2 months
Social media next: Twitter is 11T tokens, and Weibo is 38T. I was surprised how large these were, though quality filtering would bring the totals down.
Tweet media one
5
4
160
@mark_cummins
Mark Cummins
2 months
Finally, as an upper bound, here’s my estimate of the total words ever spoken
Tweet media one
4
13
159
@mark_cummins
Mark Cummins
2 months
Finally, we get to private data. There is far far more private data than public. Instant message logs come to maybe 650T tokens, and stored emails to maybe 1200T. Gmail alone is probably 300T.
Tweet media one
3
6
157
@mark_cummins
Mark Cummins
2 months
This will shock nobody who’s been following things closely, but I still found it useful to work through exactly where the limits of the current approach are, and who holds what cards in terms of large proprietary data sources.
1
2
155
@mark_cummins
Mark Cummins
2 months
I can’t see any commercial LLM making broad use of this data, due to the obvious privacy issues. However, an intelligence agency like the NSA might.
3
4
155
@mark_cummins
Mark Cummins
2 months
So what’s the bottom line for LLM training? Current models are trained on 15T tokens. Possibly with a lot of effort you could expand that to 25 – 30T, but not much further.  Adding non-English data you might get to 60T. That seems like the upper limit.
1
7
153
@mark_cummins
Mark Cummins
2 months
Up next is code. Code is a very important text type, and the amount of it surprised me. There’s 0.75T tokens of public code. Total code ever written might be as much as 20T, though much of this is private or lost.
Tweet media one
3
5
147
@mark_cummins
Mark Cummins
2 months
In the meantime, sources for all of the token estimates are in the blog post:
3
4
138
@mark_cummins
Mark Cummins
2 months
Facebook is even larger, I estimate 140T tokens, though it’s possibly even more. Privacy concerns mean that it’s likely not usable, though.
2
1
133
@mark_cummins
Mark Cummins
2 months
For the Llama 3 release it was reported that “No Meta user data was used, despite Zuckerberg boasting that it’s a larger corpus than the entirety of Common Crawl”. That aligns.
2
2
129
@mark_cummins
Mark Cummins
2 months
YouTube and TikTok are 7T and 5T respectively. Podcasts are a tenth that size, but likely higher quality.
1
1
127
@mark_cummins
Mark Cummins
2 months
I pulled this together because I couldn’t find it anywhere. @EpochAIResearch has the best analysis I've seen, though a bit less granular. They seem to come to similar overall totals. @Jsevillamol @pvllss curious if you would disagree with any of the numbers here?
11
2
125
@mark_cummins
Mark Cummins
2 months
Private data is much larger. Facebook posts are upwards of 140T, Google has around 300T tokens in Gmail, and all private data everywhere is maybe 2,000T.
2
2
118
@mark_cummins
Mark Cummins
2 months
Conceivably there might be some privacy-preserving technique that enables you to train public models on private data, but the consequences of a mistake seem so enormous that I can’t see it happening.
4
2
107
@mark_cummins
Mark Cummins
2 months
TV is tiny, so possibly not worth the effort. Radio is potentially a few trillion tokens if good archives existed, but unfortunately they seem small and fragmented, so in practice I don’t think you can get more than a few hundred billion tokens, which is less than podcasts.
2
2
103
@mark_cummins
Mark Cummins
2 months
I don’t think this will cause progress to halt, though it will require new ideas. In a future post I’ll talk through synthetic data or other potential solutions.
6
2
97
@mark_cummins
Mark Cummins
3 years
Delighted to be joining the board of @ScaleIreland , to support the push for a better policy environment for Irish start-ups and scale-ups. Lots to do.
@ScaleIreland
Scale Ireland
3 years
Scale Ireland is delighted to announce the appointment of two new board members 🙌 - @clairemchugh , Co-Founder & CEO of @axonista - @mark_cummins , Co-Founder of @pointy Both are committed to ensuring Ireland becomes a leading location for innovation & entrepreneurship 🚀
Tweet media one
5
14
50
0
5
81
@mark_cummins
Mark Cummins
3 years
Delighted to announce Pointy from Google will be offering free devices for Irish local retailers to support Covid recovery:
1
18
78
@mark_cummins
Mark Cummins
5 years
Some of @pointy 's recent progress covered by Forbes today. It's amazing to be serving so many retailers now: it's a huge opportunity to make things work better in a whole sector of the economy. That's always what excited me most about Pointy.
3
14
75
@mark_cummins
Mark Cummins
5 years
We moved @pointy to @ripplingapp earlier this year, and it's been a big productivity win. Also, when well-known founders are personally answering customer support emails hours after a $45m Series A announcement, you know that company is going to win @parkerconrad
0
7
57
@mark_cummins
Mark Cummins
2 months
@marksugruek @seanblanchfield Probably not yet, because the data being used mostly predates LLMs. But it seems like it will be an issue in future.
1
1
52
@mark_cummins
Mark Cummins
3 years
As a founder you could not ask for a better deal. Uncapped SAFE note from a good investor = instant yes. So delighted to see the Irish investment ecosystem evolving this way. Hats off @pwlsh @NDRC_hq
@NDRC_hq
NDRC
3 years
“These terms are highly flexible for startups - we want to make sure the best founders can and want to join.” We spoke to @tech_eu about our new #NDRCAccelerator ‘founder friendly’ terms, and what they mean for globally ambitious tech startups. Read:
Tweet media one
0
26
50
1
7
48
@mark_cummins
Mark Cummins
2 years
@adrianweckler @ScaleIreland I think Jack Pierse's point got misunderstood by most. He wasn't talking about income tax per se. It was about share options, where people essentially have equity. A gain there should be taxed at CGT rates, but in Ireland it's taxed as income (unless you use KEEP which has issues
3
4
43
@mark_cummins
Mark Cummins
6 years
On my way back after a week at @WebSummit and f.ounders. The execution is flawless. Hats off to @paddycosgrave and team on an amazing event.
Tweet media one
0
6
42
@mark_cummins
Mark Cummins
2 years
@adrianweckler @ScaleIreland Founders equity gets taxed at CGT rates. An employee who starts a month later and gets share options takes almost as much risk, but gets taxed at a much higher rate on their gains. It doesn't make much sense, and makes it really hard to attract talent into early stage startups.
2
6
40
@mark_cummins
Mark Cummins
3 months
This is the way. No question. What @realBobbyHealy has built with @MannaAero is phenomenal
Tweet media one
Tweet media two
1
3
40
@mark_cummins
Mark Cummins
2 years
New EI PreSeed fund is a huge improvement, replaces an old model that wasn't very attractive. The new deal is an uncapped note, can't ask for much better than that!
@ScaleIreland
Scale Ireland
2 years
We strongly welcome Enterprise Ireland’s new PreSeed Start Fund officially launched today. We want to thank the @Entirl team for consulting & engaging with @ScaleIreland on its new fund which we feel will strongly benefit early stage founders. @leo_clancy @LeoVaradkar @earlytom
Tweet media one
Tweet media two
8
40
143
3
7
37
@mark_cummins
Mark Cummins
28 days
@paulg For me anyway, it was slightly more subtle than that. Posts you liked were sometimes recommended to your followers. So I always thought of them as a soft share. Therefore, content that was interesting to me personally, but probably not to my audience, often didn't get a like
4
0
38
@mark_cummins
Mark Cummins
1 year
Dogpatch have a new program for anyone who's thinking of becoming a founder. Even if you're just exploring, no team and no idea yet. €2k stipend p/m, get mentorship and opportunity to pitch for investment at the end. A very generous offer!
@dogpatchlabs
Dogpatch Labs
1 year
📢 Announcing #Founders - Ireland’s first talent accelerator We’re bringing together the most talented engineers, domain experts and commercial minds to 🤝Find a co-founder 🚀Build the next generation of tech startups 💸Pitch for €100K investment Apply:
3
37
104
2
3
36
@mark_cummins
Mark Cummins
5 months
Some thoughts on the limits of dollar scaling, and four scenarios for AI after we hit the economic wall. Recent AI progress isn't happening at a natural rate - we have climbed much faster than Moore's Law by throwing larger and larger amounts of money at training. 1/n
7
4
37
@mark_cummins
Mark Cummins
2 years
Roman mosaic, 100 BC
Tweet media one
1
1
34
@mark_cummins
Mark Cummins
2 months
@shane_a_lynn I think a little of both, plus synthetic data. Until now compute was the binding constraint, so there wasn't much pressure to be data efficient. Now that there is, I think there are lots of tricks that can be played.
0
1
36
@mark_cummins
Mark Cummins
1 year
Saw this incredible sky today. The air was glittery with ice crystals, and we got a display complete with Parry arc and supralateral arc
Tweet media one
2
0
35
@mark_cummins
Mark Cummins
2 years
I'm of course continuing on the board of @ScaleIreland and investing/advising in startups. Far too much fun not to, and I love to see the Irish ecosystem continue to grow.
4
1
33
@mark_cummins
Mark Cummins
2 years
I've known @peteromallet a long while, and was proud to invest in Advisable. Peter is an amazing person, so of course even when he shuts down a company, he does it in a unique and very Peter way. His post-mortem below is searingly honest ... 1/n
@pom_I_moq
POM
2 years
After working on it almost every day for four and half years, we've made the incredibly difficult decision to shut down Advisable. I've gone into depth to try to explain this decision here: However, this is long so I'm sharing the tl;dr version below:
9
7
60
1
2
31
@mark_cummins
Mark Cummins
2 years
That's all folks.
Tweet media one
9
0
30
@mark_cummins
Mark Cummins
3 years
Beautifully written account of leaving a startup behind. Extremely honest and raw about the pain of investing yourself fully in something, and it not going the way you dreamed
@patjfin
Patrick Finlay
3 years
Hey folks 👋 here's a bit of holiday reading for you. I wrote a piece about leaving the company I co-founded
57
24
478
2
4
30
@mark_cummins
Mark Cummins
2 years
Sumerian sculpture, 2600 BC
Tweet media one
1
1
24
@mark_cummins
Mark Cummins
4 years
Well done Leo. Glad to see Ireland acting.
0
2
29
@mark_cummins
Mark Cummins
3 months
@levelsio The answer is LSE:MNTN. It's a closed end fund traded on the London market. Top holding is SpaceX, plus a bunch of other private tech. It's trading at .80 vs a NAV of 1.20, so you're buying at a 30% discount to market price (and the marks are ok). You can buy it through
Tweet media one
1
1
29
@mark_cummins
Mark Cummins
2 years
Egyptian tomb painting, 1100 BC
Tweet media one
1
1
27
@mark_cummins
Mark Cummins
10 months
Finn is truly excellent to work with. If you're an early stage founder, you should work with Finn!
@FinnMurphy12
Finn Murphy
10 months
Very special day announcing what I've been up to for the past 12 months (founding an early stage venture fund called Nebular) on my friends podcast tl:dr - solo-GP - writing a small number of lead and co-lead cheques at pre-seed and seed in software co's - based in nyc woo
49
22
358
1
0
28
@mark_cummins
Mark Cummins
3 years
If you're a new start-up based in Ireland, the new NDRC program is a pretty great deal. Applications closing this Sunday.
@NDRC_hq
NDRC
3 years
Applications for our 2nd #NDRCAccelerator are now open 🚀 ✅€100k SAFE investment ✅Firesides & mentoring from a €5B-worth founder network ✅Weekly coaching from EIRs Are you an entrepreneur building global solutions to global problems? Apply now ↪️
1
25
44
1
7
28
@mark_cummins
Mark Cummins
2 years
Claude monet, 1870
Tweet media one
1
1
26
@mark_cummins
Mark Cummins
1 year
I found an old directory of diffusion model tests on my laptop. On the left is the initial Stable Diffusion release from Aug 2022, on the right is Midjourney v5 from today, for the same prompt. Only 7 months apart 🤯 (Prompt is "A woman crossing a footbridge in a park")
Tweet media one
Tweet media two
1
0
26
@mark_cummins
Mark Cummins
5 years
As part of the @ScaleIreland launch, here are some thoughts on startup policy in Ireland, and how we can fix it.
0
11
27
@mark_cummins
Mark Cummins
7 months
Why is AI amazing at art, but can't drive a car? Can we predict which new AI applications might arrive soon, and which are still years away? New blog post thinking this through 1/n
Tweet media one
2
2
25
@mark_cummins
Mark Cummins
2 years
Finally, Dall-e gave me some crazy ones that don't seem to be in any style I recognize. So here are some paintings of a bicycle by Dall-e itself:
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
0
25
@mark_cummins
Mark Cummins
2 years
Jan van Eyck, 1425
Tweet media one
1
0
25
@mark_cummins
Mark Cummins
2 years
Cezanne, 1869
Tweet media one
1
0
24
@mark_cummins
Mark Cummins
2 years
Pottery, not sure from where, seems old though.
Tweet media one
1
0
21
@mark_cummins
Mark Cummins
2 years
Rene Magritte, 1828
Tweet media one
2
1
22
@mark_cummins
Mark Cummins
2 years
Illuminated manuscript, Ireland, 9th century
Tweet media one
1
0
21
@mark_cummins
Mark Cummins
4 years
This is going to be a big part of how the world unlocks.
@sundarpichai
Sundar Pichai
4 years
To help public health officials slow the spread of #COVID19 , Google & @Apple are working on a contact tracing approach designed with strong controls and protections for user privacy. @tim_cook and I are committed to working together on these efforts.
634
7K
25K
2
3
22
@mark_cummins
Mark Cummins
1 year
@elidourado Come to Ireland, every university here runs like this by law (you also only apply once to a centralized clearing house, no letters, just your score)
2
0
21
@mark_cummins
Mark Cummins
2 years
Early Medieval, 13th century
Tweet media one
1
0
21
@mark_cummins
Mark Cummins
2 years
Hokusai, 1790
Tweet media one
2
1
20
@mark_cummins
Mark Cummins
2 years
Rembrandt van Rijn, 1630
Tweet media one
1
0
21
@mark_cummins
Mark Cummins
2 years
Good summary of what's needed to validate and (possibly) scale Far UV-C. Timeline feels about right. Funding this research seems like a no-brainer, research cost is low and the potential rewards are gigantic. 20th century gave us water sanitation, 21st should add air.
@davidmanheim
David Manheim
2 years
People interested in reducing biorisk seem to be super excited about 222nm light to kill pathogens. I’m also really excited - but it’s (unfortunately) probably a decade or more away from widespread usage. Let me explain.
11
38
329
1
5
21
@mark_cummins
Mark Cummins
2 years
Van gogh, 1883
Tweet media one
1
2
20
@mark_cummins
Mark Cummins
2 years
Cave painting, 15,000 BC
Tweet media one
1
1
18
@mark_cummins
Mark Cummins
10 months
@gokulr I've definitely heard similar stories, but also know plenty of people who had the opposite experience. I think this one is just very individual.
0
0
21
@mark_cummins
Mark Cummins
5 years
@lachygroom I started @pointy to fix this problem. We're used by over 1% of all US retail locations now, and Google is starting to use the data in maps. So it's almost there now.
0
0
21
@mark_cummins
Mark Cummins
2 years
M.C. Escher, 1928
Tweet media one
1
0
19
@mark_cummins
Mark Cummins
3 years
I've met a lot of smart people over the years, but @peteromallet is without doubt one of the sharpest and most interesting minds I've known. When he builds something, I pay attention! So I highly recommend checking out
@pom_I_moq
POM
3 years
After a long journey, I'm excited to finally share the new Advisable! While our goal is the same - to help ambitious companies & talented freelancers connect - we're dramatically changing things to solve many problems that have held freelance marketplaces back. Crazy 🧵 time!
Tweet media one
3
14
91
1
2
20
@mark_cummins
Mark Cummins
2 years
Michelangelo, detail from the Sistine chapel ceiling, 1510
Tweet media one
1
1
19
@mark_cummins
Mark Cummins
2 years
UK bronze age hill figure, 1400 BC
Tweet media one
2
0
18
@mark_cummins
Mark Cummins
1 year
@norvid_studies @Tim_Dettmers Elephants have 5.6b cortical neurons, humans have 16b. Suzana Herculano-Houzel work lays it out very well: Elephant brains are pretty unusual, they don't follow normal scaling laws. They have 3x our neuron count overall, but almost all in the cerebellum.
1
3
19
@mark_cummins
Mark Cummins
1 year
I gave GPT-4 the standard coding interview I used to run at Pointy. It wasn't the best candidate I ever interviewed, but probably in the top 10%.
1
3
20
@mark_cummins
Mark Cummins
2 years
Claude Lorrain, 1632
Tweet media one
1
0
18
@mark_cummins
Mark Cummins
2 years
Up next for me is a period of funemployment. Building the company took almost a decade from start to finish. Time for a little break before anything new.
3
0
19
@mark_cummins
Mark Cummins
11 months
Uncomfortable prediction time: self-driving is finally on the cusp. I think we’re <5 years from widespread availability.
@Teslaconomics
Teslaconomics
11 months
This part was wild, take a deeper listen: 👂🏻 - This is entirely AI & cameras just like our brain which is neural nets & eyes - There is no line of code that says “slow down for speed bumps”, it’s doing it entirely on video training - There is no line of code that says “give
267
532
4K
2
0
19
@mark_cummins
Mark Cummins
2 months
I recently had cause to estimate the total number of words ever spoken. Current human population is about 8 billion, births since 1800 is 18 billion, and total homo sapiens who ever lived is estimated at around 117 billion. So would you think most words spoken belong to the deep
3
0
18
@mark_cummins
Mark Cummins
1 year
The UK is making a lot of smart moves around startup policy at the moment. Necessity is the mother of invention. Irish gov is just not at the races compared to UK or even France. There are some easy policy wins lying around, but excessive caution and inertia seems to rule.
1
4
18
@mark_cummins
Mark Cummins
5 years
Some thoughts on startup policy in Ireland, and how we can fix it. @ScaleIreland .(Eagle-eyed @realBobbyHealy beat me to the tweet :-)
1
1
16
@mark_cummins
Mark Cummins
2 years
Yayoi Kusama, 2019
Tweet media one
1
0
16
@mark_cummins
Mark Cummins
4 months
Seems about right: 3 or maybe 4 orders of magnitude from increasing investment, 1 from Nvidia margin compression, 1 from custom silicon, 1 from Moore's law. So plausibly 6-7 OOMs of headroom in the current sprint, which would last until the early 2030s. (Absent a wildcard on
@dwarkesh_sp
Dwarkesh Patel
4 months
Given that you need 100x more effective compute between model generations, if we don’t get AGI by GPT-7, will we just never get it? @_sholtodouglas : “GPT-4 costs, let's call it, $100 million. The $1B, $10B, and $100B run, all seem very plausible by private company standards. You
40
62
637
6
1
17
@mark_cummins
Mark Cummins
2 years
Engraving by Albrecht Dürer, 1493
Tweet media one
1
0
16
@mark_cummins
Mark Cummins
2 years
Henri Rousseau, 1874
Tweet media one
1
0
15
@mark_cummins
Mark Cummins
2 months
@richardprice100 I'm almost finished another piece thinking that through. Short answer, I think data efficiency tricks, multimodal data, and synthetic data will all be part of the answer. I don't think anyone fully knows how well it will work yet, but there are enough irons in the fire that I'd
1
1
16
@mark_cummins
Mark Cummins
3 years
This is really a big deal. The pain of dealing with this was immense.
@patrickc
Patrick Collison
3 years
When we ask businesses why they don’t sell in more countries, “tax complexity” is one of the top things we hear. So we’ve built fully automatic global sales tax calculation and collection: .
50
221
3K
0
1
16
@mark_cummins
Mark Cummins
2 years
Ottoman miniature, 1505
Tweet media one
1
0
14
@mark_cummins
Mark Cummins
3 months
Small LLMs being trained on bedtime stories written by larger LLMs. 'tis a like task we are at.
Tweet media one
3
2
16
@mark_cummins
Mark Cummins
1 year
I never thought I'd say this, but I find myself using Bing now instead of Google. For anything more than nav queries or basic facts, GPT-4 based chat is just night and day better. I'm sure Google will catch up, but the gap is pretty embarassing right now.
2
0
14
@mark_cummins
Mark Cummins
4 years
Why the Irish government is dawdling is hard to comprehend. We need to shut down the country NOW. We are on an exponential curve like everyone else. We cannot see all the cases yet, but they are here. Every day we delay, more spread, more deaths to come. Small delays matter a LOT
@patio11
Patrick McKenzie
4 years
This is a very lucid explanation of what we currently know about coronavirus spread: The graph you should be most concerned about is the one that demonstrates that, at the point you know coronavirus has taken root in your area, it is *very* widespread.
9
126
347
2
1
14