Mark Cummins @mark_cummins Twitter profile

Last Seen Profiles

@artisenakuy

@ekl9s

@415o1

@andiindra203

@kkkpppoooou

@kartidibelong

@Asad_N0or

@ClutchKai11

@out2sea90210

@galery_basah10

@UWAdmissions

@prettydelc

@yedekbirrhesapp

@Lead_Harper

@bntlkbr7

@DongS14456

@LongIslandAudit

@sDktp3KsaXsG

@dalks10

@momoka956174076

@ciumbumi

@HijabiShehzadi

@JorgeGuschmer

@Heynanda587

@kcclmjh

@jaehyunnier

@Sala3Dtoo

@coffee_kcals

@JuliaK2023

@SCLDS

@moonrayyy_

@stupwains

@SadityMck

@UCPrimary

@Lxpezzz_

@brstyj15

Mark Cummins

@mark_cummins

2 months

It’s no secret that LLM training data is running out. How close are we to the limit? To answer that, here's an estimate of the total amount of text in the world from every major source:

85

371

2K

Mark Cummins

@mark_cummins

2 years

I've lived in Ireland, the UK and the US. Just by living there, you can feel what these graphs show. Ireland is a much more equal society, and I would say a better society. We have big issues to fix with housing,but Ireland at the core is a good society, and we can be proud of it

John Burn-Murdoch

@jburnmurdoch

2 years

Oh, and I almost forgot: here is a custom version with Ireland highlighted, to appease the great and powerful @davidmcw 🙏🙇

149

801

4K

29

76

594

Mark Cummins

@mark_cummins

4 years

Excited to finally share the news Pointy is being acquired by Google.

Helping local businesses showcase products online with Pointy

Our agreement to acquire Pointy will help local merchants better showcase their products to interested shoppers on Google.

blog.google

47

49

571

Mark Cummins

@mark_cummins

2 years

To test drive Dall-e, I decided to take a tour of the history of art. I like bicycles, so we will have bicycles. Let's begin. 🧵 #dalle2 #dalle

18

100

505

Mark Cummins

@mark_cummins

2 months

Llama 3 was trained on 15 trillion tokens (11T words). That’s large - approximately 100,000x what a human requires for language learning

7

15

273

Mark Cummins

@mark_cummins

2 months

Books are another large source but less accessible source. Google Books is almost 5T tokens, and is available only to Google. It may be the largest proprietary high-quality token source out there.

4

9

269

Mark Cummins

@mark_cummins

2 months

The primary source of this data is web crawl. Common Crawl is over 100 trillion tokens, though a lot of that is junk and duplicates. Quality filtered subsets like Fineweb are 15T tokens. So Llama 3 is trained on basically all useful English text on the internet

2

9

235

Mark Cummins

@mark_cummins

2 months

I had never heard of Anna’s Archive, it’s a shadow library with almost 4T tokens of e-books. If you’re not inhibited by legal concerns, this is a big one.

3

7

211

Mark Cummins

@mark_cummins

2 months

Academic articles and patents add another 1.2T tokens. It takes work to extract this from PDFs, but it’s very valuable high-quality text

2

5

198

Mark Cummins

@mark_cummins

2 years

Some personal news - as of last month I've finished up at Google. With in a good home and the team doing great, it was the right time to say goodbye. I'll still be cheering loudly from outside. Lots of good things still to come there.

22

0

192

Mark Cummins

@mark_cummins

2 months

The web is only 45% English, so you could potentially double training data using multi-lingual text. Empirically this doesn’t help current models, but I suspect someone will figure out how to make effective use of this data before long.

4

6

191

Mark Cummins

@mark_cummins

2 months

Transcribed audio is another sizable source of publicly available tokens. It’s widely believed that OpenAI developed Whisper for this purpose.

3

182

Mark Cummins

@mark_cummins

2 months

So we’re at 15T tokens now, and even with extreme effort there is at most 2x-4x headroom on public data. Historically the jump between each model generation has required 10x training data, so new ideas are going to be needed soon

3

19

181

Mark Cummins

@mark_cummins

2 months

Note that Common Crawl only captures HTML pages. Anything in a PDF, or dynamically rendered, or behind a login, etc, is not captured. So there is more data to be found in other places.

1

169

Mark Cummins

@mark_cummins

2 months

Social media next: Twitter is 11T tokens, and Weibo is 38T. I was surprised how large these were, though quality filtering would bring the totals down.

5

4

160

Mark Cummins

@mark_cummins

2 months

Finally, as an upper bound, here’s my estimate of the total words ever spoken

4

13

159

Mark Cummins

@mark_cummins

2 months

Finally, we get to private data. There is far far more private data than public. Instant message logs come to maybe 650T tokens, and stored emails to maybe 1200T. Gmail alone is probably 300T.

3

6

157

Mark Cummins

@mark_cummins

2 months

This will shock nobody who’s been following things closely, but I still found it useful to work through exactly where the limits of the current approach are, and who holds what cards in terms of large proprietary data sources.

1

2

155

Mark Cummins

@mark_cummins

2 months

I can’t see any commercial LLM making broad use of this data, due to the obvious privacy issues. However, an intelligence agency like the NSA might.

3

4

155

Mark Cummins

@mark_cummins

2 months

So what’s the bottom line for LLM training? Current models are trained on 15T tokens. Possibly with a lot of effort you could expand that to 25 – 30T, but not much further. Adding non-English data you might get to 60T. That seems like the upper limit.

1

7

153

Mark Cummins

@mark_cummins

2 months

Up next is code. Code is a very important text type, and the amount of it surprised me. There’s 0.75T tokens of public code. Total code ever written might be as much as 20T, though much of this is private or lost.

3

5

147

Mark Cummins

@mark_cummins

2 months

In the meantime, sources for all of the token estimates are in the blog post:

3

4

138

Mark Cummins

@mark_cummins

2 months

Facebook is even larger, I estimate 140T tokens, though it’s possibly even more. Privacy concerns mean that it’s likely not usable, though.

2

1

133

Mark Cummins

@mark_cummins

2 months

For the Llama 3 release it was reported that “No Meta user data was used, despite Zuckerberg boasting that it’s a larger corpus than the entirety of Common Crawl”. That aligns.

2

129

Mark Cummins

@mark_cummins

2 months

YouTube and TikTok are 7T and 5T respectively. Podcasts are a tenth that size, but likely higher quality.

1

127

Mark Cummins

@mark_cummins

2 months

I pulled this together because I couldn’t find it anywhere. @EpochAIResearch has the best analysis I've seen, though a bit less granular. They seem to come to similar overall totals. @Jsevillamol @pvllss curious if you would disagree with any of the numbers here?

11

2

125

Mark Cummins

@mark_cummins

2 months

Private data is much larger. Facebook posts are upwards of 140T, Google has around 300T tokens in Gmail, and all private data everywhere is maybe 2,000T.

2

118

Mark Cummins

@mark_cummins

2 months

Conceivably there might be some privacy-preserving technique that enables you to train public models on private data, but the consequences of a mistake seem so enormous that I can’t see it happening.

4

2

107

Mark Cummins

@mark_cummins

2 months

TV is tiny, so possibly not worth the effort. Radio is potentially a few trillion tokens if good archives existed, but unfortunately they seem small and fragmented, so in practice I don’t think you can get more than a few hundred billion tokens, which is less than podcasts.

2

103

Mark Cummins

@mark_cummins

2 months

I don’t think this will cause progress to halt, though it will require new ideas. In a future post I’ll talk through synthetic data or other potential solutions.

6

2

97

Mark Cummins

@mark_cummins

3 years

Delighted to be joining the board of @ScaleIreland , to support the push for a better policy environment for Irish start-ups and scale-ups. Lots to do.

Scale Ireland

@ScaleIreland

3 years

Scale Ireland is delighted to announce the appointment of two new board members 🙌 - @clairemchugh , Co-Founder & CEO of @axonista - @mark_cummins , Co-Founder of @pointy Both are committed to ensuring Ireland becomes a leading location for innovation & entrepreneurship 🚀

5

14

50

0

5

81

Mark Cummins

@mark_cummins

3 years

Delighted to announce Pointy from Google will be offering free devices for Irish local retailers to support Covid recovery:

Irish retailers can build an online presence with Pointy

Starting today for a limited time, Pointy from Google will offer free Pointy devices to qualifying small and medium retailers in Ireland.

blog.google

1

18

78

Mark Cummins

@mark_cummins

4 years

This is going to change the world:

OpenAI Platform

Explore developer resources, tutorials, API docs, and dynamic examples to get the most out of OpenAI's platform.

platform.openai.com

6

16

76

Mark Cummins

@mark_cummins

5 years

Some of @pointy 's recent progress covered by Forbes today. It's amazing to be serving so many retailers now: it's a huge opportunity to make things work better in a whole sector of the economy. That's always what excited me most about Pointy.

3

14

75

Mark Cummins

@mark_cummins

5 years

We moved @pointy to @ripplingapp earlier this year, and it's been a big productivity win. Also, when well-known founders are personally answering customer support emails hours after a $45m Series A announcement, you know that company is going to win @parkerconrad

0

7

57

Mark Cummins

@mark_cummins

2 months

@marksugruek @seanblanchfield Probably not yet, because the data being used mostly predates LLMs. But it seems like it will be an issue in future.

1

52

Mark Cummins

@mark_cummins

3 years

As a founder you could not ask for a better deal. Uncapped SAFE note from a good investor = instant yes. So delighted to see the Irish investment ecosystem evolving this way. Hats off @pwlsh @NDRC_hq

NDRC

@NDRC_hq

3 years

“These terms are highly flexible for startups - we want to make sure the best founders can and want to join.” We spoke to @tech_eu about our new #NDRCAccelerator ‘founder friendly’ terms, and what they mean for globally ambitious tech startups. Read:

0

26

50

1

7

48

Mark Cummins

@mark_cummins

2 years

@adrianweckler @ScaleIreland I think Jack Pierse's point got misunderstood by most. He wasn't talking about income tax per se. It was about share options, where people essentially have equity. A gain there should be taxed at CGT rates, but in Ireland it's taxed as income (unless you use KEEP which has issues

3

4

43

Mark Cummins

@mark_cummins

6 years

On my way back after a week at @WebSummit and f.ounders. The execution is flawless. Hats off to @paddycosgrave and team on an amazing event.

0

6

42

Mark Cummins

@mark_cummins

2 years

@adrianweckler @ScaleIreland Founders equity gets taxed at CGT rates. An employee who starts a month later and gets share options takes almost as much risk, but gets taxed at a much higher rate on their gains. It doesn't make much sense, and makes it really hard to attract talent into early stage startups.

2

6

40

Mark Cummins

@mark_cummins

3 months

This is the way. No question. What @realBobbyHealy has built with @MannaAero is phenomenal

1

3

40

Mark Cummins

@mark_cummins

2 years

New EI PreSeed fund is a huge improvement, replaces an old model that wasn't very attractive. The new deal is an uncapped note, can't ask for much better than that!

Scale Ireland

@ScaleIreland

2 years

We strongly welcome Enterprise Ireland’s new PreSeed Start Fund officially launched today. We want to thank the @Entirl team for consulting & engaging with @ScaleIreland on its new fund which we feel will strongly benefit early stage founders. @leo_clancy @LeoVaradkar @earlytom

8

40

143

3

7

37

Mark Cummins

@mark_cummins

28 days

@paulg For me anyway, it was slightly more subtle than that. Posts you liked were sometimes recommended to your followers. So I always thought of them as a soft share. Therefore, content that was interesting to me personally, but probably not to my audience, often didn't get a like

4

0

38

Mark Cummins

@mark_cummins

1 year

Dogpatch have a new program for anyone who's thinking of becoming a founder. Even if you're just exploring, no team and no idea yet. €2k stipend p/m, get mentorship and opportunity to pitch for investment at the end. A very generous offer!

Dogpatch Labs

@dogpatchlabs

1 year

📢 Announcing #Founders - Ireland’s first talent accelerator We’re bringing together the most talented engineers, domain experts and commercial minds to 🤝Find a co-founder 🚀Build the next generation of tech startups 💸Pitch for €100K investment Apply:

3

37

104

2

3

36

Mark Cummins

@mark_cummins

5 months

Some thoughts on the limits of dollar scaling, and four scenarios for AI after we hit the economic wall. Recent AI progress isn't happening at a natural rate - we have climbed much faster than Moore's Law by throwing larger and larger amounts of money at training. 1/n

7

4

37

Mark Cummins

@mark_cummins

2 years

Roman mosaic, 100 BC

1

34

Mark Cummins

@mark_cummins

2 months

@shane_a_lynn I think a little of both, plus synthetic data. Until now compute was the binding constraint, so there wasn't much pressure to be data efficient. Now that there is, I think there are lots of tricks that can be played.

0

1

36

Mark Cummins

@mark_cummins

1 year

Saw this incredible sky today. The air was glittery with ice crystals, and we got a display complete with Parry arc and supralateral arc

2

0

35

Mark Cummins

@mark_cummins

2 years

I'm of course continuing on the board of @ScaleIreland and investing/advising in startups. Far too much fun not to, and I love to see the Irish ecosystem continue to grow.

4

1

33

Mark Cummins

@mark_cummins

2 years

I've known @peteromallet a long while, and was proud to invest in Advisable. Peter is an amazing person, so of course even when he shuts down a company, he does it in a unique and very Peter way. His post-mortem below is searingly honest ... 1/n

POM

@pom_I_moq

2 years

After working on it almost every day for four and half years, we've made the incredibly difficult decision to shut down Advisable. I've gone into depth to try to explain this decision here: However, this is long so I'm sharing the tl;dr version below:

9

7

60

1

2

31

Mark Cummins

@mark_cummins

2 years

That's all folks.

9

0

30

Mark Cummins

@mark_cummins

3 years

Beautifully written account of leaving a startup behind. Extremely honest and raw about the pain of investing yourself fully in something, and it not going the way you dreamed

Patrick Finlay

@patjfin

3 years

Hey folks 👋 here's a bit of holiday reading for you. I wrote a piece about leaving the company I co-founded

57

24

478

2

4

30

Mark Cummins

@mark_cummins

2 years

Sumerian sculpture, 2600 BC

1

24

Mark Cummins

@mark_cummins

4 years

Well done Leo. Glad to see Ireland acting.

0

2

29

Mark Cummins

@mark_cummins

3 months

@levelsio The answer is LSE:MNTN. It's a closed end fund traded on the London market. Top holding is SpaceX, plus a bunch of other private tech. It's trading at .80 vs a NAV of 1.20, so you're buying at a 30% discount to market price (and the marks are ok). You can buy it through

1

29

Mark Cummins

@mark_cummins

2 years

Egyptian tomb painting, 1100 BC

1

27

Mark Cummins

@mark_cummins

10 months

Finn is truly excellent to work with. If you're an early stage founder, you should work with Finn!

Finn Murphy

@FinnMurphy12

10 months

Very special day announcing what I've been up to for the past 12 months (founding an early stage venture fund called Nebular) on my friends podcast tl:dr - solo-GP - writing a small number of lead and co-lead cheques at pre-seed and seed in software co's - based in nyc woo

49

22

358

1

0

28

Mark Cummins

@mark_cummins

3 years

If you're a new start-up based in Ireland, the new NDRC program is a pretty great deal. Applications closing this Sunday.

NDRC

@NDRC_hq

3 years

Applications for our 2nd #NDRCAccelerator are now open 🚀 ✅€100k SAFE investment ✅Firesides & mentoring from a €5B-worth founder network ✅Weekly coaching from EIRs Are you an entrepreneur building global solutions to global problems? Apply now ↪️

1

25

44

1

7

28

Mark Cummins

@mark_cummins

2 years

Claude monet, 1870

1

26

Mark Cummins

@mark_cummins

1 year

I found an old directory of diffusion model tests on my laptop. On the left is the initial Stable Diffusion release from Aug 2022, on the right is Midjourney v5 from today, for the same prompt. Only 7 months apart 🤯 (Prompt is "A woman crossing a footbridge in a park")

1

0

26

Mark Cummins

@mark_cummins

5 years

As part of the @ScaleIreland launch, here are some thoughts on startup policy in Ireland, and how we can fix it.

0

11

27

Mark Cummins

@mark_cummins

7 months

Why is AI amazing at art, but can't drive a car? Can we predict which new AI applications might arrive soon, and which are still years away? New blog post thinking this through 1/n

2

25

Mark Cummins

@mark_cummins

2 years

Finally, Dall-e gave me some crazy ones that don't seem to be in any style I recognize. So here are some paintings of a bicycle by Dall-e itself:

1

0

25

Mark Cummins

@mark_cummins

2 years

Jan van Eyck, 1425

1

0

25

Mark Cummins

@mark_cummins

2 years

Cezanne, 1869

1

0

24

Mark Cummins

@mark_cummins

2 years

Pottery, not sure from where, seems old though.

1

0

21

Mark Cummins

@mark_cummins

2 years

Rene Magritte, 1828

2

1

22

Mark Cummins

@mark_cummins

2 years

Illuminated manuscript, Ireland, 9th century

1

0

21

Mark Cummins

@mark_cummins

4 years

This is going to be a big part of how the world unlocks.

Sundar Pichai

@sundarpichai

4 years

To help public health officials slow the spread of #COVID19 , Google & @Apple are working on a contact tracing approach designed with strong controls and protections for user privacy. @tim_cook and I are committed to working together on these efforts.

634

7K

25K

2

3

22

Mark Cummins

@mark_cummins

1 year

@elidourado Come to Ireland, every university here runs like this by law (you also only apply once to a centralized clearing house, no letters, just your score)

2

0

21

Mark Cummins

@mark_cummins

2 years

Early Medieval, 13th century

1

0

21

Mark Cummins

@mark_cummins

2 years

Hokusai, 1790

2

1

20

Mark Cummins

@mark_cummins

2 years

Rembrandt van Rijn, 1630

1

0

21

Mark Cummins

@mark_cummins

2 years

Good summary of what's needed to validate and (possibly) scale Far UV-C. Timeline feels about right. Funding this research seems like a no-brainer, research cost is low and the potential rewards are gigantic. 20th century gave us water sanitation, 21st should add air.

David Manheim

@davidmanheim

2 years

People interested in reducing biorisk seem to be super excited about 222nm light to kill pathogens. I’m also really excited - but it’s (unfortunately) probably a decade or more away from widespread usage. Let me explain.

11

38

329

1

5

21

Mark Cummins

@mark_cummins

2 years

Van gogh, 1883

1

2

20

Mark Cummins

@mark_cummins

2 years

Cave painting, 15,000 BC

1

18

Mark Cummins

@mark_cummins

10 months

@gokulr I've definitely heard similar stories, but also know plenty of people who had the opposite experience. I think this one is just very individual.

0

21

Mark Cummins

@mark_cummins

5 years

@lachygroom I started @pointy to fix this problem. We're used by over 1% of all US retail locations now, and Google is starting to use the data in maps. So it's almost there now.

0

21

Mark Cummins

@mark_cummins

2 years

M.C. Escher, 1928

1

0

19

Mark Cummins

@mark_cummins

3 years

I've met a lot of smart people over the years, but @peteromallet is without doubt one of the sharpest and most interesting minds I've known. When he builds something, I pay attention! So I highly recommend checking out

POM

@pom_I_moq

3 years

After a long journey, I'm excited to finally share the new Advisable! While our goal is the same - to help ambitious companies & talented freelancers connect - we're dramatically changing things to solve many problems that have held freelance marketplaces back. Crazy 🧵 time!

3

14

91

1

2

20

Mark Cummins

@mark_cummins

2 years

Michelangelo, detail from the Sistine chapel ceiling, 1510

1

19

Mark Cummins

@mark_cummins

2 years

UK bronze age hill figure, 1400 BC

2

0

18

Mark Cummins

@mark_cummins

1 year

@norvid_studies @Tim_Dettmers Elephants have 5.6b cortical neurons, humans have 16b. Suzana Herculano-Houzel work lays it out very well: Elephant brains are pretty unusual, they don't follow normal scaling laws. They have 3x our neuron count overall, but almost all in the cerebellum.

The Paradox of the Elephant Brain

With three times as many neurons, why doesn’t the elephant brain outperform ours?

nautil.us

1

3

19

Mark Cummins

@mark_cummins

1 year

I gave GPT-4 the standard coding interview I used to run at Pointy. It wasn't the best candidate I ever interviewed, but probably in the top 10%.

1

3

20

Mark Cummins

@mark_cummins

2 years

Claude Lorrain, 1632

1

0

18

Mark Cummins

@mark_cummins

2 years

Up next for me is a period of funemployment. Building the company took almost a decade from start to finish. Time for a little break before anything new.

3

0

19

Mark Cummins

@mark_cummins

11 months

Uncomfortable prediction time: self-driving is finally on the cusp. I think we’re <5 years from widespread availability.

Teslaconomics

@Teslaconomics

11 months

This part was wild, take a deeper listen: 👂🏻 - This is entirely AI & cameras just like our brain which is neural nets & eyes - There is no line of code that says “slow down for speed bumps”, it’s doing it entirely on video training - There is no line of code that says “give

267

532

4K

2

0

19

Mark Cummins

@mark_cummins

2 months

I recently had cause to estimate the total number of words ever spoken. Current human population is about 8 billion, births since 1800 is 18 billion, and total homo sapiens who ever lived is estimated at around 117 billion. So would you think most words spoken belong to the deep

3

0

18

Mark Cummins

@mark_cummins

1 year

The UK is making a lot of smart moves around startup policy at the moment. Necessity is the mother of invention. Irish gov is just not at the races compared to UK or even France. There are some easy policy wins lying around, but excessive caution and inertia seems to rule.

1

4

18

Mark Cummins

@mark_cummins

5 years

Some thoughts on startup policy in Ireland, and how we can fix it. @ScaleIreland .(Eagle-eyed @realBobbyHealy beat me to the tweet :-)

1

16

Mark Cummins

@mark_cummins

2 years

Yayoi Kusama, 2019

1

0

16

Mark Cummins

@mark_cummins

4 months

Seems about right: 3 or maybe 4 orders of magnitude from increasing investment, 1 from Nvidia margin compression, 1 from custom silicon, 1 from Moore's law. So plausibly 6-7 OOMs of headroom in the current sprint, which would last until the early 2030s. (Absent a wildcard on

Dwarkesh Patel

@dwarkesh_sp

4 months

Given that you need 100x more effective compute between model generations, if we don’t get AGI by GPT-7, will we just never get it? @_sholtodouglas : “GPT-4 costs, let's call it, $100 million. The $1B, $10B, and $100B run, all seem very plausible by private company standards. You

40

62

637

6

1

17

Mark Cummins

@mark_cummins

2 years

Engraving by Albrecht Dürer, 1493

1

0

16

Mark Cummins

@mark_cummins

2 years

Henri Rousseau, 1874

1

0

15

Mark Cummins

@mark_cummins

2 months

@richardprice100 I'm almost finished another piece thinking that through. Short answer, I think data efficiency tricks, multimodal data, and synthetic data will all be part of the answer. I don't think anyone fully knows how well it will work yet, but there are enough irons in the fire that I'd

1

16

Mark Cummins

@mark_cummins

3 years

This is really a big deal. The pain of dealing with this was immense.

Patrick Collison

@patrickc

3 years

When we ask businesses why they don’t sell in more countries, “tax complexity” is one of the top things we hear. So we’ve built fully automatic global sales tax calculation and collection: .

50

221

3K

0

1

16

Mark Cummins

@mark_cummins

2 years

Ottoman miniature, 1505

1

0

14

Mark Cummins

@mark_cummins

3 months

Small LLMs being trained on bedtime stories written by larger LLMs. 'tis a like task we are at.

3

2

16

Mark Cummins

@mark_cummins

1 year

I never thought I'd say this, but I find myself using Bing now instead of Google. For anything more than nav queries or basic facts, GPT-4 based chat is just night and day better. I'm sure Google will catch up, but the gap is pretty embarassing right now.

2

0

14

Mark Cummins

@mark_cummins

4 years

Why the Irish government is dawdling is hard to comprehend. We need to shut down the country NOW. We are on an exponential curve like everyone else. We cannot see all the cases yet, but they are here. Every day we delay, more spread, more deaths to come. Small delays matter a LOT

Patrick McKenzie

@patio11

4 years

This is a very lucid explanation of what we currently know about coronavirus spread: The graph you should be most concerned about is the one that demonstrates that, at the point you know coronavirus has taken root in your area, it is *very* widespread.

9

126

347

2

1

14