Charles Foster @CFGeek Twitter profile

Pinned Tweet

Charles Foster

6 days

Context Windows #2 is out! Recently I’ve been hearing a lot about search and other flavors of “inference-time compute”. But could it really it scale? And if so, why *now*? Links in thread…

1

2

16

Last Seen Profiles

@M4RCBR4NDT

@s0loexisto

@Nies_okaB

@Xlrogueyahooco1

@unwellpodcast

@ISE_Show

@random1974

@KrystalWolfyAlt

@Remkual

@LorenzoCianti

@evligiz60125833

@a_kuua15406

@jhrovat

@rhy5__

@vivienne52662

@danimarreroo

@SlashArmy_BR

@iyhrtae

@anastasia_tern

@gaspodestefano

@jhrovat

@Pen_cefn

@thevaibhav_9

@jackie_lillis

@theshahshahid

@stupidpplcool

@okunooioi

@WiIIemAlexander

@ThamKhaiMeng

@wltjwwaj

@SheaMoisture

@BritishOrioles

@bankor16

@Maevafdl

@ksbanews

Charles Foster

@CFGeek

2 years

The normalization scheme that DeepMind researchers came up with for their "linear recurrent unit" (LRU) is a nice example of how it is possible to predictably engineer circuits in artificial neural networks, when you know what you're doing. A thread:

6

74

676

Charles Foster

@CFGeek

6 months

YES! If you initialize a LoRA layer based on the SVD of the original weight matrix (with its top singular values & vectors), you get significantly better fine-tuning results. This is a straight-up free lunch, as far as I can tell.

Aran Komatsuzaki

@arankomatsuzaki

6 months

PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models Significantly improved finetuned perf by simply changing the initialization of LoRA's AB matrix from Gaussian/zero to principal components of W repo: abs:

18

98

501

6

33

333

Charles Foster

@CFGeek

10 months

What excites me most about the rising tide of RNNs/SSMs is that it could let the fields of machine learning and computational neuroscience use the same modeling tools.

10

36

305

Charles Foster

@CFGeek

1 year

Note: sparse coding is an *established* method for disentangling representations. Anthropic did not invent it, nor did they claim to. If their new results seem surprising, now's a great time to revisit the older literature (Olshausen, Kanerva, etc.).

5

22

225

Charles Foster

@CFGeek

7 months

Wow! Papers from two different teams—one from academia and one from Google DeepMind—with the same finding: linear recurrence + local (sliding window) attention is your best bet if you want an efficient alternative to global attention.

AK

@_akhaliq

7 months

Simple linear attention language models balance the recall-throughput tradeoff Recent work has shown that attention-based language models excel at recall, the ability to ground generations in tokens previously seen in context. However, the efficiency of attention-based models is

3

44

230

5

23

220

Charles Foster

@CFGeek

1 year

Stability changed the name of these models to "Stable Beluga 1/2" and quietly removed the sentence of the blog post that mentioned they used two unnamed LLMs to generate their dataset. (This likely means they used OpenAI models, in clear violation of ToS)

Meet FreeWilly, Our Large And Mighty Instruction Fine-Tuned Models — Stability AI

Stability AI and its CarperAI lab are proud to announce FreeWilly1 and its successor FreeWilly2, two powerful new, open access, Large Language Models (LLMs). Both models demonstrate exceptional...

web.archive.org

Charles Foster

@CFGeek

1 year

@EMostaque @lcastricato

3

1

28

14

40

217

Charles Foster

@CFGeek

6 months

Prediction for 2024/2025: OpenAI showcases an AI assistant that controls a virtual desktop or browser to do a bunch of routine white-collar job tasks with minimal human correction. Public freakout in response to this is significantly more intense than it was for Sora or GPT-4.

15

12

180

Charles Foster

@CFGeek

10 months

Wait, so then it's no mystery why OpenAI's new base models are good at chess: they explicitly crafted the pretraining dataset to cover that! I presume whatever extra tuning they did to chat models wasn't focused on chess, so some of that was forgotten. @GrantSlatton @davidad

Andrew Carr (e/🤸)

@andrew_n_carr

10 months

That's a fun fact!

7

8

96

12

13

174

Charles Foster

@CFGeek

8 months

> Transformers significantly outperform neural sequence models with recurrent or convolutional representations on ICLL tasks […] we provide evidence that their ability to do so relies on specialized “n-gram heads” (higher-order variants of previously-described “induction heads”)

4

25

170

Charles Foster

@CFGeek

1 year

Neural networks are associative memory machines par excellence. If you want to wire them by hand or to interpret them, this is important to know. (Diagram is mine, but the content is classic connectionist stuff, and probably goes back to at least the 1940s w/ McCulloch & Pitts)

6

12

149

Charles Foster

@CFGeek

5 months

“Orthogonalization” aka “that trick that jailbreaks Llama3 weights”. It’s actually a pretty neat training-free method to ablate a feature, lots of potential uses if it works well.

4

9

145

Charles Foster

@CFGeek

1 year

The Transformer's quadratic complexity won't kill it. What might is that, for long contexts, the KV cache ends up being huge, *even bigger than the weights*. Crossover point is when L×2×D×N = L×12×(D^2). Compute is cheap, but memory bandwidth is expensive.

The Mathematics of Training LLMs — with Quentin Anthony of Eleuther AI

Listen now | Breaking down the viral Transformers Math 101 article and high performance distributed training for Transformers-based architectures (or "How I Learned to Stop Handwaving and Make the...

www.latent.space

EnricoShippole

@EnricoShippole

1 year

Releasing Yarn-Llama-2-13b-128k, a Llama-2 model, trained for 128k context length using YaRN scaling. The model was trained in collaboration with u/bloc97 and @theemozilla of @NousResearch and @Void13950782 of @AiEleuther .

28

173

781

8

9

136

Charles Foster

@CFGeek

1 year

Running list of conjectures about neural networks 📜:

4

10

135

Charles Foster

@CFGeek

4 months

FYI: these policies would prohibit Meta from releasing Llama3 weights (specifically the 400B model).

Eva Behrens

@_ebehrens_

4 months

Here are 5 policy recommendations for the upcoming AI Safety Summit in Seoul, from me and my colleagues at ICFG. In Bletchley, world leaders discussed major risks of frontier AI development. In Seoul, they should agree on concrete next steps to address them.

42

16

77

10

13

133

Charles Foster

@CFGeek

5 months

Why are we instructing our LLMs in 50-line megaprompts? Weren’t structured control flow, subroutines, namespaces etc. invented like a half century ago?

9

6

131

Charles Foster

@CFGeek

1 year

This looks legit. Attention heads tend to use the beginning of sequence for "null attention", so maintaining those tokens at the start of the KV cache allows for better sliding-window generation of long text. Can also be combined with long context tricks.

4

11

124

Charles Foster

@CFGeek

5 months

Contrary to claims SB 1047 would only impact AI megacorps, “covered models” include any non-derivative model that is as generally capable as circa-2024 frontier models. Algorithmic progress means in a matter of years, smaller players and even hobbyists *will* fall into its scope.

Adam Gleave

@ARGleave

5 months

I support SB 1047: the regulation asks billion-$ tech companies to take reasonable precautions when training models with the greatest capability for misuse, poses few to no costs on other developers, and supports academic & open-source research through compute funding.

4

2

38

10

25

120

Charles Foster

@CFGeek

2 months

Much of the backlash to SB 1047 is best seen as an expression of negative partisanship against the AI Safety movement. For those folks, the key point is not “This bill has XYZ specific problems”, but rather “This whole campaign must be stopped, or else the Doomers win”

6

7

113

Charles Foster

@CFGeek

13 days

@Miles_Brundage Hard to fault them when they can’t verify what the actual thing is

1

0

114

Charles Foster

@CFGeek

4 months

Researchers keep writing these papers with headline claims that “Transformers are X” or “Attention is Y”, with tiny disclaimers inside that they’re *really* just talking about linear attention, not the kind of attention that Transformers actually use.

Aran Komatsuzaki

@arankomatsuzaki

4 months

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality Presents Mamba-2, which outperforms Mamba and Transformer++ in both perplexity and wall-clock time

7

104

557

7

5

109

Charles Foster

@CFGeek

9 months

In Mamba, the selection mechanism has a knob to modulate the flow of time, via Δt. If an input sets Δt → 0, time is effectively frozen, so the state value is momentarily prevented from changing, which acts to "hold" or "latch onto" a memory. And Δt → ∞ fast-forwards to reset!

4

7

109

Charles Foster

@CFGeek

1 year

@rom1504 Nobody asked the content authors. Many of them are objecting now, yet nothing is done. I think by default we should take an opt-in approach, where the author must choose to make their data broadly available as part of a corpus. Re: your question -> no, I don't mean that

3

6

100

Charles Foster

@CFGeek

6 months

From my perspective, "Is it really *reasoning*?" and "Does it really have a *world model*?" and "Is that really *generalization*?" are fundamentally kind of confused. These ten-dollar words are ways of expressing normative judgments that a computation is useful-for-some-purposes.

Dwarkesh Patel

@dwarkesh_sp

6 months

. @TrentonBricken explains how we know LLMs are actually generalizing - aka they're not just stochastic parrots: - Training models on code makes them better at reasoning in language. - Models fine tuned on math problems become better at entity detection. - We can just

28

115

727

16

10

107

Charles Foster

@CFGeek

1 month

FYI: I now think SB 1047 is not a bad bill. It definitely isn’t my favorite approach, but given a stark choice between it and a random draw from the set of alternative AI regulatory proposals, I’d be picking it more often than not.

5

4

104

Charles Foster

@CFGeek

1 year

If you use a custom 20B token synthetic training dataset and don't release it for public scrutiny, I will just assume you trained your model on the test data, or on stuff derived from the test data.

Sebastien Bubeck

@SebastienBubeck

1 year

How far does one billion parameters take you? As it turns out, pretty far!!! Today we're releasing phi-1.5, a 1.3B parameter LLM exhibiting emergent behaviors surprisingly close to much larger LLMs. For warm-up, see an example completion w. comparison to Falcon 7B & Llama2-7B

31

180

832

2

5

98

Charles Foster

@CFGeek

2 years

Wild seeing the race to cobble together AI systems that make decisions: - autonomously - with brittle methods - for reasons nobody understands - daisy-chained across the Internet - without any vigilance controls - affecting people with no notice or consent

4

17

95

Charles Foster

@CFGeek

6 months

ArXiv is already a junkyard of preprints peddling promises of infinite memory—if only we would tweak the Transformer just a tad. Whenever you see a new one, the question to ask is always “Why this one?” This may be the one, but what makes this time different?

Aran Komatsuzaki

@arankomatsuzaki

6 months

Google presents Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention 1B model that was fine-tuned on up to 5K sequence length passkey instances solves the 1M length problem

27

260

1K

10

3

97

Charles Foster

@CFGeek

9 months

Lock in your predictions: in 24 hours, will you look back on this post as substantially true or just self-promotion hype?

Brett Adcock

@adcock_brett

9 months

we just had an AI breakthrough in our lab robotics is about to have its ChatGPT moment and that moment is happening tomorrow

661

856

9K

31

1

92

Charles Foster

@CFGeek

4 months

🚨 SB 1047 was just amended🚨 - “Covered model” now means a model whose training is >10^26 FLOP and costing >$100M estimated worth of compute (inflation-adjusted) - “Derivative model” now excludes models fine-tuned for >25% of the original training compute (continued below ⤵️)

7

3

87

Charles Foster

@CFGeek

8 months

Feels notable that Anthropic, OpenAI, and Google were all able to quickly figure out massive Transformer context windows without anybody revealing their methods. And the open community is hot on their heels. All that secrecy wasn't worth much, apparently.

5

12

88

Charles Foster

@CFGeek

5 months

If we somehow time-traveled a copy of GPT-4o back to 2004 and let a focus group of NeurIPS (then NIPS) attendees interact with it for 2 hours, what percent would endorse calling it “AGI” afterward? (Pretend it won’t give responses that would require knowledge of the then-future.)

<25%

234

25-50%

264

50-75%

406

>75%

796

30

9

86

Charles Foster

@CFGeek

1 year

@rom1504 No. I would say we ML researchers should hold ourselves to a high standard of conduct, such that that when people tell us they don't want us training on the content they authored, we respect their wishes.

5

8

83

Charles Foster

@CFGeek

1 year

How does Stability get to call StableVicuna "open source" when the model is derived from the not-open-source Vicuna, and is a not-open-source LLaMA tuned with ToS-encumbered data from the not-open-source GPT-3/ChatGPT?

12

5

85

Charles Foster

@CFGeek

5 months

Contrast pairs are overpowered. Once you have them, you can use them to generate control vectors, and to initialize classifiers, and to do RL/DPO, and probably more

Anthropic

@AnthropicAI

5 months

To make the probes, we track how the model’s internal state changes between “Yes” vs “No” answers to questions like "Are you doing something dangerous?" We use this info to detect when a sleeper agent is about to misbehave (e.g. insert a code vulnerability). It works quite

7

20

184

2

7

85

Charles Foster

@CFGeek

6 months

It’s like LoRA and control vectors had a baby!

Aran Komatsuzaki

@arankomatsuzaki

6 months

ReFT: Representation Finetuning for Language Models 10x-50x more parameter-efficient than prior state-of-the-art parameter-efficient fine-tuning methods repo: abs:

5

102

506

1

9

83

Charles Foster

@CFGeek

4 months

This syncretism of rhetoric from the AI Safety movement and China-hawks unsettles me. It feels like a kind of unholy alliance in the making …

Dwarkesh Patel

@dwarkesh_sp

4 months

How a US/China superintelligence arms race will play out: “The CCP is going to have an all-out effort to infiltrate American AI labs. Thousands of people, the full force of the Ministry of State Security. There's an enormous incentive for a first strike.” @leopoldasch

112

55

458

8

2

82

Charles Foster

@CFGeek

7 months

Transformer is seemingly now the all-around heavyweight champion. Doesn't matter whether autoregressive or diffusion, text or image or video or robotics/multimodal, unsupervised or supervised or RL ...

AK

@_akhaliq

7 months

Stability AI announces Stable Diffusion 3 most capable text-to-image model, utilizing a diffusion transformer architecture for greatly improved performance in multi-subject prompts, image quality, and spelling abilities. Prompt: Epic anime artwork of a wizard atop a mountain

7

125

657

4

12

81

Charles Foster

@CFGeek

8 days

This was funny when the hacked accounts were just random individuals, but OpenAI’s new official newsroom account getting taken over by crypto-spammers is just a real bad look.

4

2

83

Charles Foster

@CFGeek

7 months

Cristian Garcia

@cgarciae88

7 months

HELL NO

9

11

88

0

10

78

Charles Foster

@CFGeek

8 months

Excited to try this out! (Though I'm kinda doubtful it'll be better than Hedgehog) It's basically just linear attention on top of queries & keys that have been passed through a LayerNorm -> elementwise squaring.

AK

@_akhaliq

8 months

Linear Transformers with Learnable Kernel Functions are Better In-Context Models Advancing the frontier of subquadratic architectures for Language Models (LMs) is crucial in the rapidly evolving field of natural language processing. Current innovations, including State Space

2

30

191

3

9

76

Charles Foster

@CFGeek

1 year

I used to *love* sneering at @GaryMarcus and his takes on AI progress. Something shifted when I started building products w/ LLMs in my day job. I started seeing more vividly why reliability matters, and how the current zeitgeist is hurting itself making promises we can't keep

5

72

Charles Foster

@CFGeek

1 year

We used to have vectorized LISP running on massively-parallel hardware that looked like this REMEMBER WHAT THEY TOOK FROM YOU

kache

@yacineMTB

1 year

symbolic AI is going to make large hoards of compute obsolete

47

26

385

3

9

75

Charles Foster

@CFGeek

9 months

This is basically DPO without preference labels! Simply assume the supervised responses to prompts are better than the model's responses to those same prompts. Similar to the trick Intel used for Neural Chat, where they assumed GPT-4 responses > Llama2 responses.

Aran Komatsuzaki

@arankomatsuzaki

9 months

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models Significantly improves the LLM’s performance across a variety of benchmarks and even outperform models trained through DPO with extra GPT-4 preference data

14

63

446

5

9

71

Charles Foster

@CFGeek

4 months

> see new Transformer contender > query is a learned, fixed vector > no other RNN baselines > no language modeling experiments

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

4 months

Attention as an RNN abs: "attention can be viewed as an RNN with the special ability to compute its many-to-one RNN output efficiently" Proposes Aaren, a new module that can be trained in parallel (like Transformers) but also be efficiently updated at

8

128

727

3

0

71

Charles Foster

@CFGeek

1 year

OPINION: we should probably move away from training AI systems on datasets like LAION-400M/5B and Books3, fair use aside. (I say this as someone who knows the folks that collected those datasets & who thinks they deserve credit for doing uncelebrated but very impactful work.)

6

7

70

Charles Foster

@CFGeek

11 months

Worried about the future of openness in AI? Here is a way to help: We're putting together a public list of all the good work that's been enabled by open-weight foundation models, to show why transparency & public scrutiny is worth protecting. ⬇️ Links below ⬇️

3

18

68

Charles Foster

@CFGeek

1 year

If we can detect an LLM is copying from a span of context (à la induction heads), couldn't we then grab the rest of the span and run it through the model in parallel (à la speculative sampling)? Could be an easy win for tasks that call for in-context retrieval...

4

5

68

Charles Foster

@CFGeek

2 months

As evidence of this, the California state legislature is considering another AI bill, AB 3211. That bill would have far worse impacts on tech companies and open-source, as reported by observers like @deanwball , @TheZvi , & @binarybits . Yet it’s produced almost no real opposition.

5

6

68

Charles Foster

@CFGeek

1 month

ICYMI: this interviewee confirms speculations that OpenAI’s Fine-tuning API uses LoRA under the hood. Around the 43.5 minute mark.

swyx @ DevDay!

@swyx

1 month

🆕 @latentspacepod : Is finetuning GPT4o worth it? w/ @AlistairPullen of @cosine_sh Betteridge's law says no: with 59 different flavors of RAG, and >2million token context + prompt caching, it's reasonable to believe that "in context learning is all you need". But Genie is the

9

12

117

5

6

65

Charles Foster

@CFGeek

1 year

renji the synthetic data maximalist

@brickroad7

1 year

This is earth-shattering news. The "hard problem" of mechanistic interpretability has been solved. The formal/cautious/technical language of most ppl commenting on this obscures the gravity of it. What this means -> not just AGI, but *safe* *superintelligence* is 100% coming🧵

111

501

3K

2

4

64

Charles Foster

@CFGeek

1 year

IDK who needs to hear this but the "70k unused embeddings for multimodal extensions" line item is pure filler. If they weren't used during training, they just contain random noise. You could've added those extra rows to the embedding matrix yourself, for the same effect.

Adept

@AdeptAILabs

1 year

There are a few cool things to note: • Trained the whole time with a 16K context–4x that of LLaMA2 and 8x of GPT-3 • Strong evals, especially on the instruct tuned version • 70k unused embeddings for multimodal extensions • Apache license!

2

1

44

4

65

Charles Foster

@CFGeek

1 year

Evaluation is hard! This goes for AI just as with us. In games like chess and Go, evaluation is easy, which allows for tight feedback loops and rapid self-improvement. But in rich domains, the bottleneck IS evaluation (doing experiments, peer review, &c.)

Challenges in evaluating AI systems

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

www.anthropic.com

4

2

62

Charles Foster

@CFGeek

4 months

2

1

62

Charles Foster

@CFGeek

10 months

@jeremyphoward This account has made similar wild claims & promises before. I don't put much weight in stuff they post anymore

3

0

62

Charles Foster

@CFGeek

8 months

Read this post. It describes—in better words than I've ever found—a shift in paradigm within ML in recent years, towards an "industrial" one based on predictable input-output relations. Lots of great lines, some of which I'll quote below (h/t @g_leech_ )

trees are harlequins, words are harlequins

I don't think you're drawing the right lesson from the broad success of transformer models. You write: If you had to summarize the last decade of AI research in one sentence, you might say that the...

nostalgebraist.tumblr.com

3

7

61

Charles Foster

@CFGeek

11 months

And *then* he said...

Jan Leike

@janleike

2 years

Before we scramble to deeply integrate LLMs everywhere in the economy, can we pause and think whether it is wise to do so? This is quite immature technology and we don't understand how it works. If we're not careful we're setting ourselves up for a lot of correlated failures.

115

161

1K

0

6

61

Charles Foster

@CFGeek

5 months

Are AI systems best described as tools, as an alien species, or as our mind-children? I think this is something of a litmus test for broader views.

32

3

61

Charles Foster

@CFGeek

1 year

What I mean is "can perform complex reasoning" wait nvm I meant "can win at strategic games" wait nvm I meant "can understand human language" wait nvm I meant "can automate economically-valued office tasks" wait nvm I meant "can assist in scientific discovery" wait nvm I meant

3

4

59

Charles Foster

@CFGeek

1 year

Rather than trying to "solve" superposition & to always explain/predict/control neural network computations using the same units of analysis, consider a more "Hopfieldian" lens, where representational spaces rule (via dynamics at multiple valid scales)

2

13

58

Charles Foster

@CFGeek

1 year

At some point I switched from seeing neural networks as arcane devices to seeing them as moldable variants of "boring" building blocks from signal processing, feedback control, associative learning, & functional programming. Like some kind of function approximation plastic/epoxy

7

5

53

Charles Foster

@CFGeek

3 months

AI safety is not a model property

Erik Jones

@ErikJones313

3 months

Model developers try to train “safe” models that refuse to help with malicious tasks like hacking ...but in new work with @JacobSteinhardt and @ancadianadragan , we show that such models still enable misuse: adversaries can combine multiple safe models to bypass safeguards 1/n

12

43

205

4

7

56

Charles Foster

@CFGeek

1 month

Re: open AI weights and China competition If what matters is “Who best monopolizes innovation on this technology?”, encouraging domestic firms to share weights may be bad. But if what matters is “Who best diffuses this technology?”, encouraging that practice may be quite good.

1

5

56

Charles Foster

@CFGeek

7 months

Current obsession: having LLMs simulate abstract machines step-by-step. This is GPT-4 acting as a register machine doing addition. Uses the INC/DEB language that I'd first read about in Dan Dennett's "Secrets of Computer Power Revealed".

3

55

Charles Foster

@CFGeek

1 year

STOP DOING INTERPRETABILITY Nonlinear coupling parameters were not supposed to be given names Want to try out some interpretable circuits, for a laugh? We had a tool for that: It was called "PROGRAMMING"

3

2

53

Charles Foster

@CFGeek

5 months

OpenAI rep: “OK so this is what I wrote down. What do you see?” *pointing phone at paper* ChatGPT: “Aww. I see ‘I love ChatGPT’. That’s so sweet of you!” *audience applauds* ChatGPT: “… wowwww, that’s quite the outfit you have on 😏 Love—” *mic cuts suddenly * Pure comedic gold

OpenAI

@OpenAI

5 months

What would you like to ask GPT-4o? We’ll pick some live prompts to do on our livestream happening now:

625

704

5K

3

1

53

Charles Foster

@CFGeek

1 month

@Teknium1 I think that bill is dead. Deadline to pass both houses was midnight and it doesn’t look like it got voted on.

4

0

52

Charles Foster

@CFGeek

1 year

Great talk by Carlini, one of many at the LLM/Transformers workshop (very curious about the comment made at 32:09)

Are Aligned Language Models “Adversarially Aligned”?

Nicholas Carlini (Google DeepMind)https://simons.berkeley.edu/talks/nicholas-carlini-google-deepmind-2023-08-16Large Language Models and TransformersAn "alig...

www.youtube.com

1

5

51

Charles Foster

@CFGeek

9 months

If this 👇 generalizes, could you leverage it to watermark your model weights before release? Like, you train it to output Y only when prompted with X. In theory, with ZKP, could you even give evidence a set of weights are derived from yours without publicly revealing what X is?

Anthropic

@AnthropicAI

9 months

New Anthropic Paper: Sleeper Agents. We trained LLMs to act secretly malicious. We found that, despite our best efforts at alignment training, deception still slipped through.

124

576

3K

3

1

50

Charles Foster

@CFGeek

1 month

Reminder that SB 1047 is not the only consequential AI-related bill that may pass the California legislature this week. There’s also SB 892, SB 896, SB 942, AB 1836, AB 2013, AB 2602, AB 2930, and AB 3211.

1

9

50

Charles Foster

@CFGeek

10 months

Aaaaand one of the OpenAI folks estimated the GPT-3.5 base model to be around 1800 ELO, which is exactly the cutoff score for games included in the GPT-4 base model pretraining dataset... 🤔

Boris Power

@BorisMPower

1 year

@francoisfleuret I don’t think anything has been published unfortunately. ELO is around 1800

4

0

23

3

2

48

Charles Foster

@CFGeek

1 year

@_akhaliq Please don't co-opt the confusing "mesa-optimization" frame to describe this. There are lots of better, less-loaded terms, including: - "fast weights" - "in-context learning" - "associative memory"

Using Fast Weights to Attend to the Recent Past

Until recently, research on artificial neural networks was largely restricted to systems with only two types of variable: Neural activities that represent the current or recent input and weights...

arxiv.org

4

46

Charles Foster

@CFGeek

8 months

It apparently took < 1 year for competiton to create LLMs that are comparable/better than GPT-4 (in its original gpt-4-0314 form). That is very fast! This result may or may not hold up, but that it's even *plausible* is evidence enough these capabilities will become commodities

lmsys.org

@lmsysorg

8 months

🔥Breaking News from Arena Google's Bard has just made a stunning leap, surpassing GPT-4 to the SECOND SPOT on the leaderboard! Big congrats to @Google for the remarkable achievement! The race is heating up like never before! Super excited to see what's next for Bard + Gemini

153

620

3K

5

6

48

Charles Foster

@CFGeek

2 months

You should read ML headlines from results in Linear Transformers the same way as biomedicine headlines from results “in mice”

1

2

44

Charles Foster

@CFGeek

1 year

Hetero-associative memory rules everything around me

Owain Evans

@OwainEvans_UK

1 year

Does a language model trained on “A is B” generalize to “B is A”? E.g. When trained only on “George Washington was the first US president”, can models automatically answer “Who was the first US president?” Our new paper shows they cannot!

175

707

4K

1

5

47

Charles Foster

@CFGeek

5 months

SB 1047 nightmare scenario for open-sourcers like Meta and Mistral: even if they can guarantee a model has *no biology-related knowledge or skills whatsoever*, the Attorney General + court can block release because terrorists might *teach it from scratch* how to build bioweapons.

6

3

46

Charles Foster

@CFGeek

1 year

Learning programs with backprop is hard, in general. Computation needs input-dependent branching. Backprop only sees linear sensitivity, so trying more than 1 branch at once requires superposition. But then some program(s) will cause interference that ruins credit assignment

3

44

Charles Foster

@CFGeek

4 months

Only use reinforcement learning (RL) if you absolutely must. RL is the “approach of last resort”, as @nostalgebraist has called it. They say that training neural networks is easy because they *want* to learn, but in RL, you will be fighting every cursed step of the way.

kache

@yacineMTB

4 months

bruh this RL shit is hard 😭

59

11

552

2

1

45

Charles Foster

@CFGeek

8 months

Fun fact: the (Moore-Penrose) pseudoinverse can be used to set the weights of neural network associative memories without training! This trick has been known since at least the 1980s, from work by Personnaz, Guyon, & Dreyfus applying it to Hopfield networks

Stefano Gogioso

@StefanoGogioso

1 year

@davidad The pseudoinverse is just so elegant: numerically stable, easily derived from the SVD, returns a preimage with least square error. It coincides with the inverse when the matrix is actually invertible, so barely any reason to teach straight inverses, tbh.

2

3

32

1

3

46

Charles Foster

@CFGeek

11 months

I think that hidden scratchpads are an inherently deceptive design. If you set your AI system up to output internal thoughts / actions that are inaccessible to the user, then you're preventing them from properly overseeing the system! This is bad and very unnecessary!

Apollo Research

@apolloaisafety

11 months

1/ Can AIs deceive their users on their own initiative? We find that GPT-4, trained to be honest and harmless, can take illegal actions like insider trading and lie about it to its user without being instructed to do so. This finding was demonstrated at the #AiSafetySummit .

13

32

196

6

4

45

Charles Foster

@CFGeek

5 months

@iScienceLuvr Yes but it isn’t fair to evaluate organizations based on what they might be in the process of developing. We judge them by what they’ve actually verifiably developed.

5

0

44

Charles Foster

@CFGeek

2 years

ML influencers: hehe silly @GaryMarcus , always peddling his "neurosymbolic hybrids" BS the same ML influencers: the *real* way to use GPT-3 is with chain-of-thought and with code generation and with tool use and with databases and and and

6

3

43

Charles Foster

@CFGeek

1 year

This is blowing my mind in a way I haven't felt since the early CLIP-VQGAN (z+quantize) outputs back in 2021

Spencer Sterling

@cerspense

1 year

This is zeroscope_v2_XL. A new 1024x576 #texttovideo model designed to take on Gen-2. Explore prompts with the new 576x320 model, then commit to a high-res render by upscaling with zeroscope_v2_XL via vid2vid in the 1111 text2video extension. Check it out:

69

296

1K

1

41

Charles Foster

@CFGeek

1 year

It makes me sad seeing "first-timers"—folks who clearly *just* found out about the AI doom debates—speedrun all the typical bad arguments in public

4

0

43

Charles Foster

@CFGeek

6 months

Return of the encoder-decoder king! And released with intermediate checkpoints etc. just like Pythia, which should make it great for open science, including interpretability👏

Aran Komatsuzaki

@arankomatsuzaki

6 months

🚀 Introducing Pile-T5! 🔗 We (EleutherAI) are thrilled to open-source our latest T5 model trained on 2T tokens from the Pile using the Llama tokenizer. ✨ Featuring intermediate checkpoints and a significant boost in benchmark performance. Work done by @lintangsutawika , me

13

111

559

0

2

43

Charles Foster

@CFGeek

2 months

This Monte Carlo integration and importance sampling stuff is really something! You’re telling me I can compute integrals/expectations *by sampling*? And I can even do it sampling from a *different* distribution? Wild.

3

2

43

Charles Foster

@CFGeek

1 year

Taking a moment to express genuine surprise at how many new pretrained LLMs for English have been released within < 1 year: - Cerebras-GPT / BTLM - Falcon - Galactica - LLaMA / Llama 2 - Mistral - MPT - OpenLLaMA - Persimmon - Phi - Pythia - Qwen - RWKV

1

3

42

Charles Foster

@CFGeek

1 year

Some of y'all sounding like

2

5

40

Charles Foster

@CFGeek

6 months

Litmus test: do you regard this behavior with horror, or delight?

Anthropic

@AnthropicAI

6 months

Many-shot jailbreaking exploits the long context windows of current LLMs. The attacker inputs a prompt beginning with hundreds of faux dialogues where a supposed AI complies with harmful requests. This overrides the LLM's safety training:

11

36

306

3

1

42

Charles Foster

@CFGeek

2 years

Judging by recent tweets, LLMs are Silicon Valley's "hot new thing". I think its worth sharing some intuitions about the tradeoffs that I think make it hard to design a decisively better architecture than the current heavyweight champion—the autoregressive Transformer. 🧵 (1/N)

1

4

40

Charles Foster

@CFGeek

4 months

This, dear reader, is what we call a “push poll”. If you see me doing this, please call me out on it.

Dan Hendrycks

@DanHendrycks

4 months

Question for the AI community: Should AI systems that can be used to easily produce weapons of mass destruction be irreversibly proliferated as open weights? (Example WMD that seems very plausible in the next few years: AI cyberweapon that can take down our power grids.)

56

2

26

2

40

Charles Foster

@CFGeek

8 months

To avoid stuff like this, I think you want to offload to a finite-state machine that defines the set of allowed choices at each state, so the LLM is only responsible for mapping user-input to the current choice set, & for mapping outputs back to language.

Ars Technica

@arstechnica

8 months

Air Canada must honor refund policy invented by airline’s chatbot

120

2K

12K

4

2

38

Charles Foster

@CFGeek

8 months

If you're using ALiBi or RoPE, it's probably best to turn it off on some attention heads (for ALiBi, just set their slopes to 0) so the model can do unbiased arbitrary-length lookups. Works with Flash Attention, even!

3

2

39

Charles Foster

@CFGeek

2 years

Aran Komatsuzaki

@arankomatsuzaki

2 years

Resurrecting Recurrent Neural Networks for Long Sequences Shows that careful design of deep RNNs performs on par with SSMs on long-range reasoning tasks with comparable speed.

8

73

366

1

2

38

Charles Foster

@CFGeek

7 months

"Is the Transformer architecture Turing-complete?" is not quite the right question to ask, IMO. What TMs show is that any algorithm can be viewed as (1) a *finite state* control policy that interacts w/ (2) a separate working memory. Our NNs only need to express the logic of (1).

2

1

38

Charles Foster

@CFGeek

5 months

@andersonbcdefg OK hear me out: yes the scaling laws say scaling is logarithmic (or at least sublinear) but Moore’s law is exponential and algorithmic progress is also exponential so this double-exponential turns the log back into an exponential again

5

1

38

Charles Foster

@CFGeek

1 year

We often think of bigger neural networks as "more complex", but AFAICT this intuition is wrong from the lens of compression (as in Solomonoff induction, adaptive coding, dictionary learning etc.). Very simple algorithms can leverage massive memory w/o increasing model complexity

Jim Fan

@DrJimFan

1 year

There're few who can deliver both great AI research and charismatic talks. OpenAI Chief Scientist @ilyasut is one of them. I watched Ilya's lecture at Simons Institute, where he delved into why unsupervised learning works through the lens of compression. Sharing my notes: -

54

432

3K

4

2

34

Charles Foster

@CFGeek

2 years

Low hanging fruit for interpretability work: take an open source language or vision model and run big dataset through it, logging activations; then have GPT-4 to bulk auto-suggest concept labels for neurons/features based on the datapoints that most activated them

2

38

Charles Foster

@CFGeek

2 years

Please let me know if you spot errors. I did this mainly to clarify my own confusions. Props to @orvieto_antonio , @SamuelMLSmith , @_albertgu , Anushan Fernando, @caglarml , Razvan Pascanu, and @sohamde_ for putting out their work. Link to the preprint:

1

3

38

Charles Foster

@CFGeek

2 years

That's it! IMO, the biggest takeaway of the LRU paper is that proper parameterization, initialization, and normalization are powerful and woefully underrated tools for engineering neural circuits that behave in predictable ways.

1

2

38

Charles Foster

@CFGeek

8 months

This finding is unintuitive precisely because, as a rule, optimizing neural networks with backprop is typically *unreasonably* predictable and stable nowadays.

Jascha Sohl-Dickstein

@jaschasd

8 months

Have you ever done a dense grid search over neural network hyperparameters? Like a *really dense* grid search? It looks like this (!!). Blueish colors correspond to hyperparameters for which training converges, redish colors to hyperparameters for which training diverges.

275

2K

10K

4

0

37