Charles Foster Profile Banner
Charles Foster Profile
Charles Foster

@CFGeek

1,973
Followers
279
Following
259
Media
3,399
Statuses

i guess we doin AI policy now 🪄 Tensor-enjoyer 🧪 @FinetuneLearn . Occasionally writing at “Context Windows” on Substack.

Oakland, CA
Joined June 2020
Don't wanna be here? Send us removal request.
Pinned Tweet
@CFGeek
Charles Foster
6 days
Context Windows #2 is out! Recently I’ve been hearing a lot about search and other flavors of “inference-time compute”. But could it really it scale? And if so, why *now*? Links in thread…
Tweet media one
1
2
16
@CFGeek
Charles Foster
2 years
The normalization scheme that DeepMind researchers came up with for their "linear recurrent unit" (LRU) is a nice example of how it is possible to predictably engineer circuits in artificial neural networks, when you know what you're doing. A thread:
Tweet media one
6
74
676
@CFGeek
Charles Foster
6 months
YES! If you initialize a LoRA layer based on the SVD of the original weight matrix (with its top singular values & vectors), you get significantly better fine-tuning results. This is a straight-up free lunch, as far as I can tell.
@arankomatsuzaki
Aran Komatsuzaki
6 months
PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models Significantly improved finetuned perf by simply changing the initialization of LoRA's AB matrix from Gaussian/zero to principal components of W repo: abs:
Tweet media one
18
98
501
6
33
333
@CFGeek
Charles Foster
10 months
What excites me most about the rising tide of RNNs/SSMs is that it could let the fields of machine learning and computational neuroscience use the same modeling tools.
Tweet media one
10
36
305
@CFGeek
Charles Foster
1 year
Note: sparse coding is an *established* method for disentangling representations. Anthropic did not invent it, nor did they claim to. If their new results seem surprising, now's a great time to revisit the older literature (Olshausen, Kanerva, etc.).
Tweet media one
5
22
225
@CFGeek
Charles Foster
7 months
Wow! Papers from two different teams—one from academia and one from Google DeepMind—with the same finding: linear recurrence + local (sliding window) attention is your best bet if you want an efficient alternative to global attention.
@_akhaliq
AK
7 months
Simple linear attention language models balance the recall-throughput tradeoff Recent work has shown that attention-based language models excel at recall, the ability to ground generations in tokens previously seen in context. However, the efficiency of attention-based models is
Tweet media one
3
44
230
5
23
220
@CFGeek
Charles Foster
1 year
Stability changed the name of these models to "Stable Beluga 1/2" and quietly removed the sentence of the blog post that mentioned they used two unnamed LLMs to generate their dataset. (This likely means they used OpenAI models, in clear violation of ToS)
@CFGeek
Charles Foster
1 year
Tweet media one
3
1
28
14
40
217
@CFGeek
Charles Foster
6 months
Prediction for 2024/2025: OpenAI showcases an AI assistant that controls a virtual desktop or browser to do a bunch of routine white-collar job tasks with minimal human correction. Public freakout in response to this is significantly more intense than it was for Sora or GPT-4.
15
12
180
@CFGeek
Charles Foster
10 months
Wait, so then it's no mystery why OpenAI's new base models are good at chess: they explicitly crafted the pretraining dataset to cover that! I presume whatever extra tuning they did to chat models wasn't focused on chess, so some of that was forgotten. @GrantSlatton @davidad
@andrew_n_carr
Andrew Carr (e/🤸)
10 months
That's a fun fact!
Tweet media one
7
8
96
12
13
174
@CFGeek
Charles Foster
8 months
> Transformers significantly outperform neural sequence models with recurrent or convolutional representations on ICLL tasks […] we provide evidence that their ability to do so relies on specialized “n-gram heads” (higher-order variants of previously-described “induction heads”)
Tweet media one
4
25
170
@CFGeek
Charles Foster
1 year
Neural networks are associative memory machines par excellence. If you want to wire them by hand or to interpret them, this is important to know. (Diagram is mine, but the content is classic connectionist stuff, and probably goes back to at least the 1940s w/ McCulloch & Pitts)
Tweet media one
6
12
149
@CFGeek
Charles Foster
5 months
“Orthogonalization” aka “that trick that jailbreaks Llama3 weights”. It’s actually a pretty neat training-free method to ablate a feature, lots of potential uses if it works well.
Tweet media one
4
9
145
@CFGeek
Charles Foster
1 year
The Transformer's quadratic complexity won't kill it. What might is that, for long contexts, the KV cache ends up being huge, *even bigger than the weights*. Crossover point is when L×2×D×N = L×12×(D^2). Compute is cheap, but memory bandwidth is expensive.
@EnricoShippole
EnricoShippole
1 year
Releasing Yarn-Llama-2-13b-128k, a Llama-2 model, trained for 128k context length using YaRN scaling. The model was trained in collaboration with u/bloc97 and @theemozilla of @NousResearch and @Void13950782 of @AiEleuther .
Tweet media one
28
173
781
8
9
136
@CFGeek
Charles Foster
1 year
Running list of conjectures about neural networks 📜:
4
10
135
@CFGeek
Charles Foster
4 months
FYI: these policies would prohibit Meta from releasing Llama3 weights (specifically the 400B model).
@_ebehrens_
Eva Behrens
4 months
Here are 5 policy recommendations for the upcoming AI Safety Summit in Seoul, from me and my colleagues at ICFG. In Bletchley, world leaders discussed major risks of frontier AI development. In Seoul, they should agree on concrete next steps to address them.
Tweet media one
42
16
77
10
13
133
@CFGeek
Charles Foster
5 months
Why are we instructing our LLMs in 50-line megaprompts? Weren’t structured control flow, subroutines, namespaces etc. invented like a half century ago?
9
6
131
@CFGeek
Charles Foster
1 year
This looks legit. Attention heads tend to use the beginning of sequence for "null attention", so maintaining those tokens at the start of the KV cache allows for better sliding-window generation of long text. Can also be combined with long context tricks.
Tweet media one
4
11
124
@CFGeek
Charles Foster
5 months
Contrary to claims SB 1047 would only impact AI megacorps, “covered models” include any non-derivative model that is as generally capable as circa-2024 frontier models. Algorithmic progress means in a matter of years, smaller players and even hobbyists *will* fall into its scope.
Tweet media one
@ARGleave
Adam Gleave
5 months
I support SB 1047: the regulation asks billion-$ tech companies to take reasonable precautions when training models with the greatest capability for misuse, poses few to no costs on other developers, and supports academic & open-source research through compute funding.
4
2
38
10
25
120
@CFGeek
Charles Foster
2 months
Much of the backlash to SB 1047 is best seen as an expression of negative partisanship against the AI Safety movement. For those folks, the key point is not “This bill has XYZ specific problems”, but rather “This whole campaign must be stopped, or else the Doomers win”
6
7
113
@CFGeek
Charles Foster
13 days
@Miles_Brundage Hard to fault them when they can’t verify what the actual thing is
1
0
114
@CFGeek
Charles Foster
4 months
Researchers keep writing these papers with headline claims that “Transformers are X” or “Attention is Y”, with tiny disclaimers inside that they’re *really* just talking about linear attention, not the kind of attention that Transformers actually use.
Tweet media one
@arankomatsuzaki
Aran Komatsuzaki
4 months
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality Presents Mamba-2, which outperforms Mamba and Transformer++ in both perplexity and wall-clock time
Tweet media one
7
104
557
7
5
109
@CFGeek
Charles Foster
9 months
In Mamba, the selection mechanism has a knob to modulate the flow of time, via Δt. If an input sets Δt → 0, time is effectively frozen, so the state value is momentarily prevented from changing, which acts to "hold" or "latch onto" a memory. And Δt → ∞ fast-forwards to reset!
Tweet media one
4
7
109
@CFGeek
Charles Foster
1 year
@rom1504 Nobody asked the content authors. Many of them are objecting now, yet nothing is done. I think by default we should take an opt-in approach, where the author must choose to make their data broadly available as part of a corpus. Re: your question -> no, I don't mean that
3
6
100
@CFGeek
Charles Foster
6 months
From my perspective, "Is it really *reasoning*?" and "Does it really have a *world model*?" and "Is that really *generalization*?" are fundamentally kind of confused. These ten-dollar words are ways of expressing normative judgments that a computation is useful-for-some-purposes.
@dwarkesh_sp
Dwarkesh Patel
6 months
. @TrentonBricken explains how we know LLMs are actually generalizing - aka they're not just stochastic parrots: - Training models on code makes them better at reasoning in language. - Models fine tuned on math problems become better at entity detection. - We can just
28
115
727
16
10
107
@CFGeek
Charles Foster
1 month
FYI: I now think SB 1047 is not a bad bill. It definitely isn’t my favorite approach, but given a stark choice between it and a random draw from the set of alternative AI regulatory proposals, I’d be picking it more often than not.
5
4
104
@CFGeek
Charles Foster
1 year
If you use a custom 20B token synthetic training dataset and don't release it for public scrutiny, I will just assume you trained your model on the test data, or on stuff derived from the test data.
@SebastienBubeck
Sebastien Bubeck
1 year
How far does one billion parameters take you? As it turns out, pretty far!!! Today we're releasing phi-1.5, a 1.3B parameter LLM exhibiting emergent behaviors surprisingly close to much larger LLMs. For warm-up, see an example completion w. comparison to Falcon 7B & Llama2-7B
Tweet media one
31
180
832
2
5
98
@CFGeek
Charles Foster
2 years
Wild seeing the race to cobble together AI systems that make decisions: - autonomously - with brittle methods - for reasons nobody understands - daisy-chained across the Internet - without any vigilance controls - affecting people with no notice or consent
4
17
95
@CFGeek
Charles Foster
6 months
ArXiv is already a junkyard of preprints peddling promises of infinite memory—if only we would tweak the Transformer just a tad. Whenever you see a new one, the question to ask is always “Why this one?” This may be the one, but what makes this time different?
Tweet media one
@arankomatsuzaki
Aran Komatsuzaki
6 months
Google presents Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention 1B model that was fine-tuned on up to 5K sequence length passkey instances solves the 1M length problem
Tweet media one
27
260
1K
10
3
97
@CFGeek
Charles Foster
9 months
Lock in your predictions: in 24 hours, will you look back on this post as substantially true or just self-promotion hype?
@adcock_brett
Brett Adcock
9 months
we just had an AI breakthrough in our lab robotics is about to have its ChatGPT moment and that moment is happening tomorrow
661
856
9K
31
1
92
@CFGeek
Charles Foster
4 months
🚨 SB 1047 was just amended🚨 - “Covered model” now means a model whose training is >10^26 FLOP and costing >$100M estimated worth of compute (inflation-adjusted) - “Derivative model” now excludes models fine-tuned for >25% of the original training compute (continued below ⤵️)
Tweet media one
7
3
87
@CFGeek
Charles Foster
8 months
Feels notable that Anthropic, OpenAI, and Google were all able to quickly figure out massive Transformer context windows without anybody revealing their methods. And the open community is hot on their heels. All that secrecy wasn't worth much, apparently.
5
12
88
@CFGeek
Charles Foster
5 months
If we somehow time-traveled a copy of GPT-4o back to 2004 and let a focus group of NeurIPS (then NIPS) attendees interact with it for 2 hours, what percent would endorse calling it “AGI” afterward? (Pretend it won’t give responses that would require knowledge of the then-future.)
<25%
234
25-50%
264
50-75%
406
>75%
796
30
9
86
@CFGeek
Charles Foster
1 year
@rom1504 No. I would say we ML researchers should hold ourselves to a high standard of conduct, such that that when people tell us they don't want us training on the content they authored, we respect their wishes.
5
8
83
@CFGeek
Charles Foster
1 year
How does Stability get to call StableVicuna "open source" when the model is derived from the not-open-source Vicuna, and is a not-open-source LLaMA tuned with ToS-encumbered data from the not-open-source GPT-3/ChatGPT?
12
5
85
@CFGeek
Charles Foster
5 months
Contrast pairs are overpowered. Once you have them, you can use them to generate control vectors, and to initialize classifiers, and to do RL/DPO, and probably more
@AnthropicAI
Anthropic
5 months
To make the probes, we track how the model’s internal state changes between “Yes” vs “No” answers to questions like "Are you doing something dangerous?" We use this info to detect when a sleeper agent is about to misbehave (e.g. insert a code vulnerability). It works quite
Tweet media one
7
20
184
2
7
85
@CFGeek
Charles Foster
6 months
It’s like LoRA and control vectors had a baby!
@arankomatsuzaki
Aran Komatsuzaki
6 months
ReFT: Representation Finetuning for Language Models 10x-50x more parameter-efficient than prior state-of-the-art parameter-efficient fine-tuning methods repo: abs:
Tweet media one
5
102
506
1
9
83
@CFGeek
Charles Foster
4 months
This syncretism of rhetoric from the AI Safety movement and China-hawks unsettles me. It feels like a kind of unholy alliance in the making …
@dwarkesh_sp
Dwarkesh Patel
4 months
How a US/China superintelligence arms race will play out: “The CCP is going to have an all-out effort to infiltrate American AI labs. Thousands of people, the full force of the Ministry of State Security. There's an enormous incentive for a first strike.” @leopoldasch
112
55
458
8
2
82
@CFGeek
Charles Foster
7 months
Transformer is seemingly now the all-around heavyweight champion. Doesn't matter whether autoregressive or diffusion, text or image or video or robotics/multimodal, unsupervised or supervised or RL ...
@_akhaliq
AK
7 months
Stability AI announces Stable Diffusion 3 most capable text-to-image model, utilizing a diffusion transformer architecture for greatly improved performance in multi-subject prompts, image quality, and spelling abilities. Prompt: Epic anime artwork of a wizard atop a mountain
Tweet media one
7
125
657
4
12
81
@CFGeek
Charles Foster
8 days
This was funny when the hacked accounts were just random individuals, but OpenAI’s new official newsroom account getting taken over by crypto-spammers is just a real bad look.
Tweet media one
4
2
83
@CFGeek
Charles Foster
7 months
Tweet media one
@cgarciae88
Cristian Garcia
7 months
HELL NO
Tweet media one
9
11
88
0
10
78
@CFGeek
Charles Foster
8 months
Excited to try this out! (Though I'm kinda doubtful it'll be better than Hedgehog) It's basically just linear attention on top of queries & keys that have been passed through a LayerNorm -> elementwise squaring.
Tweet media one
@_akhaliq
AK
8 months
Linear Transformers with Learnable Kernel Functions are Better In-Context Models Advancing the frontier of subquadratic architectures for Language Models (LMs) is crucial in the rapidly evolving field of natural language processing. Current innovations, including State Space
Tweet media one
2
30
191
3
9
76
@CFGeek
Charles Foster
1 year
I used to *love* sneering at @GaryMarcus and his takes on AI progress. Something shifted when I started building products w/ LLMs in my day job. I started seeing more vividly why reliability matters, and how the current zeitgeist is hurting itself making promises we can't keep
5
5
72
@CFGeek
Charles Foster
1 year
We used to have vectorized LISP running on massively-parallel hardware that looked like this REMEMBER WHAT THEY TOOK FROM YOU
Tweet media one
@yacineMTB
kache
1 year
symbolic AI is going to make large hoards of compute obsolete
47
26
385
3
9
75
@CFGeek
Charles Foster
9 months
This is basically DPO without preference labels! Simply assume the supervised responses to prompts are better than the model's responses to those same prompts. Similar to the trick Intel used for Neural Chat, where they assumed GPT-4 responses > Llama2 responses.
@arankomatsuzaki
Aran Komatsuzaki
9 months
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models Significantly improves the LLM’s performance across a variety of benchmarks and even outperform models trained through DPO with extra GPT-4 preference data
Tweet media one
14
63
446
5
9
71
@CFGeek
Charles Foster
4 months
> see new Transformer contender > query is a learned, fixed vector > no other RNN baselines > no language modeling experiments
Tweet media one
@iScienceLuvr
Tanishq Mathew Abraham, Ph.D.
4 months
Attention as an RNN abs: "attention can be viewed as an RNN with the special ability to compute its many-to-one RNN output efficiently" Proposes Aaren, a new module that can be trained in parallel (like Transformers) but also be efficiently updated at
Tweet media one
8
128
727
3
0
71
@CFGeek
Charles Foster
1 year
OPINION: we should probably move away from training AI systems on datasets like LAION-400M/5B and Books3, fair use aside. (I say this as someone who knows the folks that collected those datasets & who thinks they deserve credit for doing uncelebrated but very impactful work.)
6
7
70
@CFGeek
Charles Foster
11 months
Worried about the future of openness in AI? Here is a way to help: We're putting together a public list of all the good work that's been enabled by open-weight foundation models, to show why transparency & public scrutiny is worth protecting. ⬇️ Links below ⬇️
Tweet media one
3
18
68
@CFGeek
Charles Foster
1 year
If we can detect an LLM is copying from a span of context (à la induction heads), couldn't we then grab the rest of the span and run it through the model in parallel (à la speculative sampling)? Could be an easy win for tasks that call for in-context retrieval...
Tweet media one
4
5
68
@CFGeek
Charles Foster
2 months
As evidence of this, the California state legislature is considering another AI bill, AB 3211. That bill would have far worse impacts on tech companies and open-source, as reported by observers like @deanwball , @TheZvi , & @binarybits . Yet it’s produced almost no real opposition.
5
6
68
@CFGeek
Charles Foster
1 month
ICYMI: this interviewee confirms speculations that OpenAI’s Fine-tuning API uses LoRA under the hood. Around the 43.5 minute mark.
@swyx
swyx @ DevDay!
1 month
🆕 @latentspacepod : Is finetuning GPT4o worth it? w/ @AlistairPullen of @cosine_sh Betteridge's law says no: with 59 different flavors of RAG, and >2million token context + prompt caching, it's reasonable to believe that "in context learning is all you need". But Genie is the
Tweet media one
Tweet media two
Tweet media three
9
12
117
5
6
65
@CFGeek
Charles Foster
1 year
Tweet media one
@brickroad7
renji the synthetic data maximalist
1 year
This is earth-shattering news. The "hard problem" of mechanistic interpretability has been solved. The formal/cautious/technical language of most ppl commenting on this obscures the gravity of it. What this means -> not just AGI, but *safe* *superintelligence* is 100% coming🧵
111
501
3K
2
4
64
@CFGeek
Charles Foster
1 year
IDK who needs to hear this but the "70k unused embeddings for multimodal extensions" line item is pure filler. If they weren't used during training, they just contain random noise. You could've added those extra rows to the embedding matrix yourself, for the same effect.
@AdeptAILabs
Adept
1 year
There are a few cool things to note: • Trained the whole time with a 16K context–4x that of LLaMA2 and 8x of GPT-3 • Strong evals, especially on the instruct tuned version • 70k unused embeddings for multimodal extensions • Apache license!
2
1
44
4
4
65
@CFGeek
Charles Foster
1 year
Evaluation is hard! This goes for AI just as with us. In games like chess and Go, evaluation is easy, which allows for tight feedback loops and rapid self-improvement. But in rich domains, the bottleneck IS evaluation (doing experiments, peer review, &c.)
4
2
62
@CFGeek
Charles Foster
4 months
Tweet media one
2
1
62
@CFGeek
Charles Foster
10 months
@jeremyphoward This account has made similar wild claims & promises before. I don't put much weight in stuff they post anymore
3
0
62
@CFGeek
Charles Foster
8 months
Read this post. It describes—in better words than I've ever found—a shift in paradigm within ML in recent years, towards an "industrial" one based on predictable input-output relations. Lots of great lines, some of which I'll quote below (h/t @g_leech_ )
3
7
61
@CFGeek
Charles Foster
11 months
And *then* he said...
Tweet media one
@janleike
Jan Leike
2 years
Before we scramble to deeply integrate LLMs everywhere in the economy, can we pause and think whether it is wise to do so? This is quite immature technology and we don't understand how it works. If we're not careful we're setting ourselves up for a lot of correlated failures.
115
161
1K
0
6
61
@CFGeek
Charles Foster
5 months
Are AI systems best described as tools, as an alien species, or as our mind-children? I think this is something of a litmus test for broader views.
32
3
61
@CFGeek
Charles Foster
1 year
What I mean is "can perform complex reasoning" wait nvm I meant "can win at strategic games" wait nvm I meant "can understand human language" wait nvm I meant "can automate economically-valued office tasks" wait nvm I meant "can assist in scientific discovery" wait nvm I meant
3
4
59
@CFGeek
Charles Foster
1 year
Rather than trying to "solve" superposition & to always explain/predict/control neural network computations using the same units of analysis, consider a more "Hopfieldian" lens, where representational spaces rule (via dynamics at multiple valid scales)
Tweet media one
2
13
58
@CFGeek
Charles Foster
1 year
At some point I switched from seeing neural networks as arcane devices to seeing them as moldable variants of "boring" building blocks from signal processing, feedback control, associative learning, & functional programming. Like some kind of function approximation plastic/epoxy
7
5
53
@CFGeek
Charles Foster
3 months
AI safety is not a model property
@ErikJones313
Erik Jones
3 months
Model developers try to train “safe” models that refuse to help with malicious tasks like hacking ...but in new work with @JacobSteinhardt and @ancadianadragan , we show that such models still enable misuse: adversaries can combine multiple safe models to bypass safeguards 1/n
Tweet media one
12
43
205
4
7
56
@CFGeek
Charles Foster
1 month
Re: open AI weights and China competition If what matters is “Who best monopolizes innovation on this technology?”, encouraging domestic firms to share weights may be bad. But if what matters is “Who best diffuses this technology?”, encouraging that practice may be quite good.
Tweet media one
1
5
56
@CFGeek
Charles Foster
7 months
Current obsession: having LLMs simulate abstract machines step-by-step. This is GPT-4 acting as a register machine doing addition. Uses the INC/DEB language that I'd first read about in Dan Dennett's "Secrets of Computer Power Revealed".
Tweet media one
3
3
55
@CFGeek
Charles Foster
1 year
STOP DOING INTERPRETABILITY Nonlinear coupling parameters were not supposed to be given names Want to try out some interpretable circuits, for a laugh? We had a tool for that: It was called "PROGRAMMING"
3
2
53
@CFGeek
Charles Foster
5 months
OpenAI rep: “OK so this is what I wrote down. What do you see?” *pointing phone at paper* ChatGPT: “Aww. I see ‘I love ChatGPT’. That’s so sweet of you!” *audience applauds* ChatGPT: “… wowwww, that’s quite the outfit you have on 😏 Love—” *mic cuts suddenly * Pure comedic gold
@OpenAI
OpenAI
5 months
What would you like to ask GPT-4o? We’ll pick some live prompts to do on our livestream happening now:
625
704
5K
3
1
53
@CFGeek
Charles Foster
1 month
@Teknium1 I think that bill is dead. Deadline to pass both houses was midnight and it doesn’t look like it got voted on.
4
0
52
@CFGeek
Charles Foster
9 months
If this 👇 generalizes, could you leverage it to watermark your model weights before release? Like, you train it to output Y only when prompted with X. In theory, with ZKP, could you even give evidence a set of weights are derived from yours without publicly revealing what X is?
@AnthropicAI
Anthropic
9 months
New Anthropic Paper: Sleeper Agents. We trained LLMs to act secretly malicious. We found that, despite our best efforts at alignment training, deception still slipped through.
Tweet media one
124
576
3K
3
1
50
@CFGeek
Charles Foster
1 month
Reminder that SB 1047 is not the only consequential AI-related bill that may pass the California legislature this week. There’s also SB 892, SB 896, SB 942, AB 1836, AB 2013, AB 2602, AB 2930, and AB 3211.
1
9
50
@CFGeek
Charles Foster
10 months
Aaaaand one of the OpenAI folks estimated the GPT-3.5 base model to be around 1800 ELO, which is exactly the cutoff score for games included in the GPT-4 base model pretraining dataset... 🤔
@BorisMPower
Boris Power
1 year
@francoisfleuret I don’t think anything has been published unfortunately. ELO is around 1800
4
0
23
3
2
48
@CFGeek
Charles Foster
1 year
@_akhaliq Please don't co-opt the confusing "mesa-optimization" frame to describe this. There are lots of better, less-loaded terms, including: - "fast weights" - "in-context learning" - "associative memory"
4
4
46
@CFGeek
Charles Foster
8 months
It apparently took < 1 year for competiton to create LLMs that are comparable/better than GPT-4 (in its original gpt-4-0314 form). That is very fast! This result may or may not hold up, but that it's even *plausible* is evidence enough these capabilities will become commodities
@lmsysorg
lmsys.org
8 months
🔥Breaking News from Arena Google's Bard has just made a stunning leap, surpassing GPT-4 to the SECOND SPOT on the leaderboard! Big congrats to @Google for the remarkable achievement! The race is heating up like never before! Super excited to see what's next for Bard + Gemini
Tweet media one
153
620
3K
5
6
48
@CFGeek
Charles Foster
2 months
You should read ML headlines from results in Linear Transformers the same way as biomedicine headlines from results “in mice”
1
2
44
@CFGeek
Charles Foster
1 year
Hetero-associative memory rules everything around me
Tweet media one
@OwainEvans_UK
Owain Evans
1 year
Does a language model trained on “A is B” generalize to “B is A”? E.g. When trained only on “George Washington was the first US president”, can models automatically answer “Who was the first US president?” Our new paper shows they cannot!
Tweet media one
175
707
4K
1
5
47
@CFGeek
Charles Foster
5 months
SB 1047 nightmare scenario for open-sourcers like Meta and Mistral: even if they can guarantee a model has *no biology-related knowledge or skills whatsoever*, the Attorney General + court can block release because terrorists might *teach it from scratch* how to build bioweapons.
6
3
46
@CFGeek
Charles Foster
1 year
Learning programs with backprop is hard, in general. Computation needs input-dependent branching. Backprop only sees linear sensitivity, so trying more than 1 branch at once requires superposition. But then some program(s) will cause interference that ruins credit assignment
3
3
44
@CFGeek
Charles Foster
4 months
Only use reinforcement learning (RL) if you absolutely must. RL is the “approach of last resort”, as @nostalgebraist has called it. They say that training neural networks is easy because they *want* to learn, but in RL, you will be fighting every cursed step of the way.
@yacineMTB
kache
4 months
bruh this RL shit is hard 😭
59
11
552
2
1
45
@CFGeek
Charles Foster
8 months
Fun fact: the (Moore-Penrose) pseudoinverse can be used to set the weights of neural network associative memories without training! This trick has been known since at least the 1980s, from work by Personnaz, Guyon, & Dreyfus applying it to Hopfield networks
@StefanoGogioso
Stefano Gogioso
1 year
@davidad The pseudoinverse is just so elegant: numerically stable, easily derived from the SVD, returns a preimage with least square error. It coincides with the inverse when the matrix is actually invertible, so barely any reason to teach straight inverses, tbh.
2
3
32
1
3
46
@CFGeek
Charles Foster
11 months
I think that hidden scratchpads are an inherently deceptive design. If you set your AI system up to output internal thoughts / actions that are inaccessible to the user, then you're preventing them from properly overseeing the system! This is bad and very unnecessary!
@apolloaisafety
Apollo Research
11 months
1/ Can AIs deceive their users on their own initiative? We find that GPT-4, trained to be honest and harmless, can take illegal actions like insider trading and lie about it to its user without being instructed to do so. This finding was demonstrated at the #AiSafetySummit .
13
32
196
6
4
45
@CFGeek
Charles Foster
5 months
@iScienceLuvr Yes but it isn’t fair to evaluate organizations based on what they might be in the process of developing. We judge them by what they’ve actually verifiably developed.
5
0
44
@CFGeek
Charles Foster
2 years
ML influencers: hehe silly @GaryMarcus , always peddling his "neurosymbolic hybrids" BS the same ML influencers: the *real* way to use GPT-3 is with chain-of-thought and with code generation and with tool use and with databases and and and
6
3
43
@CFGeek
Charles Foster
1 year
This is blowing my mind in a way I haven't felt since the early CLIP-VQGAN (z+quantize) outputs back in 2021
@cerspense
Spencer Sterling
1 year
This is zeroscope_v2_XL. A new 1024x576 #texttovideo model designed to take on Gen-2. Explore prompts with the new 576x320 model, then commit to a high-res render by upscaling with zeroscope_v2_XL via vid2vid in the 1111 text2video extension. Check it out:
69
296
1K
1
1
41
@CFGeek
Charles Foster
1 year
It makes me sad seeing "first-timers"—folks who clearly *just* found out about the AI doom debates—speedrun all the typical bad arguments in public
4
0
43
@CFGeek
Charles Foster
6 months
Return of the encoder-decoder king! And released with intermediate checkpoints etc. just like Pythia, which should make it great for open science, including interpretability👏
@arankomatsuzaki
Aran Komatsuzaki
6 months
🚀 Introducing Pile-T5! 🔗 We (EleutherAI) are thrilled to open-source our latest T5 model trained on 2T tokens from the Pile using the Llama tokenizer. ✨ Featuring intermediate checkpoints and a significant boost in benchmark performance. Work done by @lintangsutawika , me
Tweet media one
13
111
559
0
2
43
@CFGeek
Charles Foster
2 months
This Monte Carlo integration and importance sampling stuff is really something! You’re telling me I can compute integrals/expectations *by sampling*? And I can even do it sampling from a *different* distribution? Wild.
Tweet media one
3
2
43
@CFGeek
Charles Foster
1 year
Taking a moment to express genuine surprise at how many new pretrained LLMs for English have been released within < 1 year: - Cerebras-GPT / BTLM - Falcon - Galactica - LLaMA / Llama 2 - Mistral - MPT - OpenLLaMA - Persimmon - Phi - Pythia - Qwen - RWKV
1
3
42
@CFGeek
Charles Foster
1 year
Some of y'all sounding like
Tweet media one
2
5
40
@CFGeek
Charles Foster
6 months
Litmus test: do you regard this behavior with horror, or delight?
@AnthropicAI
Anthropic
6 months
Many-shot jailbreaking exploits the long context windows of current LLMs. The attacker inputs a prompt beginning with hundreds of faux dialogues where a supposed AI complies with harmful requests. This overrides the LLM's safety training:
Tweet media one
11
36
306
3
1
42
@CFGeek
Charles Foster
2 years
Judging by recent tweets, LLMs are Silicon Valley's "hot new thing". I think its worth sharing some intuitions about the tradeoffs that I think make it hard to design a decisively better architecture than the current heavyweight champion—the autoregressive Transformer. 🧵 (1/N)
1
4
40
@CFGeek
Charles Foster
4 months
This, dear reader, is what we call a “push poll”. If you see me doing this, please call me out on it.
@DanHendrycks
Dan Hendrycks
4 months
Question for the AI community: Should AI systems that can be used to easily produce weapons of mass destruction be irreversibly proliferated as open weights? (Example WMD that seems very plausible in the next few years: AI cyberweapon that can take down our power grids.)
56
2
26
2
2
40
@CFGeek
Charles Foster
8 months
To avoid stuff like this, I think you want to offload to a finite-state machine that defines the set of allowed choices at each state, so the LLM is only responsible for mapping user-input to the current choice set, & for mapping outputs back to language.
@arstechnica
Ars Technica
8 months
Air Canada must honor refund policy invented by airline’s chatbot
120
2K
12K
4
2
38
@CFGeek
Charles Foster
8 months
If you're using ALiBi or RoPE, it's probably best to turn it off on some attention heads (for ALiBi, just set their slopes to 0) so the model can do unbiased arbitrary-length lookups. Works with Flash Attention, even!
3
2
39
@CFGeek
Charles Foster
2 years
Tweet media one
@arankomatsuzaki
Aran Komatsuzaki
2 years
Resurrecting Recurrent Neural Networks for Long Sequences Shows that careful design of deep RNNs performs on par with SSMs on long-range reasoning tasks with comparable speed.
Tweet media one
8
73
366
1
2
38
@CFGeek
Charles Foster
7 months
"Is the Transformer architecture Turing-complete?" is not quite the right question to ask, IMO. What TMs show is that any algorithm can be viewed as (1) a *finite state* control policy that interacts w/ (2) a separate working memory. Our NNs only need to express the logic of (1).
2
1
38
@CFGeek
Charles Foster
5 months
@andersonbcdefg OK hear me out: yes the scaling laws say scaling is logarithmic (or at least sublinear) but Moore’s law is exponential and algorithmic progress is also exponential so this double-exponential turns the log back into an exponential again
5
1
38
@CFGeek
Charles Foster
1 year
We often think of bigger neural networks as "more complex", but AFAICT this intuition is wrong from the lens of compression (as in Solomonoff induction, adaptive coding, dictionary learning etc.). Very simple algorithms can leverage massive memory w/o increasing model complexity
@DrJimFan
Jim Fan
1 year
There're few who can deliver both great AI research and charismatic talks. OpenAI Chief Scientist @ilyasut is one of them. I watched Ilya's lecture at Simons Institute, where he delved into why unsupervised learning works through the lens of compression. Sharing my notes: -
54
432
3K
4
2
34
@CFGeek
Charles Foster
2 years
Low hanging fruit for interpretability work: take an open source language or vision model and run big dataset through it, logging activations; then have GPT-4 to bulk auto-suggest concept labels for neurons/features based on the datapoints that most activated them
2
2
38
@CFGeek
Charles Foster
2 years
Please let me know if you spot errors. I did this mainly to clarify my own confusions. Props to @orvieto_antonio , @SamuelMLSmith , @_albertgu , Anushan Fernando, @caglarml , Razvan Pascanu, and @sohamde_ for putting out their work. Link to the preprint:
1
3
38
@CFGeek
Charles Foster
2 years
That's it! IMO, the biggest takeaway of the LRU paper is that proper parameterization, initialization, and normalization are powerful and woefully underrated tools for engineering neural circuits that behave in predictable ways.
1
2
38
@CFGeek
Charles Foster
8 months
This finding is unintuitive precisely because, as a rule, optimizing neural networks with backprop is typically *unreasonably* predictable and stable nowadays.
@jaschasd
Jascha Sohl-Dickstein
8 months
Have you ever done a dense grid search over neural network hyperparameters? Like a *really dense* grid search? It looks like this (!!). Blueish colors correspond to hyperparameters for which training converges, redish colors to hyperparameters for which training diverges.
275
2K
10K
4
0
37