Jade @Euclaise_ Twitter profile

Pinned Tweet

Jade

1 year

Hot take: Degrees should be rewarded purely based on standardized tests. If you already have the knowledge of a field, you should be able to easily and cheaply obtain a degree in it without having to go through school for it

Arylative Cope ⌬ 🧪 🍋 🏳️‍⚧️

@endless_sine

1 year

most people at first thought i actually have a formal education in the stuff I talk about

0

10

9

7

102

Last Seen Profiles

@IsraelIstanbul

@amarianabecker

@johnsama

@SokutoM

@Dulcetofficial

@HowScrapp

@MidwestEmoWhore

@caiofenekes

@MartinMasey

@koleksi_stw69

@TrumpGirlMaga7

@billyboypromise

@bain_pratai

@koleksi_stw69

@BinorRaja

@GeauxLSUH

@hijabjilbab1

@carreterasCabGC

@happielocks

@zohar87947

@hamham051909

@N_run8

@wni_jp

@Rebmangas

@JFSaine

@Kar_Bharadwaj

@o6SHLw2aBi97401

@ChtCQmaNMeJh6Sl

@OlaviT__

@nun_chuck

@guangzhoukeyue2

@jandakembangstw

@aichampionship7

@PaarthShahYT

@PST_Sport

Jade

@Euclaise_

1 year

@astro_egg_celnt I really like Myrtle but nobody else likes it

49

3

1K

Jade

@Euclaise_

1 year

@LilithLovett Not necessarily. In Orthodox Judaism, unmarried women and men aren't allowed to touch each other at all, but it goes both ways.

46

12

735

Jade

@Euclaise_

3 years

0

9

261

Jade

@Euclaise_

6 months

Datasets that I really like (no GPT ones): - ROPES (reasoning) - Goodwiki (WikiText but better) - MiniPile (small-scale pretraining) - Refinedweb (large-scale pretraining)

1

30

220

Jade

@Euclaise_

4 months

Over the past week or so, all of my RNN/linear-attn/SSM transformer-alternatives have failed to beat standard transformers

16

11

210

Jade

@Euclaise_

3 months

I don't think I've posted this yet - tweaks to the attention mechanism

You Need to Pay Better Attention: Rethinking the Mathematics of...

Scaled Dot Product Attention (SDPA) is the backbone of many modern deep-learning models. It is so versatile that it has been used in natural language, vision, and multi-modal domains with very...

arxiv.org

6

24

197

Jade

@Euclaise_

2 months

This surprises me, I didn't expect this to work

Many-Shot In-Context Learning

Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few examples provided in context at inference, without any weight updates. Newly expanded context...

arxiv.org

7

19

189

Jade

@Euclaise_

5 months

Introducing: Memphis-CoT 3B A small reasoning-focused model using a novel iterative contrastive finetuning procedure, trained on only human data, outperforming much larger human data models and similarly sized SFT models.

euclaise/Memphis-CoT-3B · Hugging Face

huggingface.co

16

31

172

Jade

@Euclaise_

8 months

@BlancheMinerva @savov @vkhosla switch-c-2048 is apache on HF

google/switch-c-2048 · Hugging Face

huggingface.co

3

15

166

Jade

@Euclaise_

4 months

This is incorrect - LOMO was the first to train 7B models in 24GB, and I was the first to do it with an adaptive optimizer

Prof. Anima Anandkumar

@AnimaAnandkumar

4 months

For the first time, we show that the Llama 7B LLM can be trained on a single consumer-grade GPU (RTX 4090) with only 24GB memory. This represents more than 82.5% reduction in memory for storing optimizer states during training. Training LLMs from scratch currently requires huge

48

390

2K

8

13

164

Jade

@Euclaise_

3 months

Why do small language models underperform? Studying Language Model...

Recent advances in language modeling consist in pretraining highly parameterized neural networks on extremely large web-mined text corpora. Training and inference with such models can be costly in...

arxiv.org

2

33

158

Jade

@Euclaise_

7 months

I finetuned base mistral on 110 examples, 21 of which I hand wrote, and the rest of which I manually reviewed and filtered..... The model still calls itself "ChatGPT" lmao

3

7

153

Jade

@Euclaise_

3 months

Transformers are limited in planning since each layer can only attend over the previous layer. Feedback memory fixes this, but ruins parallelism: Enc-dec might have an advantage here, since the decoder attends over late representations from the encoder.

5

27

155

Jade

@Euclaise_

3 months

I recently got into some arguments about next-token prediction on here I didn't post this at the time, but the following paper is interesting on the topic:

The pitfalls of next-token prediction

Can a mere next-token predictor faithfully model human intelligence? We crystallize this intuitive concern, which is fragmented in the literature. As a starting point, we argue that the two...

arxiv.org

6

16

147

Jade

@Euclaise_

1 year

@MoralHazardPay "explicitly courting moderates"

0

1

134

Jade

@Euclaise_

7 months

2

14

129

Jade

@Euclaise_

3 months

What the fuck

AK

@_akhaliq

3 months

From Words to Numbers Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples We analyze how well pre-trained large language models (e.g., Llama2, GPT-4, Claude 3, etc) can do linear and non-linear regression when given in-context examples,

15

70

375

8

1

128

Jade

@Euclaise_

11 months

@whyvert This isn't evidence lmao, you can't just do a pairwise comparison between two different countries and expect it to reflect the effects of your policy of focus

12

0

111

Jade

@Euclaise_

3 months

I spend most of my waking hours trying to figure out how to make: 1. A GPT-4 level model in a <7B finetune 2. A Mistral-level model pretrained for <$2000, or on 1x3090 in <1mo 3. AI with a coherent identity and approximate consciousness I have yet to figure out any of these

11

3

114

Jade

@Euclaise_

5 months

@LinkofSunshine

1

0

108

Jade

@Euclaise_

4 months

Training comparison runs w/ Lilith vs AdamW

8

7

97

Jade

@Euclaise_

23 days

ReLU-KAN: New Kolmogorov-Arnold Networks that Only Need Matrix...

Limited by the complexity of basis function (B-spline) calculations, Kolmogorov-Arnold Networks (KAN) suffer from restricted parallel computing capability on GPUs. This paper proposes a novel...

arxiv.org

2

13

95

Jade

@Euclaise_

11 months

@SkinnyTuna @boygrrI I'm gonna hang this on my apartment door

0

86

Jade

@Euclaise_

4 months

Although alternatives exist now, Mistral remains the most practical model in the 7B range

9

5

88

Jade

@Euclaise_

7 months

>Finetune an English model on English-only data >model starts speaking Chinese ???

7

1

86

Jade

@Euclaise_

4 months

If you're using Genstruct, I recommend augmenting it with a reward model like PairRM () The notebook in the Genstruct model card provides an example using the oasst deberta reward model

llm-blender/PairRM · Hugging Face

huggingface.co

1

7

85

Jade

@Euclaise_

3 months

Has anyone tried this yet? They seem to have perfected trianing small models (1.6B and 3B). If they were able to keep that while scaling up, this should be amazing.

stabilityai/stablelm-2-12b · Hugging Face

huggingface.co

5

8

80

Jade

@Euclaise_

4 months

Idea: Train a multi-step language model to self-correct (like diffusion models). The first step generates the full sequence, and the later steps merely correct the prior generations.

21

5

81

Jade

@Euclaise_

4 months

A result that I've seen across multiple papers: Move Adam's eps under the sqrt, it's better

5

7

71

Jade

@Euclaise_

3 months

This is an interesting direction

Learning Language Representations with Logical Inductive Bias

Transformer architectures have achieved great success in solving natural language tasks, which learn strong language representations from large-scale unlabeled texts. In this paper, we seek to go...

arxiv.org

4

15

75

Jade

@Euclaise_

3 years

@possumloverr @Geosquare_ He was mislead by misleading evidence. Were the evidence true, he would've been right to post. There is no evidence that this was malicious, and it does not seem like anything other than a mistake

0

71

Jade

@Euclaise_

3 months

Came up with a new finetuning method: ReMask LLMs are trained with ground truth labels, but don't have real ground truths at inference time. Training to address this requires costly self-generation. Instead, ReMask avoids this via regularized masking.

euclaise/ReMask-3B · Hugging Face

huggingface.co

3

10

67

Jade

@Euclaise_

1 month

Schedule-Free Learning paper:

The Road Less Scheduled

Existing learning rate schedules that do not require specification of the optimization stopping step T are greatly out-performed by learning rate schedules that depend on T. We propose an approach...

arxiv.org

3

11

66

Jade

@Euclaise_

4 months

Lilith test accuracy on MNIST (AlgoPerf): 88.89 AdamW test accuracy: 83.5 👀

Jade

@Euclaise_

4 months

New optimizer (previously named Adaheavy, now calling it Lilith) is seeming very promising

5

0

21

6

3

64

Jade

@Euclaise_

3 years

@possumloverr @Geosquare_ ??? Literally nothing points to geosquare intentionally doing this. What have you heard/seen lol

2

0

58

Jade

@Euclaise_

2 months

@jxmnop Things look Linear in the middle of a sigmoid - they look exponential earlier

3

1

64

Jade

@Euclaise_

4 months

??? this paper is only like 5 pages

9

0

65

Jade

@Euclaise_

2 years

@deonteleologist

8

1

64

Jade

@Euclaise_

3 months

GitHub - catie-aq/flashT5: A fast implementation of T5/UL2 in PyTorch using Flash Attention

A fast implementation of T5/UL2 in PyTorch using Flash Attention - catie-aq/flashT5

github.com

3

8

64

Jade

@Euclaise_

3 months

DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs

This paper revives Densely Connected Convolutional Networks (DenseNets) and reveals the underrated effectiveness over predominant ResNet-style architectures. We believe DenseNets' potential was...

arxiv.org

4

7

62

Jade

@Euclaise_

2 months

:(

Is the Number of Trainable Parameters All That Actually Matters?

Recent work has identified simple empirical scaling laws for language models, linking compute budget, dataset size, model size, and autoregressive modeling l...

proceedings.mlr.press

9

7

61

Jade

@Euclaise_

10 months

@greenTetra_ @Catboyism I was trying to think of lyrics for this earlier today, but with t-slur girl

1

0

50

Jade

@Euclaise_

4 months

Unlike GPT-4, Claude doesn't do any reasoning/CoT before outputting code. Despite this, it still gets it right. If I were to force GPT to zero-shot problems in the same way, it would do much worse than Claude does.

5

1

54

Jade

@Euclaise_

2 months

Just implemented the first *non*-linear RNN that is parallelized with a prefix-sum-style scan like Mamba is testing soon

5

0

55

Jade

@Euclaise_

3 months

ooh

internlm/internlm2-1_8b · Hugging Face

huggingface.co

2

6

55

Jade

@Euclaise_

13 days

Didn't see this mentioned, but it looks really neat

State-Free Inference of State-Space Models: The Transfer Function Approach

We approach designing a state-space model for deep learning applications through its dual representation, the transfer function, and uncover a highly efficient sequence parallel inference...

arxiv.org

1

8

55

Jade

@Euclaise_

1 month

👀

From the datasets community on Reddit: AI Books4 Dataset for training LLMs further

Posted by JohnTheMelancholic - 12 votes and 10 comments

www.reddit.com

2

3

54

Jade

@Euclaise_

2 years

@gewt Activity Monitor 73MB Discord 92MB Firefox Nightly 112GB IRCCloud 83MB iTerm2 338MB someone who is good at the economy please help me budget this. my computer is dying

1

4

52

Jade

@Euclaise_

3 years

Introducing: Smash The Record: Hardcore A 36 hour long event, starting at 7PM JST on 11/5 Runners compete to achieve the best times in the best category — Hardcore, 1.7, no F3. (yes I messed up this tweet twice)

2

12

51

Jade

@Euclaise_

1 year

@Egg_irl_bot dw about it, you'll be using arch before you know it either way

1

0

48

Jade

@Euclaise_

7 months

GPT-4 seems to have gotten simultaneously more verbose and less intelligent

9

2

50

Jade

@Euclaise_

2 years

@FemboyPaganism There are ~8 billion people in the world, it's absurd to make these "no one has ever" statements. Anyone with any single anecdote would destroy the claim. More accurately, there isn't a significant amount of this, and it doesn't even begin to compare to the inverse.

7

0

44

Jade

@Euclaise_

3 years

Geo tweeted prematurely, that is not a reason to send massive amount of hate toward him. Also, if you have a concrete opinion based on the limited amount of evidence provided, then you're a moron.

0

8

44

Jade

@Euclaise_

6 months

I'm hopeful for HyPe () Basically it's NEFT on intermediate hidden representations instead of just the initial embeddings

HyPe: Better Pre-trained Language Model Fine-tuning with Hidden...

Language models with the Transformers structure have shown great performance in natural language processing. However, there still poses problems when fine-tuning pre-trained language models on...

arxiv.org

3

5

48

Jade

@Euclaise_

2 years

@LucasOfSunshine Because it's actually "The Blast" news, Yahoo! is just the aggregator

0

42

Jade

@Euclaise_

5 months

@LinkofSunshine I like how she has now had both conspiracies about being a secret 4channer and now about being a DNC operative

1

0

47

Jade

@Euclaise_

5 months

When I was like 14 I started writing a basic OS kernel, but never got anywhere with it. It just booted and initialized interrupts. A few years later, I found that someone had forked the project and rewritten it in Rust

4

0

44

Jade

@Euclaise_

4 months

@Teknium1 314B and no better than llama 70B, rip

3

0

46

Jade

@Euclaise_

1 year

@LilithLovett Yes, just not to the level of "women are beneath me"

5

0

44

Jade

@Euclaise_

3 months

Has anyone tried just directly optimizing the merge weights? This seems stupidly obvious to me, but I'm not sure that anyone is doing it

xlr8harder

@xlr8harder

3 months

shit's getting weird

28

25

238

6

1

45

Jade

@Euclaise_

3 years

If anyone's still mad about dream cheating in speedruns: I did the stats on that, not Geo.

Adam

@BlueCrystal004

3 years

open invitation @ dttwt and dream fans u can vicariously reply to me telling me to kill myself instead of geo since he's actually suicidal then we'll all be happy

1

5

63

4

2

40

Jade

@Euclaise_

4 years

@macnpeachy @Desscrungus @phvonix @dreamwastaken2 Hi, I worked on the paper. APA style is to use pronouns, and to explicitly avoid passive form. Statistics often uses APA style.

0

1

41

Jade

@Euclaise_

4 months

I expect that we'll get more linear RNNs with orthogonal/unitary constraints soon

Linear Attention via Orthogonal Memory

Efficient attentions have greatly improved the computational efficiency of Transformers. However, most existing linear attention mechanisms suffer from an \emph{efficiency degradation} problem,...

arxiv.org

2

8

44

Jade

@Euclaise_

5 months

Today I implemented a sparse SSM-ish architecture with a gated structure similar to Mamba/GSS/etc rather than ad-hoc inserting sparse MoE MLPs Not very useful without training, but it was fun to implement anyway

3

2

41

Jade

@Euclaise_

2 years

@estrogenizedboy did you pack that into a water bottle how the fuck is it so wrinkled

0

37

Jade

@Euclaise_

2 months

I'm curious how this performs with simpler function bases. e.g. what if piecewise linear functions are used instead of B-splines?

Aran Komatsuzaki

@arankomatsuzaki

2 months

KAN: Kolmogorov–Arnold Networks Proposes an alternative to MLP that outperforms in terms of accuracy and interpretability

9

139

736

7

0

40

Jade

@Euclaise_

9 months

@abacaj mogged by 1x3090

GitHub - euclaise/SlimTrainer: Full finetuning of large language models without large memory...

Full finetuning of large language models without large memory requirements - euclaise/SlimTrainer

github.com

3

5

39

Jade

@Euclaise_

2 months

Llama 3 benchmarks have certainly exceeded my expectations

5

0

39

Jade

@Euclaise_

4 months

We need causal encoder -> decoder models Big causal encoder (causal so we can make use of KV cache), with a little decoder The large encoder should provide enough information in a single pass to predict the next *segment*, and the little decoder decodes the segment token-wise

10

0

39

Jade

@Euclaise_

3 years

Me when the rules allow F3:

2

1

34

Jade

@Euclaise_

8 months

Someone needs to make an LLM that does executive functioning for me It should schedule every minute of my day with a dynamically changing schedule that accounts for failures

5

1

38

Jade

@Euclaise_

3 years

Compromise: Ban calculators but also ban HBG

1

2

36

Jade

@Euclaise_

9 months

@xelliepurr @poisonjr My first guess was it's a Halloween thing but that doesn't really make sense

1

0

36

Jade

@Euclaise_

5 months

I'm so dysfunctional that I don't think I'll ever be able to live a normal life, and certainly won't meet my parent's expectations Although it is nice that I have exceptional abilities in other areas, I wish I could actually manage daily life

10

1

36

Jade

@Euclaise_

8 months

@LinkofSunshine vs psychologists when their R^2 is 0.0025 but "significant"

0

1

33

Jade

@Euclaise_

5 months

Mamba/Gateloop/etc is only scratching the surface - I suspect that we will soon see more complex (non-linear) RNNs that are parallelized in similar ways

Jade

@Euclaise_

5 months

Apparently associative scans can be used to compute pretty complex recurrences, not just linear ones, e.g.

2

5

20

1

3

34

Jade

@Euclaise_

3 years

This is wrong and harmful. Autism does come with absolute struggles (e.g. sensory issues) and shit like this undermines them. Autism isn't just "oh haha I have trouble reading people," it includes a range of other things.

galatax why does jillian call u babygirl

@f1gayda

3 years

understanding that autism in itself does not have any inherently negative traits and the thing that makes life hard for autistic people is a neurotypical society that refuses to make room for us is absolutely vital if you want to advocate for autistics

88

13K

66K

1

7

31

Jade

@Euclaise_

3 months

I was working on a very similar method and came across this They did pretty extensive comparisons, very nice.

Masked Thought: Simply Masking Partial Reasoning Steps Can Improve...

In reasoning tasks, even a minor error can cascade into inaccurate results, leading to suboptimal performance of large language models in such domains. Earlier fine-tuning approaches sought to...

arxiv.org

1

6

34

Jade

@Euclaise_

4 months

hmmm It seems plausible to me that auxiliary losses like these could improve data-efficiency of generative models

Variance-Covariance Regularization Improves Representation Learning

Transfer learning plays a key role in advancing machine learning models, yet conventional supervised pretraining often undermines feature transferability by prioritizing features that minimize the...

arxiv.org

1

3

33

Jade

@Euclaise_

2 years

@LucasOfSunshine let me become an elementary school teacher I will have all the children doing nonlinear times series modeling by the end of the year

0

29

Jade

@Euclaise_

10 months

@kalebjayboss1 @nominalnaomi Placebo controlled trials are, RCTs in general aren't

2

0

29

Jade

@Euclaise_

8 months

@sherjilozair I'd suspect that you need to train on a huge variety of functions for it to learn (OOD) ICL, mixed sin and linear is surely not enough - you need so many functions that it cannot learn them all directly and must learn to resort to ICL

1

32

Jade

@Euclaise_

3 months

o.o

Introducing Jamba: AI21's Groundbreaking SSM-Transformer Model

Debuting the first production-grade Mamba-based model delivering best-in-class quality and performance.

www.ai21.com

2

3

32

Jade

@Euclaise_

4 months

Further, ReLoRA also was prior to this, with an extremely similar method

0

1

32

Jade

@Euclaise_

2 months

Can Large Language Models Play Games? A Case Study of A Self-Play Approach

Large Language Models (LLMs) harness extensive data from the Internet, storing a broad spectrum of prior knowledge. While LLMs have proven beneficial as decision-making aids, their reliability is...

arxiv.org

1

3

32

Jade

@Euclaise_

6 months

@Teknium1 Y'all are sleeping on falcon 1b and Qwen 1.8b, both of which significantly outperform TinyLlama

4

2

32

Jade

@Euclaise_

2 years

@Gankra_ @auramarua the three genders

0

1

31

Jade

@Euclaise_

3 months

@arankomatsuzaki The issue with next-token prediction is moreso that it's trained with ground-truth tokens, so it can result in erratic generations after a bad generation Also, though there is some further prediction, the gradient is overwhelmed by the next token, dropping off quickly after

3

1

31

Jade

@Euclaise_

3 months

Wait a minute... A cross product fits these conditions There, an easy alternative to elementwise linear recurrences

2

3

31

Jade

@Euclaise_

1 year

@Buddy46155001 @LilithLovett

Jade

@Euclaise_

1 year

@LilithLovett Oh, yeah you're certainly correct about the poster - I meant the person in the video though

1

0

22

0

28

Jade

@Euclaise_

3 months

@VictorTaelin Where the model gets prompted with directly optimized pseudo-token embeddings instead of embeddings that correspond to real tokens The model is kept fixed, but the prompt is treated as continuous and trainable

Paper page - The Power of Scale for Parameter-Efficient Prompt Tuning

huggingface.co

2

30

Jade

@Euclaise_

2 months

Parameter-Efficient Fine-Tuning with Discrete Fourier Transform

Low-rank adaptation~(LoRA) has recently gained much interest in fine-tuning foundation models. It effectively reduces the number of trainable parameters by incorporating low-rank matrices $A$ and...

arxiv.org

2

5

30

Jade

@Euclaise_

2 years

@melinagoranson

0

26

Jade

@Euclaise_

2 months

Doesn't seem to perform incredibly, but they have a lot of neat details about the training pipeline

Nyonic Technical Report

This report details the development and key achievements of our latest language model designed for custom large language models. The advancements introduced include a novel Online Data Scheduler...

arxiv.org

1

3

30

Jade

@Euclaise_

3 months

Very tiny paper

4

2

27

Jade

@Euclaise_

3 years

@CaroltheIntern @Patton41Fan darker skin means less sun burns (and less skin cancer, I think)

3

0

19

Jade

@Euclaise_

3 months

People undervalue hyperparameter insensitivity Optimization papers often have this issue - sometimes I'll see papers examining optimizers under heavy hparam-tuning, for 'fairness'. However, this ignores if one optimizers is more/less sensitive to hparams than the other

Teortaxes▶️

@teortaxesTex

3 months

There are claims that the FAQ mitigates this On one hand, I buy that MSR can run an expensive hyperparam grid search without assumptions, to find the just-right setting for this fragile wonder On the other, this doesn't smell like what we should expect

2

1

12

4

0

29

Jade

@Euclaise_

4 months

Reminder that GSM8K is the only benchmark on the hf leaderboard that actually tests multi-step reasoning.

0

2

28

Jade

@Euclaise_

2 years

Apparently I'm globally banner from mcsr now lmao

3

0

27