Jade Profile Banner
Jade Profile
Jade

@Euclaise_

2,089
Followers
354
Following
535
Media
9,032
Statuses

⋅ Video game statistician ⋅ Soclib cyberanarchist? ⋅ C, Plan 9, LLMs, etc ⋅ Researcher w/ @NousResearch ⋅ she/they

Purdue University, IN
Joined December 2020
Don't wanna be here? Send us removal request.
Pinned Tweet
@Euclaise_
Jade
1 year
Hot take: Degrees should be rewarded purely based on standardized tests. If you already have the knowledge of a field, you should be able to easily and cheaply obtain a degree in it without having to go through school for it
@endless_sine
Arylative Cope ⌬ 🧪 🍋 🏳️‍⚧️
1 year
most people at first thought i actually have a formal education in the stuff I talk about
0
0
10
9
7
102
@Euclaise_
Jade
1 year
@astro_egg_celnt I really like Myrtle but nobody else likes it
49
3
1K
@Euclaise_
Jade
1 year
@LilithLovett Not necessarily. In Orthodox Judaism, unmarried women and men aren't allowed to touch each other at all, but it goes both ways.
46
12
735
@Euclaise_
Jade
3 years
Tweet media one
0
9
261
@Euclaise_
Jade
6 months
Datasets that I really like (no GPT ones): - ROPES (reasoning) - Goodwiki (WikiText but better) - MiniPile (small-scale pretraining) - Refinedweb (large-scale pretraining)
1
30
220
@Euclaise_
Jade
4 months
Over the past week or so, all of my RNN/linear-attn/SSM transformer-alternatives have failed to beat standard transformers
Tweet media one
16
11
210
@Euclaise_
Jade
5 months
Introducing: Memphis-CoT 3B A small reasoning-focused model using a novel iterative contrastive finetuning procedure, trained on only human data, outperforming much larger human data models and similarly sized SFT models.
16
31
172
@Euclaise_
Jade
4 months
This is incorrect - LOMO was the first to train 7B models in 24GB, and I was the first to do it with an adaptive optimizer
@AnimaAnandkumar
Prof. Anima Anandkumar
4 months
For the first time, we show that the Llama 7B LLM can be trained on a single consumer-grade GPU (RTX 4090) with only 24GB memory. This represents more than 82.5% reduction in memory for storing optimizer states during training. Training LLMs from scratch currently requires huge
48
390
2K
8
13
164
@Euclaise_
Jade
7 months
I finetuned base mistral on 110 examples, 21 of which I hand wrote, and the rest of which I manually reviewed and filtered..... The model still calls itself "ChatGPT" lmao
3
7
153
@Euclaise_
Jade
3 months
Transformers are limited in planning since each layer can only attend over the previous layer. Feedback memory fixes this, but ruins parallelism: Enc-dec might have an advantage here, since the decoder attends over late representations from the encoder.
Tweet media one
Tweet media two
5
27
155
@Euclaise_
Jade
3 months
I recently got into some arguments about next-token prediction on here I didn't post this at the time, but the following paper is interesting on the topic:
6
16
147
@Euclaise_
Jade
1 year
@MoralHazardPay "explicitly courting moderates"
Tweet media one
0
1
134
@Euclaise_
Jade
7 months
Tweet media one
2
14
129
@Euclaise_
Jade
3 months
What the fuck
@_akhaliq
AK
3 months
From Words to Numbers Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples We analyze how well pre-trained large language models (e.g., Llama2, GPT-4, Claude 3, etc) can do linear and non-linear regression when given in-context examples,
Tweet media one
15
70
375
8
1
128
@Euclaise_
Jade
11 months
@whyvert This isn't evidence lmao, you can't just do a pairwise comparison between two different countries and expect it to reflect the effects of your policy of focus
12
0
111
@Euclaise_
Jade
3 months
I spend most of my waking hours trying to figure out how to make: 1. A GPT-4 level model in a <7B finetune 2. A Mistral-level model pretrained for <$2000, or on 1x3090 in <1mo 3. AI with a coherent identity and approximate consciousness I have yet to figure out any of these
11
3
114
@Euclaise_
Jade
5 months
Tweet media one
1
0
108
@Euclaise_
Jade
4 months
Training comparison runs w/ Lilith vs AdamW
Tweet media one
8
7
97
@Euclaise_
Jade
11 months
@SkinnyTuna @boygrrI I'm gonna hang this on my apartment door
0
0
86
@Euclaise_
Jade
4 months
Although alternatives exist now, Mistral remains the most practical model in the 7B range
9
5
88
@Euclaise_
Jade
7 months
>Finetune an English model on English-only data >model starts speaking Chinese ???
7
1
86
@Euclaise_
Jade
4 months
If you're using Genstruct, I recommend augmenting it with a reward model like PairRM () The notebook in the Genstruct model card provides an example using the oasst deberta reward model
1
7
85
@Euclaise_
Jade
3 months
Has anyone tried this yet? They seem to have perfected trianing small models (1.6B and 3B). If they were able to keep that while scaling up, this should be amazing.
5
8
80
@Euclaise_
Jade
4 months
Idea: Train a multi-step language model to self-correct (like diffusion models). The first step generates the full sequence, and the later steps merely correct the prior generations.
21
5
81
@Euclaise_
Jade
4 months
A result that I've seen across multiple papers: Move Adam's eps under the sqrt, it's better
Tweet media one
5
7
71
@Euclaise_
Jade
3 years
@possumloverr @Geosquare_ He was mislead by misleading evidence. Were the evidence true, he would've been right to post. There is no evidence that this was malicious, and it does not seem like anything other than a mistake
0
0
71
@Euclaise_
Jade
3 months
Came up with a new finetuning method: ReMask LLMs are trained with ground truth labels, but don't have real ground truths at inference time. Training to address this requires costly self-generation. Instead, ReMask avoids this via regularized masking.
3
10
67
@Euclaise_
Jade
4 months
Lilith test accuracy on MNIST (AlgoPerf): 88.89 AdamW test accuracy: 83.5 👀
@Euclaise_
Jade
4 months
New optimizer (previously named Adaheavy, now calling it Lilith) is seeming very promising
5
0
21
6
3
64
@Euclaise_
Jade
3 years
@possumloverr @Geosquare_ ??? Literally nothing points to geosquare intentionally doing this. What have you heard/seen lol
2
0
58
@Euclaise_
Jade
2 months
@jxmnop Things look Linear in the middle of a sigmoid - they look exponential earlier
3
1
64
@Euclaise_
Jade
4 months
??? this paper is only like 5 pages
Tweet media one
9
0
65
@Euclaise_
Jade
2 years
Tweet media one
8
1
64
@Euclaise_
Jade
10 months
@greenTetra_ @Catboyism I was trying to think of lyrics for this earlier today, but with t-slur girl
1
0
50
@Euclaise_
Jade
4 months
Unlike GPT-4, Claude doesn't do any reasoning/CoT before outputting code. Despite this, it still gets it right. If I were to force GPT to zero-shot problems in the same way, it would do much worse than Claude does.
5
1
54
@Euclaise_
Jade
2 months
Just implemented the first *non*-linear RNN that is parallelized with a prefix-sum-style scan like Mamba is testing soon
5
0
55
@Euclaise_
Jade
2 years
@gewt Activity Monitor 73MB Discord 92MB Firefox Nightly 112GB IRCCloud 83MB iTerm2 338MB someone who is good at the economy please help me budget this. my computer is dying
1
4
52
@Euclaise_
Jade
3 years
Introducing: Smash The Record: Hardcore A 36 hour long event, starting at 7PM JST on 11/5 Runners compete to achieve the best times in the best category — Hardcore, 1.7, no F3. (yes I messed up this tweet twice)
Tweet media one
2
12
51
@Euclaise_
Jade
1 year
@Egg_irl_bot dw about it, you'll be using arch before you know it either way
1
0
48
@Euclaise_
Jade
7 months
GPT-4 seems to have gotten simultaneously more verbose and less intelligent
9
2
50
@Euclaise_
Jade
2 years
@FemboyPaganism There are ~8 billion people in the world, it's absurd to make these "no one has ever" statements. Anyone with any single anecdote would destroy the claim. More accurately, there isn't a significant amount of this, and it doesn't even begin to compare to the inverse.
7
0
44
@Euclaise_
Jade
3 years
Geo tweeted prematurely, that is not a reason to send massive amount of hate toward him. Also, if you have a concrete opinion based on the limited amount of evidence provided, then you're a moron.
0
8
44
@Euclaise_
Jade
2 years
@LucasOfSunshine Because it's actually "The Blast" news, Yahoo! is just the aggregator
0
0
42
@Euclaise_
Jade
5 months
@LinkofSunshine I like how she has now had both conspiracies about being a secret 4channer and now about being a DNC operative
1
0
47
@Euclaise_
Jade
5 months
When I was like 14 I started writing a basic OS kernel, but never got anywhere with it. It just booted and initialized interrupts. A few years later, I found that someone had forked the project and rewritten it in Rust
4
0
44
@Euclaise_
Jade
4 months
@Teknium1 314B and no better than llama 70B, rip
3
0
46
@Euclaise_
Jade
1 year
@LilithLovett Yes, just not to the level of "women are beneath me"
5
0
44
@Euclaise_
Jade
3 months
Has anyone tried just directly optimizing the merge weights? This seems stupidly obvious to me, but I'm not sure that anyone is doing it
@xlr8harder
xlr8harder
3 months
shit's getting weird
Tweet media one
28
25
238
6
1
45
@Euclaise_
Jade
3 years
If anyone's still mad about dream cheating in speedruns: I did the stats on that, not Geo.
@BlueCrystal004
Adam
3 years
open invitation @ dttwt and dream fans u can vicariously reply to me telling me to kill myself instead of geo since he's actually suicidal then we'll all be happy
1
5
63
4
2
40
@Euclaise_
Jade
4 years
@macnpeachy @Desscrungus @phvonix @dreamwastaken2 Hi, I worked on the paper. APA style is to use pronouns, and to explicitly avoid passive form. Statistics often uses APA style.
0
1
41
@Euclaise_
Jade
5 months
Today I implemented a sparse SSM-ish architecture with a gated structure similar to Mamba/GSS/etc rather than ad-hoc inserting sparse MoE MLPs Not very useful without training, but it was fun to implement anyway
3
2
41
@Euclaise_
Jade
2 years
@estrogenizedboy did you pack that into a water bottle how the fuck is it so wrinkled
0
0
37
@Euclaise_
Jade
2 months
I'm curious how this performs with simpler function bases. e.g. what if piecewise linear functions are used instead of B-splines?
@arankomatsuzaki
Aran Komatsuzaki
2 months
KAN: Kolmogorov–Arnold Networks Proposes an alternative to MLP that outperforms in terms of accuracy and interpretability
Tweet media one
9
139
736
7
0
40
@Euclaise_
Jade
2 months
Llama 3 benchmarks have certainly exceeded my expectations
5
0
39
@Euclaise_
Jade
4 months
We need causal encoder -> decoder models Big causal encoder (causal so we can make use of KV cache), with a little decoder The large encoder should provide enough information in a single pass to predict the next *segment*, and the little decoder decodes the segment token-wise
10
0
39
@Euclaise_
Jade
3 years
Me when the rules allow F3:
Tweet media one
2
1
34
@Euclaise_
Jade
8 months
Someone needs to make an LLM that does executive functioning for me It should schedule every minute of my day with a dynamically changing schedule that accounts for failures
5
1
38
@Euclaise_
Jade
3 years
Compromise: Ban calculators but also ban HBG
1
2
36
@Euclaise_
Jade
9 months
@xelliepurr @poisonjr My first guess was it's a Halloween thing but that doesn't really make sense
1
0
36
@Euclaise_
Jade
5 months
I'm so dysfunctional that I don't think I'll ever be able to live a normal life, and certainly won't meet my parent's expectations Although it is nice that I have exceptional abilities in other areas, I wish I could actually manage daily life
10
1
36
@Euclaise_
Jade
8 months
@LinkofSunshine vs psychologists when their R^2 is 0.0025 but "significant"
Tweet media one
0
1
33
@Euclaise_
Jade
5 months
Mamba/Gateloop/etc is only scratching the surface - I suspect that we will soon see more complex (non-linear) RNNs that are parallelized in similar ways
@Euclaise_
Jade
5 months
Apparently associative scans can be used to compute pretty complex recurrences, not just linear ones, e.g.
2
5
20
1
3
34
@Euclaise_
Jade
3 years
This is wrong and harmful. Autism does come with absolute struggles (e.g. sensory issues) and shit like this undermines them. Autism isn't just "oh haha I have trouble reading people," it includes a range of other things.
@f1gayda
galatax why does jillian call u babygirl
3 years
understanding that autism in itself does not have any inherently negative traits and the thing that makes life hard for autistic people is a neurotypical society that refuses to make room for us is absolutely vital if you want to advocate for autistics
88
13K
66K
1
7
31
@Euclaise_
Jade
2 years
@LucasOfSunshine let me become an elementary school teacher I will have all the children doing nonlinear times series modeling by the end of the year
0
0
29
@Euclaise_
Jade
10 months
@kalebjayboss1 @nominalnaomi Placebo controlled trials are, RCTs in general aren't
2
0
29
@Euclaise_
Jade
8 months
@sherjilozair I'd suspect that you need to train on a huge variety of functions for it to learn (OOD) ICL, mixed sin and linear is surely not enough - you need so many functions that it cannot learn them all directly and must learn to resort to ICL
1
1
32
@Euclaise_
Jade
4 months
Further, ReLoRA also was prior to this, with an extremely similar method
0
1
32
@Euclaise_
Jade
6 months
@Teknium1 Y'all are sleeping on falcon 1b and Qwen 1.8b, both of which significantly outperform TinyLlama
4
2
32
@Euclaise_
Jade
2 years
@Gankra_ @auramarua the three genders
0
1
31
@Euclaise_
Jade
3 months
@arankomatsuzaki The issue with next-token prediction is moreso that it's trained with ground-truth tokens, so it can result in erratic generations after a bad generation Also, though there is some further prediction, the gradient is overwhelmed by the next token, dropping off quickly after
3
1
31
@Euclaise_
Jade
3 months
Wait a minute... A cross product fits these conditions There, an easy alternative to elementwise linear recurrences
Tweet media one
2
3
31
@Euclaise_
Jade
1 year
@Euclaise_
Jade
1 year
@LilithLovett Oh, yeah you're certainly correct about the poster - I meant the person in the video though
1
0
22
0
0
28
@Euclaise_
Jade
3 months
@VictorTaelin Where the model gets prompted with directly optimized pseudo-token embeddings instead of embeddings that correspond to real tokens The model is kept fixed, but the prompt is treated as continuous and trainable
2
2
30
@Euclaise_
Jade
2 years
Tweet media one
0
0
26
@Euclaise_
Jade
3 months
Very tiny paper
4
2
27
@Euclaise_
Jade
3 years
@CaroltheIntern @Patton41Fan darker skin means less sun burns (and less skin cancer, I think)
3
0
19
@Euclaise_
Jade
3 months
People undervalue hyperparameter insensitivity Optimization papers often have this issue - sometimes I'll see papers examining optimizers under heavy hparam-tuning, for 'fairness'. However, this ignores if one optimizers is more/less sensitive to hparams than the other
@teortaxesTex
Teortaxes▶️
3 months
There are claims that the FAQ mitigates this On one hand, I buy that MSR can run an expensive hyperparam grid search without assumptions, to find the just-right setting for this fragile wonder On the other, this doesn't smell like what we should expect
2
1
12
4
0
29
@Euclaise_
Jade
4 months
Reminder that GSM8K is the only benchmark on the hf leaderboard that actually tests multi-step reasoning.
0
2
28
@Euclaise_
Jade
2 years
Apparently I'm globally banner from mcsr now lmao
3
0
27