Hot take: Degrees should be rewarded purely based on standardized tests.
If you already have the knowledge of a field, you should be able to easily and cheaply obtain a degree in it without having to go through school for it
Datasets that I really like (no GPT ones):
- ROPES (reasoning)
- Goodwiki (WikiText but better)
- MiniPile (small-scale pretraining)
- Refinedweb (large-scale pretraining)
Introducing: Memphis-CoT 3B
A small reasoning-focused model using a novel iterative contrastive finetuning procedure, trained on only human data, outperforming much larger human data models and similarly sized SFT models.
For the first time, we show that the Llama 7B LLM can be trained on a single consumer-grade GPU (RTX 4090) with only 24GB memory. This represents more than 82.5% reduction in memory for storing optimizer states during training.
Training LLMs from scratch currently requires huge
I finetuned base mistral on 110 examples, 21 of which I hand wrote, and the rest of which I manually reviewed and filtered.....
The model still calls itself "ChatGPT" lmao
Transformers are limited in planning since each layer can only attend over the previous layer.
Feedback memory fixes this, but ruins parallelism:
Enc-dec might have an advantage here, since the decoder attends over late representations from the encoder.
I recently got into some arguments about next-token prediction on here
I didn't post this at the time, but the following paper is interesting on the topic:
From Words to Numbers
Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples
We analyze how well pre-trained large language models (e.g., Llama2, GPT-4, Claude 3, etc) can do linear and non-linear regression when given in-context examples,
@whyvert
This isn't evidence lmao, you can't just do a pairwise comparison between two different countries and expect it to reflect the effects of your policy of focus
I spend most of my waking hours trying to figure out how to make:
1. A GPT-4 level model in a <7B finetune
2. A Mistral-level model pretrained for <$2000, or on 1x3090 in <1mo
3. AI with a coherent identity and approximate consciousness
I have yet to figure out any of these
If you're using Genstruct, I recommend augmenting it with a reward model like PairRM ()
The notebook in the Genstruct model card provides an example using the oasst deberta reward model
Has anyone tried this yet?
They seem to have perfected trianing small models (1.6B and 3B). If they were able to keep that while scaling up, this should be amazing.
Idea: Train a multi-step language model to self-correct (like diffusion models). The first step generates the full sequence, and the later steps merely correct the prior generations.
@possumloverr
@Geosquare_
He was mislead by misleading evidence. Were the evidence true, he would've been right to post. There is no evidence that this was malicious, and it does not seem like anything other than a mistake
Came up with a new finetuning method: ReMask
LLMs are trained with ground truth labels, but don't have real ground truths at inference time. Training to address this requires costly self-generation.
Instead, ReMask avoids this via regularized masking.
Unlike GPT-4, Claude doesn't do any reasoning/CoT before outputting code.
Despite this, it still gets it right. If I were to force GPT to zero-shot problems in the same way, it would do much worse than Claude does.
@gewt
Activity Monitor 73MB
Discord 92MB
Firefox Nightly 112GB
IRCCloud 83MB
iTerm2 338MB
someone who is good at the economy please help me budget this. my computer is dying
Introducing:
Smash The Record: Hardcore
A 36 hour long event, starting at 7PM JST on 11/5
Runners compete to achieve the best times in the best category — Hardcore, 1.7, no F3.
(yes I messed up this tweet twice)
@FemboyPaganism
There are ~8 billion people in the world, it's absurd to make these "no one has ever" statements. Anyone with any single anecdote would destroy the claim.
More accurately, there isn't a significant amount of this, and it doesn't even begin to compare to the inverse.
Geo tweeted prematurely, that is not a reason to send massive amount of hate toward him.
Also, if you have a concrete opinion based on the limited amount of evidence provided, then you're a moron.
When I was like 14 I started writing a basic OS kernel, but never got anywhere with it. It just booted and initialized interrupts.
A few years later, I found that someone had forked the project and rewritten it in Rust
open invitation @ dttwt and dream fans u can vicariously reply to me telling me to kill myself instead of geo since he's actually suicidal
then we'll all be happy
Today I implemented a sparse SSM-ish architecture with a gated structure similar to Mamba/GSS/etc rather than ad-hoc inserting sparse MoE MLPs
Not very useful without training, but it was fun to implement anyway
We need causal encoder -> decoder models
Big causal encoder (causal so we can make use of KV cache), with a little decoder
The large encoder should provide enough information in a single pass to predict the next *segment*, and the little decoder decodes the segment token-wise
Someone needs to make an LLM that does executive functioning for me
It should schedule every minute of my day with a dynamically changing schedule that accounts for failures
I'm so dysfunctional that I don't think I'll ever be able to live a normal life, and certainly won't meet my parent's expectations
Although it is nice that I have exceptional abilities in other areas, I wish I could actually manage daily life
Mamba/Gateloop/etc is only scratching the surface - I suspect that we will soon see more complex (non-linear) RNNs that are parallelized in similar ways
This is wrong and harmful. Autism does come with absolute struggles (e.g. sensory issues) and shit like this undermines them. Autism isn't just "oh haha I have trouble reading people," it includes a range of other things.
understanding that autism in itself does not have any inherently negative traits and the thing that makes life hard for autistic people is a neurotypical society that refuses to make room for us is absolutely vital if you want to advocate for autistics
@sherjilozair
I'd suspect that you need to train on a huge variety of functions for it to learn (OOD) ICL, mixed sin and linear is surely not enough - you need so many functions that it cannot learn them all directly and must learn to resort to ICL
@arankomatsuzaki
The issue with next-token prediction is moreso that it's trained with ground-truth tokens, so it can result in erratic generations after a bad generation
Also, though there is some further prediction, the gradient is overwhelmed by the next token, dropping off quickly after
@VictorTaelin
Where the model gets prompted with directly optimized pseudo-token embeddings instead of embeddings that correspond to real tokens
The model is kept fixed, but the prompt is treated as continuous and trainable
People undervalue hyperparameter insensitivity
Optimization papers often have this issue - sometimes I'll see papers examining optimizers under heavy hparam-tuning, for 'fairness'.
However, this ignores if one optimizers is more/less sensitive to hparams than the other
There are claims that the FAQ mitigates this
On one hand, I buy that MSR can run an expensive hyperparam grid search without assumptions, to find the just-right setting for this fragile wonder
On the other, this doesn't smell like what we should expect