Nicolas Mejia Petit Profile Banner
Nicolas Mejia Petit Profile
Nicolas Mejia Petit

@mejia_petit

749
Followers
144
Following
101
Media
968
Statuses

LLM researcher// Made Tested python 22k and 143k datasets, first Mixtral 22b MOE to dense model, open-critic-gpt, code-DPO//

Don't wanna be here? Send us removal request.
Pinned Tweet
@mejia_petit
Nicolas Mejia Petit
4 months
🚀 Introducing Mistral-22b-V.01 A breakthrough in AI! 🧠💡 - First-ever MOE to Dense model conversion🔥 #Mistral22bV01 This model is NOT an MOE (It only has 22B params.)
22
79
542
@mejia_petit
Nicolas Mejia Petit
3 months
Why isn’t everyone talking about this??? Deepspeed devs literally just created a datatype FP6 with full tensor core support on the a100’s. (Since nvidia left us stranded with int4/8) It is SO smart just reading through the kernel, my god.
Tweet media one
@rohanpaul_ai
Rohan Paul
3 months
LLaMA-70b inferencing using only a single GPU and achieving 1.69x-2.65x higher normalized inference throughput than the FP16 baseline. with Six-bit quantization (FP6) 🔥 Deepspeed has just recently released this Paper and also integrated the FP6 quantization - "FP6-LLM:
Tweet media one
10
108
611
10
106
652
@mejia_petit
Nicolas Mejia Petit
4 months
Mistral-22B-v0.2 is out now 🎉🎉! With significant improvements all across the board! Trained on 8x more data the model is significantly better at handling user queries! Try it out here!
9
82
407
@mejia_petit
Nicolas Mejia Petit
4 months
🚀 V2 is done training. My god, it’s so much better than V1 🎉. The improvement was like going from llama1 ➡️ 2 currently uploading 🚀🌌. I think I’m going to continue pre-training the V1/2 base model, then SFT. (Currently testing a diff base model, before I decide to pre-train.)
3
4
92
@mejia_petit
Nicolas Mejia Petit
3 months
Please help me peer pressure @MSFTResearch to drop the data used to take deepseek 6.7b from 49.4 human eval to 79.9.
Tweet media one
3
4
43
@mejia_petit
Nicolas Mejia Petit
3 months
Just wrote a feature request to get FP6 dtype into PyTorch help me out and put an emoji on that hoe! 🚀🚀
Tweet media one
1
3
45
@mejia_petit
Nicolas Mejia Petit
4 months
When life gives you MOE’s, compress them back to dense models.
1
1
32
@mejia_petit
Nicolas Mejia Petit
4 months
Models fully uploaded!🎉🎉 Be warned this model highly uncensored. It will try answer anything you ask.
4
4
29
@mejia_petit
Nicolas Mejia Petit
3 months
LM studio has no right being this fast on a single 3090. This is insane. @LMStudioAI
Tweet media one
1
2
28
@mejia_petit
Nicolas Mejia Petit
4 months
Just started my finetune of llama 8b instruct. Using @UnslothAI . (Yeah it’s running at peak performance even while using 24gb of shared VRAM! Expanding sequence length to 32k from 8k using rope 🪄 and a batch size of 2!) Literally pure magic! (aka highly optimized triton kernels)
Tweet media one
1
3
27
@mejia_petit
Nicolas Mejia Petit
4 months
@danielhanchen Thank you! All will be revealed with the paper, or blog post(not on medium)still haven’t decided, hopefully V2 can show promising results. I have a few other tricks up my sleeve I am yet to try. And it was very helpful otherwise training wouldn’t have fit on my GPU!
2
0
24
@mejia_petit
Nicolas Mejia Petit
4 months
@ClementDelangue @huggingface You might find this cool, a repost would really help, took a ton of time making it.
2
4
18
@mejia_petit
Nicolas Mejia Petit
3 months
@Teknium1 @Euclaise_ I feel like getting the 3.3T token dataset would be the best thing we could get
1
0
15
@mejia_petit
Nicolas Mejia Petit
4 months
(V.2 is currently training on 8x more data and significantly higher quality, with mostly all being multi turn. Please keep in mind V.1 was the experiment before I went for a bigger dataset, but my hand got forced, so I had to drop V.1 before V.2 could finish training.)
2
0
16
@mejia_petit
Nicolas Mejia Petit
3 months
The way it runs, it should also theoretically work on rocm cards and 20, and 30 series consumer cards, and T4. Essentially any fp16 card with tensor cores, *could* work. Correct me if I’m wrong. Here is the merge commit if you’d like to read through it:
1
1
15
@mejia_petit
Nicolas Mejia Petit
4 months
@WizardLM_AI @QNixSynapse Whose regulation are you referring to? Your country’s regulation?
2
0
14
@mejia_petit
Nicolas Mejia Petit
5 months
@ChaseMc67 Gpt4 acts like it is on an active mission to save compute resources and make you do all the work. No matter how many times you specify in the prompt “Only give fully written code. Do not give me incomplete code. Do not tell me to (code goes here) just write the code”
2
0
13
@mejia_petit
Nicolas Mejia Petit
5 months
I am proud to release my newest coding dataset, 143k examples of tested python code. Why train on python code that doesn't work? When you can train on a large variety of tested python code!
2
3
13
@mejia_petit
Nicolas Mejia Petit
3 months
Same model, just 3.3X more data and training 200,000 more params, and its still on epoch 1.6. While the other finished the 3 epochs of training. More high quality data is all you need :)
Tweet media one
Tweet media two
0
0
13
@mejia_petit
Nicolas Mejia Petit
2 months
Wait a minute… you are telling me Samba 3.8b beats phi-3 (3.8b); while training on a SlimPajama. Wow 🤯.
Tweet media one
Tweet media two
1
0
13
@mejia_petit
Nicolas Mejia Petit
3 months
@joefioti So the fp16 values get quantized down to fp6, 4*6bits=24 which can be packed as 3*8bits and stored as a uint8_t. So i guess int8 does play a pivotal role. Here is how the packing is done:
Tweet media one
1
0
11
@mejia_petit
Nicolas Mejia Petit
6 months
@jon_durbin @younesbelkada @Tim_Dettmers Before I fell asleep I wrote a request for 8bit rms prop to be added to TRL. I woke up to over 100 lines of code. HF team y’all are amazing ❤️!
Tweet media one
1
1
10
@mejia_petit
Nicolas Mejia Petit
3 months
@msiUSA afterburner OC + @UnslothAI optimized kernels = Brrrr (Blame Dall•E-3 for cocaine bear 2.0)
Tweet media one
Tweet media two
0
0
5
@mejia_petit
Nicolas Mejia Petit
3 months
Anyone from @IBMResearch , could possibly tell me why y’all removed the first 8 and last 8 layers of the model before merging for continued pre-training? Was it just for the optimal parameters size? Or was there logic behind it?
Tweet media one
Tweet media two
4
2
10
@mejia_petit
Nicolas Mejia Petit
3 months
Really cool can’t lie, and the fact I won’t have to name a fine tune “llama-3” is even cooler. Shout out to MSFT for the MIT license! I hope we can also get the 7b and the 14b! @MSFTResearch :)
Tweet media one
1
0
8
@mejia_petit
Nicolas Mejia Petit
4 months
UPDATE: I’ve done some testing with V.3 using a different base model. It appears to be going great (based off the loss compared to the previous base model) I will start V.3 training with the large dataset once V.2 is done uploading. I think we’re gonna have ourselves a solid 22b!
1
0
10
@mejia_petit
Nicolas Mejia Petit
3 months
Everyone who asks for evals. This is why. Not everyone can afford to do it. A lot of the times a training run can take less than the time to eval.(This is with vllm on poorer systems not a100s and above). Vllm is soon gonna get int8 so maybe soon I can afford to eval MMLU :)
@Teknium1
Teknium (e/λ)
3 months
@Ahmad_Al_Dahle @edunov @Gradient_AI_ I can run them but not mmlu my 4090s take like 24hrs to run it lol
2
0
4
0
0
10
@mejia_petit
Nicolas Mejia Petit
3 months
Whatever RL learning approach open AI took with Q-star (likely a variant of Q-Learning applied for LLM’s). It did not fine tune a 1 billion parameter gpt2 model, pretrained on only 10 billion tokens, to outperform every current model. It’s just not possible (@ gpt2 speculators)
@sama
Sam Altman
3 months
im-a-good-gpt2-chatbot
645
426
7K
1
0
9
@mejia_petit
Nicolas Mejia Petit
2 months
@burkov 4096 is literally nothing. Every day I’m more thankful for the Chinese groups, like Deep-seek and Qwen. In my personal use Qwen 2 has been great for synthetic data, its json mode is terrific! And it handles 128k context.
0
0
9
@mejia_petit
Nicolas Mejia Petit
3 months
@JagersbergKnut There is actually a paper out by microsoft, shows the results of cutting the embeddings and replacing matrices with smaller ones, to compress it, and repairing it with a lora, to achieve ~99% same quality. awesome job👏 @chargoddard I love it! 77 MMLU!!!
0
0
8
@mejia_petit
Nicolas Mejia Petit
4 months
@teortaxesTex @Prince_Canuma Too long for a non verified tweet lol
Tweet media one
1
0
8
@mejia_petit
Nicolas Mejia Petit
4 months
@mattshumer_ Easy peasy with @UnslothAI it automatically applies ROPE through the lora allowing you to expand to whatever sequence length you’d like :) , and its very efficient even when it goes into shared memory, and with gradient ckpt offloading and optimizer offloading its so clutch.
1
1
8
@mejia_petit
Nicolas Mejia Petit
5 months
@realGeorgeHotz You can get over a peta flop with 2 rtx 3090’s with an int4 dtype. All we need is an int4/8 dtype, that utilizes the int4/8 engine with no upcasting. This would help everyone since the 7900xtx also allows for int4, and int8.
Tweet media one
0
0
6
@mejia_petit
Nicolas Mejia Petit
4 months
@dchaplot But @NousResearch had already trained up a yarn version of mistral base to 128k context length, so this is a release was a bit obsolete.
3
2
7
@mejia_petit
Nicolas Mejia Petit
3 months
Doing some tests with Microsoft’s WaveCoder-Ultra-6.7b, and I really like it. A small local powerful coding model, with a human eval of 79.9. Here is a bf16 version of the model I just made for y’all to use! Since only the FP32 version is on the hub.
0
0
7
@mejia_petit
Nicolas Mejia Petit
4 months
15 trillion tokens. My god 🤯
Tweet media one
1
0
7
@mejia_petit
Nicolas Mejia Petit
3 months
@dylan522p @cis_female I think phi-2 was around a 33% in code? But in actual use it was very terrible. Also training phi-2 was a headache, due to the architecture. There is still an ongoing issue on transformers since release to fix phi-2 training loss with FA2, since vram consumption was terrible.
1
0
7
@mejia_petit
Nicolas Mejia Petit
3 months
God thats so sexy
Tweet media one
1
1
6
@mejia_petit
Nicolas Mejia Petit
2 months
Tested Qwen 2 7b instruct’s json mode on my personal 3 shot json benchmark, ~4.3% failure rate It out preforms the any other OSS model I have tried, even Codestral 22b and mistral 7b v3 instruct ~20% failure rate Congrats to the Qwen team. What an excellent Apache 2 model!
0
0
5
@mejia_petit
Nicolas Mejia Petit
5 months
@Teknium1 There is no fucking way they are running all 8 agents in production. They really used merge kit and uploaded the base model. 😂😂 if this is the grok thats been out that is terrible performance/efficiency. I’ll wait a day for someone to rip out a single agent from it.
1
0
6
@mejia_petit
Nicolas Mejia Petit
4 months
This is how I figured out Mixtral 22b was made from a single dense 22b model, through my test with the compressed moe and a single expert. (And ofc the community ❤️!)
Tweet media one
2
0
6
@mejia_petit
Nicolas Mejia Petit
3 months
Interesting. Model has a lot more learnable space, compared to MLP’s fixed layers. Very cool.
Tweet media one
@aidan_mclau
Aidan McLau
3 months
wake up new neural network just dropped (holy shit)
Tweet media one
Tweet media two
121
944
10K
1
0
6
@mejia_petit
Nicolas Mejia Petit
4 months
@Yampeleg You should integrate the “Unsloth” gradient accumulation for more memory savings and speed increase, axolotl has a full pr where they merged it into their trainer. It’ll let you squeeze some longer context length in there, and definitely use the paged Adam 8bit optimizer.
1
0
6
@mejia_petit
Nicolas Mejia Petit
5 months
@Yampeleg I wish I could merge some of the deepseek 6.7b( or even code llama 7b) layers on top of mistral, but they are not compatible, different tensor shapes. If mistral 7b could code like deepseek 6.7b mistral would be insanely good.
0
0
6
@mejia_petit
Nicolas Mejia Petit
4 months
I am proposing someone makes a 1.58 bitnet adamnw, that runs on the cpu. Since bit net doesn't require matmul, theoretically the entire computation could be offloaded to the cpu. It would be actually fast, so the training wont be bogged down, with 0 gpu memory overhead.
0
2
6
@mejia_petit
Nicolas Mejia Petit
3 months
Satya's face when everyone realizes GPT2 is Phi-3-14b
Tweet media one
0
0
6
@mejia_petit
Nicolas Mejia Petit
5 months
@AndrewCurran_ 100% I hope this does happen. Although this OAI’s lawyers are just gonna drag this out as long as possible. Till anything meaningful could happen.
0
0
5
@mejia_petit
Nicolas Mejia Petit
4 months
@mattshumer_ Already training it to 32k on code with rope :)
2
0
6
@mejia_petit
Nicolas Mejia Petit
3 months
Looking through it again, ampere cards should work, I think other cards would need might need a bit of work, I believe its dependent on this:
Tweet media one
1
0
5
@mejia_petit
Nicolas Mejia Petit
5 months
@abacaj You should check out “” trained to 32k and can handle up to 128k tokens. Unfortunately it still uses old phi-2 code. But would make for a sick base model.
0
0
5
@mejia_petit
Nicolas Mejia Petit
2 months
@neildecrypt @Teknium1 @levelsio Smart. OAI was previously criticized for using labor like that in India. Where people had to filter the worst of the worst available on the internet, hand labeling datasets, ect. So now they just hire someone else to take care of that, no bad press. Smart.
0
0
0
@mejia_petit
Nicolas Mejia Petit
3 months
The amount of respect I have for @EMostaque is top tier. The reason he left his position of manager of a freaking hedge fund and got into machine learning was his child’s autism. He is a big reason I do what I do, the fact he knows I exist is insane.
0
0
5
@mejia_petit
Nicolas Mejia Petit
3 months
@elegyals @get_palet @yacineMTB Twitter did not show me the rest of this thread I’m seeing it now 🤯. No way apple just became the budget king in any department. My jaw is on the floor.
0
0
5
@mejia_petit
Nicolas Mejia Petit
4 months
@Teknium1 I’m currently seeing how I can rip out the 22b and the add some continued pretraining to it, so its a working base model.
0
0
5
@mejia_petit
Nicolas Mejia Petit
3 months
@rohanpaul_ai WHOAAAA I just read that last part the fp6 was done on an a100 (w/o fp8) suppourt. Thats crazy 🤗🤗! #FreeAmpereCards
0
0
5
@mejia_petit
Nicolas Mejia Petit
4 months
@slashdot So they are worried that music is being democratized, and anyone can make music regardless of if they spent years learning to create and compose music. Idk personally I like to reward creativity, not just people with the time and money to create/promote their music.
2
0
5
@mejia_petit
Nicolas Mejia Petit
4 months
@MistralAILabs Please drop the 22b base, it would save me a ton of time money and work. 🙏🙏
0
0
5
@mejia_petit
Nicolas Mejia Petit
3 months
@felix_red_panda Actually I think I’m cancel my iCloud and buy one of these tf am I paying money for, screw the convenience, a raspberry pi and one of these can make a whole home media backup server.
0
0
5
@mejia_petit
Nicolas Mejia Petit
3 months
Whoa!!! Python 3.13 now supports IOS devices!! You will still need to compile it from source as there is not available, but it’s technically compatible. We’re really gonna be able to train models on a iPhone.
Tweet media one
0
2
5
@mejia_petit
Nicolas Mejia Petit
4 months
@KinasRemek I’m about to write a quick blog post to describe the process of how I did it, I’m getting tons of questions, I was originally planning to do a paper, so I could show results, but so many people asking, I’m just going to rush out a blog post!
1
0
5
@mejia_petit
Nicolas Mejia Petit
5 months
@Teknium1 Phi-2 beat llama 13b and mistral 7b on some benchmarks. Pre training a 7b model on that data should yield significant more performance.
1
0
5
@mejia_petit
Nicolas Mejia Petit
3 months
@Teknium1 I’m bullish on @AMD making strong integrated NPU’s for inference, but that wont I doubt that will ever go along into training unless they make some monstrous NPU, like the gh200’s CPU.
1
0
3
@mejia_petit
Nicolas Mejia Petit
3 months
@get_palet @yacineMTB @elegyals Are they just hooked up through thunderbolt? Do it go brrr?
1
0
5
@mejia_petit
Nicolas Mejia Petit
3 months
I’m just surprised there are that many groups with 45TB of space to download the dataset. This also means there are gonna be a ton of 15T token pre-trained models soon out 🤗
@ClementDelangue
clem 🤗
3 months
The GPT4 of datasets took down Hugging Face, sorry all 😅😅😅
24
43
905
1
0
3
@mejia_petit
Nicolas Mejia Petit
5 months
@unslothai running unsloth in windows to train models 2x faster than regular hf+fa2 and with 2x less memory letting me do a batch size of 10 with a sequence length of 2048 on a single 3090. Need a tutorial to install triton on windows? It’s pinned to my profile!
Tweet media one
3
0
4
@mejia_petit
Nicolas Mejia Petit
3 months
It would be very funny if the “gpt-2-chatbot” is phi-3 14b.
0
0
4
@mejia_petit
Nicolas Mejia Petit
3 months
This should work with h100’s, h200’s, and 4090’s as well. As long as you have int8 and fp16, it can theoretically work. Although these were written for the a100 so expect current support to only be a100, 3090, and L40. (Unless the architecture shown above is the same.)
2
0
4
@mejia_petit
Nicolas Mejia Petit
3 months
I was wondering why no one talked about Apples new open source models, ranging from 270M to 3B. Then I saw the 3B instruct got a 25% on the MMLU 😅. (Provided MMLU is a terrible benchmark, a 25% is terrible; however, they did highlight it has a very small training size.)
Tweet media one
1
0
4
@mejia_petit
Nicolas Mejia Petit
3 months
Yeah the hype is real, I got access to gpt-4-o. And its coding performance, specially with rag, is insane; in comparison to gpt-4. It’s making scripts bing gpt-4 struggled to get even close to.
0
0
3
@mejia_petit
Nicolas Mejia Petit
4 months
@MistralAI Its Mixtral 8x22b. Could we also get the 22b 🥺?
1
0
4
@mejia_petit
Nicolas Mejia Petit
5 months
@StasBekman @PyTorch And it does have fp8. Looks like int4 software suppourt might be dead in the water :( unless someone in the OS community wants to dedicate a massive amount of time. Sucks cause it’s a good compute type for the gpu poor. 4bits FTW.
2
0
2
@mejia_petit
Nicolas Mejia Petit
3 months
How do you write an entire well written research paper with graphs, 50 different citations, but no fucking code? I promise we won’t bite at badly written code; we just want the damn code. Even as just boilerplate. I BEG JUST GIVE US THE DAMN CODE.
1
0
4
@mejia_petit
Nicolas Mejia Petit
3 months
@AlpayAriyak Literally 😂😂. The way they pack 4 fp6 into 3 uint8 is insane, making a Frankenstein array, is the smartest thing I have seen in a very long time. The a100’s and 3090’s got a major boost. This needs to get added to @PyTorch asap.
Tweet media one
Tweet media two
0
0
3
@mejia_petit
Nicolas Mejia Petit
3 months
@EMostaque @dylan522p @cis_female I completely agree, I made tested python 22k and tested python 143k, and there is a preference dataset @jon_durbin made using my tested 22k. I’m exited to use those in the new base models.
0
0
3
@mejia_petit
Nicolas Mejia Petit
3 months
Google colab is going to have to get a lot more competitive 😳. (This is proof we are ram limited, someone for the love of god sell us a cheap gpu with a bunch of memory. It doesn’t have to be the fastest thing ever)
@awnihannun
Awni Hannun
3 months
Next level: QLoRA fine-tuning 4-bit Llama 3 8B on iPhone 15 pro. Incoming (Q)LoRA MLX Swift example by David Koski: works with lot's of models (Mistral, Gemma, Phi-2, etc)
28
123
946
0
0
4
@mejia_petit
Nicolas Mejia Petit
5 months
@TheBlokeAI are you okay? We miss you! We wanna hear from you!
2
1
4
@mejia_petit
Nicolas Mejia Petit
4 months
@XPhyxer1 @dchaplot @NousResearch It never got the attention it deserved. It’s pretty freaking awesome though, as when it’s in 4bits on a single 3090 I can have it at max context length. And its perplexity is really really good for a 7b model.
Tweet media one
0
1
4
@mejia_petit
Nicolas Mejia Petit
2 months
@ylecun Military research gave us satellites, internet,gps, microwave, commercial computers, and public roads to name a few. It would suck to live in a world without that.
1
0
4
@mejia_petit
Nicolas Mejia Petit
4 months
@Teknium1 Agent, planning, coding, and iterative multi turn coding examples.
0
0
3
@mejia_petit
Nicolas Mejia Petit
4 months
@realmrfakename @Teknium1 @chargoddard Definitely will! I’m gonna be here ready with my fingers crossed to see if @chargoddard could send me the script! Second he does I’ll get started.
1
0
3
@mejia_petit
Nicolas Mejia Petit
5 months
@YangYou1991 Hard to compete with the compute open ai was willing to throw at it.
1
0
3
@mejia_petit
Nicolas Mejia Petit
2 months
@Teknium1 @danielhanchen @UnslothAI Smaller learning rates for embed_token layer, and lm_head, to avoid throwing it off during training; but you should read the Unsloth blog post on continued pre-training, it’s really good.
0
0
3
@mejia_petit
Nicolas Mejia Petit
3 months
@ollama Converted 8x22b to a single 22b model. I’m currently trying to pre-train the model, but I need more compute for preferably a FFT.
0
0
2
@mejia_petit
Nicolas Mejia Petit
4 months
@realmrfakename @Teknium1 @chargoddard I'm fairly certain i can make it work. One time i merged a bunch of mistral layers, from different models got garbage output, but i repaired it with a lora (it was a bit of training the unsloth didn't exist) And it gave coherent outputs .
0
0
3
@mejia_petit
Nicolas Mejia Petit
4 months
@Teknium1 @EugeneNicholas0 @The_Real_Paolo @xlr8harder @ylecun A vision llama 8b trained with VJEPA would go so hard tho 😔. Inspo for it is low tho since VJEPA’s under a research license. I just pray they didn’t abandon the ~30B it was such a perfect size.
1
0
3
@mejia_petit
Nicolas Mejia Petit
4 months
I’m getting very very interesting training loss for v0.3 22b. I’m very exited to try it, but I think this will be the base model I’ll *Cheap Pre-train* and drop as a base model, so anyone can easily fine tune it to their own needs, or to make better.
3
0
3
@mejia_petit
Nicolas Mejia Petit
4 months
@1littlecoder As Nadella said "If OpenAI disappeared tomorrow, I don't want any customer of ours to be worried about it quite honestly, because we have all of the rights to continue the innovation." They practically have already been bought out like paypal was back in the day.
Tweet media one
Tweet media two
0
0
1
@mejia_petit
Nicolas Mejia Petit
3 months
@dan_biderman @jefrankle I’ve been using RS lora on all my training runs, it’s consistently done better than w/o. I agree with your findings on code performance with LoRa. I’ve fine tuned llama 2 7b on 22k examples of tested python code gave marginally better results. I like idea of dora. I heard from
Tweet media one
1
0
3
@mejia_petit
Nicolas Mejia Petit
4 months
I have started training v0.3 with a new base model, preliminary results seem even more promising than v0.2.
1
0
3
@mejia_petit
Nicolas Mejia Petit
2 months
@liliang_ren Thank you for the correction, I missed that in the paper, I was a bit too exited to the code 😆. To me that is still just as impressive nonetheless, it empirically proves how good Hybrid SSM’s preform in training.
Tweet media one
0
0
3
@mejia_petit
Nicolas Mejia Petit
3 months
@Teknium1 The resemblance is uncanny
Tweet media one
0
0
3
@mejia_petit
Nicolas Mejia Petit
5 months
@Tobias_Writes It takes 9 2024 toyota prius batteries to run that. Imagine speaking compute in toyota prius battery numbers 😂😂.
Tweet media one
0
0
3
@mejia_petit
Nicolas Mejia Petit
4 months
Why llama 3 isn’t MOE: Moe’s doen’t provide that many benefit’s, for gpu poor. The experts don’t even learn specific topics, most the time they just become experts in grammar. Making a quality dense model that then could be up-cycled for an MOE like mixtral is more useful.
1
0
3
@mejia_petit
Nicolas Mejia Petit
2 months
@code_star @Yuchenj_UW @karpathy Yes, there is research in that domain! Also the quality of data is relatively high; they are using fine-web, a very clean dataset. Now it won’t get chatgpt level, due to the fact, we lack their large ~100k human written SFT dataset (according to Kapathy)
6
0
3
@mejia_petit
Nicolas Mejia Petit
3 months
@rohanpaul_ai I had talked to one of the developers in deep speed, about using native int4, and I sent him a paper. He said thx & they’re looking into int4 and I think this is what he was referring to That is so cool, I hope these int4 kernels are native, the HF Quanto repo could use them :)
0
0
3
@mejia_petit
Nicolas Mejia Petit
3 months
@felix_red_panda Whoaaa, I haven’t bought a hard drive in 10 years I did not know they got so cheap. 10 years ago I had bought a 1tb for almost $100 I think, and todays money that would be like $200
Tweet media one
2
0
3
@mejia_petit
Nicolas Mejia Petit
4 months
@migtissera They could’ve also mentioned its 1/10 the size of gpt4 with similar performance, without any RLHF or DPO for better performance (not that safety bs open ai does making their models dumber but ‘safer’).
0
0
2
@mejia_petit
Nicolas Mejia Petit
3 months
@nickfrosst Thats for the people asf.
0
0
3
@mejia_petit
Nicolas Mejia Petit
3 months
It would be so funny shove llama 8b with vision + meta voice cloned text to speech shoved into transformers robot.
1
0
3