Nicolas Mejia Petit @mejia_petit Twitter profile

Pinned Tweet

Nicolas Mejia Petit

4 months

🚀 Introducing Mistral-22b-V.01 A breakthrough in AI! 🧠💡 - First-ever MOE to Dense model conversion🔥 #Mistral22bV01 This model is NOT an MOE (It only has 22B params.)

Vezora/Mistral-22B-v0.1 · Hugging Face

huggingface.co

22

79

542

Last Seen Profiles

@Bebasindo_com

@bokeplokalmalam

@sloob1123

@HijabHunny

@telunjukjokowi

@stwmaniax

@ScoutNFTs

@debra_isa

@IXIX_zzz

@emonpixels

@fars35292

@MaherFahem88211

@maxijipro

@lanceyfouxx

@bokeplokalmalam

@Kliffoth_VT

@penikma89063156

@jandakembangstw

@evligiz60125833

@bokeplokalmalam

@goiabalm

@Tumyad4

@_azizuldean

@hiphopisrealtv

@bokeplokalmalam

@crowfiender

@Thejessicakorda

@alhmyany75436

@FoMdf9U1aDRSNhI

@staydontdiex

@b88134222

@glanworthgaa

@SerialDesiN25

@Myrthios13

@Troymarq4

@TilenMajnardi1

Nicolas Mejia Petit

@mejia_petit

3 months

Why isn’t everyone talking about this??? Deepspeed devs literally just created a datatype FP6 with full tensor core support on the a100’s. (Since nvidia left us stranded with int4/8) It is SO smart just reading through the kernel, my god.

Rohan Paul

@rohanpaul_ai

3 months

LLaMA-70b inferencing using only a single GPU and achieving 1.69x-2.65x higher normalized inference throughput than the FP16 baseline. with Six-bit quantization (FP6) 🔥 Deepspeed has just recently released this Paper and also integrated the FP6 quantization - "FP6-LLM:

10

108

611

10

106

652

Nicolas Mejia Petit

@mejia_petit

4 months

Mistral-22B-v0.2 is out now 🎉🎉! With significant improvements all across the board! Trained on 8x more data the model is significantly better at handling user queries! Try it out here!

Vezora/Mistral-22B-v0.2 · Hugging Face

huggingface.co

9

82

407

Nicolas Mejia Petit

@mejia_petit

4 months

🚀 V2 is done training. My god, it’s so much better than V1 🎉. The improvement was like going from llama1 ➡️ 2 currently uploading 🚀🌌. I think I’m going to continue pre-training the V1/2 base model, then SFT. (Currently testing a diff base model, before I decide to pre-train.)

3

4

92

Nicolas Mejia Petit

@mejia_petit

3 months

Please help me peer pressure @MSFTResearch to drop the data used to take deepseek 6.7b from 49.4 human eval to 79.9.

3

4

43

Nicolas Mejia Petit

@mejia_petit

3 months

Just wrote a feature request to get FP6 dtype into PyTorch help me out and put an emoji on that hoe! 🚀🚀

1

3

45

Nicolas Mejia Petit

@mejia_petit

4 months

When life gives you MOE’s, compress them back to dense models.

1

32

Nicolas Mejia Petit

@mejia_petit

4 months

Models fully uploaded!🎉🎉 Be warned this model highly uncensored. It will try answer anything you ask.

Vezora/Mistral-22B-v0.2 · Hugging Face

huggingface.co

4

29

Nicolas Mejia Petit

@mejia_petit

3 months

LM studio has no right being this fast on a single 3090. This is insane. @LMStudioAI

1

2

28

Nicolas Mejia Petit

@mejia_petit

4 months

Just started my finetune of llama 8b instruct. Using @UnslothAI . (Yeah it’s running at peak performance even while using 24gb of shared VRAM! Expanding sequence length to 32k from 8k using rope 🪄 and a batch size of 2!) Literally pure magic! (aka highly optimized triton kernels)

1

3

27

Nicolas Mejia Petit

@mejia_petit

4 months

@danielhanchen Thank you! All will be revealed with the paper, or blog post(not on medium)still haven’t decided, hopefully V2 can show promising results. I have a few other tricks up my sleeve I am yet to try. And it was very helpful otherwise training wouldn’t have fit on my GPU!

2

0

24

Nicolas Mejia Petit

@mejia_petit

4 months

@ClementDelangue @huggingface You might find this cool, a repost would really help, took a ton of time making it.

Vezora/Mistral-22B-v0.1 · Hugging Face

huggingface.co

2

4

18

Nicolas Mejia Petit

@mejia_petit

3 months

@Teknium1 @Euclaise_ I feel like getting the 3.3T token dataset would be the best thing we could get

1

0

15

Nicolas Mejia Petit

@mejia_petit

4 months

(V.2 is currently training on 8x more data and significantly higher quality, with mostly all being multi turn. Please keep in mind V.1 was the experiment before I went for a bigger dataset, but my hand got forced, so I had to drop V.1 before V.2 could finish training.)

2

0

16

Nicolas Mejia Petit

@mejia_petit

3 months

The way it runs, it should also theoretically work on rocm cards and 20, and 30 series consumer cards, and T4. Essentially any fp16 card with tensor cores, *could* work. Correct me if I’m wrong. Here is the merge commit if you’d like to read through it:

FP6 quantization end-to-end. (#5234) · microsoft/DeepSpeed@ccfdb84

The user interface: https://github.com/microsoft/DeepSpeed-MII/pull/433 nv-a6000 ci running against the MII branch linked above is [here](https://github.com/microsoft/DeepSpeed/actions/runs/81921...

github.com

1

15

Nicolas Mejia Petit

@mejia_petit

4 months

@WizardLM_AI @QNixSynapse Whose regulation are you referring to? Your country’s regulation?

2

0

14

Nicolas Mejia Petit

@mejia_petit

5 months

@ChaseMc67 Gpt4 acts like it is on an active mission to save compute resources and make you do all the work. No matter how many times you specify in the prompt “Only give fully written code. Do not give me incomplete code. Do not tell me to (code goes here) just write the code”

2

0

13

Nicolas Mejia Petit

@mejia_petit

5 months

I am proud to release my newest coding dataset, 143k examples of tested python code. Why train on python code that doesn't work? When you can train on a large variety of tested python code!

2

3

13

Nicolas Mejia Petit

@mejia_petit

3 months

Same model, just 3.3X more data and training 200,000 more params, and its still on epoch 1.6. While the other finished the 3 epochs of training. More high quality data is all you need :)

0

13

Nicolas Mejia Petit

@mejia_petit

2 months

Wait a minute… you are telling me Samba 3.8b beats phi-3 (3.8b); while training on a SlimPajama. Wow 🤯.

1

0

13

Nicolas Mejia Petit

@mejia_petit

3 months

Just wrote a feature request to get FP6 dtype into PyTorch help me out and put an emoji on that hoe! 🚀🚀

FP6 dtype! · Issue #208 · pytorch/ao

🚀 The feature, motivation and pitch https://arxiv.org/abs/2401.14112 I think you guys are really going to like this. The deepspeed developers introduce FP6 datatype on cards without fp8 support,...

github.com

1

0

11

Nicolas Mejia Petit

@mejia_petit

3 months

@joefioti So the fp16 values get quantized down to fp6, 4*6bits=24 which can be packed as 3*8bits and stored as a uint8_t. So i guess int8 does play a pivotal role. Here is how the packing is done:

1

0

11

Nicolas Mejia Petit

@mejia_petit

6 months

@jon_durbin @younesbelkada @Tim_Dettmers Before I fell asleep I wrote a request for 8bit rms prop to be added to TRL. I woke up to over 100 lines of code. HF team y’all are amazing ❤️!

1

10

Nicolas Mejia Petit

@mejia_petit

3 months

@msiUSA afterburner OC + @UnslothAI optimized kernels = Brrrr (Blame Dall•E-3 for cocaine bear 2.0)

0

5

Nicolas Mejia Petit

@mejia_petit

3 months

Anyone from @IBMResearch , could possibly tell me why y’all removed the first 8 and last 8 layers of the model before merging for continued pre-training? Was it just for the optimal parameters size? Or was there logic behind it?

4

2

10

Nicolas Mejia Petit

@mejia_petit

3 months

Really cool can’t lie, and the fact I won’t have to name a fine tune “llama-3” is even cooler. Shout out to MSFT for the MIT license! I hope we can also get the 7b and the 14b! @MSFTResearch :)

1

0

8

Nicolas Mejia Petit

@mejia_petit

4 months

UPDATE: I’ve done some testing with V.3 using a different base model. It appears to be going great (based off the loss compared to the previous base model) I will start V.3 training with the large dataset once V.2 is done uploading. I think we’re gonna have ourselves a solid 22b!

1

0

10

Nicolas Mejia Petit

@mejia_petit

3 months

Everyone who asks for evals. This is why. Not everyone can afford to do it. A lot of the times a training run can take less than the time to eval.(This is with vllm on poorer systems not a100s and above). Vllm is soon gonna get int8 so maybe soon I can afford to eval MMLU :)

Teknium (e/λ)

@Teknium1

3 months

@Ahmad_Al_Dahle @edunov @Gradient_AI_ I can run them but not mmlu my 4090s take like 24hrs to run it lol

2

0

4

0

10

Nicolas Mejia Petit

@mejia_petit

3 months

Whatever RL learning approach open AI took with Q-star (likely a variant of Q-Learning applied for LLM’s). It did not fine tune a 1 billion parameter gpt2 model, pretrained on only 10 billion tokens, to outperform every current model. It’s just not possible (@ gpt2 speculators)

Sam Altman

@sama

3 months

im-a-good-gpt2-chatbot

645

426

7K

1

0

9

Nicolas Mejia Petit

@mejia_petit

2 months

@burkov 4096 is literally nothing. Every day I’m more thankful for the Chinese groups, like Deep-seek and Qwen. In my personal use Qwen 2 has been great for synthetic data, its json mode is terrific! And it handles 128k context.

0

9

Nicolas Mejia Petit

@mejia_petit

3 months

@JagersbergKnut There is actually a paper out by microsoft, shows the results of cutting the embeddings and replacing matrices with smaller ones, to compress it, and repairing it with a lora, to achieve ~99% same quality. awesome job👏 @chargoddard I love it! 77 MMLU!!!

SliceGPT: Compress Large Language Models by Deleting Rows and Columns

Large language models have become the cornerstone of natural language processing, but their use comes with substantial costs in terms of compute and memory resources. Sparsification provides a...

arxiv.org

0

8

Nicolas Mejia Petit

@mejia_petit

4 months

@teortaxesTex @Prince_Canuma Too long for a non verified tweet lol

1

0

8

Nicolas Mejia Petit

@mejia_petit

4 months

@mattshumer_ Easy peasy with @UnslothAI it automatically applies ROPE through the lora allowing you to expand to whatever sequence length you’d like :) , and its very efficient even when it goes into shared memory, and with gradient ckpt offloading and optimizer offloading its so clutch.

1

8

Nicolas Mejia Petit

@mejia_petit

5 months

@realGeorgeHotz You can get over a peta flop with 2 rtx 3090’s with an int4 dtype. All we need is an int4/8 dtype, that utilizes the int4/8 engine with no upcasting. This would help everyone since the 7900xtx also allows for int4, and int8.

0

6

Nicolas Mejia Petit

@mejia_petit

4 months

@dchaplot But @NousResearch had already trained up a yarn version of mistral base to 128k context length, so this is a release was a bit obsolete.

NousResearch/Yarn-Mistral-7b-128k · Hugging Face

huggingface.co

3

2

7

Nicolas Mejia Petit

@mejia_petit

3 months

Doing some tests with Microsoft’s WaveCoder-Ultra-6.7b, and I really like it. A small local powerful coding model, with a human eval of 79.9. Here is a bf16 version of the model I just made for y’all to use! Since only the FP32 version is on the hub.

0

7

Nicolas Mejia Petit

@mejia_petit

4 months

15 trillion tokens. My god 🤯

1

0

7

Nicolas Mejia Petit

@mejia_petit

3 months

@dylan522p @cis_female I think phi-2 was around a 33% in code? But in actual use it was very terrible. Also training phi-2 was a headache, due to the architecture. There is still an ongoing issue on transformers since release to fix phi-2 training loss with FA2, since vram consumption was terrible.

1

0

7

Nicolas Mejia Petit

@mejia_petit

3 months

God thats so sexy

1

6

Nicolas Mejia Petit

@mejia_petit

2 months

Tested Qwen 2 7b instruct’s json mode on my personal 3 shot json benchmark, ~4.3% failure rate It out preforms the any other OSS model I have tried, even Codestral 22b and mistral 7b v3 instruct ~20% failure rate Congrats to the Qwen team. What an excellent Apache 2 model!

0

5

Nicolas Mejia Petit

@mejia_petit

5 months

@Teknium1 There is no fucking way they are running all 8 agents in production. They really used merge kit and uploaded the base model. 😂😂 if this is the grok thats been out that is terrible performance/efficiency. I’ll wait a day for someone to rip out a single agent from it.

1

0

6

Nicolas Mejia Petit

@mejia_petit

4 months

This is how I figured out Mixtral 22b was made from a single dense 22b model, through my test with the compressed moe and a single expert. (And ofc the community ❤️!)

2

0

6

Nicolas Mejia Petit

@mejia_petit

3 months

Interesting. Model has a lot more learnable space, compared to MLP’s fixed layers. Very cool.

Aidan McLau

@aidan_mclau

3 months

wake up new neural network just dropped (holy shit)

121

944

10K

1

0

6

Nicolas Mejia Petit

@mejia_petit

4 months

@Yampeleg You should integrate the “Unsloth” gradient accumulation for more memory savings and speed increase, axolotl has a full pr where they merged it into their trainer. It’ll let you squeeze some longer context length in there, and definitely use the paged Adam 8bit optimizer.

1

0

6

Nicolas Mejia Petit

@mejia_petit

5 months

@Yampeleg I wish I could merge some of the deepseek 6.7b( or even code llama 7b) layers on top of mistral, but they are not compatible, different tensor shapes. If mistral 7b could code like deepseek 6.7b mistral would be insanely good.

0

6

Nicolas Mejia Petit

@mejia_petit

4 months

I am proposing someone makes a 1.58 bitnet adamnw, that runs on the cpu. Since bit net doesn't require matmul, theoretically the entire computation could be offloaded to the cpu. It would be actually fast, so the training wont be bogged down, with 0 gpu memory overhead.

0

2

6

Nicolas Mejia Petit

@mejia_petit

3 months

Satya's face when everyone realizes GPT2 is Phi-3-14b

0

6

Nicolas Mejia Petit

@mejia_petit

5 months

@AndrewCurran_ 100% I hope this does happen. Although this OAI’s lawyers are just gonna drag this out as long as possible. Till anything meaningful could happen.

0

5

Nicolas Mejia Petit

@mejia_petit

4 months

@mattshumer_ Already training it to 32k on code with rope :)

2

0

6

Nicolas Mejia Petit

@mejia_petit

3 months

Looking through it again, ampere cards should work, I think other cards would need might need a bit of work, I believe its dependent on this:

1

0

5

Nicolas Mejia Petit

@mejia_petit

5 months

@abacaj You should check out “” trained to 32k and can handle up to 128k tokens. Unfortunately it still uses old phi-2 code. But would make for a sick base model.

DAMO-NLP-SG/CLEX-Phi-2-32K · Hugging Face

huggingface.co

0

5

Nicolas Mejia Petit

@mejia_petit

2 months

@neildecrypt @Teknium1 @levelsio Smart. OAI was previously criticized for using labor like that in India. Where people had to filter the worst of the worst available on the internet, hand labeling datasets, ect. So now they just hire someone else to take care of that, no bad press. Smart.

0

Nicolas Mejia Petit

@mejia_petit

3 months

The amount of respect I have for @EMostaque is top tier. The reason he left his position of manager of a freaking hedge fund and got into machine learning was his child’s autism. He is a big reason I do what I do, the fact he knows I exist is insane.

0

5

Nicolas Mejia Petit

@mejia_petit

3 months

@elegyals @get_palet @yacineMTB Twitter did not show me the rest of this thread I’m seeing it now 🤯. No way apple just became the budget king in any department. My jaw is on the floor.

0

5

Nicolas Mejia Petit

@mejia_petit

4 months

@Teknium1 I’m currently seeing how I can rip out the 22b and the add some continued pretraining to it, so its a working base model.

0

5

Nicolas Mejia Petit

@mejia_petit

3 months

@rohanpaul_ai WHOAAAA I just read that last part the fp6 was done on an a100 (w/o fp8) suppourt. Thats crazy 🤗🤗! #FreeAmpereCards

0

5

Nicolas Mejia Petit

@mejia_petit

4 months

@slashdot So they are worried that music is being democratized, and anyone can make music regardless of if they spent years learning to create and compose music. Idk personally I like to reward creativity, not just people with the time and money to create/promote their music.

2

0

5

Nicolas Mejia Petit

@mejia_petit

4 months

@MistralAILabs Please drop the 22b base, it would save me a ton of time money and work. 🙏🙏

0

5

Nicolas Mejia Petit

@mejia_petit

3 months

@felix_red_panda Actually I think I’m cancel my iCloud and buy one of these tf am I paying money for, screw the convenience, a raspberry pi and one of these can make a whole home media backup server.

0

5

Nicolas Mejia Petit

@mejia_petit

3 months

Whoa!!! Python 3.13 now supports IOS devices!! You will still need to compile it from source as there is not available, but it’s technically compatible. We’re really gonna be able to train models on a iPhone.

0

2

5

Nicolas Mejia Petit

@mejia_petit

4 months

@KinasRemek I’m about to write a quick blog post to describe the process of how I did it, I’m getting tons of questions, I was originally planning to do a paper, so I could show results, but so many people asking, I’m just going to rush out a blog post!

1

0

5

Nicolas Mejia Petit

@mejia_petit

5 months

@Teknium1 Phi-2 beat llama 13b and mistral 7b on some benchmarks. Pre training a 7b model on that data should yield significant more performance.

1

0

5

Nicolas Mejia Petit

@mejia_petit

3 months

@Teknium1 I’m bullish on @AMD making strong integrated NPU’s for inference, but that wont I doubt that will ever go along into training unless they make some monstrous NPU, like the gh200’s CPU.

1

0

3

Nicolas Mejia Petit

@mejia_petit

3 months

@get_palet @yacineMTB @elegyals Are they just hooked up through thunderbolt? Do it go brrr?

1

0

5

Nicolas Mejia Petit

@mejia_petit

3 months

I’m just surprised there are that many groups with 45TB of space to download the dataset. This also means there are gonna be a ton of 15T token pre-trained models soon out 🤗

clem 🤗

@ClementDelangue

3 months

The GPT4 of datasets took down Hugging Face, sorry all 😅😅😅

24

43

905

1

0

3

Nicolas Mejia Petit

@mejia_petit

5 months

@unslothai running unsloth in windows to train models 2x faster than regular hf+fa2 and with 2x less memory letting me do a batch size of 10 with a sequence length of 2048 on a single 3090. Need a tutorial to install triton on windows? It’s pinned to my profile!

3

0

4

Nicolas Mejia Petit

@mejia_petit

3 months

It would be very funny if the “gpt-2-chatbot” is phi-3 14b.

0

4

Nicolas Mejia Petit

@mejia_petit

3 months

This should work with h100’s, h200’s, and 4090’s as well. As long as you have int8 and fp16, it can theoretically work. Although these were written for the a100 so expect current support to only be a100, 3090, and L40. (Unless the architecture shown above is the same.)

2

0

4

Nicolas Mejia Petit

@mejia_petit

3 months

I was wondering why no one talked about Apples new open source models, ranging from 270M to 3B. Then I saw the 3B instruct got a 25% on the MMLU 😅. (Provided MMLU is a terrible benchmark, a 25% is terrible; however, they did highlight it has a very small training size.)

1

0

4

Nicolas Mejia Petit

@mejia_petit

3 months

Yeah the hype is real, I got access to gpt-4-o. And its coding performance, specially with rag, is insane; in comparison to gpt-4. It’s making scripts bing gpt-4 struggled to get even close to.

0

3

Nicolas Mejia Petit

@mejia_petit

4 months

@MistralAI Its Mixtral 8x22b. Could we also get the 22b 🥺?

1

0

4

Nicolas Mejia Petit

@mejia_petit

5 months

@StasBekman @PyTorch And it does have fp8. Looks like int4 software suppourt might be dead in the water :( unless someone in the OS community wants to dedicate a massive amount of time. Sucks cause it’s a good compute type for the gpu poor. 4bits FTW.

GitHub - thu-ml/low-bit-optimizers: Low-bit optimizers for PyTorch

Low-bit optimizers for PyTorch. Contribute to thu-ml/low-bit-optimizers development by creating an account on GitHub.

github.com

2

0

2

Nicolas Mejia Petit

@mejia_petit

3 months

How do you write an entire well written research paper with graphs, 50 different citations, but no fucking code? I promise we won’t bite at badly written code; we just want the damn code. Even as just boilerplate. I BEG JUST GIVE US THE DAMN CODE.

1

0

4

Nicolas Mejia Petit

@mejia_petit

3 months

@AlpayAriyak Literally 😂😂. The way they pack 4 fp6 into 3 uint8 is insane, making a Frankenstein array, is the smartest thing I have seen in a very long time. The a100’s and 3090’s got a major boost. This needs to get added to @PyTorch asap.

0

3

Nicolas Mejia Petit

@mejia_petit

3 months

@EMostaque @dylan522p @cis_female I completely agree, I made tested python 22k and tested python 143k, and there is a preference dataset @jon_durbin made using my tested 22k. I’m exited to use those in the new base models.

0

3

Nicolas Mejia Petit

@mejia_petit

3 months

Google colab is going to have to get a lot more competitive 😳. (This is proof we are ram limited, someone for the love of god sell us a cheap gpu with a bunch of memory. It doesn’t have to be the fastest thing ever)

Awni Hannun

@awnihannun

3 months

Next level: QLoRA fine-tuning 4-bit Llama 3 8B on iPhone 15 pro. Incoming (Q)LoRA MLX Swift example by David Koski: works with lot's of models (Mistral, Gemma, Phi-2, etc)

28

123

946

0

4

Nicolas Mejia Petit

@mejia_petit

5 months

@TheBlokeAI are you okay? We miss you! We wanna hear from you!

2

1

4

Nicolas Mejia Petit

@mejia_petit

4 months

@XPhyxer1 @dchaplot @NousResearch It never got the attention it deserved. It’s pretty freaking awesome though, as when it’s in 4bits on a single 3090 I can have it at max context length. And its perplexity is really really good for a 7b model.

0

1

4

Nicolas Mejia Petit

@mejia_petit

2 months

@ylecun Military research gave us satellites, internet,gps, microwave, commercial computers, and public roads to name a few. It would suck to live in a world without that.

1

0

4

Nicolas Mejia Petit

@mejia_petit

4 months

@Teknium1 Agent, planning, coding, and iterative multi turn coding examples.

0

3

Nicolas Mejia Petit

@mejia_petit

4 months

@realmrfakename @Teknium1 @chargoddard Definitely will! I’m gonna be here ready with my fingers crossed to see if @chargoddard could send me the script! Second he does I’ll get started.

1

0

3

Nicolas Mejia Petit

@mejia_petit

5 months

@YangYou1991 Hard to compete with the compute open ai was willing to throw at it.

1

0

3

Nicolas Mejia Petit

@mejia_petit

2 months

@Teknium1 @danielhanchen @UnslothAI Smaller learning rates for embed_token layer, and lm_head, to avoid throwing it off during training; but you should read the Unsloth blog post on continued pre-training, it’s really good.

0

3

Nicolas Mejia Petit

@mejia_petit

3 months

@ollama Converted 8x22b to a single 22b model. I’m currently trying to pre-train the model, but I need more compute for preferably a FFT.

Vezora/Mistral-22B-v0.2 · Hugging Face

huggingface.co

0

2

Nicolas Mejia Petit

@mejia_petit

4 months

@realmrfakename @Teknium1 @chargoddard I'm fairly certain i can make it work. One time i merged a bunch of mistral layers, from different models got garbage output, but i repaired it with a lora (it was a bit of training the unsloth didn't exist) And it gave coherent outputs .

Vezora/Mistral-14b-Merge-Base · Hugging Face

huggingface.co

0

3

Nicolas Mejia Petit

@mejia_petit

4 months

@Teknium1 @EugeneNicholas0 @The_Real_Paolo @xlr8harder @ylecun A vision llama 8b trained with VJEPA would go so hard tho 😔. Inspo for it is low tho since VJEPA’s under a research license. I just pray they didn’t abandon the ~30B it was such a perfect size.

1

0

3

Nicolas Mejia Petit

@mejia_petit

4 months

I’m getting very very interesting training loss for v0.3 22b. I’m very exited to try it, but I think this will be the base model I’ll *Cheap Pre-train* and drop as a base model, so anyone can easily fine tune it to their own needs, or to make better.

3

0

3

Nicolas Mejia Petit

@mejia_petit

4 months

@1littlecoder As Nadella said "If OpenAI disappeared tomorrow, I don't want any customer of ours to be worried about it quite honestly, because we have all of the rights to continue the innovation." They practically have already been bought out like paypal was back in the day.

0

1

Nicolas Mejia Petit

@mejia_petit

3 months

@dan_biderman @jefrankle I’ve been using RS lora on all my training runs, it’s consistently done better than w/o. I agree with your findings on code performance with LoRa. I’ve fine tuned llama 2 7b on 22k examples of tested python code gave marginally better results. I like idea of dora. I heard from

1

0

3

Nicolas Mejia Petit

@mejia_petit

4 months

I have started training v0.3 with a new base model, preliminary results seem even more promising than v0.2.

1

0

3

Nicolas Mejia Petit

@mejia_petit

2 months

@liliang_ren Thank you for the correction, I missed that in the paper, I was a bit too exited to the code 😆. To me that is still just as impressive nonetheless, it empirically proves how good Hybrid SSM’s preform in training.

0

3

Nicolas Mejia Petit

@mejia_petit

3 months

@Teknium1 The resemblance is uncanny

0

3

Nicolas Mejia Petit

@mejia_petit

5 months

@Tobias_Writes It takes 9 2024 toyota prius batteries to run that. Imagine speaking compute in toyota prius battery numbers 😂😂.

0

3

Nicolas Mejia Petit

@mejia_petit

4 months

Why llama 3 isn’t MOE: Moe’s doen’t provide that many benefit’s, for gpu poor. The experts don’t even learn specific topics, most the time they just become experts in grammar. Making a quality dense model that then could be up-cycled for an MOE like mixtral is more useful.

1

0

3

Nicolas Mejia Petit

@mejia_petit

2 months

@code_star @Yuchenj_UW @karpathy Yes, there is research in that domain! Also the quality of data is relatively high; they are using fine-web, a very clean dataset. Now it won’t get chatgpt level, due to the fact, we lack their large ~100k human written SFT dataset (according to Kapathy)

When Online Content Disappears

A quarter of all webpages that existed at one point between 2013 and 2023 are no longer accessible.

www.pewresearch.org

6

0

3

Nicolas Mejia Petit

@mejia_petit

3 months

@rohanpaul_ai I had talked to one of the developers in deep speed, about using native int4, and I sent him a paper. He said thx & they’re looking into int4 and I think this is what he was referring to That is so cool, I hope these int4 kernels are native, the HF Quanto repo could use them :)

0

3

Nicolas Mejia Petit

@mejia_petit

3 months

@felix_red_panda Whoaaa, I haven’t bought a hard drive in 10 years I did not know they got so cheap. 10 years ago I had bought a 1tb for almost $100 I think, and todays money that would be like $200

2

0

3

Nicolas Mejia Petit

@mejia_petit

4 months

@migtissera They could’ve also mentioned its 1/10 the size of gpt4 with similar performance, without any RLHF or DPO for better performance (not that safety bs open ai does making their models dumber but ‘safer’).

0

2

Nicolas Mejia Petit

@mejia_petit

3 months

@nickfrosst Thats for the people asf.

0

3

Nicolas Mejia Petit

@mejia_petit

3 months

It would be so funny shove llama 8b with vision + meta voice cloned text to speech shoved into transformers robot.

1

0

3