Umar Jamil @hkproj Twitter profile

Pinned Tweet

Umar Jamil

11 months

A complete tutorial on how to code Stable Diffusion from scratch, using only PyTorch, with full explanation (maths, txt2img, img2img, in-painting): #stablediffusion #PyTorch #tutorial #ai #deeplearning #machinelearning #python #diffusion #diffusionmodels

Coding Stable Diffusion from scratch in PyTorch

Full coding of Stable Diffusion from scratch, with full explanation, including explanation of the mathematics. Visual explanation of text-to-image, image-to-...

www.youtube.com

3

4

44

Last Seen Profiles

@JonathanBerant

@HigorX7_FX

@ThiruGeneva

@yuuki_yuduru

@reritesillywawa

@AIL_ets

@GuzuGuzuu

@70r40

@TopActiff

@AtelierHyeon

@Shadowfvn

@YoVenice

@AJArabic

@its_immaculate6

@MMDORT

@onevoiceanimal

@teraken_rugby

@kemby_

@betabetarimocon

@gammek13

@askalphaxiv

@lass198510

@Apache_be

@ahs_ems

@QCODEmedia

@bokeplokalmalam

@jud_bent

@xss_is

@samanthaefran

@BartBostoen

@drwafa

@Vivo_Ph

@QDOOZ

@stwmaniax

@AmpaNavarrin

@ThisIsEverythng

Umar Jamil

@hkproj

1 month

My new video on how to code a Multimodal (Vision) Language Model from scratch using only Python and PyTorch while explaining every single concept step by step. Link: #tutorial #pytorch #python #coding #fromscratch #llm #paligemma #gemma #visionmodel

Coding a Multimodal (Vision) Language Model from scratch in PyTorch...

Full coding of a Multimodal (Vision) Language Model from scratch using only Python and PyTorch. We will be coding the PaliGemma Vision Language Model from sc...

www.youtube.com

22

155

923

Umar Jamil

@hkproj

29 days

Lots of people asking me about my journey into Machine Learning. I gave two interviews about my journey few months back (links in the bottom). Key takeaways: • Learning Mandarin taught me consistency and patience. Success is a natural consequence. • Quit my manager job at the

9

13

182

Umar Jamil

@hkproj

5 months

A complete explanation of Direct Preference Optimization (DPO) and the math derivations needed to understand it. Code explained. Link to the video: #dpo #directpreferenceoptimization #rlhf #rl #llm #alignment #finetuning #ai #deeplearning

Direct Preference Optimization (DPO) explained: Bradley-Terry model,...

In this video I will explain Direct Preference Optimization (DPO), an alignment technique for language models introduced in the paper "Direct Preference Opti...

www.youtube.com

5

18

89

Umar Jamil

@hkproj

4 months

A video on ML Interpretability, with feature visualization, interpretability for language models, adversarial example generation and more! Built in collaboration with Leap Labs ( @JessicaRumbelow ). #machinelearning #interpretability #llm #deeplearning

ML Interpretability: feature visualization, adversarial example,...

In this video, I will be introducing Machine Learning Interpretability, a vast topic that aims at understanding the inner mechanisms of how machine learning ...

www.youtube.com

4

9

49

Umar Jamil

@hkproj

4 months

A complete explanation of Kolmogorov-Arnold Networks, including a review of multilayer perceptron (MLPs), Bézier Curves, B-Splines, the Universal Approximation Theorem and how KANs compare to MLPs: #kan #machinelearning #deeplearning #mlp

Kolmogorov-Arnold Networks: MLP vs KAN, Math, B-Splines, Universal...

In this video, I will be explaining Kolmogorov-Arnold Networks, a new type of network that was presented in the paper "KAN: Kolmogorov-Arnold Networks" by Li...

www.youtube.com

3

9

48

Umar Jamil

@hkproj

7 months

A full explanation of the mathematical foundation of Reinforcement Learning from Human Feedback (RLHF) and PPO, starting from first principles. Includes PyTorch implementation explained line-by-line: #rlhf #ppo #math #tutorial #pytorch #aialignment

Reinforcement Learning from Human Feedback explained with math...

In this video, I will explain Reinforcement Learning from Human Feedback (RLHF) which is used to align, among others, models like ChatGPT. I will start by in...

www.youtube.com

7

46

Umar Jamil

@hkproj

28 days

Great article by NVIDIA demystifying chunked prefill of a prompt and other techniques for optimizing inference of LLMs. Someone (😇😇😇) is very interested in chunked prefill of the KV-Cache and related optimizations, so expect to hear more from me about it. Article:

4

43

Umar Jamil

@hkproj

8 months

Mamba and S4 explained: Architecture, Parallel Scan, Kernel Fusion, Recurrent/Convolution formulation, Math derivations from first principles, HIPPO theory. Video: #deeplearning #mamba #s4 #ssm #statespacemodel #transformer

Mamba and S4 Explained: Architecture, Parallel Scan, Kernel Fusion,...

Explanation of the paper Mamba: Linear-Time Sequence Modeling with Selective State SpacesIn this video I will be explaining Mamba, a new sequence modeling ar...

www.youtube.com

2

8

34

Umar Jamil

@hkproj

1 month

@iam_radheraaga I've been busy with a side project with my wife 🤰👶

3

0

36

Umar Jamil

@hkproj

1 month

@tuturetom 谢谢你分享我的视频！我也会中文（老婆是中国人）

4

2

34

Umar Jamil

@hkproj

14 days

Chunked Prefill generates the KV-Cache by splitting the prompt into chunks. We prove that while prefilling, we can exploit the partial KV-Cache to generate extractive summaries and then append them at the end of the prompt to improve the ability of the model to extract

2

6

28

Umar Jamil

@hkproj

9 months

Mistral 7B and Mixtral 8x7B explained: Sliding Window Attention, Rolling Buffer (KV) Cache, Pre-Fill and Chunking, Model Sharding, Sparse Mixture of Experts (SMoE), xformers, Pipeline Parallelism. Video: #mistral #mixtral #llm

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture...

In this video I will be introducing all the innovations in the Mistral 7B and Mixtral 8x7B model: Sliding Window Attention, KV-Cache with Rolling Buffer, Pre...

www.youtube.com

1

4

26

Umar Jamil

@hkproj

9 months

Tutorial on distributed training with PyTorch, with code walk-through and explanation of Gradient Accumulation, Collective Communication (Broadcast, Reduce, All-Reduce), DistributedDataParallel, Bucketing: #pytorch #tutorial #distributeddataparallel

Distributed Training with PyTorch: complete tutorial with cloud...

A complete tutorial on how to train a model on multiple GPUs or multiple servers.I first describe the difference between Data Parallelism and Model Paralleli...

www.youtube.com

3

2

12

Umar Jamil

@hkproj

11 months

A complete tutorial on how to code LLaMA 2 from scratch with PyTorch, explaining KV Cache, Grouped Query Attention, Rotary PE, RMSNorm #llama #llm #deeplearning #pytorch #machinelearning #coding #tutorial #llama2 #ai #languagemodels

Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query...

Full coding of LLaMA 2 from scratch, with full explanation, including Rotary Positional Embedding, RMS Normalization, Multi-Query Attention, KV Cache, Groupe...

www.youtube.com

0

4

13

Umar Jamil

@hkproj

9 months

In this video I will explain quantization: asymmetric and symmetric quantization, dynamic and static Quantization, Post-Training Quantization and Quantization-Aware Training. Including PyTorch code! #deeplearning #quantization #pytorch #tutorial

Quantization explained with PyTorch - Post-Training Quantization,...

In this video I will introduce and explain quantization: we will first start with a little introduction on numerical representation of integers and floating-...

www.youtube.com

0

3

10

Umar Jamil

@hkproj

1 month

@Yampeleg I even made a video about it, very enthusiastically, only to be let down 🥲 -

LongNet: Scaling Transformers to 1,000,000,000 tokens: Python Code +...

Full model explanation and code to visualize the attention mechanism in the new LongNet model from Microsoft Research.Full code available as always: https://...

www.youtube.com

0

1

9

Umar Jamil

@hkproj

10 months

Retrieval Augmented Generation (RAG) explained: what are embedding vectors, how they're built using Sentence BERT, how Vector DB work (HNSW algorithm). Video: #rag #retrievalaugmentedgeneration #hnsw #vectordb #sentencebert #embedding

Retrieval Augmented Generation (RAG) Explained: Embedding, Sentence...

Get your 5$ coupon for Gradient: https://gradient.1stcollab.com/umarjamilaiIn this video we explore the entire Retrieval Augmented Generation pipeline. I wi...

www.youtube.com

1

4

9

Umar Jamil

@hkproj

1 month

@tuturetom 我比较喜欢说我是“进口本地人”，我以前住在苏州，住了4年了

2

0

8

Umar Jamil

@hkproj

11 months

BERT explained: Training (Masked Language Model, Next Sentence Prediction), Self-Attention, [CLS] token, Left and Right context, BERT vs GPT/LLamA, Fine tuning, Text Classification, QA #bert #deeplearning #largelanguagemodels #machinelearning #tutorial

BERT explained: Training, Inference, BERT vs GPT/LLamA, Fine tuning,...

Full explanation of the BERT model, including a comparison with other language models like LLaMA and GPT. I cover topics like: training, inference, fine tuni...

www.youtube.com

1

7

Umar Jamil

@hkproj

29 days

@9LkVSi8E3NWf19q Study the code (by running it), comment it with your personal annotations. Try to re-create the project by using your commented code. Takes lots of patience, lots of research. Start with very small projects, they help you build the confidence for pursuing bigger ones.

0

5

Umar Jamil

@hkproj

1 month

@giffmana Thank you a lot for this! It means a lot coming from you!

0

6

Umar Jamil

@hkproj

29 days

@detention361757 Eat light at dinner (no later than 18.30), no electronic device (except e-ink reader) after 20.30, sleeping by 21.30 most of the time.

1

0

6

Umar Jamil

@hkproj

4 months

@predict_addict Thanks for sharing my video!

0

6

Umar Jamil

@hkproj

8 months

@iScienceLuvr @tri_dao I made a video in which I derive the mathematical foundation of State Space Models, including Mamba, from first principles. I also teach differential equations in 5 minutes for those who lack this background. Check it out:

Mamba and S4 Explained: Architecture, Parallel Scan, Kernel Fusion,...

Explanation of the paper Mamba: Linear-Time Sequence Modeling with Selective State SpacesIn this video I will be explaining Mamba, a new sequence modeling ar...

www.youtube.com

0

5

Umar Jamil

@hkproj

23 days

@atgorans_k Give the money to the poor, they need it more than me. I have a very long list of video I'd like to make, stay tuned and I guarantee you will keep getting amazed. Have a nice day!

1

0

3

Umar Jamil

@hkproj

29 days

@detention361757 You can sleep even 9 hours. Just be consistent with whatever schedule works for you. People underestimate the power of studying consistently even 1 hour every single day.

0

3

Umar Jamil

@hkproj

8 months

@dchaplot I made a video about the Mixture of Experts Sliding Window Attention, Rolling Buffer Cache, Pre-Fill and Chunking, Model Sharing. I provide mathematically why the sliding window attention can capture information outside the window size. Check it out!

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture...

In this video I will be introducing all the innovations in the Mistral 7B and Mixtral 8x7B model: Sliding Window Attention, KV-Cache with Rolling Buffer, Pre...

www.youtube.com

0

1

3

Umar Jamil

@hkproj

1 month

@GiorgioMantova The answer is in the first 15 minutes of the video ;-) give it a try!

1

0

2

Umar Jamil

@hkproj

14 days

Little known fact we exploited: you can always delete something from the end of the KV-Cache and replace it with something completely different. Do you know why? Each token in the KV-Cache is a contextualized embedding about all the "past" tokens, so you cannot delete anything

1

2

Umar Jamil

@hkproj

23 days

@atgorans_k Passionate audience like you is what keeps me focused on delivering high quality content.

1

0

2

Umar Jamil

@hkproj

1 month

@yllll_yl @tuturetom 上有天堂，下有苏杭，中间有桐乡🤣 我住在桐乡六个月了，正好在苏州和杭州中间我可能明年回国，到时候联系☺️。

1

0

1

Umar Jamil

@hkproj

7 months

@ducnh279 There's a long list of videos I'd love to make, and luckily for you, DPO is one of them. Stay tuned ;-)

1

0

1

Umar Jamil

@hkproj

9 months

@ohhbatu Because we also have the batch dimension (when you train a model, you have a batch of sentences), and the "h" dimension, which depends on the number of heads of the multi-head attention; d_k = d_model / h.

0

1

Umar Jamil

@hkproj

30 days

@kurtqian 我是4月出国的，现在住在德国慕尼黑

1

0

1

Umar Jamil

@hkproj

6 months

@jeffrey50963197 I'll do my best. Thanks for being a fan!

1

0

1