Umar Jamil Profile Banner
Umar Jamil Profile
Umar Jamil

@hkproj

1,849
Followers
119
Following
16
Media
45
Statuses

Keeping GPUs hot 🔥 @Get_Writer

Milan, Lombardy
Joined February 2018
Don't wanna be here? Send us removal request.
@hkproj
Umar Jamil
29 days
Lots of people asking me about my journey into Machine Learning. I gave two interviews about my journey few months back (links in the bottom). Key takeaways: • Learning Mandarin taught me consistency and patience. Success is a natural consequence. • Quit my manager job at the
9
13
182
@hkproj
Umar Jamil
4 months
A complete explanation of Kolmogorov-Arnold Networks, including a review of multilayer perceptron (MLPs), Bézier Curves, B-Splines, the Universal Approximation Theorem and how KANs compare to MLPs: #kan #machinelearning #deeplearning #mlp
3
9
48
@hkproj
Umar Jamil
7 months
A full explanation of the mathematical foundation of Reinforcement Learning from Human Feedback (RLHF) and PPO, starting from first principles. Includes PyTorch implementation explained line-by-line: #rlhf #ppo #math #tutorial #pytorch #aialignment
7
7
46
@hkproj
Umar Jamil
28 days
Great article by NVIDIA demystifying chunked prefill of a prompt and other techniques for optimizing inference of LLMs. Someone (😇😇😇) is very interested in chunked prefill of the KV-Cache and related optimizations, so expect to hear more from me about it. Article:
4
4
43
@hkproj
Umar Jamil
1 month
@iam_radheraaga I've been busy with a side project with my wife 🤰👶
3
0
36
@hkproj
Umar Jamil
1 month
@tuturetom 谢谢你分享我的视频!我也会中文(老婆是中国人)
4
2
34
@hkproj
Umar Jamil
14 days
Chunked Prefill generates the KV-Cache by splitting the prompt into chunks. We prove that while prefilling, we can exploit the partial KV-Cache to generate extractive summaries and then append them at the end of the prompt to improve the ability of the model to extract
2
6
28
@hkproj
Umar Jamil
9 months
Mistral 7B and Mixtral 8x7B explained: Sliding Window Attention, Rolling Buffer (KV) Cache, Pre-Fill and Chunking, Model Sharding, Sparse Mixture of Experts (SMoE), xformers, Pipeline Parallelism. Video: #mistral #mixtral #llm
1
4
26
@hkproj
Umar Jamil
9 months
Tutorial on distributed training with PyTorch, with code walk-through and explanation of Gradient Accumulation, Collective Communication (Broadcast, Reduce, All-Reduce), DistributedDataParallel, Bucketing: #pytorch #tutorial #distributeddataparallel
3
2
12
@hkproj
Umar Jamil
9 months
In this video I will explain quantization: asymmetric and symmetric quantization, dynamic and static Quantization, Post-Training Quantization and Quantization-Aware Training. Including PyTorch code! #deeplearning #quantization #pytorch #tutorial
0
3
10
@hkproj
Umar Jamil
1 month
@tuturetom 我比较喜欢说我是“进口本地人”,我以前住在苏州,住了4年了
2
0
8
@hkproj
Umar Jamil
29 days
@9LkVSi8E3NWf19q Study the code (by running it), comment it with your personal annotations. Try to re-create the project by using your commented code. Takes lots of patience, lots of research. Start with very small projects, they help you build the confidence for pursuing bigger ones.
0
0
5
@hkproj
Umar Jamil
1 month
@giffmana Thank you a lot for this! It means a lot coming from you!
0
0
6
@hkproj
Umar Jamil
29 days
@detention361757 Eat light at dinner (no later than 18.30), no electronic device (except e-ink reader) after 20.30, sleeping by 21.30 most of the time.
1
0
6
@hkproj
Umar Jamil
4 months
@predict_addict Thanks for sharing my video!
0
0
6
@hkproj
Umar Jamil
8 months
@iScienceLuvr @tri_dao I made a video in which I derive the mathematical foundation of State Space Models, including Mamba, from first principles. I also teach differential equations in 5 minutes for those who lack this background. Check it out:
0
0
5
@hkproj
Umar Jamil
23 days
@atgorans_k Give the money to the poor, they need it more than me. I have a very long list of video I'd like to make, stay tuned and I guarantee you will keep getting amazed. Have a nice day!
1
0
3
@hkproj
Umar Jamil
29 days
@detention361757 You can sleep even 9 hours. Just be consistent with whatever schedule works for you. People underestimate the power of studying consistently even 1 hour every single day.
0
0
3
@hkproj
Umar Jamil
8 months
@dchaplot I made a video about the Mixture of Experts Sliding Window Attention, Rolling Buffer Cache, Pre-Fill and Chunking, Model Sharing. I provide mathematically why the sliding window attention can capture information outside the window size. Check it out!
0
1
3
@hkproj
Umar Jamil
1 month
@GiorgioMantova The answer is in the first 15 minutes of the video ;-) give it a try!
1
0
2
@hkproj
Umar Jamil
14 days
Little known fact we exploited: you can always delete something from the end of the KV-Cache and replace it with something completely different. Do you know why? Each token in the KV-Cache is a contextualized embedding about all the "past" tokens, so you cannot delete anything
1
1
2
@hkproj
Umar Jamil
23 days
@atgorans_k Passionate audience like you is what keeps me focused on delivering high quality content.
1
0
2
@hkproj
Umar Jamil
1 month
@yllll_yl @tuturetom 上有天堂,下有苏杭,中间有桐乡🤣 我住在桐乡六个月了,正好在苏州和杭州中间 我可能明年回国,到时候联系☺️。
1
0
1
@hkproj
Umar Jamil
7 months
@ducnh279 There's a long list of videos I'd love to make, and luckily for you, DPO is one of them. Stay tuned ;-)
1
0
1
@hkproj
Umar Jamil
9 months
@ohhbatu Because we also have the batch dimension (when you train a model, you have a batch of sentences), and the "h" dimension, which depends on the number of heads of the multi-head attention; d_k = d_model / h.
0
0
1
@hkproj
Umar Jamil
30 days
@kurtqian 我是4月出国的,现在住在德国慕尼黑
1
0
1
@hkproj
Umar Jamil
6 months
@jeffrey50963197 I'll do my best. Thanks for being a fan!
1
0
1