"XGBoost is all you need?"
Ok, but let me caution you for cases when your data has categorical variables & you are using any tree based or boosted tree method like RandomForest, XGB etc
One hot encoding could ruin things when the categorical variable has many levels
1/n
Know your ML evaluation metrics
Got highly imbalanced data? Probably you'd want to 're'consider ROC.
While AUC ROC is a very popular metric because of its characteristics like being insensitive to class distribution, it isn't a good choice
1/
How cool is it to try your ML models live rather than just training them in Jupyter notebooks?
Very cool! And even cooler when it's done as easy as it gets with
@Gradio
and
@huggingface
🤗space!
I'm ready to try my bean plant classifier!
1/
Got reached out for data scientist role by a company based in London using AI in the field of medicine. They liked my LinkedIn activity & see me as a good fit according to what my LI makes me seem interested in - I do not even post much as I've always been interested more on the+
As a Data Scientist, I'll be smart if I quickly understand the business problem, frame it, understand how and what data is needed (or already used). In some cases knowing how to write integration pipelines also helps.
Ofc, modelling is equally important. But obviously no one +
Data loading shouldn't be a bottleneck in the model training pipeline. With my new blog, learn how
@PyTorch
ensures this.
I also explore & implement the latest DataPipes from TorchData.🌟Pretty cool.
Wrote this one on
@weights_biases
posts section.
Are you setting off learning
@PyTorch
? Follow along with my blog for a line-by-line explanation as you create your first neural net classifier.
I wrote it on
@weights_biases
posts section. Here's the link -
Ever used regularisation (L1, L2) and wondered why it's advised to standardise the features (x1 x2...xn) before doing so?
In L1 and L2 regularisation, we aim to shrink the magnitudes of coefficient estimates.
1/n
Looking to connect with people in the ML space that actively contribute to open source.
I know how welcoming & helpful open source folks are.
help me get started? :)
I'd just want to chat a bit, won't take much of your time.
Outliers in your data?
Among the many ways to deal w it, if one is going for regularisation, then -
L1 (Lasso) is more robust to outliers while L2 (Ridge) isn't really.
1/n
Why do you choose '5' fold cross validation?
Or why not any other k than the one you go for in k fold cross validation?
Well, I shouldn't be asking here. Should ask the good old Bias Variance tradeoff instead.
Here's why -
1/n
But Hash encoding & Dracula are two encoding schemes recommend for categorical variables with many levels.
Ofcourse, these do not come without their cons.
6/6
working through RoPE's math & realising it's nothing but revising my undegrad lin algebra (intuition & concept) was fun.
btw, I'll be presenting today on extending the context of models using RoPE - 2230 IST
@forai_ml
.
Paper:
theoretical (mathy, nitty gritty) side of things & that acc to me wouldn't really help me scale as a creator or maybe that isn't what people would be interested in reading.
Still, learning in public is Mighty!
think I should follow strategic posting now :)
Two types of classification tasks and how to implement each in
@PyTorch
:
I've come across this confusion a bunch of times now about choosing the right loss function for a Classification Task using Deep Learning in
@PyTorch
.
Here's a simple explanation:
1/3
So, I saw a post laying a 'set of golden rules' for dealing with missing data.
Wrong on so many levels.
No set of rules exists unless one considers the following--
Why is the data missing?
Is it even worth imputing?
A thread👇
Learn to create lists in
@PyTorch
the correct way!
The wrong way
To create a
@PyTorch
NN with a variable no. of layers, a plain python list might be a common choice to store the network layers (nn.Module) by appending.
This becomes a source of error. See in code? 👇
1/3
In the data space, I've learnt more by writing than by reading.
Few weeks back, when I started learning pytorch, I wrote a blog to explain every detail of a code of mine that constructed the most basic NN using pytorch.
To cater my reader well, I made sure every little detail +
Creating Custom Models in
@PyTorch
? Make sure you aren't making this 👇 error. Let's learn in 5 steps.
Step 1:
Firstly, to "correctly" create optimizable parameters in PyTorch without running into gradient errors, we need to ensure parameters are leaf tensors.
1/5
Ever wanted to set different learning rates for different layers/parameters while training your neural networks in
@PyTorch
?
Let's learn how to do that with
@PyTorch
in 2 steps 👇
1. We will create the simplest neural network with 2 layers:
1/2
Do not one hot encode your categorical variable while using XGB if it has got many levels.
One hot encoding works ok & might even give performance boost if no. of levels are lesser.
Curious why?
2/n
My
@weights_biases
blogathon submission explains CPCA - a simple & useful dimensionality reduction algorithm where you work with not 1, but 2 datasets to explore patterns in the target data.
Also, +
sometimes I really feel I should do an MS in 'Applied' math and stats. crazy how class 10th's moving average can be used for analysing time series data. it helps smoothing the series, assess trends by removing seasonality and even forecast future. no rocket science, just +
Working with Neural Networks?
Using a CV architecture to predict whether a brain's MR scan classifies as cancerous, non-cancerous or any other such class?
Rather than a single prediction from the neural net, wouldn't it be better if we could generate confidence intervals?
1/n
There it is! Few seconds and here's the prediction.
Was nice to build my own web app as part of Building end to end Vision Applications taught by Dr. Abubakar
@abidlabs
at CoRise
@corise_
!!
With a lot of levels comes a lot of sparsity.
So while one hot encoding many levels (equivalent to creating the same no. of variables) only a small fraction of data points shall have the value 1 for a single level (read: variable)
Why's this a problem?
3/n
z-test vs t-test
z-test: underlying statistic can be approximated to follow std. normal distribution.
t-test: ,,,,,to follow t distribution instead.
Catch is: t distribution is more accurate in case of small samples. 👇
What to do?
PR Curve is a better choice.
Both Precision & Recall deal with the postive (minority) class of interest. So in this example while Recall values are equal, precision informs how many positives as predicted by the model are true positivies.
4/
A serious error.
Feature selection/engineering on tabular data is a crucial step in any machine learning problem.
BUT! Hold on and double check you are doing it right.
If you are using Cross-validation or validation set holdout approaches for estimating test error..
1/n
People talk about fancy Machine Learning models.
Today, I'll talk about anything fancy when it comes to solving problems/answering questions using data.
An acquaintance was very keen to find the best outlier detection technique.
Clarity of concept goes a long way!
Recently had an interview where I was asked something about pre-trained BERT models that I'd never read or thought of before.
But, since I had the gist of what actually goes on inside the BERT architecture, I was able to answer on point +
New blog on
@PyTorch
soon!
I'll be talking about how Pytorch handles data effectively and efficiently.
Along, I'll also demonstrate the new DataPipe functionality from the TorchData library.
Stay tuned 👋🔥
@marktenenholtz
it would be what I failed to do myself: don't learn X first completely &then Y & then Z & so on. Take up a problem and learn on the go. so for eg. one really doesn't need to be vv good at python to learn ML. ofc it's an advantage to be that but def not required(atleast to start)
@marktenenholtz
found it! guess I got a good memory so I just remembered this is your post.
don't know what people get out of plagiarism but it's so irritating
it would've really driven me up the wall had I been in your place.
Do you use drop_last = True in your PyTorch DataLoader?
I do.
Here's what it is-
Setting it to True shall drop the last batch in each epoch in case the dataset at hand cannot be evenly divided into batches of equal sizes.
1/n
I don't know whether one needs to know math for an industry ML role or not
But what I know is that engineering skills are sooo needed. maybe more :D
What's interesting to me is that I feel this latter skillset is no less required in research as well :)
ps: No rigid agenda x
Trees split on those variables that yield the "purest" nodes.
Easy to see why a One hot encoded variable typically shall not lead to very pure nodes & hence the tree shall not split on it
No matter how important the original categorical variable could be as a feature
4/n
Naturally, this would also interfere with feature importance generated by RandomForest or any other method as even if the splits happen on these hot encoded levels, they'll most likely not happen near the root
What to do then?
Well, here's needed knowledge from experience
5/n
it's a real deal, atleast for me, to implement research papers.
it's a basic one (!= not useful rather very useful), still taking a lot of effort.
Anyone who's in a regular practice of doing this?
(will also do a thread once I finish. hoping I finish.)
@osanseviero
for a quick comprehensive overview of pos embeddings:
follows RoPE:
Detailed blogs by the authors of RoPE are the best resources.
For GQA, its paper is short and v easily understandable once you know MQA which again is v simple.
Full fine-tuning LLMs on downstream tasks comes with a lot of GPU memory usage + storage costs.
Let us look into a PEFT technique called Adaptors for efficient transfer learning in LLMs with an example application in Transformers!👇
1/7
Communication-lack of it could ruin any data project, be it industry or research
& the worse, it could cost you loads of time before things get ruined
Communication-more than half work is done if this is done properly
This isn't preach, it's what I'm experiencing these days :)
Transposing data in PyTorch?
x.T is deprecated in PyTorch's latest release when used with tensors of dimensionality other than 0 or 2.
Worth noting why - probably because it doesn't work like how we would want when dealing with batches of data (matrices).
1/2
What type of questions can I expect in a coding round for Data Scientist role?
Pandas assignment already done! 🤔 I'm wondering what this round holds.
Anyone that's gone through a similar round?
Fingers crossed 🤞
Lately, I've been realising how important a good understanding of pytorch's autograd is for any practitioner.
To this end, I'm planning to write a series of blog posts explaining how the autograd engine works with computational graphs, and related concepts.
1/2
Information lies in variability - Central idea on which dimensionality reduction by PCA is based.
But, what if we want to capture variability only due to a specified cause/reason & not care about other sources of variability.
For eg. one might want to capture +
Was reading about Markov Processes - guess they apply to us perfectly.
The future state given the present & past depends only on the present no matter what the past was.
Beaut!
my first encounter with word vectors was like -we use some algorithms to convert words into vectors in a way s.t. synonyms have similar vectors
this isn't even the most appropriate definition & ofcourse left me sort of uninterested if not clueless
read on to know the most basic
The more I study time series, the more interesting it gets.
What comes in the way sometimes however, is those math heavy proofs/conditions.
(Not the usual ones though, they are smooth to go :))
Skipping them for now, let's see them if I go for a PhD lol XD
this could be very (very, very) misleading (or misinforming).
the decision boundary of logistic is linear in *its most basic form*.
that s shaped sigmoid is **NOT** the decision boundary.
this image makes it look like that sigmoid graph is separating the blue class +
1. Logistic Regression
It's a classification model used when the target is categorical.
It is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist.
Crucial to constantly revisit the business problem while solving & evaluating the Machine learning problem.
Even if the business problem isn't dynamic- REVISE IT!
We get tempted to try fancy data science techniques and tools & forget what we are here for - THE BUSINESS PROBLEM!
"Non linear decision boundaries cannot be solved by logistic regression" - One who understands *just the basics* of what logistic regression is, would know this is TOTALLY wrong.
I've now lost count of blogs, threads, posts etc. that state this.
1/2
train test split in time series cannot be done like how is done with other types of data, that is, dividing the whole data at random. here the chronology that's inherent in the data needs to be followed in the train &test sets as well.
this is needed as most modelling techniques+
starting w non stationary time series today.
I'm loving studying time series so far. Does it come under machine learning? :)
After all, it's modelling here as well.
Infact, I feel it's challenging cause we are mainly concerned w extrapolation here, among other things.
Scratched the surface of some techniques for efficient ML inference (Quantization, Pruning etc.) - Interesting topics!
Nice to experiment w these techniques in
@PyTorch
.
Tensors' gradients unexpectedly None in
@PyTorch
?
Let's debug 👇🙌
Follow 4 simple checks and you'll have your answer.
1. tensor.requires_grad == True
2. == True, or
tensor.grad_fn is None; if it is not None, use retain_grad() on it.
1/2
Came across hierarchical softmax while revisiting the negative sampling algo.
NS is essentially a simpler alternative to HS.
HS was a technique introduced to mitigate the heavy computational complexity involved in learning word embeddings using algos like word2vec.
1/2
Seasonality, which is one of the major components of time series data, can occur in two types - Single & Multiple.
Single - when there is one dominant seasonal pattern in the data; more likely to be seen in low frequency data like monthly or yearly. for eg. in a monthly data+
Fine-tuning an LLM taking up too much GPU memory?
Heavily Parameterized Large Language Models + Basic Linear Algebra Theorem = Save GPU memory! 💯
Let’s talk about LoRA, a PEFT technique that relies on a simple concept - decomposition of non-full rank matrices.
1/7
Curious to know if ML practitioners use LMMs. Linear mixed effect models are an important and very interesting class of models. They let you model correlated data.
I used LMMs to model pollutant levels in Beijing's air over years.
Model inference in
@PyTorch
TIL that computational graphs can be used not just for backprop, but for inference as well.
We could create and export our model's graph and use it for inference later without the model checkpoint file.
1/2
The correct way - Use ModuleList
ModuleList functions similar to a python list & is meant to store nn.Module objects similar to how a python list is used to store objs like ints, strings etc.
The parameters of different layers are registered & accessible using .parameters().
3/3
Tbh, 'classic' machine learning is the easiest thing I've studied in the field of Statistics so far. Now some may say ML isn't included in Statistics. For me, it is and I'll call it that way only.
Why it seemed easiest to me could be a culmination of my interest, hold of basics👇
Electricity consumption high during the day, less during the night - I observed this in a time series data. (so it's like up and down with day and night)
What is this? seasonal not cyclic
Another series, I observe is going up then down up down.. so on.
This? cyclic not seasonal+
to compare two (or more) classification models when the data is highly imbalanced
It can be overly optimistic in case of highly imbalanced data
So say, with 100k negative examples & 10 positives - Model A & B both correctly identify 9 out of 10 positives. (true positives)
2/
As a person who writes code, no matter what role, industry, purpose etc., learning time and space complexity is inevitable.
And there's no argument to this.
Another way to detect outliers, this for the multidimensional ones.
Last few Principal Components.
Generally last few PCs capture very less variance present in the original data. So, a plot of last PC against all data points can be used to find the points against which this PC 👇
Language Modelling -Recurrent Networks vs Feed Forward NNs. A thread, no math -a gentle explanation
First up, what's Language modelling? It's when you start to write a reply to this thread &your google keyboard recommends you next words
Now how to train a model to do just this?
getting the machine learn from data's past - ML
getting my brain's machine learn from my life's past - *?*
while the former is cool and I'm okay at it ig, I hope I get good at the latter :)
word2vec- slight technicality: do we consider two vectors per word?
(initially, during the learning phase)
One vector when the word is a central word.
Other, when the word acts as context word.
@PyTorch
's Sequential vs ModuleList; & also their combination!
3 simple steps!
nn.Module's stored in Sequential are connected in a "cascaded" way - the output of the 1st Module in Sequential becomes the input to the 2nd Module -- need to take care of dimensions.
In code 👇
1/3
one question for data science people: how would you answer if an application asks you about your 'programming experience'? projects that demonstrate the same?
Doesn't this sound more on the engineering side?
And if you were to put ML(modelling)projects here, what would those be?
@PyTorch
Tip!
If the dataset isn't too big and you decide to save it in GPU's memory, and use the DataLoader to load mini batches..
Do not forget to specify the `generator` as a parameter to the DataLoader as shown. 👇
1/2