Surag Nair Profile Banner
Surag Nair Profile
Surag Nair

@suragnair

939
Followers
554
Following
27
Media
238
Statuses

Machine learning and genetics @Genentech . Previously CS PhD @Stanford .

Stanford, CA
Joined April 2009
Don't wanna be here? Send us removal request.
Pinned Tweet
@suragnair
Surag Nair
10 months
Excited to share work with co-first author @immoameen . Gene regulation is a delicate balance b/w cis-regulatory sequence & TF conc, but these components are often studied in isolation. Reprogramming is a fantastic system to study their interplay 1/16
6
43
169
@suragnair
Surag Nair
9 months
It's been a real honour being advised by @anshulkundaje . Phenomenal mentor, scientist, and human being, with an infectious curiosity. Thanks to everyone who made this journey possible!
@anshulkundaje
Anshul Kundaje (anshulkundaje@bluesky)
9 months
Big congrats to DR. @suragnair for defending his PhD thesis with flying colors! He's an incredible scientist, is truly interdisciplinary with the superpower to use ML models for biological discovery. 1/
6
6
113
12
2
104
@suragnair
Surag Nair
3 years
In-silico saturation mutagenesis (ISM) of all human regulatory regions with a long-range (and large) model like Enformer can take many GPU-months. In the updated version of fastISM, we show that ISM for Enformer can be sped up by a factor of ~3x. Check it out!
@anshulkundaje
Anshul Kundaje (anshulkundaje@bluesky)
3 years
FastISM is now "officially" published in Bioinformatics Code: Congrats to @suragnair and coauthors You can read the preprint for free at
4
13
63
0
16
86
@suragnair
Surag Nair
2 years
Amazing paper that answers a longstanding question I had- do unsupervised language models trained on genomic sequences learn features such as promoters, splice sites and TF binding sites? They do + double up as powerful assay-agnostic variant effect predictors!
@gsbenegas
Gonzalo Benegas
2 years
Excited to share our findings training GPN, a DNA language model for Arabidopsis thaliana, with @sanjitsbatra and @yun_s_song : DNA language models are powerful predictors of non-coding variant effects, without the need for any labeled data. 1/n
Tweet media one
9
51
224
1
7
70
@suragnair
Surag Nair
2 years
Excited to announce that dynseq tracks are now fully supported by the WashU Genome Browser, UCSC Genome Browser and HiGlass/Resgen. Preprint: 1/
@anshulkundaje
Anshul Kundaje (anshulkundaje@bluesky)
3 years
Very excited that "Dynseq" tracks will soon be supported at the UCSC @GenomeBrowser as well. Below is a preview. Dynseq tracks are just bigwig tracks but they get visualized as dynamic sequence where the heights of bases are proportional to the signal at each position . 1/
Tweet media one
4
23
139
2
8
56
@suragnair
Surag Nair
1 year
Come through at 11:40 am tomorrow at #ISMBECCB2023 and I'll give you a glimpse into some curious properties exhibited by DNA sequence models!
@anshulkundaje
Anshul Kundaje (anshulkundaje@bluesky)
1 year
Come check out several talks at #ISMBECCB2023 on Monday and Tuesday @MlcsbC and @ISCB_RegSys COSI by @suragnair @kellycochra @jmschreiber91 and myself. See times below 1/
Tweet media one
Tweet media two
Tweet media three
Tweet media four
4
16
77
1
5
33
@suragnair
Surag Nair
4 years
Excited to share fastISM, a method to speed up in-silico saturation mutagenesis (ISM) for convolutional neural networks that operate on biological sequences. 10x speedup for Basset-like architectures! (1/4) How it started: How it's going:
Tweet media one
Tweet media two
@anshulkundaje
Anshul Kundaje (anshulkundaje@bluesky)
4 years
Check out fastISM: an efficient implementation of in-silico saturated mutagenesis (ISM) for inferring importance scores across entire input biosequences (e.g. DNA) from CNN models. Clever work by @suragnair 1/
3
21
73
2
3
29
@suragnair
Surag Nair
10 months
Don't know about you all, but I'm really feeling the AGI.
Tweet media one
1
0
16
@suragnair
Surag Nair
3 years
For the longest time I mistakenly assumed that batchnorm and activation layers are extremely cheap, but in fact they can take up about 20% of inference time when executing convolution layers. Gelu is especially slow.
Tweet media one
1
2
13
@suragnair
Surag Nair
10 months
At many enhancers, the exact set of TFs predictive of ATAC-seq signal changes over time. E.g. here's an enhancer that is initially opened by OSK; eventually KLF4 ceases to bind (its expression dwindles) but is replaced by ZIC (expressed exclusively in iPSCs). +2/
Tweet media one
1
2
10
@suragnair
Surag Nair
4 years
Great thread! To get around this exact issue when evaluating models that predict DNase accessibility in new cell types, we tried stratifying test examples by fraction of accessible training cell types. This reveals performance on cell-type specific regions
Tweet media one
@jmschreiber91
Jacob Schreiber
4 years
Long story short: because biochemical activity is frequently similar across cell types, using the activity from one cell type to predict activity in another cell type frequently gives good performance. Look at these tracks below, each coming from a different cell type. 8/
Tweet media one
1
0
2
1
1
9
@suragnair
Surag Nair
1 year
@lpachter @TMDUniversity Are there datasets that can serve as "ground truth" annotations of cell state transitions? Perhaps acquired through orthogonal methods such as lineage tracing? In the absence of those, most velocity analyses seem rather subjective.
3
1
8
@suragnair
Surag Nair
2 years
Resgen. @deAlmeida_BP and co have set up a great session to explore model interpretations from their recent DeepSTARR models . 5/
2
3
8
@suragnair
Surag Nair
10 months
To understand the sequence basis, we trained ChromBPNet models (w/ @panushri25 , , paper coming soon). One for each cell state. They accurately predict ATAC-seq at base resolution. Interpreting the contribution of each base highlights TF motifs. 6/
Tweet media one
Tweet media two
1
2
8
@suragnair
Surag Nair
2 years
Dynseq tracks are bigwig tracks that are visualised as dynamic sequence with nucleotide heights scaled by user defined scores. ML models for genomics frequently generate importance scores, and dynseq makes it easier to visualize, explore and share them. 2/
1
1
7
@suragnair
Surag Nair
3 years
@daweonline This is okay when working with well-catalogued cell types. I've observed that when cells are perturbed in specific ways (e.g. overexpression of specific sets of TFs) one often gets a good number (>10k) of peaks that are condition specific and don't overlap with the DHS Index.
0
0
7
@suragnair
Surag Nair
2 years
Big thanks to all who made this possible @wuepgg @twang5 @GenomeBrowser @pkerpedjiev @flekschas @panushri25 @anshulkundaje and Arjun Barrett! Check out the preprint for specifications and examples of interpretation of cis-regulatory sequence syntax and regulatory variants. 6/6
1
2
7
@suragnair
Surag Nair
3 years
A truly mind-blowing piece of science!
@_e_d_v_
Eeshit Dhaval Vaishnav
3 years
Our @Nature paper is now online: Paper: Code: Data: App: My most sincere thanks to Prof. Aviv Regev, @CarldeBoerPhD , @MIT , @broadinstitute and our collaborators
Tweet media one
14
279
1K
1
2
6
@suragnair
Surag Nair
10 months
Noise in expression of Sendai vectors seems to result in initial diversification of trajectories. Cells with high K convert to a keratinocyte-like state, some cells remain fibroblast-like with low OSKM or only M, and remaining show high expression of OSKM 4/
1
2
7
@suragnair
Surag Nair
2 years
@gsbenegas @Sanjitsbatra @yun_s_song Great work!! Few questions: 1) Do the models learn reverse complement equivalence, i.e. how similar are embeddings for a sequence and its RC? 2) Have you tried using an algorithm like TF-MoDISco () to summarize the set of motifs learned?
1
0
5
@suragnair
Surag Nair
10 months
Fascinatingly, in some rare cases, the same sequence is predicted to be bound by TFs from different families in different trans contexts. Here's an example of a sequence that is a weak AP-1 site in fibroblasts (top) but a weak OCT-SOX site in iPSCs (bottom). +4/
Tweet media one
1
2
5
@suragnair
Surag Nair
2 years
Short tutorials on how to use the dynseq track are available here . UCSC: 3/
1
3
5
@suragnair
Surag Nair
10 months
Leveraging ChromBPNet's ability to clean up Tn5 bias and perform in-silico footprinting, we were able to visualize a clear relationship between footprint depth, TF concentration and motif affinity. 10/
Tweet media one
1
1
5
@suragnair
Surag Nair
2 years
WashU 4/
1
1
4
@suragnair
Surag Nair
2 years
@lal_avantika Have tried this but doesn't help. For chromatin, naively assuming a peak is 1000 bp, 50 "important" bp per peak (e.g. motifs), and 1 in 1000 positions differ from reference, it's like a flipped mislabelling issue with 5% error, to which models are usually robust.
3
0
4
@suragnair
Surag Nair
1 year
@SashaGusevPosts Not for mammals. But in yeast, [Vaishnav et al. 2022] provide compelling evidence that going "out-of-genome" with millions of random sequences helps train very accurate models of expression, despite being a much simpler problem (80 bp -> expression).
2
1
4
@suragnair
Surag Nair
4 years
The fastISM API is minimal, making it very easy for you to run it on your Keras model in a few lines. Would love to hear your thoughts and feedback, and squash bugs that come up. (2/4)
Tweet media one
1
1
3
@suragnair
Surag Nair
1 year
@SashaGusevPosts from a purely ML perspective, expression tasks are at least partly bottlenecked by training data. ~20k gene expression values per cell state is not enough to learn accurate maps from ~1Mb sequence -> expression. IMO clean, large scale perturbations will eventually change this.
1
0
4
@suragnair
Surag Nair
10 months
@immoameen generated snATAC+RNA multiome from days 1+2 post-induction. Across nuclei, we found an association between sequestration of AP1 to transient sites and broad repression across ~1k fibroblast genes. OSK indirectly repress fibroblast genes, and concentration matters! 13/
Tweet media one
1
1
2
@suragnair
Surag Nair
9 months
@cdessimoz @anshulkundaje Thank you Christophe! This journey started with the summer I spent as an undergrad in your lab, and I can't thank you enough for taking the chance!
0
0
3
@suragnair
Surag Nair
1 year
@StefanoBerto83 @anshulkundaje @lpachter Could be interesting to augment UMAPs with “warnings” when UMAP distance between 2 cells/clusters diverges from the actual distance. Something along the lines of “these distances are longer/shorter than they appear”
3
0
3
@suragnair
Surag Nair
10 months
Amazing @immoameen generated immaculate scRNA and scATAC of fibroblasts overexpressed with OSKM using a Sendai virus delivery system at days 0,2,4...14 and iPSCs at passage 30. We observe a primary reprogramming and 3 off-target trajectories. 3/
Tweet media one
Tweet media two
1
0
3
@suragnair
Surag Nair
10 months
@dr_alphalyrae relevant analysis
@james_y_zou
James Zou
11 months
How well can #GPT4 provide scientific feedback on research papers? We study this Q in our new work We created a pipeline using GPT4 to read 1000s papers (from #Nature , #ICLR , etc.) and generate feedback (eg suggestions for improvement). Then we compare
Tweet media one
Tweet media two
17
212
794
0
0
3
@suragnair
Surag Nair
4 years
ISM is an interpretability method that is unique to biological sequences. While ISM is widely used, the method itself has not received much love. Dynamic computation graphs in TensorFlow 2 now make it possible to trim away redundant computations for ISM on entire sequences. (3/4)
1
0
3
@suragnair
Surag Nair
3 years
@hardmaru @TonyZador @doctorveera From a cursory glance, disruption of the gene FOXP2 has been associated with speech/language impairment. Looks like there’s much to do in understanding how these disruptions impact brain architecture/development/function.
1
0
3
@suragnair
Surag Nair
10 months
We find ~50k enhancers reproducible across different reprogramming expts, but missing across all ENCODE cell lines and tissues. This is likely due to promiscuous OSK binding at extreme concentrations. This suggests there may not be a finite global peak set/index. +5/
1
1
3
@suragnair
Surag Nair
9 months
@_AndrewLeduc @vitaliikl @immoameen We use single-cell RNA expression as a surrogate. They span multiple orders of magnitude across cells in our system for each of the reprogramming TFs.
0
0
3
@suragnair
Surag Nair
2 years
@gsbenegas @Sanjitsbatra @yun_s_song Very cool! Thanks for running the analysis!
0
0
2
@suragnair
Surag Nair
2 years
@anshulkundaje For sure, and the authors do acknowledge that in the paper. I’ll qualify by saying it’s a good baseline for low resource species. What is super cool is showing that unsupervised learning can pick up motifs!
1
0
2
@suragnair
Surag Nair
3 years
@jmw86069 @anshulkundaje @GenomeBrowser The HiGlass version supports custom fasta for each track which can differ from the reference.
0
0
2
@suragnair
Surag Nair
10 months
Altogether, we connect TF conc and motif syntax to refine our understanding of reprogramming progression. We elaborate on the link between transient peaks and indirect repression by AP1 theft. We also hope our work informs partial reprogramming rejuvenation and other studies! 14/
Tweet media one
1
0
2
@suragnair
Surag Nair
10 months
Sendai transcripts terminate at translation end. In the scRNA data (not multiome), we were able to estimate endogenous vs Sendai OSKM transcripts via an EM algorithm. Cells that enter a partially reprogrammed state at day 14 failed to activate endogenous OCT4 (vs pre-iPSCs). +1/6
Tweet media one
1
1
3
@suragnair
Surag Nair
3 years
@nameluem @daweonline @timtriche That seems like a fairly flexible scheme. Curious-- here's some data from a single-cell ATAC-seq time series at a specific locus that shifts and narrows over the time course. What would be the best way to reconcile this with a reference-based system?
Tweet media one
2
0
2
@suragnair
Surag Nair
10 months
We also find examples of peaks that appear to "shift" over the course of reprogramming as the exact set of putatively bound TFs changes. Here, KLF and OCTSOX seem to open up a broad peak, that eventually localizes closer to OCTSOX sites. +3/
Tweet media one
1
3
3
@suragnair
Surag Nair
10 months
>100k dynamic peaks are transient: off in fibroblasts and iPSCs. They regulate transient genes with unclear function. We find that transient peaks are reproducible features of reprogramming, but not found in stable tissues/cell lines! Very mysterious. 8/
Tweet media one
Tweet media two
Tweet media three
1
0
1
@suragnair
Surag Nair
1 year
@SashaGusevPosts For humans, the largest perturbation datasets [Gasperini et al. 2019, Fulco et al. 2019] are currently only used for evaluating performance (e.g. in Enformer). Beyond a certain scale, we'll likely use them for training better models.
0
0
1
@suragnair
Surag Nair
10 months
What are transient peaks doing functionally? Previous work shows they act as sinks for fibroblast TFs like AP1. We find ~10k AP1 sites in transient peaks with clear TF footprints. Does stealing AP1 from its original sites result in widespread repression of fibroblast genes? 12/
Tweet media one
1
0
1
@suragnair
Surag Nair
11 months
@MelanieWeilert Thank you Melanie!
0
0
1
@suragnair
Surag Nair
10 months
Thanks to co-first author @immoameen for the amazing data, my advisor @anshulkundaje for his support, and co-authors Laksshman @panushri25 @jmschreiber91 @_bakshay Will, David, Helen Blau, @IKarakikes @KWang_Lab . This concludes the main thread. Read on for bonus tidbits. 16/16
1
0
1
@suragnair
Surag Nair
2 years
@lal_avantika Yes potentially when variants have long range effects e.g. histone and expression. Worth a try!
0
0
1
@suragnair
Surag Nair
9 months
@michael_nielsen I tried using ChatGPT voice mode (with GPT4) by asking it to be a translator for Hindi<->English. Worked reasonably well.
0
0
0
@suragnair
Surag Nair
3 years
@timtriche @daweonline Agree that the idea of discrete peak calling is suboptimal. More generally, I meant that cells can be perturbed in ways that open chromatin at sites other than those seen across catalogued cell types/tissues. Very high overexpression of pioneer TFs is one way.
0
0
1
@suragnair
Surag Nair
3 years
@timtriche @nameluem @daweonline pseudo-bulked and smoothed over fairly homogenous populations. Depends on what "regulatory element" means in this case as it is already a short region (600 bp). At the single-molecule level there can be a combinatorial explosion
0
0
1
@suragnair
Surag Nair
1 year
@dagarfield Plotting the + and - strand pileups separately is the best way to show it IMO. When you do +4/-4, the reads line up very nicely.
Tweet media one
1
0
1
@suragnair
Surag Nair
10 years
@guptashas damn. :/ Spread the word if you can. Thanks :)
1
0
1
@suragnair
Surag Nair
10 months
Initial state along reprogramming trajectory shows extreme expression levels of OSK- we dubbed it the extreme OSK (xOSK) state. ~5x higher OCT4 and ~10x higher SOX2 RNA relative to iPSCs. 10s of thousands of new peaks. Drastic change within 2 days. 5/
Tweet media one
Tweet media two
1
1
1
@suragnair
Surag Nair
3 years
@hardmaru @TonyZador One approach to tackle this is to look at case-control genetic association studies (GWAS) that identify genes that are potentially implicated in language impairment, and follow up on how they impact brain development. cc/ @doctorveera
1
0
1
@suragnair
Surag Nair
10 months
@anshulkundaje Or maybe it's lying deliberately :O
0
0
1
@suragnair
Surag Nair
11 years
Must get these Aam Aadmi Party guys to come to my hostel. My room needs cleaning and they've got the jhadus! #AAP #Delhi #Elections
1
0
1
@suragnair
Surag Nair
10 months
Transient peaks contain degenerate instances of OCTSOX motifs, including a partial motif previously shown to engage chromatinized motifs by @aliciakmichael @RalphSGrand . Early on at extreme conc, OSK occupy low-affinity sites (approximated by motif log-odds) compared to iPSCs 9/
Tweet media one
1
0
1
@suragnair
Surag Nair
10 months
Paper: Mapped data: Analysis products: Code: ChromBPNet: Models: Interactive browsers: 15/
1
1
1