Jacob Schreiber @jmschreiber91 Twitter profile

Pinned Tweet

Jacob Schreiber

4 years

The more papers I read for a review article I'm writing about ML pitfalls in genomics, the more my faith is shaken in the results from papers that apply machine learning to methylation arrays. A salty thread. 1/

21

244

903

Last Seen Profiles

@angiedollaz

@VestateFund

@YahIsGood_

@DriveShaft2020

@onikojiki

@Terry166369

@apexenergysw

@etherleemark

@stw_pdg

@ayushgupta0610

@suan1825471

@chaesangpoem

@idanavni75

@textmeanon

@stwmaniax

@trivium_art

@udaharu3

@__9399a04s18

@Yenakeh87679

@FoadFuad79111

@Lil_Kumdumpstr

@Samm_LFC

@Brendone5

@t78509

@Tante_Binal69

@King_Obidi

@denee_11

@stwmaniax

@bokeplokalmalam

@ForconiLuna

@ElDia_do

@guyatsfu

@rjournal

@koji_sakaguchi

@stw_pdg

@MTKilleRR

Jacob Schreiber

@jmschreiber91

2 years

@TheBcellArtist They probably go places that will pay them appropriately for their skills. Paying post-docs under $70k is common but obscene in most fields, given how critical they are.

7

10

755

Jacob Schreiber

@jmschreiber91

5 years

Selecting features using all data before splitting into folds for training/testing is a big source of train-test leakage. To demonstrate, I generated random data and labels, select down to 25 features, and train a model. Much better than random performance due to the leakage.

8

173

572

Jacob Schreiber

@jmschreiber91

3 years

The more I use classic bioinformatic tools, e.g. bwa and vcftools, the more I dislike current trends in bioinformatic tooling; pipelines are nice but if I want to test out your method the first step shouldn't be "set up a Terra/GCP/AWS account."

21

64

527

Jacob Schreiber

@jmschreiber91

2 years

It's frustrating reading comp bio articles these days because many keep falling into the same pitfalls. Hard to know if the method actually works, or whether they messed up the evaluation. Here are some issues I've seen recently (w/o names):

13

109

412

Jacob Schreiber

@jmschreiber91

4 months

Thrilled to announce that I'll be joining the incredible researchers at @IMPvienna for a year as a visiting scientist and then joining @UMassChan as an assistant professor in Genomics+CompBio in 2025! At both places, I'll be continuing my work on deep learning + genomics.

65

17

378

Jacob Schreiber

@jmschreiber91

1 year

Why are you confused? There's just genes. And alternate splicing. And regulatory elements. And regulatory elements in the alternate splicing. And regulatory elements are transcribed. And RNAs can do things. And proteins can fold differently in different cell types. And...

11

49

363

Jacob Schreiber

@jmschreiber91

4 years

@CT_Bergstrom This entire time I knew in the back of my mind that you were a person but, because I've only seen you on Twitter, I just assumed you were a benevolent bird sharing your vast knowledge of biology with us. Illusion shattered by the picture in this article. :(

12

10

357

Jacob Schreiber

@jmschreiber91

4 years

Jumping from a successful post-doc into a new PI position.

Caitlin Hudon

@beeonaposy

4 years

jumping from tutorials into your own data

21

213

1K

4

45

354

Jacob Schreiber

@jmschreiber91

5 years

Me, a former sklearn dev, hiding under the bed: Armed robber: ... Me: ... Armed robber: .... Me: .... Armed robber: Logistic regression shouldn't have a default L2 regularization of 1 Me: *still hides*

3

30

318

Jacob Schreiber

@jmschreiber91

4 years

@naomirwolf @BillGates As a researcher at U of Washington, I remember when @BillGates walked into my lab and said "Stop working on this, we must work on vaccine microchips!" and we dropped all our grant-funded work immediately. We would've gotten away with it too, if you didn't point it out on Twitter.

4

18

305

Jacob Schreiber

@jmschreiber91

1 year

CS/ML people venturing into biology frequently assume that the data they're given is clean and that all the upstream processing steps have been figured out. This is absolutely not the case. I would encourage CS/ML people to really look into the gritty details like this.

Steven Salzberg 💙💛

@StevenSalzberg1

1 year

A very intriguing result in the new Y chromosome paper, one that you might miss unless you read the paper closely... 1/6

6

295

1K

8

58

290

Jacob Schreiber

@jmschreiber91

6 months

Sequence-based ML methods (Enformer, ChromBPNet...) are invaluable in genomics but the ecosystem for their *use* after training is less developed. Introducing, `tangermeme`: a PyTorch library for genomics discovery for everything-other-than-the-model. 1.

GitHub - jmschrei/tangermeme: Biological sequence analysis for the modern age.

Biological sequence analysis for the modern age. Contribute to jmschrei/tangermeme development by creating an account on GitHub.

github.com

5

55

271

Jacob Schreiber

@jmschreiber91

3 years

Finally out in @NatureRevGenet : Navigating the pitfalls of applying machine learning in genomics! w/ @seawhalen et al. Our key point: you MUST evaluate your models in the same setting you want them to be used or they might not actually work in practice.

4

83

240

Jacob Schreiber

@jmschreiber91

2 years

PSA: There are no such things as "enhancers," "promoters," and "silencers." There are only TF binding sites and those TF's effects on the steps of transcription and degredation.

17

16

229

Jacob Schreiber

@jmschreiber91

7 months

I regularly hear people in ML+genomics complain that they're running out of memory or disk space. Frequently, the culprit is inefficient handling of RNA/DNA sequence and you can make big gains in compression with a few tricks. 1/

4

40

216

Jacob Schreiber

@jmschreiber91

2 years

A bit ago, I got a grant from @NumFOCUS to rewrite pomegranate from the ground up using a PyTorch backend. The goal was to increase speed, decrease code size, and decrease the barrier to writing custom components or integrating w PyTorch. The results have been incredible so far.

8

21

207

Jacob Schreiber

@jmschreiber91

2 years

Found out last night that @NumFOCUS funded my proposal to rewrite #pomegranate from the ground up using @PyTorch as the backend! Need to train massive HMMs using multiple GPUs, or want a mixture of negative binomials as part of your neural network? Watch this space!

16

22

205

Jacob Schreiber

@jmschreiber91

3 years

Here's another genomics ML pitfall: account for fragment length when modeling multiple genomics experiments! If you don't, your predictions will probably look a little bit... off... even though the model is correct! Why? A thread: 🧵 1/

1

33

198

Jacob Schreiber

@jmschreiber91

1 year

pomegranate v1.0.0 has been released! This major release is a complete rewrite using @PyTorch to replace the Cython backend. Same great probabilistic models, now WAY faster, GPU support, fewer installation issues, and easier to extend. Check it out! 1/

GitHub - jmschrei/pomegranate: Fast, flexible and easy to use probabilistic modelling in Python.

Fast, flexible and easy to use probabilistic modelling in Python. - jmschrei/pomegranate

github.com

3

33

194

Jacob Schreiber

@jmschreiber91

1 year

It's been since my last pitfall in genomics thread but he's a new one: YOU MUST ACCOUNT FOR READ DEPTH in single-cell experiments. Why? Because read depth will likely be confounded by CELL IDENTITY in ways that can induce leakage in downstream ML methods.

5

31

189

Jacob Schreiber

@jmschreiber91

1 year

This fiasco is exactly why I read ML papers in genomics with such a critical eye, and try to write about pitfalls as much as I can. Genomics data is COMPLICATED and ML methods are eager to please. It's easy to mess up, and when you do, you'll appear to get good performance.

Professor Booty PhD

@ProfBootyPhD

1 year

And then they asked, can we correctly classify these cancers based on zero raw data? And of course, the answer was yes - all the classified power is derived from the idiosyncratic zero-to-something normalization enacted by Voom-SNM, and none from the actual raw data. 24/

3

9

92

5

33

191

Jacob Schreiber

@jmschreiber91

3 years

A flaw I'm seeing in a lot of papers is that they think that "cross-validation" gives you permission to perform architecture search on the test set. If the cross-validation involves the entire data set and you choose models based on best performance on it, you're making an error.

8

20

190

Jacob Schreiber

@jmschreiber91

5 years

a convo I had before grad school me: does a phd make you feel like an expert on a topic? phd: the opposite me: do you feel productive while doing research? phd: the opposite me: do you at least get paid well for all the stress? phd: the opposite me: sign me up

1

23

181

Jacob Schreiber

@jmschreiber91

2 years

As a casual reminder to reviewers and authors: if you are working on a biology task and you use random cross-validation, you are making a mistake. It's truly disheartening to review a paper and see this because you have no idea just how distorted the results are.

6

35

185

Jacob Schreiber

@jmschreiber91

2 years

Wrote a custom Triton kernel for PyTorch and it's 🌟10x slower🌟 than native PyTorch 🥳

3

4

174

Jacob Schreiber

@jmschreiber91

2 years

Computational biology is becoming the same thing. So many papers and talks I see recently are benchmark-driven, not science-driven. Uncovering something scientifically interesting is seen as an optional final step if you want to get into a top journal, not a key motivation.

Gary Marcus

@GaryMarcus

2 years

Counterpoint: if you joined NLP recently, you might think that language understanding is about beating benchmarks, rather than converting syntactic strings to meanings (or vice versa) In the short-term, you might think that’s good But hallucinations may well get you in the end

7

8

60

9

22

179

Jacob Schreiber

@jmschreiber91

2 years

At the beginning of 2018 at an @ENCODE_NIH meeting, the idea for the ENCODE Imputation Challenge was born: an open contest to predict genome-wide genomics experiments given fixed train/test sets and encourage development of large-scale imputation methods. warning: drama 🧵 1/

2

35

168

Jacob Schreiber

@jmschreiber91

4 years

Ready for a new "ML pitfalls in genomics w/ Jacob"? When evaluating ML models across cell types/individuals, you MUST baseline against the avg activity or risk being fooled by seemingly good performance. Thrilled to finally see this quick read out! 1/

A pitfall for machine learning methods aiming to predict across cell types - Genome Biology

Machine learning models that predict genomic activity are most useful when they make accurate predictions across cell types. Here, we show that when the training and test sets contain the same...

genomebiology.biomedcentral.com

2

46

163

Jacob Schreiber

@jmschreiber91

6 years

Happy to share new work on a pitfall you can fall into if you train ML models to predict across cell types. TL;DR, always compare your predictions to the per-locus average activity, it's a hard baseline to beat! @uwescience @uwgenome @uwcse @EncodeDCC

2

70

166

Jacob Schreiber

@jmschreiber91

2 years

After several months of work, I'm excited to announce the first release of torchegranate, my @PyTorch rewrite of pomegranate! torchegranate is faster, more readable, better tested, and easy to extend. Try it out with `pip install torchegranate`! 1.

GitHub - jmschrei/torchegranate: A temporary repository hosting a pomegranate re-write using...

A temporary repository hosting a pomegranate re-write using PyTorch as the backend. - jmschrei/torchegranate

github.com

7

29

158

Jacob Schreiber

@jmschreiber91

3 years

The first fruit of my post-doc is finally dropping: Yuzu! Yuzu speeds up in-silico saturated mutagenesis using principles of compressed sensing, over an order of magnitude on many common architectures for both protein and DNA inputs. 1/ 🌠paper🌠:

Accelerating in-silico saturation mutagenesis using compressed sensing

In-silico saturation mutagenesis (ISM) is a popular approach in computational genomics for calculating feature attributions on biological sequences that proceeds by systematically perturbing each...

www.biorxiv.org

4

28

157

Jacob Schreiber

@jmschreiber91

4 years

A canonical mistake you can make when performing machine learning involves performing data preprocessing outside of cross-validation. This involves applying transformations or feature selections before splitting into a train/test split. 2/

3

24

156

Jacob Schreiber

@jmschreiber91

5 years

after a wild 6 years in grad school, tomorrow i get to find out what life is like after defense. i will report back. @AcademicChatter

15

2

155

Jacob Schreiber

@jmschreiber91

3 years

An unfortunate trend I'm seeing in comp genomics right now are submissions that think simply adding complexity to their model is a meaningful contribution. To me, it doesn't matter how complex your model is, it matters how useful it is in practice or what you discover with it.

3

16

148

Jacob Schreiber

@jmschreiber91

3 years

@michaelhoffman No one will use your computational method outside your group, unless it's for basic data processing, so you better be prepared to do all the legwork of applying it all the way to scientific discovery because no one else will.

4

8

150

Jacob Schreiber

@jmschreiber91

3 years

This "classic" editorial should be required reading for any new student trying to apply ML in genomics -- particularly, for those coming at it from a CS perspective. Be skeptical of your own performance measures!

Papers on normalization, variable selection, classification or clustering of microarray data

Over the last decade or so, there have been large numbers of methods published on approaches for normalization, variable (gene) selection, classification a

academic.oup.com

3

30

143

Jacob Schreiber

@jmschreiber91

4 years

Last week was my last at @uwgenome . Today, I start a post-doc with @anshulkundaje at @Stanford ! When I took the position I imagined there would be more pomp and circumstance than logging out of one server and logging into another...

5

3

145

Jacob Schreiber

@jmschreiber91

4 years

I've used this example in the past: consider ENTIRELY RANDOM data. What happens if you select the top features and then do cross-validation? You get better than random performance because the selected features coincidentally line up with the labels. 7/

4

22

140

Jacob Schreiber

@jmschreiber91

3 years

Sometimes I feel like using @numba_jit is cheating. I was concerned that an analysis was taking too long, at ~40 minutes per file, so I just slightly rewrote and jitted the function and now it takes 7 seconds.

3

12

128

Jacob Schreiber

@jmschreiber91

4 years

Seeing this mistake in scientific papers is bad enough but seeing it be subtly integrated into workflows means even more people will inadvertently make this mistake. If you are working with methylation arrays, please ensure you do probe selection only on the training set! 12/12

8

2

127

Jacob Schreiber

@jmschreiber91

5 months

You've probably seen attribution tracks where the height of each letter is its "importance" to a predictive model and motifs pop out. But the technical details behind how these are calculated can matter a lot -- and I'm worried many may be done incorrectly. 1/

4

29

128

Jacob Schreiber

@jmschreiber91

5 years

Just released our preprint on apricot, a Python package implementing submodular selection for machine learning! It efficiently finds subsets of data that are representative of the whole space. Check it out! @uwescience @uwcse

GitHub - jmschrei/apricot: apricot implements submodular optimization for the purpose of selecting...

apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly. See the documentation page: https://apricot-select.rea...

github.com

3

48

122

Jacob Schreiber

@jmschreiber91

2 years

Reading the Reddit thread about predictions for bioinformatics in 2040 () made me realize that I straight up ignore GO analyses in papers unless there's a very specific point being made (almost never). Do other people take them seriously?

16

15

121

Jacob Schreiber

@jmschreiber91

1 year

Finally, after ~6 years of work, this is published! Thanks to all my co-authors and the participants of the challenge for seeing this through.

The ENCODE Imputation Challenge: a critical assessment of methods for cross-cell type imputation of...

A promising alternative to comprehensively performing genomics experiments is to, instead, perform a subset of experiments and use computational methods to impute the remainder. However, identifying...

genomebiology.biomedcentral.com

Jacob Schreiber

@jmschreiber91

2 years

At the beginning of 2018 at an @ENCODE_NIH meeting, the idea for the ENCODE Imputation Challenge was born: an open contest to predict genome-wide genomics experiments given fixed train/test sets and encourage development of large-scale imputation methods. warning: drama 🧵 1/

2

35

168

6

30

101

Jacob Schreiber

@jmschreiber91

1 year

I was always skeptical of single-cell data simulation methods because we still have lingering questions about what exactly the readout is (e.g., it's not a uniform sampling of active genes in a cell). Good to see work on it.

The shaky foundations of simulating single-cell RNA sequencing data - Genome Biology

Background With the emergence of hundreds of single-cell RNA-sequencing (scRNA-seq) datasets, the number of computational tools to analyze aspects of the generated data has grown rapidly. As a...

genomebiology.biomedcentral.com

0

25

104

Jacob Schreiber

@jmschreiber91

3 years

What's the point of comp bio models that can only make predictions for experiments that have already been performed (e.g. DeepSEA, Basset, Enformer, BPNet, etc)? In Rit's/my latest short review on ML in comp bio, we discuss! 1/8

3

29

102

Jacob Schreiber

@jmschreiber91

1 year

Ledidi turns any predictive model (BPNet, DeepSEA, Enformer, AlphaFold...) into a biological sequence editor! After years, I released a new version with significant QoL improvements including.. being in PyTorch. Try it out w/ `pip install ledidi`

GitHub - jmschrei/ledidi: Ledidi turns any machine learning model into a biological sequence...

Ledidi turns any machine learning model into a biological sequence editor, allowing you to design sequences with desired properties. - jmschrei/ledidi

github.com

1

19

99

Jacob Schreiber

@jmschreiber91

2 years

@jonykipnis I offer reasonable rates.

1

0

98

Jacob Schreiber

@jmschreiber91

4 years

@JessicaLTami Journals will still find a way to ask you to review papers for free.

2

1

96

Jacob Schreiber

@jmschreiber91

2 years

Literally everyone studying gene regulation using transcription instead of protein abundance. @lkpino

Randall Munroe

@xkcd

2 years

Proxy Variable

38

867

8K

4

16

95

Jacob Schreiber

@jmschreiber91

3 years

@bestofnextdoor "i told you what would happen if you took my ivermectin"

2

0

87

Jacob Schreiber

@jmschreiber91

3 years

In this episode of @bioinfochat , I interview @lkpino about the limits of mass spec measurements and how proteomic measurements can be integrated with genomic measurements. Every time I talk to her I always learn a ton!

Proteomics calibration with Lindsay Pino

In this episode, Lindsay Pino discusses the challenges of making quantitative measurements in the field of proteomics. Specifically, she discusses the difficulties of comparing measurements across...

bioinformatics.chat

1

29

88

Jacob Schreiber

@jmschreiber91

4 years

What's wrong with this? Well, you're leaking information from your test set into your training set because you're selecting probes that, by construction, have large differences / perform well on both your training and test set. 6/

5

89

Jacob Schreiber

@jmschreiber91

3 years

Time for another pitfall in genomics thread! Normally, the output from a genomics experiment are reads mapped to a reference genome. More reads = stronger signal. But the total number of reads can confound machine learning analyses and statistical tests. 1/

3

14

89

Jacob Schreiber

@jmschreiber91

4 years

I think it says something about my experiences in academia (and I doubt I'm alone) when I'm shocked to get reviews back that, although will require a lot of work to address, are generally supportive and provide constructive feedback. @AcademicChatter

3

4

85

Jacob Schreiber

@jmschreiber91

2 years

After delaying my commencement by two years due to the plague that ravages this land, I'm finally a real doctor! With @thabangh

9

2

86

Jacob Schreiber

@jmschreiber91

2 months

After 5 months of effort and giving up twice, I was finally able to reproduce TOMTOM. Lots of small details and a few bugs in the code... On a large-scale task, TOMTOM is taking ~978s and my version with some basic speedups is taking ~1.2s. Out soon!

5

10

86

Jacob Schreiber

@jmschreiber91

2 years

It's always fun to fail to make basic connections about your data as a computational person. me: so, this sample is labeled "healthy" but are we sure the person is healthy? @anshulkundaje : well, it's a heart sample, so they're dead

2

8

86

Jacob Schreiber

@jmschreiber91

5 years

When doing grid search, why do you need to evaluate your final model on data other than the set you used to tune hyperparameters? Here's an example. Random data, labels, and predictions yields much better than random performance in a gridsearch-like evaluation.

3

20

81

Jacob Schreiber

@jmschreiber91

4 years

My @uwcse @uwescience thesis is now online ()! Check it out if you want to learn about my work with Avocado, imputing >30k genomics experiments, and ordering future experiments. I also wrote a 2 page tl;dr overview:

Learning a latent representation of human genomics using Avocado

In the past decade, the use of high-throughput sequencing assays has allowed researchers to experimentally acquire thousands of functional measurements for each basepair in the human genome. Despite...

www.biorxiv.org

3

10

79

Jacob Schreiber

@jmschreiber91

4 years

In this week's "ML pitfalls w/ Jacob," we're going to talk about data set creation! Problem data sets occur in every field, but I frequently see them in genomics because people build their own data sets from new experimental data. 1/

4

22

79

Jacob Schreiber

@jmschreiber91

6 months

When designing bioinformatics software with an eye toward the future, an important choice will be designing towards what hardware supports (GPUs with 192Gb memory, for instance) vs. what most people using your software will have (laptop + depression).

1

9

78

Jacob Schreiber

@jmschreiber91

3 years

Basically: production-intended pipelines should probably involve WDL/etc but be focused on internal use. For maximal external effect, your tool should take in standard file formats, run each step as a single command line w/ options, and output a standard format.

4

6

76

Jacob Schreiber

@jmschreiber91

1 year

Just found out about `.numpy(force=True)` for @PyTorch tensors and it's life-changing. Never touching `.detach()` again.

5

76

Jacob Schreiber

@jmschreiber91

3 years

In our latest episode of @bioinfochat , we talk with @Avsecz about research in academia vs industry, Enformer, and deep learning libraries! Great to hear about the work directly from the source. Hope other people enjoy our conversation!

Enformer: predicting gene expression from sequence with Žiga Avsec

In this episode, Jacob Schreiber interviews Žiga Avsec about a recently released model, Enformer. Their discussion begins with life differences between academia and industry, specifically about how...

bioinformatics.chat

4

15

72

Jacob Schreiber

@jmschreiber91

3 months

Added a super-fast one-hot encoding function to `tangermeme` last Friday, and I'm still surprised by how fast it is. Timings are encoding chr1. for-loop: ~40s numpy-vectorized: ~12s new: ~1s Thought I'd share some intuition for why it works so well.

tangermeme/tangermeme/utils.py at main · jmschrei/tangermeme

Biological sequence analysis for the modern age. Contribute to jmschrei/tangermeme development by creating an account on GitHub.

github.com

1

10

71

Jacob Schreiber

@jmschreiber91

5 years

me: I trained a GAN using Avocado to generate fake imputations advisor: okay, what questions can it help us answer? me: ... advisor: ... me: ... advisor: what questions ca- me: it's named AvoGANo advisor: ... advisor: let's look into moving your graduation date up (satire)

5

2

69

Jacob Schreiber

@jmschreiber91

1 year

@SashaGusevPosts Dr. Gusev job plz? genomics Jacob

3

0

67

Jacob Schreiber

@jmschreiber91

2 years

Ultimately, these papers are a symptom of a broken academic system. There is less value in spending time dissecting a system than there is in doing a surface-level analysis and moving on to the next thing, leaving a trail of bad tools that causes people to not trust anything.

4

14

65

Jacob Schreiber

@jmschreiber91

3 years

Super excited to be joining the amazing team at @JOSS_TheOJ as a topic editor for bioinformatics and machine learning. If you wrote a great software tool that supported amazing research, write it up and send it my way! Good software deserves more recognition in research.

0

13

65

Jacob Schreiber

@jmschreiber91

1 month

I'm shocked -- shocked! -- to find out that the department I interviewed at that emailed my advisor unsolicited critiques of my performance behind my back was unable to recruit this cycle.

2

1

66

Jacob Schreiber

@jmschreiber91

2 years

@timrpeterson The biggest problem I've seen in biotech is people who don't understand their data and lose years just learning bias. I'm not sure that getting rid of people with domain knowledge will solve this.

1

0

65

Jacob Schreiber

@jmschreiber91

4 years

Glad to see that the @numpy review article is out! The package has had a massive effect on the adoption of Python and the development of the entire ecosystem.

Array programming with NumPy

Nature - NumPy is the primary array programming library for Python; here its fundamental concepts are reviewed and its evolution into a flexible interoperability layer between increasingly...

www.nature.com

0

17

61

Jacob Schreiber

@jmschreiber91

2 years

@nomad421 @MicrobiomDigest That's what I tell my advisor when he asks me to get a second paper out of my postdoc.

0

63

Jacob Schreiber

@jmschreiber91

2 years

where do i pick up my prize

2

62

Jacob Schreiber

@jmschreiber91

7 years

pomegranate v0.9.0 released! The main focus was on adding missing value support for model fitting / structure learning / inference across all models. Read more about it here: @uwescience @uwcse @NumFOCUS

0

29

62

Jacob Schreiber

@jmschreiber91

5 years

As you increase the number of features or decrease the number of examples in your original data set this problem becomes worse because there is a higher chance to see spurious correlations. Select features using your training set, not all your data!

0

11

61

Jacob Schreiber

@jmschreiber91

5 years

When UMAP goes wrong. If you pass in similarities (where 1 means closest) rather than distances (where 0 means closest), you can get very artistic results as you smooth in the wrong direction.

5

13

60

Jacob Schreiber

@jmschreiber91

3 years

On Thursday (4:40am PST ugh) I'm giving a talk at #ISMBEECCB2021 #MLCSB2021 on five pitfalls to avoid when applying ML to genomics data! Although conceptually simple, they can be extremely difficult to identify in practice if you don't know what to look for. 1/

2

14

61

Jacob Schreiber

@jmschreiber91

5 years

After 6 years of challenges, setbacks, successes, and corgi viewings, I've scheduled my thesis defense. It always seemed so far away until suddenly it was here. I know that I wouldn't have made it without a support network. @AcademicChatter #AcademicChatter

7

2

60

Jacob Schreiber

@jmschreiber91

4 years

How do I know this has to do with data preprocessing being outside the train/test split and not me actually secretly generating a data set with secret structure in it? Let's put the feature selection IN the CV. Performance plummets. 10/

2

9

59

Jacob Schreiber

@jmschreiber91

5 years

Once again I accidentally fed in a similarity matrix to UMAP instead of a distance matrix. @leland_mcinnes implemented the best warnings for when this happens---your plot looks like a creature whipping you for being wrong.

1

4

59

Jacob Schreiber

@jmschreiber91

6 years

Proud to finally release Avocado! Avocado is a deep tensor factorization model that imputes epigenomic signal better than prev work, and the latent factors yield better ML models on genomics tasks than the data it was trained on. @uwescience @uwcse

4

22

58

Jacob Schreiber

@jmschreiber91

3 years

Even "pull this docker container" is frustrating when I just want to test your approach. I get that for research work you want to ensure precise reproducibility and these tools might be the right choice, but it's more challenging to hack and learn when setup is an ordeal.

5

3

58

Jacob Schreiber

@jmschreiber91

3 years

@boehninglab Maybe assistant professors should be paid more too.

1

0

58

Jacob Schreiber

@jmschreiber91

3 years

@KouMurayama "i couldn't haven't written a better paper myself"

1

0

57

Jacob Schreiber

@jmschreiber91

7 years

Sooooo apparently seaborn doesn't let you use the 'jet' colormap anymore... @jakevdp

4

7

56

Jacob Schreiber

@jmschreiber91

1 year

@jxnlco The GZIP paper is going to cause new researchers to independently rediscover kernel methods

2

5

53

Jacob Schreiber

@jmschreiber91

2 years

Regretting coming to #RECOMB2022 . Most people not wearing masks, coughing and sneezing are near constant in the audience, someone I know already has gotten COVID. Who would feel safe sitting in the audience of this? Talks are good though.

4

55

Jacob Schreiber

@jmschreiber91

10 months

Someone really just decided to call it "pseudotime" and we let them get away with it?

3

0

55

Jacob Schreiber

@jmschreiber91

3 years

very unfair how universities will reimburse conference expenses including food if you go "in-person" but won't reimburse this entire pizza i ate alone in bed while watching pre-recorded conference talks at midnight

0

6

55

Jacob Schreiber

@jmschreiber91

2 years

The paper just dropped on @biorxivpreprint . Give it a read, and let us know what you think! Our main point: doing genomics work correctly is HARD. Please don't just use data you find on the internet without knowing how it was processed. 18/

The ENCODE Imputation Challenge: A critical assessment of methods for cross-cell type imputation of...

Functional genomics experiments are invaluable for understanding mechanisms of gene regulation. However, comprehensively performing all such experiments, even across a fixed set of sample and assay...

www.biorxiv.org

3

16

55

Jacob Schreiber

@jmschreiber91

4 years

This also isn't a problem with supervised machine learning models. If you take your data, select down a smaller number of features (here going from 10k features to 200) even PCA will return distinct clusters. 9/

1

5

55

Jacob Schreiber

@jmschreiber91

5 years

My @SciPyConf talk, "apricot: Submodular optimization for machine learning," is online! Learn about a principled way to reduce massive data sets down to representative subsets that are widely useful. Also, #GossipGirl . Thanks @uwescience for support!

apricot: Submodular Selection for Data Summarization | SciPy 2019 |...

Submodular selection can be used to reduce a complex data set down to a representative subset. This subset can then be used to train machine learning models ...

www.youtube.com

2

13

52

Jacob Schreiber

@jmschreiber91

4 years

Excited and proud to receive the @acm_bcb 2020 best paper award for my work on making zero-shot imputations across species! Like most work, this would not have been possible without my co-authors. Here is a thread summarizing the paper:

Jacob Schreiber

@jmschreiber91

4 years

This paper proposed an approach for supplementing functional imputation models using human data when making imputations in other species, including making "zero-shot" imputations of assays performed in human but not in the other species. Here are four examples:

1

0

1

3

4

54

Jacob Schreiber

@jmschreiber91

4 months

@jmuiuc Neighborhood: Genome Biology Microenvironment: Nature Communications Niche: Nature

3

1

54

Jacob Schreiber

@jmschreiber91

5 years

two months before deadline: i hate this paper one month before deadline: i hate this paper two weeks before deadline: i hate this paper three days before deadline: this is actually really interesting lets come up with a thousand experiments we could have done

1

2

54

Jacob Schreiber

@jmschreiber91

4 years

Roman and I just released a new @bioinfochat episode! () This time, we interview @drklly about @calico , Basenji, and how machine learning models can be used to help us understand the functional consequences of genetic variation.

Basset and Basenji with David Kelley

In this episode, Jacob Schreiber interviews David Kelley about machine learning models that can yield insight into the consequences of mutations on the genome. They begin their discussion by talking...

bioinformatics.chat

2

10

53

Jacob Schreiber

@jmschreiber91

2 years

@rasbt Okay okay, I'll turn my GTX 1080 Ti off and stop training GPT-5 if that's what the nation wants.

2

0

51