Geneticist turned bioinformatician · PhD candidate
@Caltech
· Prev
@UniLeiden
· Author of
#gget
(the program not the coffee bar, unfortunately) · German-Catalan
gget alphafold: Predict the 3D structure of a protein from its amino acid sequence using
@DeepMind
’s AlphaFold v2.0 from a Python or command-line environment in 3 lines of code. Runs on any laptop and requires only ~4 GB of disk space. Simply ‘pip install gget’ and:
”Imagine that DNA had a diameter of 1 m. Then the complex that copies the DNA would be the size of a
@FedEx
truck. It would be traveling at a speed of 500 km/h. It would be making a delivery on both sides of the street every ~10 cm. It would finish its journey in 40 min...
1/2
I created a Colab tutorial that checks if a gene/transcript ID has an associated crystal structure, checks if related proteins with associated crystal structures are available, and then compares those to a de novo structure prediction:
Come find me and my giant poster at
#ASHG2022
poster board 3211 tomorrow 3-5 pm! Let’s talk genomic reference databases and how to access them quickly and effortlessly.
@GeneticsSociety
Lior and I wrote a blog post about what it was like for me to find occurrences of duplicated (and seemingly manipulated) data as a first-year graduate student at
@Caltech
. I am speaking openly because I believe that we, as the scientific community, can do better.
In a blog post, she tells the story of the reaction she received when she pointed this out to her (tenured) professor at the time, and to others. She was basically told not to waste her time: "a lot of the scientific literature has problems". 7/
Can you spot the difference between these two BRCA2 protein sequences?
They are 3,418 amino acids in length and only differ in a single amino acid.
Yet, women expressing the mutant BRCA2 protein have a high likelihood of developing breast cancer. Why is that?
We expanded kallisto to translated alignment (nucleotides <-> amino acids), and used it to detect novel viruses in RNA seq data. The single-cell resolution allowed us to determine whether the presence of these viruses affected host gene expression. 🧵
Honored to be featured in
@CaltechMagazine
with my story about switching fields during my PhD, from wet lab neuroscience to computational genetics.
TLDR: Follow your passion; sometimes, being a beginner in a new field can be your greatest advantage.
We just released
#gget
version 0.2.0! 🎉 Thanks to your feedback, we were able to further improve its ease of use. In addition to Ensembl IDs, gget now also supports WormBase 🪱 and FlyBase 🪰 IDs. All new features are listed here:
It would make a mistake once every three years. THAT is DNA replication fidelity.”
And that is why it's cool to think about Biology in numbers. Check out this fantastic talk by Prof. Rob Phillips:
Analogy by Tania Baker and Stephen Bell.
2/2
I found two papers from the same author reporting identical data with different numbers of replicates and experimental conditions (on more than one occasion). The papers are 20 years old, mostly cited by the author himself and the scientific topic quite niche. 1/3
Congratulations to me 🎉🎉🎉 I am officially NOT the owner of a car anymore. Here’s to biking everywhere!
🚴♀️🚴♀️🚴♀️
(Please vote for bike lanes so I don’t die.)
This is a golden silk spider, or “banana spider” (Trichonephila clavipes). A single thread of their anchor silk has a tensile strength of 4×10^9 N/m^2, which exceeds that of steel by a factor of eight - the strongest material known to humans. But it gets better! 1/2
gget alphafold is a Python and command-line implementation of a simplified version of
@DeepMind
’s AlphaFold2 () originally released and benchmarked for AlphaFold Colab ().
I'm saddened to hear
@GoogleDeepMind
won’t release the code for this new model. To add to the list, gget alphafold, which brought Alphafold2 to the fingertips of 100k users without requiring any fancy computing resources, would not have been possible without the source code.
In my review, I made a list of much science that happened bc AlphaFold2 code was released. I suggested that so much science will not happen if AlphaFold3 code is not released (or not with AF3 itself). We've made ~100k models with AF2 code. How could we use a server for all that?
Analysis of
#scRNAseq
requires constant, tedious, interaction with genomics databases. To facilitate querying from
@ensembl
et al.,
@NeuroLuebbert
developed gget:
(code @ ).
gget has many uses; a 🧵on the its amazing versatility: 1/
Check out the Interactive Mutation Browser by
@ElliotHershberg
(and this article about it featuring gget).
This browser lets you mutate amino acid seqs and immediately generates comparative interactive structure predictions using ESM.
I know I'm REALLY late to the game, but just in case anybody here is not aware of the
@3blue1brown
videos, they are the best, fun, accessible, pleasant resource on linear algebra (amongst others) I have encountered so far:
Srinivasan called
@lpachter
and me “unprofessional” for carefully documenting instances of manipulated data in his work.
You can read about my thoughts regarding this response and my hope for the future in this Q&A by
@Sara_Talpos
for MIT’s
@undarkmag
Thanks to
@NeuroLuebbert
for talking with me about honeybees, scientific publishing, and her years’ long effort to correct what she characterizes as flaws in the scientific literature.
Nobody:
Me: I ADDED A NEW MODULE TO GGET!!
To complete the series on protein structure modeling, I added 'gget pdb', which can fetch the structure and metadata of a protein stored in the RCSB Protein Data Bank
@buildmodels
. 🧵
@DrAnneCarpenter
I expect this poll is going to be biased towards researchers who had a “smoother” time during their PhD/PostDoc since that makes them much more likely to stay in academia
However, this situation has left me quite frustrated. How often do you encounter fraud? Is this how these things are usually handled for old papers? Can we establish a “fraud-police” for academia that treats fraud as fraud no matter who committed it and when? 3/3
Want to quickly check what your proteins are up to or predict the effect of mutations on protein-protein interactions? gget elm can efficiently analyze thousands of sequences or UniProt IDs.
gget elm preprint:
I would like to present grubClub (“grub” for larva, and “club” for… well, you will see). I built off of the ethoscope by
@qgeissmann
and
@giorgiogilestro
to enable automated long-term optogenetic stimulation of Drosophila larvae. Here is my Halloween release (turn up your 🔊):
The author is a practicing professor at a renowned university today. I was advised that any debating/contacting the author or editorial board is not worth my time. Especially since I am only at the beginning of my career. 2/3
I received countless messages with stories like mine. I am very grateful to have been given a platform to share my experience, but this is the exception to the norm. Junior scientists are ignored when they bring forward problems with existing literature, and it is unacceptable.
@NeuroLuebbert
@tyto96
@lpachter
Go you!
As an undergraduate I was assigned a project looking at damsel fly larvae behaviour and could only conclude the paper I was supposed to be basing my work on was…well, I couldn’t see how they possibly got the results they claimed…
XNo one wanted to know
Thanks to
@ChiHoangCaltech
and
@AviMaayan
, gget enrichr finally allows the specification of a background gene list! 🧬🧬🧬
P.S. Check out our renewed website!
¡Ahora también en español!
gget alphafold works on Linux and Mac, but unfortunately not on Windows. (Note: all other gget commands do work on Windows.)
@GoogleColab
notebook demonstrating gget alphafold:
Applying the same algorithm, gget alphafold produces results identical to AlphaFold Colab. The comparison below of the CASP14 target T1024 was created from the PDBs returned by gget alphafold and AlphaFold Colab using
@buildmodels
:
gget alphafold returns the predicted structure (PDB) and alignment error for each amino acid (json). PDB files can be viewed in 3D here: or using PyMOL. When called from Python, gget alphafold automatically generates some cool interactive plots (see above)
Five years ago today, I moved to the US to pursue an academic career and explore new research directions. Today, I'm overwhelmed by deep anxiety about the upcoming Executive Order suspending work visas including H-1B, which I've applied for. Speechless.
Although opinions on the exact numbers vary, at least 50% of the cells in your body are not actually human, they’re microbes. We all are more than 50% bacteria.
#MyOneScienceTweet
gget alphafold: Predict the 3D structure of a protein from its amino acid sequence using
@DeepMind
’s AlphaFold v2.0 from a Python or command-line environment in 3 lines of code. Runs on any laptop and requires only ~4 GB of disk space. Simply ‘pip install gget’ and:
A quick
#gget
update for my invertebrate fans:
'gget search' and 'gget ref' now also support fungi, protists, and invertebrate metazoa
🐝🐛🦋🪲🐜🪳🦂🦟🪰🪱🦠🐙🪼🦀🦪🍄🐚🪸
As a proof of concept, I used gget alphafold to predict the structure of an engineered fluorescent nicotine sensor I worked on in 2018 and compared the result to its crystal structure (PDB 7S7U) ...
@thesteinegger
’s lab developed the excellent tool ColabFold which can also be run locally (). However, this also requires the download of reference databases (940GB). I highly recommend checking out their ColabFold notebook
Thank you all for the helpful responses! Neither of the papers has a DOI, but
@MicrobiomDigest
helped me get ahold of at least one of them on
@PubPeer
. Here is the comment for anybody interested:
MVS has dismissed our findings as "typographical errors and minor oversights," accused Lior and me of being "unprofessional," and called any allegations "totally bizarre." For context, I am showing four examples of duplicated data as described in our arxiv manuscript (Fig 3 & 4).
At the
@Caltech
@ChenInstitute
, we are currently exploring novel approaches to creating
#ai
… e.g. by optogenetically stimulating a giant fake brain. Stay tuned!
My Master of Science work (and first co-first author paper) was published in
@eLife
! We developed fluorescent sensors to study the pharmacokinetics of smoking-cessation drugs. 🧵 1/6
… The prediction is imperfect and illustrates the limitations of AlphaFold2, mainly when predicting residues not found in the reference databases (such as the linker between the GFP and the sensor above). However…
I'm also very disappointed in
@Nature
for breaking their own code guidelines.
So much about “An inherent principle of publication is that others should be able to replicate and build upon the authors' published claims.”
New crowd today in Berlin! I got to talk about DNA, RNA, and
#gget
to a room full of (mostly) software engineers. Thank you,
@prototypefund
🦾😊
(Please excuse my double helix…)
@PartridgeCG
The problem is reproducibility. It is straightforward to reproduce an R, Python, [insert favorite language] script, but it is impossible to trace the exact clicks somebody made in a GUI. Hence mistakes are impossible to catch, and it hinders building on each other's work.
In line with the ethos of kallisto, the workflow can be executed in a few lines of code, and computational requirements do not exceed those of a standard laptop. The code to reproduce all of the results shown in our manuscript can be found here:
I had an excellent time at the Barcelona Supercomputing Center
@BSC_CNS
!
Thank you so much to the students, PIs and research staff for the wonderful discussions - especially the members of the
@mele_lab
,
@marta_mele_m
and
@Bioinfo4women
for hosting me!
We were so pleased to have
@NeuroLuebbert
at BSC presenting her strategy to detect known and unknown viruses in single-cell RNA-Seq datasets. This layer of viral information can provide great insights into cell tropism or infecion dynamics.
We really enjoyed exchanging ideas!
For those who missed it, here is our arXiv manuscript describing instances of duplicated figures, indications of data manipulation, and other irregularities in papers on honeybee odometry and navigation:
The interaction between BRCA2 and PALB2 occurs at the site of a eukaryotic linear motif (ELM).
@ChiHoangCaltech
and I wrote
#gget
elm, which can efficiently recognize ELMs in amino acid sequences, and we were able to reproduce the finding by Oliver et al. in a few lines of code:
Our blog and Twitter posts about the difficulties we faced when we first started documenting these issues have received a lot of attention—thank you all for your support!
Their lovely golden-colored silk has been used to develop composite nerve grafts of acellularized veins, with Schwann cells completely ensheathing the silk. Spider silk tissue engineering might be used in surgically improving mammalian neuronal regeneration in the future. 2/2
I got to talk to
@prototypefund
about the importance of open-source software in academia and how to start contributing.
The podcast interview is now finally available across streaming platforms!
Don’t speak German? Don’t fret, check the 🧵
Can you spot the difference between these two BRCA2 protein sequences?
They are 3,418 amino acids in length and only differ in a single amino acid.
Yet, women expressing the mutant BRCA2 protein have a high likelihood of developing breast cancer. Why is that?
Feeling very grateful for the amount of support and love gget has received. Thank you all for using and contributing to our software!
Here's what the GGETISAWESQMETHANKS protein looks like:
To quote
@lpachter
: Careful reading of science articles is crucial to the advancement of science. The same applies to our work; experts (incl MVS) can and should read our preprint carefully, and if they find problems with our technical claims, they can and should report on them.
… during a previous project, we built new fluorescent sensors by mutating 7S7U to study the subcellular pharmacokinetic properties of various other drugs. Structure predictions could have guided our guesses on which residues to mutate.
BRCA2 plays an essential role in DNA repair through homologous recombination. The promotion of homologous recombination by BRCA2 requires its interaction with a second protein, the partner and localizer of BRCA2 (PALB2).
Schematic from
Without my 2016 J-1, I would not be in grad school today. I would not have developed a protocol to make brain organoids on chips for personalized medicine; nor contributed to relevant research on how antidepressants work. Non-immigrant visas make all the difference!
Soon in cinemas near you: The gget movie
I'm so excited about this short documentary filmed in collaboration with the
@prototypefund
. Watch for its premiere in early 2024!!
@lpachter
@baym
I would like to propose the word Erschöpfungsdepressiondepression. It is a word for being depressed about being depressed due to exhaustion.
gget alphafold is not the only command line option available for running AlphaFold. The full AlphaFold2 can be run using Docker. However, it requires downloading databases (~ 3 TB disk storage) and the computational requirements far transcend a laptop (12 vCPUs, 85 GB RAM).
I’ve discussed this extensively in my podcast and documentary with the
@prototypefund
. We need to recognize the tremendous societal value of and start compensating developers appropriately for open-source technology so we can retain quality and talent.
I'm going to put out unfinished and unpolished blog posts so I can get more thoughts written down. Here's a post on funding open source bioinformatics:
Thank you
@georgiagkioxari
for joining us tonight at our monthly
@NeuroTechers
fireside AMA! We talked about computer vision and the lack of Greek food in the US 💻👀
Thank you
@JieyuZheng3
for co-organizing and to all attendees for a great discussion
Stay 🎵 for our next guest
My lab has outdone itself with my thesis posters 🥰
To portray my passion for hiking, they unknowingly used a picture of the first-ever successful ascent of Nanga Parbat by a German team in 1953. Nanga Prabat is nicknamed 'Killer Mountain' - very fitting for a pandemic PhD.
To upgrade, run 'pip install --upgrade gget'. We took great care to preserve as much backward compatibility as possible. However, I highly recommend looking at all of the changes listed here . As always, your feedback is highly valued.
💬 gget gpt 💬 (using the ChatCompletion endpoint) is now part of gget v0.27.4
$ pip install -U gget
$ gget gpt "How are you today GPT?" your_api_key
Python:
>>> gget.gpt("How are you today GPT?", your_api_key)
You can get your
@OpenAI
API key here:
Some of you may have seen me present a preliminary version of this work at the Biology of Genomes conference earlier this year. I am happy to finally share the recording of my talk:
The media articles as well as comments on our blog include responses from Prof. MV Srinivasan, who is the common author across all of the papers we flagged (he is first or last author on most of them).