If you're studying
#bioinformatics
, there comes a time when you have to decide between:
- staying in academia
- going into industry
- give up
#bioinformtics
and do something else
Well, let me tell you what it was like for me to work as a
#bioinformatics
in the biotech
You don't see what you don't see if you don't sequence more than 1 genome. Pan-genome of 26 soy bean lines reveal SVs, gene fusions, core/disposable genes.
@PacBio
#Bioinformatics
troubleshooting tip of the day: when it doubt, look at the sequence by eye. Not the alignment, the GFF, or the IGV screenshot. *THE ACTUAL SEQUENCES*. Here's how (using
@PacBio
Iso-Seq CCS BAM) files as an example: /1
#ASHG18
is great, but you know what’s better? They have childcare on site, so I don’t have to think about childcare logistics while I’m here.
#WomenInSTEM
Nice in-depth review of the new T2T / pangenome era by
@mike_schatz
et al: why gapless human genomes that now have resolved centromeres, telomeres, and segmental duplications and having pangenome to capture genetic diversity is essential to human health.
We at
@PacBio
are finally presenting to the world, at long last, isoform information at
#singlecell
resolution - some details on the tech background for the kit and
#Bioinformatics
below 🧵
#bioinformatics
tip of the day: FOLLOW THE DATA. Trace a read through its journey. Example is when I look at a fresh
@PacBio
Iso-Seq run (but really for any sequencing data for any application).
I start w CCS reads, scroll through them to viz the primers and polyA tail. /1
Possibly the most complex genome ever assembled of the most harvested (by tonnage) crop in the world - the polyploid sugarcane R570 hybrid - assembled to 8.7 Gb with 12 chromosomes.
@PacBio
+ short reads used for both WGS and RNA-Seq.
We showed you
@PacBio
MAS-Seq
#singlecell
data on Sequel II/IIe systems, well, now, here it is --- on Revio! Two cells of PBMC data have been released, 110m and 114m reads each.
If you like
@PacBio
Iso-Seq and you like R ggplots2, this is the package for you!
ggtranscript: an R package for the visualization and interpretation of transcript isoforms using ggplot2
#Bioinformatics
#ggplot2
#R
Personal
#bioinformatics
tips that helped me: (1) organize work by year/[author-project]/[month-sample-platform], (2) keep a README for data source, copy email, (3) keep a cmd file (I name it ) in each data analysis folder, (4) keep slides there too
Finally read in detail the breast cancer
@PacBio
Iso-Seq paper and it’s amazing finding - novel isoforms seen only in clinical tumor samples that correlate w survival outcome and cancer subtypes!
Plant pan-genomes are the new trend! This is now possible becuz of reduced seq cost, long reads, better algorithms, and the need to understand within-species diversity. A quick thread 1/4
All pub find courtesy of
@phototrophic
Analysis of 170
@PacBio
HiFi genomes identified 7000+ segmentation duplications (SD) that comprise 6.1% of the (T2T CHM13) genome. African genomes show higher copy number of multi-copy SDs. Comparing to Iso-Seq data finds 201 novel protein-coding gene in multicopy SDs.
LongSom: a
#bioinformatics
workflow for detecting de novo SNVs, CNAs,
#cancer
gene fusions to reconstruct tumor clonal heterogeneity. Shows that with
@PacBio
long-read single-cell RNA-Seq you can reconstruct clones and find clinically relevant SNVs in each clone. Nice work
As a preview to the upcoming MAS-Seq for bulk Iso-Seq, here's a
@PacBio
HG002 MAS bulk x Revio dataset for you to play with! 38 million full-length HiFi reads on 1 Revio SMRT Cell! Enjoy :) More to come.
Check this cool baby I'm working on! Single-cell Iso-Seq reads separated by cell types *and* colored by alleles (based on phasing info I ran w IsoPhase). Each alignment is an individual founder molecule.
@PacBio
#bioinformatics
Analysis of ENCODE
@PacBio
Iso-Seq data using isoLASER (variant calling + phasing + linkage analysis) distinguishes cis- v trans-directed alternative splicing. Finds genetic linkage of splicing is largely individual-specific and enriched in immune genes (e.g. HLA).
My
@Medium
interview w
@sedlazeck
@MedhatHelmy7
is out! We cover SV calling w long reads, the life cycle of
#Bioinformatics
tools, why it never hurts to ask for a postdoc you feel unqualified for, and the future for population SV calling!
In the coming weeks, I will be releasing a mini series on my
@medium
blog featuring different
@PacBio
HiFi assemblers! The interview with
@lh3lh3
@ChengChhy
(HiFiAsm) has been done and draft is underway. Stay tuned 🥰
The news has come out that my company, PacBio, has just gone through a significant layoff. A few notes on this:
This says nothing about the high talent of those who were let go. Many of them have contributed critically over the years - some, even before we even had the first
Cautionary tale in
#BioInformatics
: when the results don't look right, and you change the tool parameter (ex: score/length/filter cutoff), and things look *better*, you should NOT think you've saved the day. Most likely, the fault is in the data, not the tool /1
1/ End of an era for my
#bioinformatics
contribution to
@PacBio
Iso-Seq (
#RNA
analysis) --- with the release of isoseq3 v3.8.0, I've removed large chunks of now out-dated modules in Cupcake and will be deprecating the rest soon 🧵
.
@aphillippy
: new complete human genome CHM13v1.1 assembly using PacBio HiFi + ONT UL, no gaps, ~Q70, structurally correct. sees high seq similarity between the acrocentric arms, ~66Mbp new seqs, 1561 new genes.
#gi2021
Using
@PacBio
finds long (up to 23kb!) cell-free DNA in maternal plasma with methylation patterns that distinguish tissue origin. potential for NIPT and pregnancy-related risks (ex: preeclampsia)
Eichler: PAV, discovery of structural variants by comparing phased (or unphased) genomes against each other. breakpoint resolution.
#Bioinformatics
#ISMB2024
One year after tackling its genome, we're releasing the
@PacBio
redwood Iso-Seq dataset - and here's the
@medium
blog to go with it! I'll talk about it at PAGday Iso-Seq workshop on Tue too!
1/ Very excited that this
@GenomeBiology
paper by de Souza (ETH, Robinson lab) together w
@GSheynkman
(and me as a honorary middle author XD) on benchmarking DeepVariant/Clair3/GATK/NanoCaller for calling
@PacBio
Iso-Seq variant data is out!! 🧵
Analysis of
@GenomeInABottle
and
@1000genomes
genomes reveal ribosomal RNA diversity not previously discovered - most short indels and GC-rich! Also changes in rRNA variant expression in cancer suggest cancer-specific ribosomes.
@PacBio
Cool preprint using
@PacBio
Iso-Seq + RNA-Seq on combined 22 breast cancer samples to find differentially expressed genes *and* isoforms. They were able to define "common" v subtype-specific isoforms. Found five 3-hop fusion genes confirmed w WGS.
Use of Tomahto, a new targeted mass spec method, combined with
@PacBio
Kinnex full-length RNA data to to increase detection of isoforms that are turned into proteins - the first step into a personalized proteome. Great work from
@GSheynkman
lab!
Spatial transcriptomics at isoform resolution reveals isoform switching events in heart tissue post myocardial infarction - here's a use case of combining 10x Visium with the
@PacBio
MAS-Seq single-cell kit. No short reads needed. If you missed the story - the webinar recording
As someone who works in the industry, the skills I learned from getting a PhD that are still relevant 10 years later has nothing to do with writing algorithms or tenure.
It is rather, just a mindset and a way of handling hard problems at work and in life. /1
I just asked a Q at
#gi2019
(cuz
@lpachter
wants more girls to ask Q) and guess what? my CSHL collaborator, sitting in her office listening in on the talks - emailed me right after and said "HEY I HEARD YOU let's chat tomorrow". So ladies, SPEAK UP.
De novo isoform-level phasing using
@PacBio
Iso-Seq! Years ago I developed IsoPhase (v0) that used the genome to identify SNPs and phase, now we are experimenting w directly going into phasing, which would eliminate the need for a genome!
#PlantBiology
@PacBio
12/ If the idea of a grant gives you vertigo but instead you are:
- Passionate about science
- Likes solving problems, no matter where they come from ("intron retention in plants? yes! alternative splicing in alzheimer's brains? of course!")
- Wants to do something else in the
What does "full-length (FL) transcript" mean for
@PacBio
Iso-Seq and how is it determined w
#bioinformatics
?
In long-read RNA-Seq land, a read with both the 5' and 3' cDNA primer, and the polyA tail preceding the 3' end is determined to be a "full-length" (FL) read. /1
Comprehensive
@PacBio
Iso-Seq breast cancer dataset including tumor samples that contain novel isoforms (that changes ORF) not seen in cell lines! Found 35 alt. splicing events assoc w patient survival. congrats
@adeslat
:)
Just interviewed
@sedlazeck
and
@MedhatHelmy7
on their SV review paper (for my
@Medium
blog) but we covered so much more ground incl. history of SV, tool dev, postdoc, population scaling...AHHHH why do I have more meetings today I just want to write NOW 😭
I don't know why I have to say this, but the most effective
#bioinformatics
strategy to "why something isn't working", sometimes, is to just actually open up the fasta file in a text editor, Ctrl+F to find, look at a handful of sequences by eye, and you usually know what's wrong.
Isoform classification - a largely solved
#bioinformatics
problem. (hot take? come at me 🌶️)
One of the most popular isoform classification tool is SQANTI. Originally published in 2018, SQANTI is now in its 3rd version (SQANTI3) and is also implemented as `pigeon` in
@PacBio
My
#bioinformatics
superpower is discovering other people using (1) wrong primer sequences; (2) wrong primer orientations; (3) wrong file names; (4) using default parameters that don't apply. That, and I discover it faster than most people.
What can we learn about the transcript-level changes in hibernating v active brown bears using
@PacBio
Iso-Seq? Read this preprint by
@joannalkelley
@JG_Underwood
, I and others! Quick thread below /1
How to tell I'm a former computer scientist / bioinformatician - I name my Word docs without spaces and use underscore and capitalization to separate words. XDDDD
I’m talking tonight 725pm in
#PAGXVIII
single cell genomics session on single cell IsoSeq using
@PacBio
. Come talk to me about what single cell questions can be asked on 🌱 and animals. Talk slides will be online later!
#pag2020
Amazing preprint out of
@hudsonalpha
showing using
@PacBio
HiFi reads solving previously unsolved rare disease! In one case, 35bp duplication of CDKL5 exon 3 resulted in frameshift.
Hey there
#Bioinformatics
folks looking for a job, I have an academic PI friend looking to hire bioinfx talent to work on long read (
@PacBio
Iso-Seq) + other omics. International candidates welcome! no job req yet so for now DM me to be connected. RT appreciated ❤️
From
@mrvollger
(Stergachis lab): analysis of haplotype-resolved
@PacBio
HiFi FiberSeq genome reveals >1,000 regulatory elements with haplotype-selective chromatin accessibility (HSCA), with HLA being the most diverse locus. Accurate quantification of chromatin accessibility
While I'm going down memory lane today...this is probably the best moments from 10 years ago, when I sang the national anthem for a Mariners game. Ah, when I was young XDDD
A small but hopefully still useful potato pan-transcriptome dataset (in which I helped out w the
@PacBio
Iso-Seq portion). I'm really looking forward to seeing more pan-transcriptomes!
Targeted
@PacBio
Iso-Seq identifies novel SNCA transcripts in human dopaminergic neurons, many with novel UTRs and encodes novel peptides, and used to design antisense oligonucleotides (ASO) leading to the effective reversal of Parkinson's Disease (PD) pathology
Fantastic preprint! Deep
@PacBio
Iso-Seq+RNA-Seq+mass spec comparative transcriptomics (human + great apes + macaque) finds species-specific isoforms/exonization enriched for immune genes and under positive selection. This is how it's done!
Microflora Danica: largest microbiome dataset to date using short + long reads for metagenomics, 16S amplion and rRNA operon sequencing of the Danish microbiome. 449 rep samples (14.9 million bacterial and 13.4 million eukaryotic rRNA ~4-4.5kb operon) using
@PacBio
. This is
The most unique question I got asked today standing at my Alzheimer’s brain poster:
“So this is from postmortem brain, right?”
Me: “Uh, yes...?”
I mean what is the alternative? Not postmortem brain??!! 🤯😱
#ASHG19
The new
@PacBio
Kinnex kits makes long-read RNA sequencing scalable - what's the magic? This animation explains the technology behind Kinnex.
Kinnex is based on the MAS-Seq method for concatenating smaller amplicons (cDNA or 16S) into longer fragments that can fully utilize HiFi
Paper thread time! Can
@PacBio
Iso-Seq data be used to identify SNVs and find allele-specfic isoform information? Yes! And here's the list of pubs off top of my head: /1
I wrote ANGEL years ago for ORF prediction from long read data that can handle errors (and also handle alternative ORFs). I have mixed feelings about its continuing utility, but this weekend I said heck - I'm updating it to Py3.
#Bioinformatics
@PacBio
Paraphase: a
@PacBio
#bioinformatics
tool for haplotyping segmental duplication genes. Applied to 160 locus, resolved copy numbers, de novo mutations, pseudogenes, low diversity gene example AMY1 as shown below.