GeneWalk, from
@robertietswaart
,
@fiddle
and co, is a method for prioritizing genes from a gene list. Given a list of genes, eg from RNA-seq experiment, it constructs a regulatory network and uses GO annotations to identify the most relevant genes.
Just gave my name in Starbucks as
@GenomeBiology
and the barista said 'this coffee is better suited to a more specialist journal' and gave it to someone else.
Artifacts in gene expression data cause problems for gene co-expression networks. Principal component correction can reduce false discoveries. New work from
@princyparsana
, Ruberman,
@alexisjbattle
,
@jtleek
and co
A new opinion from
@asiepel
looks at how bioinformatics grew up from ad hoc, almost hobbyist beginnings, into the underpinning of most of modern biology, but the funding routes have not adapted. He suggests ways to improve the funding support
New editorial from
@StevenSalzberg1
discussing how automatic genome annotation has not kept pace with genome sequencing, and how improvements in sequencing have, somewhat paradoxically, made annotation worse
Study of false positives when calling differentially expressed genes, from
@YM123411
,
@ge_xinzhou
,
@FanglueP
,
@superweili
&
@jsb_ucla
. With population-scale sample sizes, common methods have inflated FDR. The authors suggest using Wilcoxon rank-sum test
Benchmarking of 36 workflows for determining gene co-expression networks from RNA-seq data, from
@kaylainbio
&
@compbiologist
Data came from SRA and GTEx, and the benchmarking evaluated normalization and network transformation methods
recount3, from
@chrisnwilks
.
@KasperDHansen
,
@BenLangmead
& co, is a resource of 750,000 mouse and human RNA-seq datasets uniformly processed with the same pipeline. It allows rapid queries across the whole dataset, eg cross-species or cross-study
Exciting news: from Jan 1st, we will be moving to transparent peer review (where referee reports are published anonymously alongside the published article)
Zhang,
@realzhang
, Xiong, Zhu and co show that, although H3K27ac is used as a marker for active enhancers, you can remove it from mESCs and enhancers still function with little alteration to the transcriptome.
Gorin,
@vallens
and
@lpachter
extend the RNA velocity, so that when given single cell transcriptomic and proteomic data, rather than predicting what the cell will be like in the future, they infer what it was like in the past.
MUON, from
@gtcaa
, Kats &
@OliverStegle
, a package for organizing, analyzing, visualizing & exchanging multi-omics data. It is based around MuData, an open data structure for multimodal data. An example given is ATAC-seq and RNA-seq from single cells
Le Cao,
@FertigLab
& co have written an open letter describing three hackathons designed to improve data analysis in single-cell studies. The hackathons focused on spatial transcriptomics, spatial proteomics, and epigenomics. Datasets and code available.
tidybulk:
@steman_research
,
@TonyPapenfuss
and co present a package for bringing transcriptomic anlaysis into the tidyverse. tidybulk integrates popular approaches eg limma-voom, edgeR, DESeq2, to allow modular analyses with minimal coding
Don't Panic! If you have ATAC-seq data and you're not sure of the best way to analyze it
@alexyfyf
,
@methylnick
and co have written a review to help you. You'll soon know where your towel is.
Aligning reads from ancient DNA to sequence variation graphs using vg removes reference bias and improves variant detection sensitivity (especially indels) relative to linear alignment with BWA. From
@ruidlpm
,
@erikgarrison
,
@richard_durbin
and co
ACME is a method for dissociation and fixing of cells for single cell RNA-seq, from
@gact_hell
,
@solana_jordi
and co. It uses acetic acid and methanol. It works on a variety of vertebrates and invertebrates, without biasing cell composition
Study of tissue-specific regulatory element evolution in mammals, from
@RmMasa
, Stamper,
@odom_lab
,
@PaulFlicek
, and co. Looking at 4 tissues from 10 mammals, they find promoters often arise from ancient enhancers, and lots more relating to LINEs
Benchmarking of clustering algorithms on single-cell RNA-seq data, for cell type estimation, from
@YuLijia
,
@PengyiYang82
& co. They assess 14 methods from 4 categories, on datasets from Tabula Muris and Tabula Sapiens, and ranked for different tasks
Quantile normalization of scRNA-seq data without UMIs, from
@sandakano
and
@rafalab
. Linear transformations preserve PCR distortions, so this approach uses quantile normalization assuming Poisson-lognormal distribution.
Panaroo: producing polished prokaryotic pangenomes, from
@gerrythill
, Corander, Bentley, Parkhill and co. Panaroo generates a graphical pangenome, which allows the correction of erroneous gene clusters caused by gaps and misannotations in existing genomes
3' UTRs that are actually 3' TRs.
@pre_mRCO
, Gao, Morrison, Heyd and co find that alternative splicing can cause frameshifts that result in 3' UTRs being translated. This is a conserved feature that increases protein instability, may be alternative to NMD
Review of read alignment tools from
@mealser
,
@jnrotman
,
@_onurmutlu_
,
@serghei_mangul
& co. They assess 107 methods published since 1988, comparing different approaches. They look at computational efficiency and speed. Most tools use hashing.
SyRI, find me rearrangements in this genome.
@itsmanishgoel
, Schneeberger and co present the Synteny and Rearrangement Identifier (SyRI) which identifies structural variants by performing pairwise alignments. Any non-syntenic region are, by definition, SVs
Missarova,
@satijalab
,
@marionilab
& co present geneBasis, a method for identifying gene panels that capture the variability in single cell RNA-seq datasets, for further analysis. It doesn't require prior cell type annotation
Study of gene expression data compression from
@gwaygenomics
,
@GreeneScientist
and co. Choosing a single latent dimensionality for the compression loses information. The authors present BioBombe, an approach using different latent dimensionalities
MOFA+ : multi-omics factor analysis v2 from
@RArgelaguet
, Arnol, Bredikhin,
@MarioniLab
,
@OliverStegle
and co, for integrating multiple single-cell omics datasets. Demonstrated on scRNA-seq time course of mouse embryos, and methylation of mouse cortex
Fan,
@sarahtishkoff
and co sequenced the genomes of 92 individuals from 44 African populations. They observe mixed ancestry in many individuals, suggesting short- and long-range migration events, with admixture 2/3
Bichrom is a method for predicting transcription factor binding from ATAC-seq and histone modification ChIP-seq data, from
@divyanshi91
,
@mahonylab
and co. Demonstration on Ascl1 shows Bichrom can deconvolve sequence and chromatin requirements for binding
We're always happy to accommodate revision extensions. We realize there are more important things going on right now, so if you're struggling to meet your revision deadline because your lab is shut, or whatever, just drop us an email and let us know.
GraphAligner: from
@MikkoRautiaine3
&
@tobiasmarschal
, a method for aligning sequences to sequence graphs that is much quicker and uses less memory than existing approaches. Works with noisy long reads. Applications in variant genotyping and error correct
Review on how pangenomes are helping crop genomics and guiding breeding and genome editing for crop improvement, from
@rafael_coletta
,
@mbhufford
,
@HirschCandice
and co. They discuss transposable elements, QTL mapping, understudied species and so on.
GOmeth and GOregion are gene set enrichment methods for DNA methylation data from
@JovMaksimovic
,
@AliciaOshlack
&
@BelindaPhipson
. It takes into account the fact that CpGs on chips are unevenly distributed between genes, &and some CpGs linked to >1 gene
Paragraph: a graph-based genotyper for structural variants, from Chen,
@p_krusche
,
@sedlazeck
, Eberle and co. A variant graph is constructed from a VCF file. Reads that map near to breakpoints on a linear map are realigned to the graph.
Benchmarking of sample processing time for scRNA-seq and scATAC-seq by
@rmassonix
,
@hoheyn
and co. Storing cells for >2 hours leads to changes in apparent transcript levels. No global changes in ATAC-seq data.
Samplot: from
@jon_belyeu
, Chowdhury,
@ryanlayer
and co, for visualizing read alignment at suspected structural variants, to aid manual curation of SVs. Human input is a good way of resolving false positives from automated SV callers.
Guo and Li present scSorter, for assigning single cell RNA-seq profiles to cell types based on marker gene expression. It can work when marker genes are expressed at low level, and takes information from non-marker gene expression.
Modification of a Chromium 10x droplet protocol to perform single cell RNA-seq with both Illumina and Nanopore reads, from
@Luyi_T
,
@mritchieau
& co. They analyze human and mouse cells using new analysis approach FLAMES & find shared splicing patterns
treeclimbR: from
@fionarhuang
,
@markrobinsonca
and co, an analysis approach applicable to different resolutions of hierarchical data, such as cell types in scRNA-seq data, taxonomy of microbiome data, or miRNA clusters in differential states.
Bing Ren,
@KasperDHansen
,
@afhuming
and co perform Hi-C on lymphoblastoid cell line from different individuals. They show that genetic variation influences multiple features of 3D chromatin conformation and map QTLs associated with these features
STRONG: from
@chrisquince
,
@sergeynurk
,
@koadman
and co, for strain resolution on assembly graphs, for identifying strains from multiple metagenomes. Genomes are coassembled,, and binned into metagenome assembled genomes. Strains can be resolved on graph
Townes,
@rafalab
and co investigate scRNA-seq data with UMIs. UMI counts follow a multinomial distribution and are not zero-inflated, so traditional methods for dimension reduction in scRNA-seq data are not appropriate. Models are proposed that work better
Wang, Sun,
@fooliu
,
@XShirleyLiu
and co present MAESTRO, for model-based analyses of transcriptome and regulome, for integrative analysis of single-cell RNA-seq and ATAC-seq data. It can automatically annotate cell type clusters, and infer regulators.
MultiMAP, from
@MikaSarkinJain
,
@mirjana_e
,
@teichlab
& co, a method for dimensionality reduction and integration of multiple datasets from different technologies. Demonstrated on single-cell RNA-seq, ATAC-seq, methylation, spatial transcriptomics
Review on zero-inflation of single-cell RNA-seq data, from
@Ruochen_Jiang
,
@jsb_ucla
& co. They discuss the sources of biological and non-biological zeroes, methods to address them, and how they influence analysis.
Avocado: from
@jmschreiber91
,
@thabangh
and co, an approach for compressing the human epigenome into an information-rich representation. Can be used for imputing epigenomic marks, and predicting features eg gene expression, interacting regions, replication
New research from
@Why_NeverS
@Gloveface
@cdessimoz
& co.
@ISBSIB
comparing protein length distribution across 2326 species, showing that this is more uniform than previously thought & providing evidence for universal selection on protein length
Review from
@_lazappi_
&
@fabian_theis
looking at trends in software for single cell RNA-seq analysis. The scRNA-seq database was started in 2016, and has recently passed 1000 tools. Focus moved from ordering cells on trajectories to integrating samples
Review of the current problems in human genetics from
@BrandesNadav
,
@oweissb
&
@LinialMichal
. They discuss problems such as population structure, disentangling gene-environment interactions, missing heritability, causality, ancestry diversity and so on
It is well known that Illumina sequencing misses parts of the genome.
@bioinfo_mark
, Jensen, Petrucelli,
@johnthefryer
and co systematically investigate this in human genome. Many genes, including medically relevant ones, missing.
A new editorial from
@GarmireGroup
, describing her experiences in the male-dominated field of bioinformatics, with tips for other women wanting a career in academia.
scMET from
@AndreasKapou
,
@RArgelaguet
, Sanguinetti and
@CataVallejosM
, is a Bayesian method for analyzing single cell bisulfite seq data that takes into account the sparsity of the data. Separates real methylation variability from confounding biases
Shilpa Garg reviews methods for chromosome-scale haplotype reconstruction. She covers methods involving short reads, long reads, Strand-seq and HiC. She discusses reference-based and de novo haplotyping. The last section is on metagenomic haplotyping
Cai, Chang, Wang & co construct a pan-genome of Brassica rapa from 18 genomes covering different morphologies. They infer the ancestral genome, and construct a graph genome, which they use to genotype SVs in 524 accessions. Leafy head linked to a deletion.
CoCoA-diff, from
@ypp_lab
&
@manoliskellis
, a method for identifying causal disease genes in single cell RNA-seq data. In case-control data, it adjusts for confounders existing across heterogeneous individuals. It's used to find Alzheimer causal genes.
Editorial from
@emanuelvgo
&
@FrezzaLab
on how single-cell omics and metabolic models can be used to interpret which mutations are cancer drivers, how clones are selected during tumorigenesis/treatment, and thus suggest novel therapy.
Tran, Ang, Chevrier, Zhang, Chen and co present a benchmark for batch effect correction methods for scRNA-seq data, to allow integration of different batches. Benchmarked on 10 datasets with 5 criteria. Harmony is best, and quick; LIGER, Seurat 3 also good
Simplitigs, from
@KarelBrinda
,
@baym
and
@GregoryKucherov
, is a compact, efficient and scalable implementation of de Bruijn graphs. Simplitigs are vertex-disjoint paths that relax the requirement of stopping at branch nodes to reduce storage space.
Survey of Eubacterium rectale genomes from
@NicolaiKarcher
,
@nsegata
and co. They screen 6500 gut metagenomes and assemble over 1300 E. rectale genomes. There are geographic strains. European clades have lost motility operons, and distinct CAZy genes.
ReSeq, from
@StephanSchmeing
and
@markrobinsonca
is a simulator for Illumina reads that gives realistic results. The k-mer spectrum, error rates, and coverage distributions closely match those seen in real data.
Giotto:
@RnDries
,
@qzhu2012
, Yuan and co present an analysis and visualization suite for spatial expression data. It can use input from different spatial technologies, including lower resolution, and can identify cell types, neighborhoods and interactions
Srivastava, Malik,
@nomad421
and co benchmark alignment and mapping algorithms for transcript abundance estimation from RNA-seq data. Based on results, they introduce selective alignment approach that is fast, but removes errors from lightweight alignment
Study of how DNA methylation affects transcription factor binding, by
@ahcorcha
&
@hsnajafabadi
. Joint accessibility-methylation-sequence models used to dissect the different factors affecting binding. TFs that prefer methyl CpG in vitro less so in vivo
Li, Costa and co compare biases in DNase-seq and ATAC-seq data, then use the ATAC-seq bias info to improve footprinting results for predicting TF binding sites
reference flow: from
@naechyun_chen
,
@BenLangmead
and co, for aligning reads to multiple population reference genomes, rather than graphs. This gives similar alignment accuracy to graphs, but with much smaller memory footprint, and faster speed.
phasebook, from
@XiaoLuo88872497
,
@XiongbinK
&
@ASchonhuth
, for de novo, haplotype-aware assembly of long reads from diploids. It can use both PacBio CLR and HiFi reads, as well as ONT. It is competitive with reference-guided assembly methods.
Benchmarking of 10 end-to-end preprocessing workflows for single cell RNA-seq data, from
@YueYOU9
,
@PeteHaitch
,
@mritchieau
& co. They compare methods with both CEL-seq2 and 10x Chromium data. Preprocessing is less important than other analysis steps
New opinion from
@SaraBallouz
, Dobin and
@JesseAGillis
, discussing the uses of the human reference genome, and whether the current form still works. They suggest a consensus reference, where each locus has the major allele
Daniel Baker and
@BenLangmead
present Dashing, a tool that uses HyperLogLog sketch, as opposed to MinHash, to estimate similarities of genomes or sequencing datasets efficiently and accurately
BRIE2: from
@YuanhuaHuang
& Sanguinetti, a method for associating splicing events with cell-level phenotypes in single cell RNA-seq data, taking into account the sparsity of the data. It uses sequence features and cell type, and can use Tensorflow & GPUs
The Arabidopsis thaliana reference transcript database 3, from Zhang, Brown & co. It comprises 169,000 transcripts, double the next best database, assembled from Iso-seq using novel methods to annotate splice junctions and transcription start and end site
Review from
@h_shaban
,
@RomanBarth2
and
@kerstinbys
, on how high-resolution microscopy has contributed to our understanding of nuclear architecture and gene regulation. They discuss complementarity of microscopy and Hi-C, and how chromatin moves
cuteSV, from Jiang, Liu, Wang and co, a tool for identifying structural variants using long reads. Different SV types have different signatures, which the method identifies. cuteSV can work with Nanopore and PacBio data.
SHOOT, from
@David__Emms
&
@Steve__Kelly
, searches for a sequence in a database of phylogenetic trees, identifies the homology group, and places that sequence in the phylogeny. It finds closest related sequence better than BLAST, with similar speed
Merqury: from
@ArangRhie
,
@aphillippy
and co, a method for reference-free assessment of quality, completeness, and phasing of genome assemblies. It compares k-mers in the unassembled reads with the assembly. It's demonstrated on Arabidopsis and human data
SquiggleNet, from
@baobaoyaobaobao
,
@LabWelch
& co, a method for classifying Nanopore reads direct from their electrical signal. This is faster than basecalling + alignment, so allows realtime analysis on a laptop. It needs no reference database.
sfaira: from
@davidsebfischer
,
@le_and_er
,
@fabian_theis
& co, a data and model zoo for pre-trained scRNA-seq models. This allows models streamlined access to different datasets, that automatically accounts for different cell type annotation resolutions
Campoy,
@hequan_sun
,
@LabSchneeberger
and co present gamete binning, an approach for generating haplotype-resolved genome assemblies from single-cell sequencing of gametes. Genetic map made from the gamete genomes can be used to guide long-read assembly
Benchmarking of approaches for integrating single-cell data for inferring biological trajectories, perturbation, and disease states, from
@JoleneRanek
,
@natstann
& Purvis. They look at 10 methods on 10 datasets. Integrating spliced and unspliced data helps
Wondering which base caller to use on your Oxford Nanopore data?
@rrwick
,
@JuddLmj
and
@DrKatHolt
can help you with that - they've benchmarked the available basecallers in a number of situations. Part of our
#benchmarking
special issue.
Deep learning applied to cancer transcriptomes by
@AnupamaJha48
,
@RNA_Ken
,
@YosephBarash
& co. Splice site use patterns and lncRNA expression both define cancer state, and these genes are generally not mutated in cancer.
Hu, Wang and co present LIQA, for long read isoform quantification and analysis. Each read is given its own weight for isoform expression, to account for read-specific error rate and alignment bias. Application to real human data identifies novel isoforms
Work extracting information from pathway diagrams in papers, much of it not mentioned in the text, from
@stinahanspers
,
@xanderpico
and co. They used machine learning to find molecular interaction diagrams, optical character recognition to get gene names