How do you benchmark various experimental protocols and numerous computational methods in an unbiased manner?

scDesign2 generates high-fidelity data and its interpretability serve as an intermediate step in single-cell data analysis pipelines.
Tuesday, May 25, 2021

scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured | Genome Biology  

Tianyi Sun; Dongyuan Song; Wei Vivian Li;  Jingyi Jessica Li

In the burgeoning field of single-cell transcriptomics, a pressing challenge is to benchmark various experimental protocols and numerous computational methods in an unbiased manner. This calls for computational simulators that can generate realistic single-cell data with ground truth. Although many simulators have been developed in the single-cell field, none of them can simultaneously achieve the three goals for a desirable simulator: preserving genes, capturing gene correlations, and generating any number of cells with varying sequencing depths.  To fill this gap, we propose scDesign2, a transparent simulator that can achieve all three goals and generate realistic single-cell gene expression count data. 

We verify that scDesign2 generates more realistic synthetic data than existing simulators do, for four single-cell RNA-seq protocols (10x Genomics, CEL-Seq2, Fluidigm C1, and Smart-Seq2) and two single-cell spatial transcriptomics protocols (MERFISH and pciSeq). Using two computational tasks (cell clustering and rare cell type detection) as examples, we demonstrate that scDesign2 provides informative guidance on deciding the optimal sequencing depth and cell number in single-cell RNA-seq experimental design, and that scDesign2 can effectively benchmark computational methods under varying sequencing depths and cell numbers.


In addition to its capacity to generate high-fidelity data, scDesign2 is advantageous in its interpretability (its model parameters have direct biological meanings), flexibility (it is adaptive to any single-cell gene expression count-based protocols), and sample efficiency (its training does not require many real cells). Thanks to its interpretability, scDesign2 can also serve as an intermediate step in single-cell data analysis pipelines. For example, its estimated gene correlations may assist gene-set enrichment analysis, gene network analysis, and alignment of cells across batches.
 

 

Media Contact: 

Leticia Ortiz | Marketing & Communications | Building a community around data science in biomedicine
leticiaortiz@mednet.ucla.edu​
@CompMedUCLA