We are interested in developing computational technologies to understand the functions of coding and non-coding elements, especially in the context of human physiology and disease. We are focusing on the following areas:

Algorithm development for functional screening (esp. CRISPR/Cas9 knockout screening)

We developed a comprehensive computational solution for functional screens using CRISPR/Cas9, including guide-RNA design algorithms:

Algorithms for the modeling and processing of CRISPR screens:

Algorithms for modeling single-cell CRISPR screens:

Databases for large-scale genetic screens spanning multiple phenotypes:

And so on. These algorithms became popular in the field: the MAGeCK suite reach over 20k paper visits and over 90,000 software downloads. These softwares enabled researchers to identify interesting hits from screens, and to perform joint analysis from multiple screening experiments:


Figure 1: Analyzing gene functions using MAGeCK-VISPR in a single experiment (left) and two experiments (right)

Functional analysis of coding and non-coding elements from screening and genomics data

Using the computational frameworks we developed, we collaborated with experimental and clinical scientists around to world to study DNA functions and their associations with human diseases.

Example 1: understanding gene perturbations at a single cell level

CRISPR/Cas9 based functional screening coupled with single-cell RNA-seq (“single-cell CRISPR screening”) is an exciting new technology that combines genome engineering with single cell sequencing. It’s particularly helpful to understand gene regulatory networks and enhancer-gene regulations in a large scale. We propose scMAGeCK, a computational framework to systematically identify genes and non-coding elements associated with multiple expression-based phenotypes in single-cell CRISPR screening. Furthermore, we collaborated with various labs to answer key biological questions including embryonic stem cell differentiation. scMAGeCK is a novel and effective computational tool to study genotype-phenotype relationships at a single-cell level. scMAGeCK was published at Genome Biology.


Figure 2. Overview of the scMAGeCK algorithm to analyze single-cell CRISPR screens

Example 2: targeting endocrine resistant breast cancer

Over 70% of breast cancer patients are ER positive, and endocrine therapy has been a standard treatment for these patients for decades. However, most patients with advanced stage will eventually develop resistance to ER inhibition therapies with unknown mechanisms. We collaborated with Myles Brown lab (at Dana-Farber Cancer Institute/Harvard Medical School) to study the mechanism and potential treatment solutions of breast cancer endocrine resistance. By analyzing genome-wide CRISPR knockout screening data, we found an unusual tumor suppressor, c-src tyrosine kinase (CSK), whose loss accelerated cell growth without hormone, and is associated with high-grade tumors and worse survival rates in patients.

We also identified genes that are synthetic lethal in CSK loss from screens that can serve as drug targets. The top hit (PAK family kinase) is confirmed as a vulnerable target for endocrine resistant patients, and the small molecule PAK inhibitor suppresses tumor growth in various confirmation experiments. In other words, we not only found a biomarker that are responsible for breast cancer drug resistance, but also found a potential drug that can be repurposed to treat these patients.

The paper was published in PNAS 2018 and a corresponding patent application is submitted.


Figure 3. Analyzing critical genes in breast cancer using MAGeCK and MAGeCK-VISPR.

Example 3: Studying functional long non-coding RNAs in cancer

Long non-coding RNAs (lncRNAs) do not translate into protein but they are important in many aspects (including cancer). In collaboration with Wensheng Wei laboratory (Peking University), we developed a novel computational and experimental protocol to screen for lncRNAs using paired gRNAs (pgRNAs). This technology introduces pgRNAs simultaneously into one cell, and is able to efficiently knockout non-coding elements by introducing large genomic deletions. We demonstrated its ability to knockout lncRNAs in a fast and efficient manner.

The paper was published in Nature Biotechnology.


Figure 4. lncRNA screening: designing algorithm (left) and identifying top hits (right) using CRISPR screening.

Transcriptome dynamics from RNA-seq and scRNA-seq

RNA-Seq is an exciting technology to study transcriptome via the second generation sequencing. We studied the problem of de novo transcriptome assembly from RNA-Seq reads — reconstructing all possible message RNA compositions simultaneously, without using any information from current gene annotations. We developed a series of influential algorithms for RNA-seq transcriptome assembly and expression analysis: IsoInfer, IsoLasso, CEM and ISP. IsoInfer and IsoLasso were the first algorithms to use combinatorial methods and regularized least squares methods to study assembly problem in RNA-seq.

We are now working on single-cell RNA-seq (scRNA-seq), an exciting new technology to study transcriptome dynamics at the single-cell level.


Figure 5. The IsoLasso splicing model