- Both labs are pursuing optimizations and extensions of methods. This could be done independently, but this RFA offers opportunity to work collaboratively and leverage new ideas for community. Collaborative efforts can certainly extend to other interested groups from RFA as well.
- We will work together towards an ambitious but achievable goal, integrating all scRNA-seq from the human nervous system. Will add REFs tomorrow. Includes Damanis dataset (Fluidigm, whole cell, deep), Allen (whole cell and nuclei, SS2, deep), Kun Zhang dataset 1(Fluidigm, nuclei, deep), Patch-seq (UCLA, SS2, deep), Kun Zhang Dataset 2 (scNuc-Seq, nuclei, shallow), Regev Habib (scNuc-Seq, nuclei, shallow).
- Short description of advantages enabled by integration
- Significantly boost statistical power by increasing cell #. Find consistent signals that appear across datasets, while still classifying rare cells that may be unique to a subset. Side benefit : enables systematic understanding of genes (and cell types) that are preferentially sampled in either nuc-Seq or RNA-seq, which is essential for HCA.
Aim 2:
The approaches outlined above focus on improving the performance of data integration methods when combining multiple datasets generated from the same underlying population of cells (e.g., a specific tissue) but using either different technologies or where cells are collected from different individuals.
In the context of developmental biology and when comparing tissues across species, the assumption that we are considering the same underlying population of cells does not hold. For example, in the context of tissue differentiation, cells collected at different stages of development will consist of a mix of common cell types (e.g., precursor populations present at different time points) as well as transitional populations present at only specific time points as well as, ultimately, terminally differentiated cells.
At present, technological limitations mean that cells are sampled sequentially, meaning that to construct pseudotemporal differentiation trajectories it is necessary to combine information across batches. Within this context, we propose to jointly learn the biological manifold while correcting for batch effects.
To this end, we propose to extend the Mutual Nearest Neighbor approach by employing a more formal factor analysis framework. Specifically, we suppose that variability in the expression profiles within the combined dataset (i.e., considering cells from all timepoints) can be explained by a series of hidden factors that we want to infer. Each factor will be “active” for a given set of genes that co-vary consistently across the entire dataset.
To disentangle batch effects from biological signal we will assume that batch effects apply to large numbers of genes and will thus be captured by dense factors with large numbers of active genes (including housekeeping genes). By contrast, informative factors, which correspond to the biological signal of interest, will generally have a smaller number of active genes. For the biological factors (e.g., a factor corresponding to differentiation to a particular lineage), we will assume that they are reflected in changes in expression of a smaller number of genes and will thus be less dense. Additionally, prior information, corresponding to the stages when samples are collected can help guide the choice of factors. One challenge here will be to develop statistical approaches that can distinguish between such changes and batch effects.
[JM1]Agreed – we can handle multiple batches but the order does have some effect – would be nice if the user did not have to specify this or if we could come up with a better solution. Something hierarchical perhaps weighted by some measure of quality / complexity might work reasonably well in this regard.
[JM2]Theoretically I don’t think this is a problem for us as they should simply not be paired with another population. But demonstrating this robustly in practice is not so trivial so I think that a rigorous simulation study would be v helpful in this regard.
[JM3]Absolutely – perhaps we need to think about simulations here? On the other hand, bootstrapping may not add that much to the analysis.