pipeline_singleR.py

Overview

This pipeline runs singleR for cell prediction. Single R:

runs at cell level (cells are scored independently)
Uses a non-paramentric correlation test (i.e. monotonic transformations of the test data have no effect).

Given these facts, in cellhub we run singleR on the raw counts upstream to (a) help with cell QC and (b) save time in the interpretation phase.

This pipeline operates on the ensembl_ids.

Usage

See Installation and Usage on general information how to use CGAT pipelines.

Configuration

The pipeline should be run in the cellhub directory.

To obtain a configuration file run “cellhub singleR config”.

Inputs

Per-sample market matrix files (from the cellhub API).
References for singleR obtained via the R bioconductor ‘celldex’ library. As downloading of the references is very slow, they need to be manually downloaded and “stashed” as rds files in an appropriate location using the R/scripts/singleR_stash_references.R scripts. This location is then specified in the yaml file.

Pipeline output

The pipeline saves the singleR scores and predictions for each of the specified references on the cellhub API.

Code

cellhub.pipeline_singleR.genSingleRjobs(): generate the singleR jobs

cellhub.pipeline_singleR.singleR(infile, outfile): Perform cell identity prediction with singleR.

cellhub.pipeline_singleR.concatenate(infile, outfile): Concatenate the label predictions across all the samples.

cellhub.pipeline_singleR.summary(infile, outfile): Make a summary table that can be included in the cell metadata packages.

cellhub.pipeline_singleR.singleRAPI(infiles, outfile): Add the singleR results to the cellhub API.