This pipeline performs clustering of integrated single cell datasets. Starting from an anndata object with integrated coordinates (e.g. from Harmony or scVI) the pipeline:

  • Computes the neighbour graph using the HNSW alogrithm

  • Performs UMAP compuation and Leiden clustering (with ScanPy)

  • Visualises QC statistics on the UMAPs and by cluster

  • Visualises singleR results

  • Finds cluster marker genes (using the ScanPy ‘rank_genes_groups’ function)

  • Performs pathway analysis of the cluster phenotypes (with gsfisher)

  • Prepares marker gene and summary reports

  • Export an anndata for viewing with cellxgene

For plotting of data in R, the pipeline saves the input anndata in loom format and reads data in R with the loomR library.


See Installation and Usage on general information how to use CGAT pipelines.


It is recommended to perform the clustering in a new directory.

The pipeline requires a configured pipeline_cluster.yml file.

Default configuration files can be generated by executing:

python <srcdir>/pipeline_cluster.py config


The pipeline starts from an anndata object with the following structure.

  • anndata.var.index -> gene names - these will be displayed in the plots

  • anndata.var.gene_ids -> ensembl gene ids, optional but required for pathway analysis

  • anndata.X -> scaled data (a dense matrix)

  • anndata.layers[“counts”] -> raw counts (a sparse matrix)

  • anndata.layers[“log1p”] -> total count normalised, 1og1p transformed data (a sparse matrix)

  • anndata.obs -> metadata (typically passed through from original cellhub object)

  • anndata.obsm[“X_rdim_name”] -> containing the integrated coordinates (where “rdim_name” matches the “runspec_rdim_name” parameter). TODO: rename this parameter.

It is strongly recommended to retain the information for all of the genes in all of the matrices (i.e. do not subset to HVGs!). This is important for marker gene discovery and pathway analysis.

Pipeline output

The pipeline produces the following outputs:

  1. Summary report

  • Containing the overview UMAP plots, visualisation of QC and SingleR information

  • The result of the per-cluster pathway analysis

  1. Marker report

  • Containing heatmaps, violin plots, expression dotplots, MA and volcano plots for each cluster.

  1. xlsx spreadsheets for the marker gene and pathway results

  2. Anndata objects ready to be viewed with cellxgene


cellhub.pipeline_cluster.taskSummary(infile, outfile)

Make a summary of optional tasks that will be run

cellhub.pipeline_cluster.preflight(infile, outfile)

Preflight sanity checks.

cellhub.pipeline_cluster.metadata(infile, outfile)

Export the metadata (obs) from the source anndata for use in the plotting tasks

cellhub.pipeline_cluster.loom(infile, outfile)

Export the data to the loom file format. This is used as an exchange format for plotting in R.

cellhub.pipeline_cluster.neighbourGraph(infile, outfile)

Compute the neighbor graph. A miniminal anndata is saved for UMAP computation and clustering.

cellhub.pipeline_cluster.scanpyCluster(infile, outfile)

Discover clusters using ScanPy.

cellhub.pipeline_cluster.cluster(infile, outfile)

Post-process the clustering result.

cellhub.pipeline_cluster.compareClusters(infile, outfile)

Draw a dendrogram showing the relationship between the clusters.

cellhub.pipeline_cluster.clustree(infile, outfile)

Run clustree.

cellhub.pipeline_cluster.paga(infile, outfile)

Run partition-based graph abstraction (PAGA) see: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1663-x

cellhub.pipeline_cluster.UMAP(infile, outfile)

Compute the UMAP layout.

cellhub.pipeline_cluster.plotRdimsFactors(infiles, outfile)

Visualise factors of interest on the UMAP.

cellhub.pipeline_cluster.plotRdimsClusters(infile, outfile)

Visualise the clusters on the UMAP

cellhub.pipeline_cluster.plotRdimsSingleR(infile, outfile)

Plot the SingleR primary identity assignments on a UMAP

cellhub.pipeline_cluster.plotSingleR(infile, outfile)

Make singleR heatmaps for the references present on the cellhub API.

cellhub.pipeline_cluster.summariseSingleR(infile, outfile)

Collect the single R plots into a section for the Summary report.

cellhub.pipeline_cluster.plotGroupNumbers(infiles, outfile)

Plot statistics on cells by group, e.g. numbers of cells per cluster.

Plots are defined on a case-by-case basis in the yaml.

cellhub.pipeline_cluster.clusterStats(infile, outfile)

Compute per-cluster statistics (e.g. mean expression level).

cellhub.pipeline_cluster.findMarkers(infile, outfile)

Find per-cluster marker genes. Just execute the rank_genes_groups routine, no filtering here.

cellhub.pipeline_cluster.summariseMarkers(infiles, outfile)

Summarise the differentially expressed marker genes. P-values are adjusted per-cluster are filtering out genes with low expression levels and fold changes. Per-cluster gene universes are prepared for the pathway analysis.

cellhub.pipeline_cluster.topMarkerHeatmap(infiles, outfile)

Make the top marker heatmap.

cellhub.pipeline_cluster.dePlots(infile, outfile)

Make per-cluster diagnoistic differential expression plots (MA and volcano plots).

cellhub.pipeline_cluster.markerPlots(infiles, outfile)

Make the per-cluster marker plots TODO: add some version of split dot plots back..

cellhub.pipeline_cluster.plotMarkerNumbers(infile, outfile)

Summarise the numbers of marker genes for each cluster.

cellhub.pipeline_cluster.markers(infile, outfile)

Target to run marker gene plotting tasks.


Helper function for parsing the lists of GMT files

cellhub.pipeline_cluster.genesetAnalysis(infile, outfile)

Naive geneset over-enrichment analysis of cluster marker genes.

Testing is performed with the gsfisher package.

GO categories and KEGG pathways are tested by default.

Arbitrary sets of genes cat be supplied as GMT files (e.g. such as those from MSigDB).

cellhub.pipeline_cluster.summariseGenesetAnalysis(infile, outfile)

Summarise the geneset over-enrichment analyses of cluster marker genes.

Enriched pathways are saved as dotplots and exported in excel format.

cellhub.pipeline_cluster.genesets(infile, outfile)

Intermediate target to run geneset tasks

cellhub.pipeline_cluster.plots(infile, outfile)

Target to run all the plotting functions.

cellhub.pipeline_cluster.latexVars(infiles, outfile)

Prepare a file containing the latex variable definitions for the reports.

cellhub.pipeline_cluster.summaryReportSource(infile, outfile)

Write the latex source for the summary report.

cellhub.pipeline_cluster.summaryReport(infile, outfile)

Compile the summary report to PDF format.

cellhub.pipeline_cluster.markerReportSource(infile, outfile)

Write the latex source file for the marker report.

cellhub.pipeline_cluster.markerReport(infile, outfile)

Prepare a PDF report visualising the discovered cluster markers.

cellhub.pipeline_cluster.export(infile, outfile)

Link output files to a directory in the “reports.dir” folder.

Prepare folders containing the reports, differentially expressed genes and geneset tables for each analysis.

TODO: link in the cellxgene anndata files.


Target for building the reports.

cellhub.pipeline_cluster.cellxgene(infile, outfile)

Export an anndata object for cellxgene.