pipeline_cellranger.py

Overview

This pipeline performs the following functions:

Alignment and quantitation of 10x GEX, CITE-seq and VDJ sequencing data.

Usage

See Installation and Usage for general information on how to use CGAT pipelines.

Configuration

The pipeline requires a configured:file:pipeline_cellranger.yml file.

Default configuration files can be generated by executing:

python <srcdir>/pipeline_cellranger.py config

Inputs

In addition to the “pipeline_cellranger.yml” file, the pipeline requires two inputs:

a “samples.tsv” file describing the samples
a “libraries.tsv” table containing the sample prefixes, feature type and fastq paths.

(i) samples.tsv

A table describing the samples and libraries to be analysed.

It must have the following columns:

“sample_id” a unique identifier for the biological sample being analysed
“library_id” is a unique identifier for the sequencing libraries generated from a single channel on a single 10x chip. Use the same “library ID” for Gene Expression, Antibody Capture, VDJ-T and VDJ-B libraries that are generated from the same channel.

Additional arbitrary columns describing the sample metadata should be included to aid the downstream analysis, for example

“condition”
“replicate”
“timepoint”
“genotype”
“age”
“sex”

For HTO hashing experiments include a column containing details of the hash tag, e.g.

“hto_id”

(ii) libraries.tsv

A table that links individual sequencing libraries, library types and FASTQ file locations.

It must have the following columns:

“library_id”: Must match the library_ids provided in the “samples.tsv” file, for details see above.
“feature_type”: One of “Gene Expression”, “Antibody Capture”, “VDJ-T” or “VDJ-B”. Case sensitive.
“fastq_path”: the location of the folder containing the fastq files
“sample”: this will be passed to the “–sample” parameter of the cellranger pipelines (see: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/fastq-input). It is only used to match the relevant FASTQ files: it does not have to match the “sample_id” provided in the “samples.tsv” table, and is not used in downstream analysis.
“chemistry”: The 10x reaction chemistry, the options are:
- ‘auto’ for autodetection,
- ‘threeprime’ for Single Cell 3’,
- ‘fiveprime’ for Single Cell 5’,
- ‘SC3Pv1’,
- ‘SC3Pv2’,
- ‘SC3Pv3’,
- ‘SC5P-PE’,
- ‘SC5P-R2’ for Single Cell 5’, paired-end/R2-only,
- ‘SC-FB’ for Single Cell Antibody-only 3’ v2 or 5’.
- “expect_cells”: An integer specifying the expected number of cells

It is recommended to include the following columns

“chip”: a unique ID for the 10x Chip
“channel_id”: an integer denoting the channel on the chip
“date”: the date the 10x run was performed

Note: Use of the cellranger “–lanes”: parameter is not supported. This means that data from all the lanes present in the given location for the given “sample” prefix will be run. This applies for both Gene Expression and VDJ analysis. If you need to analyse data from a single lane, link the data from that lane into a different location.

Note: To combine sequencing data from different flow cells, add additional rows to the table. Rows with identical “library_id” and “feature_type” are automatically combined by the pipelines. If you are doing this for VDJ data, the data from the different flows cells must be in different folders as explained in the note below.

Note: For V(D)J analysis, if you need to combine FASTQ files that have a different “sample” prefix (i.e. from different flow cells) the FASTQ files with different “sample” prefixes must be presented in separate folders. This is because despite the docs indicating otherwise (https://support.10xgenomics.com/single-cell-vdj/software/pipelines/latest/using/vdj), “cellranger vdj” does not support this:: –sample prefix1,prefix2 –fastqs all_data/,all_data/
but it does support:: –sample prefix1,prefix2 –fastqs flow_cell_1/,flow_cell_2/.

Dependencies

This pipeline requires: * cgat-core: https://github.com/cgat-developers/cgat-core * cellranger: https://support.10xgenomics.com/single-cell-gene-expression/

Pipeline logic

The pipeline is designed to:

map libraries in parallel to speed up analysis
submit standalone cellranger jobs rather than to use the cellranger cluster mode which can cause problems on HPC clusters that are difficult to debug
map ADT data with GEX data: so that the ADT analysis takes advantage of GEX cell calls
map VDJ-T and VDJ-B libraries using the “cellranger vjd” command.

Note: 10x recommends use of “cellranger multi” for mapping libraries from samples with GEX and VDJ. This is so that barcodes present in the VDJ results but not the GEX cell calls can be removed from the VDJ results. Here for simplicity and to maximise parallelisation we use “cellranger vdj”: it is trivial to remove VDJ barcodes without a GEX overlap downstream.

Pipeline output

The pipeline returns: * the output of cellranger

Code

cellhub.pipeline_cellranger.count(infile, outfile): Execute the cellranger count pipeline

cellhub.pipeline_cellranger.mtxAPI(infile, outfile)

Register the count market matrix (mtx) files on the API endpoint

Inputs:

The input cellranger count folder layout is:

unfiltered “outs”: ::: library_id/outs/raw_feature_bc_matrix [mtx] library_id/outs/raw_feature_bc_matrix.h5
filtered “outs”: ::: library_id/outs/filtered_feature_bc_matrix library_id/outs/filtered_feature_bc_matrix.h5

cellhub.pipeline_cellranger.h5API(infile, outfile): Put the h5 files on the API

cellhub.pipeline_cellranger.tcr(infile, outfile): Execute the cellranger vdj pipeline for the TCR libraries

cellhub.pipeline_cellranger.registerTCR(infile, outfile): Register the TCR contigfiles on the API endpoint

cellhub.pipeline_cellranger.mergeTCR(infiles, outfile): Merge the TCR contig annotations

cellhub.pipeline_cellranger.registerMergedTCR(infile, outfile): Register the merged TCR contigfiles on the API endpoint

cellhub.pipeline_cellranger.bcr(infile, outfile): Execute the cellranger vdj pipeline for the BCR libraries

cellhub.pipeline_cellranger.registerBCR(infile, outfile): Register the individual BCR contigfiles on the API endpoint

cellhub.pipeline_cellranger.mergeBCR(infiles, outfile): Merge the BCR contigfiles

cellhub.pipeline_cellranger.registerMergedBCR(infile, outfile): Register the merged VDJ-B contigfiles on the API endpoint

cellhub.pipeline_cellranger.full(): Run the full pipeline.

cellhub.pipeline_cellranger.useCounts(infile, outfile): Set the cellranger counts as the source for downstream analysis. This task is not run by default.