IFNb PBMC example

Setting up

1. Clone the example template to a local folder

Clone the folders and files for the example into a local folder:

cp -r /path/to/cellhub/examples/ifnb_pbmc/* .

This will create 3 folders:

“cellhub” where the preprocessing pipelines will be run and where the cellhub database will be created
“integration” where integration is to be performed.
“cluster” where we can perform downstream analysis with “cellhub cluster”.

The folders contain the necessary configuration files: please edit the cellhub/libraries.tsv file to point to the location of the FASTQ files on your system.

Running the pre-processing pipelines and creating the database

2. Running Cellranger

The first step is to configure and run pipeline_cellranger. This pipeline takes three inputs (i) Information about the biological samples, number of cells expected and the 10x chemistry version are specified in a “samples.tsv” file. (ii) Input 10X channel library identifiers “library_id”, sample prefixes, library types and FASTQ paths are specified via a tab delimited “libraries.tsv” file. (iii) A pipeline_cellranger.yml file is used to configure general options such as computional resource specification and the location of the genomic references. For more details please see: pipeline_cellranger.py.

For this example, preconfigured “samples.tsv”, “libraries.tsv” and “pipeline_cellranger.yml” files are provided in the cellhub directory.

Edit the “libraries.tsv” file “fastq_path” column as appropriate to point to folders containing fastq files extracted from the original BAM files submitted by Kang et. al. to GEO (GSE96583). The GEO identifiers are: unstimulated (GSM2560248) and stimulated (GSM2560249). The fastqs can be extracted with the 10X bamtofastq tool.

The pipeline is run as follows:

cellhub cellranger make full -v5 -p20

Finally, the count matrices must be manually registered on the API for downstream analysis:

cellhub cellranger_multi make useCounts

Note

When processing other datasets the “samples.tsv” and “libraries.tsv” files must be created by the user. For more details on constructing these files please see pipeline_cellranger.py. A template ‘pipeline_cellranger.yml’ file can be obtained using the “config” command which is common to all cellhub piplines.:

# cellhub cellranger config

3. Running the cell qc pipeline

Next we run the cell qc pipeline:

# cellhub cell_qc config

cellhub cell_qc make full -v5 -p20

4. Running emptydrops and investigating ambient RNA (optional)

If desired we can run emptydrops:

# cellhub emptydrops config

cellhub emptydrops make full -v5 -p20

And investigate the ambient rna:

# cellhub ambient_rna config

cellhub ambient_rna make full -v5 -p20

5. Run pipeline_singleR

Single R is run on all the cells so that the results are avaliable to help with QC as well as downstream analysis:

# cellhub singleR config

cellhub singleR make full -v5 -p20

As noted: in the pipeline_singleR inputs section the celldex references need to be stashed before the pipeline is run.

6. Loading the cell statistics into the celldb

The cell QC statistics and metadata (“samples.tsv”) are next loaded into a local sqlite database:

# cellhub celldb config

cellhub celldb make full -v5 -p20

7. Run pipeline_annotation

This pipeline retrieves Ensembl and KEGG annotations needed for downstream analysis.:

# cellhub annotation config

cellhub annotation make full -v5 -p10

Please note that the specified Ensembl version should match that used for the cellranger reference trancriptome.

Performing cell QC

8. Assessment of cell quality

This step is left to the reader to perform manually because it needs to be carefully tailored to individual datasets.

Performing downstream analysis

9. Fetch cells for integration

We use pipeline_fetch_cells to retrieve the cells we want for downstream analysis. (QC thresholds and e.g. desired samples are specified in the pipeline_fetch_cells.yml) file:

# It is recommended to fetch the cells in to a seperate directory for integration.
cd ../integration

# cellhub fetch_cells config
cellhub fetch_cells make full -v5 -p20

10. Integration

Run the provided jupyter notebook to perform a basic Harmony integration of the data and to save it in the appropriate anndata format (see in the pipeline_cluster inputs section) is provided.

11. Clustering analysis

Cluster analysis is performed with pipeline cluster (a seperate directory is recommended for this so that multiple clustering runs can be performed as required).:

# change into the clustering directory
cd ../cluster.dir

# checkout the yml file
cellhub cluster config

# a suitable yml file has been provided so we can now launch the pipeline
cellhub cluster make full -v5 -p200

The pdf reports and excel files generated by the pipeline can be found in the “reports.dir” subfolder.

For interactive visulation, the results are provided in cellxgene format. To view the cellxgene.h5ad files, you will first need toinstall cellxgene with “pip install cellxgene”. The cellxgene viewer can then be launched with:

# substitute "{x}" with the number integrated components used for the clustering run.
cellxgene --no-upgrade-check launch out.{x}.comps.dir/cellxgene.h5ad