pipeline_annotation.py

Overview

This pipeline retrieves annotation from Ensembl

Usage

The annotation pipeline should be run in the cellhub directory.

Configuration

The pipeline requires a configured pipeline_cluster.yml file.

Default configuration files can be generated by executing:

python <srcdir>/pipeline_annotation.py config

The ensembl version specified in the yaml file should match that used to build the reference transcriptome for the mapping algorithm (e.g. Cellranger)

Inputs

This pipeline has no inputs.

Dependencies

This pipeline requires:

Pipeline output

The pipeline produces the following outputs:

api/annotation/ensembl/ensembl.to.entrez.tsv.gz

A mapping of ensembl_id to gene_name and entrez_id. Used by gsfisher for pathway analysis.

api/annotation/ensembl/ensembl.gene_name.map.tsv.gz

A unique mapping of ensembl_id -> gene_name. Missing gene names are replaced with ensembl_ids. The gene names have been made unique.

api/annotation/kegg/kegg_pathways.rds

Kegg pathways in rds format for gsfisher.

cellhub.pipeline_annotation.fetchEnsembl(infile, outfile): Fetch the ensembl annotations from BioMart. This task requires internet access.

cellhub.pipeline_annotation.ensemblAPI(infile, outfile): Add the Ensembl gene annotation results to the cellhub API.

cellhub.pipeline_annotation.fetchKegg(infile, outfile): Fetch the Kegg pathway annotations. This task requires internet access.

cellhub.pipeline_annotation.keggAPI(infile, outfile): Add the kegg pathways to the cellhub API