pipeline_annotation.py

Overview

This pipeline retrieves annotation from Ensembl

Usage

The annotation pipeline should be run in the cellhub directory.

Configuration

The pipeline requires a configured pipeline_cluster.yml file.

Default configuration files can be generated by executing:

python <srcdir>/pipeline_annotation.py config

The ensembl version specified in the yaml file should match that used to build the reference transcriptome for the mapping algorithm (e.g. Cellranger)

Inputs

This pipeline has no inputs.

Dependencies

This pipeline requires:

Pipeline output

The pipeline produces the following outputs:

  1. api/annotation/ensembl/ensembl.to.entrez.tsv.gz

  • A mapping of ensembl_id to gene_name and entrez_id. Used by gsfisher for pathway analysis.

  1. api/annotation/ensembl/ensembl.gene_name.map.tsv.gz

  • A unique mapping of ensembl_id -> gene_name. Missing gene names are replaced with ensembl_ids. The gene names have been made unique.

  1. api/annotation/kegg/kegg_pathways.rds

  • Kegg pathways in rds format for gsfisher.

cellhub.pipeline_annotation.fetchEnsembl(infile, outfile)

Fetch the ensembl annotations from BioMart. This task requires internet access.

cellhub.pipeline_annotation.ensemblAPI(infile, outfile)

Add the Ensembl gene annotation results to the cellhub API.

cellhub.pipeline_annotation.fetchKegg(infile, outfile)

Fetch the Kegg pathway annotations. This task requires internet access.

cellhub.pipeline_annotation.keggAPI(infile, outfile)

Add the kegg pathways to the cellhub API