api.py

Overview

The primary and secondary analysis pipelines define and register their outputs via a common api. The api comprises of an “api” folder into which pipeline outputs are symlinked (by the “register_dataset” “api” class method).

This module contains the code for registering and accessing pipeline outputs from a common location.

There are classes that provide methods for:

registering pipeline outputs to the common service endpoint
discovering the information available from the service endpoint (not yet written)
accessing information from the service endpoint (not yet written)

The service endpoint is the folder “api”. We use a rest-like syntax for providing access to the pipline outputs.

Usage

Registering outputs on the service endpoint

All matrices registered on the API that hold per-cell statistics must have “library_id” and “barcode” columns. The library identifiers must correspond with those given in the “libraries.tsv” file. The barcodes field should contain the untouched Cellranger barcodes.

Please see pipeline_cellranger.py or pipeline_cell_qc.py source code for examples of how to register results on the API.

As an example the code used for registering the qcmetrics outputs is reproduced with some comments here:

import cellhub.tasks.api as api

file_set={}

...

# the set of files to be registered is defined as a dictionary
# the keys are arbitrary and will not appear in the api

file_set[library_id] = {"path": tsv_path,
                        "description":"qcmetric table for library " +                                        library_id,
                        "format":"tsv"}

# an api object is created, passing the pipeline name
x = api.api("cell.qc")

# the dataset to be deposited is added
x.define_dataset(analysis_name="qcmetrics",
                 data_subset="filtered",
                 file_set=file_set,
                 analysis_description="per library tables of cell GEX qc statistics",
                 file_format="tsv")

# the dataset is linked in to the API
x.register_dataset()

Discovering available datasets

At present the API can be browsed on the command line. Programmatic access is expected in a future update.

Accessing datasets

At present datasets are accessed directly via the “api” endpoint.

Class and method documentation

class cellhub.tasks.api.api(pipeline=None, endpoint='api')

Bases: object

A class for defining and registering datasets on the cellhub api.

When initialising an instance of the class, the pipeline name is passed e.g.:

x = cellhub.tasks.api.register("cell_qc")

Note

pipeline names are sanitised to replace spaces, underscores and hypens with periods.

define_dataset(analysis_name=None, analysis_description=None, data_subset=None, data_id=None, data_format=None, file_set=None)

Define the dataset.

The “data_subset”, “data_id” and “data_format” parameters are optional.

The file_set is a dictionary that contains the files to be registered:

{ "name": { "path": "path/to/file",
            "format": "file-format",
            "link_name": "api link name", # optional
            "description": "free-text" }

the top level “name” keys are arbitrary and not exposed in the API

e.g. for cell ranger output the file_set dictionary looks like this:

{"barcodes": {"path":"path/to/barcodes.tsv",
              "format": "tsv",
              "description": "cell barcode file"},
{"features": {"path":"path/to/features.tsv",
              "format": "tsv",
              "description": "features file"},
{"matrix": {"path":"path/to/matrix.mtx",
              "format": "market-matrix",
              "description": "Market matrix file"}
}

register_dataset()

Register the dataset on the service endpoint. The method:

creates the appropriate folders in the “api” endpoint folder
symlinks the source files to the target location
constructs and deposits the manifest.yml file

The location at which datasets will be registered is defined as:

api/pipeline.name/analysis_name/[data_subset/][data_id/][data_format/]

(data_subset, data_id and data_format are [optional])

show(): Print the api object for debugging.

reset_endpoint(): Clean the dataset endpoint