pipeline_celldb.py

Overview

This pipeline uploads the outputs from the preprocessing pipelines into a SQLite database. Cell identifiers are joined with sample metadata, qc statistics and other per-cell information in a virtual table called “final” in the database.

Configuration

The pipeline requires a configured pipeline_celldb.yml file.

Default configuration files can be generated by executing:

cellhub celldb config

The user must edit the final_sql_query parameter in the configuration file to create the “final” database view appropriately.

Input files

  1. A user-supplied tab-separated sample metadata file (e.g. “samples.tsv”) via a path in the pipeline_celldb.yml configuration file. It should have columns for library_id, sample_id as well as any other relevant experimental metadata such as condition, genotype, age, replicate, sex etc.

  2. The pipeline requires the outputs of pipeline_cell_qc to be registered on the API

  3. Optionally, the pipeline will load the results of pipeline_singleR and pipeline_dehash from the API if these tables are set to “active” in the configuration file.

Pipeline output

The pipeline returns an SQLite database that contains a “final” view which links cell identifiers with cell QC information, scrublet scores, user-provided metadata, cell type predictions (optional) and de-multiplexing assignments (optional). In the database cell identify is encoded by the “barcode” and “library_id” fields which are automatically indexed (as a multi-column index).

Code

cellhub.pipeline_celldb.connect()

Helper function to connect to DB

cellhub.pipeline_celldb.load_samples(outfile)

load the sample metadata table

cellhub.pipeline_celldb.load_gex_qcmetrics(outfile)

load the gex qcmetrics into the database

cellhub.pipeline_celldb.load_gex_scrublet(outfile)

Load the scrublet scores into database.

cellhub.pipeline_celldb.load_singleR(outfile)

Load the singleR predictions into the database.

cellhub.pipeline_celldb.load_gmm_demux(outfile)

Load the gmm demux dehashing calls into the database.

cellhub.pipeline_celldb.load_demuxEM(outfile)

load the demuxEM dehashing calls into the database

cellhub.pipeline_celldb.load_souporcell(outfile)

load the souporcell cluster result into the database

cellhub.pipeline_celldb.final(outfile)

Construct a “final” view on the database from which the cells can be selected and fetched by pipeline_fetch_cells.py