This pipeline uploads the outputs from the preprocessing pipelines into a SQLite database. Cell identifiers are joined with sample metadata, qc statistics and other per-cell information in a virtual table called “final” in the database.


The pipeline requires a configured pipeline_celldb.yml file.

Default configuration files can be generated by executing:

cellhub celldb config

The user must edit the final_sql_query parameter in the configuration file to create the “final” database view appropriately.

Input files

  1. A user-supplied tab-separated sample metadata file (e.g. “samples.tsv”) via a path in the pipeline_celldb.yml configuration file. It should have columns for library_id, sample_id as well as any other relevant experimental metadata such as condition, genotype, age, replicate, sex etc.

  2. The pipeline requires the outputs of pipeline_cell_qc to be registered on the API

  3. Optionally, the pipeline will load the results of pipeline_singleR and pipeline_dehash from the API if these tables are set to “active” in the configuration file.

Pipeline output

The pipeline returns an SQLite database that contains a “final” view which links cell identifiers with cell QC information, scrublet scores, user-provided metadata, cell type predictions (optional) and de-multiplexing assignments (optional). In the database cell identify is encoded by the “barcode” and “library_id” fields which are automatically indexed (as a multi-column index).



Helper function to connect to DB


load the sample metadata table


load the gex qcmetrics into the database


Load the scrublet scores into database.


Load the singleR predictions into the database.


Load the gmm demux dehashing calls into the database.


load the demuxEM dehashing calls into the database


load the souporcell cluster result into the database


Construct a “final” view on the database from which the cells can be selected and fetched by pipeline_fetch_cells.py