pipeline_celldb.py
Overview
This pipeline uploads the outputs from the preprocessing pipelines into a SQLite database. Cell identifiers are joined with sample metadata, qc statistics and other per-cell information in a virtual table called “final” in the database.
Configuration
The pipeline requires a configured pipeline_celldb.yml
file.
- Default configuration files can be generated by executing:
cellhub celldb config
The user must edit the final_sql_query parameter in the configuration file to create the “final” database view appropriately.
Input files
A user-supplied tab-separated sample metadata file (e.g. “samples.tsv”) via a path in the pipeline_celldb.yml configuration file. It should have columns for library_id, sample_id as well as any other relevant experimental metadata such as condition, genotype, age, replicate, sex etc.
The pipeline requires the outputs of pipeline_cell_qc to be registered on the API
Optionally, the pipeline will load the results of pipeline_singleR and pipeline_dehash from the API if these tables are set to “active” in the configuration file.
Pipeline output
The pipeline returns an SQLite database that contains a “final” view which links cell identifiers with cell QC information, scrublet scores, user-provided metadata, cell type predictions (optional) and de-multiplexing assignments (optional). In the database cell identify is encoded by the “barcode” and “library_id” fields which are automatically indexed (as a multi-column index).
Code
- cellhub.pipeline_celldb.connect()
Helper function to connect to DB
- cellhub.pipeline_celldb.load_samples(outfile)
load the sample metadata table
- cellhub.pipeline_celldb.load_gex_qcmetrics(outfile)
load the gex qcmetrics into the database
- cellhub.pipeline_celldb.load_gex_scrublet(outfile)
Load the scrublet scores into database.
- cellhub.pipeline_celldb.load_singleR(outfile)
Load the singleR predictions into the database.
- cellhub.pipeline_celldb.load_gmm_demux(outfile)
Load the gmm demux dehashing calls into the database.
- cellhub.pipeline_celldb.load_demuxEM(outfile)
load the demuxEM dehashing calls into the database
- cellhub.pipeline_celldb.load_souporcell(outfile)
load the souporcell cluster result into the database
- cellhub.pipeline_celldb.final(outfile)
Construct a “final” view on the database from which the cells can be selected and fetched by pipeline_fetch_cells.py