sfaira - data and model repository for single-cell data¶
sfaira is a model and a data repository in a single python package. We provide an interactive overview of the current state of the zoos on sfaira-portal.
sfaira fits into an environment of many other project centred on making data and models accessible.
Data zoo¶
We focus on providing a python interface to interact with locally stored data set collections without requiring dedicated data reading and annotation harmonisation scripts: These code blocks are absorbed into our data zoo backend and can be conveniently triggered with short commands.
Model zoo¶
A large body of recent research has been devoted to improving models that learn representation of cell captured with single-cell RNA-seq. These models include embedding models such as autoencoders and cell type prediction models. Many of these models are implemented in software packages and can be deployed on new data sets. In many of these cases, it also makes sense to use pre-trained models to leverage previously published modelling results. We provide a single interface to interact with such pre-trained models which abstracts model settings into a API so that users can easily switch between different pre-trained models. Importantly, model execution is performed locally so that data does not have to be uploaded to external servers and model storage is decentral so that anybody can contribute models easily. Users benefit from easy, streamlined access to models that can be used in analysis workflows, developers benefit from being able to deploy models to a large community of users without having to set up a model zoo.
News¶
No news yet, stay tuned!
Latest additions¶
Installation¶
sfaira is pip installable.
PyPI¶
To install a sfaira release directly from PyPi, run:
pip install sfaira
Install a development version¶
To install a specific branch target_branch
of sfaira from a clone, run:
cd target_directory
git clone https://github.com/theislab/sfaira.git
cd sfaira
git checkout target_branch
git pull
pip install -e .
In most cases, you would install one of the following:
You may choose the branch release
if you want to use a relatively stable version
which is similar to the current release but may have additional features already.
You may choose the branch dev
if you want newer features than available from release
.
You may choose a specific feature branch if you want to use or improve that feature before it
is reviewed and merged into dev
.
Note that the master
branch only contains releases,
so every installation based on the master
branch can also be performed via PyPi.
API¶
Import sfaira as:
import sfaira
Data: data
¶
Data loaders¶
The sfaira data zoo API.
Dataset representing classes used for development:
|
|
|
Container class that co-manages multiple data sets, removing need to call Dataset() methods directly through wrapping them. |
|
|
|
Container for multiple DatasetGroup instances. |
Interactive data class to use a loaded data object in the context sfaira tools:
|
Dataset universe to interact with all data loader classes:
|
Stores¶
We distinguish stores for a single feature space, which could for example be a single organism,
and those for multiple feature spaces.
Critically, data from multiple feature spaces can be represented as a data array for each feature space.
In load_store
we represent a directory of datasets as a instance of a multi-feature space store and discover all feature
spaces present.
This store can be subsetted to a single store if only data corresponding to a single organism is desired,
for example.
The core API exposed to users is:
|
Instantiates a distributed store class. |
Store classes for a single feature space:
|
Data set group class tailored to data access requirements common in high-performance computing (HPC). |
Store classes for a multiple feature spaces:
Umbrella class for a dictionary over multiple instances DistributedStoreSingleFeatureSpace. |
|
|
|
|
|
|
Carts¶
Stores represent on-disk data collection and perform operations such as subsetting. Ultimatively, they are often used to emit data objects, which are “carts”. Carts are specific to the underlying store’s data format and expose iterators, data matrices and adaptors to machine learning framework data pipelines, such as tensorflow and torchc data. Again, carts can cover one or multiple feature spaces.
|
Cart for a DistributedStoreSingleFeatureSpace(). |
|
Cart for a DistributedStoreMultipleFeatureSpaceBase(). |
The emission of data from cart iterators and adaptors is controlled by batch schedules, which direct how data is released from the underlying data matrix:
Manages distribution of selected indices for a given data object over subsequent batches. |
|
Standard batched access to data. |
|
Balanced batches across meta data partitions of data. |
|
Meta data-defined blocks of observations in each batch. |
|
Emits full dataset as a single batch in each query. |
For most purposes related to stochastic optimisation, BatchDesignBasic
is chosen.
Estimator classes: estimators
¶
Estimator classes from the sfaira model zoo API for advanced use.
Estimator base class for keras models. |
|
|
Estimator class for the cell type model. |
|
Estimator class for the embedding model. |
Model classes: models
¶
Model classes from the sfaira model zoo API for advanced use.
Cell type models¶
Classes that wrap tensorflow cell type predictor models.
|
Marker gene-based cell type classifier: Learns whether or not each gene exceeds requires threshold and learns cell type assignment as linear combination of these marker gene presence probabilities. |
|
Marker gene-based cell type classifier: Learns whether or not each gene exceeds requires threshold and learns cell type assignment as linear combination of these marker gene presence probabilities. |
|
Multi-layer perceptron to predict cell type. |
|
Embedding models¶
Classes that wrap tensorflow embedding models.
|
Combines the encoder and decoder into an end-to-end model for training. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Train: train
¶
The interface for training sfaira compatible models.
Trainer classes¶
Classes that wrap estimator classes to use in grid search training.
|
|
|
Grid search summaries¶
Classes to pool evaluation metrics across fits in a grid search.
|
|
|
|
|
Versions: versions
¶
The interface for sfaira metadata management.
Genomes¶
Genome management.
|
Container class for a genome annotation for a specific release. |
Metadata¶
Dataset metadata management. Base classes to manage ontology files:
|
Basic unordered ontology container. |
Basic ordered ontology container |
|
|
|
|
Onotology-specific classes:
|
|
|
|
|
|
|
|
|
|
|
Class wrapping cell type ontology for predictor models:
|
Cell type universe (list) and ontology (hierarchy) container class. |
Topologies¶
Model topology management.
|
Class interface for a YAML-style defined model topology that loads a genome container tailored to the model. |
User interface: ui
¶
This sub-module gives users access to the model zoo, including model query from remote servers. This API is designed to be used in analysis workflows and does not require any understanding of the way models are defined and stored.
|
This class performs data set handling and coordinates estimators for the different model types. Example code to obtain a UMAP embedding plot of the embedding created from your data with cell-type labels::. |
Commandline interface¶
sfaira¶
Create and manage sfaira dataloaders.
sfaira [OPTIONS] COMMAND [ARGS]...
Options
- --version¶
Show the version and exit.
- -v, --verbose¶
Enable verbose output (print debug statements).
- -l, --log-file <log_file>¶
Save a verbose log to a file.
annotate-dataloader¶
Annotates a dataloader.
sfaira annotate-dataloader [OPTIONS]
Options
- --doi <doi>¶
Required The doi of the paper that the data loader refers to.
- --path-data <path_data>¶
Absolute path of the location of the raw data directory.
- --path-loader <path_loader>¶
Relative path from the current directory to the location of the data loader.
- --schema <schema>¶
The curation schema to check meta data availability for.
cache-clear¶
Clears sfaira cache, including ontology and genome cache.
sfaira cache-clear [OPTIONS]
cache-reload¶
Downloads new ontology versions into cache.
sfaira cache-reload [OPTIONS]
create-dataloader¶
Interactively create a new sfaira dataloader.
sfaira create-dataloader [OPTIONS]
Options
- --path-data <path_data>¶
Absolute path of the desired location of the raw data directory.
- --path-loader <path_loader>¶
Relative path from the current directory to the desired location of the data loader.
export-h5ad¶
Creates a collection of streamlined h5ad object for a given DOI.
sfaira export-h5ad [OPTIONS]
Options
- --doi <doi>¶
Required The doi of the paper that the data loader refers to.
- --schema <schema>¶
Schema to streamline to, e.g. ‘cellxgene’
- --path-out <path_out>¶
Absolute path of the location of the streamlined output h5ads.
- --path-data <path_data>¶
Absolute path of the location of the raw data directory.
- --path-loader <path_loader>¶
Relative path from the current directory to the location of the data loader.
- --path-cache <path_cache>¶
The optional absolute path to cached data library maintained by sfaira. Using such a cache speeds up loading in sequential runs but is not necessary.
finalize-dataloader¶
Formats .tsvs and runs a full data loader test.
sfaira finalize-dataloader [OPTIONS]
Options
- --doi <doi>¶
Required The doi of the paper that the data loader refers to.
- --path-data <path_data>¶
Absolute path of the location of the raw data directory.
- --path-loader <path_loader>¶
Relative path from the current directory to the location of the data loader.
- --schema <schema>¶
The curation schema to check meta data availability for.
publish-dataloader¶
Interactively create a GitHub pull request for a newly created data loader. This only works when called in the sfaira CLI docker container. Runs a full data loader test before starting the pull request.
sfaira publish-dataloader [OPTIONS]
test-dataloader¶
Runs a full data loader test.
sfaira test-dataloader [OPTIONS]
Options
- --doi <doi>¶
Required The doi of the paper that the data loader refers to.
- --path-data <path_data>¶
Absolute path of the location of the raw data directory.
- --path-loader <path_loader>¶
Relative path from the current directory to the location of the data loader.
- --schema <schema>¶
The curation schema to check meta data availability for.
validate-dataloader¶
Verifies the dataloader against sfaira’s requirements.
sfaira validate-dataloader [OPTIONS]
Options
- --doi <doi>¶
Required The doi of the paper that the data loader refers to.
- --path-loader <path_loader>¶
Relative path from the current directory to the desired location of the data loader.
- --schema <schema>¶
The curation schema to check meta data availability for.
validate-h5ad¶
Runs a component test on a streamlined h5ad object.
h5ad is the absolute path of the .h5ad file to test. schema is the schema type (“cellxgene”,) to test.
sfaira validate-h5ad [OPTIONS]
Options
- --h5ad <h5ad>¶
- --schema <schema>¶
Tutorials¶
We provide multiple tutorials in separate repository.
A tutorial for interacting with the data loaders via the
Universe
class (universe).A tutorial for general usage of the user interface (user_interface).
A tutorial for zero-shot analysis with the user interface (pbmc3k).
A tutorial for creating meta data-based data zoo overview figure (meta_data)
The data life cycle¶
The life cycle of a single-cell count matrix often looks as follows:
Generation from primary read data in a read alignment pipeline.
Annotation with cell types and sample meta data.
Publication of annotated data, often together with a manuscript.
Curation of this public data set for the purpose of a meta study. In a python workflow, this curation step could be a scanpy script based on data from step 3, for example.
Usage of data curated specifically for the use case at hand, for example for a targeted analysis or a training of a machine learning model.
where step 1-3 is often only performed once by the original authors of the data set, while step 4 and 5 are repeated multiple times in the community for different meta studies. Sfaira offers the following functionality groups that accelerate steps along this pipeline:
Sfaira tools across life cycle¶
I) Data loaders¶
We maintain streamlined data loader code that improve Curation (step 4) and make this step sharable and iteratively improvable. Read more in our guide to data contribution Writing data loaders.
II) Dataset, DatasetGroup, DatasetSuperGroup¶
Using the data loaders from (I), we built an interface that can flexibly download, subset and curate data sets from the sfaira data zoo, thus improving Usage (step 5). This interface can yield adata instances to be used in a scanpy pipeline, for example. Read more in our guide to data consumption Using data loaders.
III) Stores¶
Using the streamlined data set collections from (II), we built a computationally efficient data interface for machine learning on such large distributed data set collection, thus improving Usage (step 5): Specifically, this interface is optimised for out-of-core observation-centric indexing in scenarios that are typical to machine learning on single-cell data. Read more in our guide to data stores Data stores.
FAIR data¶
FAIR data is a set of data management guidelines that are designed to improve data reuse and automated access (see also the original publication of FAIR for more details). The key data management topics addressed by FAIR are findability, accessibility, interoperability and reusability. Single-cell data sets are usually public and also adhere to varying degrees to FAIR principles. We designed sfaira so that it improves FAIR attributes of published data sets beyond their state at publication. Specifically, sfaira:
improves findability of data sets by serving data sets through complex meta data query.
improves accessibility of data sets by serving streamlined data sets.
improves interoperability of data sets by streamlining data using versioned meta data ontologies.
improves reusability of data sets by allowing for iterative improvements of meta data annotation and by shipping usage critical meta data.
Writing data loaders¶
For a high-level overview of data management in sfaira, read The data life cycle first. In brief, a data loader is a set of instructions that allows for streamlining of raw count matrices and meta data into objects of a target format. Here, streamlining means that gene names are controlled based on a genome assembly, metadata items are constrained to follow ontologies, and key study metadata are described. This streamlining increases accessibility and visibility of a dataset and to makes it available to a large audience. In sfaira, data loaders are grouped by scientific study (DOI of a preprint or DOI of a publication). A data loader for a study is a directory named after the DOI of the study that contains code and text files. This directory is part of the sfaira python package and, thus, maintained on GitHub. This allows for data loaders to be maintained via GitHub workflows: contribution and fixes via pull requests and deployment via repository cloning and package installation.
A dataloader consists of four file components within a single directory:
__init__.py
file which is has same content in all loaders,ID.py
file that contains aload()
functions with based instructions of loading raw data on disk,ID.yaml
file that describes most meta data,ID*.tsv
files with ontology-wise maps of free-text metadata items to constrained vocabulary.
All dataset-specific components receive an ID
that is set during the curation process.
Below, we describe how multiple datasets within a study can be handled with the same dataloder.
In cases where this is not efficient, one can go through the data loader creation process once for each dataset
and then group the resulting loaders (file groups 1-4) in a single directory named after the study’s DOI.
An experienced curator can directly write such a data loader. However, first-time contributors often struggle with the interplay of individual files, metadata maps from free-text annotation are notoriously buggy and comprehensive testing is important also for contributions by experienced curators. Therefore, we broke the process of writing a loader down into phases and built a CLI to guide users through this process. Each phase corresponds to one command (one execution of a shell command) in the CLI. In addition, the CLI guides the user through manual steps that are necessary in each phase. We structured the process of curation into four phases, a preparatory phase P precedes CLI execution and is described in this documentation.
Phase P (
prepare
): data and python environment setup for curation.Phase 1 (
create
): aload()
function (in a.py
) and a YAML are written.Phase 2 (
annotate
): ontology-specific maps of free-text metadata to contrained vocabulary (in*.tsv
) are written.Phase 3 (
finalize
): the data loader is tested and metadata are cleaned up.Phase 4 (
publish
): the data loader is uploaded to the sfaira GitHub repository.
An experienced curator could skip using the CLI for phase 1 and write the __init__.py
, ID.py
and ID.yaml
by hand.
In this case, we still highly recommend using the CLI for phase 2 and 3.
Note that phase 2 is only necessary if you have free-text metadata that needs to be mapped,
the CLI will point this out accordingly in phase 1.
This 4-phase cycle completes initial curation and results in data loader code that can be pushed to the sfaira
GitHub repository.
You have the choice between using a docker image or a sfaira installation (e.g. in conda) for phase P-4.
The workflow is more restricted but safer in conda, we recommend docker if you are inexperienced with software
development with conda, git and GitHub.
Where appropriate, separate instruction options are given for workflows in conda and docker below.
Overall, the workflow looks the same in both frameworks, though.
This cycle can be complemented by an optional workflow to cache curated .h5ad
objects (e.g. on the cellxgene website):
Phase 5 (
export-h5ad
): the data loader is used to create a streamlined.h5ad
of a particular format.Phase 6 (
validate-h5ad
): the.h5ad
from phase 4 is checked for compliance with a particular (e.g. the cellxgene format).
The resulting .h5ad
can be shared with collaborators or uploaded to data submission servers.
Create a new data loader¶
Phase P: Preparation¶
Before you start writing the data loader, we recommend completing this checks and preparation measures. Phase P is sub-structured into sub-phases:
- Pa. Name the data loader.
We will decide for a name of the dataloader based on its DOI. Prefix the DOI with
"d"
and replace the special characters in the DOI with"_"
here to prevent copy mistakes, e.g. the DOI10.1000/j.journal.2021.01.001
becomesd10_1000_j_journal_2021_01_001
Remember to replace this DOI with the DOI of the study you want to contribute, choose a publication (journal) DOI if available, otherwise a preprint DOI. If neither DOI is available, because this is unpublished data, for example, use an identifier that makes sense to you, that is prefixed withdno_doi
and contains a name of an author of the dataset, e.g.dno_doi_einstein_brain_atlas
. We will refer to this name asDOI-name
and it will be used to label the contributed code and the stored data.- Pb. Check that the data loader was not already implemented.
We will open issues for all planned data loaders, so you can search both the code base and our GitHub issues for matching data loaders before you start writing one. You can also search for GEO IDs if our code base as they are included in the data URL that is annotated in the data loader. The core data loader identified is the directory compatible doi, which is the doi with all special characters replaced by “_” and a “d” prefix is used: “10.1016/j.cell.2019.06.029” becomes “d10_1016_j_cell_2019_06_029”. Searching for this string should yield a match if it is already implemented, take care to look for both preprint and publication DOIs if both are available. We will also mention publication names in issues, you will however not find these in the code.
- Pc. Prepare an installation of sfaira to use for data loader writing.
Instead of working in your own sfaira installation, you can download the sfaira data curation docker container instead of going through any of the steps here.
- Pc-docker.
Install docker (and start Docker Desktop if you’re on Mac or Windows).
- Pull the latest version of the sfaira cli container.
sudo docker pull leanderd/sfaira-cli:latest
Run the sfaira CLI within the docker image. Please replace <path_data> and <path_loader> with paths to two empty directories on your machine. The sfaira CLI will use these to read your datafiles from and write the dataloaders to respectively.
PATH_DATA=<path_data> PATH_LOADER=<path_loader> sudo docker run --rm -it -v ${PATH_DATA}:/root/sfaira_data -v ${PATH_LOADER}:/root/sfaira_loader leanderd/sfaira-cli:latest
- Pc-conda.
Jump to step 4 if you do not require explanations of specific parts of the shell script.
- Install sfaira.
Clone sfaira into a local repository
DIR_SFAIRA
.cd DIR_SFAIRA git clone https://github.com/theislab/sfaira.git cd sfaira git checkout dev
- Prepare a local branch of sfaira dedicated to your loader.
You can name this branch after the
DOI-name
, prefix this branch withdata/
as the code change suggested is a data addition.cd DIR_SFAIRA cd sfaira git checkout dev git pull git checkout -b data/DOI-name
- Install sfaira into a conda environment.
You can for example use pip inside of a conda environment dedicated to data curation.
cd DIR_SFAIRA cd sfaira git checkout -b data/DOI-name conda create -n sfaira_loader conda install -n sfaira_loader python=3.8 conda activate sfaira_loader pip install -e .
- Summary of step 1-3.
Pc1-3 are all covered by the following code block. Remember to name the git branch after your DOI:
cd DIR_SFAIRA git clone https://github.com/theislab/sfaira.git cd sfaira git checkout dev git pull git checkout -b data/DOI-name conda create -n sfaira_loader conda install -n sfaira_loader python=3.8 conda activate sfaira_loader pip install -e .
- Pd. Download the raw data into a local directory.
You will need to set a path in which the data files can be accessed by sfaira, in the following referred to as
<path_data>/<DOI-name>/
. Identify the raw data files and copy them into the datafolder<path_data>/<DOI-name>/
. Note that this should be the exact files that are downloadable from the download URL you provided in the dataloader: Do not decompress these files if these files are archives such as zip, tar or gz. In some cases, multiple processing forms of the raw data are available, some times even on different websites. Follow these rules to disambiguate the data source for the data loader:- Rule 1: Prefer unprocessed gene expression count data over normalised data.
Often it makes sense to provide author-normalised data in a curated object in addition to count data.
- Rule 2: Prefer dedicated data archives over websites that may be temporary
Examples of archives include EGA, GEO, zenodo, potentially temporary websites may be institute websites, cloud files linked to a person’s account.
Note that it may in exception cases make sense to collect count data and cell-wise meta data from different locations, or similar, collect normalised and count matrices from different locations. You can supply multiple data URLs below, so collect all relevant files in this phase.
- Pe. Get an overview of the published data.
Data curation is much easier if you have an idea of what the data that you are curating looks like before you start. Especially, you will notice a difference in your ability to fully leverage phase 1a if you prepare here. We recommend you load the cell-wise and gene-wise meta in a python session and explore the type of meta data provided there. You will receive further guidance throughout the curation process here, but we recommend that you try locate the following meta data items now already if they are annotated in the data set and if they are shared across the dataset or specific to a feature or observation, where the latter usually corresponds to a column in
.obs
or.var
of a published.h5ad
, or to a corresponding column in a tabular file:single-cell assay
cell type
developmental stage
disease state
ethnicity (only relevant for human samples)
organ / tissue
organism
sex
Note that these are also the key ontology-restricted and required meta data in the cellxgene curation schema. Next, we recommend you briefly consider the available features: Are count matrices, processed matrices or spliced/unspliced RNA published? Which gene identifiers are used (symbols or ENSEMBL IDs)? Which non-RNA modalities are present in the data?
Phase 1: create¶
This phase creates a skeleton for a data loader: __init__.py
, .py
and .yaml
files.
Phase 1 is sub-structured into 2 sub-phases:
1a: Create template files (
sfaira create-dataloader
).1b: Completion of created files (manual).
- 1a. Create template files.
When creating a dataloader with
sfaira create-dataloader
dataloader specific attributes such as organ, organism and many more are prompted for. We provide a description of all meta data items at the bottom of this page, note that these metadata underly specific formattig and ontology constraints described below. If the requested information is not available simply hit enter to skip the entry. Note that some meta data items are always defined per data set, e.g. a DOI, whereas other meta data items may or may not be the same for all cells in a data set. For example, an entire organ may belong to one disease condition or one organ, or may consist of a pool of multiple samples that cover multiple values of the given metadata item. The questionaire and YAML are set up to guide you through finding the best fit. Note that annotating dataset-wide is preferable where possible as it results in briefer curation code. The CLI decides on anID
of this dataset within the loader that you are writing, this will be used to label all files associated with the current dataset. The CLI tells you how to continue from here, phase 1b) is always necessary, phase 2) is case-dependent and mistakes in naming the data folder in phase Pd) are flagged here. As indicated at appropriate places by the CLI, some meta data are ontology constrained. You should input symbols, ie. readable words and not IDs in these places. For example, the.yaml
entryorgan
could be “lung”, which is a symbol in the UBERON ontology, whereasorgan_obs_key
could be any string pointing to a column in the.obs
in theanndata
instance that is output byload()
, where the elements of the column are then mapped to UBERON terms in phase 2.- 1a-docker.
You can run the
create-dataloader
command directly.sfaira create-dataloader
- 1a-conda.
In the following command, replace
DATA_DIR
with the path<path_data>/
you used above. You can optionally supply--path-loader
tocreate-dataloader
to change the location of the created data loader to an arbitrary directory other than the internal collection of sfaira in./sfaira/data/dataloaders/loaders/
. Note: Use the default location if you want to commit and push changes from this sfaira clone.sfaira create-dataloader --path-data DATA_DIR
- 1b. Manual completion of created files (manual).
- Correct the
.yaml
file. Correct errors in
<path_loader>/<DOI-name>/ID.yaml
file and add further attributes you may have forgotten in step 2. See sec-multiple-files for short-cuts if you have multiple data sets. This step is can be skipped if there are the.yaml
is complete after phase 1a). Note on lists and dictionaries in the yaml file format: Some times, you need to write a list in yaml, e.g. because you have multiple data URLs. A list looks as follows:# Single URL: download_url_data: "URL1" # Two URLs: download_url_data: - "URL1" - "URL2"
As suggested in this example, do not use lists of length 1. In contrast, you may need to map a specific
sample_fns
to a meta data in multi file loaders:sample_fns: - "FN1" - "FN2" [...] assay_sc: FN1: 10x 3' v2 FN2: 10x 3' v3
Take particular care with the usage of quotes and “:” when using maps as outlined in this example.
- Correct the
- Complete the load function.
Complete the
load()
function in<path_loader>/<DOI-name>/ID.py
. If you need to read compressed files directly from python, consider our guide reading-compressed-files. If you need to read R files directly from python, consider our guide reading-r-files.
Phase 2: annotate¶
This phase creates annotation map files: .tsv
.
The metadata items that require annotation maps all non-empty entries that end on *obs_key
under
dataset_or_observation_wise
in the .yaml
which are subject to an ontology field-descriptions:.
One file is created per such metadata ITEM
, the corresponding file is <path_loader>/<DOI-name>/<ID>_<ITEM>.tsv
This means that a variable number of such files is created and dependending on the scenario, even no such files may
be necessary:
Phase 2 can be entirely skipped if no annotation maps are necessary, this is indicated by the CLI at the end of phase 1a.
Phase 2 is sub-structured into 2 sub-phases:
2a: Create metadata annotation files (
sfaira annotate-dataloader
).2b: Completion of annotation (manual).
- 2a. Create metadata annotation files (
sfaira annotate-dataloader
). This creates
<path_loader>/<DOI-name>/ID*.tsv
files with meta data map suggestions for each meta data item that requires such maps. Note: You can identify the loader via--doi
with the main DOI (ie. journal > preprint if both are defined) or with the DOI-based data loader name defined by sfaira, ie.<DOI-name>
in<path_loader>/<DOI-name>
, which is eitherd10_*
ordno_doi_*
.- 2a-docker.
In the following command, replace
DOI
with the DOI of your data loader.sfaira annotate-dataloader --doi DOI
- 2a-conda.
In the following command, replace
DATA_DIR
with the path<path_data>/
you used above and replaceDOI
with the DOI of your data loader. You can optionally supply--path-loader
tocreate-dataloader
if the data loader is not in the internal collection of sfaira in./sfaira/data/dataloaders/loaders/
.sfaira annotate-dataloader --doi DOI --path-data DATA_DIR
- 2b. Completion of annotation (manual).
Each
<path_loader>/<DOI-name>/ID*.tsv
files contains two columns with one row for each unique free-text meta data item, e.g. each cell type label. One file is created for each*_obs_key
that requires mapping to an ontology, which are:assay_sc_obs_key
,cell_line_sc_obs_key
,cell_type_sc_obs_key
,development_stage_sc_obs_key
,disease_sc_obs_key
,ethnicity_sc_obs_key
,organ_sc_obs_key
,organism_sc_obs_key
,sex_sc_obs_key
. Depending on the number of such*_obs_key
items that are set in the.yaml
, you will have between 0 amd 9.tsv
files.- “source”:
The first column is labeled “source” and contains free-text identifiers.
- “target”:
The second column is labeled “target” and contains suggestions for matching the symbols from the corresponding ontology.
The suggestions are based on multiple search criteria, mostly on similarity of the free-text token to tokes in the ontology. Suggested tokens are separated by “:” in the target column, for each token, the same number of suggestions is supplied. We use different search strategies on each token and separate the output by strategy by “:||:”. You might notice that one strategy works well for a particular
ID*.tsv
and focus your attention on that group. It is now up to you to manually mitigate the suggestions in the “target” column of each.tsv
file, for example in a text editor. Depending on the ontology and on the accuracy of the free-text annotation, these suggestions may be more or less helpful. The worst case is that you need to go to search engine of the ontology at hand for each entry to check for matches. The best case is that you know the ontology well enough to choose from the suggestions, assuming that the best match is in the suggestions. Reality lies somewhere in the middle of the two, do not be too conservative with looking items up online. We suggest to use the ontology search engine on the OLS web-interface for your manual queries. For each meta data item, the correspond ontology is listed in the detailed meta data description field-descriptions. Make sure to read our notes on cell type curation celltype-annotation.Note 1: If you compare these
ID*.tsv
totsv
files from published data loaders, you will notice that published ones contain a third column. This column is automatically added in phase 3 if the second column was correctly filled here.Note 2: The two columns in the
ID*.tsv
are separated by a tab-separator (”\t”), make sure to not accidentally delete this token. If you accidentally replace it with" "
, you will receive errors in phase 3, so do a visual check after finishing your work on eachID*.tsv
file.Note 3: Perfect matches are filled wihtout further suggestions, you can often directly leave these rows as they are after a brief sanity check.
Phase 3: finalize¶
- 3a. Clean and test data loader.
This command will test data loading and will format the metadata maps in
ID*.tsv
files from phase 2b). If this command passes without further change requests, the data loader is finished and ready for phase 4. Note: You can identify the loader via--doi
with the main DOI (ie. journal > preprint if both are defined) or with the DOI-based data loader name defined by sfaira, ie.<DOI-name>
in<path_loader>/<DOI-name>
, which is eitherd10_*
ordno_doi_*
.- 3a-docker.
In the following command, replace
DOI
with the DOI of your data loader.sfaira finalize-dataloader --doi DOI
- 3a-conda.
In the following command, replace
DATA_DIR
with the path<path_data>/
you used above and replaceDOI
with the DOI of your data loader. You can optionally supply--path-loader
tocreate-dataloader
if the data loader is not in the internal collection of sfaira in./sfaira/data/dataloaders/loaders/
. Once this command passes, it will give you a message you can use in phase 4 to document this test on the pull request.sfaira finalize-dataloader --doi DOI --path-data DATA_DIR
Phase 4: publish¶
You will need to authenticate with GitHub during this phase.
You can push the code from with the sfaira docker with a single command or you can use git
directly:
- 4a. Push data loader to the public sfaira repository.
You will test the loader one last time, this test will not throw errors if you have not introduced changes since phase 3. Note: You can identify the loader via
--doi
with the main DOI (ie. journal > preprint if both are defined) or with the DOI-based data loader name defined by sfaira, ie.<DOI-name>
in<path_loader>/<DOI-name>
, which is eitherd10_*
ordno_doi_*
.- 4a-docker.
If you are writing a data loader from within the sfaira data curation docker, you can run phase 4 with a single command. In the following command, replace
DOI
with the DOI of your data loader.sfaira test-dataloader --doi DOI sfaira publish-dataloader
You will be prompted to paste your github token in order to authenticate with github. If you do not have a token you can leave the field blank and you will be interactively guided to authenticating your github account using your browser. (You will have to manually copy a url into the browser at some point.) In certain cases you might be prompted to enter you github username and password again during the process. Please note that this requires you to enter you username and github token as before, not the password you use to log into github.com in your browser.
You will also be prompted the following by the CLI:
Where should we push the xxx branch?
You generally want to select the second option here (“Create a fork of theislab/sfaira”) unless you are a member of the theislab organisation or otherwise have previously obtained write access to the sfaira repository. In this case you can select the first option (“theislab/sfaira”).- 4a-git.
You can contribute the data loader to public sfaira as code through a pull request. Note that you can also just keep the data loader in your local installation if you do not want to make it public. In the following command, replace
DATA_DIR
with the path<path_data>/
you used above and replaceDOI
with the DOI of your data loader. If you have not modified any aspects of the data loader since phase 3, you can skipsfaira test-dataloader
below. In order to create a pullrequest you first need to fork the sfaira repository on GitHub. Once forked, you can use the code shown below to submit your new dataloader. Note: the CLI will ask you to copy a data loader testing summary into the pull request at the end of the output generated byfinalize-dataloader
.sfaira test-dataloader --doi DOI --path-data DATA_DIR cd DIR_SFAIRA cd sfaira git remote set-url origin https://github.com/<user>/sfaira.git # Replace <user> with your github username. git checkout dev git add * git commit -m "Completed data loader." git push
After successfully pushing the new dataloader to your fork, you can go to github.com and create a pullrequest from your fork to the dev branch of the original sfaira repo. Please include the doi of your added dataset in the PR title
Phase 5: export-h5ad¶
Phase 5 and 6 are optional, see also introduction paragraphs on this documentation page.
- 5a. Export
.h5ads
’s. Write streamlined dataset(s) corresponding to data loader into (an)
.h5ad
file(s) according to a specific set of rules (a schema, e.g. “cellxgene”). Note: You can identify the loader via--doi
with the main DOI (ie. journal > preprint if both are defined) or with the DOI-based data loader name defined by sfaira, ie.<DOI-name>
in<path_loader>/<DOI-name>
, which is eitherd10_*
ordno_doi_*
.- 5a-docker.
In the following command, replace
DOI
with the DOI of your data loader, replaceSCHEMA
with the target data schema. You can find the resulting h5ad file in thesfaira_data
directory you specified when starting the container. .. code-block:sfaira export-h5ad --doi DOI --schema SCHEMA --path-out /root/sfaira_data/
- 5a-conda.
In the following command, replace
DATA_DIR
with the path<path_data>/
you used above, replaceDOI
with the DOI of your data loader, replaceSCHEMA
with the target data schema, and replaceOUT_DIR
with the directory to which the objects are written to. You can optionally supply--path-loader
tocreate-dataloader
if the data loader is not in the internal collection of sfaira in./sfaira/data/dataloaders/loaders/
. .. code-block:sfaira export-h5ad --doi DOI --path-data DATA_DIR --schema SCHEMA --path-out OUT_DIR
Phase 6: validate-h5ad¶
Phase 5 and 6 are optional, see also introduction paragraphs on this documentation page.
- 6a. Validate format of
.h5ad
The streamlined
.h5ad
files from phase 5 are validated according to a specific set of rules (a schema).- 6a-docker.
In the following command, replace FN with the file name of the
.h5ad
file to test (just the filename, not the full path here), and replaceSCHEMA
with the target data schema. The h5ad file must be placed in thesfaira_data
directory you specified when starting the container.sfaira validate-h5ad --h5ad /root/sfaira_data/FN --schema SCHEMA
- 6a-conda.
In the following command, replace FN with ful path of the
.h5ad
file to test, and replaceSCHEMA
with the target data schema. .. code-block:sfaira validate-h5ad --h5ad FN --schema SCHEMA
Advanced topics¶
Loading multiple files of similar structure¶
Only one loader has to be written for each set of files that are similarly structured which belong to one DOI.
sample_fns
in dataset_structure
in the .yaml
indicates the presence of these files.
The identifiers listed there do not have to be the full file names.
They are received by load()
as the argument sample_fn
and can then be used in custom code in load()
to load
the correct file.
This allows sharing code across these files in load()
.
If these files share all meta data in the .yaml
, you do not have to change anything else here.
If a some meta data items are file specific, you can further subdefine them under the keys in this .yaml
via their
identifiers stated here.
In the following example, we show how this formalism can be used to identify one file declared as “A” as a healthy
lung sample and another file “B” as a healthy pancreas sample.
dataset_structure:
dataset_index: 1
sample_fns:
- "A"
- "B"
dataset_wise:
# ... part of yaml omitted ...
dataset_or_observation_wise:
# ... part of yaml omitted
healthy: True
healthy_obs_key:
individual:
individual_obs_key:
organ:
A: "lung"
B: "pancreas"
organ_obs_key:
# part of yaml omitted ...
Note that not all meta data items have to subdefined into “A” and “B” but only the ones with differing values!
The corresponding load
function would be:
def load(data_dir, sample_fn, fn=None) -> anndata.AnnData:
# The following reads either my_file_A.h5ad or my_file_B.h5ad which correspond to A and B in the yaml.
fn = os.path.join(data_dir, f"my_file_{sample_fn}.h5ad")
adata = anndata.read(fn)
return adata
Loaders for meta studies or atlases¶
Meta studies are studies on published gene expression data.
Often, multiple previous studies are combined or meta data annotation is changed.
Data sets from such meta studies can be added to sfaira just as primary data can be added,
we ask for theses studies to be identified through the meta data attribute primary_data
to allow sfaira users to avoid duplicate cells in data universe partitions.
Let’s consider an example case:
Study A
published 2 data sets A1
and A2
.
Study B
published 1 data set B1
.
Data loaders for A
and B
can label as primary_data: True
.
Now, study C
published 1 data set C1
that consists of A2
and B1
.
We can write a data loaders for C
and label it as primary_data: False
.
Moreover, when conducting the study C
, we could even base our analyses directly on the data loaders of A2
and
B1
to make the data analysis pipeline more reproducible.
Curating cell type annotation¶
Common challenges in cell type curation include the following:
- An free-text label is used that is not well captured by the automated search.
Often, these are abbreviations are synonyms that can be mapped to the ontology after looking these terms up online or in the manuscript corresponding to the data loader. Indeed, it is good practice to manually verify non-trivial cell type label maps with a quick contextualization in manuscript figures or text. As for all other ontology-constrained meta data, EBI OLS maintains a great interface to the ontology under CL.
- The free-text labels contain nested annotation.
For example, a low-resolution cluster may be annotated as “T cell” in one data set, while other data sets within the same study have more specific T cell labels. Simply map each of these labels to their best fit ontology name, you do not need to mitigate differential granularity.
- The free-text labels contain cellular phenotypes that map badly to the ontology.
A common example would be “cycling cells”. In some tissues, these phenotypes can be related to specific cell types through knowledge on the phenotypes of the cell types that occur in that tissue. If this is not possible or you do not know the tissue well enough, you can leave the cell type as “UNKNOWN” and future curators may improve this annotaiton. In cases such as “cycling T cell”, you may just resort to the parent label “T cell” unless you have reason to believe that “cycling” identifies a specific T cell subset here.
- The free-text labels are more fine-grained than the ontology.
A common example would be the addition of marker gene expression to cell cluster labels that are grouped under the same ontology identifier. Some times, these marker genes can be mapped to a child node of the ontology identifier. However, often these indicate cell state variation or other, not fully attributed, variation and do not need to be accounted for in this cell type curation step. These are often among the hardest cell type curation problems, keep in mind that you want to find a reasonable translation of the existing curation, you may be limited by the ontology or by the data reported by the authors, so keep an eye on the overall effort that you spend on optimizing these label maps.
- A new cell type in annotated in free-text but is not available in the ontology yet.
This is most likely only a problem for a limited period of time in which the ontology works on adding this element. Chose the best match from the ontology and leave an issue on the sfaira GitHub describing the missing cell type. We can then later update this data loader once the ontology is updated.
Multi-modal data¶
Multi-modal can be represented in the sfaira curation schema, here we briefly outline what modalities are supported and how they are accounted for. You can use any combination of orthogonal meta data, e.g. organ and disease annotation, with multi-modal measurements.
- RNA:
RNA is the standard modality in sfaira, unless otherwise specified, all information in this document is centered around RNA data.
- ATAC:
We support scATAC-seq and joint scRNA+ATAC-seq (multiome) data. In both cases, the ATAC data is commonly represented as a UMI count matrix of the dimensions
(observations x peaks)
. Here, peaks are defined by a peak calling algorithm as part of the read processing pipeline upstream of sfaira. Peak counts can be deposited in the core data matrices managed in sfaira. The corresponding feature meta data can be set such that they allow differentiation of RNA and peak features. These features are documented dataset-or-feature-wise and feature-wise.
- protein quantification through antibody quantification:
We support CITE-seq and spatial molecular profiling assays with protein quantification read-outs. In these cases, the protein data can be represented as a gene expression matrix of the dimensions
(observations x proteins)
. In the case of oligo-nucleotide-tagged antibody quantification, e.g. in CITE-seq, this can also be an UMI matrix. The corresponding feature meta data can be set such that they allow differentiation of RNA and protein features. These features are documented dataset-or-feature-wise and feature-wise.
- spatial:
A couple of single-cell and spot-based assays have spatial coordinates associated with molecular profiles. We use relative coordinates of observations in a batch as
(x, y, z)
tuples to characterize the spatial information. Note that spatial proximity graphs and similar spatial analyses are down-stream analyses on these coordinates. This features are documented feature-wise.
- spliced, unspliced transcript and velocities:
We support gene expression matrices on the level of spliced and unspliced transcript and the common processed format of a RNA velocity matrix. Note that the velocity matrix depends on the inference procedure. These matrices share
.var
annotation with the core RNA data matrix and can, therefore, be supplemented as further layeres in theAnnData
object without further effort. This features is documented data-matrices.
- V(D)J in TCR and BCR reconstructions:
V(D)J data is collected in parallel to RNA data in a couple of single-cell assays. We use key meta data defined by the AIRR consortium to characterize the reconstructed V(D)J genes, which are all direct outputs of V(D)J alignment pipelines and are are stored in
.obs
. This features are documented feature-wise.
Reading compressed files¶
This is a collection of code snippets that can be used in tha load()
function to read compressed download files.
See also the anndata and scanpy IO documentation.
- Read a .gz compressed .mtx (.mtx.gz):
Note that this often occurs in cellranger output for which their is a scanpy load function that applies to data of the following structure
./PREFIX_matrix.mtx.gz
,./PREFIX_barcodes.tsv.gz
, and./PREFIX_features.mtx.gz
. This can be read as:
import scanpy
adata = scanpy.read_10x_mtx("./", prefix="PREFIX_")
- Read from within a .gz archive (.gz):
Note: this requires temporary files, so avoid if read_function can read directly from .gz.
import gzip
from tempfile import TemporaryDirectory
import shutil
# Insert the file type as a string here so that read_function recognizes the decompressed file:
uncompressed_file_type = ""
with TemporaryDirectory() as tmpdir:
tmppth = tmpdir + f"/decompressed.{uncompressed_file_type}"
with gzip.open(fn, "rb") as input_f, open(tmppth, "wb") as output_f:
shutil.copyfileobj(input_f, output_f)
x = read_function(tmppth)
- Read from within a .tar archive (.tar.gz):
It is often useful to decompress the tar archive once manually to understand its internal directory structure. Let’s assume you are interested in a file
fn_target
within a tar archivefn_tar
, i.e. after decompressing the tar the director is<fn_tar>/<fn_target>
.
import pandas
import tarfile
with tarfile.open(fn_tar) as tar:
# Access files in archive with tar.extractfile(fn_target), e.g.
tab = pandas.read_csv(tar.extractfile(sample_fn))
Reading R files¶
Some studies deposit single-cell data in R language files, e.g. .rdata
, .Rds
or Seurat objects.
These objects can be read with python functions in sfaira using anndata2ri and rpy2.
These modules allow you to run R code from within this python code:
def load(data_dir, **kwargs):
import anndata2ri
from rpy2.robjects import r
anndata2ri.activate()
fn = os.path.join(data_dir, "SOME_FILE.rdata")
seurat_object_name = "tissue"
adata = r(
f"library(Seurat)\n"
f"load('{fn}')\n"
f"new_obj = CreateSeuratObject(counts = {seurat_object_name}@raw.data)\n"
f"new_obj@meta.data = {seurat_object_name}@meta.data\n"
f"as.SingleCellExperiment(new_obj)\n"
)
return adata
Loading third party annotation¶
In some cases, the data set in question is already in the sfaira zoo but there is alternative (third party), cell-wise
annotation of the data.
This could be different cell type annotation for example.
The underlying data (count matrix and variable names) stay the same in these cases, and often, even some cell-wise
meta data are kept and only some are added or replaced.
Therefore, these cases do not require an additional load()
function.
Instead, you can contribute load_annotation_*()
functions into the .py
file of the corresponding study.
You can chose an arbitrary suffix for the function but ideally one that identifies the source of this additional
annotation in a human readable manner at least to someone who is familiar with this data set.
Second you need to add this function into the dictionary LOAD_ANNOTATION
in the .py
file, with the suffix as a key.
If this dictionary does not exist yet, you need to add it into the .py
file with this function as its sole entry.
Here an example of a .py
file with additional annotation:
def load(data_dir, sample_fn, **kwargs):
pass
def load_annotation_meta_study_x(data_dir, sample_fn, **kwargs):
# Read a tabular file indexed with the observation names used in the adata used in load().
pass
def load_annotation_meta_study_y(data_dir, sample_fn, **kwargs):
# Read a tabular file indexed with the observation names used in the adata used in load().
pass
LOAD_ANNOTATION = {
"meta_study_x": load_annotation_meta_study_x,
"meta_study_y": load_annotation_meta_study_y,
}
The table returned by load_annotation_meta_study_x
needs to be indexed with the observation names used in .adata
,
the object generated in load()
.
If load_annotation_meta_study_x
contains a subset of the observations defined in load()
,
and this alternative annotation is chosen,
.adata
is subsetted to these observations during loading.
You can also add functions in the .py
file in the same DOI-based module in sfaira_extensions if you want to keep this
additional annotation private.
For this to work with a public data loader, you need nothing more than the .py
file with this load_annotation_*()
function and the LOAD_ANNOTATION
of these private functions in sfaira_extensions.
To access additional annotation during loading, use the setter functions additional_annotation_key
on an instance of
either Dataset
, DatasetGroup
or DatasetSuperGroup
to define data sets
for which you want to load additional annotation and which additional you want to load for these.
See also the docstrings of these functions for further details on how these can be set.
Required metadata¶
The CLI will flag any required meta data that is missing.
Note that you can use the CLI under a specific schema,
e.g. the more lenient sfaira schema (default)
or the stricter cellxgene schema, by giving the arguent --schema cellxgene
to finalize-dataloader
or
test-dataloader
.
Moreover, .h5ad
files from phase 5 can be checked for match to a particular schema in phase 6.
In brief, the following meta data are required:
dataset_structure
:dataset_index
sample_fns
is required in multi-dataset loaders to define the number and identity of datasets.
dataset_wise
:author
one DOI (i.e. either
doi_journal
ordoi_preprint
)download_url_data
primary_data
year
layers
:layer_counts
orlayer_processed
dataset_or_feature_wise
:feature_type
orfeature_type_var_key
dataset_or_observation_wise
:Either the dataset-wide item or the corresponding
_obs_key
are required to submit a data loader to sfaira:assay_sc
organism
The following are encouraged in sfaira and required in the cellxgene schema:
assay_sc
cell_type
developmental_stage
disease
ethnicity
organ
organism
sex
feature_wise
:None is required.
feature_wise
:feature_id_var_key
orfeature_symbol_var_key
meta
:version
Field descriptions¶
We constrain meta data by ontologies where possible. Meta data can either be dataset-wise, observation-wise or feature-wise.
Dataset structure¶
Dataset structure meta data are in the section dataset_structure
in the .yaml
file.
- dataset_index [int]
Numeric identifier of the first loader defined by this python file. Only relevant if multiple python files for one DOI generate loaders of the same name. In these cases, this numeric index can be used to distinguish them.
- sample_fns [list of strings]
If there are multiple data files which can be covered by one
load()
function and.yaml
file because they are structured similarly, these can identified here. See also sectionLoading multiple files of similar structure
. You can simply hardcode a file name in theload()
function and skip defining it here if you are writing a single file loader. Note: A sample is a an object similar to a count matrix or a.h5ad
, the definition of biological or technical batches or samples within one count matrix does not affect this entry.
Dataset-wise¶
Dataset-wise meta data are in the section dataset_wise
in the .yaml
file.
- author [list of strings]
List of author names of dataset (not of loader).
- doi [list of strings]
DOIs associated with dataset. These can be preprints and journal publication DOIs.
- download_url_data [list of strings]
Download links for data. Full URLs of all data files such as count matrices. Note that distinct observation-wise annotation files can be supplied in download_url_meta.
- download_url_meta [list of strings]
Download links for observation-wise data. Full URLs of all observation-wise meta data files such as count matrices. This attribute is optional and not necessary ff observation-wise meta data is already in the files defined in
download_url_data
, e.g. often the case for.h5ad
.
- primary_data: If this is the first publication to report this gene expression data {True, False}.
This is False if the study is a meta study that uses data that was previously published. This usually implies that one can also write a data loader for the data from the primary study. Usually, the data here contains new meta data or is combined with other data sets (e.g. in an “atlas”), Therefore, this data loader is different from a data laoder for the primary data. In sfaira, we maintain data loaders both for the corresponding primary and such meta data publications. See also the section on meta studies meta-studies.
- year: Year in which sample was first described [integer]
Pre-print publication year.
Data matrices¶
A curated AnnData object may contain multiple data matrices: raw and processed gene expression counts, or spliced and unspliced count data and velocity estimates, for example. Minimally, you need to supply either of the matrices “counts” or “processed”. In the following, “*counts” refers to the INTEGER count of alignment events (e.g. transcripts for RNA). In the following, “*processed” refers to any processing that modifies these counts, for example: normalization, batch correction, ambient RNA correction.
layer_counts: The total event counts per feature, e.g. UMIs that align to a gene. {‘X’, ‘raw’, or a .layers key}
layer_processed: Processed complement of ‘layer_counts’. {‘X’, ‘raw’, or a .layers key}
layer_spliced_counts: The total spliced RNA counts per gene. {a .layers key}
layer_spliced_processed: Processed complement of ‘layer_spliced_counts’. {a .layers key}
layer_unspliced_counts: The total unspliced RNA counts per gene. {a .layers key}
layer_unspliced_processed: Processed complement of ‘layer_unspliced_counts’. {a .layers key}
layer_velocity: The RNA velocity estimates per gene. {a .layers key}
Dataset- or feature-wise¶
These meta data may be defined across the entire dataset or per feature
and are in the section dataset_or_feature_wise
in the .yaml
file:
They can all be supplied as NAME
or as NAME_var_key
:
The former indicates that the entire data set has the value stated in the yaml.
The latter, NAME_var_key
, indicates that there is a column in adata.var
emitted by the load()
function of the name
NAME_var_key
which contains the annotation per feature for this meta data item.
Note that in both cases the value, or the column values, have to fulfill constraints imposed on the meta data item as
- feature_reference and feature_reference_var_key [string]
The genome annotation release that was used to quantify the features presented here, e.g. “Homo_sapiens.GRCh38.105”. You can find all ENSEMBL gtf files on the ensembl ftp server. Here, you ll find a summary of the gtf files by release, e.g. for 105. You will find a list across organisms for this release, the target release name is the name of the gtf files that ends on
.RELEASE.gtf.gz
under the corresponding organism. For homo_sapiens and release 105, this yields the following reference name “Homo_sapiens.GRCh38.105”.
- feature_type and feature_type_var_key {“rna”, “protein”, “peak”}
The type of a feature:
- “rna”: gene expression quantification on the level of RNA
e.g. from scRNA-seq or spatial RNA capture experiments
- “protein”: gene expression quantification on the level of proteins
e.g. via antibody counts in CITE-seq or spatial protocols
- “peak”: chromatin accessibility by peak
e.g. from scATAC-seq
Dataset- or observation-wise¶
These meta data may be defined across the entire dataset or per observation
and are in the section dataset_or_observation_wise
in the .yaml
file:
They can all be supplied as NAME
or as NAME_obs_key
:
The former indicates that the entire data set has the value stated in the yaml.
The latter, NAME_obs_key
, indicates that there is a column in adata.obs
emitted by the load()
function of the name
NAME_obs_key
which contains the annotation per observation for this meta data item.
Note that in both cases the value, or the column values, have to fulfill constraints imposed on the meta data item as
outlined below.
- assay_sc and assay_sc_obs_key [ontology term]
The EFO label corresponding to single-cell assay of the sample. The corresponding subset of EFO_SUBSET is the set of child nodes of “single cell library construction” (EFO:0010183).
- assay_differentiation and assay_differentiation_obs_key [string]
Try to provide a base differentiation protocol (eg. “Lancaster, 2014”) as well as any amendments to the original protocol.
- assay_type_differentiation and assay_type_differentiation_obs_key {“guided”, “unguided”}
For cell-culture samples: Whether a guided (patterned) differentiation protocol was used in the experiment.
- bio_sample and bio_sample_obs_key [string]
Column name in
adata.obs
emitted by theload()
function which reflects biologically distinct samples, either different in condition or biological replicates, as a categorical variable. The values of this column are not constrained and can be arbitrary identifiers of observation groups. You can concatenate multiple columns to build more fine grained observation groupings by concatenating the column keys with*
in this string, e.g.patient*treatment
to get onebio_sample
for each patient and treatment. Note that the notion of biologically distinct sample is slightly subjective, we allow this element to allow researchers to distinguish technical and biological replicates within one study for example. See also the meta data itemsindividual
andtech_sample
.
- cell_line and cell_line_obs_key [ontology term]
Cell line name from the cellosaurus cell line database.
- cell_type and cell_type_obs_key [ontology term]
Cell type name from the Cell Ontology CL database. Note that sometimes, original (free-text) cell type annotation is provided at different granularities. We recommend choosing the most fine-grained annotation here so that future re-annotation of the cell types in this loader is easier. You may choose to compromise the potential for re-annotation of the data loader with the size of the mapping
.tsv
that is generated during annotation: This file has one row for free text label and may be undesirably large in some cases, which reduces accessibilty of the data loader code for future curators, thus presenting a trade-off. See also the section on cell type annotation celltype-annotation.
- disease and disease_obs_key [ontology term]
Choose from MONDO.
- ethnicity and ethnicity_obs_key [ontology term]
Choose from HANCESTRO.
- gm and gm_obs_key [string]
Genetic modification. E.g. identify gene knock-outs or over-expression as a boolean indicator per cell or as guide RNA counts in approaches like CROP-seq or PERTURB-seq.
- individual and individual_obs_key [string]
Column name in
adata.obs
emitted by theload()
function which reflects the indvidual sampled as a categorical variable. The values of this column are not constrained and can be arbitrary identifiers of observation groups. You can concatenate multiple columns to build more fine grained observation groupings by concatenating the column keys with*
in this string, e.g.group1*group2
to get oneindividual
for each group1 and group2 entry. Note that the notion of individuals is slightly mal-defined in some cases, we allow this element to allow researchers to distinguish sample groups that originate from biological material with distinct genotypes. See also the meta data itemsindividual
andtech_sample
.
- organ and organ_obs_key [ontology term]
The UBERON label of the sample. This meta data item ontology is for tissue or organ identifiers from UBERON.
- organism and organism_obs_key. [ontology term]
The NCBItaxon label of the main organism sampled here. For a data matrix of an infection sample aligned against a human and virus joint reference genome, this would “Homo sapiens” as it is the “main organism” in this case. For example, “Homo sapiens” or “Mus musculus”. See also the documentation of feature_reference to see which orgainsms are supported.
- primary_data [bool]
Whether contains cells that were measured in this study (ie this is not a meta study on published data).
- sample_source and sample_source_obs_key. {“primary_tissue”, “2d_culture”, “3d_culture”, “tumor”}
Which cellular system the sample was derived from.
- sex and sex_obs_key. Sex of individual sampled. [ontology term]
The PATO label corresponding to sex of the sample. The corresponding subset of PATO_SUBSET is the set of child nodes of “phenotypic sex” (PATO:0001894).
- source_doi and source_doi_obs_key [string]
If this dataset is not primary data, you can supply the source of the analyzed data as a DOI per dataset or per cell in this meta data item. The value of this metadata item (or the entries in the corresponding
.obs
column) needs to be a DOI
- state_exact and state_exact_obs_key [string]
Free text description of condition. If you give treatment concentrations, intervals or similar measurements use square brackets around the quantity and use units:
[1g]
- tech_sample and tech_sample_obs_key [string]
Column name in
adata.obs
emitted by theload()
function which reflects technically distinct samples, either different in condition or technical replicates, as a categorical variable. Any data batch is atech_sample
. The values of this column are not constrained and can be arbitrary identifiers of observation groups. You can concatenate multiple columns to build more fine grained observation groupings by concatenating the column keys with*
in this string, e.g.patient*treatment*protocol
to get onetech_sample
for each patient, treatment and measurement protocol. See also the meta data itemsindividual
andtech_sample
.
- treatment and treatment_obs_key [string]
Treatment of sample, e.g. compound names in stimulation experiments.
Feature-wise¶
These meta data are always defined per feature and are in the section feature_wise
in the .yaml
file:
- feature_id_var_key [string]
Name of the column in
adata.var
emitted by theload()
which contains ENSEMBL gene IDs. This can also be “index” if the ENSEMBL gene names are in the index of theadata.var
data frame. Note that you do not have to map IDs to a specific annotation release but can keep them in their original form. If available, IDs are preferred over symbols.
- feature_symbol_var_key [string]
Name of the column in
adata.var
emitted by theload()
which contains gene symbol: HGNC for human and MGI for mouse. This can also be “index” if the gene symbol are in the index of theadata.var
data frame. Note that you do not have to map symbols to a specific annotation release but can keep them in their original form.
Observation-wise¶
These meta data are always defined per observation and are in the section observation_wise
in the .yaml
file:
The following items are only relevant for spatially resolved data, e.g. spot transcriptomics or MERFISH:
- spatial_x_coord, spatial_y_coord, spatial_z_coord [string]
Spatial coordinates (numeric) of observations. Most commonly, the centre of a segment or of a spot is indicated here. For 2D data, a z-coordinate is not relevant and can be skipped.
The following items are only relevant for V(D)J reconstruction data, e.g. TCR or BCR sequencing in single cells.
These meta data items are described in the AIRR project, search the this link for the element in question without
the prefixed “vdj_”.
These 10 meta data items describe chains (or loci).
In accordance with the corresponding scirpy defaults, we allow for up to two loci per cell.
In T cells, this correspond to two VJ loci (TRA) and two VDJ loci (TRB).
You can set the prefix of the column of each of the four loci below.
In total, these 10+4 meta data queries in sfaira describe 4*10 columns in .obs
after .load()
.
Note that for this to work, you need to stick to the naming convention PREFIX_SUFFIX
.
We recommend that you use scirpy.io
functions for reading the VDJ data in your load()
to use the default meta data keys suggested by the CLI and to guarantee that this naming convention is obeyed.
- vdj_vj_1_obs_key_prefix
Prefix of key of columns corresponding to first VJ gene.
- vdj_vj_2_obs_key_prefix
Prefix of key of columns corresponding to second VJ gene.
- vdj_vdj_1_obs_key_prefix
Prefix of key of columns corresponding to first VDJ gene.
- vdj_vdj_2_obs_key_prefix
Prefix of key of columns corresponding to second VDJ gene.
- vdj_c_call_obs_key_suffix
Suffix of key of columns corresponding to C gene.
- vdj_consensus_count_obs_key_suffix
Suffix of key of columns corresponding to number of reads contributing to consensus.
- vdj_d_call_obs_key_suffix
Suffix of key of columns corresponding to D gene.
- vdj_duplicate_count_obs_key_suffix
Suffix of key of columns corresponding to number of duplicate UMIs.
- vdj_j_call_obs_key_suffix
Suffix of key of columns corresponding to J gene.
- vdj_junction_obs_key_suffix
Suffix of key of columns corresponding to junction nt sequence.
- vdj_junction_aa_obs_key_suffix
Suffix of key of columns corresponding to junction aa sequence.
- vdj_locus_obs_key_suffix
Suffix of key of columns corresponding to gene locus, i.e IGH, IGK, or IGL for BCR data and TRA, TRB, TRD, or TRG for TCR data.
- vdj_productive_obs_key_suffix
Suffix of key of columns corresponding to locus productivity: whether the V(D)J gene is productive.
- vdj_v_call_obs_key_suffix
Suffix of key of columns corresponding to V gene.
Meta¶
These meta data contain information about the curation process and schema:
- version: [string]
Version identifier of meta data scheme.
Using data loaders¶
For a high-level overview of data management in sfaira, read The data life cycle first.
Build data repository locally¶
Build a repository structure¶
Choose a directory to dedicate to the data base, called root in the following.
Run the sfaira download script (sfaira.data.utils.download_all). Alternatively, you can manually set up a data base by making subfolders for each study.
Note that the automated download is a feature of sfaira but not the core purpose of the package: Sfaira allows you efficiently interact with such a local data repository. Some data sets cannot be automatically downloaded and need you manual intervention, which we report in the download script output.
Use 3rd party repositories¶
Some organization provide streamlined data objects that can be directly consumed by data zoos such as sfaira. One example for such an organization is the cellxgene data portal. Through these repositories, one can easily build or extend a collection of data sets that can be easily interfaced with sfaira. Data loaders for cellxgene structured data objects will be available soon! Contact us for support of any other repositories.
Data stores¶
For a high-level overview of data management in sfaira, read The data life cycle first.
Sfaira supports usage of distributed data for model training and execution.
The tools are summarized under sfaira.data.store
.
In contrast to using an instance of AnnData in memory, these tools can be used to use data sets that are saved
in different files (because they come from different studies) flexibly and out-of-core,
which means without loading them into memory.
A general use case is the training of a model on a large set of data sets, subsetted by particular cell-wise meta
data, without creating a merged AnnData instance in memory first.
Build a distributed data repository¶
You can use the sfaira dataset API to write streamlined groups of adata instances to a particular disk locaiton that then is the store directory. Some of the array backends used for loading stores can read arrays from cloud servers, such as dask. Therefore, these store directories can also be on cloud servers in some cases.
Reading from a distributed data repository¶
The core use-case is the consumption of data in batches from a python iterator (a “generator”).
In contrast to using the full data matrix, this allows for workflows that never require the full data matrix in memory.
This generators can for example directly be used in tensorflow or pytorch stochastic mini-batch learning pipelines.
The core interface is sfaira.data.load_store()
which can be used to initialise a store instance that exposes a
generator, for example.
An important concept in store reading is that the data sets are already streamlined on disk, which means that they have
the same feature space for example.
Distributed access optimised (DAO) store¶
The DAO store format is a on-disk representation of single-cell data which is optimised for generator-based access and distributed access. In brief, DAO stores optimize memory consumption and data batch access speed. Right now, we are using zarr and parquet, this may change in the future, we will continue to work on this format using the project name “dao”. Note that data sets represented as DAO on disk can still be read into AnnData instances in memory if you wish!
Models¶
User interface¶
The user interface allows users to query model code and parameter estimates to run on local data. It takes care of downloading model parameters from the relevant cloud storage, loading parameters into a model instance locally and performing the forward pass. With the user interface, users only have to worry about which model they want to execute, but now how this is facilitated.
Model management¶
A sfaira model is a class that inherits from BasicModel which defines a tf.keras.models.Model in self.training_model. This training_model describes the full forward pass. Additionally, embedding models also have an attribute X, a tf.keras.models.Model that describes the partial forward pass into the embedding layer.
Such a model class, e.g. ModelX, is wrapped by an inheriting class ModelXVersioned, that handles properties of the model architecture. In particular, ModelXVersioned
has access to the cell ontology container (a daughter class of CelltypeVersionsBase) that corresponds to this model if applicable
has access to a map of a version ID to an architectural hyperparameter setting (Topologies), allowing this class to set depth, width, etc of the model directly based on the name of the yielded model.
has access to the feature space of the model, including its gene names, which are defined by the model topology in Topologies
Contribute models¶
Models can be contributed and used in two ways
Full model code in sfaira repo
sfaira compatible model code in external package (to come)
Training¶
Estimator classes¶
We define estimator classes that have model instances as an attribute, that orchestrate all major aspects of model fitting, such as a data loading, data streaming and model evaluation.
Ecosystem¶
scanpy¶
scanpy provides an environment of tools that can be used to analysis single-cell data in python. sfaira allows users to easily query third party data sets and models to complement these analysis workflows.
Data zoo¶
Data providers which streamline data¶
Some organization provide streamlined data objects that can be directly consumed by data zoos such as sfaira. Examples for such data providers are:
Through these repositories, one can easily build or extend a collection of data sets that can be easily interfaced with sfaira. Data loaders for cellxgene structured data objects will be available soon, we are working on interfacing more such organisations! Contact us for support of any other repositories.
Study-centric data set servers¶
Many authors of data sets provide their data sets on servers:
cloud storage servers
manuscript supplements
Our data zoo interface is able to represent these data sets such that they can be queried in a streamlined fashion, together with many other data sets.
Single-cell study look-up tables¶
Svensson et al. published a single-cell database in the form of a table in which each row contains a description of a study which published single-cell RNA-seq data. Some of these data sets are already included in sfaira, consider also our interactive website for a graphical user interface to our complete data zoo. Note that this website can be used as a look-up table but sfaira also allows you to directly load and interact with these data sets.
Roadmap¶
Cell ontologies¶
We are currently migrating our ontology to use the Cell Ontology as a backbone. For details, read through this milestone.
Interface online data repositories¶
We are preparing to interface online data repositories which provide streamlined data. This allows users to build local data set collections more easily because these providers usually have a clear download interface, consider the cellxgene data portal for example. We aim to represent both these data set portals and data sets that have not been streamlined in such a fashion to provide a comprehensive collection of as many data sets as possible.
FAQ¶
Data zoo¶
How is load() function used in data loading?¶
load()
contains all processing steps that load raw data files into a ready to use adata object.
This adata object can be cached as an h5ad file named after the dataset ID for faster reloading
(if allow_caching=True), which exactly skips the code in load()
.
load()
can be triggered to reload from scratch even if cached data is available
(if use_cached=False).
How is the feature space (gene names) manipulated during data loading?¶
Sfaira provides both gene names and ENSEMBL IDs. Missing IDs will automatically be inferred from the gene names and vice versa. Version tags on ENSEMBL gene IDs will be removed if specified (if remove_gene_version=True); in this case, counts are aggregated across these features. Sfaira makes sure that gene IDs in a dataset match IDs of chosen reference genomes.
Datasets, DatasetGroups, DatasetSuperGroups - what are they?¶
Dataset: Custom class that loads a specific dataset. DatasetGroup: A dataset group manages collection of data loaders (multiple instances of Dataset). This is useful to group for example all data loaders corresponding to a certain study or a certain tissue. DatasetSuperGroups: A group of DatasetGroups that allow easy addition of multiple instances of DatasetGroup.
Basics of sfaira lazy loading via split into constructor and load() function.¶
The constructor of a dataset defines all metadata associated with this data set.
The loading of the actual data happens in the load()
function and not in the constructor.
This is useful as it allows initialising the datasets and accessing dataset metadata
without loading the actual count data.
DatasetGroups can contain initialised Datasets and can be sub-setted based on metadata
before loading is triggered across the entire group.
Changelog¶
This project adheres to Semantic Versioning.
0.2.1 2020-09-7¶
Added
A commandline interface with Click, Rich and Questionary
upgrade command, which checks whether the latest version of sfaira is installed on every sfaria startup and upgrades it if not.
create-dataloader command which allows for the interactive creation of a sfaira dataloader script
clean-dataloader command which cleans a with sfaira create-dataloader created dataloader script
lint-dataloader command which runs static checks on the style and completeness of a dataloader script
test-dataloader command which runs a unittest on a provided dataloader
Fixed
Dependencies
Deprecated