sfaira.data.DatasetGroup¶

class sfaira.data.DatasetGroup(datasets: dict, collection_id: str = 'default')¶

Container class that co-manages multiple data sets, removing need to call Dataset() methods directly through wrapping them.

Example:

#query loaders lung #from sfaira.dev.data.loaders.lung import DatasetGroupLung as DatasetGroup #dsg_humanlung = DatasetGroupHuman(path=’path/to/data’) #dsg_humanlung.load_all(match_to_reference=’Homo_sapiens_GRCh38_97’) #dsg_humanlung[some_id] #dsg_humanlung.adata

Attributes

`adata`
`adata_ls`
`additional_annotation_key`	"
`collection_id`
`doi`	Propagates DOI annotation from contained datasets.
`ids`
`ontology_celltypes`	use might be replaced by ontology_container_sfaira in the future.
`ontology_container_sfaira`
`supplier`	Propagates supplier annotation from contained datasets.
`datasets`

Methods

`collapse_counts`()	Collapse count matrix along duplicated index.
`download`(**kwargs)
`load`([annotated_only, load_raw, ...])	Load all datasets in group (option for temporary loading).
`ncells`([annotated_only])
`ncells_bydataset`([annotated_only])
`obs_concat`([keys])	Returns concatenation of all .obs.
`project_celltypes_to_ontology`([...])	Project free text cell type names to ontology based on mapping table.
`show_summary`()
`streamline_features`([match_to_release, ...])	Subset and sort genes to genes defined in an assembly or genes of a particular type, such as protein coding. :param match_to_release: Which genome annotation release to map the feature space to. Note that assemblies from ensbeml are usually named as Organism.Assembly.Release, this is the Release string. Can be: - str: Provide the name of the release. - dict: Mapping of organism to name of the release (see str format). Chooses release for each data set based on organism annotation.:param remove_gene_version: Whether to remove the version number after the colon sometimes found in ensembl gene ids. :param subset_genes_to_type: Type(s) to subset to. Can be a single type or a list of types or None. Types can be: - None: All genes in assembly. - "protein_coding": All protein coding genes in assembly.
`streamline_metadata`([schema, clean_obs, ...])	Streamline the adata instance in each data set to output format.
`subset`(key, values)	Subset list of adata objects based on sample-wise properties.
`subset_cells`(key, values)	Subset list of adata objects based on cell-wise properties.
`write_backed`(adata_backed, genome, idx[, ...])	Loads data set group into slice of backed anndata object.
`write_distributed_store`(dir_cache[, ...])	Write data set into a format that allows distributed access to data set on disk.
`write_ontology_class_maps`(fn, attrs[, ...])	Write cell type maps of free text cell types to ontology classes.