sfaira.data.DatasetBase.streamline_features¶

DatasetBase.streamline_features(match_to_release: Optional[Union[str, Dict[str, str]]] = None, remove_gene_version: bool = True, subset_genes_to_type: Union[None, str, List[str]] = None, schema: Optional[str] = None)¶

Subset and sort genes to genes defined in an assembly or genes of a particular type, such as protein coding. This also adds missing ensid or gene symbol columns if match_to_reference is not set to False and removes all adata.var columns that are not defined as gene_id_ensembl_var_key or gene_id_symbol_var_key in the dataloader.

Parameters

match_to_release –
Which genome annotation release to map the feature space to. Note that assemblies from ensbeml are usually named as Organism.Assembly.Release, this is the Release string. Can be:
- str: Provide the name of the release.
- dict: Mapping of organism to name of the release (see str format). Chooses release for each
  data set based on organism annotation.
remove_gene_version – Whether to remove the version number after the colon sometimes found in ensembl gene ids.
subset_genes_to_type –
Type(s) to subset to. Can be a single type or a list of types or None. Types can be:
- None: All genes in assembly.
- ”protein_coding”: All protein coding genes in assembly.