sfaira.data.DatasetSuperGroup.remove_duplicates

DatasetSuperGroup.remove_duplicates(supplier_hierarchy: str = 'cellxgene,sfaira')

Remove duplicate data loaders from super group, e.g. loaders that map to the same DOI.

Any DOI match is removed (pre-print or journal publication). Data sets without DOI are removed, too. Loaders are kept in the hierarchy indicated in supplier_hierarchy. Requires a super group with homogenous suppliers across DatasetGroups, throws an error otherwise. This is given for sfaira maintained libraries but may not be the case if custom assembled DatasetGroups are used.

Parameters

supplier_hierarchy

Hierarchy to resolve duplications by. Comma separated string that indicates which data provider takes priority. Choose “cellxgene,sfaira” to prioritise use of data sets downloaded from cellxgene. Choose “sfaira,cellxgene” to prioritise use of raw data processing pipelines locally.

  • cellxgene: cellxgene downloads

  • sfaira: local raw file processing

Returns