Unsupervised learning

Bout aggregation helper

simba.unsupervised.bout_aggregator.bout_aggregator(data, clfs, feature_names, video_info, min_bout_length=0, aggregator='MEAN')[source]

Helper to aggregate features to bout-level representations for unsupervised analysis.

Parameters
  • data (pd.DataFrame) – DataFrame with features.

  • clfs (List[str]) – Names of classifiers

  • feature_names – Names of features

  • aggregator (Optional[Literal['MEAN', 'MEDIAN']]) – Aggregation type, e.g., β€˜MEAN’, β€˜MEDIAN’. Default β€˜MEAN’.

  • min_bout_length (Optional[int]) – The length of the shortest allowed bout in milliseconds. Default 0 which means all bouts.

  • video_info (pd.DataFrame) – Dataframe holding video names, fps, resolution etc typically located at project_folder/logs/video_info.csv of SimBA project.

Return pd.DataFrame

Featurized data at aggregate bout level.

Cluster frequentist statistics calculator

class simba.unsupervised.cluster_frequentist_calculator.ClusterFrequentistCalculator(config_path, data_path, settings)[source]

Bases: simba.mixins.unsupervised_mixin.UMLMixin, simba.mixins.config_reader.ConfigReader

Class for computing frequentist statitics based on cluster assignment labels for explainability purposes.

Parameters
  • config_path (Union[str, os.PathLike]) – path to SimBA configparser.ConfigParser project_config.ini

  • data_path (Union[str, os.PathLike]) – path to pickle holding unsupervised results in simba.unsupervised.data_map.yaml format.

  • settings (dict) – Dict holding which statistical tests to use, with test name as keys and booleans as values.

Example

>>> settings = {'scaled': True, 'ANOVA': True, 'tukey_posthoc': True, 'descriptive_statistics': True}
>>> calculator = ClusterFrequentistCalculator(config_path='unsupervised/project_folder/project_config.ini', data_path='unsupervised/cluster_models/quizzical_rhodes.pickle', settings=settings)
>>> calculator.run()

Cluster video visualizer

class simba.unsupervised.cluster_video_visualizer.ClusterVideoVisualizer(config_path, data_path, max_videos=None, speed=1.0, bg_clr=(255, 255, 255), plot_type='SKELETON')[source]

Bases: simba.mixins.config_reader.ConfigReader, simba.mixins.unsupervised_mixin.UMLMixin

Class for creating video examples of cluster assignments.

Parameters
  • config_path (Union[str, os.PathLike]) – Path to SimBA project configuration file.

  • data_path (Union[str, os.PathLike]) – Path to pickle file containing unsupervised results.

  • max_videos (Optional[Union[int, None]]) – Maximum number of videos to create for each cluster. Defaults to None.

  • speed (Optional[int]) – Speed of the generated videos. Defaults to 1.0.

  • bg_clr (Optional[Tuple[int, int, int]]) – Background color of the videos as RGB tuple. Defaults to white (255, 255, 255).

  • plot_type (Optional[Literal]) – Type of plot to generate (β€˜VIDEO’, β€˜HULL’, β€˜SKELETON’, β€˜POINTS’). Defaults to β€˜SKELETON’.

Example

>>> config_path = '/Users/simon/Desktop/envs/simba/troubleshooting/NG_Unsupervised/project_folder/project_config.ini'
>>> data_path = '/Users/simon/Desktop/envs/simba/troubleshooting/NG_Unsupervised/project_folder/cluster_mdls/hopeful_khorana.pickle'
>>> visualizer = ClusterVideoVisualizer(config_path=config_path, data_path=data_path, bg_clr=(0, 0, 255), max_videos=20, speed=0.2, plot_type='POINTS')
>>> visualizer.run()

Cluster XAI calculator

class simba.unsupervised.cluster_xai_calculator.ClusterXAICalculator(data_path, config_path, settings)[source]

Bases: simba.mixins.unsupervised_mixin.UMLMixin, simba.mixins.config_reader.ConfigReader

Class for building RF models on top of cluster assignments, and calculating latent space explainability metrics based on RF models.

Parameters
  • config_path (Union[str, os.PathLike]) – path to SimBA configparser.ConfigParser project_config.ini

  • data_path (Union[str, os.PathLike]) – path to pickle holding unsupervised results in data_map.yaml format.

  • settings (Dict[str, Any]) – Dict holding which explainability tests to use.

Example

>>> settings = {'gini_importance': True, 'permutation_importance': True, 'shap': {'method': 'cluster_paired', 'create': True, 'sample': 100}}
>>> calculator = ClusterXAICalculator(config_path='unsupervised/project_folder/project_config.ini', data_path='unsupervised/cluster_models/quizzical_rhodes.pickle', settings=settings)
>>> calculator.run()

Data extractor

class simba.unsupervised.data_extractor.DataExtractor(config_path, data_path, data_types, settings=None)[source]

Bases: simba.mixins.unsupervised_mixin.UMLMixin, simba.mixins.config_reader.ConfigReader

Extracts human-readable data from directory of pickles or single pickle file that holds unsupervised analyses.

Parameters
  • config_path – path to SimBA configparser.ConfigParser project_config.ini

  • data_path – path to pickle holding unsupervised results in data_map.yaml format.

  • data_type – The type of data to extract. E.g., CLUSTERER_PARAMETERS, DIMENSIONALITY_REDUCTION_PARAMETERS, SCALER, SCALED_DATA, LOW_VARIANCE_FIELDS, FEATURE_NAMES, FRAME_FEATURES, FRAME_POSE, FRAME_TARGET, BOUTS_FEATURES, BOUTS_TARGETS, BOUTS_DIM_CORDS

  • settings – User-defined parameters for data extraction.

Example

>>> extractor = DataExtractor(data_path='unsupervised/cluster_models/awesome_curran.pickle', data_type=['BOUTS_TARGETS'], settings=None, config_path='unsupervised/project_folder/project_config.ini')
>>> extractor.run()

Data creator

class simba.unsupervised.dataset_creator.DatasetCreator(config_path, settings)[source]

Bases: simba.mixins.config_reader.ConfigReader, simba.mixins.unsupervised_mixin.UMLMixin

Transform raw frame-wise supervised classification data into aggregated data for unsupervised analyses. Saves the aggergated data in to logs directory of the SimBa project.

Parameters
  • config_path (Union[str, os.PathLike]) – path to SimBA configparser.ConfigParser project_config.ini

  • settings (Dict[str, Any]) – Attributes for which data should be included and how the data should be aggregated.

Example

>>> settings = {'data_slice': 'ALL FEATURES (EXCLUDING POSE)', 'clf_slice': 'Attack', 'bout_aggregation_type': 'MEDIAN', 'min_bout_length': 66, 'feature_path': '/Users/simon/Desktop/envs/simba_dev/simba/assets/unsupervised/features.csv'}
>>> db_creator = DatasetCreator(config_path='/Users/simon/Desktop/envs/simba/troubleshooting/NG_Unsupervised/project_folder/project_config.ini', settings=settings)
>>> db_creator.run()

Density-based cluster validation (DBCV)

class simba.unsupervised.dbcv_calculator.DBCVCalculator(config_path, data_path)[source]

Bases: simba.mixins.unsupervised_mixin.UMLMixin, simba.mixins.config_reader.ConfigReader

Density-based Cluster Validation (DBCV).

Note

Jitted version of DBCSV. Faster runtime by replacing meth:scipy.spatial.distance.cdist in original DBCSV with LLVM as discussed HERE. A further non-jitted implementaion can be found in the hdbscan library. AWS Denseclus HDBSCAN appears to have DBCV as an attribute of returned object. maybe a faster alternative?

Parameters
  • embedders_path (str) – Directory holding dimensionality reduction models in pickle format.

  • clusterers_path (str) – Directory holding cluster models in pickle format.

References

1

Moulavi et al, Density-Based Clustering Validation, SIAM 2014, https://doi.org/10.1137/1.9781611973440.96

Examples

>>> dbcv_calculator = DBCVCalculator(clusterers_path='unsupervised/cluster_models', config_path='my_simba_config')
>>> results = dbcv_calculator.run()
DBCV(X, labels)[source]
Parameters
  • X (np.ndarray) – 2D array of shape len(observations) x len(dimensionality reduced dimensions)

  • labels (np.ndarray) – 1D array with cluster labels

Return float

DBCV cluster validity score

static calculate_dists(X, arrays_by_cluster)[source]
Parameters
  • X (np.ndarray) – 2D array of shape len(observations) x len(dimensionality reduced dimensions)

  • arrays_by_cluster (numba.types.ListType[list[np.ndarray]]) – Numba typed List of list with 2d arrays.

Return Tuple[np.ndarray, bool)

Graph of all pair-wise mutual reachability distances between points of size X.shape[0] x X.shape[0]. Boolean representing if any issues where detected. Including: If (i) any clusters consist of a single observation.

Embedding correlation calculator

class simba.unsupervised.embedding_correlation_calculator.EmbeddingCorrelationCalculator(data_path, config_path, settings)[source]

Bases: simba.mixins.unsupervised_mixin.UMLMixin, simba.mixins.config_reader.ConfigReader

Class for correlating dimensionality reduction features with original features for explainability purposes.

Embedding Correlation Calculator
Parameters
  • config_path (str) – path to SimBA configparser.ConfigParser project_config.ini

  • data_path (str) – path to pickle holding unsupervised results in data_map.yaml format.

  • settings (dict) – dict holding which statistical tests to use and how to create plots.

Example

>>> settings = {'correlation_methods': ['pearson', 'kendall', 'spearman'], 'plots': {'create': True, 'correlations': 'pearson', 'palette': 'jet'}}
>>> calculator = EmbeddingCorrelationCalculator(config_path='unsupervised/project_folder/project_config.ini', data_path='unsupervised/cluster_models/quizzical_rhodes.pickle', settings=settings)
>>> calculator.run()

Grid-search visualizer

class simba.unsupervised.grid_search_visualizers.GridSearchVisualizer(model_dir, save_dir, settings)[source]

Bases: simba.mixins.unsupervised_mixin.UMLMixin

Visualize grid-searched latent spaces in .png format.

Grid Search Visualizer
Parameters
  • model_dir – path to pickle holding unsupervised results in data_map.yaml format.

  • save_dir – directory holding one or more unsupervised results in pickle data_map.yaml format.

  • settings – User-defined image attributes (e.g., continous and catehorical palettes)

Example

>>> settings = {'CATEGORICAL_PALETTE': 'Pastel1', 'CONTINUOUS_PALETTE': 'magma', 'SCATTER_SIZE': 10}
>>> visualizer = GridSearchVisualizer(model_dir='/Users/simon/Desktop/envs/troubleshooting/unsupervised/cluster_models_042023', save_dir='/Users/simon/Desktop/envs/troubleshooting/unsupervised/images', settings=settings)
>>> visualizer.continuous_visualizer(continuous_vars=['START_FRAME'])
>>> visualizer.categorical_visualizer(categoricals=['CLUSTER'])

HDBSCAN clusterer

class simba.unsupervised.hdbscan_clusterer.HDBSCANClusterer[source]

Bases: simba.mixins.unsupervised_mixin.UMLMixin

Methods for grid-search HDBSCAN model fit and transform. Defaults to GPU and cuml.cluster.HDBSCAN. If GPU unavailable, then hdbscan.HDBSCAN.

fit(data_path, save_dir, hyper_parameters)[source]
Parameters
  • data_path – Path holding pickled unsupervised dimensionality reduction results in data_map.yaml format

  • save_dir – Empty directory where to save the HDBSCAN results.

  • hyper_parameters – dict holding hyperparameters in list format

Returns

Example I

Grid-search fit:

>>> hyper_parameters = {'alpha': [1.0], 'min_cluster_size': [10], 'min_samples': [1], 'cluster_selection_epsilon': [20]}
>>> embedding_dir = '/Users/simon/Desktop/envs/troubleshooting/unsupervised/dr_models'
>>> save_dir = '/Users/simon/Desktop/envs/troubleshooting/unsupervised/cluster_models'
>>> config_path = '/Users/simon/Desktop/envs/troubleshooting/unsupervised/project_folder/project_config.ini'
>>> clusterer = HDBSCANClusterer(data_path=embedding_dir, save_dir=save_dir)
>>> clusterer.fit(hyper_parameters=hyper_parameters)
transform(data_path, model, save_dir=None, settings=None, save_format='csv')[source]
Parameters
  • data_path – Path to directory holding pickled unsupervised dimensionality reduction results in data_map.yaml format

  • model – Path to pickle holding hdbscan model in data_map.yaml format.

  • save_dir – Empty directory where to save the HDBSCAN results. If none, then keep results in memory under self.results.

  • settings – User-defined params.

Example I

Transform:

>>> data_path = '/Users/simon/Desktop/envs/simba/troubleshooting/NG_Unsupervised/project_folder/logs/unsupervised_data_20240218134920.pickle'
>>> mdl_path = '/Users/simon/Desktop/envs/simba/troubleshooting/NG_Unsupervised/project_folder/cluster_mdls/hopeful_khorana.pickle'
>>> clusterer = HDBSCANClusterer()
>>> settings = {'DATA_FORMAT': 'scaled', 'CLASSIFICATIONS': True}
>>> results = clusterer.transform(data_path=data_path, model=mdl_path, settings=settings)

UMAP embedder

class simba.unsupervised.umap_embedder.UmapEmbedder[source]

Bases: simba.mixins.unsupervised_mixin.UMLMixin

Methods for grid-search UMAP model fit and transform. Defaults to GPU and cuml.UMAP if GPU available. If GPU unavailable, then umap.UMAP.

Parameters
  • data_path – Path holding pickled data-set created by `simba.unsupervised.dataset_creator.DatasetCreator.

  • save_dir – Empty directory where to save the UMAP results.

  • hyper_parameters – dict holding UMAP hyperparameters in list format.

Example I

Fit.

>>> hyper_parameters = {'n_neighbors': [10, 2], 'min_distance': [1.0], 'spread': [1.0], 'scaler': 'MIN-MAX', 'variance': 0.25, "multicolinearity": 0.5}
>>> data_path = 'unsupervised/project_folder/logs/unsupervised_data_20230416145821.pickle'
>>> save_dir = 'unsupervised/dr_models'
>>> config_path = 'unsupervised/project_folder/project_config.ini'
>>> embedder = UmapEmbedder(data_path=data_path, save_dir=save_dir)
>>> embedder.fit(hyper_parameters=hyper_parameters)
transform(data_path, model, settings=None, save_dir=None, save_format='csv')[source]
Example I

Transform.

>>> data_path = '/Users/simon/Desktop/envs/simba/troubleshooting/NG_Unsupervised/project_folder/logs/unsupervised_data_20240215143716.pickle'
>>> model_path = '/Users/simon/Desktop/envs/simba/troubleshooting/NG_Unsupervised/project_folder/clustering1704/academic_montalcini.pickle'
>>> embedder = UmapEmbedder()
>>> embedder.transform(save_dir=None, data_path=data_path, model=model_path, settings=None)
>>> embedder.transform(save_dir=None, data_path=data_path, model=model_path, settings={'DATA_FORMAT': 'scaled', 'CLASSIFICATIONS': True})

Unsupervised mixins