Unsupervised analyses in SimBA

Bout aggregation helper

simba.unsupervised.bout_aggregator.bout_aggregator(data: DataFrame, clfs: List[str], feature_names: List[str], video_info: DataFrame, min_bout_length: Optional[int] = 0, aggregator: Optional[typing_extensions.Literal['MEAN', 'MEDIAN']] = 'MEAN') DataFrame[source]

Helper to aggregate features to bout-level representations for unsupervised analysis.

Parameters
  • data (pd.DataFrame) – DataFrame with features.

  • clfs (List[str]) – Names of classifiers

  • feature_names – Names of features

  • aggregator (Optional[Literal['MEAN', 'MEDIAN']]) – Aggregation type, e.g., ‘MEAN’, ‘MEDIAN’. Default ‘MEAN’.

  • min_bout_length (Optional[int]) – The length of the shortest allowed bout in milliseconds. Default 0 which means all bouts.

  • video_info (pd.DataFrame) – Dataframe holding video names, fps, resolution etc typically located at project_folder/logs/video_info.csv of SimBA project.

Return pd.DataFrame

Featurized data at aggregate bout level.

Cluster frequentist statistics calculator

class simba.unsupervised.cluster_frequentist_calculator.ClusterFrequentistCalculator(config_path: Union[str, PathLike], data_path: Union[str, PathLike], settings: Dict[str, bool])[source]

Bases: UnsupervisedMixin, ConfigReader

Class for computing frequentist statitics based on cluster assignment labels for explainability purposes.

Parameters
  • config_path (Union[str, os.PathLike]) – path to SimBA configparser.ConfigParser project_config.ini

  • data_path (Union[str, os.PathLike]) – path to pickle holding unsupervised results in simba.unsupervised.data_map.yaml format.

  • settings (dict) – Dict holding which statistical tests to use, with test name as keys and booleans as values.

Example

>>> settings = {'scaled': True, 'ANOVA': True, 'tukey_posthoc': True, 'descriptive_statistics': True}
>>> calculator = ClusterFrequentistCalculator(config_path='unsupervised/project_folder/project_config.ini', data_path='unsupervised/cluster_models/quizzical_rhodes.pickle', settings=settings)
>>> calculator.run()

Cluster video visualizer

class simba.unsupervised.cluster_video_visualizer.ClusterVideoVisualizer(config_path: Union[str, PathLike], data_path: Union[str, PathLike], max_videos: Optional[int] = None, speed: Optional[int] = 1.0, bg_clr: Optional[Tuple[int, int, int]] = (255, 255, 255), plot_type: Optional[typing_extensions.Literal['VIDEO', 'HULL', 'SKELETON', 'POINTS']] = 'SKELETON')[source]

Bases: ConfigReader, UnsupervisedMixin

Class for creating video examples of cluster assignments.

Parameters
  • config_path (Union[str, os.PathLike]) – Path to SimBA project configuration file.

  • data_path (Union[str, os.PathLike]) – Path to pickle file containing unsupervised results.

  • max_videos (Optional[Union[int, None]]) – Maximum number of videos to create for each cluster. Defaults to None.

  • speed (Optional[int]) – Speed of the generated videos. Defaults to 1.0.

  • bg_clr (Optional[Tuple[int, int, int]]) – Background color of the videos as RGB tuple. Defaults to white (255, 255, 255).

  • plot_type (Optional[Literal]) – Type of plot to generate (‘VIDEO’, ‘HULL’, ‘SKELETON’, ‘POINTS’). Defaults to ‘SKELETON’.

Example

>>> config_path = '/Users/simon/Desktop/envs/simba/troubleshooting/NG_Unsupervised/project_folder/project_config.ini'
>>> data_path = '/Users/simon/Desktop/envs/simba/troubleshooting/NG_Unsupervised/project_folder/cluster_mdls/hopeful_khorana.pickle'
>>> visualizer = ClusterVideoVisualizer(config_path=config_path, data_path=data_path, bg_clr=(0, 0, 255), max_videos=20, speed=0.2, plot_type='POINTS')
>>> visualizer.run()

Cluster XAI calculator

class simba.unsupervised.cluster_xai_calculator.ClusterXAICalculator(data_path: Union[str, PathLike], config_path: Union[str, PathLike], settings: Dict[str, Any])[source]

Bases: UnsupervisedMixin, ConfigReader

Class for building RF models on top of cluster assignments, and calculating latent space explainability metrics based on RF models.

Parameters
  • config_path (Union[str, os.PathLike]) – path to SimBA configparser.ConfigParser project_config.ini

  • data_path (Union[str, os.PathLike]) – path to pickle holding unsupervised results in data_map.yaml format.

  • settings (Dict[str, Any]) – Dict holding which explainability tests to use.

Example

>>> settings = {'gini_importance': True, 'permutation_importance': True, 'shap': {'method': 'cluster_paired', 'create': True, 'sample': 100}}
>>> calculator = ClusterXAICalculator(config_path='unsupervised/project_folder/project_config.ini', data_path='unsupervised/cluster_models/quizzical_rhodes.pickle', settings=settings)
>>> calculator.run()

Data extractor

class simba.unsupervised.data_extractor.DataExtractor(config_path: Union[str, PathLike], data_path: Union[str, PathLike], data_types: List[str], settings: Optional[dict] = None)[source]

Bases: UnsupervisedMixin, ConfigReader

Extracts human-readable data from directory of pickles or single pickle file that holds unsupervised analyses.

Parameters
  • config_path – path to SimBA configparser.ConfigParser project_config.ini

  • data_path – path to pickle holding unsupervised results in data_map.yaml format.

  • data_type – The type of data to extract. E.g., CLUSTERER_PARAMETERS, DIMENSIONALITY_REDUCTION_PARAMETERS, SCALER, SCALED_DATA, LOW_VARIANCE_FIELDS, FEATURE_NAMES, FRAME_FEATURES, FRAME_POSE, FRAME_TARGET, BOUTS_FEATURES, BOUTS_TARGETS, BOUTS_DIM_CORDS

  • settings – User-defined parameters for data extraction.

Example

>>> extractor = DataExtractor(data_path='unsupervised/cluster_models/awesome_curran.pickle', data_type=['BOUTS_TARGETS'], settings=None, config_path='unsupervised/project_folder/project_config.ini')
>>> extractor.run()

Data creator

class simba.unsupervised.dataset_creator.DatasetCreator(config_path: Union[str, PathLike], settings: Dict[str, Any])[source]

Bases: ConfigReader, UnsupervisedMixin

Transform raw frame-wise supervised classification data into aggregated data for unsupervised analyses. Saves the aggergated data in to logs directory of the SimBa project.

Parameters
  • config_path (Union[str, os.PathLike]) – path to SimBA configparser.ConfigParser project_config.ini

  • settings (Dict[str, Any]) – Attributes for which data should be included and how the data should be aggregated.

Example

>>> settings = {'data_slice': 'ALL FEATURES (EXCLUDING POSE)', 'clf_slice': 'Attack', 'bout_aggregation_type': 'MEDIAN', 'min_bout_length': 66, 'feature_path': '/Users/simon/Desktop/envs/simba_dev/simba/assets/unsupervised/features.csv'}
>>> db_creator = DatasetCreator(config_path='/Users/simon/Desktop/envs/simba/troubleshooting/NG_Unsupervised/project_folder/project_config.ini', settings=settings)
>>> db_creator.run()

Density-based cluster validation (DBCV)

class simba.unsupervised.dbcv_calculator.DBCVCalculator(config_path: Union[str, PathLike], data_path: Union[str, PathLike])[source]

Bases: UnsupervisedMixin, ConfigReader

Density-based Cluster Validation (DBCV).

Note

Jitted version of DBCSV. Faster runtime by replacing meth:scipy.spatial.distance.cdist in original DBCSV with LLVM as discussed HERE. A further non-jitted implementaion can be found in the hdbscan library. AWS Denseclus HDBSCAN appears to have DBCV as an attribute of returned object. maybe a faster alternative?

Parameters
  • embedders_path (str) – Directory holding dimensionality reduction models in pickle format.

  • clusterers_path (str) – Directory holding cluster models in pickle format.

References

1

Moulavi et al, Density-Based Clustering Validation, SIAM 2014, https://doi.org/10.1137/1.9781611973440.96

Examples

>>> dbcv_calculator = DBCVCalculator(clusterers_path='unsupervised/cluster_models', config_path='my_simba_config')
>>> results = dbcv_calculator.run()
DBCV(X: ndarray, labels: ndarray) float[source]
Parameters
  • X (np.ndarray) – 2D array of shape len(observations) x len(dimensionality reduced dimensions)

  • labels (np.ndarray) – 1D array with cluster labels

Returns float

DBCV cluster validity score

static calculate_dists(X: ndarray, arrays_by_cluster: List) Tuple[ndarray, bool][source]
Parameters
  • X (np.ndarray) – 2D array of shape len(observations) x len(dimensionality reduced dimensions)

  • arrays_by_cluster (numba.types.ListType[list[np.ndarray]]) – Numba typed List of list with 2d arrays.

Returns Tuple[np.ndarray, bool)

Graph of all pair-wise mutual reachability distances between points of size X.shape[0] x X.shape[0]. Boolean representing if any issues where detected. Including: If (i) any clusters consist of a single observation.

Embedding correlation calculator

class simba.unsupervised.embedding_correlation_calculator.EmbeddingCorrelationCalculator(data_path: Union[str, PathLike], config_path: Union[str, PathLike], settings: Dict[str, Any])[source]

Bases: UnsupervisedMixin, ConfigReader

Class for correlating dimensionality reduction features with original features for explainability purposes.

_images/EmbeddingCorrelationCalculator.png
Parameters
  • config_path (str) – path to SimBA configparser.ConfigParser project_config.ini

  • data_path (str) – path to pickle holding unsupervised results in data_map.yaml format.

  • settings (dict) – dict holding which statistical tests to use and how to create plots.

Example

>>> settings = {'correlation_methods': ['pearson', 'kendall', 'spearman'], 'plots': {'create': True, 'correlations': 'pearson', 'palette': 'jet'}}
>>> calculator = EmbeddingCorrelationCalculator(config_path='unsupervised/project_folder/project_config.ini', data_path='unsupervised/cluster_models/quizzical_rhodes.pickle', settings=settings)
>>> calculator.run()

Grid-search visualizer

class simba.unsupervised.grid_search_visualizers.GridSearchVisualizer(model_dir: Union[str, PathLike], save_dir: Union[str, PathLike], settings: Dict[str, Any])[source]

Bases: UnsupervisedMixin

Visualize grid-searched latent spaces in .png format.

_images/GridSearchVisualizer.png
Parameters
  • model_dir – path to pickle holding unsupervised results in data_map.yaml format.

  • save_dir – directory holding one or more unsupervised results in pickle data_map.yaml format.

  • settings – User-defined image attributes (e.g., continous and catehorical palettes)

Example

>>> settings = {'CATEGORICAL_PALETTE': 'Pastel1', 'CONTINUOUS_PALETTE': 'magma', 'SCATTER_SIZE': 10}
>>> visualizer = GridSearchVisualizer(model_dir='/Users/simon/Desktop/envs/troubleshooting/unsupervised/cluster_models_042023', save_dir='/Users/simon/Desktop/envs/troubleshooting/unsupervised/images', settings=settings)
>>> visualizer.continuous_visualizer(continuous_vars=['START_FRAME'])
>>> visualizer.categorical_visualizer(categoricals=['CLUSTER'])

HDBSCAN clusterer

UMAP embedder