Unsupervised analyses in SimBA
Bout aggregation helper
- simba.unsupervised.bout_aggregator.bout_aggregator(data: DataFrame, clfs: List[str], feature_names: List[str], video_info: DataFrame, min_bout_length: Optional[int] = 0, aggregator: Optional[typing_extensions.Literal['MEAN', 'MEDIAN']] = 'MEAN') DataFrame [source]
Helper to aggregate features to bout-level representations for unsupervised analysis.
- Parameters
data (pd.DataFrame) – DataFrame with features.
clfs (List[str]) – Names of classifiers
feature_names – Names of features
aggregator (Optional[Literal['MEAN', 'MEDIAN']]) – Aggregation type, e.g., ‘MEAN’, ‘MEDIAN’. Default ‘MEAN’.
min_bout_length (Optional[int]) – The length of the shortest allowed bout in milliseconds. Default 0 which means all bouts.
video_info (pd.DataFrame) – Dataframe holding video names, fps, resolution etc typically located at project_folder/logs/video_info.csv of SimBA project.
- Return pd.DataFrame
Featurized data at aggregate bout level.
Cluster frequentist statistics calculator
- class simba.unsupervised.cluster_frequentist_calculator.ClusterFrequentistCalculator(config_path: Union[str, PathLike], data_path: Union[str, PathLike], settings: Dict[str, bool])[source]
Bases:
UnsupervisedMixin
,ConfigReader
Class for computing frequentist statitics based on cluster assignment labels for explainability purposes.
- Parameters
config_path (Union[str, os.PathLike]) – path to SimBA configparser.ConfigParser project_config.ini
data_path (Union[str, os.PathLike]) – path to pickle holding unsupervised results in
simba.unsupervised.data_map.yaml
format.settings (dict) – Dict holding which statistical tests to use, with test name as keys and booleans as values.
- Example
>>> settings = {'scaled': True, 'ANOVA': True, 'tukey_posthoc': True, 'descriptive_statistics': True} >>> calculator = ClusterFrequentistCalculator(config_path='unsupervised/project_folder/project_config.ini', data_path='unsupervised/cluster_models/quizzical_rhodes.pickle', settings=settings) >>> calculator.run()
Cluster video visualizer
- class simba.unsupervised.cluster_video_visualizer.ClusterVideoVisualizer(config_path: Union[str, PathLike], data_path: Union[str, PathLike], max_videos: Optional[int] = None, speed: Optional[int] = 1.0, bg_clr: Optional[Tuple[int, int, int]] = (255, 255, 255), plot_type: Optional[typing_extensions.Literal['VIDEO', 'HULL', 'SKELETON', 'POINTS']] = 'SKELETON')[source]
Bases:
ConfigReader
,UnsupervisedMixin
Class for creating video examples of cluster assignments.
- Parameters
config_path (Union[str, os.PathLike]) – Path to SimBA project configuration file.
data_path (Union[str, os.PathLike]) – Path to pickle file containing unsupervised results.
max_videos (Optional[Union[int, None]]) – Maximum number of videos to create for each cluster. Defaults to None.
speed (Optional[int]) – Speed of the generated videos. Defaults to 1.0.
bg_clr (Optional[Tuple[int, int, int]]) – Background color of the videos as RGB tuple. Defaults to white (255, 255, 255).
plot_type (Optional[Literal]) – Type of plot to generate (‘VIDEO’, ‘HULL’, ‘SKELETON’, ‘POINTS’). Defaults to ‘SKELETON’.
- Example
>>> config_path = '/Users/simon/Desktop/envs/simba/troubleshooting/NG_Unsupervised/project_folder/project_config.ini' >>> data_path = '/Users/simon/Desktop/envs/simba/troubleshooting/NG_Unsupervised/project_folder/cluster_mdls/hopeful_khorana.pickle' >>> visualizer = ClusterVideoVisualizer(config_path=config_path, data_path=data_path, bg_clr=(0, 0, 255), max_videos=20, speed=0.2, plot_type='POINTS') >>> visualizer.run()
Cluster XAI calculator
- class simba.unsupervised.cluster_xai_calculator.ClusterXAICalculator(data_path: Union[str, PathLike], config_path: Union[str, PathLike], settings: Dict[str, Any])[source]
Bases:
UnsupervisedMixin
,ConfigReader
Class for building RF models on top of cluster assignments, and calculating latent space explainability metrics based on RF models.
- Parameters
config_path (Union[str, os.PathLike]) – path to SimBA configparser.ConfigParser project_config.ini
data_path (Union[str, os.PathLike]) – path to pickle holding unsupervised results in
data_map.yaml
format.settings (Dict[str, Any]) – Dict holding which explainability tests to use.
- Example
>>> settings = {'gini_importance': True, 'permutation_importance': True, 'shap': {'method': 'cluster_paired', 'create': True, 'sample': 100}} >>> calculator = ClusterXAICalculator(config_path='unsupervised/project_folder/project_config.ini', data_path='unsupervised/cluster_models/quizzical_rhodes.pickle', settings=settings) >>> calculator.run()
Data extractor
- class simba.unsupervised.data_extractor.DataExtractor(config_path: Union[str, PathLike], data_path: Union[str, PathLike], data_types: List[str], settings: Optional[dict] = None)[source]
Bases:
UnsupervisedMixin
,ConfigReader
Extracts human-readable data from directory of pickles or single pickle file that holds unsupervised analyses.
- Parameters
config_path – path to SimBA configparser.ConfigParser project_config.ini
data_path – path to pickle holding unsupervised results in
data_map.yaml
format.data_type – The type of data to extract. E.g., CLUSTERER_PARAMETERS, DIMENSIONALITY_REDUCTION_PARAMETERS, SCALER, SCALED_DATA, LOW_VARIANCE_FIELDS, FEATURE_NAMES, FRAME_FEATURES, FRAME_POSE, FRAME_TARGET, BOUTS_FEATURES, BOUTS_TARGETS, BOUTS_DIM_CORDS
settings – User-defined parameters for data extraction.
- Example
>>> extractor = DataExtractor(data_path='unsupervised/cluster_models/awesome_curran.pickle', data_type=['BOUTS_TARGETS'], settings=None, config_path='unsupervised/project_folder/project_config.ini') >>> extractor.run()
Data creator
- class simba.unsupervised.dataset_creator.DatasetCreator(config_path: Union[str, PathLike], settings: Dict[str, Any])[source]
Bases:
ConfigReader
,UnsupervisedMixin
Transform raw frame-wise supervised classification data into aggregated data for unsupervised analyses. Saves the aggergated data in to logs directory of the SimBa project.
- Parameters
config_path (Union[str, os.PathLike]) – path to SimBA configparser.ConfigParser project_config.ini
settings (Dict[str, Any]) – Attributes for which data should be included and how the data should be aggregated.
- Example
>>> settings = {'data_slice': 'ALL FEATURES (EXCLUDING POSE)', 'clf_slice': 'Attack', 'bout_aggregation_type': 'MEDIAN', 'min_bout_length': 66, 'feature_path': '/Users/simon/Desktop/envs/simba_dev/simba/assets/unsupervised/features.csv'} >>> db_creator = DatasetCreator(config_path='/Users/simon/Desktop/envs/simba/troubleshooting/NG_Unsupervised/project_folder/project_config.ini', settings=settings) >>> db_creator.run()
Density-based cluster validation (DBCV)
- class simba.unsupervised.dbcv_calculator.DBCVCalculator(config_path: Union[str, PathLike], data_path: Union[str, PathLike])[source]
Bases:
UnsupervisedMixin
,ConfigReader
Density-based Cluster Validation (DBCV).
Note
Jitted version of DBCSV. Faster runtime by replacing meth:scipy.spatial.distance.cdist in original DBCSV with LLVM as discussed HERE. A further non-jitted implementaion can be found in the hdbscan library. AWS Denseclus HDBSCAN appears to have DBCV as an attribute of returned object. maybe a faster alternative?
- Parameters
References
- 1
Moulavi et al, Density-Based Clustering Validation, SIAM 2014, https://doi.org/10.1137/1.9781611973440.96
Examples
>>> dbcv_calculator = DBCVCalculator(clusterers_path='unsupervised/cluster_models', config_path='my_simba_config') >>> results = dbcv_calculator.run()
- DBCV(X: ndarray, labels: ndarray) float [source]
- Parameters
X (np.ndarray) – 2D array of shape len(observations) x len(dimensionality reduced dimensions)
labels (np.ndarray) – 1D array with cluster labels
- Returns float
DBCV cluster validity score
- static calculate_dists(X: ndarray, arrays_by_cluster: List) Tuple[ndarray, bool] [source]
- Parameters
X (np.ndarray) – 2D array of shape len(observations) x len(dimensionality reduced dimensions)
arrays_by_cluster (numba.types.ListType[list[np.ndarray]]) – Numba typed List of list with 2d arrays.
- Returns Tuple[np.ndarray, bool)
Graph of all pair-wise mutual reachability distances between points of size X.shape[0] x X.shape[0]. Boolean representing if any issues where detected. Including: If (i) any clusters consist of a single observation.
Embedding correlation calculator
- class simba.unsupervised.embedding_correlation_calculator.EmbeddingCorrelationCalculator(data_path: Union[str, PathLike], config_path: Union[str, PathLike], settings: Dict[str, Any])[source]
Bases:
UnsupervisedMixin
,ConfigReader
Class for correlating dimensionality reduction features with original features for explainability purposes.
- Parameters
- Example
>>> settings = {'correlation_methods': ['pearson', 'kendall', 'spearman'], 'plots': {'create': True, 'correlations': 'pearson', 'palette': 'jet'}} >>> calculator = EmbeddingCorrelationCalculator(config_path='unsupervised/project_folder/project_config.ini', data_path='unsupervised/cluster_models/quizzical_rhodes.pickle', settings=settings) >>> calculator.run()
Grid-search visualizer
- class simba.unsupervised.grid_search_visualizers.GridSearchVisualizer(model_dir: Union[str, PathLike], save_dir: Union[str, PathLike], settings: Dict[str, Any])[source]
Bases:
UnsupervisedMixin
Visualize grid-searched latent spaces in .png format.
- Parameters
model_dir – path to pickle holding unsupervised results in
data_map.yaml
format.save_dir – directory holding one or more unsupervised results in pickle
data_map.yaml
format.settings – User-defined image attributes (e.g., continous and catehorical palettes)
- Example
>>> settings = {'CATEGORICAL_PALETTE': 'Pastel1', 'CONTINUOUS_PALETTE': 'magma', 'SCATTER_SIZE': 10} >>> visualizer = GridSearchVisualizer(model_dir='/Users/simon/Desktop/envs/troubleshooting/unsupervised/cluster_models_042023', save_dir='/Users/simon/Desktop/envs/troubleshooting/unsupervised/images', settings=settings) >>> visualizer.continuous_visualizer(continuous_vars=['START_FRAME']) >>> visualizer.categorical_visualizer(categoricals=['CLUSTER'])