Unsupervised learningο
On this page
Bout aggregation helperο
- simba.unsupervised.bout_aggregator.bout_aggregator(data, clfs, feature_names, video_info, min_bout_length=0, aggregator='MEAN')[source]ο
Helper to aggregate features to bout-level representations for unsupervised analysis.
- Parameters
data (pd.DataFrame) β DataFrame with features.
clfs (List[str]) β Names of classifiers
feature_names β Names of features
aggregator (Optional[Literal['MEAN', 'MEDIAN']]) β Aggregation type, e.g., βMEANβ, βMEDIANβ. Default βMEANβ.
min_bout_length (Optional[int]) β The length of the shortest allowed bout in milliseconds. Default 0 which means all bouts.
video_info (pd.DataFrame) β Dataframe holding video names, fps, resolution etc typically located at project_folder/logs/video_info.csv of SimBA project.
- Return pd.DataFrame
Featurized data at aggregate bout level.
Cluster frequentist statistics calculatorο
- class simba.unsupervised.cluster_frequentist_calculator.ClusterFrequentistCalculator(config_path, data_path, settings)[source]ο
Bases:
simba.mixins.unsupervised_mixin.UMLMixin,simba.mixins.config_reader.ConfigReaderClass for computing frequentist statitics based on cluster assignment labels for explainability purposes.
- Parameters
config_path (Union[str, os.PathLike]) β path to SimBA configparser.ConfigParser project_config.ini
data_path (Union[str, os.PathLike]) β path to pickle holding unsupervised results in
simba.unsupervised.data_map.yamlformat.settings (dict) β Dict holding which statistical tests to use, with test name as keys and booleans as values.
- Example
>>> settings = {'scaled': True, 'ANOVA': True, 'tukey_posthoc': True, 'descriptive_statistics': True} >>> calculator = ClusterFrequentistCalculator(config_path='unsupervised/project_folder/project_config.ini', data_path='unsupervised/cluster_models/quizzical_rhodes.pickle', settings=settings) >>> calculator.run()
Cluster video visualizerο
- class simba.unsupervised.cluster_video_visualizer.ClusterVideoVisualizer(config_path, data_path, max_videos=None, speed=1.0, bg_clr=(255, 255, 255), plot_type='SKELETON')[source]ο
Bases:
simba.mixins.config_reader.ConfigReader,simba.mixins.unsupervised_mixin.UMLMixinClass for creating video examples of cluster assignments.
- Parameters
config_path (Union[str, os.PathLike]) β Path to SimBA project configuration file.
data_path (Union[str, os.PathLike]) β Path to pickle file containing unsupervised results.
max_videos (Optional[Union[int, None]]) β Maximum number of videos to create for each cluster. Defaults to None.
speed (Optional[int]) β Speed of the generated videos. Defaults to 1.0.
bg_clr (Optional[Tuple[int, int, int]]) β Background color of the videos as RGB tuple. Defaults to white (255, 255, 255).
plot_type (Optional[Literal]) β Type of plot to generate (βVIDEOβ, βHULLβ, βSKELETONβ, βPOINTSβ). Defaults to βSKELETONβ.
- Example
>>> config_path = '/Users/simon/Desktop/envs/simba/troubleshooting/NG_Unsupervised/project_folder/project_config.ini' >>> data_path = '/Users/simon/Desktop/envs/simba/troubleshooting/NG_Unsupervised/project_folder/cluster_mdls/hopeful_khorana.pickle' >>> visualizer = ClusterVideoVisualizer(config_path=config_path, data_path=data_path, bg_clr=(0, 0, 255), max_videos=20, speed=0.2, plot_type='POINTS') >>> visualizer.run()
Cluster XAI calculatorο
- class simba.unsupervised.cluster_xai_calculator.ClusterXAICalculator(data_path, config_path, settings)[source]ο
Bases:
simba.mixins.unsupervised_mixin.UMLMixin,simba.mixins.config_reader.ConfigReaderClass for building RF models on top of cluster assignments, and calculating latent space explainability metrics based on RF models.
- Parameters
config_path (Union[str, os.PathLike]) β path to SimBA configparser.ConfigParser project_config.ini
data_path (Union[str, os.PathLike]) β path to pickle holding unsupervised results in
data_map.yamlformat.settings (Dict[str, Any]) β Dict holding which explainability tests to use.
- Example
>>> settings = {'gini_importance': True, 'permutation_importance': True, 'shap': {'method': 'cluster_paired', 'create': True, 'sample': 100}} >>> calculator = ClusterXAICalculator(config_path='unsupervised/project_folder/project_config.ini', data_path='unsupervised/cluster_models/quizzical_rhodes.pickle', settings=settings) >>> calculator.run()
Data extractorο
- class simba.unsupervised.data_extractor.DataExtractor(config_path, data_path, data_types, settings=None)[source]ο
Bases:
simba.mixins.unsupervised_mixin.UMLMixin,simba.mixins.config_reader.ConfigReaderExtracts human-readable data from directory of pickles or single pickle file that holds unsupervised analyses.
- Parameters
config_path β path to SimBA configparser.ConfigParser project_config.ini
data_path β path to pickle holding unsupervised results in
data_map.yamlformat.data_type β The type of data to extract. E.g., CLUSTERER_PARAMETERS, DIMENSIONALITY_REDUCTION_PARAMETERS, SCALER, SCALED_DATA, LOW_VARIANCE_FIELDS, FEATURE_NAMES, FRAME_FEATURES, FRAME_POSE, FRAME_TARGET, BOUTS_FEATURES, BOUTS_TARGETS, BOUTS_DIM_CORDS
settings β User-defined parameters for data extraction.
- Example
>>> extractor = DataExtractor(data_path='unsupervised/cluster_models/awesome_curran.pickle', data_type=['BOUTS_TARGETS'], settings=None, config_path='unsupervised/project_folder/project_config.ini') >>> extractor.run()
Data creatorο
- class simba.unsupervised.dataset_creator.DatasetCreator(config_path, settings)[source]ο
Bases:
simba.mixins.config_reader.ConfigReader,simba.mixins.unsupervised_mixin.UMLMixinTransform raw frame-wise supervised classification data into aggregated data for unsupervised analyses. Saves the aggergated data in to logs directory of the SimBa project.
- Parameters
config_path (Union[str, os.PathLike]) β path to SimBA configparser.ConfigParser project_config.ini
settings (Dict[str, Any]) β Attributes for which data should be included and how the data should be aggregated.
- Example
>>> settings = {'data_slice': 'ALL FEATURES (EXCLUDING POSE)', 'clf_slice': 'Attack', 'bout_aggregation_type': 'MEDIAN', 'min_bout_length': 66, 'feature_path': '/Users/simon/Desktop/envs/simba_dev/simba/assets/unsupervised/features.csv'} >>> db_creator = DatasetCreator(config_path='/Users/simon/Desktop/envs/simba/troubleshooting/NG_Unsupervised/project_folder/project_config.ini', settings=settings) >>> db_creator.run()
Density-based cluster validation (DBCV)ο
- class simba.unsupervised.dbcv_calculator.DBCVCalculator(config_path, data_path)[source]ο
Bases:
simba.mixins.unsupervised_mixin.UMLMixin,simba.mixins.config_reader.ConfigReaderDensity-based Cluster Validation (DBCV).
Note
Jitted version of DBCSV. Faster runtime by replacing meth:scipy.spatial.distance.cdist in original DBCSV with LLVM as discussed HERE. A further non-jitted implementaion can be found in the hdbscan library. AWS Denseclus HDBSCAN appears to have DBCV as an attribute of returned object. maybe a faster alternative?
- Parameters
References
- 1
Moulavi et al, Density-Based Clustering Validation, SIAM 2014, https://doi.org/10.1137/1.9781611973440.96
Examples
>>> dbcv_calculator = DBCVCalculator(clusterers_path='unsupervised/cluster_models', config_path='my_simba_config') >>> results = dbcv_calculator.run()
- DBCV(X, labels)[source]ο
- Parameters
X (np.ndarray) β 2D array of shape len(observations) x len(dimensionality reduced dimensions)
labels (np.ndarray) β 1D array with cluster labels
- Return float
DBCV cluster validity score
- static calculate_dists(X, arrays_by_cluster)[source]ο
- Parameters
X (np.ndarray) β 2D array of shape len(observations) x len(dimensionality reduced dimensions)
arrays_by_cluster (numba.types.ListType[list[np.ndarray]]) β Numba typed List of list with 2d arrays.
- Return Tuple[np.ndarray, bool)
Graph of all pair-wise mutual reachability distances between points of size X.shape[0] x X.shape[0]. Boolean representing if any issues where detected. Including: If (i) any clusters consist of a single observation.
Embedding correlation calculatorο
- class simba.unsupervised.embedding_correlation_calculator.EmbeddingCorrelationCalculator(data_path, config_path, settings)[source]ο
Bases:
simba.mixins.unsupervised_mixin.UMLMixin,simba.mixins.config_reader.ConfigReaderClass for correlating dimensionality reduction features with original features for explainability purposes.
- Parameters
- Example
>>> settings = {'correlation_methods': ['pearson', 'kendall', 'spearman'], 'plots': {'create': True, 'correlations': 'pearson', 'palette': 'jet'}} >>> calculator = EmbeddingCorrelationCalculator(config_path='unsupervised/project_folder/project_config.ini', data_path='unsupervised/cluster_models/quizzical_rhodes.pickle', settings=settings) >>> calculator.run()
Grid-search visualizerο
- class simba.unsupervised.grid_search_visualizers.GridSearchVisualizer(model_dir, save_dir, settings)[source]ο
Bases:
simba.mixins.unsupervised_mixin.UMLMixinVisualize grid-searched latent spaces in .png format.
- Parameters
model_dir β path to pickle holding unsupervised results in
data_map.yamlformat.save_dir β directory holding one or more unsupervised results in pickle
data_map.yamlformat.settings β User-defined image attributes (e.g., continous and catehorical palettes)
- Example
>>> settings = {'CATEGORICAL_PALETTE': 'Pastel1', 'CONTINUOUS_PALETTE': 'magma', 'SCATTER_SIZE': 10} >>> visualizer = GridSearchVisualizer(model_dir='/Users/simon/Desktop/envs/troubleshooting/unsupervised/cluster_models_042023', save_dir='/Users/simon/Desktop/envs/troubleshooting/unsupervised/images', settings=settings) >>> visualizer.continuous_visualizer(continuous_vars=['START_FRAME']) >>> visualizer.categorical_visualizer(categoricals=['CLUSTER'])
HDBSCAN clustererο
- class simba.unsupervised.hdbscan_clusterer.HDBSCANClusterer[source]ο
Bases:
simba.mixins.unsupervised_mixin.UMLMixinMethods for grid-search HDBSCAN model fit and transform. Defaults to GPU and cuml.cluster.HDBSCAN. If GPU unavailable, then hdbscan.HDBSCAN.
- fit(data_path, save_dir, hyper_parameters)[source]ο
- Parameters
data_path β Path holding pickled unsupervised dimensionality reduction results in
data_map.yamlformatsave_dir β Empty directory where to save the HDBSCAN results.
hyper_parameters β dict holding hyperparameters in list format
- Returns
- Example I
Grid-search fit:
>>> hyper_parameters = {'alpha': [1.0], 'min_cluster_size': [10], 'min_samples': [1], 'cluster_selection_epsilon': [20]} >>> embedding_dir = '/Users/simon/Desktop/envs/troubleshooting/unsupervised/dr_models' >>> save_dir = '/Users/simon/Desktop/envs/troubleshooting/unsupervised/cluster_models' >>> config_path = '/Users/simon/Desktop/envs/troubleshooting/unsupervised/project_folder/project_config.ini' >>> clusterer = HDBSCANClusterer(data_path=embedding_dir, save_dir=save_dir) >>> clusterer.fit(hyper_parameters=hyper_parameters)
- transform(data_path, model, save_dir=None, settings=None, save_format='csv')[source]ο
- Parameters
data_path β Path to directory holding pickled unsupervised dimensionality reduction results in
data_map.yamlformatmodel β Path to pickle holding hdbscan model in
data_map.yamlformat.save_dir β Empty directory where to save the HDBSCAN results. If none, then keep results in memory under self.results.
settings β User-defined params.
- Example I
Transform:
>>> data_path = '/Users/simon/Desktop/envs/simba/troubleshooting/NG_Unsupervised/project_folder/logs/unsupervised_data_20240218134920.pickle' >>> mdl_path = '/Users/simon/Desktop/envs/simba/troubleshooting/NG_Unsupervised/project_folder/cluster_mdls/hopeful_khorana.pickle' >>> clusterer = HDBSCANClusterer() >>> settings = {'DATA_FORMAT': 'scaled', 'CLASSIFICATIONS': True} >>> results = clusterer.transform(data_path=data_path, model=mdl_path, settings=settings)
UMAP embedderο
- class simba.unsupervised.umap_embedder.UmapEmbedder[source]ο
Bases:
simba.mixins.unsupervised_mixin.UMLMixinMethods for grid-search UMAP model fit and transform. Defaults to GPU and cuml.UMAP if GPU available. If GPU unavailable, then umap.UMAP.
- Parameters
data_path β Path holding pickled data-set created by `simba.unsupervised.dataset_creator.DatasetCreator.
save_dir β Empty directory where to save the UMAP results.
hyper_parameters β dict holding UMAP hyperparameters in list format.
- Example I
Fit.
>>> hyper_parameters = {'n_neighbors': [10, 2], 'min_distance': [1.0], 'spread': [1.0], 'scaler': 'MIN-MAX', 'variance': 0.25, "multicolinearity": 0.5} >>> data_path = 'unsupervised/project_folder/logs/unsupervised_data_20230416145821.pickle' >>> save_dir = 'unsupervised/dr_models' >>> config_path = 'unsupervised/project_folder/project_config.ini' >>> embedder = UmapEmbedder(data_path=data_path, save_dir=save_dir) >>> embedder.fit(hyper_parameters=hyper_parameters)
- transform(data_path, model, settings=None, save_dir=None, save_format='csv')[source]ο
- Example I
Transform.
>>> data_path = '/Users/simon/Desktop/envs/simba/troubleshooting/NG_Unsupervised/project_folder/logs/unsupervised_data_20240215143716.pickle' >>> model_path = '/Users/simon/Desktop/envs/simba/troubleshooting/NG_Unsupervised/project_folder/clustering1704/academic_montalcini.pickle' >>> embedder = UmapEmbedder() >>> embedder.transform(save_dir=None, data_path=data_path, model=model_path, settings=None) >>> embedder.transform(save_dir=None, data_path=data_path, model=model_path, settings={'DATA_FORMAT': 'scaled', 'CLASSIFICATIONS': True})