Model mixin๏ƒ

Utilities for fit, inference, and evaluation of classifiers.

class simba.mixins.train_model_mixin.TrainModelMixin[source]๏ƒ

Train model methods

bout_train_test_splitter(x_df: DataFrame, y_df: DataFrame, test_size: float) Tuple[DataFrame, DataFrame, Series, Series][source]๏ƒ

Helper to split train and test based on annotated bouts.

_images/bout_vs_frames_split.png
Parameters
  • x_df (pd.DataFrame) โ€“ Features

  • y_df (pd.Series) โ€“ Target

  • test_size (float) โ€“ Size of test as ratio of all annotated bouts (e.g., 0.2).

Returns

Size-4 tuple with DataFrames of Series representing, (i) Features for training, (ii) Features for testing, (iii) Target for training, (iv) Target for testing.

Return type

Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]

Examples

>>> x = pd.DataFrame(data=[[11, 23, 12], [87, 65, 76], [23, 73, 27], [10, 29, 2], [12, 32, 42], [32, 73, 2], [21, 83, 98], [98, 1, 1]])
>>> y =  pd.Series([0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1])
>>> x_train, x_test, y_train, y_test = TrainModelMixin().bout_train_test_splitter(x_df=x, y_df=y, test_size=0.5)
calc_learning_curve(x_y_df: DataFrame, clf_name: str, shuffle_splits: int, dataset_splits: int, tt_size: float, rf_clf: RandomForestClassifier, save_dir: Union[str, PathLike], save_file_no: Optional[int] = None, multiclass: Optional[bool] = False, scoring: Optional[str] = 'f1', plot: Optional[bool] = True) None[source]๏ƒ

Helper to compute random forest learning curves with cross-validation.

_images/learning_curves.png
Parameters
  • x_y_df (pd.DataFrame) โ€“ Dataframe holding features and target.

  • clf_name (str) โ€“ Name of the classifier

  • shuffle_splits (int) โ€“ Number of cross-validation datasets at each data split.

  • dataset_splits (int) โ€“ Number of data splits.

  • tt_size (float) โ€“ The size of the test set as a ratio of the dataset. E.g., 0.2.

  • rf_clf (RandomForestClassifier) โ€“ A sklearn RandomForestClassifier object.

  • save_dir (str) โ€“ Directory where to save output in csv file format.

  • save_file_no (Optional[int]) โ€“ If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.

  • multiclass (bool) โ€“ If True, then target consist of several categories [0, 1, 2 โ€ฆ] and scoring becomes None. If False, then scoring f1.

  • scoring (Optional[str]) โ€“ The score of the models to present. Default: โ€˜f1โ€™.

  • plot (Optional[bool]) โ€“ If True, creates plot with the train fraction size on x and scoring on y.

Returns

None. Results are stored in save_dir.

calc_permutation_importance(x_test: ndarray, y_test: ndarray, clf: RandomForestClassifier, feature_names: List[str], clf_name: str, save_dir: Optional[Union[str, PathLike]] = None, save_file_no: Optional[int] = None, plot: Optional[bool] = True, n_repeats: Optional[int] = 10) Union[None, Tuple[DataFrame, Union[None, ndarray]]][source]๏ƒ

Computes feature permutation importance scores.

Parameters
  • x_test (np.ndarray) โ€“ 2d feature test data of shape len(frames) x len(features)

  • y_test (np.ndarray) โ€“ 2d feature target test data of shape len(frames) x 1

  • clf (RandomForestClassifier) โ€“ random forest classifier object

  • feature_names (List[str]) โ€“ Names of features in x_test

  • clf_name (str) โ€“ Name of classifier in y_test.

  • save_dir (str) โ€“ Directory where to save results in CSV format. If None, then returns the dataframe and the plot (if plot

  • plot (Optional[bool]) โ€“ If True, creates bar plot chart and saves in same directory as the CSV file.

  • save_file_no (Optional[int]) โ€“ If permutation importance calculation is part of a grid search, provide integer identifier representing the model in the grid serach sequence. This will be used as suffix in output filename.

Returns

Either non or a Tuple with the dataframe and the plot. A CSV file representing the permutation importances is stored in save_dir if save_dir is passed.

calc_pr_curve(rf_clf: RandomForestClassifier, x_df: DataFrame, y_df: DataFrame, clf_name: str, save_dir: Union[str, PathLike], multiclass: bool = False, plot: Optional[bool] = True, classifier_map: Optional[Dict[int, str]] = None, save_file_no: Optional[int] = None) None[source]๏ƒ

Compute random forest precision-recall curve.

_images/pr_curves.png
Parameters
  • rf_clf (RandomForestClassifier) โ€“ sklearn RandomForestClassifier object.

  • x_df (pd.DataFrame) โ€“ Pandas dataframe holding test features.

  • y_df (pd.DataFrame) โ€“ Pandas dataframe holding test target.

  • clf_name (str) โ€“ Classifier name.

  • save_dir (str) โ€“ Directory where to save output in csv file format.

  • multiclass (bool) โ€“ If the classifier is a multi-classifier. Default: False.

  • plot (Optional[bool]) โ€“ If True, creates and saves line plot PR curve in the same lication as the output CSV file.

  • classifier_map (Dict[int, str]) โ€“ If multiclass, dictionary mapping integers to classifier names.

  • save_file_no (Optional[int]) โ€“ If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.

Returns

None. Results are stored in save_dir`.

check_df_dataset_integrity(df: DataFrame, file_name: str, logs_path: Union[str, PathLike]) None[source]๏ƒ

Helper to check for non-numerical np.inf, -np.inf, NaN, None in a single dataframe. :parameter pd.DataFrame x_df: Features :raise NoDataError: If data contains np.inf, -np.inf, None.

check_raw_dataset_integrity(df: DataFrame, logs_path: Optional[Union[str, PathLike]]) None[source]๏ƒ

Helper to check column-wise NaNs in raw input data for fitting model.

:param pd.DataFrame df :param str logs_path: The logs directory of the SimBA project :raise FaultyTrainingSetError: When the dataset contains NaNs

check_sampled_dataset_integrity(x_df: DataFrame, y_df: DataFrame) None[source]๏ƒ

Helper to check for non-numerical entries post data sampling

Parameters
  • x_df (pd.DataFrame) โ€“ Features

  • y_df (pd.DataFrame) โ€“ Target

Raises

FaultyTrainingSetError โ€“ Training or testing data sets contain non-numerical values

check_validity_of_meta_files(data_df: DataFrame, meta_file_paths: List[Union[str, PathLike]])[source]๏ƒ
clf_define(n_estimators: Optional[int] = 2000, max_depth: Optional[int] = None, max_features: Optional[Union[int, str]] = 'sqrt', n_jobs: Optional[int] = -1, criterion: Optional[str] = 'gini', min_samples_leaf: Optional[int] = 1, bootstrap: Optional[bool] = True, verbose: Optional[int] = 1, class_weight: Optional[dict] = None, cuda: Optional[bool] = False) RandomForestClassifier[source]๏ƒ
clf_fit(clf: RandomForestClassifier, x_df: DataFrame, y_df: DataFrame, selected_feature_names: Optional[List[str]] = None, verbose: bool = False) RandomForestClassifier[source]๏ƒ

Helper to fit clf model.

EXPECTED RUNTIMES

FEATURE COUNT

OBSERVATION COUNT

GPU TIME (S)

CPU TIME (S)

500

1000

9.7055

3.6682

500

10000

14.1444

18.6429

500

100000

74.8791

400.252

750

1000

7.0187

4.0107

750

10000

14.9369

22.482

750

100000

87.2037

449.0292

1000

1000

8.3954

3.9517

1000

10000

17.4793

25.5511

1000

100000

100.3192

515.217

MAX DEPTH: 32

ESTIMATORS: 2k

CPU: 32 cores

NVIDIA GeForce RTX 4070

See also

To define a cuml/sklearn object, see simba.mixins.train_model_mixin.TrainModelMixin.clf_define()

Parameters
  • clf โ€“ Un-fitted random forest classifier object, either from sklearn or cuml.

  • x_df (pd.DataFrame) โ€“ Pandas dataframe with features.

  • y_df (pd.DataFrame) โ€“ Pandas dataframe/Series with target

  • selected_feature_names (Optional[List[str]]) โ€“ Optional subset of feature column names from x_df to fit on. If None, fits on all features. Default: None.

Returns

Fitted random forest classifier object

Return type

RandomForestClassifier

clf_predict_proba(clf: RandomForestClassifier, x_df: Union[DataFrame, ndarray], multiclass: bool = False, model_name: Optional[str] = None, data_path: Optional[Union[str, PathLike]] = None, verbose: bool = False) ndarray[source]๏ƒ

Helper to predict class probabilities using a fitted random forest classifier.

Computes prediction probabilities for binary or multiclass classification using either scikit-learn or cuML RandomForestClassifier. For binary classifiers, returns the probability of the positive class (class 1). For multiclass classifiers, returns probabilities for all classes.

EXPECTED RUNTIMES

OBSERVATION COUNT

GPU TIME (S)

CPU TIME (S)

100_000

1.5299

4.0823

500_000

2.8537

24.9888

1_000_000

9.2034

51.5734

1_500_000

23.548

83.1209

2_000_000

50.6484

131.5

CLF X 10k obs / 750 features

MAX DEPTH: 32

ESTIMATORS: 2k

CPU: 32 cores

NVIDIA GeForce RTX 4070

Parameters
  • clf (Union[RandomForestClassifier, cuRF]) โ€“ Fitted random forest classifier object from sklearn or cuml.

  • x_df (Union[pd.DataFrame, np.ndarray]) โ€“ Features for data to predict. DataFrame or array of shape (n_samples, n_features).

  • multiclass (bool) โ€“ If True, the classifier predicts more than 2 classes. If False, binary classifier (default: False).

  • model_name (Optional[str]) โ€“ Name of the model for error messages and logging. Default: None.

  • data_path (Optional[Union[str, os.PathLike]]) โ€“ Path to the data file being processed, used in error messages. Default: None.

  • verbose (bool) โ€“ If True, print inference progress and timing information. Default: False.

Return np.ndarray

Prediction probabilities. For binary classifiers: 1D array of shape (n_samples,) with probability of positive class. For multiclass: 2D array of shape (n_samples, n_classes) with probabilities for each class.

create_clf_report(rf_clf: RandomForestClassifier, x_df: DataFrame, y_df: DataFrame, class_names: List[str], save_dir: Union[str, PathLike], digits: Optional[int] = 4, clf_name: Optional[str] = None, img_size: tuple = (2500, 4500), cmap: str = 'coolwarm', threshold: float = 0.5, svg: bool = False, save_file_no: Optional[int] = None, dpi: int = 300) None[source]๏ƒ

Create classifier truth table report.

Generates a classification report heatmap visualization showing precision, recall, F1-score, and support for each class. The report is displayed as a heatmap with annotations showing metric values. Predictions are made using the provided threshold to convert probabilities to binary predictions.

See also

Documentation

_images/clf_report.png _static/img/clf_report.webp _images/clf_report_mosaic.webp
Parameters
  • rf_clf (Union[RandomForestClassifier, cuRF]) โ€“ sklearn RandomForestClassifier or cuRF object.

  • x_df (pd.DataFrame) โ€“ DataFrame holding test features. Must match the feature set used for training.

  • y_df (pd.DataFrame) โ€“ DataFrame holding test target values. Should be binary (0/1) for binary classification.

  • class_names (List[str]) โ€“ List of class names. E.g., [โ€˜Attack absentโ€™, โ€˜Attack presentโ€™]. Must match the order of classes in the classifier.

  • save_dir (Union[str, os.PathLike]) โ€“ Directory where to save the classification report image.

  • digits (Optional[int]) โ€“ Number of decimal places in the classification report metrics. Default: 4.

  • clf_name (Optional[str]) โ€“ Name of the classifier. If not None, used in the output filename. If None, uses class_names[1].

  • img_size (Tuple[int, int]) โ€“ Size of the output image in pixels (width, height). Default: (2500, 4500).

  • cmap (str) โ€“ Colormap palette for the heatmap. Default: โ€œcoolwarmโ€ (blue to red).

  • threshold (float) โ€“ Classification threshold for converting probabilities to binary predictions. Values above threshold become 1, below become 0. Default: 0.5.

  • svg (bool) โ€“ If True, save as SVG format. If False (default), save as PNG format.

  • save_file_no (Optional[int]) โ€“ If integer, represents the count of the classifier within a grid search. Used in filename generation. If None, the classifier is not part of a grid search.

  • dpi (int) โ€“ Resolution (dots per inch) for the output image. Default: 300.

Returns

None. Classification report image is saved to save_dir.

create_example_dt(rf_clf: RandomForestClassifier, clf_name: str, feature_names: List[str], class_names: List[str], save_dir: str, tree_id: Optional[int] = 3, save_file_no: Optional[int] = None) None[source]๏ƒ

Helper to produce visualization of random forest decision tree using graphviz.

_images/create_example_dt.png
Parameters
  • rf_clf (RandomForestClassifier) โ€“ sklearn RandomForestClassifier object.

  • clf_name (str) โ€“ Classifier name.

  • feature_names (List[str]) โ€“ List of feature names.

  • class_names (List[str]) โ€“ List of classes. E.g., [โ€˜Attack absentโ€™, โ€˜Attack presentโ€™]

  • save_dir (str) โ€“ Directory where to save output in csv file format.

  • save_file_no (Optional[int]) โ€“ If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.

create_meta_data_csv_training_multiple_models(meta_data, clf_name, save_dir, save_file_no: Optional[int] = None) None[source]๏ƒ
create_meta_data_csv_training_one_model(meta_data_lst: list, clf_name: str, save_dir: Union[str, PathLike]) None[source]๏ƒ

Helper to save single model meta data (hyperparameters, sampling settings etc.) from list format into SimBA compatible CSV config file.

Parameters
  • meta_data_lst (list) โ€“ Meta data in list format

  • clf_name (str) โ€“ Name of classifier

  • clf_name โ€“ Name of classifier

  • save_dir (str) โ€“ Directory where to save output in csv file format.

create_shap_log(rf_clf: RandomForestClassifier, x: Union[DataFrame, ndarray], y: Union[DataFrame, Series, ndarray], x_names: List[str], clf_name: str, cnt_present: int, cnt_absent: int, verbose: bool = True, plot: bool = True, save_it: Optional[int] = 100, save_dir: Optional[Union[str, PathLike]] = None, save_file_suffix: Optional[int] = None) Union[None, Tuple[DataFrame, DataFrame, Dict[str, DataFrame], ndarray]][source]๏ƒ

Compute SHAP values for a random forest classifier. This method computes SHAP (SHapley Additive exPlanations) values for a given random forest classifier. The SHAP value for feature โ€˜iโ€™ in the context of a prediction โ€˜fโ€™ and input โ€˜xโ€™ is calculated using the following formula:

\phi_i(f, x) = \sum_{S \subseteq F \setminus {i}} \frac{|S|!(|F| - |S| - 1)!}{|F|!} (f_{S \cup {i}}(x_{S \cup {i}}) - f_S(x_S))

Note

Documentation Uses TreeSHAP Documentation

_images/shap.png

See also

For multicore solution, see create_shap_log_mp() For GPU method, see create_shap_log()

Parameters
  • rf_clf (RandomForestClassifier) โ€“ sklearn random forest classifier

  • x (Union[pd.DataFrame, np.ndarray]) โ€“ Test features.

  • y (Union[pd.DataFrame, pd.Series, np.ndarray]) โ€“ Test target.

  • x_names (List[str]) โ€“ Feature names.

  • clf_name (str) โ€“ Classifier name.

  • cnt_present (int) โ€“ Number of behavior-present frames to calculate SHAP values for.

  • cnt_absent (int) โ€“ Number of behavior-absent frames to calculate SHAP values for.

  • save_it (int) โ€“ Save iteration cadence. If None, then only saves at completion.

  • save_dir (str) โ€“ Optional directory where to save output in csv file format. If None, the data is returned.

  • save_file_suffix (Optional[int]) โ€“ If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.

Example

>>> from simba.mixins.train_model_mixin import TrainModelMixin
>>> x_cols = list(pd.read_csv('/Users/simon/Desktop/envs/simba/simba/tests/data/sample_data/shap_test.csv', index_col=0).columns)
>>> x = pd.DataFrame(np.random.randint(0, 500, (9000, len(x_cols))), columns=x_cols)
>>> y = pd.Series(np.random.randint(0, 2, (9000,)))
>>> rf_clf = TrainModelMixin().clf_define(n_estimators=100)
>>> rf_clf = TrainModelMixin().clf_fit(clf=rf_clf, x_df=x, y_df=y)
>>> feature_names = [str(x) for x in list(x.columns)]
>>> TrainModelMixin.create_shap_log(rf_clf=rf_clf, x=x, y=y, x_names=feature_names, clf_name='test', save_it=10, cnt_present=50, cnt_absent=50, plot=True, save_dir=r'/Users/simon/Desktop/feltz')
create_shap_log_concurrent_mp(rf_clf: Union[RandomForestClassifier, str, PathLike], x: Union[DataFrame, ndarray], y: Union[DataFrame, Series, ndarray], x_names: List[str], clf_name: str, cnt_present: int, cnt_absent: int, core_cnt: int = -1, chunk_size: int = 100, verbose: bool = True, save_dir: Optional[Union[str, PathLike]] = None, save_file_suffix: Optional[int] = None, plot: bool = False) Union[None, Tuple[DataFrame, DataFrame, Dict[str, DataFrame], ndarray]][source]๏ƒ

Compute SHAP values using multiprocessing.

See also

Documentation

For single-core solution, see create_shap_log() For GPU method, see create_shap_log() For multiprocassing imap method (reliably runs on Windows and Mac), see create_shap_log_mp()

_images/shap.png
Parameters
  • rf_clf (Union[RandomForestClassifier, str, os.PathLike]) โ€“ Fitted sklearn random forest classifier, or pat to fitted, pickled sklearn random forest classifier.

  • x (Union[pd.DataFrame, np.ndarray]) โ€“ Test features.

  • y_df (Union[pd.DataFrame, pd.Series, np.ndarray]) โ€“ Test target.

  • x_names (List[str]) โ€“ Feature names.

  • clf_name (str) โ€“ Classifier name.

  • cnt_present (int) โ€“ Number of behavior-present frames to calculate SHAP values for.

  • cnt_absent (int) โ€“ Number of behavior-absent frames to calculate SHAP values for.

  • chunk_size (int) โ€“ How many observations to process in each chunk. Increase value for faster processing if your memory allows.

  • verbose (bool) โ€“ If True, prints progress.

  • save_dir (Optional[Union[str, os.PathLike]]) โ€“ Optional directory where to store the results. If None, then the results are returned.

  • save_file_suffix (Optional[int]) โ€“ Optional suffix to add to the shap output filenames. Useful for gridsearches and multiple shap data output files are to-be stored in the same save_dir.

  • plot (bool) โ€“ If True, create SHAP aggregation and plots.

Example

>>> CONFIG_PATH = r"C:/troubleshooting/mitra/project_folder/project_config.ini"
>>> RF_PATH = r"C:/troubleshooting/mitra/models/validations/straub_tail_5_new/straub_tail_5.sav"
>>> DATA_PATH = r"C:/troubleshooting/mitra/project_folder/csv/targets_inserted/new_straub/appended/501_MA142_Gi_CNO_0514.csv"
>>> config = ConfigReader(config_path=CONFIG_PATH)
>>> df = read_df(file_path=DATA_PATH, file_type='csv')
>>> y = df['straub_tail']
>>> x = df.drop(['immobility', 'rearing', 'grooming', 'circling', 'shaking', 'lay-on-belly', 'straub_tail'], axis=1)
>>> x = x.drop(config.bp_col_names, axis=1)
>>> TrainModelMixin.create_shap_log_concurrent_mp(rf_clf=RF_PATH, x=x, y=y, x_names=list(x.columns), clf_name='straub_tail', cnt_absent=100, cnt_present=10, core_cnt=10)
create_shap_log_mp(rf_clf: RandomForestClassifier, x: Union[DataFrame, ndarray], y: Union[DataFrame, Series, ndarray], x_names: List[str], clf_name: str, cnt_present: int, cnt_absent: int, core_cnt: int = -1, chunk_size: int = 100, verbose: bool = True, save_dir: Optional[Union[str, PathLike]] = None, save_file_suffix: Optional[int] = None, plot: bool = False) Union[None, Tuple[DataFrame, DataFrame, Dict[str, DataFrame], ndarray]][source]๏ƒ

Compute SHAP values using multiprocessing.

See also

Documentation

For single-core solution, see create_shap_log() For GPU method, see create_shap_log() For multiprocassing concurrent futures method (should be more reliable on Linux distros), see create_shap_log_concurrent_mp()

_images/shap.png
Parameters
  • rf_clf (RandomForestClassifier) โ€“ Fitted sklearn random forest classifier

  • x (Union[pd.DataFrame, np.ndarray]) โ€“ Test features.

  • y_df (Union[pd.DataFrame, pd.Series, np.ndarray]) โ€“ Test target.

  • x_names (List[str]) โ€“ Feature names.

  • clf_name (str) โ€“ Classifier name.

  • cnt_present (int) โ€“ Number of behavior-present frames to calculate SHAP values for.

  • cnt_absent (int) โ€“ Number of behavior-absent frames to calculate SHAP values for.

  • chunk_size (int) โ€“ How many observations to process in each chunk. Increase value for faster processing if your memory allows.

  • verbose (bool) โ€“ If True, prints progress.

  • save_dir (Optional[Union[str, os.PathLike]]) โ€“ Optional directory where to store the results. If None, then the results are returned.

  • save_file_suffix (Optional[int]) โ€“ Optional suffix to add to the shap output filenames. Useful for gridsearches and multiple shap data output files are to-be stored in the same save_dir.

  • plot (bool) โ€“ If True, create SHAP aggregation and plots.

Example

>>> from simba.mixins.train_model_mixin import TrainModelMixin
>>> x_cols = list(pd.read_csv('/Users/simon/Desktop/envs/simba/simba/tests/data/sample_data/shap_test.csv', index_col=0).columns)
>>> x = pd.DataFrame(np.random.randint(0, 500, (9000, len(x_cols))), columns=x_cols)
>>> y = pd.Series(np.random.randint(0, 2, (9000,)))
create_x_importance_bar_chart(rf_clf: RandomForestClassifier, x_names: list, clf_name: str, save_dir: str, n_bars: int, palette: Optional[str] = 'hot', save_file_no: Optional[int] = None) None[source]๏ƒ

Helper to create a bar chart displaying the top N gini or entropy feature importance scores.

See also

Documentation

_images/gini_bar_chart.png
Parameters
  • rf_clf (RandomForestClassifier) โ€“ sklearn RandomForestClassifier object.

  • x_names (List[str]) โ€“ Names of features.

  • clf_name (str) โ€“ Name of classifier.

  • save_dir (str) โ€“ Directory where to save output in csv file format.

  • n_bars (int) โ€“ Number of bars in the plot.

  • save_file_no (Optional[int]) โ€“ If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search

Returns

None. Results are stored in save_dir`.

create_x_importance_log(rf_clf: RandomForestClassifier, x_names: List[str], clf_name: str, precision: int = 25, sort_ascending: bool = False, verbose: bool = True, save_dir: Optional[str] = None, save_file_no: Optional[int] = None) Union[None, DataFrame][source]๏ƒ

Compute gini / entropy based feature importance scores.

Calculates mean and standard deviation of feature importances across all trees in the RandomForestClassifier. Results are sorted by mean importance (descending by default) and can be saved to CSV or returned as a DataFrame.

See also

To plot gini / entropy based feature importance scores, see create_x_importance_bar_chart()

Parameters
  • rf_clf (Union[RandomForestClassifier, cuRF]) โ€“ sklearn RandomForestClassifier or cuRF object.

  • x_names (List[str]) โ€“ Names of features. Must match the number of features in the classifier.

  • clf_name (str) โ€“ Name of classifier. Used in output filename if save_dir is provided.

  • precision (int) โ€“ Number of decimal places for rounding feature importance values. Default: 25.

  • sort_ascending (bool) โ€“ If True, sort features by importance in ascending order. If False (default), sort in descending order.

  • verbose (bool) โ€“ If True (default), print progress messages. If False, suppress output.

  • save_dir (Optional[str]) โ€“ Directory where to save output in CSV file format. If None, then returns the DataFrame instead of saving.

  • save_file_no (Optional[int]) โ€“ If integer, represents the count of the classifier within a grid search. Used in filename generation. If None, the classifier is not part of a grid search.

Returns Union[None, pd.DataFrame]

If save_dir is provided, returns None and saves CSV file. If save_dir is None, returns DataFrame with columns: โ€˜FEATUREโ€™, โ€˜FEATURE_IMPORTANCE_MEANโ€™, โ€˜FEATURE_IMPORTANCE_STDEVโ€™.

cuml_rf_x_importances(nodes: dict, n_features: int) ndarray[source]๏ƒ

Method for computing feature importanceโ€™s from cuml RF object.

From szchixy.

static define_scaler(scaler_name: typing_extensions.Literal['min-max', 'standard', 'quantile']) Union[MinMaxScaler, StandardScaler, QuantileTransformer][source]๏ƒ

Defines a sklearn scaler object. See UMLOptions.SCALER_OPTIONS.value for accepted scalers.

Example

>>> TrainModelMixin.define_scaler(scaler_name='min-max')
define_tree_explainer(clf: RandomForestClassifier, data: Optional[ndarray] = None, model_output: str = 'raw', feature_perturbation: str = 'tree_path_dependent') TreeExplainer[source]๏ƒ
delete_other_annotation_columns(df: DataFrame, annotations_lst: List[str], raise_error: bool = True) DataFrame[source]๏ƒ

Helper to drop fields that contain annotations which are not the target.

Parameters
  • df (pd.DataFrame) โ€“ Dataframe holding features and annotations.

  • annotations_lst (List[str]) โ€“ column fields to be removed from df

Raise_error bool raise_error

If True, throw error if annotation column doesnโ€™t exist. Else, skip. Default: True.

Returns

Dataframe without non-target annotation columns

Return type

pd.DataFrame

Examples

>>> self.delete_other_annotation_columns(df=df, annotations_lst=['Sniffing'])
dviz_classification_visualization(x_train: ndarray, y_train: ndarray, clf_name: str, class_names: List[str], save_dir: str) None[source]๏ƒ

Helper to create visualization of example decision tree using dtreeviz.

Parameters
  • x_train (np.ndarray) โ€“ training features

  • y_train (np.ndarray) โ€“ training targets

  • clf_name (str) โ€“ Name of classifier

  • class_names (List[str]) โ€“ List of class names. E.g., [โ€˜Attack absentโ€™, โ€˜Attack presentโ€™]

  • save_dir (str) โ€“ Directory where to save output in csv file format.

static find_collinear_features(data: DataFrame, threshold: float) List[str][source]๏ƒ

Identify collinear features in a pandas DataFrame for removal.

Finds pairs of features with Pearson correlation coefficients above the specified threshold and returns the names of features that should be removed to reduce multicollinearity.

Serves as a validation wrapper around numba implementation.

See also

For the underlying numba-accelerated implementation, see simba.mixins.train_model_mixin.TrainModelMixin.find_highly_correlated_fields() For non-numba statistical methods, see simba.mixins.statistics_mixin.Statistics.find_collinear_features()

EXPECTED RUNTIMES

FEATURES N

TIME (S)

100

1.0479

200

2.3715

400

6.1663

800

23.639

1600

160.69

ITERATIONS:3

Intel(R) Core(TM) i9-14900KF

OBSERVATION COUNT: 1M

Parameters
  • data (pd.DataFrame) โ€“ Input DataFrame containing numeric features. Each column represents a feature and each row represents an observation. Must contain only numeric data types.

  • threshold (float) โ€“ Correlation threshold for identifying collinear features. Must be between 0.0 and 1.0. Higher values (e.g., 0.9) identify only very highly correlated features, while lower values (e.g., 0.1) identify more loosely correlated features.

Returns

List of column names that are highly correlated with other features and should be considered for removal to reduce multicollinearity.

Return type

List[str]

Example

>>> a = np.random.randint(0, 5, (1_000_000, 100))
>>> df = pd.DataFrame(a)
>>> c = find_collinear_features(data=df, threshold=0.0025)
static find_highly_correlated_fields(data: ndarray, threshold: float, field_names: ListType[unicode_type]) List[str][source]๏ƒ

Find highly correlated fields in a dataset using Pearson product-moment correlation coefficient.

Calculates the absolute correlation coefficients between columns in a given dataset and identifies pairs of columns that have a correlation coefficient greater than the specified threshold. For every pair of correlated features identified, the function returns the field name of one feature. These field names can later be dropped from the input data to reduce memory requirements and collinearity.

Parameters
  • data (np.ndarray) โ€“ Two dimension numpy array with features represented as columns and frames represented as rows.

  • threshold (float) โ€“ Threshold value for significant collinearity.

  • field_names (List[str]) โ€“ List mapping the column names in data to a field name. Use types.ListType(types.unicode_type) to take advantage of JIT compilation

Returns

Unique field names that correlates with at least one other field above the threshold value.

Return type

List[str]

Example

>>> data = np.random.randint(0, 1000, (1000, 5000)).astype(np.float32)
>>> field_names = []
>>> for i in range(data.shape[1]): field_names.append(f'Feature_{i+1}')
>>> highly_correlated_fields = TrainModelMixin().find_highly_correlated_fields(data=data, field_names=typed.List(field_names), threshold=0.10)
static find_low_variance_fields(data: DataFrame, variance_threshold: float) List[str][source]๏ƒ

Finds fields with variance below provided threshold.

Parameters
  • data (pd.DataFrame) โ€“ Dataframe with continoues numerical features.

  • variance (float) โ€“ Variance threshold (0.0-1.0).

Return List[str]

static fit_scaler(scaler: Union[MinMaxScaler, QuantileTransformer, StandardScaler], data: Union[DataFrame, ndarray]) object[source]๏ƒ
get_all_clf_names(config: ConfigParser, target_cnt: int) List[str][source]๏ƒ

Helper to get all classifier names in a SimBA project.

Parameters
  • config (configparser.ConfigParser) โ€“ Parsed SimBA project_config.ini

  • target_cnt (int.ConfigParser) โ€“ Parsed SimBA project_config.ini

Returns

All classifier names in project

Return type

List[str]

Example

>>> self.get_all_clf_names(config=config, target_cnt=2)
>>> ['Attack', 'Sniffing']
get_model_info(config: ConfigParser, model_cnt: int) Dict[int, Any][source]๏ƒ

Helper to read in N SimBA random forest config meta files to python dict memory.

Parameters
Return dict

Dictionary with integers as keys and hyperparameter dictionaries as keys.

insert_column_headers_for_outlier_correction(data_df: DataFrame, new_headers: List[str], filepath: Union[str, PathLike]) DataFrame[source]๏ƒ

Helper to insert new column headers onto a dataframe following outlier correction.

Parameters
  • data_df (pd.DataFrame) โ€“ Dataframe with headers to-be replaced.

  • filepath (str) โ€“ Path to where data_df is stored on disk.

Param

DataFRame with the corrected headers following outlier correction.

partial_dependence_calculator(clf: RandomForestClassifier, x_df: DataFrame, clf_name: str, save_dir: Union[str, PathLike], clf_cnt: Optional[int] = None, grid_resolution: Optional[int] = 50, plot: Optional[bool] = True) None[source]๏ƒ

Compute feature partial dependencies for every feature in training set.

Parameters
  • clf (RandomForestClassifier) โ€“ Random forest classifier

  • x_df (pd.DataFrame) โ€“ Features training set

  • clf_name (str) โ€“ Name of classifier

  • save_dir (str) โ€“ Directory where to save the data

  • clf_cnt (Optional[int]) โ€“ If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.

print_machine_model_information(model_dict: dict) None[source]๏ƒ

Helper to print model information in tabular form.

Parameters

model_dict (dict) โ€“ dictionary holding model meta data in SimBA meta-config format.

random_multiclass_bout_sampler(x_df: DataFrame, y_df: DataFrame, target_field: str, target_var: int, sampling_ratio: Union[float, Dict[int, float]], raise_error: bool = False) DataFrame[source]๏ƒ

Randomly sample multiclass behavioral bouts.

This function performs random sampling on a multiclass dataset to balance the class distribution. From each class, the function selects a count of โ€œboutsโ€ where the count is computed as a ratio of a user-specified class variable count. All bout observations in the user-specified class is selected.

Parameters
  • x_df (pd.DataFrame) โ€“ A dataframe holding features.

  • y_df (pd.DataFrame) โ€“ A dataframe holding target.

  • target_field (str) โ€“ The name of the target column.

  • target_var (int) โ€“ The variable in the target that should serve as baseline. E.g., 0 if 0 represents no behavior.

  • sampling_ratio (Union[float, dict]) โ€“ The ratio of target_var bout observations that should be sampled of non-target_var observations. E.g., if float 1.0, and there are 10` bouts of target_var observations in the dataset, then 10 bouts of each non-target_var observations will be sampled. If different under-sampling ratios for different class variables are needed, use dict with the class variable name as key and ratio relative to target_var as the value.

  • raise_error (bool) โ€“ If True, then raises error if there are not enough observations of the non-target_var fullfilling the sampling_ratio. Else, takes all observations even though not enough to reach criterion.

Raises

SamplingError โ€“ If any of the following conditions are met: - No bouts of the target class are detected in the data. - The target variable is present in the sampling ratio dictionary. - The sampling ratio dictionary contains non-integer keys or non-float values less than 0.0. - The variable specified in the sampling ratio is not present in the DataFrame. - The sampling ratio results in a sample size of zero or less. - The requested sample size exceeds the available data and raise_error is True.

Return (pd.DataFrame, pd.DataFrame)

resampled features, and resampled associated target.

Examples

>>> df = pd.read_csv('/Users/simon/Desktop/envs/troubleshooting/multilabel/project_folder/csv/targets_inserted/01.YC015YC016phase45-sample_sampler.csv', index_col=0)
>>> undersampled_df = TrainModelMixin().random_multiclass_bout_sampler(data=df, target_field='syllable_class', target_var=0, sampling_ratio={1: 1.0, 2: 1, 3: 1}, raise_error=True)
random_multiclass_frm_sampler(x_df: DataFrame, y_df: DataFrame, target_field: str, target_var: int, sampling_ratio: Union[float, Dict[int, float]], raise_error: bool = False)[source]๏ƒ

Random multiclass undersampler.

This function performs random under-sampling on a multiclass dataset to balance the class distribution. From each class, the function selects a number of frames computed as a ratio relative to a user-specified class variable.

All the observations in the user-specified class is selected.

Parameters
  • x_df (pd.DataFrame) โ€“ A dataframe holding features.

  • y_df (pd.DataFrame) โ€“ A dataframe holding target.

  • target_field (str) โ€“ The name of the target column.

  • target_var (int) โ€“ The variable in the target that should serve as baseline. E.g., 0 if 0 represents no behavior.

  • sampling_ratio (Union[float, dict]) โ€“ The ratio of target_var observations that should be sampled of non-target_var observations. E.g., if float 1.0, and there are 10` target_var observations in the dataset, then 10 of each non-target_var observations will be sampled. If different under-sampling ratios for different class variables are needed, use dict with the class variable name as key and ratio raletive to target_var as the value.

  • raise_error (bool) โ€“ If True, then raises error if there are not enough observations of the non-target_var fullfilling the sampling_ratio. Else, takes all observations even though not enough to reach criterion.

Return (pd.DataFrame, pd.DataFrame)

resampled features, and resampled associated target.

Examples

>>> df = pd.read_csv('/Users/simon/Desktop/envs/troubleshooting/multilabel/project_folder/csv/targets_inserted/01.YC015YC016phase45-sample_sampler.csv', index_col=0)
>>> TrainModelMixin().random_multiclass_frm_sampler(data_df=df, target_field='syllable_class', target_var=0, sampling_ratio=0.20)
>>> TrainModelMixin().random_multiclass_frm_sampler(data_df=df, target_field='syllable_class', target_var=0, sampling_ratio={1: 0.1, 2: 0.2, 3: 0.3})
random_undersampler(x_train: ndarray, y_train: ndarray, sample_ratio: float) Tuple[DataFrame, DataFrame][source]๏ƒ

Perform random under-sampling of behavior-absent frames in a dataframe.

Parameters
  • x_train (np.ndarray) โ€“ 2-dimensional array representing the features in train set

  • y_train (np.ndarray) โ€“ Array representing the target in the training set.

  • sample_ratio (float) โ€“ Ratio of behavior-absent frames to keep relative to the behavior-present frames. E.g., 1.0 returns an equal count of behavior-absent and behavior-present frames. 2.0 returns twice as many behavior-absent frames as and behavior-present frames.

Returns

Size-2 tuple with DataFrames representing the under-sampled feature set and under-sampled target set.

Return type

Tuple[pd.DataFrame, pd.DataFrame]

Examples

>>> self.random_undersampler(x_train=x_train, y_train=y_train, sample_ratio=1.0)
read_all_files_in_folder(file_paths: List[str], file_type: str, classifier_names: Optional[List[str]] = None, raise_bool_clf_error: bool = True) Tuple[DataFrame, List[int]][source]๏ƒ

Read in all data files in a folder into a single pd.DataFrame.

Note

For improved runtime using multiprocessing and pyarrow, use read_all_files_in_folder_mp() For improved runtime using ``concurrent` library, use simba.mixins.train_model_mixin.TrainModelMixin.read_all_files_in_folder_mp_futures().

Parameters
  • file_paths (List[str]) โ€“ List of file paths representing files to be read in.

  • file_type (str) โ€“ The type of files to be read in (e.g., csv)

  • classifier_names (Optional[List[str]]) โ€“ Optional list of classifier names representing fields of human annotations. If not None, then assert that classifier names are present in each data file.

Returns

concatenated DataFrame if all data represented in file_paths, and a aligned list of frame numbers associated with the rows in the DataFrame.

Return type

Tuple[pd.DataFrame, List[int]]

Examples

>>> self.read_all_files_in_folder(file_paths=['targets_inserted/Video_1.csv', 'targets_inserted/Video_2.csv'], file_type='csv', classifier_names=['Attack'])
static read_all_files_in_folder_mp(file_paths: List[str], file_type: typing_extensions.Literal['csv', 'parquet', 'pickle'], classifier_names: Optional[List[str]] = None, raise_bool_clf_error: bool = True) Tuple[DataFrame, List[int]][source]๏ƒ

Multiprocessing helper function to read in all data files in a folder to a single pd.DataFrame for downstream ML. Defaults to ceil(CPU COUNT / 2) cores. Asserts that all classifiers have annotation fields present in each dataframe.

Note

If multiprocess fail, reverts to simba.mixins.train_model_mixin.read_all_files_in_folder()

Parameters
  • file_paths (List[str]) โ€“ List of file-paths

  • file_paths โ€“ The filetype of file_paths OPTIONS: csv or parquet.

  • classifier_names (Optional[List[str]]) โ€“ List of classifier names representing fields of human annotations. If not None, then assert that classifier names are present in each data file.

Returns

concatenated DataFrame if all data represented in file_paths, and an aligned list of frame numbers associated with the rows in the DataFrame.

Return type

Tuple[pd.DataFrame, List[int]]

read_all_files_in_folder_mp_futures(annotations_file_paths: List[str], file_type: typing_extensions.Literal['csv', 'parquet', 'pickle'], classifier_names: Optional[List[str]] = None, raise_bool_clf_error: bool = True) Tuple[DataFrame, List[int]][source]๏ƒ

Multiprocessing helper function to read in all data files in a folder to a single pd.DataFrame for downstream ML through concurrent.Futures. Asserts that all classifiers have annotation fields present in each dataframe.

Note

A concurrent.Futures alternative to simba.mixins.train_model_mixin.read_all_files_in_folder_mp() which has uses multiprocessing.ProcessPoolExecutor and reported unstable on Linux machines.

If multiprocess failure, reverts to simba.mixins.train_model_mixin.read_all_files_in_folder()

See also

For single process method, use read_all_files_in_folder() For improved runtime using multiprocessing and pyarrow, use read_all_files_in_folder_mp()

Parameters
  • file_paths (List[str]) โ€“ List of file-paths

  • file_paths โ€“ The filetype of file_paths OPTIONS: csv or parquet.

  • classifier_names (Optional[List[str]]) โ€“ List of classifier names representing fields of human annotations. If not None, then assert that classifier names are present in each data file.

  • raise_bool_clf_error (bool) โ€“ If True, raises an error if a classifier column contains values outside 0 and 1.

Returns

concatenated DataFrame if all data represented in file_paths, and an aligned list of frame numbers associated with the rows in the DataFrame.

Return type

Tuple[pd.DataFrame, List[int]]

read_in_all_model_names_to_remove(config: ConfigParser, model_cnt: int, clf_name: str) List[str][source]๏ƒ

Helper to find all field names that are annotations but are not the target.

Parameters
  • config (configparser.ConfigParser) โ€“ Configparser object holding data from the project_config.ini

  • model_cnt (int) โ€“ Number of classifiers in the SimBA project

  • clf_name (str) โ€“ Name of the classifier.

Returns

List of non-target annotation column names.

Return type

List[str]

Examples

>>> self.read_in_all_model_names_to_remove(config=config, model_cnt=2, clf_name=['Attack'])
read_model_settings_from_config(config: ConfigParser)[source]๏ƒ
read_pickle(file_path: Union[str, PathLike]) RandomForestClassifier[source]๏ƒ

Read pickled RandomForestClassifier object.

Parameters

file_path (Union[str, os.PathLike]) โ€“ Path to pickle file on disk.

Returns

A scikitRandomForestClassifier object.

Return type

RandomForestClassifier

save_rf_model(rf_clf: RandomForestClassifier, clf_name: str, save_dir: Union[str, PathLike], save_file_no: Optional[int] = None) None[source]๏ƒ

Helper to save pickled classifier object to disk.

See also

To write pickle, can also use write_pickle() To read pickle, see read_pickle() or read_pickle().

Parameters
  • rf_clf (RandomForestClassifier) โ€“ sklearn random forest classifier

  • clf_name (str) โ€“ Classifier name

  • save_dir (str) โ€“ Directory where to save output as pickle.

  • save_file_no (Optional[int]) โ€“ If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.

Returns

None. Results are saved in save_dir.

static scaler_inverse_transform(data: DataFrame, scaler: Union[MinMaxScaler, StandardScaler, QuantileTransformer], name: Optional[str] = '') DataFrame[source]๏ƒ
static scaler_transform(data: DataFrame, scaler: Union[MinMaxScaler, StandardScaler, QuantileTransformer], name: Optional[str] = '') DataFrame[source]๏ƒ

Helper to run transform dataframe using previously fitted scaler.

Parameters
  • data (pd.DataFrame) โ€“ Data to transform.

  • scaler โ€“ fitted scaler.

smote_oversampler(x_train: DataFrame, y_train: DataFrame, sample_ratio: float) Tuple[ndarray, ndarray][source]๏ƒ

Helper to perform SMOTE oversampling of behavior-present annotations.

Parameters
  • x_train (np.ndarray) โ€“ Features in train set

  • y_train (np.ndarray) โ€“ Target in train set

  • sample_ratio (float) โ€“ Over-sampling ratio

Returns

Size-2 tuple arrays representing the over-sampled feature set and over-sampled target set.

Return type

Tuple[np.ndarray, np.ndarray]

Examples

>>> self.smote_oversampler(x_train=x_train, y_train=y_train, sample_ratio=1.0)
smoteen_oversampler(x_train: DataFrame, y_train: DataFrame, sample_ratio: float) Tuple[ndarray, ndarray][source]๏ƒ

Helper to perform SMOTEEN oversampling of behavior-present annotations.

Parameters
  • x_train (np.ndarray) โ€“ Features in train set

  • y_train (np.ndarray) โ€“ Target in train set

  • sample_ratio (float) โ€“ Over-sampling ratio

Returns

Size-2 tuple arrays representing the over-sampled feature set and over-sampled target set.

Return type

Tuple[np.ndarray, np.ndarray]

Examples

>>> self.smoteen_oversampler(x_train=x_train, y_train=y_train, sample_ratio=1.0)
static split_and_group_df(df: ~pandas.core.frame.DataFrame, splits: int, include_split_order: bool = True) -> (typing.List[pandas.core.frame.DataFrame], <class 'int'>)[source]๏ƒ

Helper to split a dataframe for multiprocessing. If include_split_order, then include the group number in split data as a column. Returns split data and approximations of number of observations per split.

split_df_to_x_y(df: DataFrame, clf_name: str) Tuple[DataFrame, DataFrame][source]๏ƒ

Helper to split dataframe into features and target.

Parameters
  • df (pd.DataFrame) โ€“ Dataframe holding features and annotations.

  • clf_name (str) โ€“ Name of target.

Returns

Size-2 tuple containing two dataframes - the features, and the target.

Return type

Tuple[pd.DataFrame, pd.DataFrame]

Examples

>>> self.split_df_to_x_y(df=df, clf_name='Attack')

Batch random forest inference๏ƒ

class simba.model.inference_batch.InferenceBatch(config_path: Union[str, PathLike], features_dir: Optional[Union[str, PathLike]] = None, save_dir: Optional[Union[str, PathLike]] = None, minimum_bout_length: Optional[int] = None, feature_subsets_by_clf: Optional[Dict[str, Dict[str, List[str]]]] = None, model_dict: Optional[Dict[str, Dict[str, Union[str, int, float]]]] = None, save_agg_stats: Optional[Union[str, PathLike]] = None, verbose: bool = True)[source]๏ƒ

Run classifier inference on all files with the project_folder/csv/features_extracted directory. Results are stored in the project_folder/csv/machine_results directory of the SimBA project.

Note

To compute aggregate statistics from the output of this class, see simba.data_processors.agg_clf_calculator.AggregateClfCalculator()

Parameters
  • config_path (Union[str, os.PathLike]) โ€“ path to SimBA project config file in Configparser format.

  • features_dir (Optional[Union[str, os.PathLike]]) โ€“ Optional directory containing featurized files in CSV or parquet format. If None, then the project_folder/csv/features_extracted directory of the project will be used.

  • save_dir (Optional[Union[str, os.PathLike]]) โ€“ Optional directory to save the data for the analyzed videos. If None, then the project_folder/csv/machine_results directory of the project will be used.

  • minimum_bout_length (Optional[int]) โ€“ Optional minimum bout length (milliseconds) override. If None, classifier-specific minimum bout settings from project configuration are used.

  • feature_subsets_by_clf (Optional[Dict[str, Dict[str, List[str]]]]) โ€“ Optional per-classifier feature subsets to use during inference. Format: {classifier_name: {subset_name: [feature_col_1, feature_col_2, ...]}}. If provided, each classifier is applied once per subset and outputs are suffixed with the subset name.

  • model_dict (Optional[Dict[str, Dict[str, Union[str, int, float]]]]) โ€“ Optional override of the classifiers to run. Format: {model_name: {'model_path': '/path/to/clf.sav', 'minimum_bout_length': 100, 'threshold': 0.5}}. If None, classifier definitions are read from the project config (current behavior). When provided, these models replace the project-config classifiers for this run.

  • save_agg_stats (Optional[Union[str, os.PathLike]]) โ€“ Optional directory in which to save aggregate classifier statistics. If None, no aggregate statistics are computed. If a directory is provided, simba.data_processors.agg_clf_calculator.AggregateClfCalculator is run after inference completes, reading from this classโ€™s save_dir and writing its CSV outputs to save_agg_stats.

  • verbose (bool) โ€“ If True, print progress and status messages during inference. Default: True.

Example I

>>> inferencer = InferenceBatch(config_path='MyConfigPath')
>>> inferencer.run()
Example II

>>> inferencer = InferenceBatch(config_path=r"D:/troubleshooting/mitra/project_folder/project_config.ini", features_dir=r"D:/troubleshooting/mitra/project_folder/videos/bg_removed/rotated/tail_features/APPENDED")
>>> inferencer.run()
run()[source]๏ƒ

Batch multi-animal random forest inference๏ƒ

class simba.model.inference_multi_animal_batch.InferenceMultiAnimalBatch(config_path: Union[str, PathLike], clf_name: str)[source]๏ƒ

Run a single trained behavior classifier across every animal in a SimBA project, producing per-animal predictions in the output CSVs.

See also

Training counterpart: simba.model.grid_search_rf.GridSearchRandomForestClassifier with feature_subset_suffix='_animal_<N>'.

Parameters
  • config_path (Union[str, os.PathLike]) โ€“ Path to the SimBA project_config.ini.

  • clf_name (str) โ€“ Name of the configured classifier to run multi-animal inference for.

Example

>>> InferenceMultiAnimalBatch(config_path=r'/path/project_folder/project_config.ini', clf_name='wing_wave').run()
run() None[source]๏ƒ

Batch multi-class random forest inference๏ƒ

class simba.model.inference_multiclass_batch.InferenceMulticlassBatch(config_path: str)[source]๏ƒ
run()[source]๏ƒ

Grid-search random forest classifiers๏ƒ

class simba.model.grid_search_rf.GridSearchRandomForestClassifier(config_path: Union[str, PathLike], feature_subset_suffix: Optional[str] = None, target_dir: Optional[Union[str, PathLike]] = None, save_dir: Optional[Union[str, PathLike]] = None)[source]๏ƒ

Train one or more random-forest classifiers from SimBA meta-config files.

Reads model hyperparameters and sampling settings from meta files in project_folder/configs and trains one model per valid meta file. Training data is loaded from annotated target files and saved models plus evaluation artifacts are written to the configured output directory.

Note

Searches the SimBA project project_folder/configs directory for meta files and builds one model per valid config file. Tutorial.

Parameters
  • config_path (Union[str, os.PathLike]) โ€“ Path to SimBA project config file in ConfigParser format.

  • feature_subset_suffix (Optional[str]) โ€“ Optional suffix used to subset feature columns before training. If set, only feature columns ending with this suffix are retained.

  • target_dir (Optional[Union[str, os.PathLike]]) โ€“ Optional directory with annotated target files (CSV or parquet, matching project file type). If None, project default targets directory is used.

  • save_dir (Optional[Union[str, os.PathLike]]) โ€“ Optional directory where trained models and evaluation artifacts are saved. If None, defaults to <model_dir>/validations from project configuration.

Example

>>> _ = GridSearchRandomForestClassifier(config_path='MyConfigPath').run()
perform_sampling(meta_dict: dict)[source]๏ƒ
run()[source]๏ƒ

Grid-search random forest multi-classifiers๏ƒ

class simba.model.grid_search_multiclass_rf.GridSearchMulticlassRandomForestClassifier(config_path: Union[str, PathLike])[source]๏ƒ
perform_sampling(meta_dict: dict)[source]๏ƒ
run()[source]๏ƒ

Random forest inference - validation๏ƒ

class simba.model.inference_validation.InferenceValidation(config_path: Union[str, PathLike], input_file_path: Union[str, PathLike], clf_path: Union[str, PathLike])[source]๏ƒ

Run a single classifier on a single featurized input file. Results are saved within the project_folder/csv/validation directory of the SimBA project by defau

Parameters
  • config_file_path (str) โ€“ path to SimBA project config file in Configparser format

  • input_file_path (str) โ€“ path to file containing features

  • clf_path (str) โ€“ path to pickled rf sklearn classifier.

Note

Tutorial

Example

>>> InferenceValidation(config_path=r"MyProjectConfigPath", input_file_path=r"FeatureFilePath", clf_path=r"ClassifierPath")

Fit random forest classifier๏ƒ

class simba.model.train_rf.TrainRandomForestClassifier(config_path: Union[str, PathLike])[source]๏ƒ

Train a single random forest model using hyperparameter setting and evaluation methods stored within the SimBA project config .ini file (global environment).

Parameters

config_path (Union[str, os.PathLike]) โ€“ path to SimBA project config file in Configparser format

Note

Tutorial

Example

>>> model_trainer = TrainRandomForestClassifier(config_path='MyConfigPath')
>>> model_trainer.run()
>>> model_trainer.save()
perform_sampling()[source]๏ƒ

Method for sampling data for training and testing, and perform over and under-sampling of the training sets as indicated within the SimBA project config.

run()[source]๏ƒ

Method for training single random forest model.

save() None[source]๏ƒ

Method for saving pickled RF model. The model is saved in the models/generated_models directory of the SimBA project tree.

Fit random forest classifier - multi-class๏ƒ

class simba.model.train_multiclass_rf.TrainMultiClassRandomForestClassifier(config_path: Union[str, PathLike])[source]๏ƒ
run()[source]๏ƒ
save_model() None[source]๏ƒ

Method for saving pickled RF model. The model is saved in the models/generated_models directory of the SimBA project tree.

Ordinal classifier methods๏ƒ

class simba.model.ordinal_clf.OrdinalClassifier[source]๏ƒ

This class implements a strategy for ordinal classification by fitting multiple binary classifiers to predict thresholds between classes.

It is particularly useful for problems where the target variable has an inherent order but uneven intervals between levels. Thi includes human severity scores, for example, seizures, stereotopy, convulsion, bizarre behavior scores ranging fro 0-5.

Warning

If larger data sizes (>2m) pass a GPU cuml.ensemble.RandomForestClassifier object.

References

1

Frank, Eibe, and Mark Hall. โ€œA Simple Approach to Ordinal Classification.โ€ In Machine Learning: ECML 2001, edited by Luc De Raedt and Peter Flach, 2167:145โ€“56. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg, 2001. https://doi.org/10.1007/3-540-44795-4_13.

2

Sabnis, Gautam, Leinani Hession, J. Matthew Mahoney, Arie Mobley, Marina Santos, and Vivek Kumar. โ€œVisual Detection of Seizures in Mice Using Supervised Machine Learning,โ€ May 31, 2024. https://doi.org/10.1101/2024.05.29.596520.

3

Another implementation / benchmarking by Lee Prevost - https://github.com/leeprevost/OrdinalClassifier/tree/main.

4

https://github.com/fabianp/mord

5

Michael J Wurm, Paul J Rathouz, and Bret M Hanlon. Regularized ordinal regression and the ordinalnet r package. Journal of Statistical Software, 99(6), 2021.

Example

>>> X = np.random.randint(0, 500, (100, 50))
>>> y = np.random.randint(1, 6, (100))
>>> rf_mdl = TrainModelMixin().clf_define(cuda=False)
>>> fitted_mdl = OrdinalClassifier.fit(X, y, rf_mdl, -1)
>>> y_hat = OrdinalClassifier.predict_proba(X, fitted_mdl)
>>> y = OrdinalClassifier.predict(X, fitted_mdl)
>>> save_path = r"/mnt/c/Users/sroni/Downloads/Box4-20191208T1639-1652/ord_mdl/mdl.pickle"
>>> OrdinalClassifier.save(mdl=fitted_mdl, save_path=save_path)
>>> rf_mdl = OrdinalClassifier.load(file_path=save_path)
>>> y_hat = OrdinalClassifier.predict_proba(X, rf_mdl)
static fit(X: ndarray, y: ndarray, clf: RandomForestClassifier, core_cnt: int = -1, parallel: Optional[bool] = True) Dict[int, RandomForestClassifier][source]๏ƒ
static load(file_path: Union[str, PathLike]) Dict[int, RandomForestClassifier][source]๏ƒ
static predict(X: ndarray, mdl: Dict[int, RandomForestClassifier]) ndarray[source]๏ƒ
static predict_proba(X: ndarray, mdl: Dict[int, RandomForestClassifier]) ndarray[source]๏ƒ
static save(mdl: Dict[int, RandomForestClassifier], save_path: Union[str, PathLike])[source]๏ƒ

Regression - metrics๏ƒ

simba.model.regression.metrics.mean_absolute_error(y_true: ndarray, y_pred: ndarray, weights: Optional[ndarray] = None) float[source]๏ƒ

Compute the Mean Absolute Error (MAE) between the true and predicted values.

Parameters
  • y_true (np.ndarray) โ€“ A 1D array of true values (ground truth).

  • y_pred (np.ndarray) โ€“ A 1D array of predicted values.

  • weights (np.ndarray) โ€“ An optional 1D array of weights for each observation. If provided, the weighted MAE is computed.

Returns

The Mean Absolute Error (MAE) as a float. A lower value indicates a better fit.

Return type

float

simba.model.regression.metrics.mean_absolute_percentage_error(y_true: ndarray, y_pred: ndarray, epsilon=1e-10, weights: Optional[ndarray] = None) float[source]๏ƒ

Compute the Mean Absolute Percentage Error (MAPE)

Parameters
  • y_true (np.ndarray) โ€“ The array containing the true values (dependent variable) of the dataset. Should be a 1D numeric array of shape (n,).

  • y_pred (np.ndarray) โ€“ The array containing the predicted values for the dataset. Should be a 1D numeric array of shape (n,) and of the same length as y_true.

  • epsilon (float) โ€“ A small pseudovalue to replace zeros in y_true to avoid division by zero errors.

  • weights (Optional[np.ndarray]) โ€“ An optional 1D array of weights to apply to each error. If provided, the weighted mean absolute percentage error is computed.

Returns

The Mean Absolute Percentage Error (MAPE) as a float, in percentage format. A lower value indicates better prediction accuracy.

Return type

float

Example

>>> x, y = np.random.random(size=(100000,)), np.random.random(size=(100000,))
>>> mean_absolute_percentage_error(y_true=x, y_pred=y)
simba.model.regression.metrics.mean_squared_error(y_true: ndarray, y_pred: ndarray, weights: Optional[ndarray] = None) float[source]๏ƒ

Compute the Mean Squared Error (MSE) between the true and predicted values.

Parameters
  • y_true (np.ndarray) โ€“ The array containing the true values (dependent variable) of the dataset. Should be a 1D numeric array of shape (n,).

  • y_pred (np.ndarray) โ€“ The array containing the predicted values for the dataset. Should be a 1D numeric array of shape (n,) and of the same length as y_true.

  • weights (Optional[np.ndarray]) โ€“ An optional 1D array of weights to apply to each squared error. If provided, the weighted mean squared error is computed.

Returns

The Mean Squared Error (MSE) as a float. A lower value indicates better model accuracy.

Return type

float

simba.model.regression.metrics.r2_score(y_true: ndarray, y_pred: ndarray, weights: Optional[ndarray] = None) float[source]๏ƒ

Compute the R^2 (coefficient of determination) score.

Parameters
  • y_true (np.ndarray) โ€“ 1D array of true values (dependent variable).

  • y_pred (np.ndarray) โ€“ 1D array of predicted values, same length as y_true.

  • weights (np.ndarray) โ€“ Optional 1D array of weights for each observation.

Returns

The R^2 score as a float. A value closer to 1 indicates better fit.

Return type

float

simba.model.regression.metrics.root_mean_squared_error(y_true: ndarray, y_pred: ndarray, weights: Optional[ndarray] = None) float[source]๏ƒ

Compute the Root Mean Squared Error (RMSE) between the true and predicted values.

Parameters
  • y_true (np.ndarray) โ€“ The array containing the true values (dependent variable) of the dataset. Should be a 1D numeric array of shape (n,).

  • y_pred (np.ndarray) โ€“ The array containing the predicted values for the dataset. Should be a 1D numeric array of shape (n,) and of the same length as y_true.

  • weights (Optional[np.ndarray]) โ€“ An optional 1D array of weights to apply to each squared error. If provided, the weighted mean squared error is computed.

Returns

The Root Mean Squared Error (MSE) as a float. A lower value indicates better model accuracy.

Return type

float

Regression - fit and transform๏ƒ

simba.model.regression.model.evaluate_xgb(y_pred: ndarray, y_true: ndarray, metrics: List[str], stratified: Optional[bool] = False) dict[source]๏ƒ

Evaluates the performance of a regression model (e.g., XGBoost) by calculating selected metrics. Optionally, the evaluation can be stratified by unique values in the true target variable (y_true), where performance is computed separately for each class/level.

Parameters
  • y_pred (np.ndarray) โ€“ Predicted values generated by the model, must have the same shape as y_true.

  • y_true (np.ndarray) โ€“ True target values to compare the predictions against.

  • metrics (List[str]) โ€“ List of metrics to compute.

  • stratified โ€“ If True, computes the metric for each unique class/level in y_true. If False (default), computes the metric for the entire dataset.

Returns

A dictionary containing the computed metrics.

Return type

dict

Example

>>> x = pd.DataFrame(np.random.randint(0, 500, (100, 20)))
>>> y = np.random.randint(1, 6, (100,))
>>> mdl = fit_xgb(x=x, y=y)
>>> new_x = pd.DataFrame(np.random.randint(0, 500, (100, 20)))
>>> y_pred = transform_xgb(x=new_x, mdl=mdl)
>>> evaluate_xgb(y_pred=y_pred, y_true=y, metrics=['MAE', 'MAPE', 'RMSE', 'MSE'])
simba.model.regression.model.fit_xgb(x: DataFrame, y: ndarray, mdl: XGBRegressor) XGBRegressor[source]๏ƒ

Fits an XGBoost regressor model to the given data.

Parameters
  • x (pd.DataFrame) โ€“ Input feature matrix where each row represents a sample and each column a feature. The data must have numeric types.

  • y (np.ndarray) โ€“ Target values, must be a 1-dimensional array of numeric types with the same number of rows as x.

  • mdl (xgb.XGBRegressor) โ€“ Defined xgb.XGBRegressor. E.g., can be defined with simba.model.regression.model.xgb_define(),

Returns

Trained XGBoost regressor model.

Return type

xgb.XGBRegressor

Example

>>> x = pd.DataFrame(np.random.randint(0, 500, (100, 20)))
>>> y = np.random.randint(1, 6, (100,))
>>> mdl = fit_xgb(x=x, y=y)
simba.model.regression.model.transform_xgb(x: DataFrame, mdl: XGBRegressor) ndarray[source]๏ƒ

Transforms the input data using the provided XGBoost model by making predictions.

Parameters
  • x (pd.DataFrame) โ€“ Input feature matrix where each row represents a sample and each column a feature. The data must have numeric types.

  • mdl (xgb.XGBRegressor) โ€“ Trained XGBoost model to use for making predictions.

Returns

Predictions rounded to 2 decimal places.

Return type

np.ndarray

Example

>>> x, y = pd.DataFrame(np.random.randint(0, 500, (100, 20))), np.random.randint(1, 6, (100,))
>>> mdl = fit_xgb(x=x, y=y)
>>> new_x = pd.DataFrame(np.random.randint(0, 500, (100, 20)))
>>> results = transform_xgb(x=new_x, mdl=mdl)
Example

>>> x, y = pd.DataFrame(np.random.randint(0, 500, (100, 20))), np.random.randint(1, 6, (100,))
>>> mdl = fit_xgb(x=x, y=y)
>>> new_x = pd.DataFrame(np.random.randint(0, 500, (100, 20)))
>>> results = transform_xgb(x=new_x, mdl=mdl)
simba.model.regression.model.xgb_define(objective: str = 'reg:squarederror', n_estimators: int = 100, max_depth: int = 6, verbosity: int = 1, learning_rate: float = 0.3, eta: float = 0.3, gamma: float = 0.0, tree_method: str = 'auto') XGBRegressor[source]๏ƒ

Defines an XGBoost regressor.

Parameters
  • objective (str) โ€“ The learning objective for the model.

  • n_estimators (int) โ€“ Number of boosting rounds. Must be greater than or equal to 1. Default is 100.

  • max_depth (int) โ€“ Maximum depth of a tree. Increasing this value makes the model more complex and more likely to overfit. Must be greater than or equal to 1. Default is 6.

  • verbosity (int) โ€“ Verbosity of the training process (0-3).

  • learning_rate (float) โ€“ Step size shrinkage used to prevent overfitting. Lower values make the model more robust but require more boosting rounds. Must be between 0.1 and 1.0. Default is 0.3.

  • eta (float) โ€“ Learning rate alias. Must be between 0.0 and 1.0. Default is 0.3.

  • gamma (float) โ€“ Minimum loss reduction required to make a further partition on a leaf node of the tree. Larger values prevent overfitting. Must be greater than or equal to 0.0. Default is 0.0.

  • tree_method (str) โ€“ The tree construction algorithm used in XGBoost.

Returns

An initialized XGBoost Regressor with the specified configuration.

Return type

xgb.XGBRegressor

SAM2 segmentation inference๏ƒ

class simba.model.sam_inference.SamInference(video_path: Union[str, PathLike], weights_path: Union[str, PathLike], save_dir: Union[str, PathLike], prompts: Union[ndarray, List[List[int]]], labels: Union[ndarray, List[List[int]]], names: Tuple[str, ...], imgsz: Optional[int] = 1024, confidence: Optional[float] = 0.25, vertice_cnt: Optional[int] = 100)[source]๏ƒ
Example

>>> i = SamInference(video_path=r"MyVideo",
>>>                 labels=[[1]],
>>>                 prompts=[[166, 428]],
>>>                 weights_path=r"D:\yolo_weights\sam2.1_b.pt",
>>>                 save_dir=r'C:   roubleshooting\sam_results',
>>>                 names=('Animal1',))
>>> i.run()
run()[source]๏ƒ

Fit YOLO model๏ƒ

class simba.model.yolo_fit.FitYolo(model_yaml: Union[str, PathLike], save_path: Union[str, PathLike], weights_path: Optional[Union[str, PathLike]] = None, epochs: int = 200, batch: Union[int, float] = 16, plots: bool = True, imgsz: int = 640, format: Optional[str] = None, device: Union[typing_extensions.Literal['cpu'], int] = 0, verbose: bool = True, workers: int = 8, patience: int = 500, device_id: Optional[int] = None)[source]๏ƒ

Fit an Ultralytics YOLO model (detection, pose, or segmentation) from SimBA projects with parameter validation.

Note

  • Works with any Ultralytics model flavour (bbox, pose, segmentation).

  • Download starter weights from HuggingFace.

  • Example dataset YAMLs: bbox, pose.

See also

simba.bounding_box_tools.yolo.utils.fit_yolo() for the functional API. simba.bounding_box_tools.yolo.utils.load_yolo_model() to load trained weights. For instructions, see YOLO Pose Estimation Training Documentation.

Parameters
  • weights_path (Union[str, os.PathLike]) โ€“ Path to base weights (e.g., yolo11n.pt or .onnx export).

  • model_yaml (Union[str, os.PathLike]) โ€“ Dataset configuration YAML describing dataset folders and class labels.

  • save_path (Union[str, os.PathLike]) โ€“ Directory where training outputs (weights, metrics, plots) are written.

  • epochs (int) โ€“ Training epochs to run. Must be โ‰ฅ 1. Default 200.

  • batch (Union[int, float]) โ€“ Batch size per step. Default 16.

  • plots (bool) โ€“ If True, Ultralytics saves training curves. Default True.

  • imgsz (int) โ€“ Square image resolution used during training. Default 640.

  • format (Optional[str]) โ€“ Optional weights format override. Must belong to simba.utils.enums.Options.VALID_YOLO_FORMATS. Default None.

  • device (Union[Literal['cpu'], int]) โ€“ Compute device string or CUDA index. Default 0.

  • verbose (bool) โ€“ Emit detailed progress information. Default True.

  • workers (int) โ€“ Data-loader worker processes. Use -1 for all cores. Default 8.

  • patience (int) โ€“ Early-stopping patience (epochs without improvement). Default 100.

Raises
Example
>>> fitter = FitYolo(
...     weights_path=r"D:\yolo_weights\yolo11n-pose.pt",
...     model_yaml=r"D:\datasets\pose_project\map.yaml",
...     save_path=r"D:\datasets\pose_project\mdl",
...     epochs=300,
...     batch=24,
...     device=0,
...     imgsz=640,
... )
>>> fitter.run()
run()[source]๏ƒ

YOLO bounding-box inference๏ƒ

class simba.model.yolo_inference.YoloInference(weights: Union[str, PathLike, ultralytics.YOLO], video_path: Union[str, PathLike, List[Union[str, PathLike]]], verbose: Optional[bool] = False, save_dir: Optional[Union[str, PathLike]] = None, half_precision: Optional[bool] = True, device: Union[typing_extensions.Literal['cpu'], int] = 0, batch_size: Optional[int] = 400, core_cnt: int = 8, threshold: float = 0.25, max_detections: int = 300, max_per_class: Optional[int] = None, smoothing_method: Optional[typing_extensions.Literal['savitzky-golay', 'bartlett', 'blackman', 'boxcar', 'cosine', 'gaussian', 'hamming', 'exponential']] = None, smoothing_time_window: Optional[int] = None, interpolate: bool = False, imgsz: int = 320, bbox_size: Optional[Tuple[int, int]] = None, stream: Optional[bool] = True)[source]๏ƒ

Performs object detection inference on a video using a YOLO model.

YOLO-based object detection (bounding-box) on one or more video files. It supports GPU acceleration, batch processing, streaming, and optional result saving. The model returns bounding box coordinates and class confidence scores for each frame. Results can be smoothed or interpolated to handle detection gaps.

See also

To perform bounding box and keypoint (pose) detection, see YOLOPoseInference(). To perform keypoint (pose) detection with tracking, see YOLOPoseTrackInference() To visualize bounding boxes only, see YOLOVisualizer()

EXPECTED RUNTIMES

VIDEOS (COUNT)

FRAMES (COUNT)

TIME (S)

STDEV(S)

1

9000

19.69

0.185202592

2

18000

39.91333333

0.718424202

3

27000

59.20333333

0.29143324

4

36000

80.82

1.407870733

BATCH SIZE: 500

IMGSZ: 256

NVIDIA GeForce RTX 4070

CPU COUNT (LOADERS): 16

3 runs

Parameters
  • weights (Union[str, os.PathLike, YOLO]) โ€“ Path to YOLO model weights or a preloaded ultralytics.YOLO model instance.

  • video_path (Union[Union[str, os.PathLike], List[Union[str, os.PathLike]]]) โ€“ Input video path, list of paths, or directory containing videos.

  • verbose (Optional[bool]) โ€“ If True, print progress information.

  • save_dir (Optional[Union[str, os.PathLike]]) โ€“ Directory to save output CSV files. If None, results are returned in-memory.

  • half_precision (Optional[bool]) โ€“ If True, run inference in fp16 where supported.

  • device (Union[Literal['cpu'], int]) โ€“ Inference device (โ€˜cpuโ€™ or CUDA index).

  • batch_size (Optional[int]) โ€“ Number of frames per prediction batch.

  • core_cnt (int) โ€“ CPU thread count used by torch.

  • threshold (float) โ€“ Detection confidence threshold in [0.0, 1.0].

  • max_detections (int) โ€“ Maximum detections per frame (total, across all classes) returned by the model.

  • max_per_class (Optional[int]) โ€“ Maximum number of detections to retain per class per frame. E.g., if one โ€˜residentโ€™ and one โ€˜intruderโ€™ is expected, set this to 1. Defaults to None, meaning all detected instances of each class are retained (up to max_detections).

  • smoothing_method (Optional[Literal['savitzky-golay', 'bartlett', 'blackman', 'boxcar', 'cosine', 'gaussian', 'hamming', 'exponential']]) โ€“ Optional temporal smoothing method for bbox coordinates.

  • smoothing_time_window (Optional[int]) โ€“ Smoothing window in milliseconds. Used only when smoothing_method is not None.

  • interpolate (bool) โ€“ If True, interpolate missing bbox coordinates (nearest, per class).

  • imgsz (int) โ€“ Model inference image size.

  • bbox_size (Optional[Tuple[int, int]]) โ€“ Optional fixed bbox size (height, width) in pixels applied to detected boxes.

  • stream (Optional[bool]) โ€“ If True, use streaming predictions.

Returns

If save_dir is None, returns a dict mapping video name to result dataframe. Otherwise saves CSVs and returns None.

Return type

Union[None, Dict[str, pd.DataFrame]]

Example

>>> video_path = "/mnt/d/netholabs/yolo_videos/input/mp4_20250606083508/2025-05-28_19-50-23.mp4"
>>> i = YoloInference(
...     weights=r"/mnt/c/troubleshooting/coco_data/mdl/train8/weights/best.pt",
...     video_path=video_path,
...     save_dir=r"/mnt/c/troubleshooting/coco_data/mdl/results",
...     verbose=True,
...     device=0,
...     interpolate=True,
...     bbox_size=(128, 128)
... )
>>> i.run()
run()[source]๏ƒ