Model mixin๏
Utilities for fit, inference, and evaluation of classifiers.
- class simba.mixins.train_model_mixin.TrainModelMixin[source]๏
Train model methods
- bout_train_test_splitter(x_df: DataFrame, y_df: DataFrame, test_size: float) Tuple[DataFrame, DataFrame, Series, Series][source]๏
Helper to split train and test based on annotated bouts.
- Parameters
x_df (pd.DataFrame) โ Features
y_df (pd.Series) โ Target
test_size (float) โ Size of test as ratio of all annotated bouts (e.g.,
0.2).
- Returns
Size-4 tuple with DataFrames of Series representing, (i) Features for training, (ii) Features for testing, (iii) Target for training, (iv) Target for testing.
- Return type
Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]
- Examples
>>> x = pd.DataFrame(data=[[11, 23, 12], [87, 65, 76], [23, 73, 27], [10, 29, 2], [12, 32, 42], [32, 73, 2], [21, 83, 98], [98, 1, 1]]) >>> y = pd.Series([0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1]) >>> x_train, x_test, y_train, y_test = TrainModelMixin().bout_train_test_splitter(x_df=x, y_df=y, test_size=0.5)
- calc_learning_curve(x_y_df: DataFrame, clf_name: str, shuffle_splits: int, dataset_splits: int, tt_size: float, rf_clf: RandomForestClassifier, save_dir: Union[str, PathLike], save_file_no: Optional[int] = None, multiclass: Optional[bool] = False, scoring: Optional[str] = 'f1', plot: Optional[bool] = True) None[source]๏
Helper to compute random forest learning curves with cross-validation.
- Parameters
x_y_df (pd.DataFrame) โ Dataframe holding features and target.
clf_name (str) โ Name of the classifier
shuffle_splits (int) โ Number of cross-validation datasets at each data split.
dataset_splits (int) โ Number of data splits.
tt_size (float) โ The size of the test set as a ratio of the dataset. E.g., 0.2.
rf_clf (RandomForestClassifier) โ A sklearn RandomForestClassifier object.
save_dir (str) โ Directory where to save output in csv file format.
save_file_no (Optional[int]) โ If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.
multiclass (bool) โ If True, then target consist of several categories [0, 1, 2 โฆ] and scoring becomes
None. If False, then scoringf1.scoring (Optional[str]) โ The score of the models to present. Default: โf1โ.
plot (Optional[bool]) โ If True, creates plot with the train fraction size on x and
scoringon y.
- Returns
None. Results are stored in
save_dir.
- calc_permutation_importance(x_test: ndarray, y_test: ndarray, clf: RandomForestClassifier, feature_names: List[str], clf_name: str, save_dir: Optional[Union[str, PathLike]] = None, save_file_no: Optional[int] = None, plot: Optional[bool] = True, n_repeats: Optional[int] = 10) Union[None, Tuple[DataFrame, Union[None, ndarray]]][source]๏
Computes feature permutation importance scores.
- Parameters
x_test (np.ndarray) โ 2d feature test data of shape len(frames) x len(features)
y_test (np.ndarray) โ 2d feature target test data of shape len(frames) x 1
clf (RandomForestClassifier) โ random forest classifier object
feature_names (List[str]) โ Names of features in x_test
clf_name (str) โ Name of classifier in y_test.
save_dir (str) โ Directory where to save results in CSV format. If None, then returns the dataframe and the plot (if plot
plot (Optional[bool]) โ If True, creates bar plot chart and saves in same directory as the CSV file.
save_file_no (Optional[int]) โ If permutation importance calculation is part of a grid search, provide integer identifier representing the model in the grid serach sequence. This will be used as suffix in output filename.
- Returns
Either non or a Tuple with the dataframe and the plot. A CSV file representing the permutation importances is stored in
save_dirif save_dir is passed.
- calc_pr_curve(rf_clf: RandomForestClassifier, x_df: DataFrame, y_df: DataFrame, clf_name: str, save_dir: Union[str, PathLike], multiclass: bool = False, plot: Optional[bool] = True, classifier_map: Optional[Dict[int, str]] = None, save_file_no: Optional[int] = None) None[source]๏
Compute random forest precision-recall curve.
- Parameters
rf_clf (RandomForestClassifier) โ sklearn RandomForestClassifier object.
x_df (pd.DataFrame) โ Pandas dataframe holding test features.
y_df (pd.DataFrame) โ Pandas dataframe holding test target.
clf_name (str) โ Classifier name.
save_dir (str) โ Directory where to save output in csv file format.
multiclass (bool) โ If the classifier is a multi-classifier. Default: False.
plot (Optional[bool]) โ If True, creates and saves line plot PR curve in the same lication as the output CSV file.
classifier_map (Dict[int, str]) โ If multiclass, dictionary mapping integers to classifier names.
save_file_no (Optional[int]) โ If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.
- Returns
None. Results are stored in save_dir`.
- check_df_dataset_integrity(df: DataFrame, file_name: str, logs_path: Union[str, PathLike]) None[source]๏
Helper to check for non-numerical np.inf, -np.inf, NaN, None in a single dataframe. :parameter pd.DataFrame x_df: Features :raise NoDataError: If data contains np.inf, -np.inf, None.
- check_raw_dataset_integrity(df: DataFrame, logs_path: Optional[Union[str, PathLike]]) None[source]๏
Helper to check column-wise NaNs in raw input data for fitting model.
:param pd.DataFrame df :param str logs_path: The logs directory of the SimBA project :raise FaultyTrainingSetError: When the dataset contains NaNs
- check_sampled_dataset_integrity(x_df: DataFrame, y_df: DataFrame) None[source]๏
Helper to check for non-numerical entries post data sampling
- Parameters
x_df (pd.DataFrame) โ Features
y_df (pd.DataFrame) โ Target
- Raises
FaultyTrainingSetError โ Training or testing data sets contain non-numerical values
- check_validity_of_meta_files(data_df: DataFrame, meta_file_paths: List[Union[str, PathLike]])[source]๏
- clf_define(n_estimators: Optional[int] = 2000, max_depth: Optional[int] = None, max_features: Optional[Union[int, str]] = 'sqrt', n_jobs: Optional[int] = -1, criterion: Optional[str] = 'gini', min_samples_leaf: Optional[int] = 1, bootstrap: Optional[bool] = True, verbose: Optional[int] = 1, class_weight: Optional[dict] = None, cuda: Optional[bool] = False) RandomForestClassifier[source]๏
- clf_fit(clf: RandomForestClassifier, x_df: DataFrame, y_df: DataFrame, selected_feature_names: Optional[List[str]] = None, verbose: bool = False) RandomForestClassifier[source]๏
Helper to fit clf model.
EXPECTED RUNTIMES
FEATURE COUNT
OBSERVATION COUNT
GPU TIME (S)
CPU TIME (S)
500
1000
9.7055
3.6682
500
10000
14.1444
18.6429
500
100000
74.8791
400.252
750
1000
7.0187
4.0107
750
10000
14.9369
22.482
750
100000
87.2037
449.0292
1000
1000
8.3954
3.9517
1000
10000
17.4793
25.5511
1000
100000
100.3192
515.217
MAX DEPTH: 32
ESTIMATORS: 2k
CPU: 32 cores
NVIDIA GeForce RTX 4070
See also
To define a cuml/sklearn object, see
simba.mixins.train_model_mixin.TrainModelMixin.clf_define()- Parameters
clf โ Un-fitted random forest classifier object, either from sklearn or cuml.
x_df (pd.DataFrame) โ Pandas dataframe with features.
y_df (pd.DataFrame) โ Pandas dataframe/Series with target
selected_feature_names (Optional[List[str]]) โ Optional subset of feature column names from
x_dfto fit on. If None, fits on all features. Default: None.
- Returns
Fitted random forest classifier object
- Return type
RandomForestClassifier
- clf_predict_proba(clf: RandomForestClassifier, x_df: Union[DataFrame, ndarray], multiclass: bool = False, model_name: Optional[str] = None, data_path: Optional[Union[str, PathLike]] = None, verbose: bool = False) ndarray[source]๏
Helper to predict class probabilities using a fitted random forest classifier.
Computes prediction probabilities for binary or multiclass classification using either scikit-learn or cuML RandomForestClassifier. For binary classifiers, returns the probability of the positive class (class 1). For multiclass classifiers, returns probabilities for all classes.
EXPECTED RUNTIMES
OBSERVATION COUNT
GPU TIME (S)
CPU TIME (S)
100_000
1.5299
4.0823
500_000
2.8537
24.9888
1_000_000
9.2034
51.5734
1_500_000
23.548
83.1209
2_000_000
50.6484
131.5
CLF X 10k obs / 750 features
MAX DEPTH: 32
ESTIMATORS: 2k
CPU: 32 cores
NVIDIA GeForce RTX 4070
See also
To fit a classifier, see
simba.mixins.train_model_mixin.TrainModelMixin.clf_fit()To define a classifier, seesimba.mixins.train_model_mixin.TrainModelMixin.clf_define()- Parameters
clf (Union[RandomForestClassifier, cuRF]) โ Fitted random forest classifier object from sklearn or cuml.
x_df (Union[pd.DataFrame, np.ndarray]) โ Features for data to predict. DataFrame or array of shape (n_samples, n_features).
multiclass (bool) โ If True, the classifier predicts more than 2 classes. If False, binary classifier (default: False).
model_name (Optional[str]) โ Name of the model for error messages and logging. Default: None.
data_path (Optional[Union[str, os.PathLike]]) โ Path to the data file being processed, used in error messages. Default: None.
verbose (bool) โ If True, print inference progress and timing information. Default: False.
- Return np.ndarray
Prediction probabilities. For binary classifiers: 1D array of shape (n_samples,) with probability of positive class. For multiclass: 2D array of shape (n_samples, n_classes) with probabilities for each class.
- create_clf_report(rf_clf: RandomForestClassifier, x_df: DataFrame, y_df: DataFrame, class_names: List[str], save_dir: Union[str, PathLike], digits: Optional[int] = 4, clf_name: Optional[str] = None, img_size: tuple = (2500, 4500), cmap: str = 'coolwarm', threshold: float = 0.5, svg: bool = False, save_file_no: Optional[int] = None, dpi: int = 300) None[source]๏
Create classifier truth table report.
Generates a classification report heatmap visualization showing precision, recall, F1-score, and support for each class. The report is displayed as a heatmap with annotations showing metric values. Predictions are made using the provided threshold to convert probabilities to binary predictions.
See also
- Parameters
rf_clf (Union[RandomForestClassifier, cuRF]) โ sklearn RandomForestClassifier or cuRF object.
x_df (pd.DataFrame) โ DataFrame holding test features. Must match the feature set used for training.
y_df (pd.DataFrame) โ DataFrame holding test target values. Should be binary (0/1) for binary classification.
class_names (List[str]) โ List of class names. E.g., [โAttack absentโ, โAttack presentโ]. Must match the order of classes in the classifier.
save_dir (Union[str, os.PathLike]) โ Directory where to save the classification report image.
digits (Optional[int]) โ Number of decimal places in the classification report metrics. Default: 4.
clf_name (Optional[str]) โ Name of the classifier. If not None, used in the output filename. If None, uses
class_names[1].img_size (Tuple[int, int]) โ Size of the output image in pixels (width, height). Default: (2500, 4500).
cmap (str) โ Colormap palette for the heatmap. Default: โcoolwarmโ (blue to red).
threshold (float) โ Classification threshold for converting probabilities to binary predictions. Values above threshold become 1, below become 0. Default: 0.5.
svg (bool) โ If True, save as SVG format. If False (default), save as PNG format.
save_file_no (Optional[int]) โ If integer, represents the count of the classifier within a grid search. Used in filename generation. If None, the classifier is not part of a grid search.
dpi (int) โ Resolution (dots per inch) for the output image. Default: 300.
- Returns
None. Classification report image is saved to
save_dir.
- create_example_dt(rf_clf: RandomForestClassifier, clf_name: str, feature_names: List[str], class_names: List[str], save_dir: str, tree_id: Optional[int] = 3, save_file_no: Optional[int] = None) None[source]๏
Helper to produce visualization of random forest decision tree using graphviz.
Note
- Parameters
rf_clf (RandomForestClassifier) โ sklearn RandomForestClassifier object.
clf_name (str) โ Classifier name.
feature_names (List[str]) โ List of feature names.
class_names (List[str]) โ List of classes. E.g., [โAttack absentโ, โAttack presentโ]
save_dir (str) โ Directory where to save output in csv file format.
save_file_no (Optional[int]) โ If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.
- create_meta_data_csv_training_multiple_models(meta_data, clf_name, save_dir, save_file_no: Optional[int] = None) None[source]๏
- create_meta_data_csv_training_one_model(meta_data_lst: list, clf_name: str, save_dir: Union[str, PathLike]) None[source]๏
Helper to save single model meta data (hyperparameters, sampling settings etc.) from list format into SimBA compatible CSV config file.
- create_shap_log(rf_clf: RandomForestClassifier, x: Union[DataFrame, ndarray], y: Union[DataFrame, Series, ndarray], x_names: List[str], clf_name: str, cnt_present: int, cnt_absent: int, verbose: bool = True, plot: bool = True, save_it: Optional[int] = 100, save_dir: Optional[Union[str, PathLike]] = None, save_file_suffix: Optional[int] = None) Union[None, Tuple[DataFrame, DataFrame, Dict[str, DataFrame], ndarray]][source]๏
Compute SHAP values for a random forest classifier. This method computes SHAP (SHapley Additive exPlanations) values for a given random forest classifier. The SHAP value for feature โiโ in the context of a prediction โfโ and input โxโ is calculated using the following formula:

Note
Documentation Uses TreeSHAP Documentation
See also
For multicore solution, see
create_shap_log_mp()For GPU method, seecreate_shap_log()- Parameters
rf_clf (RandomForestClassifier) โ sklearn random forest classifier
x (Union[pd.DataFrame, np.ndarray]) โ Test features.
y (Union[pd.DataFrame, pd.Series, np.ndarray]) โ Test target.
x_names (List[str]) โ Feature names.
clf_name (str) โ Classifier name.
cnt_present (int) โ Number of behavior-present frames to calculate SHAP values for.
cnt_absent (int) โ Number of behavior-absent frames to calculate SHAP values for.
save_it (int) โ Save iteration cadence. If None, then only saves at completion.
save_dir (str) โ Optional directory where to save output in csv file format. If None, the data is returned.
save_file_suffix (Optional[int]) โ If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.
- Example
>>> from simba.mixins.train_model_mixin import TrainModelMixin >>> x_cols = list(pd.read_csv('/Users/simon/Desktop/envs/simba/simba/tests/data/sample_data/shap_test.csv', index_col=0).columns) >>> x = pd.DataFrame(np.random.randint(0, 500, (9000, len(x_cols))), columns=x_cols) >>> y = pd.Series(np.random.randint(0, 2, (9000,))) >>> rf_clf = TrainModelMixin().clf_define(n_estimators=100) >>> rf_clf = TrainModelMixin().clf_fit(clf=rf_clf, x_df=x, y_df=y) >>> feature_names = [str(x) for x in list(x.columns)] >>> TrainModelMixin.create_shap_log(rf_clf=rf_clf, x=x, y=y, x_names=feature_names, clf_name='test', save_it=10, cnt_present=50, cnt_absent=50, plot=True, save_dir=r'/Users/simon/Desktop/feltz')
- create_shap_log_concurrent_mp(rf_clf: Union[RandomForestClassifier, str, PathLike], x: Union[DataFrame, ndarray], y: Union[DataFrame, Series, ndarray], x_names: List[str], clf_name: str, cnt_present: int, cnt_absent: int, core_cnt: int = -1, chunk_size: int = 100, verbose: bool = True, save_dir: Optional[Union[str, PathLike]] = None, save_file_suffix: Optional[int] = None, plot: bool = False) Union[None, Tuple[DataFrame, DataFrame, Dict[str, DataFrame], ndarray]][source]๏
Compute SHAP values using multiprocessing.
See also
- Documentation
For single-core solution, see
create_shap_log()For GPU method, seecreate_shap_log()For multiprocassing imap method (reliably runs on Windows and Mac), seecreate_shap_log_mp()
- Parameters
rf_clf (Union[RandomForestClassifier, str, os.PathLike]) โ Fitted sklearn random forest classifier, or pat to fitted, pickled sklearn random forest classifier.
x (Union[pd.DataFrame, np.ndarray]) โ Test features.
y_df (Union[pd.DataFrame, pd.Series, np.ndarray]) โ Test target.
x_names (List[str]) โ Feature names.
clf_name (str) โ Classifier name.
cnt_present (int) โ Number of behavior-present frames to calculate SHAP values for.
cnt_absent (int) โ Number of behavior-absent frames to calculate SHAP values for.
chunk_size (int) โ How many observations to process in each chunk. Increase value for faster processing if your memory allows.
verbose (bool) โ If True, prints progress.
save_dir (Optional[Union[str, os.PathLike]]) โ Optional directory where to store the results. If None, then the results are returned.
save_file_suffix (Optional[int]) โ Optional suffix to add to the shap output filenames. Useful for gridsearches and multiple shap data output files are to-be stored in the same save_dir.
plot (bool) โ If True, create SHAP aggregation and plots.
- Example
>>> CONFIG_PATH = r"C:/troubleshooting/mitra/project_folder/project_config.ini" >>> RF_PATH = r"C:/troubleshooting/mitra/models/validations/straub_tail_5_new/straub_tail_5.sav" >>> DATA_PATH = r"C:/troubleshooting/mitra/project_folder/csv/targets_inserted/new_straub/appended/501_MA142_Gi_CNO_0514.csv" >>> config = ConfigReader(config_path=CONFIG_PATH) >>> df = read_df(file_path=DATA_PATH, file_type='csv') >>> y = df['straub_tail'] >>> x = df.drop(['immobility', 'rearing', 'grooming', 'circling', 'shaking', 'lay-on-belly', 'straub_tail'], axis=1) >>> x = x.drop(config.bp_col_names, axis=1) >>> TrainModelMixin.create_shap_log_concurrent_mp(rf_clf=RF_PATH, x=x, y=y, x_names=list(x.columns), clf_name='straub_tail', cnt_absent=100, cnt_present=10, core_cnt=10)
- create_shap_log_mp(rf_clf: RandomForestClassifier, x: Union[DataFrame, ndarray], y: Union[DataFrame, Series, ndarray], x_names: List[str], clf_name: str, cnt_present: int, cnt_absent: int, core_cnt: int = -1, chunk_size: int = 100, verbose: bool = True, save_dir: Optional[Union[str, PathLike]] = None, save_file_suffix: Optional[int] = None, plot: bool = False) Union[None, Tuple[DataFrame, DataFrame, Dict[str, DataFrame], ndarray]][source]๏
Compute SHAP values using multiprocessing.
See also
- Documentation
For single-core solution, see
create_shap_log()For GPU method, seecreate_shap_log()For multiprocassing concurrent futures method (should be more reliable on Linux distros), seecreate_shap_log_concurrent_mp()
- Parameters
rf_clf (RandomForestClassifier) โ Fitted sklearn random forest classifier
x (Union[pd.DataFrame, np.ndarray]) โ Test features.
y_df (Union[pd.DataFrame, pd.Series, np.ndarray]) โ Test target.
x_names (List[str]) โ Feature names.
clf_name (str) โ Classifier name.
cnt_present (int) โ Number of behavior-present frames to calculate SHAP values for.
cnt_absent (int) โ Number of behavior-absent frames to calculate SHAP values for.
chunk_size (int) โ How many observations to process in each chunk. Increase value for faster processing if your memory allows.
verbose (bool) โ If True, prints progress.
save_dir (Optional[Union[str, os.PathLike]]) โ Optional directory where to store the results. If None, then the results are returned.
save_file_suffix (Optional[int]) โ Optional suffix to add to the shap output filenames. Useful for gridsearches and multiple shap data output files are to-be stored in the same save_dir.
plot (bool) โ If True, create SHAP aggregation and plots.
- Example
>>> from simba.mixins.train_model_mixin import TrainModelMixin >>> x_cols = list(pd.read_csv('/Users/simon/Desktop/envs/simba/simba/tests/data/sample_data/shap_test.csv', index_col=0).columns) >>> x = pd.DataFrame(np.random.randint(0, 500, (9000, len(x_cols))), columns=x_cols) >>> y = pd.Series(np.random.randint(0, 2, (9000,)))
- create_x_importance_bar_chart(rf_clf: RandomForestClassifier, x_names: list, clf_name: str, save_dir: str, n_bars: int, palette: Optional[str] = 'hot', save_file_no: Optional[int] = None) None[source]๏
Helper to create a bar chart displaying the top N gini or entropy feature importance scores.
See also
- Parameters
rf_clf (RandomForestClassifier) โ sklearn RandomForestClassifier object.
x_names (List[str]) โ Names of features.
clf_name (str) โ Name of classifier.
save_dir (str) โ Directory where to save output in csv file format.
n_bars (int) โ Number of bars in the plot.
save_file_no (Optional[int]) โ If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search
- Returns
None. Results are stored in save_dir`.
- create_x_importance_log(rf_clf: RandomForestClassifier, x_names: List[str], clf_name: str, precision: int = 25, sort_ascending: bool = False, verbose: bool = True, save_dir: Optional[str] = None, save_file_no: Optional[int] = None) Union[None, DataFrame][source]๏
Compute gini / entropy based feature importance scores.
Calculates mean and standard deviation of feature importances across all trees in the RandomForestClassifier. Results are sorted by mean importance (descending by default) and can be saved to CSV or returned as a DataFrame.
Note
See also
To plot gini / entropy based feature importance scores, see
create_x_importance_bar_chart()- Parameters
rf_clf (Union[RandomForestClassifier, cuRF]) โ sklearn RandomForestClassifier or cuRF object.
x_names (List[str]) โ Names of features. Must match the number of features in the classifier.
clf_name (str) โ Name of classifier. Used in output filename if
save_diris provided.precision (int) โ Number of decimal places for rounding feature importance values. Default: 25.
sort_ascending (bool) โ If True, sort features by importance in ascending order. If False (default), sort in descending order.
verbose (bool) โ If True (default), print progress messages. If False, suppress output.
save_dir (Optional[str]) โ Directory where to save output in CSV file format. If None, then returns the DataFrame instead of saving.
save_file_no (Optional[int]) โ If integer, represents the count of the classifier within a grid search. Used in filename generation. If None, the classifier is not part of a grid search.
- Returns Union[None, pd.DataFrame]
If
save_diris provided, returns None and saves CSV file. Ifsave_diris None, returns DataFrame with columns: โFEATUREโ, โFEATURE_IMPORTANCE_MEANโ, โFEATURE_IMPORTANCE_STDEVโ.
- cuml_rf_x_importances(nodes: dict, n_features: int) ndarray[source]๏
Method for computing feature importanceโs from cuml RF object.
From szchixy.
- static define_scaler(scaler_name: typing_extensions.Literal['min-max', 'standard', 'quantile']) Union[MinMaxScaler, StandardScaler, QuantileTransformer][source]๏
Defines a sklearn scaler object. See
UMLOptions.SCALER_OPTIONS.valuefor accepted scalers.- Example
>>> TrainModelMixin.define_scaler(scaler_name='min-max')
- define_tree_explainer(clf: RandomForestClassifier, data: Optional[ndarray] = None, model_output: str = 'raw', feature_perturbation: str = 'tree_path_dependent') TreeExplainer[source]๏
- delete_other_annotation_columns(df: DataFrame, annotations_lst: List[str], raise_error: bool = True) DataFrame[source]๏
Helper to drop fields that contain annotations which are not the target.
- Parameters
df (pd.DataFrame) โ Dataframe holding features and annotations.
annotations_lst (List[str]) โ column fields to be removed from df
- Raise_error bool raise_error
If True, throw error if annotation column doesnโt exist. Else, skip. Default: True.
- Returns
Dataframe without non-target annotation columns
- Return type
pd.DataFrame
- Examples
>>> self.delete_other_annotation_columns(df=df, annotations_lst=['Sniffing'])
- dviz_classification_visualization(x_train: ndarray, y_train: ndarray, clf_name: str, class_names: List[str], save_dir: str) None[source]๏
Helper to create visualization of example decision tree using dtreeviz.
- Parameters
- static find_collinear_features(data: DataFrame, threshold: float) List[str][source]๏
Identify collinear features in a pandas DataFrame for removal.
Finds pairs of features with Pearson correlation coefficients above the specified threshold and returns the names of features that should be removed to reduce multicollinearity.
Serves as a validation wrapper around numba implementation.
See also
For the underlying numba-accelerated implementation, see
simba.mixins.train_model_mixin.TrainModelMixin.find_highly_correlated_fields()For non-numba statistical methods, seesimba.mixins.statistics_mixin.Statistics.find_collinear_features()EXPECTED RUNTIMES
FEATURES N
TIME (S)
100
1.0479
200
2.3715
400
6.1663
800
23.639
1600
160.69
ITERATIONS:3
Intel(R) Core(TM) i9-14900KF
OBSERVATION COUNT: 1M
- Parameters
data (pd.DataFrame) โ Input DataFrame containing numeric features. Each column represents a feature and each row represents an observation. Must contain only numeric data types.
threshold (float) โ Correlation threshold for identifying collinear features. Must be between 0.0 and 1.0. Higher values (e.g., 0.9) identify only very highly correlated features, while lower values (e.g., 0.1) identify more loosely correlated features.
- Returns
List of column names that are highly correlated with other features and should be considered for removal to reduce multicollinearity.
- Return type
List[str]
- Example
>>> a = np.random.randint(0, 5, (1_000_000, 100)) >>> df = pd.DataFrame(a) >>> c = find_collinear_features(data=df, threshold=0.0025)
Find highly correlated fields in a dataset using Pearson product-moment correlation coefficient.
Calculates the absolute correlation coefficients between columns in a given dataset and identifies pairs of columns that have a correlation coefficient greater than the specified threshold. For every pair of correlated features identified, the function returns the field name of one feature. These field names can later be dropped from the input data to reduce memory requirements and collinearity.
See also
For non-numba method, see
simba.mixins.statistics_mixin.Statistics.find_collinear_features(). For validation wrapper, seesimba.mixins.train_model_mixin.TrainModelMixin.find_collinear_features()- Parameters
data (np.ndarray) โ Two dimension numpy array with features represented as columns and frames represented as rows.
threshold (float) โ Threshold value for significant collinearity.
field_names (List[str]) โ List mapping the column names in data to a field name. Use types.ListType(types.unicode_type) to take advantage of JIT compilation
- Returns
Unique field names that correlates with at least one other field above the threshold value.
- Return type
List[str]
- Example
>>> data = np.random.randint(0, 1000, (1000, 5000)).astype(np.float32) >>> field_names = [] >>> for i in range(data.shape[1]): field_names.append(f'Feature_{i+1}') >>> highly_correlated_fields = TrainModelMixin().find_highly_correlated_fields(data=data, field_names=typed.List(field_names), threshold=0.10)
- static find_low_variance_fields(data: DataFrame, variance_threshold: float) List[str][source]๏
Finds fields with variance below provided threshold.
- Parameters
data (pd.DataFrame) โ Dataframe with continoues numerical features.
variance (float) โ Variance threshold (0.0-1.0).
- Return List[str]
- static fit_scaler(scaler: Union[MinMaxScaler, QuantileTransformer, StandardScaler], data: Union[DataFrame, ndarray]) object[source]๏
- get_all_clf_names(config: ConfigParser, target_cnt: int) List[str][source]๏
Helper to get all classifier names in a SimBA project.
- Parameters
config (configparser.ConfigParser) โ Parsed SimBA project_config.ini
target_cnt (int.ConfigParser) โ Parsed SimBA project_config.ini
- Returns
All classifier names in project
- Return type
List[str]
- Example
>>> self.get_all_clf_names(config=config, target_cnt=2) >>> ['Attack', 'Sniffing']
- get_model_info(config: ConfigParser, model_cnt: int) Dict[int, Any][source]๏
Helper to read in N SimBA random forest config meta files to python dict memory.
- Parameters
config (configparser.ConfigParser) โ Parsed SimBA project_config.ini
model_cnt (int) โ Count of models
- Return dict
Dictionary with integers as keys and hyperparameter dictionaries as keys.
- insert_column_headers_for_outlier_correction(data_df: DataFrame, new_headers: List[str], filepath: Union[str, PathLike]) DataFrame[source]๏
Helper to insert new column headers onto a dataframe following outlier correction.
- Parameters
data_df (pd.DataFrame) โ Dataframe with headers to-be replaced.
filepath (str) โ Path to where
data_dfis stored on disk.
- Param
DataFRame with the corrected headers following outlier correction.
- partial_dependence_calculator(clf: RandomForestClassifier, x_df: DataFrame, clf_name: str, save_dir: Union[str, PathLike], clf_cnt: Optional[int] = None, grid_resolution: Optional[int] = 50, plot: Optional[bool] = True) None[source]๏
Compute feature partial dependencies for every feature in training set.
- Parameters
clf (RandomForestClassifier) โ Random forest classifier
x_df (pd.DataFrame) โ Features training set
clf_name (str) โ Name of classifier
save_dir (str) โ Directory where to save the data
clf_cnt (Optional[int]) โ If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.
- print_machine_model_information(model_dict: dict) None[source]๏
Helper to print model information in tabular form.
- Parameters
model_dict (dict) โ dictionary holding model meta data in SimBA meta-config format.
- random_multiclass_bout_sampler(x_df: DataFrame, y_df: DataFrame, target_field: str, target_var: int, sampling_ratio: Union[float, Dict[int, float]], raise_error: bool = False) DataFrame[source]๏
Randomly sample multiclass behavioral bouts.
This function performs random sampling on a multiclass dataset to balance the class distribution. From each class, the function selects a count of โboutsโ where the count is computed as a ratio of a user-specified class variable count. All bout observations in the user-specified class is selected.
- Parameters
x_df (pd.DataFrame) โ A dataframe holding features.
y_df (pd.DataFrame) โ A dataframe holding target.
target_field (str) โ The name of the target column.
target_var (int) โ The variable in the target that should serve as baseline. E.g.,
0if0represents no behavior.sampling_ratio (Union[float, dict]) โ The ratio of target_var bout observations that should be sampled of non-target_var observations. E.g., if float
1.0, and there are 10` bouts of target_var observations in the dataset, then 10 bouts of each non-target_var observations will be sampled. If different under-sampling ratios for different class variables are needed, use dict with the class variable name as key and ratio relative to target_var as the value.raise_error (bool) โ If True, then raises error if there are not enough observations of the non-target_var fullfilling the sampling_ratio. Else, takes all observations even though not enough to reach criterion.
- Raises
SamplingError โ If any of the following conditions are met: - No bouts of the target class are detected in the data. - The target variable is present in the sampling ratio dictionary. - The sampling ratio dictionary contains non-integer keys or non-float values less than 0.0. - The variable specified in the sampling ratio is not present in the DataFrame. - The sampling ratio results in a sample size of zero or less. - The requested sample size exceeds the available data and raise_error is True.
- Return (pd.DataFrame, pd.DataFrame)
resampled features, and resampled associated target.
- Examples
>>> df = pd.read_csv('/Users/simon/Desktop/envs/troubleshooting/multilabel/project_folder/csv/targets_inserted/01.YC015YC016phase45-sample_sampler.csv', index_col=0) >>> undersampled_df = TrainModelMixin().random_multiclass_bout_sampler(data=df, target_field='syllable_class', target_var=0, sampling_ratio={1: 1.0, 2: 1, 3: 1}, raise_error=True)
- random_multiclass_frm_sampler(x_df: DataFrame, y_df: DataFrame, target_field: str, target_var: int, sampling_ratio: Union[float, Dict[int, float]], raise_error: bool = False)[source]๏
Random multiclass undersampler.
This function performs random under-sampling on a multiclass dataset to balance the class distribution. From each class, the function selects a number of frames computed as a ratio relative to a user-specified class variable.
All the observations in the user-specified class is selected.
- Parameters
x_df (pd.DataFrame) โ A dataframe holding features.
y_df (pd.DataFrame) โ A dataframe holding target.
target_field (str) โ The name of the target column.
target_var (int) โ The variable in the target that should serve as baseline. E.g.,
0if0represents no behavior.sampling_ratio (Union[float, dict]) โ The ratio of target_var observations that should be sampled of non-target_var observations. E.g., if float
1.0, and there are 10` target_var observations in the dataset, then 10 of each non-target_var observations will be sampled. If different under-sampling ratios for different class variables are needed, use dict with the class variable name as key and ratio raletive to target_var as the value.raise_error (bool) โ If True, then raises error if there are not enough observations of the non-target_var fullfilling the sampling_ratio. Else, takes all observations even though not enough to reach criterion.
- Return (pd.DataFrame, pd.DataFrame)
resampled features, and resampled associated target.
- Examples
>>> df = pd.read_csv('/Users/simon/Desktop/envs/troubleshooting/multilabel/project_folder/csv/targets_inserted/01.YC015YC016phase45-sample_sampler.csv', index_col=0) >>> TrainModelMixin().random_multiclass_frm_sampler(data_df=df, target_field='syllable_class', target_var=0, sampling_ratio=0.20) >>> TrainModelMixin().random_multiclass_frm_sampler(data_df=df, target_field='syllable_class', target_var=0, sampling_ratio={1: 0.1, 2: 0.2, 3: 0.3})
- random_undersampler(x_train: ndarray, y_train: ndarray, sample_ratio: float) Tuple[DataFrame, DataFrame][source]๏
Perform random under-sampling of behavior-absent frames in a dataframe.
- Parameters
x_train (np.ndarray) โ 2-dimensional array representing the features in train set
y_train (np.ndarray) โ Array representing the target in the training set.
sample_ratio (float) โ Ratio of behavior-absent frames to keep relative to the behavior-present frames. E.g.,
1.0returns an equal count of behavior-absent and behavior-present frames.2.0returns twice as many behavior-absent frames as and behavior-present frames.
- Returns
Size-2 tuple with DataFrames representing the under-sampled feature set and under-sampled target set.
- Return type
Tuple[pd.DataFrame, pd.DataFrame]
- Examples
>>> self.random_undersampler(x_train=x_train, y_train=y_train, sample_ratio=1.0)
- read_all_files_in_folder(file_paths: List[str], file_type: str, classifier_names: Optional[List[str]] = None, raise_bool_clf_error: bool = True) Tuple[DataFrame, List[int]][source]๏
Read in all data files in a folder into a single pd.DataFrame.
Note
For improved runtime using multiprocessing and pyarrow, use
read_all_files_in_folder_mp()For improved runtime using ``concurrent` library, usesimba.mixins.train_model_mixin.TrainModelMixin.read_all_files_in_folder_mp_futures().- Parameters
file_paths (List[str]) โ List of file paths representing files to be read in.
file_type (str) โ The type of files to be read in (e.g., csv)
classifier_names (Optional[List[str]]) โ Optional list of classifier names representing fields of human annotations. If not None, then assert that classifier names are present in each data file.
- Returns
concatenated DataFrame if all data represented in
file_paths, and a aligned list of frame numbers associated with the rows in the DataFrame.- Return type
Tuple[pd.DataFrame, List[int]]
- Examples
>>> self.read_all_files_in_folder(file_paths=['targets_inserted/Video_1.csv', 'targets_inserted/Video_2.csv'], file_type='csv', classifier_names=['Attack'])
- static read_all_files_in_folder_mp(file_paths: List[str], file_type: typing_extensions.Literal['csv', 'parquet', 'pickle'], classifier_names: Optional[List[str]] = None, raise_bool_clf_error: bool = True) Tuple[DataFrame, List[int]][source]๏
Multiprocessing helper function to read in all data files in a folder to a single pd.DataFrame for downstream ML. Defaults to ceil(CPU COUNT / 2) cores. Asserts that all classifiers have annotation fields present in each dataframe.
Note
If multiprocess fail, reverts to
simba.mixins.train_model_mixin.read_all_files_in_folder()See also
For single process method, use
read_all_files_in_folder()For concurrent library, usesimba.mixins.train_model_mixin.TrainModelMixin.read_all_files_in_folder_mp_futures().- Parameters
file_paths (List[str]) โ List of file-paths
file_paths โ The filetype of
file_pathsOPTIONS: csv or parquet.classifier_names (Optional[List[str]]) โ List of classifier names representing fields of human annotations. If not None, then assert that classifier names are present in each data file.
- Returns
concatenated DataFrame if all data represented in
file_paths, and an aligned list of frame numbers associated with the rows in the DataFrame.- Return type
Tuple[pd.DataFrame, List[int]]
- read_all_files_in_folder_mp_futures(annotations_file_paths: List[str], file_type: typing_extensions.Literal['csv', 'parquet', 'pickle'], classifier_names: Optional[List[str]] = None, raise_bool_clf_error: bool = True) Tuple[DataFrame, List[int]][source]๏
Multiprocessing helper function to read in all data files in a folder to a single pd.DataFrame for downstream ML through
concurrent.Futures. Asserts that all classifiers have annotation fields present in each dataframe.Note
A
concurrent.Futuresalternative tosimba.mixins.train_model_mixin.read_all_files_in_folder_mp()which has usesmultiprocessing.ProcessPoolExecutorand reported unstable on Linux machines.If multiprocess failure, reverts to
simba.mixins.train_model_mixin.read_all_files_in_folder()See also
For single process method, use
read_all_files_in_folder()For improved runtime using multiprocessing and pyarrow, useread_all_files_in_folder_mp()- Parameters
file_paths (List[str]) โ List of file-paths
file_paths โ The filetype of
file_pathsOPTIONS: csv or parquet.classifier_names (Optional[List[str]]) โ List of classifier names representing fields of human annotations. If not None, then assert that classifier names are present in each data file.
raise_bool_clf_error (bool) โ If True, raises an error if a classifier column contains values outside 0 and 1.
- Returns
concatenated DataFrame if all data represented in
file_paths, and an aligned list of frame numbers associated with the rows in the DataFrame.- Return type
Tuple[pd.DataFrame, List[int]]
- read_in_all_model_names_to_remove(config: ConfigParser, model_cnt: int, clf_name: str) List[str][source]๏
Helper to find all field names that are annotations but are not the target.
- Parameters
config (configparser.ConfigParser) โ Configparser object holding data from the project_config.ini
model_cnt (int) โ Number of classifiers in the SimBA project
clf_name (str) โ Name of the classifier.
- Returns
List of non-target annotation column names.
- Return type
List[str]
- Examples
>>> self.read_in_all_model_names_to_remove(config=config, model_cnt=2, clf_name=['Attack'])
- read_model_settings_from_config(config: ConfigParser)[source]๏
- read_pickle(file_path: Union[str, PathLike]) RandomForestClassifier[source]๏
Read pickled RandomForestClassifier object.
- Parameters
file_path (Union[str, os.PathLike]) โ Path to pickle file on disk.
- Returns
A scikitRandomForestClassifier object.
- Return type
RandomForestClassifier
- save_rf_model(rf_clf: RandomForestClassifier, clf_name: str, save_dir: Union[str, PathLike], save_file_no: Optional[int] = None) None[source]๏
Helper to save pickled classifier object to disk.
See also
To write pickle, can also use
write_pickle()To read pickle, seeread_pickle()orread_pickle().- Parameters
rf_clf (RandomForestClassifier) โ sklearn random forest classifier
clf_name (str) โ Classifier name
save_dir (str) โ Directory where to save output as pickle.
save_file_no (Optional[int]) โ If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.
- Returns
None. Results are saved in
save_dir.
- static scaler_inverse_transform(data: DataFrame, scaler: Union[MinMaxScaler, StandardScaler, QuantileTransformer], name: Optional[str] = '') DataFrame[source]๏
- static scaler_transform(data: DataFrame, scaler: Union[MinMaxScaler, StandardScaler, QuantileTransformer], name: Optional[str] = '') DataFrame[source]๏
Helper to run transform dataframe using previously fitted scaler.
- Parameters
data (pd.DataFrame) โ Data to transform.
scaler โ fitted scaler.
- smote_oversampler(x_train: DataFrame, y_train: DataFrame, sample_ratio: float) Tuple[ndarray, ndarray][source]๏
Helper to perform SMOTE oversampling of behavior-present annotations.
- Parameters
x_train (np.ndarray) โ Features in train set
y_train (np.ndarray) โ Target in train set
sample_ratio (float) โ Over-sampling ratio
- Returns
Size-2 tuple arrays representing the over-sampled feature set and over-sampled target set.
- Return type
Tuple[np.ndarray, np.ndarray]
- Examples
>>> self.smote_oversampler(x_train=x_train, y_train=y_train, sample_ratio=1.0)
- smoteen_oversampler(x_train: DataFrame, y_train: DataFrame, sample_ratio: float) Tuple[ndarray, ndarray][source]๏
Helper to perform SMOTEEN oversampling of behavior-present annotations.
- Parameters
x_train (np.ndarray) โ Features in train set
y_train (np.ndarray) โ Target in train set
sample_ratio (float) โ Over-sampling ratio
- Returns
Size-2 tuple arrays representing the over-sampled feature set and over-sampled target set.
- Return type
Tuple[np.ndarray, np.ndarray]
- Examples
>>> self.smoteen_oversampler(x_train=x_train, y_train=y_train, sample_ratio=1.0)
- static split_and_group_df(df: ~pandas.core.frame.DataFrame, splits: int, include_split_order: bool = True) -> (typing.List[pandas.core.frame.DataFrame], <class 'int'>)[source]๏
Helper to split a dataframe for multiprocessing. If include_split_order, then include the group number in split data as a column. Returns split data and approximations of number of observations per split.
- split_df_to_x_y(df: DataFrame, clf_name: str) Tuple[DataFrame, DataFrame][source]๏
Helper to split dataframe into features and target.
- Parameters
df (pd.DataFrame) โ Dataframe holding features and annotations.
clf_name (str) โ Name of target.
- Returns
Size-2 tuple containing two dataframes - the features, and the target.
- Return type
Tuple[pd.DataFrame, pd.DataFrame]
- Examples
>>> self.split_df_to_x_y(df=df, clf_name='Attack')
Batch random forest inference๏
- class simba.model.inference_batch.InferenceBatch(config_path: Union[str, PathLike], features_dir: Optional[Union[str, PathLike]] = None, save_dir: Optional[Union[str, PathLike]] = None, minimum_bout_length: Optional[int] = None, feature_subsets_by_clf: Optional[Dict[str, Dict[str, List[str]]]] = None, model_dict: Optional[Dict[str, Dict[str, Union[str, int, float]]]] = None, save_agg_stats: Optional[Union[str, PathLike]] = None, verbose: bool = True)[source]๏
Run classifier inference on all files with the
project_folder/csv/features_extracteddirectory. Results are stored in theproject_folder/csv/machine_resultsdirectory of the SimBA project.Note
To compute aggregate statistics from the output of this class, see
simba.data_processors.agg_clf_calculator.AggregateClfCalculator()- Parameters
config_path (Union[str, os.PathLike]) โ path to SimBA project config file in Configparser format.
features_dir (Optional[Union[str, os.PathLike]]) โ Optional directory containing featurized files in CSV or parquet format. If None, then the project_folder/csv/features_extracted directory of the project will be used.
save_dir (Optional[Union[str, os.PathLike]]) โ Optional directory to save the data for the analyzed videos. If None, then the project_folder/csv/machine_results directory of the project will be used.
minimum_bout_length (Optional[int]) โ Optional minimum bout length (milliseconds) override. If None, classifier-specific minimum bout settings from project configuration are used.
feature_subsets_by_clf (Optional[Dict[str, Dict[str, List[str]]]]) โ Optional per-classifier feature subsets to use during inference. Format:
{classifier_name: {subset_name: [feature_col_1, feature_col_2, ...]}}. If provided, each classifier is applied once per subset and outputs are suffixed with the subset name.model_dict (Optional[Dict[str, Dict[str, Union[str, int, float]]]]) โ Optional override of the classifiers to run. Format:
{model_name: {'model_path': '/path/to/clf.sav', 'minimum_bout_length': 100, 'threshold': 0.5}}. If None, classifier definitions are read from the project config (current behavior). When provided, these models replace the project-config classifiers for this run.save_agg_stats (Optional[Union[str, os.PathLike]]) โ Optional directory in which to save aggregate classifier statistics. If None, no aggregate statistics are computed. If a directory is provided,
simba.data_processors.agg_clf_calculator.AggregateClfCalculatoris run after inference completes, reading from this classโssave_dirand writing its CSV outputs tosave_agg_stats.verbose (bool) โ If True, print progress and status messages during inference. Default: True.
- Example I
>>> inferencer = InferenceBatch(config_path='MyConfigPath') >>> inferencer.run()
- Example II
>>> inferencer = InferenceBatch(config_path=r"D:/troubleshooting/mitra/project_folder/project_config.ini", features_dir=r"D:/troubleshooting/mitra/project_folder/videos/bg_removed/rotated/tail_features/APPENDED") >>> inferencer.run()
Batch multi-animal random forest inference๏
- class simba.model.inference_multi_animal_batch.InferenceMultiAnimalBatch(config_path: Union[str, PathLike], clf_name: str)[source]๏
Run a single trained behavior classifier across every animal in a SimBA project, producing per-animal predictions in the output CSVs.
See also
Training counterpart:
simba.model.grid_search_rf.GridSearchRandomForestClassifierwithfeature_subset_suffix='_animal_<N>'.- Parameters
config_path (Union[str, os.PathLike]) โ Path to the SimBA project_config.ini.
clf_name (str) โ Name of the configured classifier to run multi-animal inference for.
- Example
>>> InferenceMultiAnimalBatch(config_path=r'/path/project_folder/project_config.ini', clf_name='wing_wave').run()
Batch multi-class random forest inference๏
Grid-search random forest classifiers๏
- class simba.model.grid_search_rf.GridSearchRandomForestClassifier(config_path: Union[str, PathLike], feature_subset_suffix: Optional[str] = None, target_dir: Optional[Union[str, PathLike]] = None, save_dir: Optional[Union[str, PathLike]] = None)[source]๏
Train one or more random-forest classifiers from SimBA meta-config files.
Reads model hyperparameters and sampling settings from meta files in
project_folder/configsand trains one model per valid meta file. Training data is loaded from annotated target files and saved models plus evaluation artifacts are written to the configured output directory.Note
Searches the SimBA project
project_folder/configsdirectory for meta files and builds one model per valid config file. Tutorial.- Parameters
config_path (Union[str, os.PathLike]) โ Path to SimBA project config file in ConfigParser format.
feature_subset_suffix (Optional[str]) โ Optional suffix used to subset feature columns before training. If set, only feature columns ending with this suffix are retained.
target_dir (Optional[Union[str, os.PathLike]]) โ Optional directory with annotated target files (CSV or parquet, matching project file type). If None, project default targets directory is used.
save_dir (Optional[Union[str, os.PathLike]]) โ Optional directory where trained models and evaluation artifacts are saved. If None, defaults to
<model_dir>/validationsfrom project configuration.
- Example
>>> _ = GridSearchRandomForestClassifier(config_path='MyConfigPath').run()
Grid-search random forest multi-classifiers๏
Random forest inference - validation๏
- class simba.model.inference_validation.InferenceValidation(config_path: Union[str, PathLike], input_file_path: Union[str, PathLike], clf_path: Union[str, PathLike])[source]๏
Run a single classifier on a single featurized input file. Results are saved within the
project_folder/csv/validationdirectory of the SimBA project by defau- Parameters
Note
- Example
>>> InferenceValidation(config_path=r"MyProjectConfigPath", input_file_path=r"FeatureFilePath", clf_path=r"ClassifierPath")
Fit random forest classifier๏
- class simba.model.train_rf.TrainRandomForestClassifier(config_path: Union[str, PathLike])[source]๏
Train a single random forest model using hyperparameter setting and evaluation methods stored within the SimBA project config .ini file (
global environment).- Parameters
config_path (Union[str, os.PathLike]) โ path to SimBA project config file in Configparser format
Note
- Example
>>> model_trainer = TrainRandomForestClassifier(config_path='MyConfigPath') >>> model_trainer.run() >>> model_trainer.save()
Fit random forest classifier - multi-class๏
Ordinal classifier methods๏
- class simba.model.ordinal_clf.OrdinalClassifier[source]๏
This class implements a strategy for ordinal classification by fitting multiple binary classifiers to predict thresholds between classes.
It is particularly useful for problems where the target variable has an inherent order but uneven intervals between levels. Thi includes human severity scores, for example, seizures, stereotopy, convulsion, bizarre behavior scores ranging fro 0-5.
Warning
If larger data sizes (>2m) pass a GPU
cuml.ensemble.RandomForestClassifierobject.Note
References
- 1
Frank, Eibe, and Mark Hall. โA Simple Approach to Ordinal Classification.โ In Machine Learning: ECML 2001, edited by Luc De Raedt and Peter Flach, 2167:145โ56. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg, 2001. https://doi.org/10.1007/3-540-44795-4_13.
- 2
Sabnis, Gautam, Leinani Hession, J. Matthew Mahoney, Arie Mobley, Marina Santos, and Vivek Kumar. โVisual Detection of Seizures in Mice Using Supervised Machine Learning,โ May 31, 2024. https://doi.org/10.1101/2024.05.29.596520.
- 3
Another implementation / benchmarking by Lee Prevost - https://github.com/leeprevost/OrdinalClassifier/tree/main.
- 4
- 5
Michael J Wurm, Paul J Rathouz, and Bret M Hanlon. Regularized ordinal regression and the ordinalnet r package. Journal of Statistical Software, 99(6), 2021.
- Example
>>> X = np.random.randint(0, 500, (100, 50)) >>> y = np.random.randint(1, 6, (100)) >>> rf_mdl = TrainModelMixin().clf_define(cuda=False) >>> fitted_mdl = OrdinalClassifier.fit(X, y, rf_mdl, -1) >>> y_hat = OrdinalClassifier.predict_proba(X, fitted_mdl) >>> y = OrdinalClassifier.predict(X, fitted_mdl) >>> save_path = r"/mnt/c/Users/sroni/Downloads/Box4-20191208T1639-1652/ord_mdl/mdl.pickle" >>> OrdinalClassifier.save(mdl=fitted_mdl, save_path=save_path) >>> rf_mdl = OrdinalClassifier.load(file_path=save_path) >>> y_hat = OrdinalClassifier.predict_proba(X, rf_mdl)
Regression - metrics๏
- simba.model.regression.metrics.mean_absolute_error(y_true: ndarray, y_pred: ndarray, weights: Optional[ndarray] = None) float[source]๏
Compute the Mean Absolute Error (MAE) between the true and predicted values.
- Parameters
y_true (np.ndarray) โ A 1D array of true values (ground truth).
y_pred (np.ndarray) โ A 1D array of predicted values.
weights (np.ndarray) โ An optional 1D array of weights for each observation. If provided, the weighted MAE is computed.
- Returns
The Mean Absolute Error (MAE) as a float. A lower value indicates a better fit.
- Return type
- simba.model.regression.metrics.mean_absolute_percentage_error(y_true: ndarray, y_pred: ndarray, epsilon=1e-10, weights: Optional[ndarray] = None) float[source]๏
Compute the Mean Absolute Percentage Error (MAPE)
- Parameters
y_true (np.ndarray) โ The array containing the true values (dependent variable) of the dataset. Should be a 1D numeric array of shape (n,).
y_pred (np.ndarray) โ The array containing the predicted values for the dataset. Should be a 1D numeric array of shape (n,) and of the same length as y_true.
epsilon (float) โ A small pseudovalue to replace zeros in y_true to avoid division by zero errors.
weights (Optional[np.ndarray]) โ An optional 1D array of weights to apply to each error. If provided, the weighted mean absolute percentage error is computed.
- Returns
The Mean Absolute Percentage Error (MAPE) as a float, in percentage format. A lower value indicates better prediction accuracy.
- Return type
- Example
>>> x, y = np.random.random(size=(100000,)), np.random.random(size=(100000,)) >>> mean_absolute_percentage_error(y_true=x, y_pred=y)
- simba.model.regression.metrics.mean_squared_error(y_true: ndarray, y_pred: ndarray, weights: Optional[ndarray] = None) float[source]๏
Compute the Mean Squared Error (MSE) between the true and predicted values.
- Parameters
y_true (np.ndarray) โ The array containing the true values (dependent variable) of the dataset. Should be a 1D numeric array of shape (n,).
y_pred (np.ndarray) โ The array containing the predicted values for the dataset. Should be a 1D numeric array of shape (n,) and of the same length as y_true.
weights (Optional[np.ndarray]) โ An optional 1D array of weights to apply to each squared error. If provided, the weighted mean squared error is computed.
- Returns
The Mean Squared Error (MSE) as a float. A lower value indicates better model accuracy.
- Return type
- simba.model.regression.metrics.r2_score(y_true: ndarray, y_pred: ndarray, weights: Optional[ndarray] = None) float[source]๏
Compute the R^2 (coefficient of determination) score.
- Parameters
y_true (np.ndarray) โ 1D array of true values (dependent variable).
y_pred (np.ndarray) โ 1D array of predicted values, same length as y_true.
weights (np.ndarray) โ Optional 1D array of weights for each observation.
- Returns
The R^2 score as a float. A value closer to 1 indicates better fit.
- Return type
- simba.model.regression.metrics.root_mean_squared_error(y_true: ndarray, y_pred: ndarray, weights: Optional[ndarray] = None) float[source]๏
Compute the Root Mean Squared Error (RMSE) between the true and predicted values.
- Parameters
y_true (np.ndarray) โ The array containing the true values (dependent variable) of the dataset. Should be a 1D numeric array of shape (n,).
y_pred (np.ndarray) โ The array containing the predicted values for the dataset. Should be a 1D numeric array of shape (n,) and of the same length as y_true.
weights (Optional[np.ndarray]) โ An optional 1D array of weights to apply to each squared error. If provided, the weighted mean squared error is computed.
- Returns
The Root Mean Squared Error (MSE) as a float. A lower value indicates better model accuracy.
- Return type
Regression - fit and transform๏
- simba.model.regression.model.evaluate_xgb(y_pred: ndarray, y_true: ndarray, metrics: List[str], stratified: Optional[bool] = False) dict[source]๏
Evaluates the performance of a regression model (e.g., XGBoost) by calculating selected metrics. Optionally, the evaluation can be stratified by unique values in the true target variable (y_true), where performance is computed separately for each class/level.
- Parameters
y_pred (np.ndarray) โ Predicted values generated by the model, must have the same shape as y_true.
y_true (np.ndarray) โ True target values to compare the predictions against.
metrics (List[str]) โ List of metrics to compute.
stratified โ If True, computes the metric for each unique class/level in y_true. If False (default), computes the metric for the entire dataset.
- Returns
A dictionary containing the computed metrics.
- Return type
- Example
>>> x = pd.DataFrame(np.random.randint(0, 500, (100, 20))) >>> y = np.random.randint(1, 6, (100,)) >>> mdl = fit_xgb(x=x, y=y) >>> new_x = pd.DataFrame(np.random.randint(0, 500, (100, 20))) >>> y_pred = transform_xgb(x=new_x, mdl=mdl) >>> evaluate_xgb(y_pred=y_pred, y_true=y, metrics=['MAE', 'MAPE', 'RMSE', 'MSE'])
- simba.model.regression.model.fit_xgb(x: DataFrame, y: ndarray, mdl: XGBRegressor) XGBRegressor[source]๏
Fits an XGBoost regressor model to the given data.
- Parameters
x (pd.DataFrame) โ Input feature matrix where each row represents a sample and each column a feature. The data must have numeric types.
y (np.ndarray) โ Target values, must be a 1-dimensional array of numeric types with the same number of rows as x.
mdl (xgb.XGBRegressor) โ Defined xgb.XGBRegressor. E.g., can be defined with
simba.model.regression.model.xgb_define(),
- Returns
Trained XGBoost regressor model.
- Return type
xgb.XGBRegressor
- Example
>>> x = pd.DataFrame(np.random.randint(0, 500, (100, 20))) >>> y = np.random.randint(1, 6, (100,)) >>> mdl = fit_xgb(x=x, y=y)
- simba.model.regression.model.transform_xgb(x: DataFrame, mdl: XGBRegressor) ndarray[source]๏
Transforms the input data using the provided XGBoost model by making predictions.
- Parameters
x (pd.DataFrame) โ Input feature matrix where each row represents a sample and each column a feature. The data must have numeric types.
mdl (xgb.XGBRegressor) โ Trained XGBoost model to use for making predictions.
- Returns
Predictions rounded to 2 decimal places.
- Return type
np.ndarray
- Example
>>> x, y = pd.DataFrame(np.random.randint(0, 500, (100, 20))), np.random.randint(1, 6, (100,)) >>> mdl = fit_xgb(x=x, y=y) >>> new_x = pd.DataFrame(np.random.randint(0, 500, (100, 20))) >>> results = transform_xgb(x=new_x, mdl=mdl)
- Example
>>> x, y = pd.DataFrame(np.random.randint(0, 500, (100, 20))), np.random.randint(1, 6, (100,)) >>> mdl = fit_xgb(x=x, y=y) >>> new_x = pd.DataFrame(np.random.randint(0, 500, (100, 20))) >>> results = transform_xgb(x=new_x, mdl=mdl)
- simba.model.regression.model.xgb_define(objective: str = 'reg:squarederror', n_estimators: int = 100, max_depth: int = 6, verbosity: int = 1, learning_rate: float = 0.3, eta: float = 0.3, gamma: float = 0.0, tree_method: str = 'auto') XGBRegressor[source]๏
Defines an XGBoost regressor.
- Parameters
objective (str) โ The learning objective for the model.
n_estimators (int) โ Number of boosting rounds. Must be greater than or equal to 1. Default is 100.
max_depth (int) โ Maximum depth of a tree. Increasing this value makes the model more complex and more likely to overfit. Must be greater than or equal to 1. Default is 6.
verbosity (int) โ Verbosity of the training process (0-3).
learning_rate (float) โ Step size shrinkage used to prevent overfitting. Lower values make the model more robust but require more boosting rounds. Must be between 0.1 and 1.0. Default is 0.3.
eta (float) โ Learning rate alias. Must be between 0.0 and 1.0. Default is 0.3.
gamma (float) โ Minimum loss reduction required to make a further partition on a leaf node of the tree. Larger values prevent overfitting. Must be greater than or equal to 0.0. Default is 0.0.
tree_method (str) โ The tree construction algorithm used in XGBoost.
- Returns
An initialized XGBoost Regressor with the specified configuration.
- Return type
xgb.XGBRegressor
SAM2 segmentation inference๏
- class simba.model.sam_inference.SamInference(video_path: Union[str, PathLike], weights_path: Union[str, PathLike], save_dir: Union[str, PathLike], prompts: Union[ndarray, List[List[int]]], labels: Union[ndarray, List[List[int]]], names: Tuple[str, ...], imgsz: Optional[int] = 1024, confidence: Optional[float] = 0.25, vertice_cnt: Optional[int] = 100)[source]๏
- Example
>>> i = SamInference(video_path=r"MyVideo", >>> labels=[[1]], >>> prompts=[[166, 428]], >>> weights_path=r"D:\yolo_weights\sam2.1_b.pt", >>> save_dir=r'C: roubleshooting\sam_results', >>> names=('Animal1',)) >>> i.run()
Fit YOLO model๏
- class simba.model.yolo_fit.FitYolo(model_yaml: Union[str, PathLike], save_path: Union[str, PathLike], weights_path: Optional[Union[str, PathLike]] = None, epochs: int = 200, batch: Union[int, float] = 16, plots: bool = True, imgsz: int = 640, format: Optional[str] = None, device: Union[typing_extensions.Literal['cpu'], int] = 0, verbose: bool = True, workers: int = 8, patience: int = 500, device_id: Optional[int] = None)[source]๏
Fit an Ultralytics YOLO model (detection, pose, or segmentation) from SimBA projects with parameter validation.
Note
Works with any Ultralytics model flavour (bbox, pose, segmentation).
Download starter weights from HuggingFace.
See also
simba.bounding_box_tools.yolo.utils.fit_yolo()for the functional API.simba.bounding_box_tools.yolo.utils.load_yolo_model()to load trained weights. For instructions, see YOLO Pose Estimation Training Documentation.- Parameters
weights_path (Union[str, os.PathLike]) โ Path to base weights (e.g.,
yolo11n.ptor.onnxexport).model_yaml (Union[str, os.PathLike]) โ Dataset configuration YAML describing dataset folders and class labels.
save_path (Union[str, os.PathLike]) โ Directory where training outputs (weights, metrics, plots) are written.
epochs (int) โ Training epochs to run. Must be โฅ 1. Default
200.batch (Union[int, float]) โ Batch size per step. Default
16.plots (bool) โ If
True, Ultralytics saves training curves. DefaultTrue.imgsz (int) โ Square image resolution used during training. Default
640.format (Optional[str]) โ Optional weights format override. Must belong to
simba.utils.enums.Options.VALID_YOLO_FORMATS. DefaultNone.device (Union[Literal['cpu'], int]) โ Compute device string or CUDA index. Default
0.verbose (bool) โ Emit detailed progress information. Default
True.workers (int) โ Data-loader worker processes. Use
-1for all cores. Default8.patience (int) โ Early-stopping patience (epochs without improvement). Default
100.
- Raises
SimBAGPUError โ If no CUDA-capable GPU is detected.
SimBAPAckageVersionError โ If
ultralyticsis unavailable in the environment.FileNotFoundError โ If
weights_pathormodel_yamldo not exist.ValueError โ If provided arguments fail SimBA validation checks.
- Example
>>> fitter = FitYolo( ... weights_path=r"D:\yolo_weights\yolo11n-pose.pt", ... model_yaml=r"D:\datasets\pose_project\map.yaml", ... save_path=r"D:\datasets\pose_project\mdl", ... epochs=300, ... batch=24, ... device=0, ... imgsz=640, ... ) >>> fitter.run()
YOLO bounding-box inference๏
- class simba.model.yolo_inference.YoloInference(weights: Union[str, PathLike, ultralytics.YOLO], video_path: Union[str, PathLike, List[Union[str, PathLike]]], verbose: Optional[bool] = False, save_dir: Optional[Union[str, PathLike]] = None, half_precision: Optional[bool] = True, device: Union[typing_extensions.Literal['cpu'], int] = 0, batch_size: Optional[int] = 400, core_cnt: int = 8, threshold: float = 0.25, max_detections: int = 300, max_per_class: Optional[int] = None, smoothing_method: Optional[typing_extensions.Literal['savitzky-golay', 'bartlett', 'blackman', 'boxcar', 'cosine', 'gaussian', 'hamming', 'exponential']] = None, smoothing_time_window: Optional[int] = None, interpolate: bool = False, imgsz: int = 320, bbox_size: Optional[Tuple[int, int]] = None, stream: Optional[bool] = True)[source]๏
Performs object detection inference on a video using a YOLO model.
YOLO-based object detection (bounding-box) on one or more video files. It supports GPU acceleration, batch processing, streaming, and optional result saving. The model returns bounding box coordinates and class confidence scores for each frame. Results can be smoothed or interpolated to handle detection gaps.
See also
To perform bounding box and keypoint (pose) detection, see
YOLOPoseInference(). To perform keypoint (pose) detection with tracking, seeYOLOPoseTrackInference()To visualize bounding boxes only, seeYOLOVisualizer()EXPECTED RUNTIMES
VIDEOS (COUNT)
FRAMES (COUNT)
TIME (S)
STDEV(S)
1
9000
19.69
0.185202592
2
18000
39.91333333
0.718424202
3
27000
59.20333333
0.29143324
4
36000
80.82
1.407870733
BATCH SIZE: 500
IMGSZ: 256
NVIDIA GeForce RTX 4070
CPU COUNT (LOADERS): 16
3 runs
- Parameters
weights (Union[str, os.PathLike, YOLO]) โ Path to YOLO model weights or a preloaded
ultralytics.YOLOmodel instance.video_path (Union[Union[str, os.PathLike], List[Union[str, os.PathLike]]]) โ Input video path, list of paths, or directory containing videos.
verbose (Optional[bool]) โ If True, print progress information.
save_dir (Optional[Union[str, os.PathLike]]) โ Directory to save output CSV files. If None, results are returned in-memory.
half_precision (Optional[bool]) โ If True, run inference in fp16 where supported.
device (Union[Literal['cpu'], int]) โ Inference device (โcpuโ or CUDA index).
batch_size (Optional[int]) โ Number of frames per prediction batch.
core_cnt (int) โ CPU thread count used by torch.
threshold (float) โ Detection confidence threshold in [0.0, 1.0].
max_detections (int) โ Maximum detections per frame (total, across all classes) returned by the model.
max_per_class (Optional[int]) โ Maximum number of detections to retain per class per frame. E.g., if one โresidentโ and one โintruderโ is expected, set this to 1. Defaults to None, meaning all detected instances of each class are retained (up to
max_detections).smoothing_method (Optional[Literal['savitzky-golay', 'bartlett', 'blackman', 'boxcar', 'cosine', 'gaussian', 'hamming', 'exponential']]) โ Optional temporal smoothing method for bbox coordinates.
smoothing_time_window (Optional[int]) โ Smoothing window in milliseconds. Used only when
smoothing_methodis not None.interpolate (bool) โ If True, interpolate missing bbox coordinates (nearest, per class).
imgsz (int) โ Model inference image size.
bbox_size (Optional[Tuple[int, int]]) โ Optional fixed bbox size
(height, width)in pixels applied to detected boxes.stream (Optional[bool]) โ If True, use streaming predictions.
- Returns
If
save_diris None, returns a dict mapping video name to result dataframe. Otherwise saves CSVs and returns None.- Return type
Union[None, Dict[str, pd.DataFrame]]
- Example
>>> video_path = "/mnt/d/netholabs/yolo_videos/input/mp4_20250606083508/2025-05-28_19-50-23.mp4" >>> i = YoloInference( ... weights=r"/mnt/c/troubleshooting/coco_data/mdl/train8/weights/best.pt", ... video_path=video_path, ... save_dir=r"/mnt/c/troubleshooting/coco_data/mdl/results", ... verbose=True, ... device=0, ... interpolate=True, ... bbox_size=(128, 128) ... ) >>> i.run()