SimBA utilities

Variable checks

simba.utils.checks.check_all_file_names_are_represented_in_video_log(video_info_df: DataFrame, data_paths: List[Union[str, PathLike]]) None[source]

Helper to check that all files are represented in a dataframe of the SimBA project_folder/logs/video_info.csv file.

Parameters
  • video_info_df (pd.DataFrame) – List of file-paths.

  • data_paths (List[Union[str, os.PathLike]]) – List of file-paths.

Raises

ParametersFileError – The list is empty.

simba.utils.checks.check_ffmpeg_available(raise_error: Optional[bool] = False) Optional[bool][source]

Helper to check of FFMpeg is available via subprocess ffmpeg.

Parameters

raise_error (Optional[bool]) – If True, raises FFMPEGNotFoundError if FFmpeg can’t be found. Else return False. Default False.

Returns bool

True if ffmpeg returns not None and raise_error is False. Else False.

simba.utils.checks.check_file_exist_and_readable(file_path: Union[str, PathLike]) None[source]

Checks if a path points to a readable file.

Parameters

file_path (str) – Path to file on disk.

Raises
simba.utils.checks.check_float(name: str, value: ~typing.Any, max_value: ~typing.Optional[float] = None, min_value: ~typing.Optional[float] = None, raise_error: bool = True) -> (<class 'bool'>, <class 'str'>)[source]

Check if variable is a valid float.

Parameters
  • name (str) – Name of variable

  • value (Any) – Value of variable

  • max_value (Optional[int]) – Maximum allowed value of the float. If None, then no maximum. Default: None.

  • Optional[int] – Minimum allowed value of the float. If None, then no minimum. Default: None.

  • raise_error (Optional[bool]) – If True, then raise error if invalid float. Default: True.

Return bool

False if invalid. True if valid.

Return str

If invalid, then error msg. Else empty str.

Examples

>>> check_float(name='My_float', value=0.5, max_value=1.0, min_value=0.0)
simba.utils.checks.check_if_2d_array_has_min_unique_values(data: ndarray, min: int) bool[source]

Check if a 2D NumPy array has at least a minimum number of unique rows.

For example, use when creating shapely Polygons or Linestrings, which typically requires at least 2 or three unique body-part coordinates.

Parameters
  • data (np.ndarray) – Input 2D array to be checked.

  • min (np.ndarray) – Minimum number of unique rows required.

Return bool

True if the input array has at least the specified minimum number of unique rows, False otherwise.

Example

>>> data = np.array([[0, 0], [0, 0], [0, 0], [0, 1]])
>>> check_if_2d_array_has_min_unique_values(data=data, min=2)
>>> True
simba.utils.checks.check_if_df_field_is_boolean(df: DataFrame, field: str, raise_error: Optional[bool] = True, bool_values: Optional[Tuple[Any]] = (0, 1), df_name: Optional[str] = '')[source]

Helper to check if a dataframe field is a boolean value

simba.utils.checks.check_if_dir_exists(in_dir: Union[str, PathLike], source: Optional[str] = None, create_if_not_exist: Optional[bool] = False) None[source]

Check if a directory path exists.

Parameters
  • in_dir (Union[str, os.PathLike]) – Putative directory path.

  • source (Optional[str]) – String source for interpretable error messaging.

  • create_if_not_exist (Optional[bool]) – If directory does not exist, then create it. Default False,

Raises

NotDirectoryError – The directory does not exist.

simba.utils.checks.check_if_filepath_list_is_empty(filepaths: List[str], error_msg: str) None[source]

Check if a list is empty

Parameters

List[str] – List of file-paths.

Raises

NoFilesFoundError – The list is empty.

simba.utils.checks.check_if_headers_in_dfs_are_unique(dfs: List[DataFrame]) List[str][source]

Helper to check heaaders in multiple dataframes are unique.

Parameters

dfs (List[pd.DataFrame]) – List of dataframes.

Return List[str]

List of columns headers seen in multiple dataframes. Empty if None.

Examples

>>> df_1, df_2 = pd.DataFrame([[1, 2]], columns=['My_column_1', 'My_column_2']), pd.DataFrame([[4, 2]], columns=['My_column_3', 'My_column_1'])
>>> check_if_headers_in_dfs_are_unique(dfs=[df_1, df_2])
>>> ['My_column_1']
simba.utils.checks.check_if_keys_exist_in_dict(data: dict, key: Union[str, int, tuple, List], name: Optional[str] = '', raise_error: Optional[bool] = True) bool[source]
simba.utils.checks.check_if_list_contains_values(data: List[Union[str, int, float]], values: List[Union[str, int, float]], name: str, raise_error: bool = True) None[source]

Helper to check if values are represeted in a list. E.g., make sure annotatations of behvaior absent and present are represented in annitation column

Parameters
  • data (List[Union[float, int, str]]) – List of values. E.g., annotation column represented as list.

  • values (List[Union[float, int, str]]) – Values to conform present. E.g., [0, 1].

  • name (str) – Arbitrary name of the data for more useful error msg.

  • raise_error (bool) – If True, raise error of not all values can be found in data. Else, print warning.

Example

>>> check_if_list_contains_values(data=[1,2, 3, 4, 0], values=[0, 1, 6], name='My_data')
simba.utils.checks.check_if_module_has_import(parsed_file: Module, import_name: str) bool[source]

Check if a Python module has a specific import statement.

Used for e.g., user custom feature extraction classes in simba.utils.custom_feature_extractor.CustomFeatureExtractor.

Parameters
  • file_path (ast.Module) – The abstract syntax tree (AST) of the Python module.

  • import_name (str) – The name of the module or package to check for in the import statements.

  • bool – True if the specified import is found in the module, False otherwise.

Example

>>> parsed_file = ast.parse(Path('/simba/misc/piotr.py').read_text())
>>> check_if_module_has_import(parsed_file=parsed_file, import_name='argparse')
>>> True
simba.utils.checks.check_if_string_value_is_valid_video_timestamp(value: str, name: str) None[source]

Helper to check if a string is in a valid HH:MM:SS format

Parameters
  • value (str) – Timestamp in HH:MM:SS format.

  • name (str) – An arbitrary string name of the timestamp.

Raises

InvalidInputError – If the timestamp is in invalid format

Example

>>> check_if_string_value_is_valid_video_timestamp(value='00:0b:10', name='My time stamp')
>>> "InvalidInputError: My time stamp is should be in the format XX:XX:XX where X is an integer between 0-9"
>>> check_if_string_value_is_valid_video_timestamp(value='00:00:10', name='My time stamp'
simba.utils.checks.check_if_valid_img(data: ndarray, source: Optional[str] = '', raise_error: Optional[bool] = True) Optional[bool][source]

Check if a variable is a valid image.

Parameters
  • source (str) – Name of the variable and/or class origin for informative error messaging and logging.

  • data (np.ndarray) – Data variable to check if a valid image representation.

  • raise_error (Optional[bool]) – If True, raise InvalidInputError if invalid image representation. Else, return bool.

simba.utils.checks.check_if_valid_input(name: str, input: str, options: ~typing.List[str], raise_error: bool = True) -> (<class 'bool'>, <class 'str'>)[source]

Check if string variable is valid option

Parameters
  • name (str) – Atrbitrary name of variable.

  • input (Any) – Value of variable.

  • options (List[str]) – Allowed options of input

  • raise_error (Optional[bool]) – If True, then raise error if invalid value. Default: True.

Return bool

False if invalid. True if valid.

Return str

If invalid, then error msg. Else, empty str.

Example

>>> check_if_valid_input(name='split_eval', input='gini', options=['entropy', 'gini'])
>>> (True, '')
simba.utils.checks.check_if_valid_rgb_str(input: str, delimiter: str = ',', return_cleaned_rgb_tuple: bool = True, reverse_returned: bool = True)[source]

Helper to check if a string is a valid representation of an RGB color.

Parameters
  • input (str) – Value to check as string. E.g., ‘(166, 29, 12)’ or ‘22,32,999’

  • delimiter (str) – The delimiter between subsequent values in the rgb input string.

  • return_cleaned_rgb_tuple (bool) – If True, and input is a valid rgb, then returns a “clean” rgb tuple: Eg. ‘166, 29, 12’ -> (166, 29, 12). Else, returns None.

  • reverse_returned (bool) – If True and return_cleaned_rgb_tuple is True, reverses to returned cleaned rgb tuple (e.g., RGB becomes BGR) before returning it.

Example

>>> check_if_valid_rgb_str(input='(50, 25, 100)', return_cleaned_rgb_tuple=True, reverse_returned=True)
>>> (100, 25, 50)
simba.utils.checks.check_if_valid_rgb_tuple(data: Tuple[int, int, int]) bool[source]
simba.utils.checks.check_if_video_corrupted(video: Union[str, PathLike, VideoCapture], frame_interval: Optional[int] = None, frame_n: Optional[int] = 20, raise_error: Optional[bool] = True) None[source]

Check if a video file is corrupted by inspecting a set of its frames.

Note

For decent run-time regardless of video length, pass a smaller frame_n (<100).

Parameters
  • video_path (Union[str, os.PathLike]) – Path to the video file or cv2.VideoCapture OpenCV object.

  • frame_interval (Optional[int]) – Interval between frames to be checked. If None, frame_n will be used.

  • frame_n (Optional[int]) – Number of frames to be checked, will be sampled at large allowed interval. If None, frame_interval will be used.

  • raise_error (Optional[bool]) – Whether to raise an error if corruption is found. If False, prints warning.

Return None

Example

>>> check_if_video_corrupted(video_path='/Users/simon/Downloads/NOR ENCODING FExMP8.mp4')
simba.utils.checks.check_instance(source: str, instance: object, accepted_types: Union[Tuple[Any], Any]) None[source]

Check if an instance is an acceptable type.

Parameters
  • name (str) – Arbitrary name of instance used for interpretable error msg. Can also be the name of the method.

  • instance (object) – A data object.

  • accepted_types (Union[Tuple[object], object]) – Accepted instance types. E.g., (Polygon, pd.DataFrame) or Polygon.

simba.utils.checks.check_int(name: str, value: ~typing.Any, max_value: ~typing.Optional[int] = None, min_value: ~typing.Optional[int] = None, raise_error: ~typing.Optional[bool] = True) -> (<class 'bool'>, <class 'str'>)[source]

Check if variable is a valid integer.

Parameters
  • name (str) – Name of variable

  • value (Any) – Value of variable

  • max_value (Optional[int]) – Maximum allowed value of the variable. If None, then no maximum. Default: None.

  • Optional[int] – Minimum allowed value of the variable. If None, then no minimum. Default: None.

  • raise_error (Optional[bool]) – If True, then raise error if invalid integer. Default: True.

Return bool

False if invalid. True if valid.

Return str

If invalid, then error msg. Else empty str.

Examples

>>> check_int(name='My_fps', input=25, min_value=1)
simba.utils.checks.check_iterable_length(source: str, val: int, exact_accepted_length: Optional[int] = None, max: Optional[int] = inf, min: int = 1) None[source]
simba.utils.checks.check_minimum_roll_windows(roll_windows_values: List[int], minimum_fps: float) List[int][source]

Remove any rolling temporal window that are shorter than a single frame in any of the videos within the project.

Parameters
  • roll_windows_values (List[int]) – Rolling temporal windows represented as frame counts. E.g., [10, 15, 30, 60]

  • minimum_fps (float) – The lowest fps of the videos that are to be analyzed. E.g., 10.

Return List[int]

roll_windows_values without impassable windows.

simba.utils.checks.check_nvidea_gpu_available() bool[source]

Helper to check of NVIDEA GPU is available via nvidia-smi. returns bool: True if nvidia-smi returns not None. Else False.

simba.utils.checks.check_same_number_of_rows_in_dfs(dfs: List[DataFrame]) bool[source]

Helper to check that each dataframe in list contains an equal number of rows

Parameters

dfs (List[pd.DataFrame]) – List of dataframes.

Return bool

True if dataframes has an equal number of rows. Else False.

>>> df_1, df_2 = pd.DataFrame([[1, 2], [1, 2]]), pd.DataFrame([[4, 2], [9, 3], [1, 5]])
>>> check_same_number_of_rows_in_dfs(dfs=[df_1, df_2])
>>> False
>>> df_1, df_2 = pd.DataFrame([[1, 2], [1, 2]]), pd.DataFrame([[4, 2], [9, 3]])
>>> True
simba.utils.checks.check_str(name: str, value: ~typing.Any, options: ~typing.Optional[~typing.Tuple[~typing.Any]] = (), allow_blank: bool = False, raise_error: bool = True) -> (<class 'bool'>, <class 'str'>)[source]

Check if variable is a valid string.

Parameters
  • name (str) – Name of variable

  • value (Any) – Value of variable

  • options (Optional[Tuple[Any]]) – Tuple of allowed strings. If empty tuple, then any string allowed. Default: ().

  • allow_blank (Optional[bool]) – If True, allow empty string. Default: False.

  • raise_error (Optional[bool]) – If True, then raise error if invalid string. Default: True.

Return bool

False if invalid. True if valid.

Return str

If invalid, then error msg. Else empty str.

Examples

>>> check_str(name='split_eval', input='gini', options=['entropy', 'gini'])
simba.utils.checks.check_that_column_exist(df: DataFrame, column_name: Union[str, PathLike, List[str]], file_name: str) None[source]

Check if single named field or a list of fields exist within a dataframe.

Parameters
  • df (pd.DataFrame) –

  • column_name (str) – Name or names of field(s).

  • file_name (str) – Path of df on disk.

Raises

ColumnNotFoundError – The column_name does not exist within df.

simba.utils.checks.check_that_dir_has_list_of_filenames(dir: Union[str, PathLike], file_name_lst: List[str], file_type: Optional[str] = 'csv')[source]

Check that all file names in a list has an equivalent file in a specified directory. E.g., check if all files in the outlier corrected folder has an equivalent file in the featurues_extracted directory.

Example

>>> file_name_lst = glob.glob('/Users/simon/Desktop/envs/troubleshooting/two_black_animals_14bp/project_folder/csv/outlier_corrected_movement' + '/*.csv')
>>> check_that_dir_has_list_of_filenames(dir = '/Users/simon/Desktop/envs/troubleshooting/two_black_animals_14bp/project_folder/csv/features_extracted', file_name_lst=file_name_lst)
simba.utils.checks.check_that_directory_is_empty(directory: Union[str, PathLike], raise_error: Optional[bool] = True) None[source]

Checks if a directory is empty. If the directory has content, then returns False or raises DirectoryNotEmptyError.

Parameters

directory (str) – Directory to check.

Raises

DirectoryNotEmptyError – If directory contains files.

simba.utils.checks.check_that_hhmmss_start_is_before_end(start_time: str, end_time: str, name: str) None[source]

Helper to check that a start time in HH:MM:SS or HH:MM:SS:MS format is before an end time in HH:MM:SS or HH:MM:SS:MS format

Parameters
  • start_time (str) – Period start time in HH:MM:SS format.

  • end_time (str) – Period end time in HH:MM:SS format.

  • name (int) – Name of the variable

Raises

InvalidInputError – If end time is before the start time.

Example

>>> check_that_hhmmss_start_is_before_end(start_time='00:00:05', end_time='00:00:01', name='My time period')
>>> "InvalidInputError: My time period has an end-time which is before the start-time"
>>> check_that_hhmmss_start_is_before_end(start_time='00:00:01', end_time='00:00:05')
simba.utils.checks.check_umap_hyperparameters(hyper_parameters: Dict[str, Any]) None[source]

Checks if dictionary of paramameters (umap, scaling, etc) are valid for grid-search umap dimensionality reduction .

Parameters

hyper_parameters (dict) – Dictionary holding umap hyerparameters.

Raises

InvalidInputError – If any input is invalid

Example

>>> check_umap_hyperparameters(hyper_parameters={'n_neighbors': [2], 'min_distance': [0.1], 'spread': [1], 'scaler': 'MIN-MAX', 'variance': 0.2})
simba.utils.checks.check_valid_array(data: ndarray, source: Optional[str] = '', accepted_ndims: Optional[List[Tuple[int]]] = None, accepted_sizes: Optional[List[int]] = None, accepted_axis_0_shape: Optional[List[int]] = None, accepted_axis_1_shape: Optional[List[int]] = None, accepted_dtypes: Optional[List[str]] = None, accepted_values: Optional[List[Any]] = None, accepted_shapes: Optional[List[Tuple[int]]] = None, min_axis_0: Optional[int] = None, max_axis_1: Optional[int] = None, min_axis_1: Optional[int] = None) None[source]

Check if the given array satisfies specified criteria regarding its dimensions, shape, and data type.

Parameters
  • data (np.ndarray) – The numpy array to be checked.

  • source (Optional[str]) – A string identifying the source, name, or purpose of the array for interpretable error messaging.

  • accepted_ndims (Optional[Tuple[int]]) – List of tuples representing acceptable dimensions. If provided, checks whether the array’s number of dimensions matches any tuple in the list.

  • accepted_axis_0_shape (Optional[List[str]]) – List of accepted number of rows of 2-dimensional array. Will also raise error if value passed and input is not a 2-dimensional array.

  • accepted_axis_1_shape (Optional[List[str]]) – List of accepted number of columns or fields of 2-dimensional array. Will also raise error if value passed and input is not a 2-dimensional array.

  • accepted_sizes (Optional[List[int]]) – List of acceptable sizes for the array’s shape. If provided, checks whether the length of the array’s shape matches any value in the list.

  • accepted_dtypes (Optional[List[str]]) – List of acceptable data types for the array. If provided, checks whether the array’s data type matches any string in the list.

Example

>>> data = np.array([[1, 2], [3, 4]])
>>> check_valid_array(data, source="Example", accepted_ndims=(4, 3), accepted_sizes=[2], accepted_dtypes=['int'])
simba.utils.checks.check_valid_dataframe(df: DataFrame, source: Optional[str] = '', valid_dtypes: Optional[Tuple[Any]] = None, required_fields: Optional[List[str]] = None, min_axis_0: Optional[int] = None, min_axis_1: Optional[int] = None, max_axis_0: Optional[int] = None, max_axis_1: Optional[int] = None)[source]

Helper to check if a dataframe is valid

simba.utils.checks.check_valid_extension(path: Union[str, PathLike], accepted_extensions: Union[List[str], str])[source]

Checks if the file extension of the provided path is in the list of accepted extensions.

Parameters
  • file_path (Union[str, os.PathLike]) – The path to the file whose extension needs to be checked.

  • accepted_extensions (List[str]) – A list of accepted file extensions. E.g., [‘pickle’, ‘csv’].

simba.utils.checks.check_valid_hex_color(color_hex: str, raise_error: Optional[bool] = True) bool[source]

Check if given string represents a valid hexadecimal color code.

Parameters
  • color_hex (str) – A string representing a hexadecimal color code, either in the format ‘#RRGGBB’ or ‘#RGB’.

  • raise_error (bool) – If True, raise an exception when the color_hex is invalid; if False, return False instead. Default is True.

Return bool

True if the color_hex is a valid hexadecimal color code; False otherwise (if raise_error is False).

Raises

IntegerError – If the color_hex is an invalid hexadecimal color code and raise_error is True.

simba.utils.checks.check_valid_lst(data: list, source: Optional[str] = '', valid_dtypes: Optional[Tuple[Any]] = None, valid_values: Optional[List[Any]] = None, min_len: Optional[int] = 1, max_len: Optional[int] = None, exact_len: Optional[int] = None, raise_error: Optional[bool] = True) bool[source]

Check the validity of a list based on passed criteria.

Parameters
  • data (list) – The input list to be validated.

  • source (Optional[str]) – A string indicating the source or context of the data for informative error messaging.

  • valid_dtypes (Optional[Tuple[Any]]) – A tuple of accepted data types. If provided, check if all elements in the list have data types in this tuple.

  • valid_values (Optional[List[Any]]) – A list of accepted list values. If provided, check if all elements in the list have matching values in this list.

  • min_len (Optional[int]) – The minimum allowed length of the list.

  • max_len (Optional[int]) – The maximum allowed length of the list.

  • raise_error (Optional[bool]) – If True, raise an InvalidInputError if any validation fails. If False, return False instead of raising an error.

Return bool

True if all validation criteria are met, False otherwise.

Example

>>> check_valid_lst(data=[1, 2, 'three'], valid_dtypes=(int, str), min_len=2, max_len=5)
>>> check_valid_lst(data=[1, 2, 3], valid_dtypes=(int,), min_len=3)
simba.utils.checks.check_valid_tuple(x: tuple, source: Optional[str] = '', accepted_lengths: Optional[Tuple[int]] = None, valid_dtypes: Optional[Tuple[Any]] = None)[source]
simba.utils.checks.check_video_and_data_frm_count_align(video: Union[str, PathLike, VideoCapture], data: Union[str, PathLike, DataFrame], name: Optional[str] = '', raise_error: Optional[bool] = True) None[source]

Check if the frame count of a video matches the row count of a data file.

Parameters
  • video (Union[str, os.PathLike, cv2.VideoCapture]) – Path to the video file or cv2.VideoCapture object.

  • data (Union[str, os.PathLike, pd.DataFrame]) – Path to the data file or DataFrame containing the data.

  • name (Optional[str]) – Name of the video (optional for interpretable error msgs).

  • raise_error (Optional[bool]) – Whether to raise an error if the counts don’t align (default is True). If False, prints warning.

Return None

Example

>>> data_1 = '/Users/simon/Desktop/envs/simba/troubleshooting/mouse_open_field/project_folder/csv/outlier_corrected_movement_location/SI_DAY3_308_CD1_PRESENT.csv'
>>> video_1 = '/Users/simon/Desktop/envs/simba/troubleshooting/mouse_open_field/project_folder/frames/output/ROI_analysis/SI_DAY3_308_CD1_PRESENT.mp4'
>>> check_video_and_data_frm_count_align(video=video_1, data=data_1, raise_error=True)
simba.utils.checks.check_video_has_rois(roi_dict: dict, video_names: List[str], roi_names: List[str])[source]

Check that specified videos all have user-defined ROIs with specified names.

simba.utils.checks.get_fn_ext(filepath: ~typing.Union[~os.PathLike, str]) -> (<class 'str'>, <class 'str'>, <class 'str'>)[source]

Split file path into three components: (i) directory, (ii) file name, and (iii) file extension.

Parameters

filepath (str) – Path to file.

Return str

File directory name

Return str

File name

Return str

File extension

Example

>>> get_fn_ext(filepath='C:/My_videos/MyVideo.mp4')
>>> ('My_videos', 'MyVideo', '.mp4')

SimBA project config creator

class simba.utils.config_creator.ProjectConfigCreator(project_path: str, project_name: str, target_list: List[str], pose_estimation_bp_cnt: str, body_part_config_idx: int, animal_cnt: int, file_type: str = 'csv')[source]

Bases: object

Create SimBA project directory tree and associated project_config.ini config file.

Parameters
  • project_path (str) – path to directory where to save the SimBA project directory tree

  • project_name (str) – Name of the SimBA project

  • target_list (List[str]) – Classifier names in the SimBA project

  • pose_estimation_bp_cnt (str) – String representing the number of body-parts in the pose-estimation data used in the simba project. E.g., ‘4’, ‘7’, ‘8’, ‘9’, ‘14’, ‘16’ or ‘user_defined’, ‘3D_user_defined’.

  • body_part_config_idx (int) – The index of the SimBA GUI dropdown pose-estimation selection. E.g., 1. I.e., the row representing your pose-estimated body-parts in this file.

  • animal_cnt (int) – Number of animals tracked in the input pose-estimation data.

  • file_type (str) – The SimBA project file type. OPTIONS: csv or parquet.

Note

Tutorial.

Examples

>>> _ = ProjectConfigCreator(project_path = 'project/path', project_name='project_name', target_list=['Attack'], pose_estimation_bp_cnt='16', body_part_config_idx=9, animal_cnt=2, file_type='csv')

Data utilities

simba.utils.data.add_missing_ROI_cols(shape_df: DataFrame) DataFrame[source]

Add missing ROI definitions in ROI info dataframes created by the first version of the SimBA ROI user-interface but analyzed using newer versions of SimBA.

Parameters

shape_df (pd.DataFrame) – Dataframe holding ROI definitions.

:returns DataFrame

simba.utils.data.beta(a, b, size=None)

Draw samples from a Beta distribution.

The Beta distribution is a special case of the Dirichlet distribution, and is related to the Gamma distribution. It has the probability distribution function

f(x; a,b) = \frac{1}{B(\alpha, \beta)} x^{\alpha - 1}
(1 - x)^{\beta - 1},

where the normalization, B, is the beta function,

B(\alpha, \beta) = \int_0^1 t^{\alpha - 1}
(1 - t)^{\beta - 1} dt.

It is often seen in Bayesian inference and order statistics.

Note

New code should use the beta method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • a (float or array_like of floats) – Alpha, positive (>0).

  • b (float or array_like of floats) – Beta, positive (>0).

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if a and b are both scalars. Otherwise, np.broadcast(a, b).size samples are drawn.

Returns

out – Drawn samples from the parameterized beta distribution.

Return type

ndarray or scalar

See also

Generator.beta

which should be used for new code.

simba.utils.data.binomial(n, p, size=None)

Draw samples from a binomial distribution.

Samples are drawn from a binomial distribution with specified parameters, n trials and p probability of success where n an integer >= 0 and p is in the interval [0,1]. (n may be input as a float, but it is truncated to an integer in use)

Note

New code should use the binomial method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • n (int or array_like of ints) – Parameter of the distribution, >= 0. Floats are also accepted, but they will be truncated to integers.

  • p (float or array_like of floats) – Parameter of the distribution, >= 0 and <=1.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if n and p are both scalars. Otherwise, np.broadcast(n, p).size samples are drawn.

Returns

out – Drawn samples from the parameterized binomial distribution, where each sample is equal to the number of successes over the n trials.

Return type

ndarray or scalar

See also

scipy.stats.binom

probability density function, distribution or cumulative density function, etc.

Generator.binomial

which should be used for new code.

Notes

The probability density for the binomial distribution is

P(N) = \binom{n}{N}p^N(1-p)^{n-N},

where n is the number of trials, p is the probability of success, and N is the number of successes.

When estimating the standard error of a proportion in a population by using a random sample, the normal distribution works well unless the product p*n <=5, where p = population proportion estimate, and n = number of samples, in which case the binomial distribution is used instead. For example, a sample of 15 people shows 4 who are left handed, and 11 who are right handed. Then p = 4/15 = 27%. 0.27*15 = 4, so the binomial distribution should be used in this case.

References

1

Dalgaard, Peter, “Introductory Statistics with R”, Springer-Verlag, 2002.

2

Glantz, Stanton A. “Primer of Biostatistics.”, McGraw-Hill, Fifth Edition, 2002.

3

Lentner, Marvin, “Elementary Applied Statistics”, Bogden and Quigley, 1972.

4

Weisstein, Eric W. “Binomial Distribution.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/BinomialDistribution.html

5

Wikipedia, “Binomial distribution”, https://en.wikipedia.org/wiki/Binomial_distribution

Examples

Draw samples from the distribution:

>>> n, p = 10, .5  # number of trials, probability of each trial
>>> s = np.random.binomial(n, p, 1000)
# result of flipping a coin 10 times, tested 1000 times.

A real world example. A company drills 9 wild-cat oil exploration wells, each with an estimated probability of success of 0.1. All nine wells fail. What is the probability of that happening?

Let’s do 20,000 trials of the model, and count the number that generate zero positive results.

>>> sum(np.random.binomial(9, 0.1, 20000) == 0)/20000.
# answer = 0.38885, or 38%.
simba.utils.data.bucket_data(data: ndarray, method: typing_extensions.Literal['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt'] = 'auto') Tuple[float, int][source]

Computes the optimal bin count and bin width non-heuristically using specified method.

Parameters
  • data (np.ndarray) – 1D array of numerical data.

  • data – The method to compute optimal bin count and bin width. These methods differ in how they estimate the optimal bin count and width. Defaults to ‘auto’, which represents the maximum of the Sturges and Freedman-Diaconis estimators. Available methods are ‘fd’, ‘doane’, ‘auto’, ‘scott’, ‘stone’, ‘rice’, ‘sturges’, ‘sqrt’.

Returns Tuple[float, int]

A tuple containing the optimal bin width and bin count.

Example

>>> data = np.random.randint(low=1, high=1000, size=(1, 100))
>>> bucket_data(data=data, method='fd')
>>> (190.8, 6)
>>> bucket_data(data=data, method='doane')
>>> (106.0, 10)
simba.utils.data.chisquare(df, size=None)

Draw samples from a chi-square distribution.

When df independent random variables, each with standard normal distributions (mean 0, variance 1), are squared and summed, the resulting distribution is chi-square (see Notes). This distribution is often used in hypothesis testing.

Note

New code should use the chisquare method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • df (float or array_like of floats) – Number of degrees of freedom, must be > 0.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if df is a scalar. Otherwise, np.array(df).size samples are drawn.

Returns

out – Drawn samples from the parameterized chi-square distribution.

Return type

ndarray or scalar

Raises

ValueError – When df <= 0 or when an inappropriate size (e.g. size=-1) is given.

See also

Generator.chisquare

which should be used for new code.

Notes

The variable obtained by summing the squares of df independent, standard normally distributed random variables:

Q = \sum_{i=0}^{\mathtt{df}} X^2_i

is chi-square distributed, denoted

Q \sim \chi^2_k.

The probability density function of the chi-squared distribution is

p(x) = \frac{(1/2)^{k/2}}{\Gamma(k/2)}
x^{k/2 - 1} e^{-x/2},

where \Gamma is the gamma function,

\Gamma(x) = \int_0^{-\infty} t^{x - 1} e^{-t} dt.

References

1

NIST “Engineering Statistics Handbook” https://www.itl.nist.gov/div898/handbook/eda/section3/eda3666.htm

Examples

>>> np.random.chisquare(2,4)
array([ 1.89920014,  9.00867716,  3.13710533,  5.62318272]) # random
simba.utils.data.choice(a, size=None, replace=True, p=None)

Generates a random sample from a given 1-D array

New in version 1.7.0.

Note

New code should use the choice method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • a (1-D array-like or int) – If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if a were np.arange(a)

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.

  • replace (boolean, optional) – Whether the sample is with or without replacement

  • p (1-D array-like, optional) – The probabilities associated with each entry in a. If not given the sample assumes a uniform distribution over all entries in a.

Returns

samples – The generated random samples

Return type

single item or ndarray

Raises

ValueError – If a is an int and less than zero, if a or p are not 1-dimensional, if a is an array-like of size 0, if p is not a vector of probabilities, if a and p have different lengths, or if replace=False and the sample size is greater than the population size

See also

randint, shuffle, permutation

Generator.choice

which should be used in new code

Notes

Sampling random rows from a 2-D array is not possible with this function, but is possible with Generator.choice through its axis keyword.

Examples

Generate a uniform random sample from np.arange(5) of size 3:

>>> np.random.choice(5, 3)
array([0, 3, 4]) # random
>>> #This is equivalent to np.random.randint(0,5,3)

Generate a non-uniform random sample from np.arange(5) of size 3:

>>> np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0])
array([3, 3, 0]) # random

Generate a uniform random sample from np.arange(5) of size 3 without replacement:

>>> np.random.choice(5, 3, replace=False)
array([3,1,0]) # random
>>> #This is equivalent to np.random.permutation(np.arange(5))[:3]

Generate a non-uniform random sample from np.arange(5) of size 3 without replacement:

>>> np.random.choice(5, 3, replace=False, p=[0.1, 0, 0.3, 0.6, 0])
array([2, 3, 0]) # random

Any of the above can be repeated with an arbitrary array-like instead of just integers. For instance:

>>> aa_milne_arr = ['pooh', 'rabbit', 'piglet', 'Christopher']
>>> np.random.choice(aa_milne_arr, 5, p=[0.5, 0.1, 0.1, 0.3])
array(['pooh', 'pooh', 'pooh', 'Christopher', 'piglet'], # random
      dtype='<U11')
simba.utils.data.convert_roi_definitions(roi_definitions_path: Union[str, PathLike], save_dir: Union[str, PathLike]) None[source]

Helper to convert SimBA ROI_definitions.h5 file into human-readable format.

Parameters
  • roi_definitions_path (Union[str, os.PathLike]) – Path to SimBA ROI_definitions.h5 on disk.

  • save_dir (Union[str, os.PathLike]) – Directory location where the output data should be stored

simba.utils.data.create_color_palette(pallete_name: str, increments: int, as_rgb_ratio: Optional[bool] = False, as_hex: Optional[bool] = False) list[source]

Create a list of colors in RGB from specified color palette.

Parameters
  • pallete_name (str) – Palette name (e.g., jet)

  • increments (int) – Numbers of colors in the color palette to create.

  • as_rgb_ratio (Optional[bool]) – Return RGB to ratios. Default: False

  • as_hex (Optional[bool]) – Return values as HEX. Default: False

Note

If both as_rgb_ratio and as_hex, HEX values will be returned.

Return list

Color palette values.

Example

>>> create_color_palette(pallete_name='jet', increments=3)
>>> [[127.5, 0.0, 0.0], [255.0, 212.5, 0.0], [0.0, 229.81481481481478, 255.0], [0.0, 0.0, 127.5]]
>>> create_color_palette(pallete_name='jet', increments=3, as_rgb_ratio=True)
>>> [[0.5, 0.0, 0.0], [1.0, 0.8333333333333334, 0.0], [0.0, 0.0.9012345679012345, 1.0], [0.0, 0.0, 0.5]]
>>> create_color_palette(pallete_name='jet', increments=3, as_hex=True)
>>> ['#800000', '#ffd400', '#00e6ff', '#000080']
simba.utils.data.create_color_palettes(no_animals: int, map_size: int, cmaps: Optional[List[str]] = None) List[List[int]][source]

Create list of lists of bgr colors, one for each animal. Each list is pulled from a different palette matplotlib color map.

Parameters
  • no_animals (int) – Number of different palette lists

  • map_size (int) – Number of colors in each created palette.

Return List[List[int]]

BGR colors

Example

>>> create_color_palettes(no_animals=2, map_size=2)
>>> [[[255.0, 0.0, 255.0], [0.0, 255.0, 255.0]], [[102.0, 127.5, 0.0], [102.0, 255.0, 255.0]]]
simba.utils.data.detect_bouts(data_df: DataFrame, target_lst: List[str], fps: int) DataFrame[source]

Detect behavior “bouts” (e.g., continous sequence of classified behavior-present frames) for specified classifiers.

Note

Can be any field of boolean type. E.g., target_lst = [‘Inside_ROI_1`] also works for bouts inside ROI shape.

Parameters
  • data_df (pd.DataFrame) – Dataframe with fields representing classifications in boolean type.

  • target_lst (List[str]) – Classifier names. E.g., [‘Attack’, ‘Sniffing’, ‘Grooming’] or ROIs

  • fps (int) – The fps of the input video.

Return pd.DataFrame

Dataframe where bouts are represented by rows and fields are represented by ‘Event type ‘, ‘Start time’, ‘End time’, ‘Start frame’, ‘End frame’, ‘Bout time’

Example

>>> data_df = read_df(file_path='tests/data/test_projects/two_c57/project_folder/csv/machine_results/Together_1.csv', file_type='csv')
>>> detect_bouts(data_df=data_df, target_lst=['Attack', 'Sniffing'], fps=25)
>>>     'Event'  'Start_time'  'End Time'  'Start_frame'  'End_frame'  'Bout_time'
>>> 0   'Attack'    5.03          5.33          151        159            0.30
>>> 1   'Attack'    5.87          6.23          176        186            0.37
>>> 2  'Sniffing'   3.47          3.83          104        114            0.37
simba.utils.data.detect_bouts_multiclass(data: DataFrame, target: str, fps: int = 1, classifier_map: Optional[Dict[int, str]] = None) DataFrame[source]

Detect bouts in a multiclass time series dataset and return the bout event types, their start times, end times and duration.

Parameters
  • data (pd.DataFrame) – A Pandas DataFrame containing multiclass time series data.

  • target (str) – Name of the target column in data.

  • fps (int) – Frames per second of the video used to collect data. Default is 1.

  • classifier_map (Dict[int, str]) – A dictionary mapping class labels to their names. Used to replace numeric labels with descriptive names. If None, then numeric event labels are kept.

Example

>>> df = pd.DataFrame({'value': [0, 0, 0, 2, 2, 1, 1, 1, 3, 3]})
>>> detect_bouts_multiclass(data=df, target='value', fps=3, classifier_map={0: 'None', 1: 'sharp', 2: 'track', 3: 'sync'})
>>>    'Event'  'Start_time'  'End_time'  'Start_frame'  'End_frame'  'Bout_time'
>>> 0   'None'    0.000000  1.000000          0.0        2.0   1.000000
>>> 1   'sharp'   1.666667  2.666667          5.0        7.0   1.000000
>>> 2   'track'   1.000000  1.666667          3.0        4.0   0.666667
>>> 3   'sync '   2.666667  3.333333          8.0        9.0   0.666667
simba.utils.data.dirichlet(alpha, size=None)

Draw samples from the Dirichlet distribution.

Draw size samples of dimension k from a Dirichlet distribution. A Dirichlet-distributed random variable can be seen as a multivariate generalization of a Beta distribution. The Dirichlet distribution is a conjugate prior of a multinomial distribution in Bayesian inference.

Note

New code should use the dirichlet method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • alpha (sequence of floats, length k) – Parameter of the distribution (length k for sample of length k).

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n), then m * n * k samples are drawn. Default is None, in which case a vector of length k is returned.

Returns

samples – The drawn samples, of shape (size, k).

Return type

ndarray,

Raises

ValueError – If any value in alpha is less than or equal to zero

See also

Generator.dirichlet

which should be used for new code.

Notes

The Dirichlet distribution is a distribution over vectors x that fulfil the conditions x_i>0 and \sum_{i=1}^k x_i = 1.

The probability density function p of a Dirichlet-distributed random vector X is proportional to

p(x) \propto \prod_{i=1}^{k}{x^{\alpha_i-1}_i},

where \alpha is a vector containing the positive concentration parameters.

The method uses the following property for computation: let Y be a random vector which has components that follow a standard gamma distribution, then X = \frac{1}{\sum_{i=1}^k{Y_i}} Y is Dirichlet-distributed

References

1

David McKay, “Information Theory, Inference and Learning Algorithms,” chapter 23, http://www.inference.org.uk/mackay/itila/

2

Wikipedia, “Dirichlet distribution”, https://en.wikipedia.org/wiki/Dirichlet_distribution

Examples

Taking an example cited in Wikipedia, this distribution can be used if one wanted to cut strings (each of initial length 1.0) into K pieces with different lengths, where each piece had, on average, a designated average length, but allowing some variation in the relative sizes of the pieces.

>>> s = np.random.dirichlet((10, 5, 3), 20).transpose()
>>> import matplotlib.pyplot as plt
>>> plt.barh(range(20), s[0])
>>> plt.barh(range(20), s[1], left=s[0], color='g')
>>> plt.barh(range(20), s[2], left=s[0]+s[1], color='r')
>>> plt.title("Lengths of Strings")
simba.utils.data.exponential(scale=1.0, size=None)

Draw samples from an exponential distribution.

Its probability density function is

f(x; \frac{1}{\beta}) = \frac{1}{\beta} \exp(-\frac{x}{\beta}),

for x > 0 and 0 elsewhere. \beta is the scale parameter, which is the inverse of the rate parameter \lambda = 1/\beta. The rate parameter is an alternative, widely used parameterization of the exponential distribution [3]_.

The exponential distribution is a continuous analogue of the geometric distribution. It describes many common situations, such as the size of raindrops measured over many rainstorms [1]_, or the time between page requests to Wikipedia [2]_.

Note

New code should use the exponential method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • scale (float or array_like of floats) – The scale parameter, \beta = 1/\lambda. Must be non-negative.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if scale is a scalar. Otherwise, np.array(scale).size samples are drawn.

Returns

out – Drawn samples from the parameterized exponential distribution.

Return type

ndarray or scalar

See also

Generator.exponential

which should be used for new code.

References

1

Peyton Z. Peebles Jr., “Probability, Random Variables and Random Signal Principles”, 4th ed, 2001, p. 57.

2

Wikipedia, “Poisson process”, https://en.wikipedia.org/wiki/Poisson_process

3

Wikipedia, “Exponential distribution”, https://en.wikipedia.org/wiki/Exponential_distribution

simba.utils.data.f(dfnum, dfden, size=None)

Draw samples from an F distribution.

Samples are drawn from an F distribution with specified parameters, dfnum (degrees of freedom in numerator) and dfden (degrees of freedom in denominator), where both parameters must be greater than zero.

The random variate of the F distribution (also known as the Fisher distribution) is a continuous probability distribution that arises in ANOVA tests, and is the ratio of two chi-square variates.

Note

New code should use the f method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • dfnum (float or array_like of floats) – Degrees of freedom in numerator, must be > 0.

  • dfden (float or array_like of float) – Degrees of freedom in denominator, must be > 0.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if dfnum and dfden are both scalars. Otherwise, np.broadcast(dfnum, dfden).size samples are drawn.

Returns

out – Drawn samples from the parameterized Fisher distribution.

Return type

ndarray or scalar

See also

scipy.stats.f

probability density function, distribution or cumulative density function, etc.

Generator.f

which should be used for new code.

Notes

The F statistic is used to compare in-group variances to between-group variances. Calculating the distribution depends on the sampling, and so it is a function of the respective degrees of freedom in the problem. The variable dfnum is the number of samples minus one, the between-groups degrees of freedom, while dfden is the within-groups degrees of freedom, the sum of the number of samples in each group minus the number of groups.

References

1

Glantz, Stanton A. “Primer of Biostatistics.”, McGraw-Hill, Fifth Edition, 2002.

2

Wikipedia, “F-distribution”, https://en.wikipedia.org/wiki/F-distribution

Examples

An example from Glantz[1], pp 47-40:

Two groups, children of diabetics (25 people) and children from people without diabetes (25 controls). Fasting blood glucose was measured, case group had a mean value of 86.1, controls had a mean value of 82.2. Standard deviations were 2.09 and 2.49 respectively. Are these data consistent with the null hypothesis that the parents diabetic status does not affect their children’s blood glucose levels? Calculating the F statistic from the data gives a value of 36.01.

Draw samples from the distribution:

>>> dfnum = 1. # between group degrees of freedom
>>> dfden = 48. # within groups degrees of freedom
>>> s = np.random.f(dfnum, dfden, 1000)

The lower bound for the top 1% of the samples is :

>>> np.sort(s)[-10]
7.61988120985 # random

So there is about a 1% chance that the F statistic will exceed 7.62, the measured value is 36, so the null hypothesis is rejected at the 1% level.

simba.utils.data.fast_mean_rank(data: ndarray, descending: bool = True)[source]

Jitted helper to rank values in 1D array using mean method.

Parameters
  • data (np.ndarray) – 1D array of feature values.

  • descending (bool) – If True, ranks returned where low values get a high rank. If False, low values get a low rank. Default: True.

References

Modified from James Webber gist on GitHub.

Example

>>> data = np.array([1, 1, 3, 4, 5, 6, 7, 8, 9, 10])
>>> fast_mean_rank(data=data, descending=True)
>>> [9.5, 9.5, 8. , 7. , 6. , 5. , 4. , 3. , 2. , 1. ]
simba.utils.data.fast_minimum_rank(data: ndarray, descending: bool = True)[source]

Jitted helper to rank values in 1D array using minimum method.

Parameters
  • data (np.ndarray) – 1D array of feature values.

  • descending (bool) – If True, ranks returned where low values get a high rank. If False, low values get a low rank. Default: True.

References

Jérôme Richard on StackOverflow.

Example

>>> data = np.array([1, 1, 3, 4, 5, 6, 7, 8, 9, 10])
>>> fast_minimum_rank(data=data, descending=True)
>>> [9, 9, 8, 7, 6, 5, 4, 3, 2, 1]
>>> fast_minimum_rank(data=data, descending=False)
>>> [ 1,  1,  3,  4,  5,  6,  7,  8,  9, 10]
simba.utils.data.find_bins(data: Dict[str, List[int]], bracket_type: typing_extensions.Literal['QUANTILE', 'QUANTIZE'], bracket_cnt: int, normalization_method: typing_extensions.Literal['ALL VIDEOS', 'BY VIDEO']) Dict[str, ndarray][source]

Helper to find bin cut-off points.

Parameters
  • data (dict) – Dictionary with video names as keys and list of values of size len(frames).

  • bracket_type (Literal[str]) – ‘QUANTILE’ or ‘QUANTIZE’

  • bracket_cnt (str) – Number of bins.

  • normalization_method (str) – Create bins based on data in all videos (“ALL VIDEOS”) or create different bins per video (‘BY VIDEO’)

Returns dict

The videos as keys and bin cut off points as array of size len(bracket_cnt) x 2.

simba.utils.data.find_frame_numbers_from_time_stamp(start_time: str, end_time: str, fps: int) List[int][source]

Given start and end timestamps in HH:MM:SS formats and the fps, return the frame numbers representing the time period.

Parameters
  • start_time (str) – Period start time in HH:MM:SS format.

  • end_time (str) – Period end time in HH:MM:SS format.

  • fps (int) – Framerate of the video.

Returns List[int]

Frame numbers within the period.

Example

>>> find_frame_numbers_from_time_stamp(start_time='00:00:00', end_time='00:00:01', fps=10)
>>> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
simba.utils.data.find_ranked_colors(data: Dict[str, float], palette: str, as_hex: Optional[bool] = False) Dict[str, Union[Tuple[int], str]][source]

Find ranked colors for a given data dictionary values based on a specified color palette.

The key with the highest value in the data dictionary is assigned the most intense palette color, while the key with the lowest value in the data dictionary is assigned the least intense palette color.

Parameters
  • data – A dictionary where keys are labels and values are numerical scores.

  • palette – A string representing the name of the color palette to use (e.g., ‘magma’).

  • as_hex – If True, return colors in hexadecimal format; if False, return as RGB tuples. Default is False.

Returns

A dictionary where keys are labels and values are corresponding colors based on ranking.

Examples

>>> data = {'Animal_1': 0.34786870380536705, 'Animal_2': 0.4307923198152757, 'Animal_3': 0.221338976379357}
>>> find_ranked_colors(data=data, palette='magma', as_hex=True)
>>> {'Animal_2': '#040000', 'Animal_1': '#7937b7', 'Animal_3': '#bffdfc'}
simba.utils.data.freedman_diaconis(data: ~numpy.array) -> (<class 'float'>, <class 'int'>)[source]

Use Freedman-Diaconis rule to compute optimal count of histogram bins and their width.

Note

Can also use simba.utils.data.bucket_data passing method fd.

References
2

Reference freedman_diaconis.

simba.utils.data.gamma(shape, scale=1.0, size=None)

Draw samples from a Gamma distribution.

Samples are drawn from a Gamma distribution with specified parameters, shape (sometimes designated “k”) and scale (sometimes designated “theta”), where both parameters are > 0.

Note

New code should use the gamma method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • shape (float or array_like of floats) – The shape of the gamma distribution. Must be non-negative.

  • scale (float or array_like of floats, optional) – The scale of the gamma distribution. Must be non-negative. Default is equal to 1.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if shape and scale are both scalars. Otherwise, np.broadcast(shape, scale).size samples are drawn.

Returns

out – Drawn samples from the parameterized gamma distribution.

Return type

ndarray or scalar

See also

scipy.stats.gamma

probability density function, distribution or cumulative density function, etc.

Generator.gamma

which should be used for new code.

Notes

The probability density for the Gamma distribution is

p(x) = x^{k-1}\frac{e^{-x/\theta}}{\theta^k\Gamma(k)},

where k is the shape and \theta the scale, and \Gamma is the Gamma function.

The Gamma distribution is often used to model the times to failure of electronic components, and arises naturally in processes for which the waiting times between Poisson distributed events are relevant.

References

1

Weisstein, Eric W. “Gamma Distribution.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/GammaDistribution.html

2

Wikipedia, “Gamma distribution”, https://en.wikipedia.org/wiki/Gamma_distribution

Examples

Draw samples from the distribution:

>>> shape, scale = 2., 2.  # mean=4, std=2*sqrt(2)
>>> s = np.random.gamma(shape, scale, 1000)

Display the histogram of the samples, along with the probability density function:

>>> import matplotlib.pyplot as plt
>>> import scipy.special as sps  
>>> count, bins, ignored = plt.hist(s, 50, density=True)
>>> y = bins**(shape-1)*(np.exp(-bins/scale) /  
...                      (sps.gamma(shape)*scale**shape))
>>> plt.plot(bins, y, linewidth=2, color='r')  
>>> plt.show()
simba.utils.data.geometric(p, size=None)

Draw samples from the geometric distribution.

Bernoulli trials are experiments with one of two outcomes: success or failure (an example of such an experiment is flipping a coin). The geometric distribution models the number of trials that must be run in order to achieve success. It is therefore supported on the positive integers, k = 1, 2, ....

The probability mass function of the geometric distribution is

f(k) = (1 - p)^{k - 1} p

where p is the probability of success of an individual trial.

Note

New code should use the geometric method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • p (float or array_like of floats) – The probability of success of an individual trial.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if p is a scalar. Otherwise, np.array(p).size samples are drawn.

Returns

out – Drawn samples from the parameterized geometric distribution.

Return type

ndarray or scalar

See also

Generator.geometric

which should be used for new code.

Examples

Draw ten thousand values from the geometric distribution, with the probability of an individual success equal to 0.35:

>>> z = np.random.geometric(p=0.35, size=10000)

How many trials succeeded after a single run?

>>> (z == 1).sum() / 10000.
0.34889999999999999 #random
simba.utils.data.get_mode(x: ndarray) Union[float, int][source]

Get the mode (most frequent value) within an array

simba.utils.data.get_state()

Return a tuple representing the internal state of the generator.

For more details, see set_state.

Parameters

legacy (bool, optional) – Flag indicating to return a legacy tuple state when the BitGenerator is MT19937, instead of a dict.

Returns

out – The returned tuple has the following items:

  1. the string ‘MT19937’.

  2. a 1-D array of 624 unsigned integer keys.

  3. an integer pos.

  4. an integer has_gauss.

  5. a float cached_gaussian.

If legacy is False, or the BitGenerator is not MT19937, then state is returned as a dictionary.

Return type

{tuple(str, ndarray of 624 uints, int, int, float), dict}

See also

set_state

Notes

set_state and get_state are not needed to work with any of the random distributions in NumPy. If the internal state is manually altered, the user should know exactly what he/she is doing.

simba.utils.data.gumbel(loc=0.0, scale=1.0, size=None)

Draw samples from a Gumbel distribution.

Draw samples from a Gumbel distribution with specified location and scale. For more information on the Gumbel distribution, see Notes and References below.

Note

New code should use the gumbel method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • loc (float or array_like of floats, optional) – The location of the mode of the distribution. Default is 0.

  • scale (float or array_like of floats, optional) – The scale parameter of the distribution. Default is 1. Must be non- negative.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if loc and scale are both scalars. Otherwise, np.broadcast(loc, scale).size samples are drawn.

Returns

out – Drawn samples from the parameterized Gumbel distribution.

Return type

ndarray or scalar

See also

scipy.stats.gumbel_l, scipy.stats.gumbel_r, scipy.stats.genextreme, weibull

Generator.gumbel

which should be used for new code.

Notes

The Gumbel (or Smallest Extreme Value (SEV) or the Smallest Extreme Value Type I) distribution is one of a class of Generalized Extreme Value (GEV) distributions used in modeling extreme value problems. The Gumbel is a special case of the Extreme Value Type I distribution for maximums from distributions with “exponential-like” tails.

The probability density for the Gumbel distribution is

p(x) = \frac{e^{-(x - \mu)/ \beta}}{\beta} e^{ -e^{-(x - \mu)/
\beta}},

where \mu is the mode, a location parameter, and \beta is the scale parameter.

The Gumbel (named for German mathematician Emil Julius Gumbel) was used very early in the hydrology literature, for modeling the occurrence of flood events. It is also used for modeling maximum wind speed and rainfall rates. It is a “fat-tailed” distribution - the probability of an event in the tail of the distribution is larger than if one used a Gaussian, hence the surprisingly frequent occurrence of 100-year floods. Floods were initially modeled as a Gaussian process, which underestimated the frequency of extreme events.

It is one of a class of extreme value distributions, the Generalized Extreme Value (GEV) distributions, which also includes the Weibull and Frechet.

The function has a mean of \mu + 0.57721\beta and a variance of \frac{\pi^2}{6}\beta^2.

References

1

Gumbel, E. J., “Statistics of Extremes,” New York: Columbia University Press, 1958.

2

Reiss, R.-D. and Thomas, M., “Statistical Analysis of Extreme Values from Insurance, Finance, Hydrology and Other Fields,” Basel: Birkhauser Verlag, 2001.

Examples

Draw samples from the distribution:

>>> mu, beta = 0, 0.1 # location and scale
>>> s = np.random.gumbel(mu, beta, 1000)

Display the histogram of the samples, along with the probability density function:

>>> import matplotlib.pyplot as plt
>>> count, bins, ignored = plt.hist(s, 30, density=True)
>>> plt.plot(bins, (1/beta)*np.exp(-(bins - mu)/beta)
...          * np.exp( -np.exp( -(bins - mu) /beta) ),
...          linewidth=2, color='r')
>>> plt.show()

Show how an extreme value distribution can arise from a Gaussian process and compare to a Gaussian:

>>> means = []
>>> maxima = []
>>> for i in range(0,1000) :
...    a = np.random.normal(mu, beta, 1000)
...    means.append(a.mean())
...    maxima.append(a.max())
>>> count, bins, ignored = plt.hist(maxima, 30, density=True)
>>> beta = np.std(maxima) * np.sqrt(6) / np.pi
>>> mu = np.mean(maxima) - 0.57721*beta
>>> plt.plot(bins, (1/beta)*np.exp(-(bins - mu)/beta)
...          * np.exp(-np.exp(-(bins - mu)/beta)),
...          linewidth=2, color='r')
>>> plt.plot(bins, 1/(beta * np.sqrt(2 * np.pi))
...          * np.exp(-(bins - mu)**2 / (2 * beta**2)),
...          linewidth=2, color='g')
>>> plt.show()
simba.utils.data.hist_1d(data: ndarray, bins: int, range: ndarray)[source]
simba.utils.data.hypergeometric(ngood, nbad, nsample, size=None)

Draw samples from a Hypergeometric distribution.

Samples are drawn from a hypergeometric distribution with specified parameters, ngood (ways to make a good selection), nbad (ways to make a bad selection), and nsample (number of items sampled, which is less than or equal to the sum ngood + nbad).

Note

New code should use the hypergeometric method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • ngood (int or array_like of ints) – Number of ways to make a good selection. Must be nonnegative.

  • nbad (int or array_like of ints) – Number of ways to make a bad selection. Must be nonnegative.

  • nsample (int or array_like of ints) – Number of items sampled. Must be at least 1 and at most ngood + nbad.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if ngood, nbad, and nsample are all scalars. Otherwise, np.broadcast(ngood, nbad, nsample).size samples are drawn.

Returns

out – Drawn samples from the parameterized hypergeometric distribution. Each sample is the number of good items within a randomly selected subset of size nsample taken from a set of ngood good items and nbad bad items.

Return type

ndarray or scalar

See also

scipy.stats.hypergeom

probability density function, distribution or cumulative density function, etc.

Generator.hypergeometric

which should be used for new code.

Notes

The probability density for the Hypergeometric distribution is

P(x) = \frac{\binom{g}{x}\binom{b}{n-x}}{\binom{g+b}{n}},

where 0 \le x \le n and n-b \le x \le g

for P(x) the probability of x good results in the drawn sample, g = ngood, b = nbad, and n = nsample.

Consider an urn with black and white marbles in it, ngood of them are black and nbad are white. If you draw nsample balls without replacement, then the hypergeometric distribution describes the distribution of black balls in the drawn sample.

Note that this distribution is very similar to the binomial distribution, except that in this case, samples are drawn without replacement, whereas in the Binomial case samples are drawn with replacement (or the sample space is infinite). As the sample space becomes large, this distribution approaches the binomial.

References

1

Lentner, Marvin, “Elementary Applied Statistics”, Bogden and Quigley, 1972.

2

Weisstein, Eric W. “Hypergeometric Distribution.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/HypergeometricDistribution.html

3

Wikipedia, “Hypergeometric distribution”, https://en.wikipedia.org/wiki/Hypergeometric_distribution

Examples

Draw samples from the distribution:

>>> ngood, nbad, nsamp = 100, 2, 10
# number of good, number of bad, and number of samples
>>> s = np.random.hypergeometric(ngood, nbad, nsamp, 1000)
>>> from matplotlib.pyplot import hist
>>> hist(s)
#   note that it is very unlikely to grab both bad items

Suppose you have an urn with 15 white and 15 black marbles. If you pull 15 marbles at random, how likely is it that 12 or more of them are one color?

>>> s = np.random.hypergeometric(15, 15, 15, 100000)
>>> sum(s>=12)/100000. + sum(s<=3)/100000.
#   answer = 0.003 ... pretty unlikely!
simba.utils.data.laplace(loc=0.0, scale=1.0, size=None)

Draw samples from the Laplace or double exponential distribution with specified location (or mean) and scale (decay).

The Laplace distribution is similar to the Gaussian/normal distribution, but is sharper at the peak and has fatter tails. It represents the difference between two independent, identically distributed exponential random variables.

Note

New code should use the laplace method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • loc (float or array_like of floats, optional) – The position, \mu, of the distribution peak. Default is 0.

  • scale (float or array_like of floats, optional) – \lambda, the exponential decay. Default is 1. Must be non- negative.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if loc and scale are both scalars. Otherwise, np.broadcast(loc, scale).size samples are drawn.

Returns

out – Drawn samples from the parameterized Laplace distribution.

Return type

ndarray or scalar

See also

Generator.laplace

which should be used for new code.

Notes

It has the probability density function

f(x; \mu, \lambda) = \frac{1}{2\lambda}
\exp\left(-\frac{|x - \mu|}{\lambda}\right).

The first law of Laplace, from 1774, states that the frequency of an error can be expressed as an exponential function of the absolute magnitude of the error, which leads to the Laplace distribution. For many problems in economics and health sciences, this distribution seems to model the data better than the standard Gaussian distribution.

References

1

Abramowitz, M. and Stegun, I. A. (Eds.). “Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th printing,” New York: Dover, 1972.

2

Kotz, Samuel, et. al. “The Laplace Distribution and Generalizations, ” Birkhauser, 2001.

3

Weisstein, Eric W. “Laplace Distribution.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/LaplaceDistribution.html

4

Wikipedia, “Laplace distribution”, https://en.wikipedia.org/wiki/Laplace_distribution

Examples

Draw samples from the distribution

>>> loc, scale = 0., 1.
>>> s = np.random.laplace(loc, scale, 1000)

Display the histogram of the samples, along with the probability density function:

>>> import matplotlib.pyplot as plt
>>> count, bins, ignored = plt.hist(s, 30, density=True)
>>> x = np.arange(-8., 8., .01)
>>> pdf = np.exp(-abs(x-loc)/scale)/(2.*scale)
>>> plt.plot(x, pdf)

Plot Gaussian for comparison:

>>> g = (1/(scale * np.sqrt(2 * np.pi)) *
...      np.exp(-(x - loc)**2 / (2 * scale**2)))
>>> plt.plot(x,g)
simba.utils.data.logistic(loc=0.0, scale=1.0, size=None)

Draw samples from a logistic distribution.

Samples are drawn from a logistic distribution with specified parameters, loc (location or mean, also median), and scale (>0).

Note

New code should use the logistic method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • loc (float or array_like of floats, optional) – Parameter of the distribution. Default is 0.

  • scale (float or array_like of floats, optional) – Parameter of the distribution. Must be non-negative. Default is 1.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if loc and scale are both scalars. Otherwise, np.broadcast(loc, scale).size samples are drawn.

Returns

out – Drawn samples from the parameterized logistic distribution.

Return type

ndarray or scalar

See also

scipy.stats.logistic

probability density function, distribution or cumulative density function, etc.

Generator.logistic

which should be used for new code.

Notes

The probability density for the Logistic distribution is

P(x) = P(x) = \frac{e^{-(x-\mu)/s}}{s(1+e^{-(x-\mu)/s})^2},

where \mu = location and s = scale.

The Logistic distribution is used in Extreme Value problems where it can act as a mixture of Gumbel distributions, in Epidemiology, and by the World Chess Federation (FIDE) where it is used in the Elo ranking system, assuming the performance of each player is a logistically distributed random variable.

References

1

Reiss, R.-D. and Thomas M. (2001), “Statistical Analysis of Extreme Values, from Insurance, Finance, Hydrology and Other Fields,” Birkhauser Verlag, Basel, pp 132-133.

2

Weisstein, Eric W. “Logistic Distribution.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/LogisticDistribution.html

3

Wikipedia, “Logistic-distribution”, https://en.wikipedia.org/wiki/Logistic_distribution

Examples

Draw samples from the distribution:

>>> loc, scale = 10, 1
>>> s = np.random.logistic(loc, scale, 10000)
>>> import matplotlib.pyplot as plt
>>> count, bins, ignored = plt.hist(s, bins=50)

# plot against distribution

>>> def logist(x, loc, scale):
...     return np.exp((loc-x)/scale)/(scale*(1+np.exp((loc-x)/scale))**2)
>>> lgst_val = logist(bins, loc, scale)
>>> plt.plot(bins, lgst_val * count.max() / lgst_val.max())
>>> plt.show()
simba.utils.data.lognormal(mean=0.0, sigma=1.0, size=None)

Draw samples from a log-normal distribution.

Draw samples from a log-normal distribution with specified mean, standard deviation, and array shape. Note that the mean and standard deviation are not the values for the distribution itself, but of the underlying normal distribution it is derived from.

Note

New code should use the lognormal method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • mean (float or array_like of floats, optional) – Mean value of the underlying normal distribution. Default is 0.

  • sigma (float or array_like of floats, optional) – Standard deviation of the underlying normal distribution. Must be non-negative. Default is 1.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if mean and sigma are both scalars. Otherwise, np.broadcast(mean, sigma).size samples are drawn.

Returns

out – Drawn samples from the parameterized log-normal distribution.

Return type

ndarray or scalar

See also

scipy.stats.lognorm

probability density function, distribution, cumulative density function, etc.

Generator.lognormal

which should be used for new code.

Notes

A variable x has a log-normal distribution if log(x) is normally distributed. The probability density function for the log-normal distribution is:

p(x) = \frac{1}{\sigma x \sqrt{2\pi}}
e^{(-\frac{(ln(x)-\mu)^2}{2\sigma^2})}

where \mu is the mean and \sigma is the standard deviation of the normally distributed logarithm of the variable. A log-normal distribution results if a random variable is the product of a large number of independent, identically-distributed variables in the same way that a normal distribution results if the variable is the sum of a large number of independent, identically-distributed variables.

References

1

Limpert, E., Stahel, W. A., and Abbt, M., “Log-normal Distributions across the Sciences: Keys and Clues,” BioScience, Vol. 51, No. 5, May, 2001. https://stat.ethz.ch/~stahel/lognormal/bioscience.pdf

2

Reiss, R.D. and Thomas, M., “Statistical Analysis of Extreme Values,” Basel: Birkhauser Verlag, 2001, pp. 31-32.

Examples

Draw samples from the distribution:

>>> mu, sigma = 3., 1. # mean and standard deviation
>>> s = np.random.lognormal(mu, sigma, 1000)

Display the histogram of the samples, along with the probability density function:

>>> import matplotlib.pyplot as plt
>>> count, bins, ignored = plt.hist(s, 100, density=True, align='mid')
>>> x = np.linspace(min(bins), max(bins), 10000)
>>> pdf = (np.exp(-(np.log(x) - mu)**2 / (2 * sigma**2))
...        / (x * sigma * np.sqrt(2 * np.pi)))
>>> plt.plot(x, pdf, linewidth=2, color='r')
>>> plt.axis('tight')
>>> plt.show()

Demonstrate that taking the products of random samples from a uniform distribution can be fit well by a log-normal probability density function.

>>> # Generate a thousand samples: each is the product of 100 random
>>> # values, drawn from a normal distribution.
>>> b = []
>>> for i in range(1000):
...    a = 10. + np.random.standard_normal(100)
...    b.append(np.product(a))
>>> b = np.array(b) / np.min(b) # scale values to be positive
>>> count, bins, ignored = plt.hist(b, 100, density=True, align='mid')
>>> sigma = np.std(np.log(b))
>>> mu = np.mean(np.log(b))
>>> x = np.linspace(min(bins), max(bins), 10000)
>>> pdf = (np.exp(-(np.log(x) - mu)**2 / (2 * sigma**2))
...        / (x * sigma * np.sqrt(2 * np.pi)))
>>> plt.plot(x, pdf, color='r', linewidth=2)
>>> plt.show()
simba.utils.data.logseries(p, size=None)

Draw samples from a logarithmic series distribution.

Samples are drawn from a log series distribution with specified shape parameter, 0 < p < 1.

Note

New code should use the logseries method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • p (float or array_like of floats) – Shape parameter for the distribution. Must be in the range (0, 1).

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if p is a scalar. Otherwise, np.array(p).size samples are drawn.

Returns

out – Drawn samples from the parameterized logarithmic series distribution.

Return type

ndarray or scalar

See also

scipy.stats.logser

probability density function, distribution or cumulative density function, etc.

Generator.logseries

which should be used for new code.

Notes

The probability density for the Log Series distribution is

P(k) = \frac{-p^k}{k \ln(1-p)},

where p = probability.

The log series distribution is frequently used to represent species richness and occurrence, first proposed by Fisher, Corbet, and Williams in 1943 [2]. It may also be used to model the numbers of occupants seen in cars [3].

References

1

Buzas, Martin A.; Culver, Stephen J., Understanding regional species diversity through the log series distribution of occurrences: BIODIVERSITY RESEARCH Diversity & Distributions, Volume 5, Number 5, September 1999 , pp. 187-195(9).

2

Fisher, R.A,, A.S. Corbet, and C.B. Williams. 1943. The relation between the number of species and the number of individuals in a random sample of an animal population. Journal of Animal Ecology, 12:42-58.

3

D. J. Hand, F. Daly, D. Lunn, E. Ostrowski, A Handbook of Small Data Sets, CRC Press, 1994.

4

Wikipedia, “Logarithmic distribution”, https://en.wikipedia.org/wiki/Logarithmic_distribution

Examples

Draw samples from the distribution:

>>> a = .6
>>> s = np.random.logseries(a, 10000)
>>> import matplotlib.pyplot as plt
>>> count, bins, ignored = plt.hist(s)

# plot against distribution

>>> def logseries(k, p):
...     return -p**k/(k*np.log(1-p))
>>> plt.plot(bins, logseries(bins, a)*count.max()/
...          logseries(bins, a).max(), 'r')
>>> plt.show()
simba.utils.data.multinomial(n, pvals, size=None)

Draw samples from a multinomial distribution.

The multinomial distribution is a multivariate generalization of the binomial distribution. Take an experiment with one of p possible outcomes. An example of such an experiment is throwing a dice, where the outcome can be 1 through 6. Each sample drawn from the distribution represents n such experiments. Its values, X_i = [X_0, X_1, ..., X_p], represent the number of times the outcome was i.

Note

New code should use the multinomial method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • n (int) – Number of experiments.

  • pvals (sequence of floats, length p) – Probabilities of each of the p different outcomes. These must sum to 1 (however, the last element is always assumed to account for the remaining probability, as long as sum(pvals[:-1]) <= 1).

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.

Returns

out – The drawn samples, of shape size, if that was provided. If not, the shape is (N,).

In other words, each entry out[i,j,...,:] is an N-dimensional value drawn from the distribution.

Return type

ndarray

See also

Generator.multinomial

which should be used for new code.

Examples

Throw a dice 20 times:

>>> np.random.multinomial(20, [1/6.]*6, size=1)
array([[4, 1, 7, 5, 2, 1]]) # random

It landed 4 times on 1, once on 2, etc.

Now, throw the dice 20 times, and 20 times again:

>>> np.random.multinomial(20, [1/6.]*6, size=2)
array([[3, 4, 3, 3, 4, 3], # random
       [2, 4, 3, 4, 0, 7]])

For the first run, we threw 3 times 1, 4 times 2, etc. For the second, we threw 2 times 1, 4 times 2, etc.

A loaded die is more likely to land on number 6:

>>> np.random.multinomial(100, [1/7.]*5 + [2/7.])
array([11, 16, 14, 17, 16, 26]) # random

The probability inputs should be normalized. As an implementation detail, the value of the last entry is ignored and assumed to take up any leftover probability mass, but this should not be relied on. A biased coin which has twice as much weight on one side as on the other should be sampled like so:

>>> np.random.multinomial(100, [1.0 / 3, 2.0 / 3])  # RIGHT
array([38, 62]) # random

not like:

>>> np.random.multinomial(100, [1.0, 2.0])  # WRONG
Traceback (most recent call last):
ValueError: pvals < 0, pvals > 1 or pvals contains NaNs
simba.utils.data.multivariate_normal(mean, cov, size=None, check_valid='warn', tol=1e-08)

Draw random samples from a multivariate normal distribution.

The multivariate normal, multinormal or Gaussian distribution is a generalization of the one-dimensional normal distribution to higher dimensions. Such a distribution is specified by its mean and covariance matrix. These parameters are analogous to the mean (average or “center”) and variance (standard deviation, or “width,” squared) of the one-dimensional normal distribution.

Note

New code should use the multivariate_normal method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • mean (1-D array_like, of length N) – Mean of the N-dimensional distribution.

  • cov (2-D array_like, of shape (N, N)) – Covariance matrix of the distribution. It must be symmetric and positive-semidefinite for proper sampling.

  • size (int or tuple of ints, optional) – Given a shape of, for example, (m,n,k), m*n*k samples are generated, and packed in an m-by-n-by-k arrangement. Because each sample is N-dimensional, the output shape is (m,n,k,N). If no shape is specified, a single (N-D) sample is returned.

  • check_valid ({ 'warn', 'raise', 'ignore' }, optional) – Behavior when the covariance matrix is not positive semidefinite.

  • tol (float, optional) – Tolerance when checking the singular values in covariance matrix. cov is cast to double before the check.

Returns

out – The drawn samples, of shape size, if that was provided. If not, the shape is (N,).

In other words, each entry out[i,j,...,:] is an N-dimensional value drawn from the distribution.

Return type

ndarray

See also

Generator.multivariate_normal

which should be used for new code.

Notes

The mean is a coordinate in N-dimensional space, which represents the location where samples are most likely to be generated. This is analogous to the peak of the bell curve for the one-dimensional or univariate normal distribution.

Covariance indicates the level to which two variables vary together. From the multivariate normal distribution, we draw N-dimensional samples, X = [x_1, x_2, ... x_N]. The covariance matrix element C_{ij} is the covariance of x_i and x_j. The element C_{ii} is the variance of x_i (i.e. its “spread”).

Instead of specifying the full covariance matrix, popular approximations include:

  • Spherical covariance (cov is a multiple of the identity matrix)

  • Diagonal covariance (cov has non-negative elements, and only on the diagonal)

This geometrical property can be seen in two dimensions by plotting generated data-points:

>>> mean = [0, 0]
>>> cov = [[1, 0], [0, 100]]  # diagonal covariance

Diagonal covariance means that points are oriented along x or y-axis:

>>> import matplotlib.pyplot as plt
>>> x, y = np.random.multivariate_normal(mean, cov, 5000).T
>>> plt.plot(x, y, 'x')
>>> plt.axis('equal')
>>> plt.show()

Note that the covariance matrix must be positive semidefinite (a.k.a. nonnegative-definite). Otherwise, the behavior of this method is undefined and backwards compatibility is not guaranteed.

References

1

Papoulis, A., “Probability, Random Variables, and Stochastic Processes,” 3rd ed., New York: McGraw-Hill, 1991.

2

Duda, R. O., Hart, P. E., and Stork, D. G., “Pattern Classification,” 2nd ed., New York: Wiley, 2001.

Examples

>>> mean = (1, 2)
>>> cov = [[1, 0], [0, 1]]
>>> x = np.random.multivariate_normal(mean, cov, (3, 3))
>>> x.shape
(3, 3, 2)

The following is probably true, given that 0.6 is roughly twice the standard deviation:

>>> list((x[0,0,:] - mean) < 0.6)
[True, True] # random
simba.utils.data.negative_binomial(n, p, size=None)

Draw samples from a negative binomial distribution.

Samples are drawn from a negative binomial distribution with specified parameters, n successes and p probability of success where n is > 0 and p is in the interval [0, 1].

Note

New code should use the negative_binomial method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • n (float or array_like of floats) – Parameter of the distribution, > 0.

  • p (float or array_like of floats) – Parameter of the distribution, >= 0 and <=1.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if n and p are both scalars. Otherwise, np.broadcast(n, p).size samples are drawn.

Returns

out – Drawn samples from the parameterized negative binomial distribution, where each sample is equal to N, the number of failures that occurred before a total of n successes was reached.

Return type

ndarray or scalar

See also

Generator.negative_binomial

which should be used for new code.

Notes

The probability mass function of the negative binomial distribution is

P(N;n,p) = \frac{\Gamma(N+n)}{N!\Gamma(n)}p^{n}(1-p)^{N},

where n is the number of successes, p is the probability of success, N+n is the number of trials, and \Gamma is the gamma function. When n is an integer, \frac{\Gamma(N+n)}{N!\Gamma(n)} = \binom{N+n-1}{N}, which is the more common form of this term in the the pmf. The negative binomial distribution gives the probability of N failures given n successes, with a success on the last trial.

If one throws a die repeatedly until the third time a “1” appears, then the probability distribution of the number of non-“1”s that appear before the third “1” is a negative binomial distribution.

References

1

Weisstein, Eric W. “Negative Binomial Distribution.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/NegativeBinomialDistribution.html

2

Wikipedia, “Negative binomial distribution”, https://en.wikipedia.org/wiki/Negative_binomial_distribution

Examples

Draw samples from the distribution:

A real world example. A company drills wild-cat oil exploration wells, each with an estimated probability of success of 0.1. What is the probability of having one success for each successive well, that is what is the probability of a single success after drilling 5 wells, after 6 wells, etc.?

>>> s = np.random.negative_binomial(1, 0.1, 100000)
>>> for i in range(1, 11): 
...    probability = sum(s<i) / 100000.
...    print(i, "wells drilled, probability of one success =", probability)
simba.utils.data.noncentral_chisquare(df, nonc, size=None)

Draw samples from a noncentral chi-square distribution.

The noncentral \chi^2 distribution is a generalization of the \chi^2 distribution.

Note

New code should use the noncentral_chisquare method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • df (float or array_like of floats) –

    Degrees of freedom, must be > 0.

    Changed in version 1.10.0: Earlier NumPy versions required dfnum > 1.

  • nonc (float or array_like of floats) – Non-centrality, must be non-negative.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if df and nonc are both scalars. Otherwise, np.broadcast(df, nonc).size samples are drawn.

Returns

out – Drawn samples from the parameterized noncentral chi-square distribution.

Return type

ndarray or scalar

See also

Generator.noncentral_chisquare

which should be used for new code.

Notes

The probability density function for the noncentral Chi-square distribution is

P(x;df,nonc) = \sum^{\infty}_{i=0}
\frac{e^{-nonc/2}(nonc/2)^{i}}{i!}
P_{Y_{df+2i}}(x),

where Y_{q} is the Chi-square with q degrees of freedom.

References

1

Wikipedia, “Noncentral chi-squared distribution” https://en.wikipedia.org/wiki/Noncentral_chi-squared_distribution

Examples

Draw values from the distribution and plot the histogram

>>> import matplotlib.pyplot as plt
>>> values = plt.hist(np.random.noncentral_chisquare(3, 20, 100000),
...                   bins=200, density=True)
>>> plt.show()

Draw values from a noncentral chisquare with very small noncentrality, and compare to a chisquare.

>>> plt.figure()
>>> values = plt.hist(np.random.noncentral_chisquare(3, .0000001, 100000),
...                   bins=np.arange(0., 25, .1), density=True)
>>> values2 = plt.hist(np.random.chisquare(3, 100000),
...                    bins=np.arange(0., 25, .1), density=True)
>>> plt.plot(values[1][0:-1], values[0]-values2[0], 'ob')
>>> plt.show()

Demonstrate how large values of non-centrality lead to a more symmetric distribution.

>>> plt.figure()
>>> values = plt.hist(np.random.noncentral_chisquare(3, 20, 100000),
...                   bins=200, density=True)
>>> plt.show()
simba.utils.data.noncentral_f(dfnum, dfden, nonc, size=None)

Draw samples from the noncentral F distribution.

Samples are drawn from an F distribution with specified parameters, dfnum (degrees of freedom in numerator) and dfden (degrees of freedom in denominator), where both parameters > 1. nonc is the non-centrality parameter.

Note

New code should use the noncentral_f method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • dfnum (float or array_like of floats) –

    Numerator degrees of freedom, must be > 0.

    Changed in version 1.14.0: Earlier NumPy versions required dfnum > 1.

  • dfden (float or array_like of floats) – Denominator degrees of freedom, must be > 0.

  • nonc (float or array_like of floats) – Non-centrality parameter, the sum of the squares of the numerator means, must be >= 0.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if dfnum, dfden, and nonc are all scalars. Otherwise, np.broadcast(dfnum, dfden, nonc).size samples are drawn.

Returns

out – Drawn samples from the parameterized noncentral Fisher distribution.

Return type

ndarray or scalar

See also

Generator.noncentral_f

which should be used for new code.

Notes

When calculating the power of an experiment (power = probability of rejecting the null hypothesis when a specific alternative is true) the non-central F statistic becomes important. When the null hypothesis is true, the F statistic follows a central F distribution. When the null hypothesis is not true, then it follows a non-central F statistic.

References

1

Weisstein, Eric W. “Noncentral F-Distribution.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/NoncentralF-Distribution.html

2

Wikipedia, “Noncentral F-distribution”, https://en.wikipedia.org/wiki/Noncentral_F-distribution

Examples

In a study, testing for a specific alternative to the null hypothesis requires use of the Noncentral F distribution. We need to calculate the area in the tail of the distribution that exceeds the value of the F distribution for the null hypothesis. We’ll plot the two probability distributions for comparison.

>>> dfnum = 3 # between group deg of freedom
>>> dfden = 20 # within groups degrees of freedom
>>> nonc = 3.0
>>> nc_vals = np.random.noncentral_f(dfnum, dfden, nonc, 1000000)
>>> NF = np.histogram(nc_vals, bins=50, density=True)
>>> c_vals = np.random.f(dfnum, dfden, 1000000)
>>> F = np.histogram(c_vals, bins=50, density=True)
>>> import matplotlib.pyplot as plt
>>> plt.plot(F[1][1:], F[0])
>>> plt.plot(NF[1][1:], NF[0])
>>> plt.show()
simba.utils.data.normal(loc=0.0, scale=1.0, size=None)

Draw random samples from a normal (Gaussian) distribution.

The probability density function of the normal distribution, first derived by De Moivre and 200 years later by both Gauss and Laplace independently [2]_, is often called the bell curve because of its characteristic shape (see the example below).

The normal distributions occurs often in nature. For example, it describes the commonly occurring distribution of samples influenced by a large number of tiny, random disturbances, each with its own unique distribution [2]_.

Note

New code should use the normal method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • loc (float or array_like of floats) – Mean (“centre”) of the distribution.

  • scale (float or array_like of floats) – Standard deviation (spread or “width”) of the distribution. Must be non-negative.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if loc and scale are both scalars. Otherwise, np.broadcast(loc, scale).size samples are drawn.

Returns

out – Drawn samples from the parameterized normal distribution.

Return type

ndarray or scalar

See also

scipy.stats.norm

probability density function, distribution or cumulative density function, etc.

Generator.normal

which should be used for new code.

Notes

The probability density for the Gaussian distribution is

p(x) = \frac{1}{\sqrt{ 2 \pi \sigma^2 }}
e^{ - \frac{ (x - \mu)^2 } {2 \sigma^2} },

where \mu is the mean and \sigma the standard deviation. The square of the standard deviation, \sigma^2, is called the variance.

The function has its peak at the mean, and its “spread” increases with the standard deviation (the function reaches 0.607 times its maximum at x + \sigma and x - \sigma [2]_). This implies that normal is more likely to return samples lying close to the mean, rather than those far away.

References

1

Wikipedia, “Normal distribution”, https://en.wikipedia.org/wiki/Normal_distribution

2

P. R. Peebles Jr., “Central Limit Theorem” in “Probability, Random Variables and Random Signal Principles”, 4th ed., 2001, pp. 51, 51, 125.

Examples

Draw samples from the distribution:

>>> mu, sigma = 0, 0.1 # mean and standard deviation
>>> s = np.random.normal(mu, sigma, 1000)

Verify the mean and the variance:

>>> abs(mu - np.mean(s))
0.0  # may vary
>>> abs(sigma - np.std(s, ddof=1))
0.1  # may vary

Display the histogram of the samples, along with the probability density function:

>>> import matplotlib.pyplot as plt
>>> count, bins, ignored = plt.hist(s, 30, density=True)
>>> plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *
...                np.exp( - (bins - mu)**2 / (2 * sigma**2) ),
...          linewidth=2, color='r')
>>> plt.show()

Two-by-four array of samples from N(3, 6.25):

>>> np.random.normal(3, 2.5, size=(2, 4))
array([[-4.49401501,  4.00950034, -1.81814867,  7.29718677],   # random
       [ 0.39924804,  4.68456316,  4.99394529,  4.84057254]])  # random
simba.utils.data.pareto(a, size=None)

Draw samples from a Pareto II or Lomax distribution with specified shape.

The Lomax or Pareto II distribution is a shifted Pareto distribution. The classical Pareto distribution can be obtained from the Lomax distribution by adding 1 and multiplying by the scale parameter m (see Notes). The smallest value of the Lomax distribution is zero while for the classical Pareto distribution it is mu, where the standard Pareto distribution has location mu = 1. Lomax can also be considered as a simplified version of the Generalized Pareto distribution (available in SciPy), with the scale set to one and the location set to zero.

The Pareto distribution must be greater than zero, and is unbounded above. It is also known as the “80-20 rule”. In this distribution, 80 percent of the weights are in the lowest 20 percent of the range, while the other 20 percent fill the remaining 80 percent of the range.

Note

New code should use the pareto method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • a (float or array_like of floats) – Shape of the distribution. Must be positive.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if a is a scalar. Otherwise, np.array(a).size samples are drawn.

Returns

out – Drawn samples from the parameterized Pareto distribution.

Return type

ndarray or scalar

See also

scipy.stats.lomax

probability density function, distribution or cumulative density function, etc.

scipy.stats.genpareto

probability density function, distribution or cumulative density function, etc.

Generator.pareto

which should be used for new code.

Notes

The probability density for the Pareto distribution is

p(x) = \frac{am^a}{x^{a+1}}

where a is the shape and m the scale.

The Pareto distribution, named after the Italian economist Vilfredo Pareto, is a power law probability distribution useful in many real world problems. Outside the field of economics it is generally referred to as the Bradford distribution. Pareto developed the distribution to describe the distribution of wealth in an economy. It has also found use in insurance, web page access statistics, oil field sizes, and many other problems, including the download frequency for projects in Sourceforge [1]_. It is one of the so-called “fat-tailed” distributions.

References

1

Francis Hunt and Paul Johnson, On the Pareto Distribution of Sourceforge projects.

2

Pareto, V. (1896). Course of Political Economy. Lausanne.

3

Reiss, R.D., Thomas, M.(2001), Statistical Analysis of Extreme Values, Birkhauser Verlag, Basel, pp 23-30.

4

Wikipedia, “Pareto distribution”, https://en.wikipedia.org/wiki/Pareto_distribution

Examples

Draw samples from the distribution:

>>> a, m = 3., 2.  # shape and mode
>>> s = (np.random.pareto(a, 1000) + 1) * m

Display the histogram of the samples, along with the probability density function:

>>> import matplotlib.pyplot as plt
>>> count, bins, _ = plt.hist(s, 100, density=True)
>>> fit = a*m**a / bins**(a+1)
>>> plt.plot(bins, max(count)*fit/max(fit), linewidth=2, color='r')
>>> plt.show()
simba.utils.data.permutation(x)

Randomly permute a sequence, or return a permuted range.

If x is a multi-dimensional array, it is only shuffled along its first index.

Note

New code should use the permutation method of a default_rng() instance instead; please see the random-quick-start.

Parameters

x (int or array_like) – If x is an integer, randomly permute np.arange(x). If x is an array, make a copy and shuffle the elements randomly.

Returns

out – Permuted sequence or array range.

Return type

ndarray

See also

Generator.permutation

which should be used for new code.

Examples

>>> np.random.permutation(10)
array([1, 7, 4, 3, 0, 9, 2, 5, 8, 6]) # random
>>> np.random.permutation([1, 4, 9, 12, 15])
array([15,  1,  9,  4, 12]) # random
>>> arr = np.arange(9).reshape((3, 3))
>>> np.random.permutation(arr)
array([[6, 7, 8], # random
       [0, 1, 2],
       [3, 4, 5]])
simba.utils.data.plug_holes_shortest_bout(data_df: DataFrame, clf_name: str, fps: int, shortest_bout: int) DataFrame[source]

Removes behavior “bouts” that are shorter than the minimum user-specified length within a dataframe.

Parameters
  • data_df (pd.DataFrame) – Pandas Dataframe with classifier prediction data.

  • clf_name (str) – Name of the classifier field.

  • fps (int) – The fps of the input video.

  • shortest_bout (int) – The shortest valid behavior boat in milliseconds.

Return pd.DataFrame data_df

Dataframe where behavior bouts with invalid lengths have been removed (< shortest_bout)

Example

>>>  data_df = pd.DataFrame(data=[1, 0, 1, 1, 1], columns=['target'])
>>>  plug_holes_shortest_bout(data_df=data_df, clf_name='target', fps=10, shortest_bout=2000)
>>>         target
>>>    0       1
>>>    1       1
>>>    2       1
>>>    3       1
>>>    4       1
simba.utils.data.poisson(lam=1.0, size=None)

Draw samples from a Poisson distribution.

The Poisson distribution is the limit of the binomial distribution for large N.

Note

New code should use the poisson method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • lam (float or array_like of floats) – Expectation of interval, must be >= 0. A sequence of expectation intervals must be broadcastable over the requested size.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if lam is a scalar. Otherwise, np.array(lam).size samples are drawn.

Returns

out – Drawn samples from the parameterized Poisson distribution.

Return type

ndarray or scalar

See also

Generator.poisson

which should be used for new code.

Notes

The Poisson distribution

f(k; \lambda)=\frac{\lambda^k e^{-\lambda}}{k!}

For events with an expected separation \lambda the Poisson distribution f(k; \lambda) describes the probability of k events occurring within the observed interval \lambda.

Because the output is limited to the range of the C int64 type, a ValueError is raised when lam is within 10 sigma of the maximum representable value.

References

1

Weisstein, Eric W. “Poisson Distribution.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/PoissonDistribution.html

2

Wikipedia, “Poisson distribution”, https://en.wikipedia.org/wiki/Poisson_distribution

Examples

Draw samples from the distribution:

>>> import numpy as np
>>> s = np.random.poisson(5, 10000)

Display histogram of the sample:

>>> import matplotlib.pyplot as plt
>>> count, bins, ignored = plt.hist(s, 14, density=True)
>>> plt.show()

Draw each 100 values for lambda 100 and 500:

>>> s = np.random.poisson(lam=(100., 500.), size=(100, 2))
simba.utils.data.power(a, size=None)

Draws samples in [0, 1] from a power distribution with positive exponent a - 1.

Also known as the power function distribution.

Note

New code should use the power method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • a (float or array_like of floats) – Parameter of the distribution. Must be non-negative.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if a is a scalar. Otherwise, np.array(a).size samples are drawn.

Returns

out – Drawn samples from the parameterized power distribution.

Return type

ndarray or scalar

Raises

ValueError – If a < 1.

See also

Generator.power

which should be used for new code.

Notes

The probability density function is

P(x; a) = ax^{a-1}, 0 \le x \le 1, a>0.

The power function distribution is just the inverse of the Pareto distribution. It may also be seen as a special case of the Beta distribution.

It is used, for example, in modeling the over-reporting of insurance claims.

References

1

Christian Kleiber, Samuel Kotz, “Statistical size distributions in economics and actuarial sciences”, Wiley, 2003.

2

Heckert, N. A. and Filliben, James J. “NIST Handbook 148: Dataplot Reference Manual, Volume 2: Let Subcommands and Library Functions”, National Institute of Standards and Technology Handbook Series, June 2003. https://www.itl.nist.gov/div898/software/dataplot/refman2/auxillar/powpdf.pdf

Examples

Draw samples from the distribution:

>>> a = 5. # shape
>>> samples = 1000
>>> s = np.random.power(a, samples)

Display the histogram of the samples, along with the probability density function:

>>> import matplotlib.pyplot as plt
>>> count, bins, ignored = plt.hist(s, bins=30)
>>> x = np.linspace(0, 1, 100)
>>> y = a*x**(a-1.)
>>> normed_y = samples*np.diff(bins)[0]*y
>>> plt.plot(x, normed_y)
>>> plt.show()

Compare the power function distribution to the inverse of the Pareto.

>>> from scipy import stats 
>>> rvs = np.random.power(5, 1000000)
>>> rvsp = np.random.pareto(5, 1000000)
>>> xx = np.linspace(0,1,100)
>>> powpdf = stats.powerlaw.pdf(xx,5)  
>>> plt.figure()
>>> plt.hist(rvs, bins=50, density=True)
>>> plt.plot(xx,powpdf,'r-')  
>>> plt.title('np.random.power(5)')
>>> plt.figure()
>>> plt.hist(1./(1.+rvsp), bins=50, density=True)
>>> plt.plot(xx,powpdf,'r-')  
>>> plt.title('inverse of 1 + np.random.pareto(5)')
>>> plt.figure()
>>> plt.hist(1./(1.+rvsp), bins=50, density=True)
>>> plt.plot(xx,powpdf,'r-')  
>>> plt.title('inverse of stats.pareto(5)')
simba.utils.data.rand(d0, d1, ..., dn)

Random values in a given shape.

Note

This is a convenience function for users porting code from Matlab, and wraps random_sample. That function takes a tuple to specify the size of the output, which is consistent with other NumPy functions like numpy.zeros and numpy.ones.

Create an array of the given shape and populate it with random samples from a uniform distribution over [0, 1).

Parameters
  • d0 (int, optional) – The dimensions of the returned array, must be non-negative. If no argument is given a single Python float is returned.

  • d1 (int, optional) – The dimensions of the returned array, must be non-negative. If no argument is given a single Python float is returned.

  • ... (int, optional) – The dimensions of the returned array, must be non-negative. If no argument is given a single Python float is returned.

  • dn (int, optional) – The dimensions of the returned array, must be non-negative. If no argument is given a single Python float is returned.

Returns

out – Random values.

Return type

ndarray, shape (d0, d1, ..., dn)

See also

random

Examples

>>> np.random.rand(3,2)
array([[ 0.14022471,  0.96360618],  #random
       [ 0.37601032,  0.25528411],  #random
       [ 0.49313049,  0.94909878]]) #random
simba.utils.data.randint(low, high=None, size=None, dtype=int)

Return random integers from low (inclusive) to high (exclusive).

Return random integers from the “discrete uniform” distribution of the specified dtype in the “half-open” interval [low, high). If high is None (the default), then results are from [0, low).

Note

New code should use the integers method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • low (int or array-like of ints) – Lowest (signed) integers to be drawn from the distribution (unless high=None, in which case this parameter is one above the highest such integer).

  • high (int or array-like of ints, optional) – If provided, one above the largest (signed) integer to be drawn from the distribution (see above for behavior if high=None). If array-like, must contain integer values

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.

  • dtype (dtype, optional) –

    Desired dtype of the result. Byteorder must be native. The default value is int.

    New in version 1.11.0.

Returns

outsize-shaped array of random integers from the appropriate distribution, or a single such random int if size not provided.

Return type

int or ndarray of ints

See also

random_integers

similar to randint, only for the closed interval [low, high], and 1 is the lowest value if high is omitted.

Generator.integers

which should be used for new code.

Examples

>>> np.random.randint(2, size=10)
array([1, 0, 0, 0, 1, 1, 0, 0, 1, 0]) # random
>>> np.random.randint(1, size=10)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Generate a 2 x 4 array of ints between 0 and 4, inclusive:

>>> np.random.randint(5, size=(2, 4))
array([[4, 0, 2, 1], # random
       [3, 2, 2, 0]])

Generate a 1 x 3 array with 3 different upper bounds

>>> np.random.randint(1, [3, 5, 10])
array([2, 2, 9]) # random

Generate a 1 by 3 array with 3 different lower bounds

>>> np.random.randint([1, 5, 7], 10)
array([9, 8, 7]) # random

Generate a 2 by 4 array using broadcasting with dtype of uint8

>>> np.random.randint([1, 3, 5, 7], [[10], [20]], dtype=np.uint8)
array([[ 8,  6,  9,  7], # random
       [ 1, 16,  9, 12]], dtype=uint8)
simba.utils.data.randn(d0, d1, ..., dn)

Return a sample (or samples) from the “standard normal” distribution.

Note

This is a convenience function for users porting code from Matlab, and wraps standard_normal. That function takes a tuple to specify the size of the output, which is consistent with other NumPy functions like numpy.zeros and numpy.ones.

Note

New code should use the standard_normal method of a default_rng() instance instead; please see the random-quick-start.

If positive int_like arguments are provided, randn generates an array of shape (d0, d1, ..., dn), filled with random floats sampled from a univariate “normal” (Gaussian) distribution of mean 0 and variance 1. A single float randomly sampled from the distribution is returned if no argument is provided.

Parameters
  • d0 (int, optional) – The dimensions of the returned array, must be non-negative. If no argument is given a single Python float is returned.

  • d1 (int, optional) – The dimensions of the returned array, must be non-negative. If no argument is given a single Python float is returned.

  • ... (int, optional) – The dimensions of the returned array, must be non-negative. If no argument is given a single Python float is returned.

  • dn (int, optional) – The dimensions of the returned array, must be non-negative. If no argument is given a single Python float is returned.

Returns

Z – A (d0, d1, ..., dn)-shaped array of floating-point samples from the standard normal distribution, or a single such float if no parameters were supplied.

Return type

ndarray or float

See also

standard_normal

Similar, but takes a tuple as its argument.

normal

Also accepts mu and sigma arguments.

Generator.standard_normal

which should be used for new code.

Notes

For random samples from N(\mu, \sigma^2), use:

sigma * np.random.randn(...) + mu

Examples

>>> np.random.randn()
2.1923875335537315  # random

Two-by-four array of samples from N(3, 6.25):

>>> 3 + 2.5 * np.random.randn(2, 4)
array([[-4.49401501,  4.00950034, -1.81814867,  7.29718677],   # random
       [ 0.39924804,  4.68456316,  4.99394529,  4.84057254]])  # random
simba.utils.data.random(size=None)

Return random floats in the half-open interval [0.0, 1.0). Alias for random_sample to ease forward-porting to the new random API.

simba.utils.data.random_integers(low, high=None, size=None)

Random integers of type np.int_ between low and high, inclusive.

Return random integers of type np.int_ from the “discrete uniform” distribution in the closed interval [low, high]. If high is None (the default), then results are from [1, low]. The np.int_ type translates to the C long integer type and its precision is platform dependent.

This function has been deprecated. Use randint instead.

Deprecated since version 1.11.0.

Parameters
  • low (int) – Lowest (signed) integer to be drawn from the distribution (unless high=None, in which case this parameter is the highest such integer).

  • high (int, optional) – If provided, the largest (signed) integer to be drawn from the distribution (see above for behavior if high=None).

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.

Returns

outsize-shaped array of random integers from the appropriate distribution, or a single such random int if size not provided.

Return type

int or ndarray of ints

See also

randint

Similar to random_integers, only for the half-open interval [low, high), and 0 is the lowest value if high is omitted.

Notes

To sample from N evenly spaced floating-point numbers between a and b, use:

a + (b - a) * (np.random.random_integers(N) - 1) / (N - 1.)

Examples

>>> np.random.random_integers(5)
4 # random
>>> type(np.random.random_integers(5))
<class 'numpy.int64'>
>>> np.random.random_integers(5, size=(3,2))
array([[5, 4], # random
       [3, 3],
       [4, 5]])

Choose five random numbers from the set of five evenly-spaced numbers between 0 and 2.5, inclusive (i.e., from the set {0, 5/8, 10/8, 15/8, 20/8}):

>>> 2.5 * (np.random.random_integers(5, size=(5,)) - 1) / 4.
array([ 0.625,  1.25 ,  0.625,  0.625,  2.5  ]) # random

Roll two six sided dice 1000 times and sum the results:

>>> d1 = np.random.random_integers(1, 6, 1000)
>>> d2 = np.random.random_integers(1, 6, 1000)
>>> dsums = d1 + d2

Display results as a histogram:

>>> import matplotlib.pyplot as plt
>>> count, bins, ignored = plt.hist(dsums, 11, density=True)
>>> plt.show()
simba.utils.data.random_sample(size=None)

Return random floats in the half-open interval [0.0, 1.0).

Results are from the “continuous uniform” distribution over the stated interval. To sample Unif[a, b), b > a multiply the output of random_sample by (b-a) and add a:

(b - a) * random_sample() + a

Note

New code should use the random method of a default_rng() instance instead; please see the random-quick-start.

Parameters

size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.

Returns

out – Array of random floats of shape size (unless size=None, in which case a single float is returned).

Return type

float or ndarray of floats

See also

Generator.random

which should be used for new code.

Examples

>>> np.random.random_sample()
0.47108547995356098 # random
>>> type(np.random.random_sample())
<class 'float'>
>>> np.random.random_sample((5,))
array([ 0.30220482,  0.86820401,  0.1654503 ,  0.11659149,  0.54323428]) # random

Three-by-two array of random numbers from [-5, 0):

>>> 5 * np.random.random_sample((3, 2)) - 5
array([[-3.99149989, -0.52338984], # random
       [-2.99091858, -0.79479508],
       [-1.23204345, -1.75224494]])
simba.utils.data.rayleigh(scale=1.0, size=None)

Draw samples from a Rayleigh distribution.

The \chi and Weibull distributions are generalizations of the Rayleigh.

Note

New code should use the rayleigh method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • scale (float or array_like of floats, optional) – Scale, also equals the mode. Must be non-negative. Default is 1.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if scale is a scalar. Otherwise, np.array(scale).size samples are drawn.

Returns

out – Drawn samples from the parameterized Rayleigh distribution.

Return type

ndarray or scalar

See also

Generator.rayleigh

which should be used for new code.

Notes

The probability density function for the Rayleigh distribution is

P(x;scale) = \frac{x}{scale^2}e^{\frac{-x^2}{2 \cdotp scale^2}}

The Rayleigh distribution would arise, for example, if the East and North components of the wind velocity had identical zero-mean Gaussian distributions. Then the wind speed would have a Rayleigh distribution.

References

1

Brighton Webs Ltd., “Rayleigh Distribution,” https://web.archive.org/web/20090514091424/http://brighton-webs.co.uk:80/distributions/rayleigh.asp

2

Wikipedia, “Rayleigh distribution” https://en.wikipedia.org/wiki/Rayleigh_distribution

Examples

Draw values from the distribution and plot the histogram

>>> from matplotlib.pyplot import hist
>>> values = hist(np.random.rayleigh(3, 100000), bins=200, density=True)

Wave heights tend to follow a Rayleigh distribution. If the mean wave height is 1 meter, what fraction of waves are likely to be larger than 3 meters?

>>> meanvalue = 1
>>> modevalue = np.sqrt(2 / np.pi) * meanvalue
>>> s = np.random.rayleigh(modevalue, 1000000)

The percentage of waves larger than 3 meters is:

>>> 100.*sum(s>3)/1000000.
0.087300000000000003 # random
simba.utils.data.run_user_defined_feature_extraction_class(file_path: Union[str, PathLike], config_path: Union[str, PathLike]) None[source]

Loads and executes user-defined feature extraction class within .py file.

Parameters
  • file_path – Path to .py file holding user-defined feature extraction class.

  • config_path (str) – Path to SimBA project config file.

Warning

Legacy function. The GUI since 12/23 uses simba.utils.custom_feature_extractor.UserDefinedFeatureExtractor.

Note

Tutorial.

If the file_path contains multiple classes, then the first class will be used.

The user defined class needs to contain a config_path init argument.

If the feature extraction class contains a if __name__ == "__main__": entry point and uses argparse, then the custom feature extraction module will be executed through python subprocess.

Else, will be executed using sys.

I recommend using the if __name__ == "__main__: and subprocess alternative, as the feature extraction clas will be executed in a different thread and any multicore parallel processes within the user feature extraction class will not be throttled by the graphical interface mainloop.

Example

>>> run_user_defined_feature_extraction_class(config_path='/Users/simon/Desktop/envs/troubleshooting/circular_features_zebrafish/project_folder/project_config.ini', file_path='/Users/simon/Desktop/fish_feature_extractor_2023_version_5.py')
>>> run_user_defined_feature_extraction_class(config_path='/Users/simon/Desktop/envs/troubleshooting/piotr/project_folder/train-20231108-sh9-frames-with-p-lt-2_plus3-&3_best-f1.ini', file_path='/simba/misc/piotr.py')
simba.utils.data.sample_df_n_by_unique(df: DataFrame, field: str, n: int) DataFrame[source]

Randomly sample at most N rows per unique value in specified field of a dataframe.

For example, sample 100 observation from each inferred cluster assignment.

Parameters
  • pd.DataFramedf – The dataframe to sample from.

  • field (str) – The column name in the DataFrame to use for sampling based on unique values.

  • n (int) – The maximum number of rows to sample for each unique value in the specified column.

Return pd.DataFrame

A dataframe containing randomly sampled rows.

simba.utils.data.seed(self, seed=None)

Reseed a legacy MT19937 BitGenerator

Notes

This is a convenience, legacy function.

The best practice is to not reseed a BitGenerator, rather to recreate a new one. This method is here for legacy reasons. This example demonstrates best practice.

>>> from numpy.random import MT19937
>>> from numpy.random import RandomState, SeedSequence
>>> rs = RandomState(MT19937(SeedSequence(123456789)))
# Later, you want to restart the stream
>>> rs = RandomState(MT19937(SeedSequence(987654321)))
simba.utils.data.set_state(state)

Set the internal state of the generator from a tuple.

For use if one has reason to manually (re-)set the internal state of the bit generator used by the RandomState instance. By default, RandomState uses the “Mersenne Twister”[1]_ pseudo-random number generating algorithm.

Parameters

state ({tuple(str, ndarray of 624 uints, int, int, float), dict}) –

The state tuple has the following items:

  1. the string ‘MT19937’, specifying the Mersenne Twister algorithm.

  2. a 1-D array of 624 unsigned integers keys.

  3. an integer pos.

  4. an integer has_gauss.

  5. a float cached_gaussian.

If state is a dictionary, it is directly set using the BitGenerators state property.

Returns

out – Returns ‘None’ on success.

Return type

None

See also

get_state

Notes

set_state and get_state are not needed to work with any of the random distributions in NumPy. If the internal state is manually altered, the user should know exactly what he/she is doing.

For backwards compatibility, the form (str, array of 624 uints, int) is also accepted although it is missing some information about the cached Gaussian value: state = ('MT19937', keys, pos).

References

1

M. Matsumoto and T. Nishimura, “Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator,” ACM Trans. on Modeling and Computer Simulation, Vol. 8, No. 1, pp. 3-30, Jan. 1998.

simba.utils.data.shuffle(x)

Modify a sequence in-place by shuffling its contents.

This function only shuffles the array along the first axis of a multi-dimensional array. The order of sub-arrays is changed but their contents remains the same.

Note

New code should use the shuffle method of a default_rng() instance instead; please see the random-quick-start.

Parameters

x (array_like) – The array or list to be shuffled.

Return type

None

See also

Generator.shuffle

which should be used for new code.

Examples

>>> arr = np.arange(10)
>>> np.random.shuffle(arr)
>>> arr
[1 7 5 2 9 4 3 6 0 8] # random

Multi-dimensional arrays are only shuffled along the first axis:

>>> arr = np.arange(9).reshape((3, 3))
>>> np.random.shuffle(arr)
>>> arr
array([[3, 4, 5], # random
       [6, 7, 8],
       [0, 1, 2]])
simba.utils.data.slice_roi_dict_for_video(data: Dict[str, DataFrame], video_name: str) Tuple[Dict[str, DataFrame], List[str]][source]

Given a dictionary of dataframes representing different ROIs (created by simba.mixins.config_reader.ConfigReader.read_roi_data), retain only the ROIs belonging to the specified video.

simba.utils.data.slp_to_df_convert(file_path: Union[str, PathLike], headers: List[str], joined_tracks: Optional[bool] = False, multi_index: Optional[bool] = True, drop_body_parts: Optional[List[str]] = None) DataFrame[source]

Helper to convert .slp pose-estimation data in h5 format to pandas dataframe.

Parameters
  • file_path (Union[str, os.PathLike]) – Path to SLEAP H5 file on disk.

  • headers (List[str]) – List of strings representing output dataframe headers.

  • joined_tracks (bool) – If True, the h5 file has been created by joining multiple .slp files.

  • multi_index (bool) – If True, inserts multi-index place-holders in the output dataframe (used in SimBA data import).

  • drop_body_parts (Optional[List[str]]) – Body-parts that should be removed from the SLEAP H5 dataset before import into SimBA. Use the body-part names as defined in SLEAP. Default: None.

Raises
  • InvalidFileTypeError – If file_path is not a valid SLEAP H5 pose-estimation file.

  • DataHeaderError – If sleap file contains more or less body-parts than suggested by len(headers)

Return pd.DataFrame

With animal ID, Track ID and body-part names as columns.

Example

>>> headers = ['d_nose_1', 'd_neck_1', 'd_back_1', 'd_tail_1', 'nest_s_2', 'nest_cc_2', 'nest_cv_2', 'nest_cc_2', 'nest_csc_2', 'nest_cscd_2']
>>> new_headers = []
>>> for h in headers: new_headers.append(h + '_x'); new_headers.append(h + '_y'); new_headers.append(h + '_p')
>>> df = slp_to_df_convert(file_path='/Users/simon/Desktop/envs/troubleshooting/ryan/LBN4a_Ctrl_P05_1_2022-01-15_08-16-20c.h5', headers=new_headers, joined_tracks=True)
simba.utils.data.smooth_data_gaussian(config: ConfigParser, file_path: str, time_window_parameter: int) None[source]

Perform Gaussian smoothing of pose-estimation data.

Important

Overwrites the input data with smoothened data.

Parameters
  • config (configparser.ConfigParser) – Parsed SimBA project_config.ini file.

  • file_path (str) – Path to pose estimation data.

  • time_window_parameter (int) – Gaussian rolling window size in milliseconds.

Example

>>> config = read_config_file(ini_path='/Users/simon/Desktop/envs/troubleshooting/Tests_022023/project_folder/project_config.ini')
>>> smooth_data_gaussian(config=config, file_path='/Users/simon/Desktop/envs/troubleshooting/Tests_022023/project_folder/csv/input_csv/Together_1.csv', time_window_parameter=500)
simba.utils.data.smooth_data_savitzky_golay(config: ConfigParser, file_path: Union[str, PathLike], time_window_parameter: int, overwrite: Optional[bool] = True) None[source]

Perform Savitzky-Golay smoothing of pose-estimation data within a file.

Important

Overwrites the input data with smoothened data.

Parameters
  • config (configparser.ConfigParser) – Parsed SimBA project_config.ini file.

  • file_path (str) – Path to pose estimation data.

  • time_window_parameter (int) – Savitzky-Golay rolling window size in milliseconds.

  • overwrite (bool) – If True, overwrites the input data. If False, returns the smoothened dataframe.

Example

>>> config = read_config_file(config_path='Tests_022023/project_folder/project_config.ini')
>>> smooth_data_savitzky_golay(config=config, file_path='Tests_022023/project_folder/csv/input_csv/Together_1.csv', time_window_parameter=500)
simba.utils.data.standard_cauchy(size=None)

Draw samples from a standard Cauchy distribution with mode = 0.

Also known as the Lorentz distribution.

Note

New code should use the standard_cauchy method of a default_rng() instance instead; please see the random-quick-start.

Parameters

size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.

Returns

samples – The drawn samples.

Return type

ndarray or scalar

See also

Generator.standard_cauchy

which should be used for new code.

Notes

The probability density function for the full Cauchy distribution is

P(x; x_0, \gamma) = \frac{1}{\pi \gamma \bigl[ 1+
(\frac{x-x_0}{\gamma})^2 \bigr] }

and the Standard Cauchy distribution just sets x_0=0 and \gamma=1

The Cauchy distribution arises in the solution to the driven harmonic oscillator problem, and also describes spectral line broadening. It also describes the distribution of values at which a line tilted at a random angle will cut the x axis.

When studying hypothesis tests that assume normality, seeing how the tests perform on data from a Cauchy distribution is a good indicator of their sensitivity to a heavy-tailed distribution, since the Cauchy looks very much like a Gaussian distribution, but with heavier tails.

References

1

NIST/SEMATECH e-Handbook of Statistical Methods, “Cauchy Distribution”, https://www.itl.nist.gov/div898/handbook/eda/section3/eda3663.htm

2

Weisstein, Eric W. “Cauchy Distribution.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/CauchyDistribution.html

3

Wikipedia, “Cauchy distribution” https://en.wikipedia.org/wiki/Cauchy_distribution

Examples

Draw samples and plot the distribution:

>>> import matplotlib.pyplot as plt
>>> s = np.random.standard_cauchy(1000000)
>>> s = s[(s>-25) & (s<25)]  # truncate distribution so it plots well
>>> plt.hist(s, bins=100)
>>> plt.show()
simba.utils.data.standard_exponential(size=None)

Draw samples from the standard exponential distribution.

standard_exponential is identical to the exponential distribution with a scale parameter of 1.

Note

New code should use the standard_exponential method of a default_rng() instance instead; please see the random-quick-start.

Parameters

size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.

Returns

out – Drawn samples.

Return type

float or ndarray

See also

Generator.standard_exponential

which should be used for new code.

Examples

Output a 3x8000 array:

>>> n = np.random.standard_exponential((3, 8000))
simba.utils.data.standard_gamma(shape, size=None)

Draw samples from a standard Gamma distribution.

Samples are drawn from a Gamma distribution with specified parameters, shape (sometimes designated “k”) and scale=1.

Note

New code should use the standard_gamma method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • shape (float or array_like of floats) – Parameter, must be non-negative.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if shape is a scalar. Otherwise, np.array(shape).size samples are drawn.

Returns

out – Drawn samples from the parameterized standard gamma distribution.

Return type

ndarray or scalar

See also

scipy.stats.gamma

probability density function, distribution or cumulative density function, etc.

Generator.standard_gamma

which should be used for new code.

Notes

The probability density for the Gamma distribution is

p(x) = x^{k-1}\frac{e^{-x/\theta}}{\theta^k\Gamma(k)},

where k is the shape and \theta the scale, and \Gamma is the Gamma function.

The Gamma distribution is often used to model the times to failure of electronic components, and arises naturally in processes for which the waiting times between Poisson distributed events are relevant.

References

1

Weisstein, Eric W. “Gamma Distribution.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/GammaDistribution.html

2

Wikipedia, “Gamma distribution”, https://en.wikipedia.org/wiki/Gamma_distribution

Examples

Draw samples from the distribution:

>>> shape, scale = 2., 1. # mean and width
>>> s = np.random.standard_gamma(shape, 1000000)

Display the histogram of the samples, along with the probability density function:

>>> import matplotlib.pyplot as plt
>>> import scipy.special as sps  
>>> count, bins, ignored = plt.hist(s, 50, density=True)
>>> y = bins**(shape-1) * ((np.exp(-bins/scale))/  
...                       (sps.gamma(shape) * scale**shape))
>>> plt.plot(bins, y, linewidth=2, color='r')  
>>> plt.show()
simba.utils.data.standard_normal(size=None)

Draw samples from a standard Normal distribution (mean=0, stdev=1).

Note

New code should use the standard_normal method of a default_rng() instance instead; please see the random-quick-start.

Parameters

size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.

Returns

out – A floating-point array of shape size of drawn samples, or a single sample if size was not specified.

Return type

float or ndarray

See also

normal

Equivalent function with additional loc and scale arguments for setting the mean and standard deviation.

Generator.standard_normal

which should be used for new code.

Notes

For random samples from N(\mu, \sigma^2), use one of:

mu + sigma * np.random.standard_normal(size=...)
np.random.normal(mu, sigma, size=...)

Examples

>>> np.random.standard_normal()
2.1923875335537315 #random
>>> s = np.random.standard_normal(8000)
>>> s
array([ 0.6888893 ,  0.78096262, -0.89086505, ...,  0.49876311,  # random
       -0.38672696, -0.4685006 ])                                # random
>>> s.shape
(8000,)
>>> s = np.random.standard_normal(size=(3, 4, 2))
>>> s.shape
(3, 4, 2)

Two-by-four array of samples from N(3, 6.25):

>>> 3 + 2.5 * np.random.standard_normal(size=(2, 4))
array([[-4.49401501,  4.00950034, -1.81814867,  7.29718677],   # random
       [ 0.39924804,  4.68456316,  4.99394529,  4.84057254]])  # random
simba.utils.data.standard_t(df, size=None)

Draw samples from a standard Student’s t distribution with df degrees of freedom.

A special case of the hyperbolic distribution. As df gets large, the result resembles that of the standard normal distribution (standard_normal).

Note

New code should use the standard_t method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • df (float or array_like of floats) – Degrees of freedom, must be > 0.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if df is a scalar. Otherwise, np.array(df).size samples are drawn.

Returns

out – Drawn samples from the parameterized standard Student’s t distribution.

Return type

ndarray or scalar

See also

Generator.standard_t

which should be used for new code.

Notes

The probability density function for the t distribution is

P(x, df) = \frac{\Gamma(\frac{df+1}{2})}{\sqrt{\pi df}
\Gamma(\frac{df}{2})}\Bigl( 1+\frac{x^2}{df} \Bigr)^{-(df+1)/2}

The t test is based on an assumption that the data come from a Normal distribution. The t test provides a way to test whether the sample mean (that is the mean calculated from the data) is a good estimate of the true mean.

The derivation of the t-distribution was first published in 1908 by William Gosset while working for the Guinness Brewery in Dublin. Due to proprietary issues, he had to publish under a pseudonym, and so he used the name Student.

References

1

Dalgaard, Peter, “Introductory Statistics With R”, Springer, 2002.

2

Wikipedia, “Student’s t-distribution” https://en.wikipedia.org/wiki/Student’s_t-distribution

Examples

From Dalgaard page 83 [1]_, suppose the daily energy intake for 11 women in kilojoules (kJ) is:

>>> intake = np.array([5260., 5470, 5640, 6180, 6390, 6515, 6805, 7515, \
...                    7515, 8230, 8770])

Does their energy intake deviate systematically from the recommended value of 7725 kJ?

We have 10 degrees of freedom, so is the sample mean within 95% of the recommended value?

>>> s = np.random.standard_t(10, size=100000)
>>> np.mean(intake)
6753.636363636364
>>> intake.std(ddof=1)
1142.1232221373727

Calculate the t statistic, setting the ddof parameter to the unbiased value so the divisor in the standard deviation will be degrees of freedom, N-1.

>>> t = (np.mean(intake)-7725)/(intake.std(ddof=1)/np.sqrt(len(intake)))
>>> import matplotlib.pyplot as plt
>>> h = plt.hist(s, bins=100, density=True)

For a one-sided t-test, how far out in the distribution does the t statistic appear?

>>> np.sum(s<t) / float(len(s))
0.0090699999999999999  #random

So the p-value is about 0.009, which says the null hypothesis has a probability of about 99% of being true.

simba.utils.data.triangular(left, mode, right, size=None)

Draw samples from the triangular distribution over the interval [left, right].

The triangular distribution is a continuous probability distribution with lower limit left, peak at mode, and upper limit right. Unlike the other distributions, these parameters directly define the shape of the pdf.

Note

New code should use the triangular method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • left (float or array_like of floats) – Lower limit.

  • mode (float or array_like of floats) – The value where the peak of the distribution occurs. The value must fulfill the condition left <= mode <= right.

  • right (float or array_like of floats) – Upper limit, must be larger than left.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if left, mode, and right are all scalars. Otherwise, np.broadcast(left, mode, right).size samples are drawn.

Returns

out – Drawn samples from the parameterized triangular distribution.

Return type

ndarray or scalar

See also

Generator.triangular

which should be used for new code.

Notes

The probability density function for the triangular distribution is

P(x;l, m, r) = \begin{cases}
\frac{2(x-l)}{(r-l)(m-l)}& \text{for $l \leq x \leq m$},\\
\frac{2(r-x)}{(r-l)(r-m)}& \text{for $m \leq x \leq r$},\\
0& \text{otherwise}.
\end{cases}

The triangular distribution is often used in ill-defined problems where the underlying distribution is not known, but some knowledge of the limits and mode exists. Often it is used in simulations.

References

1

Wikipedia, “Triangular distribution” https://en.wikipedia.org/wiki/Triangular_distribution

Examples

Draw values from the distribution and plot the histogram:

>>> import matplotlib.pyplot as plt
>>> h = plt.hist(np.random.triangular(-3, 0, 8, 100000), bins=200,
...              density=True)
>>> plt.show()
simba.utils.data.uniform(low=0.0, high=1.0, size=None)

Draw samples from a uniform distribution.

Samples are uniformly distributed over the half-open interval [low, high) (includes low, but excludes high). In other words, any value within the given interval is equally likely to be drawn by uniform.

Note

New code should use the uniform method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • low (float or array_like of floats, optional) – Lower boundary of the output interval. All values generated will be greater than or equal to low. The default value is 0.

  • high (float or array_like of floats) – Upper boundary of the output interval. All values generated will be less than or equal to high. The default value is 1.0.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if low and high are both scalars. Otherwise, np.broadcast(low, high).size samples are drawn.

Returns

out – Drawn samples from the parameterized uniform distribution.

Return type

ndarray or scalar

See also

randint

Discrete uniform distribution, yielding integers.

random_integers

Discrete uniform distribution over the closed interval [low, high].

random_sample

Floats uniformly distributed over [0, 1).

random

Alias for random_sample.

rand

Convenience function that accepts dimensions as input, e.g., rand(2,2) would generate a 2-by-2 array of floats, uniformly distributed over [0, 1).

Generator.uniform

which should be used for new code.

Notes

The probability density function of the uniform distribution is

p(x) = \frac{1}{b - a}

anywhere within the interval [a, b), and zero elsewhere.

When high == low, values of low will be returned. If high < low, the results are officially undefined and may eventually raise an error, i.e. do not rely on this function to behave when passed arguments satisfying that inequality condition. The high limit may be included in the returned array of floats due to floating-point rounding in the equation low + (high-low) * random_sample(). For example:

>>> x = np.float32(5*0.99999999)
>>> x
5.0

Examples

Draw samples from the distribution:

>>> s = np.random.uniform(-1,0,1000)

All values are within the given interval:

>>> np.all(s >= -1)
True
>>> np.all(s < 0)
True

Display the histogram of the samples, along with the probability density function:

>>> import matplotlib.pyplot as plt
>>> count, bins, ignored = plt.hist(s, 15, density=True)
>>> plt.plot(bins, np.ones_like(bins), linewidth=2, color='r')
>>> plt.show()
simba.utils.data.vonmises(mu, kappa, size=None)

Draw samples from a von Mises distribution.

Samples are drawn from a von Mises distribution with specified mode (mu) and dispersion (kappa), on the interval [-pi, pi].

The von Mises distribution (also known as the circular normal distribution) is a continuous probability distribution on the unit circle. It may be thought of as the circular analogue of the normal distribution.

Note

New code should use the vonmises method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • mu (float or array_like of floats) – Mode (“center”) of the distribution.

  • kappa (float or array_like of floats) – Dispersion of the distribution, has to be >=0.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if mu and kappa are both scalars. Otherwise, np.broadcast(mu, kappa).size samples are drawn.

Returns

out – Drawn samples from the parameterized von Mises distribution.

Return type

ndarray or scalar

See also

scipy.stats.vonmises

probability density function, distribution, or cumulative density function, etc.

Generator.vonmises

which should be used for new code.

Notes

The probability density for the von Mises distribution is

p(x) = \frac{e^{\kappa cos(x-\mu)}}{2\pi I_0(\kappa)},

where \mu is the mode and \kappa the dispersion, and I_0(\kappa) is the modified Bessel function of order 0.

The von Mises is named for Richard Edler von Mises, who was born in Austria-Hungary, in what is now the Ukraine. He fled to the United States in 1939 and became a professor at Harvard. He worked in probability theory, aerodynamics, fluid mechanics, and philosophy of science.

References

1

Abramowitz, M. and Stegun, I. A. (Eds.). “Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th printing,” New York: Dover, 1972.

2

von Mises, R., “Mathematical Theory of Probability and Statistics”, New York: Academic Press, 1964.

Examples

Draw samples from the distribution:

>>> mu, kappa = 0.0, 4.0 # mean and dispersion
>>> s = np.random.vonmises(mu, kappa, 1000)

Display the histogram of the samples, along with the probability density function:

>>> import matplotlib.pyplot as plt
>>> from scipy.special import i0  
>>> plt.hist(s, 50, density=True)
>>> x = np.linspace(-np.pi, np.pi, num=51)
>>> y = np.exp(kappa*np.cos(x-mu))/(2*np.pi*i0(kappa))  
>>> plt.plot(x, y, linewidth=2, color='r')  
>>> plt.show()
simba.utils.data.wald(mean, scale, size=None)

Draw samples from a Wald, or inverse Gaussian, distribution.

As the scale approaches infinity, the distribution becomes more like a Gaussian. Some references claim that the Wald is an inverse Gaussian with mean equal to 1, but this is by no means universal.

The inverse Gaussian distribution was first studied in relationship to Brownian motion. In 1956 M.C.K. Tweedie used the name inverse Gaussian because there is an inverse relationship between the time to cover a unit distance and distance covered in unit time.

Note

New code should use the wald method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • mean (float or array_like of floats) – Distribution mean, must be > 0.

  • scale (float or array_like of floats) – Scale parameter, must be > 0.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if mean and scale are both scalars. Otherwise, np.broadcast(mean, scale).size samples are drawn.

Returns

out – Drawn samples from the parameterized Wald distribution.

Return type

ndarray or scalar

See also

Generator.wald

which should be used for new code.

Notes

The probability density function for the Wald distribution is

P(x;mean,scale) = \sqrt{\frac{scale}{2\pi x^3}}e^
\frac{-scale(x-mean)^2}{2\cdotp mean^2x}

As noted above the inverse Gaussian distribution first arise from attempts to model Brownian motion. It is also a competitor to the Weibull for use in reliability modeling and modeling stock returns and interest rate processes.

References

1

Brighton Webs Ltd., Wald Distribution, https://web.archive.org/web/20090423014010/http://www.brighton-webs.co.uk:80/distributions/wald.asp

2

Chhikara, Raj S., and Folks, J. Leroy, “The Inverse Gaussian Distribution: Theory : Methodology, and Applications”, CRC Press, 1988.

3

Wikipedia, “Inverse Gaussian distribution” https://en.wikipedia.org/wiki/Inverse_Gaussian_distribution

Examples

Draw values from the distribution and plot the histogram:

>>> import matplotlib.pyplot as plt
>>> h = plt.hist(np.random.wald(3, 2, 100000), bins=200, density=True)
>>> plt.show()
simba.utils.data.weibull(a, size=None)

Draw samples from a Weibull distribution.

Draw samples from a 1-parameter Weibull distribution with the given shape parameter a.

X = (-ln(U))^{1/a}

Here, U is drawn from the uniform distribution over (0,1].

The more common 2-parameter Weibull, including a scale parameter \lambda is just X = \lambda(-ln(U))^{1/a}.

Note

New code should use the weibull method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • a (float or array_like of floats) – Shape parameter of the distribution. Must be nonnegative.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if a is a scalar. Otherwise, np.array(a).size samples are drawn.

Returns

out – Drawn samples from the parameterized Weibull distribution.

Return type

ndarray or scalar

See also

scipy.stats.weibull_max, scipy.stats.weibull_min, scipy.stats.genextreme, gumbel

Generator.weibull

which should be used for new code.

Notes

The Weibull (or Type III asymptotic extreme value distribution for smallest values, SEV Type III, or Rosin-Rammler distribution) is one of a class of Generalized Extreme Value (GEV) distributions used in modeling extreme value problems. This class includes the Gumbel and Frechet distributions.

The probability density for the Weibull distribution is

p(x) = \frac{a}
{\lambda}(\frac{x}{\lambda})^{a-1}e^{-(x/\lambda)^a},

where a is the shape and \lambda the scale.

The function has its peak (the mode) at \lambda(\frac{a-1}{a})^{1/a}.

When a = 1, the Weibull distribution reduces to the exponential distribution.

References

1

Waloddi Weibull, Royal Technical University, Stockholm, 1939 “A Statistical Theory Of The Strength Of Materials”, Ingeniorsvetenskapsakademiens Handlingar Nr 151, 1939, Generalstabens Litografiska Anstalts Forlag, Stockholm.

2

Waloddi Weibull, “A Statistical Distribution Function of Wide Applicability”, Journal Of Applied Mechanics ASME Paper 1951.

3

Wikipedia, “Weibull distribution”, https://en.wikipedia.org/wiki/Weibull_distribution

Examples

Draw samples from the distribution:

>>> a = 5. # shape
>>> s = np.random.weibull(a, 1000)

Display the histogram of the samples, along with the probability density function:

>>> import matplotlib.pyplot as plt
>>> x = np.arange(1,100.)/50.
>>> def weib(x,n,a):
...     return (a / n) * (x / n)**(a - 1) * np.exp(-(x / n)**a)
>>> count, bins, ignored = plt.hist(np.random.weibull(5.,1000))
>>> x = np.arange(1,100.)/50.
>>> scale = count.max()/weib(x, 1., 5.).max()
>>> plt.plot(x, weib(x, 1., 5.)*scale)
>>> plt.show()
simba.utils.data.zipf(a, size=None)

Draw samples from a Zipf distribution.

Samples are drawn from a Zipf distribution with specified parameter a > 1.

The Zipf distribution (also known as the zeta distribution) is a continuous probability distribution that satisfies Zipf’s law: the frequency of an item is inversely proportional to its rank in a frequency table.

Note

New code should use the zipf method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • a (float or array_like of floats) – Distribution parameter. Must be greater than 1.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if a is a scalar. Otherwise, np.array(a).size samples are drawn.

Returns

out – Drawn samples from the parameterized Zipf distribution.

Return type

ndarray or scalar

See also

scipy.stats.zipf

probability density function, distribution, or cumulative density function, etc.

Generator.zipf

which should be used for new code.

Notes

The probability density for the Zipf distribution is

p(x) = \frac{x^{-a}}{\zeta(a)},

where \zeta is the Riemann Zeta function.

It is named for the American linguist George Kingsley Zipf, who noted that the frequency of any word in a sample of a language is inversely proportional to its rank in the frequency table.

References

1

Zipf, G. K., “Selected Studies of the Principle of Relative Frequency in Language,” Cambridge, MA: Harvard Univ. Press, 1932.

Examples

Draw samples from the distribution:

>>> a = 2. # parameter
>>> s = np.random.zipf(a, 1000)

Display the histogram of the samples, along with the probability density function:

>>> import matplotlib.pyplot as plt
>>> from scipy import special  

Truncate s values at 50 so plot is interesting:

>>> count, bins, ignored = plt.hist(s[s<50], 50, density=True)
>>> x = np.arange(1., 50.)
>>> y = x**(-a) / special.zetac(a)  
>>> plt.plot(x, y/max(y), linewidth=2, color='r')  
>>> plt.show()

Enums

class simba.utils.enums.ConfigKey(value)[source]

Bases: Enum

An enumeration.

ANIMAL_CNT = 'animal_no'
BODYPART_DIRECTION_VALUE = 'bodypart_direction'
CREATE_ENSEMBLE_SETTINGS = 'create ensemble settings'
DIRECTIONALITY_SETTINGS = 'Directionality settings'
DISTANCE_MM = 'distance_mm'
DISTANCE_PLOT_SETTINGS = 'Distance plot'
FILE_TYPE = 'workflow_file_type'
FOLDER_PATH = 'folder_path'
FRAME_SETTINGS = 'Frame settings'
GENERAL_SETTINGS = 'General settings'
HEATMAP_SETTINGS = 'Heatmap settings'
LINE_PLOT_SETTINGS = 'Line plot settings'
LOCATION_CRITERION = 'location_criterion'
MIN_BOUT_LENGTH = 'Minimum_bout_lengths'
MODEL_DIR = 'model_dir'
MOVEMENT_CRITERION = 'movement_criterion'
MULTI_ANIMAL_IDS = 'ID_list'
MULTI_ANIMAL_ID_SETTING = 'Multi animal IDs'
OS = 'OS_system'
OUTLIER_SETTINGS = 'Outlier settings'
PATH_PLOT_SETTINGS = 'Path plot settings'
POSE_SETTING = 'pose_estimation_body_parts'
PROBABILITY_THRESHOLD = 'probability_threshold'
PROCESS_MOVEMENT_SETTINGS = 'process movements'
PROJECT_NAME = 'project_name'
PROJECT_PATH = 'project_path'
RF_JOBS = 'RF_n_jobs'
ROI_ANIMAL_CNT = 'no_of_animals'
ROI_SETTINGS = 'ROI settings'
SKLEARN_BP_PROB_THRESH = 'bp_threshold_sklearn'
SML_SETTINGS = 'SML settings'
TARGET_CNT = 'No_targets'
THRESHOLD_SETTINGS = 'threshold_settings'
VALIDATION_SETTINGS = 'validation/run model'
VALIDATION_VIDEO = 'generate_validation_video'
VIDEO_INFO_CSV = 'video_info.csv'
class simba.utils.enums.Defaults(value)[source]

Bases: Enum

An enumeration.

BROWSE_FILE_BTN_TEXT = 'Browse File'
BROWSE_FOLDER_BTN_TEXT = 'Browse Folder'
CHUNK_SIZE = 1
LARGE_MAX_TASK_PER_CHILD = 1000
MAXIMUM_MAX_TASK_PER_CHILD = 8000
MAX_TASK_PER_CHILD = 10
NO_FILE_SELECTED_TEXT = 'No file selected'
SPLASH_TIME = 2500
STR_SPLIT_DELIMITER = '\t'
WELCOME_MSG = 'Welcome fellow scientists! \n SimBA v.1.91.9 \n '
class simba.utils.enums.DirNames(value)[source]

Bases: Enum

An enumeration.

BP_NAMES = 'bp_names'
CONFIGS = 'configs'
CSV = 'csv'
FEATURES_EXTRACTED = 'features_extracted'
FRAMES = 'frames'
INPUT = 'input'
INPUT_CSV = 'input_csv'
LOGS = 'logs'
MACHINE_RESULTS = 'machine_results'
MEASURES = 'measures'
MODEL = 'models'
OUTLIER_MOVEMENT = 'outlier_corrected_movement'
OUTLIER_MOVEMENT_LOCATION = 'outlier_corrected_movement_location'
OUTPUT = 'output'
POSE_CONFIGS = 'pose_configs'
PROJECT = 'project_folder'
TARGETS_INSERTED = 'targets_inserted'
VIDEOS = 'videos'
class simba.utils.enums.Dtypes(value)[source]

Bases: Enum

An enumeration.

ENTROPY = 'entropy'
FLOAT = 'float'
FOLDER = 'folder_path'
INT = 'int'
NAN = 'NaN'
NONE = 'None'
SQRT = 'sqrt'
STR = 'str'
class simba.utils.enums.Formats(value)[source]

Bases: Enum

An enumeration.

AREA = 'area'
AVI_CODEC = 'XVID'
BATCH_CODEC = 'libx264'
CSV = 'csv'
DLC_FILETYPES = {'box': ['bx.h5', 'bx_filtered.h5'], 'ellipse': ['el.h5', 'el_filtered.h5'], 'skeleton': ['sk.h5', 'sk_filtered.h5']}
DLC_NETWORK_FILE_NAMES = ['dlc_resnet50', 'dlc_resnet_50', 'dlc_dlcrnetms5', 'dlc_effnet_b0', 'dlc_resnet101']
FONT = 4
H5 = 'h5'
LABELFRAME_HEADER_CLICKABLE_COLOR = '#0563c1'
LABELFRAME_HEADER_CLICKABLE_FORMAT = ('Helvetica', 12, 'bold', 'underline')
LABELFRAME_HEADER_FORMAT = ('Helvetica', 12, 'bold')
MP4_CODEC = 'mp4v'
PARQUET = 'parquet'
PERIMETER = 'perimeter'
PICKLE = 'pickle'
ROOT_WINDOW_SIZE = '750x750'
TKINTER_FONT = ('Rockwell', 11)
XLXS = 'xlsx'
class simba.utils.enums.GeometryEnum(value)[source]

Bases: Enum

An enumeration.

CAP_STYLE_MAP = {'flat': 3, 'round': 1, 'square': 2}
CONTOURS_MODE_MAP = {'all': 1, 'exterior': 0, 'interior': 3}
CONTOURS_RETRIEVAL_MAP = {'kcos': 4, 'l1': 3, 'none': 0, 'simple': 2}
HISTOGRAM_COMPARISON_MAP = {'bhattacharyya': 3, 'chi_square': 1, 'chi_square_alternative': 5, 'correlation': 0, 'hellinger': 4, 'intersection': 2}
RANKING_METHODS = ['area', 'min_distance', 'max_distance', 'mean_distance', 'left_to_right', 'top_to_bottom']
class simba.utils.enums.Keys(value)[source]

Bases: Enum

An enumeration.

DOCUMENTATION = 'documentation'
FRAME_COUNT = 'frame_count'
ROI_CIRCLES = 'circleDf'
ROI_POLYGONS = 'polygons'
ROI_RECTANGLES = 'rectangles'
class simba.utils.enums.Labelling(value)[source]

Bases: Enum

An enumeration.

MAX_FRM_SIZE = (1280, 650)
PADDING = 5
PLAY_VIDEO_SCRIPT_PATH = '/home/docs/checkouts/readthedocs.org/user_builds/simba-uw-tf-dev/checkouts/latest/simba/labelling/play_annotation_video.py'
VALID_ANNOTATIONS_ADVANCED = [0, 1, 2]
VIDEO_FRAME_SIZE = (700, 500)

Bases: Enum

An enumeration.

ADDITIONAL_IMPORTS = 'https://github.com/sgoldenlab/simba/blob/master/docs/Scenario1.md#step-2-optional-step--import-more-dlc-tracking-data-or-videos'
ADVANCED_LBL = 'https://github.com/sgoldenlab/simba/blob/master/docs/advanced_labelling.md'
AGGREGATE_BOOL_STATS = 'https://github.com/sgoldenlab/simba/blob/master/docs/ROI_tutorial.md#compute-aggregate-conditional-statistics-from-boolean-fields'
ANALYZE_ML_RESULTS = 'https://github.com/sgoldenlab/simba/blob/master/docs/Scenario2.md#part-4--analyze-machine-results'
ANALYZE_ROI = 'https://github.com/sgoldenlab/simba/blob/master/docs/ROI_tutorial.md#part-2-analyzing-roi-data'
APPEND_ROI_FEATURES = 'https://github.com/sgoldenlab/simba/blob/master/docs/ROI_tutorial.md#part-3-generating-features-from-roi-data'
BATCH_PREPROCESS = 'https://github.com/sgoldenlab/simba/blob/master/docs/tutorial_process_videos.md'
BBOXES = 'https://github.com/sgoldenlab/simba/blob/master/docs/anchored_rois.md'
CIRCLE_CROP = 'https://github.com/sgoldenlab/simba/blob/master/docs/Tutorial_tools.md#circle-crop'
CLF_VALIDATION = 'https://github.com/sgoldenlab/simba/blob/master/docs/classifier_validation.md'
CONCAT_VIDEOS = 'https://github.com/sgoldenlab/simba/blob/master/docs/Scenario2.md#merging-concatenating-videos'
COUNT_ANNOTATIONS_IN_PROJECT = 'https://github.com/sgoldenlab/simba/blob/master/docs/label_behavior.md#count-annotations-in-simba-project'
COUNT_ANNOTATIONS_OUTSIDE_PROJECT = 'https://github.com/sgoldenlab/simba/blob/master/docs/Tutorial_tools.md#extract-project-annotation-counts'
CREATE_PROJECT = 'https://github.com/sgoldenlab/simba/blob/master/docs/Scenario1.md#step-1-generate-project-config'
CUE_LIGHTS = 'https://github.com/sgoldenlab/simba/blob/master/docs/cue_light_tutorial.md'
DATA_ANALYSIS = 'https://github.com/sgoldenlab/simba/blob/master/docs/Scenario2.md#part-4--analyze-machine-results'
DATA_TABLES = 'https://github.com/sgoldenlab/simba/blob/master/docs/Scenario2.md#visualizing-data-tables'
DIRECTING_ANIMALS_PLOTS = 'https://github.com/sgoldenlab/simba/blob/master/docs/directionality_between_animals.md'
DISTANCE_PLOTS = 'https://github.com/sgoldenlab/simba/blob/master/docs/Scenario2.md#visualizing-distance-plots'
DOWNSAMPLE = 'https://github.com/sgoldenlab/simba/blob/master/docs/Tutorial_tools.md#downsample-video'
EXTRACT_FEATURES = 'https://github.com/sgoldenlab/simba/blob/master/docs/Scenario1.md#step-5-extract-features'
FEATURE_SUBSETS = 'https://github.com/sgoldenlab/simba/blob/master/docs/feature_subsets.md'
FSTTC = 'https://github.com/sgoldenlab/simba/blob/master/docs/FSTTC.md'
GANTT_PLOTS = 'https://github.com/sgoldenlab/simba/blob/master/docs/Scenario2.md#visualizing-gantt-charts'
GITHUB_REPO = 'https://github.com/sgoldenlab/simba'
GITTER = 'https://gitter.im/SimBA-Resource/community'
HEATMAP_CLF = 'https://github.com/sgoldenlab/simba/blob/master/docs/Scenario2.md#visualizing-classification-heatmaps'
HEATMAP_LOCATION = 'https://github.com/sgoldenlab/simba/blob/master/docs/ROI_tutorial.md#heatmaps'
KLEINBERG = 'https://github.com/sgoldenlab/simba/blob/master/docs/kleinberg_filter.md'
LABEL_BEHAVIOR = 'https://github.com/sgoldenlab/simba/blob/master/docs/label_behavior.md'
LOAD_PROJECT = 'https://github.com/sgoldenlab/simba/blob/master/docs/Scenario1.md#part-2-load-project-1'
OSF_REPO = 'https://osf.io/tmu6y/'
OULIERS = 'https://github.com/sgoldenlab/simba/blob/master/misc/Outlier_settings.pdf'
OUTLIERS_DOC = 'https://github.com/sgoldenlab/simba/blob/master/docs/Scenario1.md#step-4-outlier-correction'
OUT_OF_SAMPLE_VALIDATION = 'https://github.com/sgoldenlab/simba/blob/master/docs/Scenario1.md#step-8-evaluating-the-model-on-new-out-of-sample-data'
PATH_PLOTS = 'https://github.com/sgoldenlab/simba/blob/master/docs/Scenario2.md#visualizing-path-plots'
PLOTLY = 'https://github.com/sgoldenlab/simba/blob/master/docs/plotly_dash.md'
PSEUDO_LBL = 'https://github.com/sgoldenlab/simba/blob/master/docs/pseudoLabel.md'
REMOVE_CLF = 'https://github.com/sgoldenlab/simba/blob/master/docs/Scenario1.md#step-2-optional-step--import-more-dlc-tracking-data-or-videos'
ROI = 'https://github.com/sgoldenlab/simba/blob/master/docs/ROI_tutorial_new.md'
ROI_DATA_ANALYSIS = 'https://github.com/sgoldenlab/simba/blob/master/docs/ROI_tutorial.md#part-2-analyzing-roi-data'
ROI_DATA_PLOT = 'https://github.com/sgoldenlab/simba/blob/master/docs/ROI_tutorial.md#part-4-visualizing-roi-data'
ROI_FEATURES = 'https://github.com/sgoldenlab/simba/blob/master/docs/ROI_tutorial.md#part-3-generating-features-from-roi-data'
ROI_FEATURES_PLOT = 'https://github.com/sgoldenlab/simba/blob/master/docs/ROI_tutorial.md#part-5-visualizing-roi-features'
SCENARIO_2 = 'https://github.com/sgoldenlab/simba/blob/master/docs/Scenario2.md'
SET_RUN_ML_PARAMETERS = 'https://github.com/sgoldenlab/simba/blob/master/docs/Scenario2.md#part-3-run-the-classifier-on-new-data'
SKLEARN_PLOTS = 'https://github.com/sgoldenlab/simba/blob/master/docs/Scenario2.md#visualizing-classifications'
THIRD_PARTY_ANNOTATION = 'https://github.com/sgoldenlab/simba/blob/master/docs/third_party_annot.md'
THIRD_PARTY_ANNOTATION_NEW = 'https://github.com/sgoldenlab/simba/blob/master/docs/third_party_annot_new.md'
TRAIN_ML_MODEL = 'https://github.com/sgoldenlab/simba/blob/master/docs/Scenario1.md#step-7-train-machine-model'
USER_DEFINED_FEATURE_EXTRACTION = 'https://github.com/sgoldenlab/simba/blob/master/docs/extractFeatures.md'
VIDEO_PARAMETERS = 'https://github.com/sgoldenlab/simba/blob/master/docs/Scenario1.md#step-3-set-video-parameters'
VIDEO_TOOLS = 'https://github.com/sgoldenlab/simba/blob/master/docs/Tutorial_tools.md'
VISUALIZATION = 'https://github.com/sgoldenlab/simba/blob/master/docs/Scenario2.md#part-5--visualizing-results'
VISUALIZE_CLF_PROBABILITIES = 'https://github.com/sgoldenlab/simba/blob/master/docs/Scenario2.md#visualizing-classification-probabilities'
class simba.utils.enums.MLParamKeys(value)[source]

Bases: Enum

An enumeration.

CLASSIFIER = 'classifier'
CLASSIFIER_MAP = 'classifier_map'
CLASSIFIER_NAME = 'classifier_name'
CLASS_CUSTOM_WEIGHTS = 'class_custom_weights'
CLASS_WEIGHTS = 'class_weights'
CLF_REPORT = 'generate_classification_report'
EX_DECISION_TREE = 'generate_example_decision_tree'
EX_DECISION_TREE_FANCY = 'generate_example_decision_tree_fancy'
IMPORTANCE_BARS_N = 'N_feature_importance_bars'
IMPORTANCE_BAR_CHART = 'generate_features_importance_bar_graph'
IMPORTANCE_LOG = 'generate_features_importance_log'
LEARNING_CURVE = 'generate_sklearn_learning_curves'
LEARNING_CURVE_DATA_SPLITS = 'learning_curve_data_splits'
LEARNING_CURVE_K_SPLITS = 'learning_curve_k_splits'
LEARNING_DATA_SPLITS = 'LearningCurve_shuffle_data_splits'
MIN_LEAF = 'rf_min_sample_leaf'
MODEL_TO_RUN = 'model_to_run'
N_FEATURE_IMPORTANCE_BARS = 'n_feature_importance_bars'
OVERSAMPLE_RATIO = 'over_sample_ratio'
OVERSAMPLE_SETTING = 'over_sample_setting'
PARTIAL_DEPENDENCY = 'partial_dependency'
PERMUTATION_IMPORTANCE = 'compute_feature_permutation_importance'
PRECISION_RECALL = 'generate_precision_recall_curves'
RF_CRITERION = 'rf_criterion'
RF_ESTIMATORS = 'rf_n_estimators'
RF_MAX_DEPTH = 'rf_max_depth'
RF_MAX_FEATURES = 'rf_max_features'
RF_METADATA = 'generate_rf_model_meta_data_file'
SAVE_TRAIN_TEST_FRM_IDX = 'save_train_test_frm_idx'
SHAP_ABSENT = 'shap_target_absent_no'
SHAP_MULTIPROCESS = 'shap_multiprocess'
SHAP_PRESENT = 'shap_target_present_no'
SHAP_SAVE_ITERATION = 'shap_save_iteration'
SHAP_SCORES = 'generate_shap_scores'
TRAIN_TEST_SPLIT_TYPE = 'train_test_split_type'
TT_SIZE = 'train_test_size'
UNDERSAMPLE_RATIO = 'under_sample_ratio'
UNDERSAMPLE_SETTING = 'under_sample_setting'
class simba.utils.enums.Methods(value)[source]

Bases: Enum

An enumeration.

ADDITIONAL_THIRD_PARTY_CLFS = 'ADDITIONAL third-party behavior detected'
ANOVA = 'ANOVA'
BORIS = 'BORIS'
CLASSIC_TRACKING = 'Classic tracking'
CREATE_POSE_CONFIG = 'Create pose config...'
ERROR = 'ERROR'
GAUSSIAN = 'Gaussian'
INVALID_THIRD_PARTY_APPENDER_FILE = 'INVALID annotations file data format'
MULTI_TRACKING = 'Multi tracking'
RANDOM_UNDERSAMPLE = 'random undersample'
SAVITZKY_GOLAY = 'Savitzky Golay'
SMOTE = 'SMOTE'
SMOTEENN = 'SMOTEENN'
SPLIT_TYPE_BOUTS = 'BOUTS'
SPLIT_TYPE_FRAMES = 'FRAMES'
THIRD_PARTY_ANNOTATION_FILE_NOT_FOUND = 'Annotations data file NOT FOUND'
THIRD_PARTY_EVENT_COUNT_CONFLICT = 'Annotations EVENT COUNT conflict'
THIRD_PARTY_EVENT_OVERLAP = 'Annotations OVERLAP inaccuracy'
THIRD_PARTY_FPS_CONFLICT = 'Annotations and pose FPS conflict'
THIRD_PARTY_FRAME_COUNT_CONFLICT = 'Annotations and pose FRAME COUNT conflict'
THREE_D_TRACKING = '3D tracking'
USER_DEFINED = 'user_defined'
WARNING = 'WARNING'
ZERO_THIRD_PARTY_VIDEO_ANNOTATIONS = 'ZERO third-party video annotations found'
ZERO_THIRD_PARTY_VIDEO_BEHAVIOR_ANNOTATIONS = 'ZERO third-party video behavior annotations found'
class simba.utils.enums.OS(value)[source]

Bases: Enum

An enumeration.

LINUX = 'Linux'
MAC = 'Darwin'
PYTHON_VER = '3.6'
WINDOWS = 'Windows'
class simba.utils.enums.Options(value)[source]

Bases: Enum

An enumeration.

ALL_IMAGE_FORMAT_OPTIONS = ('.bmp', '.png', '.jpeg', '.jpg', '.webp')
ALL_IMAGE_FORMAT_STR_OPTIONS = '.bmp .png .jpeg .jpg'
ALL_VIDEO_FORMAT_OPTIONS = ('.avi', '.mp4', '.mov', '.flv', '.m4v')
ALL_VIDEO_FORMAT_STR_OPTIONS = '.avi .mp4 .mov .flv .m4v'
BOOL_STR_OPTIONS = ['TRUE', 'FALSE']
BUCKET_METHODS = ['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt']
CLASSICAL_TRACKING_OPTIONS = ['1 animal; 4 body-parts', '1 animal; 7 body-parts', '1 animal; 8 body-parts', '1 animal; 9 body-parts', '2 animals; 8 body-parts', '2 animals; 14 body-parts', '2 animals; 16 body-parts', 'MARS']
CLASS_WEIGHT_OPTIONS = ['None', 'balanced', 'balanced_subsample', 'custom']
CLF_CRITERION = ['gini', 'entropy']
CLF_DESCRIPTIVES_OPTIONS = ['Bout count', 'Total event duration (s)', 'Mean event bout duration (s)', 'Median event bout duration (s)', 'First event occurrence (s)', 'Mean event bout interval duration (s)', 'Median event bout interval duration (s)']
CLF_MAX_FEATURES = ['sqrt', 'log', 'None']
CLF_MODELS = ['RF', 'GBC', 'XGBoost']
CLF_TEST_SIZE_OPTIONS = ['0.1', '0.2', '0.3', '0.4', '0.5', '0.6', '0.7', '0.8', '0.9']
DPI_OPTIONS = [100, 200, 400, 800, 1600, 3200]
FEATURE_SUBSET_OPTIONS = ['Two-point body-part distances (mm)', 'Within-animal three-point body-part angles (degrees)', 'Within-animal three-point convex hull perimeters (mm)', 'Within-animal four-point convex hull perimeters (mm)', 'Entire animal convex hull perimeters (mm)', 'Entire animal convex hull area (mm2)', 'Frame-by-frame body-part movements (mm)', 'Frame-by-frame body-part distances to ROI centers (mm)', 'Frame-by-frame body-parts inside ROIs (Boolean)']
GANTT_VALIDATION_OPTIONS = ['None', 'Gantt chart: final frame only (slightly faster)', 'Gantt chart: video']
HEATMAP_BIN_SIZE_OPTIONS = ['10×10', '20×20', '40×40', '80×80', '100×100', '160×160', '320×320', '640×640', '1280×1280']
HEATMAP_SHADING_OPTIONS = ['gouraud', 'flat']
IMPORT_TYPE_OPTIONS = ['CSV (DLC/DeepPoseKit)', 'JSON (BENTO)', 'H5 (multi-animal DLC)', 'SLP (SLEAP)', 'CSV (SLEAP)', 'H5 (SLEAP)', 'TRK (multi-animal APT)', 'MAT (DANNCE 3D)']
INTERPOLATION_OPTIONS = ['Animal(s): Nearest', 'Animal(s): Linear', 'Animal(s): Quadratic', 'Body-parts: Nearest', 'Body-parts: Linear', 'Body-parts: Quadratic']
INTERPOLATION_OPTIONS_W_NONE = ['None', 'Animal(s): Nearest', 'Animal(s): Linear', 'Animal(s): Quadratic', 'Body-parts: Nearest', 'Body-parts: Linear', 'Body-parts: Quadratic']
MIN_MAX_SCALER = 'MIN-MAX'
MULTI_ANIMAL_TRACKING_OPTIONS = ['Multi-animals; 4 body-parts', 'Multi-animals; 7 body-parts', 'Multi-animals; 8 body-parts', 'AMBER']
MULTI_DLC_TYPE_IMPORT_OPTION = ['skeleton', 'box', 'ellipse']
OVERSAMPLE_OPTIONS = ['None', 'SMOTE', 'SMOTEENN']
PALETTE_OPTIONS = ['magma', 'jet', 'inferno', 'plasma', 'viridis', 'gnuplot2', 'RdBu', 'winter']
PALETTE_OPTIONS_CATEGORICAL = ['Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'Set1', 'Set2', 'Set3', 'tab10', 'tab20']
PERFORM_FLAGS = ['yes', True, 'True']
QUANTILE_SCALER = 'QUANTILE'
RESOLUTION_OPTIONS = ['320×240', '640×480', '720×480', '800×640', '960×800', '1120×960', '1280×720', '1980×1080']
RESOLUTION_OPTIONS_2 = ['AUTO', 240, 320, 480, 640, 720, 800, 960, 1120, 1080, 1980]
ROLLING_WINDOW_DIVISORS = [2, 5, 6, 7.5, 15]
RUN_OPTIONS_FLAGS = ['yes', True, 'True', 'False', 'no', False, 'true', 'false']
SCALER_NAMES = ['MIN-MAX', 'STANDARD', 'QUANTILE']
SCALER_OPTIONS = ['MIN-MAX', 'STANDARD', 'QUANTILE']
SMOOTHING_OPTIONS = ['Gaussian', 'Savitzky Golay']
SMOOTHING_OPTIONS_W_NONE = ['None', 'Gaussian', 'Savitzky Golay']
SPEED_OPTIONS = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0]
STANDARD_SCALER = 'STANDARD'
THIRD_PARTY_ANNOTATION_APPS_OPTIONS = ['BORIS', 'ETHOVISION', 'OBSERVER', 'SOLOMON', 'DEEPETHOGRAM', 'BENTO']
THIRD_PARTY_ANNOTATION_ERROR_OPTIONS = ['INVALID annotations file data format', 'ADDITIONAL third-party behavior detected', 'Annotations OVERLAP conflict', 'ZERO third-party video behavior annotations found', 'Annotations and pose FRAME COUNT conflict', 'Annotations EVENT COUNT conflict', 'Annotations data file NOT FOUND']
THREE_DIM_TRACKING_OPTIONS = ['3D tracking']
TIMEBINS_MEASURMENT_OPTIONS = ['First occurrence (s)', 'Event count', 'Total event duration (s)', 'Mean event duration (s)', 'Median event duration (s)', 'Mean event interval (s)', 'Median event interval (s)']
TRACKING_TYPE_OPTIONS = ['Classic tracking', 'Multi tracking', '3D tracking']
TRAIN_TEST_SPLIT = ['FRAMES', 'BOUTS']
UNDERSAMPLE_OPTIONS = ['None', 'random undersample']
UNSUPERVISED_FEATURE_OPTIONS = ['INCLUDE FEATURE DATA (ORIGINAL)', 'INCLUDE FEATURES (SCALED)', 'EXCLUDE FEATURE DATA']
VIDEO_FORMAT_OPTIONS = ['mp4', 'avi']
WORKFLOW_FILE_TYPE_OPTIONS = ['csv', 'parquet']
WORKFLOW_FILE_TYPE_STR_OPTIONS = '.csv .parquet'
class simba.utils.enums.Paths(value)[source]

Bases: Enum

An enumeration.

ABOUT_ME = PosixPath('assets/img/about_me.png')
ANNOTATED_FRAMES_DIR = PosixPath('frames/output/annotated_frames')
BG_IMG_PATH = PosixPath('assets/img/bg_2024.png')
BODY_PART_DIRECTIONALITY_DF_DIR = PosixPath('logs/body_part_directionality_dataframes')
BP_NAMES = PosixPath('logs/measures/pose_configs/bp_names/project_bp_names.csv')
CLF_DATA_VALIDATION_DIR = PosixPath('csv/validation')
CLF_VALIDATION_DIR = PosixPath('frames/output/classifier_validation')
CLUSTER_EXAMPLES = PosixPath('frames/output/cluster_examples')
CONCAT_VIDEOS_DIR = PosixPath('frames/output/merged')
CRITICAL_VALUES = PosixPath('simba/assets/lookups/critical_values_05.pickle')
DATA_TABLE = PosixPath('frames/output/live_data_table')
DETAILED_ROI_DATA_DIR = PosixPath('logs/Detailed_ROI_data')
DIRECTING_ANIMALS_OUTPUT_PATH = PosixPath('frames/output/ROI_directionality_visualize')
DIRECTING_BETWEEN_ANIMALS_OUTPUT_PATH = PosixPath('frames/output/Directing_animals')
DIRECTING_BETWEEN_ANIMAL_BODY_PART_OUTPUT_PATH = PosixPath('frames/output/Body_part_directing_animals')
DIRECTIONALITY_DF_DIR = PosixPath('logs/directionality_dataframes')
FEATURES_EXTRACTED_DIR = PosixPath('csv/features_extracted')
FRAMES_OUTPUT_DIR = PosixPath('frames/output')
GANTT_PLOT_DIR = PosixPath('frames/output/gantt_plots')
HEATMAP_CLF_LOCATION_DIR = PosixPath('frames/output/heatmaps_classifier_locations')
HEATMAP_LOCATION_DIR = PosixPath('frames/output/heatmaps_locations')
ICON_ASSETS = PosixPath('assets/icons')
INPUT_CSV = PosixPath('csv/input_csv')
INPUT_FRAMES_DIR = PosixPath('frames/input')
LINE_PLOT_DIR = PosixPath('frames/output/line_plot')
LOGO_ICON_DARWIN_PATH = PosixPath('assets/icons/SimBA_logo.png')
LOGO_ICON_WINDOWS_PATH = PosixPath('assets/icons/SimBA_logo.ico')
MACHINE_RESULTS_DIR = PosixPath('csv/machine_results')
OUTLIER_CORRECTED = PosixPath('csv/outlier_corrected_movement_location')
OUTLIER_CORRECTED_MOVEMENT = PosixPath('csv/outlier_corrected_movement')
PATH_PLOT_DIR = PosixPath('frames/output/path_plots')
PROBABILITY_PLOTS_DIR = PosixPath('frames/output/probability_plots')
PROJECT_POSE_CONFIG_NAMES = PosixPath('pose_configurations/configuration_names/pose_config_names.csv')
ROI_ANALYSIS = PosixPath('frames/output/ROI_analysis')
ROI_DEFINITIONS = PosixPath('measures/ROI_definitions.h5')
ROI_FEATURES = PosixPath('frames/output/ROI_features')
SCHEMATICS = PosixPath('pose_configurations/schematics')
SHAP_LOGS = PosixPath('logs/shap')
SIMBA_BP_CONFIG_PATH = PosixPath('pose_configurations/bp_names/bp_names.csv')
SIMBA_FEATURE_EXTRACTION_COL_NAMES_PATH = PosixPath('assets/lookups/feature_extraction_headers.csv')
SIMBA_NO_ANIMALS_PATH = PosixPath('pose_configurations/no_animals/no_animals.csv')
SIMBA_SHAP_CATEGORIES_PATH = PosixPath('assets/shap/feature_categories/shap_feature_categories.csv')
SIMBA_SHAP_IMG_PATH = PosixPath('assets/shap')
SINGLE_CLF_VALIDATION = PosixPath('frames/output/validation')
SKLEARN_RESULTS = PosixPath('frames/output/sklearn_results')
SPLASH_PATH_LINUX = PosixPath('assets/img/splash.PNG')
SPLASH_PATH_MOVIE = PosixPath('assets/img/splash_2024.mp4')
SPLASH_PATH_WINDOWS = PosixPath('assets/img/splash.png')
SPONTANEOUS_ALTERNATION_VIDEOS_DIR = PosixPath('frames/output/spontanous_alternation')
TARGETS_INSERTED_DIR = PosixPath('csv/targets_inserted')
TEST_PATH = '/Users/simon/Desktop/envs/simba_dev/simba/'
UNSUPERVISED_MODEL_NAMES = PosixPath('assets/lookups/model_names.parquet')
VIDEO_INFO = PosixPath('logs/video_info.csv')
class simba.utils.enums.TagNames(value)[source]

Bases: Enum

An enumeration.

CLASS_INIT = 'CLASS_INIT'
COMPLETE = 'complete'
ERROR = 'error'
GREETING = 'greeting'
STANDARD = 'standard'
TRASH = 'trash'
WARNING = 'warning'
class simba.utils.enums.TestPaths(value)[source]

Bases: Enum

An enumeration.

CRITICAL_VALUES = '../simba/assets/lookups/critical_values_05.pickle'
class simba.utils.enums.TextOptions(value)[source]

Bases: Enum

An enumeration.

BORDER_BUFFER_X = 5
BORDER_BUFFER_Y = 10
COLOR = (147, 20, 255)
FIRST_LINE_SPACING = 2
FONT = 0
FONT_SCALER = 0.8
LINE_SPACING = 1
LINE_THICKNESS = 2
RADIUS_SCALER = 10
RESOLUTION_SCALER = 1500
SPACE_SCALER = 25
TEXT_THICKNESS = 1
class simba.utils.enums.UMAPParam(value)[source]

Bases: Enum

An enumeration.

HYPERPARAMETERS = ['n_neighbors', 'min_distance', 'spread', 'scaler', 'variance']
MIN_DISTANCE = 'min_distance'
N_NEIGHBORS = 'n_neighbors'
SCALER = 'scaler'
SPREAD = 'spread'
VARIANCE = 'variance'

Errors

exception simba.utils.errors.AdvancedLabellingError(frame: str, lbl_lst: list, unlabel_lst: list, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.AnimalNumberError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.AnnotationFileNotFoundError(video_name: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.ArrayError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.BodypartColumnNotFoundError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.ClassifierInferenceError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.ColumnNotFoundError(column_name: str, file_name: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.CorruptedFileError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.CountError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.DataHeaderError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.DirectoryExistError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.DirectoryNotEmptyError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.DuplicationError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.FFMPEGCodecGPUError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.FFMPEGNotFoundError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.FaultyTrainingSetError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.FeatureNumberMismatchError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.FileExistError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.FloatError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.FrameRangeError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.IntegerError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.InvalidFileTypeError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.InvalidFilepathError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.InvalidHyperparametersFileError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.InvalidInputError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.InvalidVideoFileError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.MissingColumnsError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.MissingProjectConfigEntryError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.MixedMosaicError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.NoChoosenClassifierError(source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.NoChoosenMeasurementError(source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.NoChoosenROIError(source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.NoDataError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.NoFilesFoundError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.NoROIDataError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.NoSpecifiedOutputError(msg: str, source: str = '', show_window: bool = True)[source]

Bases: SimbaError

exception simba.utils.errors.NotDirectoryError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.ParametersFileError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.PermissionError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.ROICoordinatesNotFoundError(expected_file_path: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.SamplingError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.SimbaError(msg: str, source: str = ' ', show_window: bool = False)[source]

Bases: Exception

print_and_log_error()[source]
exception simba.utils.errors.StringError(msg: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.ThirdPartyAnnotationEventCountError(video_name: str, clf_name: str, start_event_cnt: int, stop_event_cnt: int, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.ThirdPartyAnnotationFileNotFoundError(video_name: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.ThirdPartyAnnotationOverlapError(video_name: str, clf_name: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.ThirdPartyAnnotationsAdditionalClfError(video_name: str, clf_names: list, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.ThirdPartyAnnotationsClfMissingError(video_name: str, clf_name: str, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.ThirdPartyAnnotationsFpsConflictError(video_name: str, annotation_fps: int, video_fps: int, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.ThirdPartyAnnotationsMissingAnnotationsError(video_name: str, clf_names: list, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

exception simba.utils.errors.ThirdPartyAnnotationsOutsidePoseEstimationDataError(video_name: str, frm_cnt: int, clf_name: Optional[str] = None, annotation_frms: Optional[int] = None, first_error_frm: Optional[int] = None, ambiguous_cnt: Optional[int] = None, source: str = '', show_window: bool = False)[source]

Bases: SimbaError

Lookups

class simba.utils.lookups.SharedCounter(initval=0)[source]

Bases: object

Counter that can be shared across processes on different cores

increment()[source]
value()[source]
simba.utils.lookups.cardinality_to_integer_lookup() Dict[str, int][source]

Create dictionary that maps cardinal compass directions to integers.

Example

>>> data = ["N", "NE", "E", "SE", "S", "SW", "W", "NW"]
>>> [cardinality_to_integer_lookup()[d] for d in data]
>>> [0, 1, 2, 3, 4, 5, 6, 7]
simba.utils.lookups.create_color_palettes(no_animals: int, map_size: int) List[List[int]][source]

Create list of lists of bgr colors, one for each animal. Each list is pulled from a different palette matplotlib color map.

Parameters
  • no_animals (int) – Number of different palette lists

  • map_size (int) – Number of colors in each created palette.

Return List[List[int]]

BGR colors

Example

>>> create_color_palettes(no_animals=2, map_size=2)
>>> [[[255.0, 0.0, 255.0], [0.0, 255.0, 255.0]], [[102.0, 127.5, 0.0], [102.0, 255.0, 255.0]]]
simba.utils.lookups.get_body_part_configurations() Dict[str, Union[str, PathLike]][source]

Return dict with named body-part schematics of pose-estimation schemas in SimBA installation as keys, and paths to the images representing those body-part schematics as values.

simba.utils.lookups.get_bp_config_code_class_pairs() Dict[str, object][source]

Helper to match SimBA project_config.ini [create ensemble settings][pose_estimation_body_parts] setting to feature extraction module class.

simba.utils.lookups.get_bp_config_codes() Dict[str, str][source]

Helper to match SimBA project_config.ini [create ensemble settings][pose_estimation_body_parts] to string names.

simba.utils.lookups.get_categorical_palettes()[source]
simba.utils.lookups.get_cmaps() List[str][source]

Get list of named matplotlib color palettes.

simba.utils.lookups.get_color_dict() Dict[str, Tuple[int]][source]

Get dict of color names as keys and RGB tuples as values

simba.utils.lookups.get_emojis() Dict[str, str][source]

Helper to get dictionary of emojis with names as keys and emojis as values. Note, the same emojis are represented differently in different python versions.

simba.utils.lookups.get_ffmpeg_crossfade_methods()[source]
simba.utils.lookups.get_icons_paths() Dict[str, Union[str, PathLike]][source]

Helper to get dictionary with icons with the icon names as keys (grabbed from file-name) and their file paths as values.

simba.utils.lookups.get_log_config()[source]
simba.utils.lookups.get_meta_data_file_headers() List[str][source]

Get List of headers for SimBA classifier metadata output.

Return List[str]

simba.utils.lookups.get_named_colors() List[str][source]

Get list of named matplotlib colors.

simba.utils.lookups.get_third_party_appender_file_formats() Dict[str, str][source]

Helper to get dictionary that maps different third-party annotation tools with different file formats.

simba.utils.lookups.integer_to_cardinality_lookup()[source]

Create dictionary that maps integers to cardinal compass directions.

simba.utils.lookups.percent_to_crf_lookup() Dict[str, int][source]

Create dictionary that matches human-readable percent values to FFmpeg Constant Rate Factor (CRF) values that regulates video quality in CPU codecs. Higher CRF values translates to lower video quality and reduced file sizes.

simba.utils.lookups.percent_to_qv_lk()[source]

Create dictionary that matches human-readable percent values to FFmpeg regulates video quality in CPU codecs. Higher FFmpeg quality scores maps to smaller, lower quality videos. Used in some AVI codecs such as ‘divx’ and ‘mjpeg’.

simba.utils.lookups.video_quality_to_preset_lookup() Dict[str, str][source]

Create dictionary that matches human-readable video quality settings to FFmpeg presets for GPU codecs.

Printing

class simba.utils.printing.SimbaTimer(start: bool = False)[source]

Bases: object

Timer class for keeping track of start and end-times of calls

start_timer()[source]
stop_timer()[source]
simba.utils.printing.log_event(logger_name: str, log_type: typing_extensions.Literal['CLASS_INIT', 'error', 'warning'], msg: str)[source]
simba.utils.printing.perform_timing(func)[source]
simba.utils.printing.stdout_success(msg: str, source: Optional[str] = '', elapsed_time: Optional[str] = None) None[source]

Helper to parse msg of completed operation to SimBA main interface.

Parameters
  • msg (str) – Message to be parsed.

  • source (Optional[str]) – Optional string indicating the source method or function of the msg for logging.

  • elapsed_time (Optional[str]) – Optional string indicating the runtime of the completed operation.

Return None

simba.utils.printing.stdout_trash(msg: str, source: Optional[str] = '', elapsed_time: Optional[str] = None) None[source]

Helper to parse msg of delete operation to SimBA main interface.

Parameters
  • msg (str) – Message to be parsed.

  • source (Optional[str]) – Optional string indicating the source method or function of the operation for logging.

  • elapsed_time – Optional string indicating the runtime.

Return None

simba.utils.printing.stdout_warning(msg: str, elapsed_time: Optional[str] = None) None[source]

Helper to parse warning msg to SimBA main interface.

Parameters
  • msg (str) – Message to be parsed.

  • source (Optional[str]) – Optional string indicating the source method or function of the msg for logging.

  • elapsed_time – Optional string indicating the runtime.

Return None

Reading and writing

simba.utils.read_write.archive_processed_files(config_path: Union[str, PathLike], archive_name: str) None[source]

Archive files within a SimBA project.

Parameters
  • config_path (str) – Path to SimBA project project_config.ini.

  • archive_name (str) – Name of archive.

See also

Tutorial

Example

>>> archive_processed_files(config_path='project_folder/project_config.ini', archive_name='my_archive')
simba.utils.read_write.check_if_hhmmss_timestamp_is_valid_part_of_video(timestamp: str, video_path: Union[str, PathLike]) None[source]

Helper to check that a timestamp in HH:MM:SS format is a valid timestamp in a video file.

Parameters
  • timestamp (str) – Timestamp in HH:MM:SS format.

  • video_path (str) – Path to a video file.

Raises

FrameRangeError – If timestamp is not in the video file. E.g., timestamp 00:01:00 will raise FrameRangeError if the video is 59s long.

Example

>>> check_if_hhmmss_timestamp_is_valid_part_of_video(timestamp='01:00:05', video_path='/Users/simon/Desktop/video_tests/Together_1.avi')
>>> "FrameRangeError: The timestamp '01:00:05' does not occur in video Together_1.avi, the video has length 10s"
simba.utils.read_write.clean_sleap_file_name(filename: str) str[source]

Clean a SLEAP input filename by removing ‘.analysis’ suffix and project name prefix to match orginal video name.

Note

Modified from vtsai881.

Parameters

filename (str) – The original filename to be cleaned to match video name.

Returns str

The cleaned filename.

Example

>>> clean_sleap_file_name("projectname.v00x.00x_videoname.analysis.csv")
>>> 'videoname.csv'
>>> clean_sleap_file_name("projectname.v00x.00x_videoname.analysis.h5")
>>> 'videoname.h5'
simba.utils.read_write.clean_sleap_filenames_in_directory(dir: Union[str, PathLike]) None[source]

Clean up SLEAP input filenames in the specified directory by removing a prefix and a suffix, and renaming the files to match the names of the original video files.

Note

Modified from vtsai881.

Parameters

dir (Union[str, os.PathLike]) – The directory path where the SLEAP CSV or H5 files are located.

Example

>>> clean_sleap_filenames_in_directory(dir='/Users/simon/Desktop/envs/troubleshooting/Hornet_SLEAP/import/')
simba.utils.read_write.concatenate_videos_in_folder(in_folder: Union[str, PathLike], save_path: Union[str, PathLike], file_paths: Optional[List[Union[str, PathLike]]] = None, video_format: Optional[str] = 'mp4', substring: Optional[str] = None, remove_splits: Optional[bool] = True, gpu: Optional[bool] = False, verbose: Optional[bool] = True) None[source]

Concatenate (temporally) all video files in a folder into a single video.

Important

Input video parts will be joined in alphanumeric order, should ideally have to have sequential numerical ordered file names, e.g., 1.mp4, 2.mp4….

Note

If substring and file_paths are both not None, then file_paths with be sliced and only file paths with substring will be retained.

Parameters
  • in_folder (Union[str, os.PathLike]) – Path to folder holding un-concatenated video files.

  • save_path (Union[str, os.PathLike]) – Path to the saved the output file. Note: If the path exist, it will be overwritten

  • file_paths (Optional[List[Union[str, os.PathLike]]]) – If not None, then the files that should be joined. If None, then all files. Default None.

  • video_format (Optional[str]) – The format of the video clips that should be concatenated. Default: mp4.

  • substring (Optional[str]) – If a string, then only videos in in_folder with a filename that contains substring will be joined. If None, then all are joined. Default: None.

  • video_format – Format of the input video files in in_folder. Default: mp4.

  • remove_splits (Optional[bool]) – If true, the input splits in the in_folder will be removed following concatenation. Default: True.

simba.utils.read_write.convert_csv_to_parquet(directory: Union[str, PathLike]) None[source]

Convert all csv files in a folder to parquet format.

Parameters

directory (str) – Path to directory holding csv files.

Raises

NoFilesFoundError – The directory has no csv files.

Examples

>>> convert_parquet_to_csv(directory='project_folder/csv/input_csv')
simba.utils.read_write.convert_parquet_to_csv(directory: str) None[source]

Convert all parquet files in a directory to csv format.

Parameters

directory (str) – Path to directory holding parquet files

Raises

NoFilesFoundError – The directory has no parquet files.

Examples

>>> convert_parquet_to_csv(directory='project_folder/csv/input_csv')
simba.utils.read_write.copy_files_in_directory(in_dir: Union[str, PathLike], out_dir: Union[str, PathLike], raise_error: bool = True, filetype: Optional[str] = None) None[source]

Copy files from the specified input directory to the output directory.

Parameters
  • in_dir (Union[str, os.PathLike]) – The input directory from which files will be copied.

  • out_dir (Union[str, os.PathLike]) – The output directory where files will be copied to.

  • raise_error (bool) – If True, raise an error if no files are found in the input directory. Default is True.

  • filetype (Optional[str]) – If specified, only copy files with the given file extension. Default is None, meaning all files will be copied.

Example

>>> copy_files_in_directory('/input_dir', '/output_dir', raise_error=True, filetype='txt')
simba.utils.read_write.copy_files_to_directory(file_paths: List[Union[str, PathLike]], dir: Union[str, PathLike], verbose: Optional[bool] = True, integer_save_names: Optional[bool] = False) List[Union[str, PathLike]][source]

Copy a list of files to a specified directory.

Parameters
  • file_paths (List[Union[str, os.PathLike]]) – List of paths to the files to be copied.

  • dir (Union[str, os.PathLike]) – Path to the directory where files will be copied.

  • verbose (Optional[bool]) – If True, prints progress information. Default True.

  • integer_save_names (Optional[bool]) – If True, saves files with integer names. E.g., file one in file_paths will be saved as dir/0.

Return List[Union[str, os.PathLike]]

List of paths to the copied files

simba.utils.read_write.copy_multiple_videos_to_project(config_path: Union[str, PathLike], source: Union[str, PathLike], file_type: str, symlink: Optional[bool] = False, allowed_video_formats: Optional[Tuple[str]] = ('avi', 'mp4')) None[source]

Import directory of videos to SimBA project.

Parameters
  • simba_ini_path (Union[str, os.PathLike]) – path to SimBA project config file in Configparser format

  • source_path (Union[str, os.PathLike]) – Path to directory with video files outside SimBA project.

  • file_type (str) – Video format of imported videos (i.e.,: mp4 or avi)

  • symlink (Optional[bool]) – If True, creates soft copies rather than hard copies. Default: False.

  • allowed_video_formats (Optional[Tuple[str]]) – Allowed video formats. DEFAULT: avi or mp4

simba.utils.read_write.copy_single_video_to_project(simba_ini_path: Union[str, PathLike], source_path: Union[str, PathLike], symlink: bool = False, allowed_video_formats: Optional[Tuple[str]] = ('avi', 'mp4'), overwrite: Optional[bool] = False) None[source]

Import single video file to SimBA project

Parameters
  • simba_ini_path (Union[str, os.PathLike]) – path to SimBA project config file in Configparser format

  • source_path (Union[str, os.PathLike]) – Path to video file outside SimBA project.

  • symlink (Optional[bool]) – If True, creates soft copy rather than hard copy. Default: False.

  • allowed_video_formats (Optional[Tuple[str]]) – Allowed video formats. DEFAULT: avi or mp4

  • overwrite (Optional[bool]) – If True, overwrites existing video if it exists in SimBA project. Else, raise FileExistError.

simba.utils.read_write.create_directory(path: Union[str, PathLike])[source]
simba.utils.read_write.drop_df_fields(data: DataFrame, fields: List[str], raise_error: Optional[bool] = False) DataFrame[source]

Drops specified fields in dataframe.

Parameters
  • pd.DataFrame – Data in pandas format.

  • fields (List[str]) – Columns to drop.

:return pd.DataFrame

simba.utils.read_write.find_all_videos_in_directory(directory: Union[str, PathLike], as_dict: Optional[bool] = False, raise_error: bool = False, video_formats: Optional[Tuple[str]] = ('.avi', '.mp4', '.mov', '.flv', '.m4v')) Union[dict, list][source]

Get all video file paths within a directory

Parameters
  • directory (str) – Directory to search for video files.

  • as_dict (bool) – If True, returns dictionary with the video name as key and file path as value.

  • raise_error (bool) – If True, raise error if no videos are found. Else, NoFileFoundWarning.

  • video_formats (Tuple[str]) – Acceptable video formats. Default: ‘.avi’, ‘.mp4’, ‘.mov’, ‘.flv’, ‘.m4v’.

:return List[str] or Dict[str, str] :raises NoFilesFoundError: If raise_error and directory has no files in formats video_formats.

Examples

>>> find_all_videos_in_directory(directory='project_folder/videos')
simba.utils.read_write.find_all_videos_in_project(videos_dir: Union[str, PathLike], basename: Optional[bool] = False) List[str][source]

Get filenames of .avi and .mp4 files within a directory

Parameters
  • videos_dir (str) – Directory holding video files.

  • basename (bool) – If true returns basenames, else file paths.

Example

>>> find_all_videos_in_project(videos_dir='project_folder/videos')
>>> ['project_folder/videos/Together_2.avi', 'project_folder/videos/Together_3.avi', 'project_folder/videos/Together_1.avi']
simba.utils.read_write.find_core_cnt() Tuple[int, int][source]

Find the local cpu count and quarter of the cpu counts.

Return int

The local cpu count

Return int

The local cpu count // 4

Example

>>> find_core_cnt()
>>> (8, 2)
simba.utils.read_write.find_files_of_filetypes_in_directory(directory: str, extensions: list, raise_warning: Optional[bool] = True, raise_error: Optional[bool] = False) List[str][source]

Find all files in a directory of specified extensions/types.

Parameters
  • directory (str) – Directory holding files.

  • extensions (List[str]) – Accepted file extensions.

  • raise_warning (bool) – If True, raise error if no files are found.

Return List[str]

All files in directory with extensions.

Example

>>> find_files_of_filetypes_in_directory(directory='project_folder/videos', extensions=['mp4', 'avi', 'png'], raise_warning=False)
simba.utils.read_write.find_max_vertices_coordinates(shapes: List[Union[Polygon, LineString, MultiPolygon, Point]], buffer: Optional[int] = None) Tuple[int, int][source]

Find the maximum x and y coordinates among the vertices of a list of Shapely geometries.

Can be useful for plotting puposes, to dtermine the rquired size of the canvas to fit all geometries.

Parameters
  • shapes (List[Union[Polygon, LineString, MultiPolygon, Point]]) – A list of Shapely geometries including Polygons, LineStrings, MultiPolygons, and Points.

  • buffer (Optional[int]) – If int, adds to maximum x and y.

Returns Tuple[int, int]

A tuple containing the maximum x and y coordinates found among the vertices.

Example

>>> polygon = Polygon([(0, 0), (1, 0), (1, 1), (0, 1)])
>>> line = LineString([(1, 1), (2, 2), (3, 1), (4, 0)])
>>> multi_polygon = MultiPolygon([Polygon([(0, 0), (1, 0), (1, 1), (0, 1)]), Polygon([(1, 1), (2, 1), (2, 2), (1, 2)])])
>>> point = Point(3, 4)
>>> find_max_vertices_coordinates([polygon, line, multi_polygon, point])
>>> (4, 4)
simba.utils.read_write.find_time_stamp_from_frame_numbers(start_frame: int, end_frame: int, fps: float) List[str][source]

Given start and end frame numbers and frames per second (fps), return a list of formatted time stamps corresponding to the frame range start and end time.

Parameters
  • start_frame (int) – The starting frame index.

  • end_frame (int) – The ending frame index.

  • fps (float) – Frames per second.

Return List[str]

A list of time stamps in the format ‘HH:MM:SS:MS’.

Example

>>> find_time_stamp_from_frame_numbers(start_frame=11, end_frame=20, fps=3.4)
>>> ['00:00:03:235', '00:00:05:882']
simba.utils.read_write.find_video_of_file(video_dir: Union[str, PathLike], filename: str, raise_error: bool = False) Union[str, PathLike][source]

Helper to find the video file with the SimBA project that represents a known data file path.

Parameters
  • video_dir (str) – Directory holding putative video file.

  • filename (str) – Data file name, e.g., Video_1.

  • raise_error (bool) – If True, raise error if no file can be found. Else, print warning. Default: False

Return str

Video path.

Raises

NoFilesFoundError – No video file representing file found.

Examples

>>> find_video_of_file(video_dir='project_folder/videos', filename='Together_1')
>>> 'project_folder/videos/Together_1.avi'
simba.utils.read_write.get_all_clf_names(config: ConfigParser, target_cnt: int) List[str][source]

Get all classifier names in a SimBA project.

Parameters
Return List[str]

Classifier model names

Example

>>> get_all_clf_names(config=config, target_cnt=2)
>>> ['Attack', 'Sniffing']
simba.utils.read_write.get_bp_headers(body_parts_lst: List[str]) list[source]

Helper to create ordered list of all column header fields from body-part names for SimBA project dataframes.

Parameters

body_parts_lst (List[str]) – Body-part names in the SimBA prject

Return List[str]

Body-part headers

Examaple

>>> get_bp_headers(body_parts_lst=['Nose'])
>>> ['Nose_x', 'Nose_y', 'Nose_p']
simba.utils.read_write.get_file_name_info_in_directory(directory: Union[str, PathLike], file_type: str) Dict[str, str][source]

Get dict of all file paths in a directory with specified extension as values and file base names as keys.

Parameters
  • directory (str) – Directory containing files.

  • file_type (str) – File-type in directory of interest

Return dict

All found files as values and file base names as keys.

Example

>>> get_file_name_info_in_directory(directory='C:\project_folder\csv\machine_results', file_type='csv')
>>> {'Video_1': 'C:\project_folder\csv\machine_results\Video_1'}
simba.utils.read_write.get_fn_ext(filepath: ~typing.Union[~os.PathLike, str]) -> (<class 'str'>, <class 'str'>, <class 'str'>)[source]

Split file path into three components: (i) directory, (ii) file name, and (iii) file extension.

Parameters

filepath (str) – Path to file.

Return str

File directory name

Return str

File name

Return str

File extension

Example

>>> get_fn_ext(filepath='C:/My_videos/MyVideo.mp4')
>>> ('My_videos', 'MyVideo', '.mp4')
simba.utils.read_write.get_memory_usage_of_df(df: DataFrame) Dict[str, float][source]

Get the RAM memory usage of a dataframe.

Parameters

df (pd.DataFrame) – Parsed dataframe

Return dict

The memory usage of the dataframe in bytes, mb, and gb.

Example

>>> df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
>>> {'bytes': 3328, 'megabytes': 0.003328, 'gigabytes': 3e-06}
simba.utils.read_write.get_number_of_header_columns_in_df(df: DataFrame) int[source]

Returns the count of non-numerical header rows in dataframe. E.g., can be helpful to determine if dataframe is multi-index columns.

Parameters

df (pd.DataFrame) – Dataframe to check the count of non-numerical header rows for.

Example

>>> get_number_of_header_columns_in_df(df='project_folder/csv/input_csv/Video_1.csv')
>>> 3
simba.utils.read_write.get_pkg_version(pkg: str)[source]

Helper to get the version of a package in the current python environment.

Example

>>> get_pkg_version(pkg='simba-uw-tf-dev')
>>> 1.82.7
>>> get_pkg_version(pkg='bla-bla')
>>> None
simba.utils.read_write.get_unique_values_in_iterable(data: Iterable, name: Optional[str] = '', min: Optional[int] = 1, max: Optional[int] = None) int[source]

Helper to get and check the number of unique variables in iterable. E.g., check the number of unique identified clusters.

Parameters
  • data (np.ndarray) – 1D iterable.

  • name (Optional[str]) – Arbitrary name of iterable for informative error messaging.

  • min (Optional[int]) – Optional minimum number of unique variables. Default 1.

  • max (Optional[int]) – Optional maximum number of unique variables. Default None.

simba.utils.read_write.get_video_meta_data(video_path: Union[str, PathLike], fps_as_int: bool = True) dict[source]

Read video metadata (fps, resolution, frame cnt etc.) from video file (e.g., mp4).

Parameters
  • video_path (str) – Path to a video file.

  • fps_as_int (bool) – If True, force video fps to int through floor rounding, else float. Default = True.

Return dict

Video file meta data.

Example

>>> get_video_meta_data('test_data/video_tests/Video_1.avi')
{'video_name': 'Video_1', 'fps': 30, 'width': 400, 'height': 600, 'frame_count': 300, 'resolution_str': '400 x 600', 'video_length_s': 10}
simba.utils.read_write.read_config_entry(config: ConfigParser, section: str, option: str, data_type: str, default_value: Optional[Any] = None, options: Optional[List] = None) Union[float, int, str][source]

Helper to read entry in SimBA project_config.ini parsed by configparser.ConfigParser.

Parameters
  • config (configparser.ConfigParser) – Parsed SimBA project_config.ini. Use simba.utils.read_config_file() to parse file.

  • section (str) – Section name of entry to parse.

  • option (str) – Option name of entry to parse.

  • data_type (str) – Type of data to parse. E.g., str, int, float.

  • default_value (Optional[Any]) – If no matching entry can be found in the project_config.ini, use this as default.

  • options (Optional[List] or None) – List of valid options. If not None, checks that the returned entry value exists in this list.

:return Any

Example

>>> read_config_entry(config='project_folder/project_config.ini', section='General settings', option='project_name', data_type='str')
>>> 'two_animals_14_bps'
simba.utils.read_write.read_config_file(config_path: Union[str, PathLike]) ConfigParser[source]

Helper to parse SimBA project project_config.ini file

Parameters

config_path (str) – Path to project_config.ini file

Return configparser.ConfigParser

parsed project_config.ini file

Raises

MissingProjectConfigEntryError – Invalid file format.

Example

>>> read_config_file(config_path='project_folder/project_config.ini')
simba.utils.read_write.read_data_paths(path: Optional[Union[str, PathLike]], default: List[Union[str, PathLike]], default_name: Optional[str] = '', file_type: Optional[str] = 'csv') List[str][source]

Helper to flexibly read in a set of file-paths.

Parameters
  • path (Union[str, os.PathLike]) – None or path to a file or a folder or list of paths to files.

  • default (List[Union[str, os.PathLike]]) – If path is None. Use this passed list of file paths.

  • default_name (Optional[str]) – A readable name representing the default for interpretable error msgs. Defaults to empty string.

  • file_type (Optional[str]) – If path is a directory, read in all files in directory with this file extension. Default: csv.

Return List[str]

List of file paths.

simba.utils.read_write.read_df(file_path: Union[str, PathLike], file_type: Union[str, PathLike], has_index: Optional[bool] = True, remove_columns: Optional[List[str]] = None, usecols: Optional[List[str]] = None, anipose_data: Optional[bool] = False, check_multiindex: Optional[bool] = False, multi_index_headers_to_keep: Optional[int] = None) DataFrame[source]

Read single tabular data file or pickle

Note

For improved runtime, defaults to pyarrow.csv.write_cs() if file type is csv.

Parameters
  • file_path (str) – Path to data file

  • file_type (str) – Type of data. OPTIONS: ‘parquet’, ‘csv’, ‘pickle’.

  • Optional[bool] – If the input file has an initial index column. Default: True.

  • remove_columns (Optional[List[str]]) – If not None, then remove columns in lits.

  • usecols (Optional[List[str]]) – If not None, then keep columns in list.

  • check_multiindex (bool) – check file is multi-index headers. Default: False.

  • multi_index_headers_to_keep (int) – If reading multi-index file, and we want to keep one of the dropped multi-index levels as the header in the output file, specify the index of the multiindex hader as int.

:return pd.DataFrame

Example

>>> read_df(file_path='project_folder/csv/input_csv/Video_1.csv', file_type='csv', check_multiindex=True)
simba.utils.read_write.read_frm_of_video(video_path: Union[str, PathLike, VideoCapture], frame_index: int = 0, opacity: Optional[float] = None, size: Optional[Tuple[int, int]] = None, greyscale: Optional[bool] = False, clahe: Optional[bool] = False) ndarray[source]

Reads single image from video file.

Parameters
  • video_path (Union[str, os.PathLike]) – Path to video file, or cv2.VideoCapture object.

  • frame_index (int) – The frame of video to return. Default: 1.

  • opacity (Optional[int]) – Value between 0 and 100 or None. If float, returns image with opacity. 100 fully opaque. 0.0 fully transparant.

  • size (Optional[Tuple[int, int]]) – If tuple, resizes the image to size. Else, returns original image size.

  • greyscale (Optional[bool]) – If true, returns the greyscale image. Default False.

  • clahe (Optional[bool]) – If true, returns clahe enhanced image. Default False.

Return np.ndarray

Image as numpy array.

Example

>>> img = read_frm_of_video(video_path='/Users/simon/Desktop/envs/platea_featurizer/data/video/3D_Mouse_5-choice_MouseTouchBasic_s9_a4_grayscale.mp4', clahe=True)
>>> cv2.imshow('img', img)
>>> cv2.waitKey(5000)
simba.utils.read_write.read_meta_file(meta_file_path: Union[str, PathLike]) dict[source]

Read in single SimBA modelconfig meta file CSV to python dictionary.

Parameters

meta_file_path (str) – Path to SimBA config meta file

Return dict

Dictionary holding model parameters.

Example

>>> read_meta_file('project_folder/configs/Attack_meta_0.csv')
>>> {'Classifier_name': 'Attack', 'RF_n_estimators': 2000, 'RF_max_features': 'sqrt', 'RF_criterion': 'gini', ...}
simba.utils.read_write.read_pickle(data_path: Union[str, PathLike], verbose: Optional[bool] = False) dict[source]

Read a single or directory of pickled objects. If directory, returns dict with numerical sequential integer keys for each object.

Parameters
  • data_path (str) – Pickled file path, or directory of pickled files.

  • verbose (Optional[bool]) – If True, prints progress. Default False.

:returns dict

Example

>>> data = read_pickle(data_path='/test/unsupervised/cluster_models')
simba.utils.read_write.read_project_path_and_file_type(config: ~configparser.ConfigParser) -> (<class 'str'>, <class 'str'>)[source]

Helper to read the path and file type of the SimBA project from the project_config.ini.

Parameters

config (configparser.ConfigParser) – parsed SimBA config in configparser.ConfigParser format

Return str

The path of the project project_folder.

Return str

The set file type of the project (i.e., csv or parquet).

simba.utils.read_write.read_roi_data(roi_path: ~typing.Union[str, ~os.PathLike]) -> (<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>)[source]

Method to read in ROI definitions from SimBA project

simba.utils.read_write.read_simba_meta_files(folder_path: str, raise_error: bool = False) List[str][source]

Read in paths of SimBA model config files directory (project_folder/configs’). Consider files that have `meta suffix only.

Parameters
  • folder_path (str) – directory with SimBA model config meta files

  • raise_error (bool) – If True, raise error if no files are found with meta suffix. Else, print warning. Default: False.

Return List[str]

List of paths to SimBA model config meta files.

Example

>>> read_simba_meta_files(folder_path='/project_folder/configs')
>>> ['project_folder/configs/Attack_meta_1.csv', 'project_folder/configs/Attack_meta_0.csv']
simba.utils.read_write.read_video_info(vid_info_df: ~pandas.core.frame.DataFrame, video_name: str) -> (<class 'pandas.core.frame.DataFrame'>, <class 'float'>, <class 'float'>)[source]

Helper to read the metadata (pixels per mm, resolution, fps etc) from the video_info.csv for a single input file/video

Parameters
  • vid_info_df (pd.DataFrame) – Parsed project_folder/logs/video_info.csv file. This file can be parsed by simba.utils.read_write.read_video_info_csv().

  • video_name (str) – Name of the video as represented in the Video column of the project_folder/logs/video_info.csv file.

Returns pd.DataFrame

One row DataFrame representing the video in the project_folder/logs/video_info.csv file.

Return float

The frame rate of the video as represented in the project_folder/logs/video_info.csv file

Return float

The pixels per millimeter of the video as represented in the project_folder/logs/video_info.csv file

Raises

ParametersFileError – The video is not accurately represented in the project_folder/logs/video_info.csv file.

Example

>>> video_info_df = read_video_info_csv(file_path='project_folder/logs/video_info.csv')
>>> read_video_info(vid_info_df=vid_info_df, video_name='Together_1')
simba.utils.read_write.read_video_info_csv(file_path: Union[str, PathLike]) DataFrame[source]

Read the project_folder/logs/video_info.csv of the SimBA project as a pd.DataFrame

Parameters

file_path (str) – Path to the SimBA project project_folder/logs/video_info.csv file

:return pd.DataFrame :raise ParametersFileError: Invalid format of project_folder/logs/video_info.csv. :raise InvalidValueWarning: Some videos are registered with FPS >= 1.

Example

>>> read_video_info_csv(file_path='project_folder/logs/video_info.csv')
simba.utils.read_write.remove_a_folder(folder_dir: Union[str, PathLike]) None[source]

Helper to remove a directory

simba.utils.read_write.remove_files(file_paths: List[Union[str, PathLike]], raise_error: Optional[bool] = False) None[source]

Delete (remove) the files specified within a list of filepaths.

Parameters
  • file_paths (Union[str, os.PathLike]) – A list of file paths to be removed.

  • raise_error (Optional[bool]) – If True, raise exceptions for errors during file deletion. Else, pass. Defaults to False.

Examples

>>> file_paths = ['/path/to/file1.txt', '/path/to/file2.txt']
>>> remove_files(file_paths, raise_error=True)
simba.utils.read_write.seconds_to_timestamp(seconds: int) str[source]

Convert an integer number representing seconds to a HH:MM:SS format.

simba.utils.read_write.str_2_bool(input_str: str) bool[source]

Helper to convert string representation of bool to bool.

Example

>>> str_2_bool(input_str='yes')
>>> True
simba.utils.read_write.tabulate_clf_info(clf_path: Union[str, PathLike]) None[source]

Print the hyperparameters and creation date of a pickled classifier.

Parameters

clf_path (str) – Path to classifier

Raises

InvalidFilepathError – The file is not a pickle or not a scikit-learn RF classifier.

simba.utils.read_write.web_callback(url: str) None[source]
simba.utils.read_write.write_df(df: DataFrame, file_type: str, save_path: Union[str, PathLike], multi_idx_header: bool = False) None[source]

Write single tabular data file.

Note

For improved runtime, defaults to pyarrow.csv if file_type == csv.

Parameters
  • df (pd.DataFrame) – Pandas dataframe to save to disk.

  • file_type (str) – Type of data. OPTIONS: parquet, csv, pickle.

  • save_path (str) – Location where to store the data.

  • check_multiindex (bool) – check if input file is multi-index headers. Default: False.

Example

>>> write_df(df=df, file_type='csv', save_path='project_folder/csv/input_csv/Video_1.csv')
simba.utils.read_write.write_pickle(data: Dict[str, Any], save_path: Union[str, PathLike]) None[source]

Write a single object as pickle.

Parameters
  • data_path (str) – Pickled file path.

  • save_path (str) – Location of saved pickle.

Example

>>> write_pickle(data=my_model, save_path='/test/unsupervised/cluster_models/My_model.pickle')

Warnings

simba.utils.warnings.BodypartColumnNotFoundWarning(**kwargs)[source]
simba.utils.warnings.BorisPointEventsWarning(**kwargs)[source]
simba.utils.warnings.CorruptedFileWarning(**kwargs)[source]
simba.utils.warnings.CropWarning(**kwargs)[source]
simba.utils.warnings.DataHeaderWarning(**kwargs)[source]
simba.utils.warnings.DuplicateNamesWarning(**kwargs)[source]
simba.utils.warnings.FFMpegCodecWarning(**kwargs)[source]
simba.utils.warnings.FFMpegNotFoundWarning(**kwargs)[source]
simba.utils.warnings.FileExistWarning(**kwargs)[source]
simba.utils.warnings.FrameRangeWarning(**kwargs)[source]
simba.utils.warnings.IdenticalInputWarning(**kwargs)[source]
simba.utils.warnings.InValidUserInputWarning(**kwargs)[source]
simba.utils.warnings.InvalidValueWarning(**kwargs)[source]
simba.utils.warnings.KleinbergWarning(**kwargs)[source]
simba.utils.warnings.MissingUserInputWarning(**kwargs)[source]
simba.utils.warnings.MultiProcessingFailedWarning(**kwargs)[source]
simba.utils.warnings.NoDataFoundWarning(**kwargs)[source]
simba.utils.warnings.NoFileFoundWarning(**kwargs)[source]
simba.utils.warnings.NoModuleWarning(**kwargs)[source]
simba.utils.warnings.NotEnoughDataWarning(**kwargs)[source]
simba.utils.warnings.PythonVersionWarning(**kwargs)[source]
simba.utils.warnings.ROIWarning(**kwargs)[source]
simba.utils.warnings.SameInputAndOutputWarning(**kwargs)[source]
simba.utils.warnings.SamplingWarning(**kwargs)[source]
simba.utils.warnings.ShapWarning(**kwargs)[source]
simba.utils.warnings.SkippingFileWarning(**kwargs)[source]
simba.utils.warnings.SkippingRuleWarning(**kwargs)[source]
simba.utils.warnings.ThirdPartyAnnotationEventCountWarning(video_name: str, clf_name: str, start_event_cnt: int, stop_event_cnt: int, source: str = '', log_status: bool = False)[source]
simba.utils.warnings.ThirdPartyAnnotationFileNotFoundWarning(video_name: str, source: str = '', log_status: bool = False)[source]
simba.utils.warnings.ThirdPartyAnnotationOverlapWarning(video_name: str, clf_name: str, source: str = '', log_status: bool = False)[source]
simba.utils.warnings.ThirdPartyAnnotationsAdditionalClfWarning(video_name: str, clf_names: list, source: str = '', log_status: bool = False)[source]
simba.utils.warnings.ThirdPartyAnnotationsClfMissingWarning(video_name: str, clf_name: str)[source]
simba.utils.warnings.ThirdPartyAnnotationsFpsConflictWarning(video_name: str, annotation_fps: int, video_fps: int, source: str = '')[source]
simba.utils.warnings.ThirdPartyAnnotationsInvalidFileFormatWarning(annotation_app: str, file_path: str, source: str = '', log_status: bool = False)[source]
simba.utils.warnings.ThirdPartyAnnotationsMissingAnnotationsWarning(video_name: str, clf_names: list, source: str = '', log_status: bool = False)[source]
simba.utils.warnings.ThirdPartyAnnotationsOutsidePoseEstimationDataWarning(video_name: str, frm_cnt: int, log_status: bool = False, clf_name: Optional[str] = None, annotation_frms: Optional[int] = None, first_error_frm: Optional[int] = None, ambiguous_cnt: Optional[int] = None)[source]
simba.utils.warnings.log_warning(func)[source]