Mixins

Config reader methods

Methods for reading SimBA configparser.Configparser project config and associated project data

class simba.mixins.config_reader.ConfigReader(config_path: str, read_video_info: bool = True, create_logger: bool = True)[source]

Bases: object

Methods for reading SimBA configparser.Configparser project config and associated project data.

Parameters

config_path (configparser.Configparser) – path to SimBA project_config.ini
read_video_info (bool) – if true, read the project_folder/logs/video_info.csv file.

add_missing_ROI_cols(shape_df: DataFrame) → DataFrame[source]

Helper to add missing ROI definitions (Color BGR, Thickness, Color name) in ROI info dataframes created by the first version of the SimBA ROI user-interface but analyzed using newer versions of SimBA.

Parameters: shape_df (pd.DataFrame) – Dataframe holding ROI definitions.

:return pd.DataFrame with Color BGR, Thickness, Color name fields

check_multi_animal_status() → None[source]: Helper to check if the project is a multi-animal SimBA project.

create_body_part_dictionary(multi_animal_status: bool, animal_id_lst: list, animal_cnt: int, x_cols: List[str], y_cols: List[str], p_cols: Optional[List[str]] = None, colors: Optional[List[List[Tuple[int, int, int]]]] = None) → Dict[str, Union[List[str], List[Tuple]]][source]

Helper to create dict of dict lookup of body-parts where the keys are animal names, and values are the body-part names.

Parameters

multi_animal_status (bool) – If True, it is a multi-animal SimBA project.
multi_animal_id_lst (List[str]) – Animal names. Eg., [‘Simon, ‘JJ’]. Note: If a single animal project, this will be overridden and set to Animal_1.
animal_cnt (int) – Number of animals in the SimBA project.
x_cols (List[str]) – column names for body-part coordinates on x-axis. Returned by simba.mixins.config_reader.ConfigReader.get_body_part_names()
y_cols (List[str]) – column names for body-part coordinates on y-axis. Returned by simba.mixins.config_reader.ConfigReader.get_body_part_names()
p_cols (List[str]) – column names for body-part pose-estimation probability values. Returned by simba.mixins.config_reader.ConfigReader.get_body_part_names()
colors (Optional[List[List[Tuple[int, int, int]]]]) – Optional bgr colors to associate with the body-parts. Returned by simba.utils.data.create_color_palettes().

:returns dict

Example

>>> ConfigReader.create_body_part_dictionary(multi_animal_status=True, animal_id_lst=['simon',])
>>> {'simon': {'X_bps': ['Nose_1_x', 'Ear_left_1_x', 'Ear_right_1_x', 'Center_1_x', 'Lat_left_1_x', 'Lat_right_1_x', 'Tail_base_1_x', 'Tail_end_1_x'], 'Y_bps': ['Nose_1_y', 'Ear_left_1_y', 'Ear_right_1_y', 'Center_1_y', 'Lat_left_1_y', 'Lat_right_1_y', 'Tail_base_1_y', 'Tail_end_1_y'], 'colors': [[255.0, 0.0, 255.0], [223.125, 31.875, 255.0], [191.25, 63.75, 255.0], [159.375, 95.625, 255.0], [127.5, 127.5, 255.0], [95.625, 159.375, 255.0], [63.75, 191.25, 255.0], [31.875, 223.125, 255.0], [0.0, 255.0, 255.0]], 'P_bps': ['Nose_1_p', 'Ear_left_1_p', 'Ear_right_1_p', 'Center_1_p', 'Lat_left_1_p', 'Lat_right_1_p', 'Tail_base_1_p', 'Tail_end_1_p']}, 'jj': {'X_bps': ['Nose_2_x', 'Ear_left_2_x', 'Ear_right_2_x', 'Center_2_x', 'Lat_left_2_x', 'Lat_right_2_x', 'Tail_base_2_x', 'Tail_end_2_x'], 'Y_bps': ['Nose_2_y', 'Ear_left_2_y', 'Ear_right_2_y', 'Center_2_y', 'Lat_left_2_y', 'Lat_right_2_y', 'Tail_base_2_y', 'Tail_end_2_y'], 'colors': [[102.0, 127.5, 0.0], [102.0, 143.4375, 31.875], [102.0, 159.375, 63.75], [102.0, 175.3125, 95.625], [102.0, 191.25, 127.5], [102.0, 207.1875, 159.375], [102.0, 223.125, 191.25], [102.0, 239.0625, 223.125], [102.0, 255.0, 255.0]], 'P_bps': ['Nose_2_p', 'Ear_left_2_p', 'Ear_right_2_p', 'Center_2_p', 'Lat_left_2_p', 'Lat_right_2_p', 'Tail_base_2_p', 'Tail_end_2_p']}}

drop_bp_cords(df: DataFrame, raise_error: bool = False) → DataFrame[source]

Helper to remove pose-estimation fields from dataframe.

Parameters

df (pd.DataFrame) – pandas dataframe containing pose-estimation fields (body-part x, y, p fields)
raise_error (bool) – If True, raise error if body-parts cant be found. Else, print warning

Return pd.DataFrame

df without pose-estimation fields

Example

>>> config_reader = ConfigReader(config_path='test/project_folder/project_config.csv')
>>> df = read_df(config_reader.machine_results_paths[0], file_type='csv')
>>> df = config_reader.drop_bp_cords(df=df)

find_animal_name_from_body_part_name(bp_name: str, bp_dict: dict) → str[source]

Given body-part name and animal body-part dict, returns the animal name

Parameters

bp_name (str) – Name of the body-part. E.g., Ear_1.
bp_dict (dict) – Nested dict holding animal names as keys and body-part names as and coordinates as values. Created by simba.mixins.config_reader.ConfigReader.create_body_part_dictionary()

:returns str

Example

>>> config_reader = ConfigReader(config_path='tests/data/test_projects/two_c57/project_folder/project_config.ini')
>>> ConfigReader.find_animal_name_from_body_part_name(bp_name='Ear_1', bp_dict=config_reader.animal_bp_dict)
>>> 'simon'

find_video_of_file(video_dir: Union[str, PathLike], filename: str, raise_error: bool = False) → Union[str, PathLike][source]

Helper to find the video file representing a known data file basename.

Parameters

video_dir (Union[str, os.PathLike]) – Directory holding putative video file.
filename (str) – Data file name, e.g., Video_1.
raise_error (bool) – If True, raise error if no video can be found.

Return Union[str, os.PathLike]

Path to video file.

Example

>>> config_reader = ConfigReader(config_path='My_SimBA_Config')
>>> config_reader.find_video_of_file(video_dir=config_reader.video_dir, filename='Video1')
>>> '/project_folder/videos/Video1.mp4'

get_all_clf_names() → List[str][source]

Helper to return all classifier names in SimBA project

Return List[str]

get_body_part_names()[source]

Helper to extract pose-estimation data field names (x, y, p)

Example

>>> config_reader = ConfigReader(config_path='test/project_config.csv')
>>> config_reader.get_body_part_names()

get_bp_headers() → None[source]

Helper to create ordered list of all column header fields for SimBA project dataframes.

>>> config_reader = ConfigReader(config_path='test/project_folder/project_config.ini')
>>> config_reader.get_bp_headers()

get_number_of_header_columns_in_df(df: DataFrame) → int[source]

Helper to find the count of non-numerical rows at the top of a dataframe.

Parameters: data_df (pd.DataFrame) – Dataframe to find the count non-numerical header rows in.

:returns int :raises DataHeaderError: All rows are non-numerical.

Example

>>> ConfigReader.get_number_of_header_columns_in_df(df=pd.DataFrame(data=[[1, 2, 3], [1, 2, 3]]))
>>> 0
>>> ConfigReader.get_number_of_header_columns_in_df(df=pd.DataFrame(data=[['Head_1', 'Body_2', 'Tail_3'], ['Some_nonsense', 'A_mistake', 'Maybe_multi_headers?'], [11, 99, 109], [122, 43, 2091]]))
>>> 2

insert_column_headers_for_outlier_correction(data_df: DataFrame, new_headers: List[str], filepath: str) → DataFrame[source]

Helper to insert new column headers onto a dataframe.

Parameters

data_df (pd.DataFrame) – Dataframe where headers to to-bo replaced.
new_headers (List[str]) – Names of new headers.
filepath (str) – Path to where data_df is stored on disk

Returns pd.DataFrame

Dataframe with new headers

Raises

DataHeaderWarning – If new headers are fewer/more than columns in dataframe

Example

>>> df = pd.DataFrame(data=[[1, 2, 3], [1, 2, 3]], columns=['Feature_1', 'Feature_2', 'Feature_3'])
>>> ConfigReader.insert_column_headers_for_outlier_correction(data_df=df, new_headers=['Feature_4', 'Feature_5', 'Feature_6'], filepath='test/my_test_file.csv')

read_config_entry(config: ConfigParser, section: str, option: str, data_type: typing_extensions.Literal['str', 'int', 'float', 'folder_path'], default_value: Optional[Any] = None, options: Optional[List[Any]] = None) → Union[str, int, float][source]

Helper to read entry from a configparser.ConfigParser object

Parameters

config (ConfigParser) – Parsed SimBA project config
section (str) – Project config section name
option (str) – Project config option name
data_type (str) – Type of data. E.g., str, int, float, folder_path.
default_value (Optional[Any]) – If entry cannot be found, then default to this value.
options (Optional[List[Any]]) – If not None, then list of viable entries.

Raises

InvalidInputError – If returned value is not in options.
MissingProjectConfigEntryError – If no entry is found and no default_value is provided.

:return Union[str, float, int, os.Pathlike]

Example

>>> config = ConfigReader(config_path='tests/data/test_projects/two_c57/project_folder/project_config.ini')
>>> config.read_config_entry(config=self.config, section='Multi animal IDs', option='id_list', data_type='str')
>>> 'simon,jj'

read_roi_data() → None[source]: Method to read in ROI definitions from SimBA project

read_video_info(video_name: str, raise_error: ~typing.Optional[bool] = True) -> (<class 'pandas.core.frame.DataFrame'>, <class 'float'>, <class 'float'>)[source]

Helper to read the meta-data (pixels per mm, resolution, fps) from the video_info.csv for a single input file.

Parameters

video_name (str) – The name of the video without extension to get the metadata for
raise_error (Optional[bool]) – If True, raise error if video info for the video name cannot be found. Default: True.

Raises

ParametersFileError – If raise_error and video metadata info is not found
DuplicationError – If file contains multiple entries for the same video.

:return (pd.DataFrame, float, float) representing all video info, pixels per mm, and fps

read_video_info_csv(file_path: str) → DataFrame[source]

Helper to read the project_folder/logs/video_info.csv of the SimBA project in as a pd.DataFrame :param file_path: :type file_path: str

Return type: pd.DataFrame

remove_a_folder(folder_dir: str, raise_error: Optional[bool] = False) → None[source]

Helper to remove single directory.

Parameters

folder_dir – Directory to remove.
raise_error (bool) – If True, raise NotDirectoryError error of folder does not exist.

Raises

NotDirectoryError – If raise_error and directory does not exist.

Example

>>> self.remove_a_folder(folder_dir'gerbil/gerbil_data/featurized_data/temp')

remove_multiple_folders(folders: List[PathLike], raise_error: Optional[bool] = False) → None[source]

Helper to remove multiple directories.

Parameters

List[os.PathLike] (folders) – List of directory paths.
raise_error (bool) – If True, raise NotDirectoryError error of folder does not exist. if False, then pass. Default False.

Raises

NotDirectoryError – If raise_error and directory does not exist.

Example

>>> self.remove_multiple_folders(folders= ['gerbil/gerbil_data/featurized_data/temp'])

remove_roi_features(data_dir: Union[str, PathLike]) → None[source]

Helper to remove ROI-based features from datasets within a directory. The identified ROI-based fields are move to the project_folder/logs/ROI_data_{datetime} directory.

Note

ROI-based features are identified based on the combined criteria of (i) The prefix of the field is a named ROI in the project_folder/logs/ROI_definitions.h5 file, and (ii) the suffix of the field is contained in the [‘in zone’, ‘n zone_cumulative_time’, ‘in zone_cumulative_percent’, ‘distance’, ‘facing’]

Parameters: data_dir (Union[str, os.PathLike]) – directory with data to remove ROi features from.
Example

>>> self.remove_roi_features('/project_folder/csv/features_extracted')

Feature extraction methods

class simba.mixins.feature_extraction_mixin.FeatureExtractionMixin(config_path: Optional[str] = None)[source]

Bases: object

Methods for featurizing pose-estimation data.

Parameters: config_path (Optional[configparser.Configparser]) – path to SimBA project_config.ini

static angle3pt(ax: float, ay: float, bx: float, by: float, cx: float, cy: float) → float[source]

Jitted helper for single frame 3-point angle.

See also

For 3-point angles across multiple frames and improved runtime, see simba.mixins.feature_extraction_mixin.FeatureExtractionMixin.angle3pt_serialized().

Example

>>> FeatureExtractionMixin.angle3pt(ax=122.0, ay=198.0, bx=237.0, by=138.0, cx=191.0, cy=109)
>>> 59.78156901181637

static angle3pt_serialized(data: ndarray) → ndarray[source]

Jitted helper for frame-wise 3-point angles.

Parameters: data (ndarray) – 2D numerical array with frame number on x and [ax, ay, bx, by, cx, cy] on y.
Return ndarray: 1d float numerical array of size data.shape[0] with angles.
Examples

>>> coordinates = np.random.randint(1, 10, size=(6, 6))
>>> FeatureExtractionMixin.angle3pt_serialized(data=coordinates)
>>> [ 67.16634582,   1.84761027, 334.23067238, 258.69006753, 11.30993247, 288.43494882]

static cdist(array_1: ndarray, array_2: ndarray) → ndarray[source]

Jitted analogue of meth:scipy.cdist for two 2D arrays. Use to calculate Euclidean distances between all coordinates in one array and all coordinates in a second array. E.g., computes the distances between all body-parts of one animal and all body-parts of a second animal.

Parameters

array_1 (np.ndarray) – 2D array of body-part coordinates
array_2 (np.ndarray) – 2D array of body-part coordinates

Return np.ndarray

2D array of euclidean distances between body-parts in array_1 and array_2

Example

>>> array_1 = np.random.randint(1, 10, size=(3, 2)).astype(np.float32)
>>> array_2 = np.random.randint(1, 10, size=(3, 2)).astype(np.float32)
>>> FeatureExtractionMixin.cdist(array_1=array_1, array_2=array_2)
>>> [[7.07106781, 1.        , 3.60555124],
>>> [3.60555124, 6.3245554 , 2.        ],
>>>  [3.1622777 , 5.38516474, 4.12310553]])

static cdist_3d(data: ndarray) → ndarray[source]

Jitted analogue of meth:scipy.cdist for 3D array. Use to calculate Euclidean distances between all coordinates in of one array and itself.

Parameters: data (np.ndarray) – 3D array of body-part coordinates of size len(frames) x -1 x 2.
Return np.ndarray: 3D array of size data.shape[0], data.shape[1], data.shape[1].

change_in_bodypart_euclidean_distance(location_1: ndarray, location_2: ndarray, fps: int, px_per_mm: float, time_windows: ndarray = array([0.2, 0.4, 0.8, 1.6])) → ndarray[source]: Computes the difference between the distances of two body-parts in the current frame versus N.N seconds ago. Used for computing if animal body-parts are traveling away or towards each other within defined time-windows.

check_directionality_cords() → dict[source]

Helper to check if ear and nose body-parts are present within the pose-estimation data.

Return dict: Body-part names of ear and nose body-parts as values and animal names as keys. If empty, ear and nose body-parts are not present within the pose-estimation data

check_directionality_viable()[source]

Check if it is possible to calculate directionality statistics (i.e., nose, and ear coordinates from pose estimation has to be present)

Return bool: If True, directionality is viable. Else, not viable.
Return np.ndarray nose_coord: If viable, then 2D array with coordinates of the nose in all frames. Else, empty array.
Return np.ndarray ear_left_coord: If viable, then 2D array with coordinates of the left ear in all frames. Else, empty array.
Return np.ndarray ear_right_coord: If viable, then 2D array with coordinates of the right ear in all frames. Else, empty array.

static convex_hull_calculator_mp(arr: ndarray, px_per_mm: float) → float[source]

Calculate single frame convex hull perimeter length in millimeters.

See also

For acceptable run-time, call using parallel.delayed. For large data, use simba.feature_extractors.perimeter_jit.jitted_hull() which returns perimiter length OR area.

Parameters

arr (np.ndarray) – 2D array of size len(body-parts) x 2.
px_per_mm (float) – Video pixels per millimeter.

Return float

The length of the animal perimeter in millimeters.

Example

>>> coordinates = np.random.randint(1, 200, size=(6, 2)).astype(np.float32)
>>> FeatureExtractionMixin.convex_hull_calculator_mp(arr=coordinates, px_per_mm=4.56)
>>> 98.6676814218373

static cosine_similarity(data: ndarray) → ndarray[source]

Jitted analogue of sklearn.metrics.pairwise import cosine_similarity. Similar to scipy.cdist. calculates the cosine similarity between all pairs in 2D array.

Example

>>> data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]).astype(np.float32)
>>> FeatureExtractionMixin().cosine_similarity(data=data)
>>> [[1.0, 0.974, 0.959][0.974,  1.0, 0.998] [0.959, 0.998, 1.0]

static count_values_in_range(data: ndarray, ranges: ndarray) → ndarray[source]

Jitted helper finding count of values that falls within ranges. E.g., count number of pose-estimated body-parts that fall within defined bracket of probabilities per frame.

Parameters

data (np.ndarray) – 2D numpy array with frames on X.
ranges (np.ndarray) – 2D numpy array representing the brackets. E.g., [[0, 0.1], [0.1, 0.5]]

Return np.ndarray

2D numpy array of size data.shape[0], ranges.shape[1]

Example

>>> FeatureExtractionMixin.count_values_in_range(data=np.random.random((3,10)), ranges=np.array([[0.0, 0.25], [0.25, 0.5]]))
>>> [[6, 1], [3, 2],[2, 1]]

static create_shifted_df(df: DataFrame, periods: int = 1) → DataFrame[source]

Create dataframe including duplicated shifted (1) columns with _shifted suffix.

:parameter pd.DataFrame df :return pd.DataFrame: Dataframe including original and shifted columns.

Example

>>> df = pd.DataFrame(np.random.randint(0,100,size=(3, 1)), columns=['Feature_1'])
>>> FeatureExtractionMixin.create_shifted_df(df=df)
>>>             Feature_1  Feature_1_shifted
>>>    0         76               76.0
>>>    1         41               76.0
>>>    2         89               41.0

dataframe_gaussian_smoother(df: DataFrame, fps: int, time_window: int = 100) → DataFrame[source]

Column-wise Gaussian smoothing of dataframe.

Parameters

df (pd.DataFrame) – Dataframe with un-smoothened data.
fps (int) – The frame-rate of the video representing the data.
time_window (int) – Time-window in milliseconds to use for Gaussian smoothing.

Return pd.DataFrame

Dataframe with smoothened data

References

1: Video expected putput.

dataframe_savgol_smoother(df: DataFrame, fps: int, time_window: int = 150) → DataFrame[source]

Column-wise Savitzky-Golay smoothing of dataframe.

Parameters

df (pd.DataFrame) – Dataframe with un-smoothened data.
fps (int) – The frame-rate of the video representing the data.
time_window (int) – Time-window in milliseconds to use for Gaussian smoothing.

Return pd.DataFrame

Dataframe with smoothened data

References

1: Video expected putput.

static euclidean_distance(bp_1_x: ndarray, bp_2_x: ndarray, bp_1_y: ndarray, bp_2_y: ndarray, px_per_mm: float) → ndarray[source]

Helper to compute the Euclidean distance in millimeters between two body-parts in all frames of a video

Parameters

bp_1_x (np.ndarray) – 2D array of size len(frames) x 1 with bodypart 1 x-coordinates.
bp_2_x (np.ndarray) – 2D array of size len(frames) x 1 with bodypart 2 x-coordinates.
bp_1_y (np.ndarray) – 2D array of size len(frames) x 1 with bodypart 1 y-coordinates.
bp_2_y (np.ndarray) – 2D array of size len(frames) x 1 with bodypart 2 y-coordinates.

Return np.ndarray

2D array of size len(frames) x 1 with distances between body-part 1 and 2 in millimeters

Example

>>> x1, x2 = np.random.randint(1, 10, size=(10, 1)), np.random.randint(1, 10, size=(10, 1))
>>> y1, y2 = np.random.randint(1, 10, size=(10, 1)), np.random.randint(1, 10, size=(10, 1))
>>> FeatureExtractionMixin.euclidean_distance(bp_1_x=x1, bp_2_x=x2, bp_1_y=y1, bp_2_y=y2, px_per_mm=4.56)

static find_midpoints(bp_1: ndarray, bp_2: ndarray, percentile: float = 0.5) → ndarray[source]

Compute the midpoints between two sets of 2D points based on a given percentile.

Parameters

bp_1 (np.ndarray) – An array of 2D points representing the first set of points. Rows represent frames. First column represent x coordinates. Second column represent y coordinates.
bp_2 (np.ndarray) – An array of 2D points representing the second set of points. Rows represent frames. First column represent x coordinates. Second column represent y coordinates.
percentile (float) – The percentile value to determine the distance between the points for calculating midpoints. When set to 0.5 it calculates midpoints at the midpoint of the two points.

Returns

np.ndarray: An array of 2D points representing the midpoints between the points in bp_1 and bp_2 based on the specified percentile.

Example

>>> bp_1 = np.array([[1, 3], [30, 10]]).astype(np.int64)
>>> bp_2 = np.array([[10, 4], [20, 1]]).astype(np.int64)
>>> FeatureExtractionMixin().find_midpoints(bp_1=bp_1, bp_2=bp_2, percentile=0.5)
>>> [[ 5,  3], [25,  6]]

static framewise_euclidean_distance(location_1: ndarray, location_2: ndarray, px_per_mm: float, centimeter: bool = False) → ndarray[source]

Jitted helper finding frame-wise distances between two moving locations in millimeter or centimeter.

Parameters

location_1 (ndarray) – 2D array of size len(frames) x 2.
location_1 – 2D array of size len(frames) x 2.
px_per_mm (float) – The pixels per millimeter in the video.
centimeter (bool) – If true, the value in centimeters is returned. Else the value in millimeters.

Return np.ndarray

1D array of size location_1.shape[0]

Example

>>> loc_1 = np.random.randint(1, 200, size=(6, 2)).astype(np.float32)
>>> loc_2 = np.random.randint(1, 200, size=(6, 2)).astype(np.float32)
>>> FeatureExtractionMixin.framewise_euclidean_distance(location_1=loc_1, location_2=loc_2, px_per_mm=4.56, centimeter=False)
>>> [49.80098657, 46.54963644, 49.60650394, 70.35919993, 37.91069901, 71.95422524]

static framewise_euclidean_distance_roi(location_1: ndarray, location_2: ndarray, px_per_mm: float, centimeter: bool = False) → ndarray[source]

Find frame-wise distances between a moving location (location_1) and static location (location_2) in millimeter or centimeter.

Parameters

location_1 (ndarray) – 2D numpy array of size len(frames) x 2.
location_1 – 1D numpy array holding the X and Y of the static location.
px_per_mm (float) – The pixels per millimeter in the video.
centimeter (bool) – If true, the value in centimeters is returned. Else the value in millimeters.

Return np.ndarray

1D array of size location_1.shape[0]

Example

>>> loc_1 = np.random.randint(1, 200, size=(6, 2)).astype(np.float32)
>>> loc_2 = np.random.randint(1, 200, size=(1, 2)).astype(np.float32)
>>> FeatureExtractionMixin.framewise_euclidean_distance_roi(location_1=loc_1, location_2=loc_2, px_per_mm=4.56, centimeter=False)
>>> [11.31884926, 13.84534585,  6.09712224, 17.12773976, 19.32066031, 12.18043378]
>>> FeatureExtractionMixin.framewise_euclidean_distance_roi(location_1=loc_1, location_2=loc_2, px_per_mm=4.56, centimeter=True)
>>> [1.13188493, 1.38453458, 0.60971222, 1.71277398, 1.93206603, 1.21804338]

static framewise_inside_polygon_roi(bp_location: ndarray, roi_coords: ndarray) → ndarray[source]

Jitted helper for frame-wise detection if animal is inside static polygon ROI.

Note

Modified from epifanio

Parameters

bp_location (np.ndarray) – 2d numeric np.ndarray size len(frames) x 2
roi_coords (np.ndarray) – 2d numeric np.ndarray size len(polygon points) x 2

Return ndarray

2d numeric boolean np.ndarray size len(frames) x 1 with 0 representing outside the polygon and 1 representing inside the polygon

Example

>>> bp_loc = np.random.randint(1, 10, size=(6, 2)).astype(np.float32)
>>> roi_coords = np.random.randint(1, 10, size=(10, 2)).astype(np.float32)
>>> FeatureExtractionMixin.framewise_inside_polygon_roi(bp_location=bp_loc, roi_coords=roi_coords)
>>> [0, 0, 0, 1]

static framewise_inside_rectangle_roi(bp_location: ndarray, roi_coords: ndarray) → ndarray[source]

Jitted helper for frame-wise analysis if animal is inside static rectangular ROI.

Parameters

bp_location (np.ndarray) – 2d numeric np.ndarray size len(frames) x 2
roi_coords (np.ndarray) – 2d numeric np.ndarray size 2x2 (top left[x, y], bottom right[x, y)

Return ndarray

2d numeric boolean np.ndarray size len(frames) x 1 with 0 representing outside the rectangle and 1 representing inside the rectangle

Example

>>> bp_loc = np.random.randint(1, 10, size=(6, 2)).astype(np.float32)
>>> roi_coords = np.random.randint(1, 10, size=(2, 2)).astype(np.float32)
>>> FeatureExtractionMixin.framewise_inside_rectangle_roi(bp_location=bp_loc, roi_coords=roi_coords)
>>> [0, 0, 0, 0, 0, 0]

get_bp_headers() → None[source]: Helper to create ordered list of all column header fields for SimBA project dataframes.

get_feature_extraction_headers(pose: str) → List[str][source]

Helper to return the headers names (body-part location columns) that should be used during feature extraction.

Parameters: pose (str) – Pose-estimation setting, e.g., 16.
Return List[str]: The names and order of the pose-estimation columns.

insert_default_headers_for_feature_extraction(df: DataFrame, headers: List[str], pose_config: str, filename: str) → DataFrame[source]: Helper to insert correct body-part column names prior to defualt feature extraction methods.

static jitted_line_crosses_to_nonstatic_targets(left_ear_array: ndarray, right_ear_array: ndarray, nose_array: ndarray, target_array: ndarray) → ndarray[source]

Jitted helper to calculate if an animal is directing towards another animals body-part coordinate, given the target body-part and the left ear, right ear, and nose coordinates of the observer.

Note

Input left ear, right ear, and nose coordinates of the observer is returned by simba.mixins.feature_extraction_mixin.FeatureExtractionMixin.check_directionality_viable()

Parameters

left_ear_array (np.ndarray) – 2D array of size len(frames) x 2 with the coordinates of the observer animals left ear
right_ear_array (np.ndarray) – 2D array of size len(frames) x 2 with the coordinates of the observer animals right ear
nose_array (np.ndarray) – 2D array of size len(frames) x 2 with the coordinates of the observer animals nose
target_array (np.ndarray) – 2D array of size len(frames) x 2 with the target body-part location

Return np.ndarray

2D array of size len(frames) x 4. First column represent the side of the observer that the target is in view. 0 = Left side, 1 = Right side, 2 = Not in view.

Second and third column represent the x and y location of the observer animals eye (half-way between the ear and the nose). Fourth column represent if target is is view (bool).

static jitted_line_crosses_to_static_targets(left_ear_array: ndarray, right_ear_array: ndarray, nose_array: ndarray, target_array: ndarray) → ndarray[source]

Jitted helper to calculate if an animal is directing towards a static location (ROI centroid), given the target location and the left ear, right ear, and nose coordinates of the observer.

Note

Input left ear, right ear, and nose coordinates of the observer is returned by simba.mixins.feature_extraction_mixin.FeatureExtractionMixin.check_directionality_viable()

Parameters

left_ear_array (np.ndarray) – 2D array of size len(frames) x 2 with the coordinates of the observer animals left ear
right_ear_array (np.ndarray) – 2D array of size len(frames) x 2 with the coordinates of the observer animals right ear
nose_array (np.ndarray) – 2D array of size len(frames) x 2 with the coordinates of the observer animals nose
target_array (np.ndarray) – 1D array of with x,y of target location

Return np.ndarray

2D array of size len(frames) x 4. First column represent the side of the observer that the target is in view. 0 = Left side, 1 = Right side, 2 = Not in view.

Second and third column represent the x and y location of the observer animals eye (half-way between the ear and the nose). Fourth column represent if target is view (bool).

static line_crosses_to_static_targets(p: ~typing.List[float], q: ~typing.List[float], n: ~typing.List[float], M: ~typing.List[float], coord: ~typing.List[float]) -> (<class 'bool'>, typing.List[float])[source]

Legacy non-jitted helper to calculate if an animal is directing towards a static coordinate (e.g., ROI centroid).

Parameters

p (list) – left ear coordinates of observing animal.
q (list) – right ear coordinates of observing animal.
n (list) – nose coordinates of observing animal.
M (list) – The location of the target coordinates.
coord (list) – empty list to store the eye coordinate of the observing animal.

Return bool

If True, static coordinate is in view.

Return List

If True, the coordinate of the observing animals eye (half-way between nose and ear).

static minimum_bounding_rectangle(points: ndarray) → ndarray[source]

Finds the minimum bounding rectangle from convex hull vertices.

Parameters: points (np.ndarray) – 2D array representing the convexhull vertices of the animal.
Return np.ndarray: 2D array representing minimum bounding rectangle of the convexhull vertices of the animal.

Note

Modified from JesseBuesking

See simba.mixins.feature_extractors.perimeter_jit.jitted_hull() for computing the convexhull vertices.

TODO: Place in numba njit.

Example

>>>   points = np.random.randint(1, 10, size=(10, 2))
>>>   FeatureExtractionMixin.minimum_bounding_rectangle(points=points)
>>> [[10.7260274 ,  3.39726027], [ 1.4109589 , -0.09589041], [-0.31506849,  4.50684932], [ 9., 8. ]]

static windowed_frequentist_distribution_tests(data: ndarray, feature_name: str, fps: int) → DataFrame[source]

Calculates feature value distributions and feature peak counts in 1-s sequential time-bins.

Computes (i) feature value distributions in 1-s sequential time-bins: Kolmogorov-Smirnov and T-tests. Computes (ii) feature values against a normal distribution: Shapiro-Wilks. Computes (iii) peak count in rolling 1s long feature window: scipy.find_peaks.

Parameters

data (np.ndarray) – Single feature 1D array
feature_name (np.ndarray) – The name of the input feature.
fps (int) – The framerate of the video representing the data.

Return pd.DataFrame

Of size len(data) x 4 with columns representing KS, T, Shapiro-Wilks, and peak count statistics.

Example

>>> feature_data = np.random.randint(1, 10, size=(100))
>>> FeatureExtractionMixin.windowed_frequentist_distribution_tests(data=feature_data, fps=25, feature_name='Anima_1_velocity')

Geometry transformation methods

class simba.mixins.geometry_mixin.GeometryMixin[source]

Bases: object

Methods to perform geometry transformation of pose-estimation data. This includes creating bounding boxes, line objects, circles etc. from pose-estimated body-parts and computing metric representations of the relationships between created shapes or their attributes (sizes, distances etc.).

As of 01/24, very much wip and relies heavily on shapley.

Note

These methods generally do not create visualizations - they mainly generate geometry data-objects or metrics. To create visualizations with geometries overlay on videos, pass returned shapes to simba.plotting.geometry_plotter.GeometryPlotter.

static adjust_geometry_locations(geometries: List[Polygon], shift: Tuple[int, int], minimum: Optional[Tuple[int, int]] = (0, 0), maximum: Optional[Tuple[int, int]] = (inf, inf)) → List[Polygon][source]

Shift a set of geometries specified distance in the x and/or y-axis.

Parameters

geometries (List[Polygon]) – List of input polygons to be adjusted.
shift (Tuple[int, int]) – Tuple specifying the shift distances in the x and y-axis.
minimum (Optional[Tuple[int, int]]) – Minimim allowed coordinates of Polygon points on x and y axes. Default: (0,0).
maximum (Optional[Tuple[int, int]]) – Maximum allowed coordinates of Polygon points on x and y axes. Default: (np.inf, np.inf).

Return List[Polygon]

List of adjusted polygons.

Example

>>> shapes = GeometryMixin().adjust_geometry_locations(geometries=shapes, shift=(0, 333))

static area(shape: Union[MultiPolygon, Polygon], pixels_per_mm: float)[source]

Calculate the area of a geometry in square millimeters.

Note

If certain that the input data is a valid Polygon, consider using simba.feature_extractors.perimeter_jit.jitted_hull()

Parameters

shape (Union[MultiPolygon, Polygon]) – The geometry (MultiPolygon or Polygon) for which to calculate the area.
pixels_per_mm (float) – The pixel-to-millimeter conversion factor.

Return float

The area of the geometry in square millimeters.

Example

>>> polygon = GeometryMixin().bodyparts_to_polygon(np.array([[10, 10], [10, 100], [100, 10], [100, 100]]))
>>> GeometryMixin().area(shape=polygon, pixels_per_mm=4.9)
>>> 1701.556313816644

static bodyparts_to_circle(data: ndarray, parallel_offset: float, pixels_per_mm: Optional[int] = 1) → Polygon[source]

Create a circle geometry from a single body-part (x,y) coordinate.

Note

For multiple frames, call this method using multiframe_bodyparts_to_circle()

Parameters

data (np.ndarray) – The body-part coordinate xy as a 1d array. E.g., np.array([364, 308])
parallel_offset (float) – The radius of the resultant circle in millimeters.
pixels_per_mm (int) – The pixels per millimeter of the video. If not passed, 1 will be used meaning revert to radius in pixels rather than millimeters.

Returns Polygon

Shapely Polygon of curcular shape.

Example

>>> data = np.array([364, 308])
>>> polygon = GeometryMixin().bodyparts_to_circle(data=data, parallel_offset=10, pixels_per_mm=4)

static bodyparts_to_line(data: ndarray, buffer: Optional[int] = None, px_per_mm: Optional[float] = None) → Union[Polygon, LineString][source]

Convert body-part coordinates to a Linestring.

Note

If buffer and px_per_mm is provided, then the returned object will be linestring buffered to a 2D object rectangle with specificed area.

Example

>>> data = np.array([[364, 308],[383, 323], [403, 335],[423, 351]])
>>> line = GeometryMixin().bodyparts_to_line(data=data)
>>> line = GeometryMixin().bodyparts_to_line(data=data, buffer=10, px_per_mm=4)

static bodyparts_to_multistring_skeleton(data: ndarray) → MultiLineString[source]

Create a multistring skeleton from a 3d array where each 2d array represents start and end coordinates of a line within the skeleton.

Parameters: data (np.ndarray) – A 3D numpy array where each 2D array represents the start position and end position of each LineString.
Returns MultiLineString: Shapely MultiLineString representing animal skeleton.

_images/bodyparts_to_multistring_skeleton.png

_images/bodyparts_to_multistring_skeleton.gif

Example

>>> skeleton = np.array([[[5, 5], [1, 10]], [[5, 5], [9, 10]], [[9, 10], [1, 10]], [[9, 10], [9, 25]], [[1, 10], [1, 25]], [[9, 25], [5, 50]], [[1, 25], [5, 50]]])
>>> shape_multistring = GeometryMixin().bodyparts_to_multistring_skeleton(data=skeleton)

static bodyparts_to_points(data: ndarray, buffer: Optional[int] = None, px_per_mm: Optional[int] = None) → List[Union[Point, Polygon]][source]

Convert body-parts coordinate to Point geometries.

Parameters

data (np.ndarray) – 2D array with body-part coordinates where rows are frames and columns are x and y coordinates.
buffer (Optional[int]) – If not None, then the area of the Point. Thus, if not None, then returns Polygons representing the Points.
px_per_mm (Optional[int]) – Pixels to millimeter convertion factor. Required if buffer is not None.

Example

>>> data = np.random.randint(0, 100, (1, 2))
>>> GeometryMixin().bodyparts_to_points(data=data)

static bodyparts_to_polygon(data: ndarray, cap_style: typing_extensions.Literal['round', 'square', 'flat'] = 'round', parallel_offset: int = 1, pixels_per_mm: int = 1, simplify_tolerance: float = 2, preserve_topology: bool = True) → Polygon[source]

Example

>>> data = [[[364, 308],[383, 323],[403, 335], [423, 351]]]
>>> GeometryMixin().bodyparts_to_polygon(data=data)

static bucket_img_into_grid_hexagon(bucket_size_mm: float, img_size: Tuple[int, int], px_per_mm: float) → Tuple[Dict[Tuple[int, int], Polygon], float][source]

Bucketize an image into hexagons and return a dictionary of polygons representing the hexagon locations.

_images/bucket_img_into_grid_hexagon.png

Parameters

bucket_size_mm (float) – The width/height of each hexagon bucket in millimeters.
img_size (Tuple[int, int]) – Tuple representing the width and height of the image in pixels.
px_per_mm (float) – Pixels per millimeter conversion factor.

Return Tuple[Dict[Tuple[int, int], Polygon], float]

First value is a dictionary where keys are (row, column) indices of the bucket, and values are Shapely Polygon objects representing the corresponding hexagon buckets. Second value is the aspect ratio of the hexagonal grid.

Example

>>> polygons, aspect_ratio = GeometryMixin().bucket_img_into_grid_hexagon(bucket_size_mm=10, img_size=(800, 600), px_per_mm=5.0, add_correction=True)

static bucket_img_into_grid_points(point_distance: int, px_per_mm: float, img_size: Tuple[int, int], border_sites: Optional[bool] = True) → Dict[Tuple[int, int], Point][source]

Generate a grid of evenly spaced points within an image. Use for creating spatial markers within an arena.

Parameters

point_distance (int) – Distance between adjacent points in millimeters.
px_per_mm (float) – Pixels per millimeter conversion factor.
img_size (Tuple[int, int]) – Size of the image in pixels (width, height).
border_sites (Optional[bool]) – If True, includes points on the border of the image. Default is True.

Returns Dict[Tuple[int, int], Point]

Dictionary where keys are (row, column) indices of the point, and values are Shapely Point objects.

Example

>>> GeometryMixin.bucket_img_into_grid_points(point_distance=20, px_per_mm=4, img_size=img.shape, border_sites=False)

static bucket_img_into_grid_square(img_size: Iterable[int], bucket_grid_size_mm: Optional[float] = None, bucket_grid_size: Optional[Iterable[int]] = None, px_per_mm: Optional[float] = None, add_correction: Optional[bool] = True) → Tuple[Dict[Tuple[int, int], Polygon], float][source]

Bucketize an image into squares and return a dictionary of polygons representing the bucket locations.

_images/bucket_img_into_grid_square_3.png

Parameters

img_size (Iterable[int]) – 2-value tuple, list or array representing the width and height of the image in pixels.
bucket_grid_size_mm (Optional[float]) – The width/height of each square bucket in millimeters. E.g., 50 will create 5cm by 5cm squares. If None, then buckets will by defined by bucket_grid_size argument.
bucket_grid_size (Optional[Iterable[int]]) – 2-value tuple, list or array representing the grid square in number of horizontal squares x number of vertical squares. If None, then buckets will be defined by the bucket_size_mm argument.
px_per_mm (Optional[float]) – Pixels per millimeter conversion factor. Necessery if buckets are defined by bucket_size_mm argument.
add_correction (Optional[bool]) – If True, performs correction by adding extra columns or rows to cover any remaining space if using bucket_size_mm. Default True.

Example

>>> img = cv2.imread('/Users/simon/Desktop/Screenshot 2024-01-21 at 10.15.55 AM.png', 1)
>>> polygons = GeometryMixin().bucket_img_into_grid_square(bucket_grid_size=(10, 5), bucket_grid_size_mm=None, img_size=(img.shape[1], img.shape[0]), px_per_mm=5.0)
>>> for k, v in polygons[0].items(): cv2.polylines(img, [np.array(v.exterior.coords).astype(int)], True, (255, 0, 133), 2)
>>> cv2.imshow('img', img)
>>> cv2.waitKey()

static buffer_shape(shape: Union[Polygon, LineString], size_mm: int, pixels_per_mm: float, cap_style: typing_extensions.Literal['round', 'square', 'flat'] = 'round') → Polygon[source]

Create a buffered shape by applying a buffer operation to the input polygon or linestring.

Parameters

shape (Union[Polygon, LineString]) – The input Polygon or LineString to be buffered.
size_mm (int) – The size of the buffer in millimeters. Use a negative value for an inward buffer.
pixels_per_mm (float) – The conversion factor from millimeters to pixels.
cap_style (Literal['round', 'square', 'flat']) – The cap style for the buffer. Valid values are ‘round’, ‘square’, or ‘flat’. Defaults to ‘round’.

Return Polygon

The buffered shape.

Example

>>> polygon = GeometryMixin().bodyparts_to_polygon(np.array([[100, 110],[100, 100],[110, 100],[110, 110]]))
>>> buffered_polygon = GeometryMixin().buffer_shape(shape=polygon, size_mm=-1, pixels_per_mm=1)

static compute_pct_shape_overlap(shapes: ndarray, denominator: Optional[typing_extensions.Literal['difference', 'shape_1', 'shape_2']] = 'difference') → int[source]

Compute the percentage of overlap between two shapes.

Parameters

shapes (List[Union[LineString, Polygon]]) – A 2D array, where each sub-array has two Polygon or LineString shapes.
denominator (Optional[Literal['union', 'shape_1', 'shape_2']]) – If difference, then percent overlap is calculated using non-intersection area as denominator. If shape_1, percent overlap is calculated using the area of the first shape as denominator. If shape_2, percent overlap is calculated using the area of the second shape as denominator. Default: difference.

Return float

The percentage of overlap between the two shapes as integer.

Example

>>> polygon_1 = GeometryMixin().bodyparts_to_polygon(np.array([[364, 308],[383, 323],[403, 335],[423, 351]]))
>>> polygon_2 = GeometryMixin().bodyparts_to_polygon(np.array([[356, 307],[376, 319],[396, 331],[419, 347]]))
>>> polygon_1 = [polygon_1 for x in range(100)]
>>> polygon_2 = [polygon_2 for x in range(100)]
>>> data = np.column_stack((polygon_1, polygon_2))
>>> results = GeometryMixin.compute_pct_shape_overlap(shapes=data)

static compute_shape_overlap(shapes: List[Union[Polygon, LineString]]) → int[source]

Computes if two geometrical shapes (Polygon or LineString) overlaps or are disjoint.

Note

Only returns if two shapes are overlapping or not overlapping. If the amount of overlap is required, use GeometryMixin().compute_shape_overlap().

Parameters: shapes (List[Union[LineString, Polygon]]) – A list of two input Polygon or LineString shapes.
Return float: Returns 1 if the two shapes overlap, otherwise returns 0.

static contours_to_geometries(contours: List[ndarray], force_rectangles: Optional[bool] = True) → List[Polygon][source]

Convert a list of contours to a list of geometries.

E.g., convert a list of contours detected with ImageMixin.find_contours to a list of Shapely geometries that can be used within the GeometryMixin.

Parameters

contours (List[np.ndarray]) – List of contours represented as 2D arrays.
force_rectangles – If True, then force the resulting geometries to be rectangular.

Return List[Polygon]

List of Shapley Polygons.

Example

>>> video_frm = read_frm_of_video(video_path='/Users/simon/Desktop/envs/platea_featurizer/data/video/3D_Mouse_5-choice_MouseTouchBasic_s9_a4_grayscale.mp4')
>>> contours = ImageMixin.find_contours(img=video_frm)
>>> GeometryMixin.contours_to_geometries(contours=contours)

static crosses(shapes: List[LineString]) → bool[source]

Check if two LineString objects cross each other.

Parameters: shapes (List[LineString]) – A list containing two LineString objects.
Return bool: True if the LineStrings cross each other, False otherwise.
Example

>>> line_1 = GeometryMixin().bodyparts_to_line(np.array([[10, 10],[20, 10],[30, 10],[40, 10]]))
>>> line_2 = GeometryMixin().bodyparts_to_line(np.array([[25, 5],[25, 20],[25, 30],[25, 40]]))
>>> GeometryMixin().crosses(shapes=[line_1, line_2])
>>> True

cumsum_bool_geometries(data: ndarray, geometries: Dict[Tuple[int, int], Polygon], bool_data: ndarray, fps: Optional[float] = None, core_cnt: Optional[int] = -1) → ndarray[source]

Compute the cumulative sums of boolean events within polygon geometries over time using multiprocessing.

E.g., compute the cumulative time of classified events within spatial locations at all time-points of the video.

Parameters

data (np.ndarray) – Array containing spatial data with shape (n, 2). E.g., 2D-array with body-part coordinates.
geometries (Dict[Tuple[int, int], Polygon]) – Dictionary of polygons representing spatial regions. Created by GeometryMixin.bucket_img_into_squares.
bool_data (np.ndarray) – Boolean array with shape (data.shape[0],) or (data.shape[0], 1) indicating the presence or absence in each frame.
fps (Optional[float]) – Frames per second. If provided, the result is normalized by the frame rate.
core_cnt (Optional[float]) – Number of CPU cores to use for parallel processing. Default is -1, which means using all available cores.

Returns np.ndarray

Array of size (frames x horizontal bins x verical bins) with times in seconds (if fps passed) or frames (if fps not passed)

Example

>>> geometries = GeometryMixin.bucket_img_into_grid_square(bucket_size_mm=50, img_size=(800, 800) , px_per_mm=5.0)[0]
>>> coord_data = np.random.randint(0, 800, (500, 2))
>>> bool_data = np.random.randint(0, 2, (500,))
>>> x = GeometryMixin().cumsum_bool_geometries(data=coord_data, geometries=geometries, bool_data=bool_data, fps=15)
>>> x.shape
>>> (500, 4, 4)

cumsum_coord_geometries(data: ndarray, geometries: Dict[Tuple[int, int], Polygon], fps: Optional[int] = None, core_cnt: Optional[int] = -1, verbose: Optional[bool] = True)[source]

Compute the cumulative time a body-part has spent inside a grid of geometries using multiprocessing.

Parameters

data (np.ndarray) – Input data array where rows represent frames and columns represent body-part x and y coordinates.
geometries (Dict[Tuple[int, int], Polygon]) – Dictionary of polygons representing spatial regions. Created by GeometryMixin.bucket_img_into_squares.
fps (Optional[int]) – Frames per second (fps) for time normalization. If None, cumulative sum of frame count is returned.

Example

>>> img_geometries = GeometryMixin.bucket_img_into_grid_square(img_size=(640, 640), bucket_grid_size=(10, 10), px_per_mm=1)
>>> bp_arr = np.random.randint(0, 640, (5000, 2))
>>> geo_data = GeometryMixin().cumsum_coord_geometries(data=bp_arr, geometries=img_geometries[0], verbose=False, fps=1)

static delaunay_triangulate_keypoints(data: ndarray) → List[Polygon][source]

Triangulates a set of 2D keypoints. E.g., use to polygonize animal hull, or triangulate a gridpoint areana.

This method takes a 2D numpy array representing a set of keypoints and triangulates them using the Delaunay triangulation algorithm. The input array should have two columns corresponding to the x and y coordinates of the keypoints.

_images/delaunay_triangulate_keypoints.png

_images/delaunay_triangulate_keypoints.gif

_images/delaunay_triangulate_keypoints_2.png

Parameters: data (np.ndarray) – NumPy array of body part coordinates. Each subarray represents the coordinates of a body part.
Returns List[Polygon]: A list of Polygon objects representing the triangles formed by the Delaunay triangulation.
Example

>>> data = np.array([[126, 122],[152, 116],[136,  85],[167, 172],[161, 206],[197, 193],[191, 237]])
>>> triangulated_hull = GeometryMixin().delaunay_triangulate_keypoints(data=data)

static difference(shapes=typing.List[typing.Union[shapely.geometry.linestring.LineString, shapely.geometry.polygon.Polygon, shapely.geometry.multipolygon.MultiPolygon]]) → Polygon[source]

Calculate the difference between a shape and one or more potentially overlapping shapes.

Parameters: shapes (List[Union[LineString, Polygon, MultiPolygon]]) – A list of geometries.
Returns: The first geometry in shapes is returned where all parts that overlap with the other geometries in ``shapes have been removed.
Example

>>> polygon_1 = GeometryMixin().bodyparts_to_polygon(np.array([[10, 10], [10, 100], [100, 10], [100, 100]]))
>>> polygon_2 = GeometryMixin().bodyparts_to_polygon(np.array([[25, 25],[25, 75],[90, 25],[90, 75]]))
>>> polygon_3 = GeometryMixin().bodyparts_to_polygon(np.array([[1, 25],[1, 75],[110, 25],[110, 75]]))
>>> difference = GeometryMixin().difference(shapes = [polygon_1, polygon_2, polygon_3])

static extend_line_to_bounding_box_edges(line_points: ndarray, bounding_box: ndarray) → ndarray[source]

Jitted extend a line segment defined by two points to fit within a bounding box.

_images/extend_line_to_bounding_box_edges.png

Parameters

line_points (np.ndarray) – Coordinates of the line segment’s two points. Two rows and each row represents a point (x, y).
bounding_box (np.ndarray) – Bounding box coordinates in the format (min_x, min_y, max_x, max_y).

Returns np.ndarray

Intersection points where the extended line crosses the bounding box edges. The shape of the array is (2, 2), where each row represents a point (x, y).

Example

>>> line_points = np.array([[25, 25], [45, 25]]).astype(np.float32)
>>> bounding_box = np.array([0, 0, 50, 50]).astype(np.float32)
>>> GeometryMixin().extend_line_to_bounding_box_edges(line_points, bounding_box)
>>> [[ 0. 25.] [50. 25.]]

static filter_low_p_bps_for_shapes(x: ndarray, p: ndarray, threshold: float)[source]

Filter body-part data for geometry construction while maintaining valid geometry arrays.

Having a 3D array representing body-parts across time, and a second 3D array representing probabilities of those body-parts across time, we want to “remove” body-parts with low detection probabilities whilst also keeping the array sizes intact and suitable for geometry construction. To do this, we find body-parts with detection probabilities below the threshold, and replace these with a body-part that doesn’t fall below the detection probability threshold within the same frame. However, to construct a geometry, we need >= 3 unique key-point locations. Thus, no substitution can be made to when there are less than three unique body-part locations within a frame that falls above the threshold.

Example

>>> x = np.random.randint(0, 500, (18000, 7, 2))
>>> p = np.random.random(size=(18000, 7, 1))
>>> x = GeometryMixin.filter_low_p_bps_for_shapes(x=x, p=p, threshold=0.1)
>>> x = x.reshape(x.shape[0], int(x.shape[1] * 2))

static geometry_contourcomparison(imgs: List[Union[ndarray, Tuple[VideoCapture, int]]], geometry: Optional[Polygon] = None, method: Optional[typing_extensions.Literal['all', 'exterior']] = 'all', canny: Optional[bool] = True) → float[source]

Compare contours between a geometry in two images using shape matching.

Important

If there is non-pose related noise in the environment (e.g., there are non-experiment related intermittant light or shade sources that goes on and off, this will negatively affect the reliability of contour comparisons.

Used to pick up very subtle changes around pose-estimated body-part locations.

Parameters

imgs (List[Union[np.ndarray, Tuple[cv2.VideoCapture, int]]]) – List of two input images. Can be either be two images in numpy array format OR a two tuples with cv2.VideoCapture object and the frame index.
geometry (Optional[Polygon]) – If Polygon, then the geometry in the two images that should be compared. If None, then entire images will be contourcompared.
method (Literal['all', 'exterior']) – The method used for contour comparison.
canny (Optional[bool]) – If True, applies Canny edge detection before contour comparison. Helps reduce noise and enhance contours. Default is True.

Returns float

Contour matching score between the two images. Lower scores indicate higher similarity.

Example

>>> img_1 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1_frames/1978.png').astype(np.uint8)
>>> img_2 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1_frames/1977.png').astype(np.uint8)
>>> data = pd.read_csv('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/csv/outlier_corrected_movement_location/Example_1.csv', nrows=1, usecols=['Nose_x', 'Nose_y']).fillna(-1).values.astype(np.int64)
>>> geometry = GeometryMixin().bodyparts_to_circle(data[0, :], 100)
>>> GeometryMixin().geometry_contourcomparison(imgs=[img_1, img_2], geometry=geometry, canny=True, method='exterior')
>>> 22.54

static geometry_histocomparison(imgs: List[Union[ndarray, Tuple[VideoCapture, int]]], geometry: Optional[Polygon] = None, method: Optional[typing_extensions.Literal['chi_square', 'correlation', 'intersection', 'bhattacharyya', 'hellinger', 'chi_square_alternative', 'kl_divergence']] = 'correlation', absolute: Optional[bool] = True) → float[source]

Retrieve histogram similarities within a geometry inside two images.

For example, the polygon may represent an area around a rodents head. While the front paws are not pose-estimated, computing the histograms of the geometry in two sequential images gives indication of non-freezing.

Note

If shapes is None, the entire two images passed as imgs will be compared.

Documentation.

Important

If there is non-pose related noise in the environment (e.g., there are non-experiment related light sources that goes on and off, or waving window curtains causing changes in histgram values w/o affecting pose) this will negatively affect the realiability of histogram comparisons.

Parameters

imgs (List[Union[np.ndarray, Tuple[cv2.VideoCapture, int]]]) – List of two input images. Can be either an two image in numpy array format OR a two tuples with cv2.VideoCapture object and the frame index.
geometry (Optional[Polygon]) – If Polygon, then the geometry in the two images that should be compared. If None, then entire images will be histocompared.
method (Literal['correlation', 'chi_square']) – The method used for comparison. E.g., if correlation, then small output values suggest large differences between the current versus prior image. If chi_square, then large output values suggest large differences between the geometries.
absolute (Optional[bool]) – If True, the absolute difference between the two histograms. If False, then (image2 histogram) - (image1 histogram)

Return float

Value representing the histogram similarities between the geometry in the two images.

Example

>>> img_1 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1_frames/1.png')
>>> img_2 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1_frames/2.png')
>>> data_path = '/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/csv/outlier_corrected_movement_location/Example_1.csv'
>>> data = pd.read_csv(data_path, nrows=1, usecols=['Nose_x', 'Nose_y']).fillna(-1).values.astype(np.int64)
>>> polygon = GeometryMixin().bodyparts_to_circle(data[0], 100)
>>> GeometryMixin().geometry_histocomparison(imgs=[img_1, img_2], geometry=polygon, method='correlation')
>>> 0.9999769684923543
>>> img_2 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1_frames/41411.png')
>>> GeometryMixin().geometry_histocomparison(imgs=[img_1, img_2], geometry=polygon, method='correlation')
>>> 0.6732792208872572
>>> img_1 = (cv2.VideoCapture('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1.mp4'), 1)
>>> img_2 = (cv2.VideoCapture('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1.mp4'), 2)
>>> GeometryMixin().geometry_histocomparison(imgs=[img_1, img_2], geometry=polygon, method='correlation')
>>> 0.9999769684923543

static geometry_video(shapes: List[List[Union[LineString, Polygon, MultiPolygon, MultiLineString, MultiPoint]]], save_path: Union[str, PathLike], size: Optional[Tuple[int]], fps: Optional[int] = 10, verbose: Optional[bool] = False, bg_img: Optional[ndarray] = None, bg_clr: Optional[Tuple[int]] = None) → None[source]

Helper to create a geometry video from a list of shapes.

Note

If more aesthetic videos are needed, overlaid on video, then use simba.plotting.geometry_plotter.GeometryPlotter If single images of geometries are needed, then use simba.mixins.geometry_mixin.view_shapes

Parameters

shapes (List[List[Union[LineString, Polygon, MultiPolygon, MultiPoint, MultiLineString]]]) – List of lists containing geometric shapes to be included in the video. Each sublist represents a frame, and each element within the sublist represents a shape for that frame.
save_path (Union[str, os.PathLike]) – Path where the resulting video will be saved.
size (Optional[Tuple[int]]) – Tuple specifying the size of the output video in pixels (width, height).
fps (Optional[int]) – Frames per second of the output video. Defaults to 10.
verbose (Optional[bool]) – If True, then prints progress frmae-by-frame. Default: False.
bg_img (Optional[np.ndarray]) – Background image to be used as the canvas for drawing shapes. Defaults to None. Could be e.g., a low opacity image of the arena.
bg_clr (Optional[Tuple[int]]) – Background color specified as a tuple of RGB values. Defaults to white.

static get_center(shape: Union[LineString, Polygon, MultiPolygon]) → ndarray[source]

Example

>>> multipolygon = MultiPolygon([Polygon([[200, 110],[200, 100],[200, 100],[200, 110]]), Polygon([[70, 70],[70, 60],[10, 50],[1, 70]])])
>>> GeometryMixin().get_center(shape=multipolygon)
>>> [33.96969697, 62.32323232]

static get_geometry_brightness_intensity(img: Union[ndarray, Tuple[VideoCapture, int]], geometries: List[Union[Polygon, ndarray]], ignore_black: Optional[bool] = True) → ndarray[source]

Calculate the average brightness intensity within a geometry region-of-interest of an image.

E.g., can be used with hardcoded thresholds or model kmeans in simba.mixins.statistics_mixin.Statistics.kmeans_1d to detect if a light source is ON or OFF state.

_images/get_geometry_brightness_intensity.png

Parameters

img (np.ndarray) – Either an image in numpy array format OR a tuple with cv2.VideoCapture object and the frame index.
geometries (List[Union[Polygon, np.ndarray]]) – A list of shapes either as vertices in a numpy array, or as shapely Polygons.
ignore_black (Optional[bool]) – If non-rectangular geometries, then pixels that don’t belong to the geometry are masked in black. If True, then these pixels will be ignored when computing averages.

Example

>>> img = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1_frames/1.png').astype(np.uint8)
>>> data_path = '/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/csv/outlier_corrected_movement_location/Example_1.csv'
>>> data = pd.read_csv(data_path, usecols=['Nose_x', 'Nose_y']).sample(n=3).fillna(1).values.astype(np.int64)
>>> geometries = []
>>> for frm_data in data: geometries.append(GeometryMixin().bodyparts_to_circle(frm_data, 100))
>>> GeometryMixin().get_geometry_brightness_intensity(img=img, geometries=geometries, ignore_black=False)
>>> [125.0, 113.0, 118.0]

static hausdorff_distance(geometries: List[List[Union[Polygon, LineString]]]) → ndarray[source]

The Hausdorff distance measure of the similarity between time-series sequential geometries. It is defined as the maximum of the distances from each point in one set to the nearest point in the other set.

Hausdorff distance can be used to measure the similarity of the geometry in one frame relative to the geometry in the next frame. Larger values indicate that the animal has a different shape than in the preceding shape.

Parameters: geometries (List[List[Union[Polygon, LineString]]]) – List of list where each list has two geometries.
Return np.ndarray: 1D array of hausdorff distances of geometries in each list.
Example

>>> x = Polygon([[0,1], [0, 2], [1,1]])
>>> y = Polygon([[0,1], [0, 2], [0,1]])
>>> GeometryMixin.hausdorff_distance(geometries=[[x, y]])
>>> [1.]

static is_containing(shapes=typing.List[typing.Union[shapely.geometry.polygon.Polygon, shapely.geometry.linestring.LineString]]) → bool[source]

Example

static is_shape_covered(shapes: List[Union[LineString, Polygon, MultiPolygon, MultiPoint]]) → bool[source]

Check if one geometry fully covers another.

Parameters: shapes (Union[LineString, Polygon, MultiPolygon, MultiPoint]) – List of 2 geometries, checks if the second geometry fully covers the first geometry.
Return bool: True if the second geometry fully covers the first geometry, otherwise False.

>>> polygon_1 = GeometryMixin().bodyparts_to_polygon(np.array([[10, 10], [10, 100], [100, 10], [100, 100]]))
>>> polygon_2 = GeometryMixin().bodyparts_to_polygon(np.array([[25, 25], [25, 75], [90, 25], [90, 75]]))
>>> GeometryMixin().is_shape_covered(shapes=[polygon_2, polygon_1])
>>> True

static is_touching(shapes=typing.List[typing.Union[shapely.geometry.polygon.Polygon, shapely.geometry.linestring.LineString]]) → bool[source]

Check if two geometries touch each other.

Note

Different from GeometryMixin().crosses: Touches requires a common boundary, and does not require the sharing of interior space.

Parameters: shapes (List[Union[LineString, Polygon]]) – A list containing two LineString or Polygon geometries.
Return bool: True if the geometries touch each other, False otherwise.
Example

>>> rectangle_1 = Polygon(np.array([[0, 0], [10, 10], [0, 10], [10, 0]]))
>>> rectangle_2 = Polygon(np.array([[20, 20], [30, 30], [20, 30], [30, 20]]))
>>> GeometryMixin().is_touching(shapes=[rectangle_1, rectangle_2])
>>> False

static length(shape: Union[LineString, MultiLineString], pixels_per_mm: float, unit: typing_extensions.Literal['mm', 'cm', 'dm', 'm'] = 'mm') → float[source]

Calculate the length of a LineString geometry.

Parameters

shape (LineString) – The LineString geometry for which the length is to be calculated.
unit (Literal['mm', 'cm', 'dm', 'm']) – The desired unit for the length measurement (‘mm’, ‘cm’, ‘dm’, ‘m’).

Return float

The length of the LineString geometry in the specified unit.

Example

>>> line_1 = GeometryMixin().bodyparts_to_line(np.array([[10, 70],[20, 60],[30, 50],[40, 70]]))
>>> GeometryMixin().length(shape=line_1, pixels_per_mm=1.0)
>>> 50.6449510224598

static line_split_bounding_box(intersections: ndarray, bounding_box: ndarray) → GeometryCollection[source]

Split a bounding box into two parts using an extended line.

Note

Extended line can be found by body-parts using GeometryMixin().extend_line_to_bounding_box_edges.

_images/extend_line_to_bounding_box_edge.gif

Parameters

line_points (np.ndarray) – Intersection points where the extended line crosses the bounding box edges. The shape of the array is (2, 2), where each row represents a point (x, y).
bounding_box (np.ndarray) – Bounding box coordinates in the format (min_x, min_y, max_x, max_y).

Returns GeometryCollection

A collection of polygons resulting from splitting the bounding box with the extended line.

Example

>>> line_points = np.array([[25, 25], [45, 25]]).astype(np.float32)
>>> bounding_box = np.array([0, 0, 50, 50]).astype(np.float32)
>>> intersection_points = GeometryMixin().extend_line_to_bounding_box_edges(line_points, bounding_box)
>>> GeometryMixin().line_split_bounding_box(intersections=intersection_points, bounding_box=bounding_box)

static linear_frechet_distance(x: ndarray, y: ndarray, sample: int = 100) → float[source]

Compute the Linear Fréchet Distance between two trajectories.

The Fréchet Distance measures the dissimilarity between two continuous curves or trajectories represented as sequences of points in a 2-dimensional space.

Parameters

data (ndarray) – First 2D array of size len(frames) representing body-part coordinates x and y.
data – Second 2D array of size len(frames) representing body-part coordinates x and y.
sample (int) – The downsampling factor for the trajectories (default is 100If sample > 1, the trajectories are downsampled by selecting every sample-th point.

Note

Slightly modified from João Paulo Figueira

Example

>>> x = np.random.randint(0, 100, (10000, 2)).astype(np.float32)
>>> y = np.random.randint(0, 100, (10000, 2)).astype(np.float32)
>>> distance = GeometryMixin.linear_frechet_distance(x=x, y=y, sample=100)

static locate_line_point(path: Union[LineString, ndarray], geometry: Union[LineString, Polygon, Point], px_per_mm: Optional[float] = 1, fps: Optional[float] = 1, core_cnt: Optional[int] = -1, distance_min: Optional[bool] = True, time_prior: Optional[bool] = True) → Dict[str, float][source]

Compute the time and distance travelled along a path to reach the most proximal point in reference to a second geometry.

Note

To compute the time and distance travelled to along a path to reach the most distal point to a second geometry, pass distance_min = False.
To compute the time and distance travelled along a path after reaching the most distal or proximal point to a second geometry, pass time_prior = False.

Example

>>> line = LineString([[10, 10], [7.5, 7.5], [15, 15], [7.5, 7.5]])
>>> polygon = Polygon([[0, 5], [0, 0], [5, 0], [5, 5]])
>>> GeometryMixin.locate_line_point(path=line, geometry=polygon)
>>> {'distance_value': 3.5355339059327378, 'distance_travelled': 3.5355339059327378, 'time_travelled': 1.0, 'distance_index': 1}

static minimum_rotated_rectangle(shape=<class 'shapely.geometry.polygon.Polygon'>) → Polygon[source]

Calculate the minimum rotated rectangle that bounds a given polygon.

The minimum rotated rectangle, also known as the minimum bounding rectangle (MBR) or oriented bounding box (OBB), is the smallest rectangle that can fully contain a given polygon or set of points while allowing rotation. It is defined by its center, dimensions (length and width), and rotation angle.

Parameters: shape (Polygon) – The Polygon for which the minimum rotated rectangle is to be calculated.
Return Polygon: The minimum rotated rectangle geometry that bounds the input polygon.
Example

>>> polygon = GeometryMixin().bodyparts_to_polygon(np.array([[364, 308],[383, 323],[403, 335],[423, 351]]))
>>> rectangle = GeometryMixin().minimum_rotated_rectangle(shape=polygon)

static multiframe_bodypart_to_point(data: ndarray, core_cnt: Optional[int] = -1, buffer: Optional[int] = None, px_per_mm: Optional[int] = None) → Union[List[Point], List[List[Point]]][source]

Process multiple frames of body part data in parallel and convert them to shapely Points.

This function takes a multi-frame body part data represented as an array and converts it into points. It utilizes multiprocessing for parallel processing.

Parameters

data (np.ndarray) – 2D or 3D array with body-part coordinates where rows are frames and columns are x and y coordinates.
core_cnt (Optional[int]) – The number of cores to use. If -1, then all available cores.
px_per_mm (Optional[int]) – Pixels ro millimeter convertion factor. Required if buffer is not None.
buffer (Optional[int]) – If not None, then the area of the Point. Thus, if not None, then returns Polygons representing the Points.
px_per_mm – Pixels to millimeter convertion factor. Required if buffer is not None.

Returns Union[List[Point], List[List[Point]]]

If input is a 2D array, then list of Points. If 3D array, then list of list of Points.

Note

If buffer and px_per_mm is not None, then the points will be buffered and a 2D share polygon created with the specified buffered area. If buffer is provided, then also provide px_per_mm for accurate conversion factor between pixels and millimeters.

Example

>>> data = np.random.randint(0, 100, (100, 2))
>>> points_lst = GeometryMixin().multiframe_bodypart_to_point(data=data, buffer=10, px_per_mm=4)
>>> data = np.random.randint(0, 100, (10, 10, 2))
>>> point_lst_of_lst = GeometryMixin().multiframe_bodypart_to_point(data=data)

multiframe_bodyparts_to_circle(data: ndarray, parallel_offset: int = 1, core_cnt: int = -1, pixels_per_mm: Optional[int] = 1) → List[Polygon][source]

Convert a set of pose-estimated key-points to circles with specified radius using multiprocessing.

Parameters

data (int) – The body-part coordinates xy as a 2d array where rows are frames and columns represent x and y coordinates . E.g., np.array([[364, 308], [369, 309]])
data – The radius of the resultant circle in millimeters.
core_cnt (int) – Number of CPU cores to use. Defaults to -1 meaning all available cores will be used.
pixels_per_mm (int) – The pixels per millimeter of the video. If not passed, 1 will be used meaning revert to radius in pixels rather than millimeters.

Returns Polygon

List of shapely Polygons of circular shape of size data.shape[0].

Example

>>> data = np.random.randint(0, 100, (100, 2))
>>> circles = GeometryMixin().multiframe_bodyparts_to_circle(data=data)

multiframe_bodyparts_to_line(data: ndarray, buffer: Optional[int] = None, px_per_mm: Optional[float] = None, core_cnt: Optional[int] = -1) → List[LineString][source]

Convert multiframe body-parts data to a list of LineString objects using multiprocessing.

Parameters

data (np.ndarray) – Input array representing multiframe body-parts data. It should be a 3D array with dimensions (frames, points, coordinates).
buffer (Optional[int]) – If not None, then the linestring will be expanded into a 2D geometry polygon with area buffer.
px_per_mm (Optional[int]) – If buffer if not None, then provide the pixels to millimeter
core_cnt (Optional[int]) – Number of CPU cores to use for parallel processing. If set to -1, the function will automatically determine the available core count.

Return List[LineString]

A list of LineString objects representing the body-parts trajectories.

Example

>>> data = np.random.randint(0, 100, (100, 2))
>>> data = data.reshape(50,-1, data.shape[1])
>>> lines = GeometryMixin().multiframe_bodyparts_to_line(data=data)

multiframe_bodyparts_to_multistring_skeleton(data_df: DataFrame, skeleton: Iterable[str], core_cnt: Optional[int] = -1, verbose: Optional[bool] = False, video_name: Optional[bool] = False, animal_names: Optional[bool] = False) → List[Union[LineString, MultiLineString]][source]

Convert body parts to LineString skeleton representations in a videos using multiprocessing.

Parameters

data_df (pd.DataFrame) – Pose-estimation data.
skeleton (Iterable[str]) – Iterable of body part pairs defining the skeleton structure. Eg., [[‘Center’, ‘Lat_left’], [‘Center’, ‘Lat_right’], [‘Center’, ‘Nose’], [‘Center’, ‘Tail_base’]]
core_cnt (Optional[int]) – Number of CPU cores to use for parallel processing. Default is -1, which uses all available cores.
verbose (Optional[bool]) – If True, print progress information during computation. Default is False.
video_name (Optional[bool]) – If True, include video name in progress information. Default is False.
animal_names (Optional[bool]) – If True, include animal names in progress information. Default is False.

Return List[Union[LineString, MultiLineString]]

List of LineString or MultiLineString objects representing the computed skeletons.

Example

>>> df = pd.read_csv('/Users/simon/Desktop/envs/troubleshooting/Rat_NOR/project_folder/csv/machine_results/08102021_DOT_Rat7_8(2).csv', nrows=500).fillna(0).astype(int)
>>> skeleton = [['Center', 'Lat_left'], ['Center', 'Lat_right'], ['Center', 'Nose'], ['Center', 'Tail_base'], ['Lat_left', 'Tail_base'], ['Lat_right', 'Tail_base'], ['Nose', 'Ear_left'], ['Nose', 'Ear_right'], ['Ear_left', 'Lat_left'], ['Ear_right', 'Lat_right']]
>>> geometries = GeometryMixin().multiframe_bodyparts_to_multistring_skeleton(data_df=df, skeleton=skeleton, core_cnt=2, verbose=True)

multiframe_bodyparts_to_polygon(data: ndarray, video_name: Optional[str] = None, animal_name: Optional[str] = None, verbose: Optional[bool] = False, cap_style: Optional[typing_extensions.Literal['round', 'square', 'flat']] = 'round', parallel_offset: Optional[int] = 1, pixels_per_mm: Optional[float] = None, simplify_tolerance: Optional[float] = 2, preserve_topology: bool = True, core_cnt: int = -1) → List[Polygon][source]

Convert multidimensional NumPy array representing body part coordinates to a list of Polygons.

Parameters

data (np.ndarray) – NumPy array of body part coordinates. Each subarray represents the coordinates of a body part.
cap_style (Literal['round', 'square', 'flat']) – Style of line cap for parallel offset. Options: ‘round’, ‘square’, ‘flat’.
parallel_offset (int) – Offset distance for parallel lines. Default is 1.
simplify_tolerance (float) – Tolerance parameter for simplifying geometries. Default is 2.

Example

>>> data = np.array([[[364, 308], [383, 323], [403, 335], [423, 351]],[[356, 307], [376, 319], [396, 331], [419, 347]]])
>>> GeometryMixin().multiframe_bodyparts_to_polygon(data=data)

multiframe_compute_pct_shape_overlap(shape_1: List[Polygon], shape_2: List[Polygon], core_cnt: Optional[int] = -1, video_name: Optional[str] = None, verbose: Optional[bool] = False, animal_names: Optional[Tuple[str]] = None, denominator: Optional[typing_extensions.Literal['difference', 'shape_1', 'shape_2']] = 'difference') → List[float][source]

Compute the percentage overlap between corresponding Polygons in two lists.

_images/multiframe_compute_pct_shape_overlap.png

Parameters

shape_1 (List[Polygon]) – List of Polygons.
shape_2 (List[Polygon]) – List of Polygons with the same length as shape_1.
core_cnt (int) – Number of CPU cores to use for parallel processing. Default is -1, which uses all available cores.
video_name (Optional[bool]) – If not None, then the name of the video being processed for interpretable progress msgs.
video_name – If True, then prints interpretable progress msgs.
animal_names (Optional[Tuple[str]]) – If not None, then a two-tuple of animal names (or alternative shape names) interpretable progress msgs.

Return List[float]

List of percentage overlap between corresponding Polygons.

Example

multiframe_compute_shape_overlap(shape_1: List[Polygon], shape_2: List[Polygon], core_cnt: Optional[int] = -1, verbose: Optional[bool] = False, names: Optional[Tuple[str]] = None) → List[int][source]

Multiprocess compute overlap between corresponding Polygons in two lists.

Note

Only returns if two shapes are overlapping or not overlapping. If the amount of overlap is required, use GeometryMixin().multifrm_compute_pct_shape_overlap().

Parameters

shape_1 (List[Polygon]) – List of Polygons.
shape_2 (List[Polygon]) – List of Polygons with the same length as shape_1.
core_cnt (int) – Number of CPU cores to use for parallel processing. Default is -1, which uses all available cores.

Return List[float]

List of overlap between corresponding Polygons. If overlap 1, else 0.

multiframe_delaunay_triangulate_keypoints(data: ndarray, core_cnt: int = -1) → List[List[Polygon]][source]

>>> data_path = '/Users/simon/Desktop/envs/troubleshooting/Rat_NOR/project_folder/csv/machine_results/08102021_DOT_Rat7_8(2).csv'
>>> data = pd.read_csv(data_path, index_col=0).head(1000).iloc[:, 0:21]
>>> data = data[data.columns.drop(list(data.filter(regex='_p')))]
>>> animal_data = data.values.reshape(len(data), -1, 2).astype(int)
>>> tri = GeometryMixin().multiframe_delaunay_triangulate_keypoints(data=animal_data)

multiframe_difference(shapes: Iterable[Union[LineString, Polygon, MultiPolygon]], core_cnt: Optional[int] = -1, verbose: Optional[bool] = False, animal_names: Optional[str] = None, video_name: Optional[str] = None) → List[Union[Polygon, MultiPolygon]][source]

Compute the multi-frame difference for a collection of shapes using parallel processing.

Parameters

shapes (Iterable[Union[LineString, Polygon, MultiPolygon]]) – A collection of shapes, where each shape is a list containing two geometries.
core_cnt (int) – The number of CPU cores to use for parallel processing. Default is -1, which automatically detects the available cores.
verbose (Optional[bool]) – If True, print progress messages during computation. Default is False.
animal_names (Optional[str]) – Optional string representing the names of animals for informative messages.
Optional[str]video_name – Optional string representing the name of the video for informative messages.

Return List[Union[Polygon, MultiPolygon]]

A list of geometries representing the multi-frame difference.

multiframe_hausdorff_distance(geometries: List[Union[Polygon, LineString]], lag: Optional[int] = 1, core_cnt: Optional[int] = -1) → List[float][source]

The Hausdorff distance measure of the similarity between sequential time-series geometries.

Example

>>> df = read_df(file_path='/Users/simon/Desktop/envs/simba/troubleshooting/mouse_open_field/project_folder/csv/outlier_corrected_movement_location/SI_DAY3_308_CD1_PRESENT.csv', file_type='csv')
>>> cols = [x for x in df.columns if not x.endswith('_p')]
>>> data = df[cols].values.reshape(len(df), -1 , 2).astype(np.int)
>>> geometries = GeometryMixin().multiframe_bodyparts_to_polygon(data=data, pixels_per_mm=1, parallel_offset=1, verbose=False, core_cnt=-1)
>>> hausdorff_distances = GeometryMixin.multiframe_hausdorff_distance(geometries=geometries)

multiframe_is_shape_covered(shape_1: List[Polygon], shape_2: List[Polygon], core_cnt: Optional[int] = -1) → List[bool][source]

For each shape in time-series of shapes, check if another shape in the same time-series fully covers the first shape.

Example

>>> shape_1 = GeometryMixin().multiframe_bodyparts_to_polygon(data=np.random.randint(0, 200, (100, 6, 2)))
>>> shape_2 = [Polygon([[0, 0], [20, 20], [20, 10], [10, 20]]) for x in range(len(shape_1))]
>>> GeometryMixin.multiframe_is_shape_covered(shape_1=shape_1, shape_2=shape_2, core_cnt=3)

multiframe_length(shapes: List[Union[LineString, MultiLineString]], pixels_per_mm: float, core_cnt: int = -1, unit: typing_extensions.Literal['mm', 'cm', 'dm', 'm'] = 'mm') → List[float][source]

Example

>>> data = np.random.randint(0, 100, (5000, 2))
>>> data = data.reshape(2500,-1, data.shape[1])
>>> lines = GeometryMixin().multiframe_bodyparts_to_line(data=data)
>>> lengths = GeometryMixin().multiframe_length(shapes=lines, pixels_per_mm=1.0)

multiframe_minimum_rotated_rectangle(shapes: List[Polygon], video_name: Optional[str] = None, verbose: Optional[bool] = False, animal_name: Optional[bool] = None, core_cnt: int = -1) → List[Polygon][source]

Compute the minimum rotated rectangle for each Polygon in a list using mutiprocessing.

Parameters

shapes (List[Polygon]) – List of Polygons.
core_cnt – Number of CPU cores to use for parallel processing. Default is -1, which uses all available cores.

multiframe_shape_distance(shape_1: List[Union[Polygon, LineString]], shape_2: List[Union[Polygon, LineString]], pixels_per_mm: float, unit: typing_extensions.Literal['mm', 'cm', 'dm', 'm'] = 'mm', core_cnt=-1) → List[float][source]

Compute shape distances between corresponding shapes in two lists of LineString or Polygon geometries for multiple frames.

Parameters

shape_1 (List[Union[LineString, Polygon]]) – List of LineString or Polygon geometries.
shape_2 (List[Union[LineString, Polygon]]) – List of LineString or Polygon geometries with the same length as shape_1.
pixels_per_mm (float) – Conversion factor from pixels to millimeters.
unit (Literal['mm', 'cm', 'dm', 'm']) – Unit of measurement for the result. Options: ‘mm’, ‘cm’, ‘dm’, ‘m’. Default: ‘mm’.
core_cnt – Number of CPU cores to use for parallel processing. Default is -1, which uses all available cores.

Return List[float]

List of shape distances between corresponding shapes in passed unit.

multiframe_symmetric_difference(shapes: Iterable[Union[LineString, MultiLineString]], core_cnt: int = -1)[source]

Compute the symmetric differences between corresponding LineString or MultiLineString geometries usng multiprocessing.

Example

>>> data_1 = np.random.randint(0, 100, (5000, 2)).reshape(1000,-1, 2)
>>> data_2 = np.random.randint(0, 100, (5000, 2)).reshape(1000,-1, 2)
>>> polygon_1 = GeometryMixin().multiframe_bodyparts_to_polygon(data=data_1)
>>> polygon_2 = GeometryMixin().multiframe_bodyparts_to_polygon(data=data_2)
>>> data = np.array([polygon_1, polygon_2]).T
>>> symmetric_differences = GeometryMixin().multiframe_symmetric_difference(shapes=data)

multiframe_union(shapes: Iterable[Union[LineString, MultiLineString]], core_cnt: int = -1) → Iterable[Union[LineString, MultiLineString]][source]

Example

>>> data_1 = np.random.randint(0, 100, (5000, 2)).reshape(1000,-1, 2)
>>> data_2 = np.random.randint(0, 100, (5000, 2)).reshape(1000,-1, 2)
>>> polygon_1 = GeometryMixin().multiframe_bodyparts_to_polygon(data=data_1)
>>> polygon_2 = GeometryMixin().multiframe_bodyparts_to_polygon(data=data_2)
>>> data = np.array([polygon_1, polygon_2]).T
>>> unions = GeometryMixin().multiframe_union(shapes=data)

multifrm_geometry_histocomparison(video_path: Union[str, PathLike], data: ndarray, shape_type: typing_extensions.Literal['rectangle', 'circle', 'line'], lag: Optional[int] = 2, core_cnt: Optional[int] = -1, pixels_per_mm: int = 1, parallel_offset: int = 1) → ndarray[source]

Perform geometry histocomparison on multiple video frames using multiprocessing.

Note

Comparions are made using the intersections of the two image geometries, meaning that the same experimental area of the image and arena is used in the comparison and shifts in animal location cannot account for variability.

Parameters

video_path (Union[str, os.PathLike]) – Path to the video file.
data (np.ndarray) – Input data, typically containing coordinates of one or several body-parts.
shape_type (Literal['rectangle', 'circle']) – Type of shape for comparison.
lag (Optional[int]) – Number of frames to lag between comparisons. Default is 2.
core_cnt (Optional[int]) – Number of CPU cores to use for parallel processing. Default is -1 which is all available cores.
pixels_per_mm (Optional[int]) – Pixels per millimeter for conversion. Default is 1.
parallel_offset (Optional[int]) – Size of the geometry ROI in millimeters. Default 1.

Returns np.ndarray

The difference between the successive geometry histograms.

Example

>>> data = pd.read_csv('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/csv/outlier_corrected_movement_location/Example_1.csv', nrows=2100, usecols=['Nose_x', 'Nose_y']).fillna(-1).values.astype(np.int64)
>>> results = GeometryMixin().multifrm_geometry_histocomparison(video_path='/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1.mp4', data= data, shape_type='circle', pixels_per_mm=1, parallel_offset=100)
>>> data = pd.read_csv('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/csv/outlier_corrected_movement_location/Example_2.csv', nrows=2100, usecols=['Nose_x', 'Nose_y', 'Tail_base_x' , 'Tail_base_y', 'Center_x' , 'Center_y']).fillna(-1).values.astype(np.int64)
>>> results = GeometryMixin().multifrm_geometry_histocomparison(video_path='/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1.mp4', data= data, shape_type='rectangle', pixels_per_mm=1, parallel_offset=1)

static point_lineside(lines: ndarray, points: ndarray) → ndarray[source]

Determine the relative position of a point (left vs right) with respect to a lines in each frame.

Parameters

lines (numpy.ndarray) – An array of shape (N, 2, 2) representing N lines, where each line is defined by two points. The first point that denotes the beginning of the line, the second point denotes the end of the line.
point (numpy.ndarray) – An array of shape (N, 2) representing N points.

Return np.ndarray

An array of length N containing the results for each line. 2 if the point is on the right side of the line. 1 if the point is on the left side of the line. 0 if the point is on the line.

Example

>>> lines = np.array([[[25, 25], [25, 20]], [[15, 25], [15, 20]], [[15, 25], [50, 20]]]).astype(np.float32)
>>> points = np.array([[20, 0], [15, 20], [90, 0]]).astype(np.float32)
>>> GeometryMixin().point_lineside(lines=lines, points=points)
>>> [1., 0., 1.]

static rank_shapes(shapes: List[Polygon], method: typing_extensions.Literal['area', 'min_distance', 'max_distance', 'mean_distance', 'left_to_right', 'top_to_bottom'], deviation: Optional[bool] = False, descending: Optional[bool] = True) → List[Polygon][source]

Rank a list of polygon geometries based on a specified method. E.g., order the list of geometries according to sizes or distances to each other or from left to right etc.

Parameters

shapes (List[Polygon]) – List of Shapely polygons to be ranked. List has to contain two or more shapes.
method (Literal['area', 'min_center_distance', 'max_center_distance', 'mean_shape_distance']) – The ranking method to use.
deviation (Optional[bool]) – If True, rank based on absolute deviation from the mean. Default: False.
descending (Optional[bool]) – If True, rank in descending order; otherwise, rank in ascending order. Default: False.

Returns

A input list of Shapely polygons sorted according to the specified ranking method.

static shape_distance(shapes: List[Union[LineString, Polygon, Point]], pixels_per_mm: float, unit: typing_extensions.Literal['mm', 'cm', 'dm', 'm'] = 'mm') → float[source]

Calculate the distance between two geometries in specified units.

Parameters

shapes (List[Union[LineString, Polygon]]) – A list containing two LineString or Polygon geometries.
pixels_per_mm (float) – The conversion factor from pixels to millimeters.
unit (Literal['mm', 'cm', 'dm', 'm']) – The desired unit for the distance calculation. Options: ‘mm’, ‘cm’, ‘dm’, ‘m’. Defaults to ‘mm’.

Return float

The distance between the two geometries in the specified unit.

>>> shape_1 = Polygon([(0, 0), 10, 10), 0, 10), 10, 0)])
>>> shape_2 = Polygon([(0, 0), 10, 10), 0, 10), 10, 0)])
>>> GeometryMixin.shape_distance(shapes=[shape_1, shape_2], pixels_per_mm=1)
>>> 0

static simba_roi_to_geometries(rectangles_df: DataFrame, circles_df: DataFrame, polygons_df: DataFrame, color: Optional[bool] = False) → dict[source]

Convert SimBA dataframes holding ROI geometries to nested dictionary holding Shapley polygons.

Example

>>> #config_path = '/Users/simon/Desktop/envs/simba/troubleshooting/spontenous_alternation/project_folder/project_config.ini'
>>> #config = ConfigReader(config_path=config_path)
>>> #config.read_roi_data()
>>> #GeometryMixin.simba_roi_to_geometries(rectangles_df=config.rectangles_df, circles_df=config.circles_df, polygons_df=config.polygon_df)

static static_point_lineside(lines: ndarray, point: ndarray) → ndarray[source]

Determine the relative position (left vs right) of a static point with respect to multiple lines.

Note

Modified from rayryeng.

Parameters

lines (numpy.ndarray) – An array of shape (N, 2, 2) representing N lines, where each line is defined by two points. The first point that denotes the beginning of the line, the second point denotes the end of the line.
point (numpy.ndarray) – A 2-element array representing the coordinates of the static point.

Return np.ndarray

An array of length N containing the results for each line. 2 if the point is on the right side of the line. 1 if the point is on the left side of the line. 0 if the point is on the line.

Example

>>> line = np.array([[[25, 25], [25, 20]], [[15, 25], [15, 20]], [[15, 25], [50, 20]]]).astype(np.float32)
>>> point = np.array([20, 0]).astype(np.float64)
>>> GeometryMixin().static_point_lineside(lines=line, point=point)
>>> [1. 2. 1.]

static symmetric_difference(shapes: List[Union[LineString, Polygon, MultiPolygon]]) → List[Union[Polygon, MultiPolygon]][source]

Computes a new geometry consisting of the parts that are exclusive to each input geometry.

In other words, it includes the parts that are unique to each geometry while excluding the parts that are common to both.

Parameters: shapes (List[Union[LineString, Polygon, MultiPolygon]]) – A list of LineString, Polygon, or MultiPolygon geometries to find the symmetric difference.
Return List[Union[Polygon, MultiPolygon]]: A list containing the resulting geometries after performing symmetric difference operations.
Example

>>> polygon_1 = GeometryMixin().bodyparts_to_polygon(np.array([[10, 10], [10, 100], [100, 10], [100, 100]]))
>>> polygon_2 = GeometryMixin().bodyparts_to_polygon(np.array([[1, 25], [1, 75], [110, 25], [110, 75]]))
>>> symmetric_difference = symmetric_difference(shapes=[polygon_1, polygon_2])

static to_linestring(data: ndarray) → LineString[source]

Convert a 2D array of x and y coordinates to a shapely linestring.

Linestrings are useful for representing an animal path, and to answer questions like (i) “How far along the animals paths was the animal most proximal to geometry X”? “How far had the animal travelled at time T?” “When does the animal path intersect geometry X?”

Parameters: data (np.ndarray) – 2D array with floats or ints of size Nx2 representing body-part coordinates.
Example

>>> data = np.load('/Users/simon/Desktop/envs/simba/simba/simba/sandbox/data.npy')
>>> linestring = GeometryMixin.to_linestring(data=data)

static union(shapes: List[Union[LineString, Polygon, MultiPolygon]]) → Union[MultiPolygon, Polygon, MultiLineString][source]

Compute the union of multiple geometries.

Parameters: shapes (List[Union[LineString, Polygon, MultiPolygon]]) – A list of LineString, Polygon, or MultiPolygon geometries to be unioned.
Return Union[MultiPolygon, Polygon]: The resulting geometry after performing the union operation.
Example

>>> polygon_1 = GeometryMixin().bodyparts_to_polygon(np.array([[10, 10], [10, 100], [100, 10], [100, 100]]))
>>> polygon_2 = GeometryMixin().bodyparts_to_polygon(np.array([[1, 25],[1, 75],[110, 25],[110, 75]]))
>>> union = GeometryMixin().union(shape = polygon_1, overlap_shapes=[polygon_2, polygon_2])

static view_shapes(shapes: List[Union[LineString, Polygon, MultiPolygon, MultiLineString]], bg_img: Optional[ndarray] = None, bg_clr: Optional[Tuple[int]] = None, size: Optional[int] = None, color_palette: Optional[str] = None) → ndarray[source]

Helper function to draw shapes on white canvas or specified background image. Useful for quick troubleshooting.

Example

>>> multipolygon_1 = MultiPolygon([Polygon([[200, 110],[200, 100],[200, 100],[200, 110]]), Polygon([[70, 70],[70, 60],[10, 50],[1, 70]])])
>>> polygon_1 = GeometryMixin().bodyparts_to_polygon(np.array([[100, 110],[100, 100],[110, 100],[110, 110]]))
>>> line_1 = GeometryMixin().bodyparts_to_line(np.array([[10, 70],[20, 60],[30, 50],[40, 70]]))
>>> img = GeometryMixin.view_shapes(shapes=[line_1, polygon_1, multipolygon_1])

Network (Graph) methods

class simba.mixins.network_mixin.NetworkMixin[source]

Bases: object

Methods to create and analyze time-dependent graphs from pose-estimation data.

When working with pose-estimation data for more than two animals - over extended periods - it can be beneficial to represent the data as a graph where the animals feature as nodes are their relationship strengths are represented as edges.

When formatted as a graph, we can compute (i) how the relationships between animal pairs change across time and recordings, (ii) the relative importance’s and hierarchies of individual animals within the group, or (iii) identify sub-groups with the network.

The critical component determining the results is how edge weights are represented. These edge weight values could be the amount of time animal bounding boxes overlap each other, aggregate distances between the animals, or how much time animals engange in coordinated behaviors. These values can be computed through other methods within SimBA mixin methods.

Very much wip and so far primarily depend on networkx.

References

See below references for mature and reliable packages (12/2023):

1: networkx
2: igraph

static berger_parker(x: ndarray) → float[source]

Berger-Parker index for the given one-dimensional array. The Berger-Parker index is a measure of category dominance, calculated as the ratio of the frequency of the most abundant category to the total number of observations. Answer how dominated a cluster or community is by categorical variable.

The Berger-Parker index (BP) is calculated using the formula:

$BP = \frac{f_{\max}}{N}$

where: - ( f_{max} ) is the frequency of the most abundant category, - ( N ) is the total number of observations.

Parameters: x (np.ndarray) – One-dimensional numpy array containing the values for which the Berger-Parker index is calculated.
Return float: Berger-Parker index value for the input array x
Example

>>> x = np.random.randint(0, 25, (100,)).astype(np.float32)
>>> z = NetworkMixin.berger_parker(x=x)

static brillouins_index(x: array) → float[source]

Calculate Brillouin’s Diversity Index for a given array of values.

Brillouin’s Diversity Index is a measure of cluster/community diversity that accounts for both richness and evenness of distribution.

Brillouin’s Diversity Index (H) is calculated using the formula:

$H = \frac{1}{\log(S)} \sum_{i=1}^{S} \frac{N_i(N_i - 1)}{n(n-1)}$

where: - ( H ) is Brillouin’s Diversity Index, - ( S ) is the total number of unique species, - ( N_i ) is the count of individuals in the i-th species, - ( n ) is the total number of individuals.

Parameters: x (np.array) – One-dimensional numpy array containing the values for which Brillouin’s Index is calculated.
Return float: Brillouin’s Diversity Index value for the input array x
Example

>>> x = np.random.randint(0, 10, (100,))
>>> NetworkMixin.brillouins_index(x)

static create_graph(data: Dict[Tuple[str, str], float]) → <networkx.classes.graph.Graph object at 0x7f93fe326da0>[source]

Create a single undirected graph with single edges from on dictionary.

Parameters: data (Dict[Tuple[str, str], float]) – A dictionary where keys are tuples representing node pairs and values are the corresponding edge weights.
Returns nx.Graph: A networkx graph with nodes and edges defined by the input data.
Example

>>> data = {('Animal_1', 'Animal_2'): 1.0, ('Animal_1', 'Animal_3'): 0.2, ('Animal_2', 'Animal_3'): 0.5}
>>> graph = NetworkMixin.create_graph(data=data)

static create_multigraph(data: Dict[Tuple[str, str], List[float]]) → MultiGraph[source]

Create a multi-graph from a dictionary of node pairs and associated edge weights.

For example, creates a multi-graph where node edges represent animal relationship weights at different timepoints.

Parameters: data (Dict[Tuple[str, str], List[float]]) – A dictionary where keys are tuples representing node pairs, and values are lists of edge weights associated with each pair.
Returns nx.MultiGraph: A NetworkX multigraph with nodes and edges specified by the input data. Each edge is labeled and weighted based on the provided information.
Example

>>> data = {('Animal_1', 'Animal_2'): [0, 0, 0, 6], ('Animal_1', 'Animal_3'): [0, 0, 0, 0], ('Animal_1', 'Animal_4'): [0, 0, 0, 0], ('Animal_1', 'Animal_5'): [0, 0, 0, 0], ('Animal_2', 'Animal_3'): [0, 0, 0, 0], ('Animal_2', 'Animal_4'): [5, 0, 0, 2], ('Animal_2', 'Animal_5'): [0, 0, 0, 0], ('Animal_3', 'Animal_4'): [0, 0, 0, 0], ('Animal_3', 'Animal_5'): [0, 2, 22, 0], ('Animal_4', 'Animal_5'): [0, 0, 0, 0]}
>>> NetworkMixin().create_multigraph(data=data)

static girvan_newman(graph: Graph, levels: Optional[int] = 1, most_valuable_edge: object = None)[source]

Example

>>> graph = NetworkMixin.create_graph({ ('Animal_1', 'Animal_2'): 0.0, ('Animal_1', 'Animal_3'): 0.0, ('Animal_1', 'Animal_4'): 0.0, ('Animal_1', 'Animal_5'): 0.0, ('Animal_2', 'Animal_3'): 1.0, ('Animal_2', 'Animal_4'): 1.0, ('Animal_2', 'Animal_5'): 1.0, ('Animal_3', 'Animal_4'): 1.0, ('Animal_3', 'Animal_5'): 1.0, ('Animal_4', 'Animal_5'): 1.0})
>>> NetworkMixin().girvan_newman(graph=graph, levels = 1)
>>> [({'Animal_1'}, {'Animal_2', 'Animal_3', 'Animal_4', 'Animal_5'})]

static graph_current_flow_closeness_centrality(graph: Graph, weights: Optional[str] = 'weight')[source]

Example

>>> graph = NetworkMixin.create_graph(data={('Animal_1', 'Animal_2'): 1.0, ('Animal_1', 'Animal_3'): 0.2, ('Animal_2', 'Animal_3'): 0.5})
>>> NetworkMixin().graph_current_flow_closeness_centrality(graph=graph)

static graph_katz_centrality(graph: Graph, weights: Optional[str] = 'weight', alpha: Optional[float] = 0.85)[source]

Katz centrality is an algorithm in NetworkX that measures the relative influence of a node in a network.

See networkx documentation

Example

>>> graph = NetworkMixin.create_graph(data={('Animal_1', 'Animal_2'): 1.0, ('Animal_1', 'Animal_3'): 0.2, ('Animal_2', 'Animal_3'): 0.5})
>>> NetworkMixin().graph_katz_centrality(graph=graph)

static graph_page_rank(graph: Graph, weights: Optional[str] = 'weight', alpha: Optional[float] = 0.85, max_iter: Optional[int] = 100) → Dict[str, float][source]

Calculate the PageRank of nodes in a graph.

Example

>>> graph = NetworkMixin.create_graph(data={('Animal_1', 'Animal_2'): 1.0, ('Animal_1', 'Animal_3'): 0.2, ('Animal_2', 'Animal_3'): 0.5})
>>> NetworkMixin().graph_page_rank(graph=graph)

static margalef_diversification_index(x: array) → float[source]

Calculate the Margalef Diversification Index for a given array of values.

The Margalef Diversification Index is a measure of category diversity. It quantifies the richness of a community/cluster relative to the number of individuals. A high Margalef Diversification Index indicates a high diversity of categories relative to the number of observations. A low Margalef Diversification Index suggests a lower diversity of categories relative to the number of observations.

The Margalef Diversification Index (D) is calculated using the formula:

$D = \frac{(S - 1)}{\log(N)}$

where: - ( S ) is the number of unique categories, - ( N ) is the total number of individuals.

Parameters: x (np.array) – One-dimensional numpy array containing nominal values for which the Margalef Diversification Index is calculated.
Return float: Margalef Diversification Index value for the input array x
Example

>>> x = np.random.randint(0, 100, (100,))
>>> NetworkMixin.margalef_diversification_index(x=x)

static menhinicks_index(x: array) → float[source]

Calculate the Menhinick’s Index for a given array of values.

Menhinick’s Index is a measure of category richness. It quantifies the number of categories relative to the square root of the total number of observations. A high Menhinick’s Index suggests a high diversity of categories relative to the number of observations. A low Menhinick’s Index indicates a lower diversity of categories relative to the number of observations.

Menhinick’s Index (D) is calculated using the formula:

$D = \frac{S}{\sqrt{N}}$

where: - ( S ) is the number of unique categories, - ( N ) is the total number of observations.

param np.array x

One-dimensional numpy array containing the integer values representing nominal values for which Menhinick’s Index is calculated.

return float

Menhinick’s Index value for the input array x

Example

>>> x = np.random.randint(0, 5, (1000,))
>>> NetworkMixin.menhinicks_index(x=x)

static multigraph_page_rank(graph: MultiGraph, weights: Optional[str] = 'weight', alpha: Optional[float] = 0.85, max_iter: Optional[int] = 100) → Dict[str, List[float]][source]

Calculate multi-graph PageRank scores for each node in a MultiGraph.

For example, each node-pair in a graph has N undirected edges representing the weighted relationship between the two nodes atobserved point in time. Calculates the page rank of each node at each observed time point.

Parameters: graph (nx.MultiGraph) – The input MultiGraph, created by NetworkMixin.create_multigraph().
Example

>>> multigraph = NetworkMixin().create_multigraph(data={('Animal_1', 'Animal_2'): [0, 0, 0, 6], ('Animal_1', 'Animal_3'): [0, 0, 0, 0], ('Animal_1', 'Animal_4'): [0, 0, 0, 0], ('Animal_1', 'Animal_5'): [0, 0, 0, 0], ('Animal_2', 'Animal_3'): [0, 0, 0, 0], ('Animal_2', 'Animal_4'): [5, 0, 0, 2], ('Animal_2', 'Animal_5'): [0, 0, 0, 0], ('Animal_3', 'Animal_4'): [0, 0, 0, 0], ('Animal_3', 'Animal_5'): [0, 2, 22, 0], ('Animal_4', 'Animal_5'): [0, 0, 0, 0]})
>>> NetworkMixin().multigraph_page_rank(graph=multigraph)
>>> {'Animal_1': [0.06122524589028524, 0.06122524589028524, 0.06122524589028524, 0.32739635847890775], 'Animal_2': [0.06122524589028524, 0.40816213116457223, 0.06122524589028524, 0.442259400816002], 'Animal_3': [0.40816213116457223, 0.06122524589028524, 0.40816213116457223, 0.04545454545454547], 'Animal_4': [0.06122524589028524, 0.40816213116457223, 0.06122524589028524, 0.13943514979599955], 'Animal_5': [0.40816213116457223, 0.06122524589028524, 0.40816213116457223, 0.04545454545454547]}

static shannon_diversity_index(x: ndarray) → float[source]

Calculate the Shannon Diversity Index for a given array of categories. The Shannon Diversity Index is a measure of diversity in a categorical feature, taking into account both the number of different categories (richness) and their relative abundances (evenness). Answer how homogenous a cluster or community is for categorical variable. A low value indicates that one or a few categories dominate.

$H = -\sum_{i=1}^{n} (p_i \cdot \log(p_i))$

where: - ( p_i ) is the proportion of individuals belonging to the i-th category, - ( n ) is the total number of categories.

Parameters: x (np.ndarray) – One-dimensional numpy array containing the categories for which the Shannon Diversity Index is calculated.
Return float: Shannon Diversity Index value for the input array x
Example

>>> x = np.random.randint(0, 100, (100, ))
>>> NetworkMixin.shannon_diversity_index(x=x)

static simpson_index(x: ndarray) → float[source]

Calculate Simpson’s diversity index for a given array of values.

Simpson’s diversity index is a measure of diversity that takes into account the number of different categories present in the input data as well as the relative abundance of each category. Answer how homogenous a cluster or community is for categorical input variable.

$D =$

rac{sum(n(n-1))}{N(N-1)}

where: - ( n ) is the number of individuals of a particular category, - ( N ) is the total number of individuals, - ( sum ) represents the sum over all categories.

param np.ndarray x

1-dimensional numpy array containing the values representing categories for which Simpson’s index is calculated.

return float

Simpson’s diversity index value for the input array x

static sorensen_dice_coefficient(x: ndarray, y: ndarray) → float[source]

Calculate Sørensen’s Similarity Index between two communities/clusters.

The Sørensen similarity index, also known as the overlap index, quantifies the overlap between two populations by comparing the number of shared categories to the total number of categories in both populations. It ranges from zero, indicating no overlap, to one, representing perfect overlap

Sørensen’s Similarity Index (S) is calculated using the formula:

$S = \frac{2 imes |X \cap Y|}{|X| + |Y|}$

where: - ( S ) is Sørensen’s Similarity Index, - ( X ) and ( Y ) are the sets representing the categories in the first and second communities, respectively, - ( |X \cap Y| ) is the number of shared categories between the two communities, - ( |X| ) and ( |Y| ) are the total number of categories in the first and second communities, respectively.

Parameters

x – 1D numpy array with nominal values for the first cluster/community.
y – 1D numpy array with nominal values for the second cluster/community.

Returns

Sørensen’s Similarity Index between x and y.

Example

>>> x = np.random.randint(0, 10, (100,))
>>> y = np.random.randint(0, 10, (100,))
>>> NetworkMixin.sorensen_dice_coefficient(x=x, y=y)

static visualize(graph: Graph, save_path: Optional[Union[str, PathLike]] = None, node_size: Optional[Union[float, Dict[str, float]]] = 25.0, palette: Optional[Union[str, Dict[str, str]]] = 'magma', img_size: Optional[Tuple[int, int]] = (500, 500)) → Union[None, Network][source]

Visualizes a network graph using the vis.js library and saves the result as an HTML file.

Note

Multi-networks created by simba.mixins.network_mixin.create_multigraph can be a little messy to look at. Instead, creates seperate objects and files with single edges from each time-point.

Parameters

graph (Union[nx.Graph, nx.MultiGraph]) – The input graph to be visualized.
save_path (Optional[Union[str, os.PathLike]]) – The path to save the HTML file. If None, the graph is not saved but returned. Default: None.
node_size (Optional[Union[float, Dict[str, float]]]) – The size of nodes. Can be a single float or a dictionary mapping node names to their respective sizes. Default: 25.0.
palette (Optional[Union[str, Dict[str, str]]]) – The color palette for nodes. Can be a single string representing a palette name or a dictionary mapping node names to their respective colors. Default; magma.
img_size (Optional[Tuple[int, int]]) – The size of the resulting image in pixels, represented as (width, height). Default: 500x500.

Example

>>> graph = NetworkMixin.create_graph(data={('Animal_1', 'Animal_2'): 1.0, ('Animal_1', 'Animal_3'): 0.2, ('Animal_2', 'Animal_3'): 0.5})
>>> graph_pg = NetworkMixin().graph_page_rank(graph=graph)

Feature extraction supplement methods

class simba.mixins.feature_extraction_supplement_mixin.FeatureExtractionSupplemental[source]

Bases: FeatureExtractionMixin

Additional feature extraction method not called by default feature extraction classes from simba.feature_extractors.

static border_distances(data: ndarray, pixels_per_mm: float, img_resolution: ndarray, time_window: float, fps: int)[source]

Compute the mean distance of key-point to the left, right, top, and bottom sides of the image in rolling time-windows. Uses a straight line.

Attention

Output for initial frames where [current_frm - window_size] < 0 will be populated with -1.

Parameters

data (np.ndarray) – 2d array of size len(frames)x2 with body-part coordinates.
img_resolution (np.ndarray) – Resolution of video in WxH format.
pixels_per_mm (float) – Pixels per millimeter of recorded video.
fps (int) – FPS of the recorded video
time_windows (float) – Rolling time-window as floats in seconds. E.g., 0.2

Returns np.ndarray

Size data.shape[0] x time_windows.shape[0] array with millimeter distances from LEFT, RIGH, TOP, BOTTOM,

Example

>>> data = np.array([[250, 250], [250, 250], [250, 250], [500, 500],[500, 500], [500, 500]]).astype(float)
>>> img_resolution = np.array([500, 500])
>>> FeatureExtractionSupplemental().border_distances(data=data, img_resolution=img_resolution, time_window=1, fps=2, pixels_per_mm=1)
>>> [[-1, -1, -1, -1][250, 250, 250, 250][250, 250, 250, 250][375, 125, 375, 125][500, 0, 500, 0][500, 0, 500, 0]]

static consecutive_time_series_categories_count(data: ndarray, fps: int)[source]

Compute the count of consecutive milliseconds the feature value has remained static. For example, compute for how long in milleseconds the animal has remained in the current cardinal direction or the within an ROI.

_images/categorical_consecitive_time.png

Parameters

data (np.ndarray) – 1d array of feature values
fps (int) – Frame-rate of video.

Returns np.ndarray

Array of size data.shape[0]

Example

>>> data = np.array([0, 1, 1, 1, 4, 5, 6, 7, 8, 9])
>>> FeatureExtractionSupplemental().consecutive_time_series_categories_count(data=data, fps=10)
>>> [0.1, 0.1, 0.2, 0.3, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
>>> data = np.array(['A', 'B', 'B', 'B', 'C', 'D', 'E', 'F', 'G', 'H'])
>>> [0.1, 0.1, 0.2, 0.3, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]

static distance_and_velocity(x: ndarray, fps: float, pixels_per_mm: float, centimeters: Optional[bool] = True) → Tuple[float, float][source]

Calculate total movement and mean velocity from a sequence of position data.

Parameters

x – Array containing movement data. For example, created by simba.mixins.FeatureExtractionMixin.framewise_euclidean_distance.
fps – Frames per second of the data.
pixels_per_mm – Conversion factor from pixels to millimeters.
centimeters (Optional[bool]) – If True, results are returned in centimeters and centimeters per second. Defaults to True.

Return Tuple[float, float]

A tuple containing total movement and mean velocity.

Example

>>> x = np.random.randint(0, 100, (100,))
>>> sum_movement, avg_velocity = FeatureExtractionSupplemental.distance_and_velocity(x=x, fps=10, pixels_per_mm=10, centimeters=True)

euclidean_distance_timeseries_change(location_1: ndarray, location_2: ndarray, fps: int, px_per_mm: float, time_windows: ndarray = array([0.2, 0.4, 0.8, 1.6])) → ndarray[source]

Compute the difference in distance between two points in the current frame versus N.N seconds ago. E.g., computes if two points are traveling away from each other (positive output values) or towards each other (negative output values) relative to reference time-point(s)

Parameters

location_1 (ndarray) – 2D array of size len(frames) x 2 representing pose-estimated locations of body-part one
location_2 (ndarray) – 2D array of size len(frames) x 2 representing pose-estimated locations of body-part two
fps (int) – Fps of the recorded video.
px_per_mm (float) – The pixels per millimeter in the video.
time_windows (np.ndarray) – Time windows to compare.

Return np.array

Array of size location_1.shape[0] x time_windows.shape[0]

Example

>>> location_1 = np.random.randint(low=0, high=100, size=(2000, 2)).astype('float32')
>>> location_2 = np.random.randint(low=0, high=100, size=(2000, 2)).astype('float32')
>>> distances = self.euclidean_distance_timeseries_change(location_1=location_1, location_2=location_2, fps=10, px_per_mm=4.33, time_windows=np.array([0.2, 0.4, 0.8, 1.6]))

static find_path_loops(data: ndarray) → Dict[Tuple[int], List[int]][source]

Compute the loops detected within a 2-dimensional path.

Parameters: data (np.ndarray) – Nx2 2-dimensional array with the x and y coordinated represented on axis 1.
Returns: Dictionary with the coordinate tuple(x, y) as keys, and sequential frame numbers as values when animals visited, and re-visited the key coordinate.
Example

>>> data = read_df(file_path='/Users/simon/Desktop/envs/simba/troubleshooting/mouse_open_field/project_folder/csv/outlier_corrected_movement_location/SI_DAY3_308_CD1_PRESENT.csv', usecols=['Center_x', 'Center_y'], file_type='csv').values.astype(int)
>>> FeatureExtractionSupplemental.find_path_loops(data=data)

static peak_ratio(data: ndarray, bin_size_s: int, fps: int)[source]

Compute the ratio of peak values relative to number of values within each seqential time-period represented of bin_size_s seconds. Peak is defined as value is higher than in the prior observation (i.e., no future data is involved in comparison).

Parameters

data (ndarray) – 1D array of size len(frames) representing feature values.
bin_size_s (int) – The size of the buckets in seconds.
fps (int) – Frame-rate of recorded video.

Return np.ndarray

Array of size data.shape[0] with peak counts as ratio of len(frames).

Example

>>> data = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> FeatureExtractionSupplemental().peak_ratio(data=data, bin_size_s=1, fps=10)
>>> [0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9]
>>> data = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
>>> FeatureExtractionSupplemental().peak_ratio(data=data, bin_size_s=1, fps=10)
>>> [0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.  0.  0.  0.  0.  0.  0.  0. 0.  0. ]

static rolling_categorical_switches_ratio(data: ndarray, time_windows: ndarray, fps: int) → ndarray[source]

Compute the ratio of in categorical feature switches within rolling windows.

Attention

Output for initial frames where [current_frm - window_size] < 0, are populated with 0.

Parameters

data (np.ndarray) – 1d array of feature values
time_windows (np.ndarray) – Rolling time-windows as floats in seconds. E.g., [0.2, 0.4, 0.6]
fps (int) – fps of the recorded video

Returns np.ndarray

Size data.shape[0] x time_windows.shape[0] array

Example

>>> data = np.array([0, 1, 1, 1, 4, 5, 6, 7, 8, 9])
>>> FeatureExtractionSupplemental().rolling_categorical_switches_ratio(data=data, time_windows=np.array([1.0]), fps=10)
>>> [[-1][-1][-1][-1][-1][-1][-1][-1][-1][ 0.7]]
>>> data = np.array(['A', 'B', 'B', 'B', 'C', 'D', 'E', 'F', 'G', 'H'])
>>> FeatureExtractionSupplemental().rolling_categorical_switches_ratio(data=data, time_windows=np.array([1.0]), fps=10)
>>> [[-1][-1][-1][-1][-1][-1][-1][-1][-1][ 0.7]]

static rolling_horizontal_vs_vertical_movement(data: ndarray, pixels_per_mm: float, time_windows: ndarray, fps: int) → ndarray[source]

Compute the movement along the x-axis relative to the y-axis in rolling time bins.

Attention

Output for initial frames where [current_frm - window_size] < 0, are populated with 0.

Parameters

data (np.ndarray) – 2d array of size len(frames)x2 with body-part coordinates.
fps (int) – FPS of the recorded video
pixels_per_mm (float) – Pixels per millimeter of recorded video.
time_windows (np.ndarray) – Rolling time-windows as floats in seconds. E.g., [0.2, 0.4, 0.6]

Returns np.ndarray

Size data.shape[0] x time_windows.shape[0] array

Returns np.ndarray

Size data.shape[0] x time_windows.shape[0]. Greater values denote greater movement on x-axis relative to y-axis.

Example

>>> data = np.array([[250, 250], [250, 250], [250, 250], [250, 500], [500, 500], 500, 500]]).astype(float)
>>> FeatureExtractionSupplemental().rolling_horizontal_vs_vertical_movement(data=data, time_windows=np.array([1.0]), fps=2, pixels_per_mm=1)
>>> [[  -1.][   0.][   0.][-250.][ 250.][   0.]]

static sequential_lag_analysis(data: DataFrame, criterion: str, target: str, time_window: float, fps: float)[source]

Perform sequential lag analysis to determine the temporal relationship between two events.

For every onset of behavior C, count the proportions of behavior T onsets in the time-window preceding the onset of behavior C vs the proportion of behavior T onsets in the time-window proceeding the onset of behavior C.

A value closer to 1.0 indicates that behavior T always precede behavior C. A value closer to 0.0 indicates that behavior T follows behavior C. A value of -1.0 indicates that behavior T never precede nor proceed behavior C.

See also

simba.data_processorsfsttc_calculator.FSTTCCalculator

Example

>>> df = pd.DataFrame(np.random.randint(0, 2, (100, 2)), columns=['Attack', 'Sniffing'])
>>> FeatureExtractionSupplemental.sequential_lag_analysis(data=df, criterion='Attack', target='Sniffing', fps=5, time_window=2.0)

References

1: Casarrubea et al., Structural analyses in the study of behavior: From rodents to non-human primates, Frontiers in Psychology, 2022.

static spontaneous_alternations(data: DataFrame, arm_names: List[str], center_name: str) → Tuple[Dict[Union[str, Tuple[int]], int]][source]

Detects spontaneous alternations between a set of user-defined ROIs.

Parameters

data (pd.DataFrame) – DataFrame containing shape data where each row represents a frame and each column represents a shape where 0 represents not in ROI and 1 represents inside the ROI
shape_names (List[str]) – List of column names in the DataFrame corresponding to shape names.

Returns Dict[Union[str, Tuple[str], Union[int, float, List[int]]]]

Dict with the following keys and values:

‘pct_alternation’: Percent alternation computed as (spontaneous alternation cnt / (total number of arm entries - (number of arms - 1))) × 100
‘alternation_cnt’: The sliding count of ROI entry sequences of length len(shape_names) that are all unique.
‘same_arm_returns_cnt’: Aggregate count of sequential visits to the same ROI.
‘alternate_arm_returns_cnt’: Aggregate count of errors which are not same-arm-return errors.
‘error_cnt’: Aggregate error count (same_arm_returns_cnt + alternate_arm_returns_cnt),
‘same_arm_returns_dict’: Dictionary with the keys being the name of the ROI and values are a list of frames when the same-arm-return errors were committed.
‘alternate_arm_returns_cnt’: Dictionary with the keys being the name of the ROI and values are a list of frames when the alternate-arm-return errors were committed.
‘alternations_dict’: Dictionary with the keys being unique ROI name tuple sequences of length len(shape_names) and values are a list of frames when the sequence was completed.
‘arm_entry_sequence’: Pandas dataframe with two columns: sequence of arm names entered, the frame the animal entered the arm, the frame that the animal left the arm.

Example

>>> data = np.zeros((100, 4), dtype=int)
>>> random_indices = np.random.randint(0, 4, size=100)
>>> for i in range(100): data[i, random_indices[i]] = 1
>>> df = pd.DataFrame(data, columns=['left', 'top', 'right', 'bottom'])
>>> spontanous_alternations = FeatureExtractionSupplemental.spontaneous_alternations(data=df, shape_names=['left', 'top', 'right', 'bottom'])

static velocity_aggregator(config_path: Union[str, PathLike], data_dir: Union[str, PathLike], body_part: str, ts_plot: Optional[bool] = True)[source]

Aggregate and plot velocity data from multiple pose-estimation files.

Parameters

config_path (Union[str, os.PathLike]) – Path to SimBA configuration file.
data_dir (Union[str, os.PathLike]) – Directory containing data files.
body_part (str data_dir) – Body part to use when calculating velocity.
ts_plot (Optional[bool] data_dir) – Whether to generate a time series plot of velocities for each data file. Defaults to True.

Example

>>> config_path = '/Users/simon/Desktop/envs/simba/troubleshooting/two_black_animals_14bp/project_folder/project_config.ini'
>>> data_dir = '/Users/simon/Desktop/envs/simba/troubleshooting/two_black_animals_14bp/project_folder/csv/outlier_corrected_movement_location'
>>> body_part = 'Nose_1'
>>> FeatureExtractionSupplemental.velocity_aggregator(config_path=config_path, data_dir=data_dir, body_part=body_part)

Statistics methods

class simba.mixins.statistics_mixin.Statistics[source]

Bases: FeatureExtractionMixin

Primarily frequentist statistics methods used for feature extraction or drift assessment.

Note

Most methods implemented using numba parallelization for improved run-times. See line graph below for expected run-times for a few methods included in this class.

Most method has numba typed signatures to decrease compilation time through reduced type inference. Make sure to pass the correct dtypes as indicated by signature decorators. If dtype is not specified at array creation, it will typically be float64 or int64. As most methods here use float32 for the input data argument, make sure to downcast.

This class contains a few probability distribution comparison methods. These are being moved to simba.sandbox.distances (05.24).

static adjusted_mutual_info(x: ndarray, y: ndarray) → float[source]

Calculate the Adjusted Mutual Information (AMI) between two clusterings as a measure of similarity.

Calculates the Adjusted Mutual Information (AMI) between two sets of cluster labels. AMI measures the agreement between two clustering results, accounting for chance agreement. The value of AMI ranges from 0 (indicating no agreement) to 1 (perfect agreement).

$ext{AMI}(x, y) =$

rac{ ext{MI}(x, y) - E( ext{MI}(x, y))}{max(H(x), H(y)) - E( ext{MI}(x, y))}

where:

ext{MI}(x, y) ext{ is the mutual information between } x ext{ and } y.

E( ext{MI}(x, y)) ext{ is the expected mutual information.}

H(x) ext{ and } H(y) ext{ are the entropies of } x ext{ and } y, ext{ respectively.}

param np.ndarray x

1D array representing the labels of the first model.

param np.ndarray y

1D array representing the labels of the second model.

return float

Score between 0 and 1, where 1 indicates perfect clustering agreement.

static adjusted_rand(x: ndarray, y: ndarray) → float[source]

Calculate the Adjusted Rand Index (ARI) between two clusterings.

The Adjusted Rand Index (ARI) is a measure of the similarity between two clusterings. It considers all pairs of samples and counts pairs that are assigned to the same or different clusters in both the true and predicted clusterings.

The ARI is defined as:

$ARI = \frac{TP + TN}{TP + FP + FN + TN}$

where:

TP (True Positive) is the number of pairs of elements that are in the same cluster in both x and y,
FP (False Positive) is the number of pairs of elements that are in the same cluster in y but not in x,
FN (False Negative) is the number of pairs of elements that are in the same cluster in x but not in y,
TN (True Negative) is the number of pairs of elements that are in different clusters in both x and y.

The ARI value ranges from -1 to 1. A value of 1 indicates perfect clustering agreement, 0 indicates random clustering, and negative values indicate disagreement between the clusterings.

Note

Modified from scikit-learn

Parameters

x (np.ndarray) – 1D array representing the labels of the first model.
y (np.ndarray) – 1D array representing the labels of the second model.

Return float

A value of 1 indicates perfect clustering agreement, a value of 0 indicates random clustering, and negative values indicate disagreement between the clusterings.

Example

>>> x = np.array([0, 0, 0, 0, 0])
>>> y = np.array([1, 1, 1, 1, 1])
>>> Statistics.adjusted_rand(x=x, y=y)
>>> 1.0

static bray_curtis_dissimilarity(x: ndarray, w: Optional[ndarray] = None) → ndarray[source]

Jitted compute of the Bray-Curtis dissimilarity matrix between samples based on feature values.

The Bray-Curtis dissimilarity measures the dissimilarity between two samples based on their feature values. It is useful for finding similar frames based on behavior.

Useful for finding similar frames based on behavior.

Note

Adapted from pynndescent.

Parameters

x (np.ndarray) – 2d array with likely normalized feature values.
w (Optional[np.ndarray]) – Optional 2d array with weights of same size as x. Default None and all observations will have the same weight.

Returns np.ndarray

2d array with same size as x representing dissimilarity values. 0 and the observations are identical and at 1 the observations are completly disimilar.

Example

>>> x = np.array([[1, 1, 1, 1, 1], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [1, 1, 1, 1, 1]]).astype(np.float32)
>>> Statistics().bray_curtis_dissimilarity(x=x)
>>> [[0, 1., 1., 0.], [1., 0., 0., 1.], [1., 0., 0., 1.], [0., 1., 1., 0.]]

static brunner_munzel(sample_1: ndarray, sample_2: ndarray) → float[source]

Jitted compute of Brunner-Munzel W between two distributions.

The Brunner-Munzel W statistic compares the central tendency and the spread of two independent samples. It is useful for comparing the distribution of a continuous variable between two groups, especially when the assumptions of parametric tests like the t-test are violated.

Note

Modified from scipy.stats.brunnermunzel

$W = -\frac{{n_x \cdot n_y \cdot (\bar{R}_y - \bar{R}_x)}}{{(n_x + n_y) \cdot \sqrt{{n_x \cdot S_x + n_y \cdot S_y}}}}$

where:

( n_x ) and ( n_y ) are the sizes of sample_1 and sample_2 respectively,
( ar{R}_x ) and ( ar{R}_y ) are the mean ranks of sample_1 and sample_2 respectively,
( S_x ) and ( S_y ) are the dispersion statistics of sample_1 and sample_2 respectively.

Parameters

sample_1 (ndarray) – First 1d array representing feature values.
sample_2 (ndarray) – Second 1d array representing feature values.

Returns float

Brunner-Munzel W.

Example

>>> sample_1, sample_2 = np.random.normal(loc=10, scale=2, size=10), np.random.normal(loc=20, scale=2, size=10)
>>> Statistics().brunner_munzel(sample_1=sample_1, sample_2=sample_2)
>>> 0.5751408161437165

static calinski_harabasz(x: ndarray, y: ndarray) → float[source]

Compute the Calinski-Harabasz score to evaluate clustering quality.

The Calinski-Harabasz score is a measure of cluster separation and compactness. It is calculated as the ratio of the between-cluster dispersion to the within-cluster dispersion. A higher score indicates better clustering.

Note

Modified from scikit-learn

Parameters

x – 2D array representing the data points. Shape (n_samples, n_features/n_dimension).
y – 2D array representing cluster labels for each data point. Shape (n_samples,).

Return float

Calinski-Harabasz score.

Example

>>> x = np.random.random((100, 2)).astype(np.float32)
>>> y = np.random.randint(0, 100, (100,)).astype(np.int64)
>>> Statistics.calinski_harabasz(x=x, y=y)

static chi_square(sample_1: ndarray, sample_2: ndarray, critical_values: Optional[ndarray] = None, type: Optional[typing_extensions.Literal['goodness_of_fit', 'independence']] = 'goodness_of_fit') → Tuple[float, Optional[bool]][source]

Jitted compute of chi square between two categorical distributions.

Parameters

sample_1 (ndarray) – First 1d array representing feature values.
sample_2 (ndarray) – Second 1d array representing feature values.
critical_values (ndarray) – 2D array with where indexes represent degrees of freedom and values represent critical values. Can be found in simba.assets.critical_values_05.pickle

Note

Requires sample_1 and sample_2 has to be numeric. if working with strings, convert to numeric category values before using chi_square.

Example

>>> sample_1 = np.array([1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 5]).astype(np.float32)
>>> sample_2 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).astype(np.float32)
>>> critical_values = pickle.load(open("simba/assets/lookups/critical_values_5.pickle", "rb"))['chi_square']['one_tail'].values
>>> Statistics.chi_square(sample_1=sample_2, sample_2=sample_1, critical_values=critical_values, type='goodness_of_fit')
>>> (8.333, False)
>>>

static cochrans_q(data: ndarray) → Tuple[float, float][source]

Compute Cochrans Q for 2-dimensional boolean array.

Cochran’s Q statistic is used to test for significant differences between more than two proportions. It can be used to evaluate if the performance of multiple (>=2) classifiers on the same data is the same or significantly different.

Note

If two classifiers, consider simba.mixins.statistics.Statistics.mcnemar.

Useful background: https://psych.unl.edu/psycrs/handcomp/hccochran.PDF

Parameters: data (np.ndarray) – Two dimensional array of boolean values where axis 1 represents classifiers or features and rows represent frames.
Return Tuple[float, float]: Cochran’s Q statistic signidicance value.
Example

>>> data = np.random.randint(0, 2, (100000, 4))
>>> Statistics.cochrans_q(data=data)

static cohens_d(sample_1: ndarray, sample_2: ndarray) → float[source]

Jitted compute of Cohen’s d between two distributions.

Cohen’s d is a measure of effect size that quantifies the difference between the means of two distributions in terms of their standard deviation. It is calculated as the difference between the means of the two distributions divided by the pooled standard deviation.

Higher values indicate a larger effect size, with 0.2 considered a small effect, 0.5 a medium effect, and 0.8 or above a large effect. Negative values indicate that the mean of sample 2 is larger than the mean of sample 1.

$d = \frac{{\bar{x}_1 - \bar{x}_2}}{{\sqrt{{\frac{{s_1^2 + s_2^2}}{2}}}}}$

where:

(bar{x}_1) and (bar{x}_2) are the means of sample_1 and sample_2 respectively,
(s_1) and (s_2) are the standard deviations of sample_1 and sample_2 respectively.

Parameters

sample_1 (ndarray) – First 1d array representing feature values.
sample_2 (ndarray) – Second 1d array representing feature values.

Returns float

Cohens D statistic.

Example

>>> sample_1 = [2, 4, 7, 3, 7, 35, 8, 9]
>>> sample_2 = [4, 8, 14, 6, 14, 70, 16, 18]
>>> Statistics().cohens_d(sample_1=sample_1, sample_2=sample_2)
>>> -0.5952099775170546

static cohens_h(sample_1: ndarray, sample_2: ndarray) → float[source]

Jitted compute Cohen’s h effect size for two samples of binary [0, 1] values. Cohen’s h is a measure of effect size for comparing two independent samples based on the differences in proportions of the two samples.

$\text{Cohen's h} = 2 \arcsin\left(\sqrt{\frac{\sum\text{sample\_1}}{N\_1}}\right) - 2 \arcsin\left(\sqrt{\frac{\sum\text{sample\_2}}{N\_2}}\right)$

Where N_1 and N_2 are the sample sizes of sample_1 and sample_2, respectively.

Parameters

sample_1 (np.ndarray) – 1D array with binary [0, 1] values (e.g., first classifier inference values).
sample_2 (np.ndarray) – 1D array with binary [0, 1] values (e.g., second classifier inference values).

Return float

Cohen’s h effect size.

Example

>>> sample_1 = np.array([1, 0, 0, 1])
>>> sample_2 = np.array([1, 1, 1, 0])
>>> Statistics().cohens_h(sample_1=sample_1, sample_2=sample_2)
>>> -0.5235987755982985

static cohens_kappa(sample_1: ndarray, sample_2: ndarray)[source]

Jitted compute Cohen’s Kappa coefficient for two binary samples.

Cohen’s Kappa coefficient measures the agreement between two sets of binary ratings, taking into account agreement occurring by chance. It ranges from -1 to 1, where 1 indicates perfect agreement, 0 indicates agreement by chance, and -1 indicates complete disagreement.

$\kappa = 1 - \frac{\sum{w_{ij} \cdot D_{ij}}}{\sum{w_{ij} \cdot E_{ij}}}$

where:

( kappa ) is Cohen’s Kappa coefficient,
( w_{ij} ) are the weights,
( D_{ij} ) are the observed frequencies,
( E_{ij} ) are the expected frequencies.

Example

>>> sample_1 = np.random.randint(0, 2, size=(10000,))
>>> sample_2 = np.random.randint(0, 2, size=(10000,))
>>> Statistics.cohens_kappa(sample_1=sample_1, sample_2=sample_2))

static concordance_ratio(x: ndarray, invert: bool) → float[source]

Calculate the concordance ratio of a 2D numpy array.

Parameters

x (np.ndarray) – A 2D numpy array with ordinals represented as integers.
invert (bool) – If True, the concordance ratio is inverted, and disconcordance ratio is returned

Return float

The concordance ratio, representing the count of rows with only one unique value divided by the total number of rows in the array.

Example

>>> x = np.random.randint(0, 2, (5000, 4))
>>> results = Statistics.concordance_ratio(x=x, invert=False)

static cov_matrix(data: ndarray)[source]

Jitted helper to compute the covariance matrix of the input data. Helper for computing cronbach alpha, multivariate analysis, and distance computations.

Parameters: data (np.ndarray) – 2-dimensional numpy array representing the input data with shape (n, m), where n is the number of observations and m is the number of features.
Returns: Covariance matrix of the input data with shape (m, m). The (i, j)-th element of the matrix represents the covariance between the i-th and j-th features in the data.
Example

>>> data = np.random.randint(0,2, (200, 40)).astype(np.float32)
>>> covariance_matrix = Statistics.cov_matrix(data=data)

static d_prime(x: ndarray, y: ndarray, lower_limit: Optional[float] = 0.0001, upper_limit: Optional[float] = 0.9999) → float[source]

Computes d-prime from two Boolean 1d arrays, e.g., between classifications and ground truth.

D-prime (d’) is a measure of signal detection performance, indicating the ability to discriminate between signal and noise. It is computed as the difference between the inverse cumulative distribution function (CDF) of the hit rate and the false alarm rate.

Parameters

x (np.ndarray) – Boolean 1D array of response values, where 1 represents presence, and 0 representing absence.
y (np.ndarray) – Boolean 1D array of ground truth, where 1 represents presence, and 0 representing absence.
lower_limit (Optional[float]) – Lower limit to bound hit and false alarm rates. Defaults to 0.0001.
upper_limit (Optional[float]) – Upper limit to bound hit and false alarm rates. Defaults to 0.9999.

Return float

The calculated d’ (d-prime) value.

Example

>>> x = np.random.randint(0, 2, (1000,))
>>> y = np.random.randint(0, 2, (1000,))
>>> Statistics.d_prime(x=x, y=y)

davis_bouldin(y: ndarray) → float[source]

Calculate the Davis-Bouldin index for evaluating clustering performance.

Davis-Bouldin index measures the clustering quality based on the within-cluster similarity and between-cluster dissimilarity. Lower values indicate better clustering.

Note

Modified from scikit-learn

Parameters

x (np.ndarray) – 2D array representing the data points. Shape (n_samples, n_features/n_dimension).
y (np.ndarray) – 2D array representing cluster labels for each data point. Shape (n_samples,).

Return float

Davis-Bouldin score.

Example

>>> x = np.random.randint(0, 100, (100, 2))
>>> y = np.random.randint(0, 3, (100,))
>>> Statistics.davis_bouldin(x=x, y=y)

static dunn_index(x: ndarray, y: ndarray) → float[source]

Calculate the Dunn index to evaluate the quality of clustered labels.

This function calculates the Dunn index, which is a measure of clustering quality. The index considers the ratio of the minimum inter-cluster distance to the maximum intra-cluster distance. A higher Dunn index indicates better clustering.

Note

Modified from jqmviegas

Wiki https://en.wikipedia.org/wiki/Dunn_index

Uses Euclidean distances.

Parameters

x (np.ndarray) – 2D array representing the data points. Shape (n_samples, n_features).
y (np.ndarray) – 2D array representing cluster labels for each data point. Shape (n_samples,).

Return float

The Dunn index value

Example

>>> x = np.random.randint(0, 100, (100, 2))
>>> y = np.random.randint(0, 3, (100,))
>>> Statistics.dunn_index(x=x, y=y)

static elliptic_envelope(data: ndarray, contamination: Optional[float] = 0.1, normalize: Optional[bool] = False, groupby_idx: Optional[int] = None) → ndarray[source]

Compute the Mahalanobis distances of each observation in the input array using Elliptic Envelope method.

Parameters

data – Input data array of shape (n_samples, n_features).
contamination (Optional[float]) – The proportion of outliers to be assumed in the data. Defaults to 0.1.
normalize (Optional[bool]) – Whether to normalize the Mahalanobis distances between 0 and 1. Defaults to True.

Return np.ndarray

The Mahalanobis distances of each observation in array. Larger values indicate outliers.

Example

>>> data, lbls = make_blobs(n_samples=2000, n_features=2, centers=1, random_state=42)
>>> envelope_score = elliptic_envelope(data=data, normalize=True)
>>> results = np.hstack((data[:, 0:2], envelope_score.reshape(lof.shape[0], 1)))
>>> results = pd.DataFrame(results, columns=['X', 'Y', 'ENVELOPE SCORE'])
>>> PlottingMixin.continuous_scatter(data=results, palette='seismic', bg_clr='lightgrey', columns=['X', 'Y', 'ENVELOPE SCORE'],size=30)

static eta_squared(x: ndarray, y: ndarray) → float[source]

Calculate eta-squared, a measure of between-subjects effect size.

Eta-squared ((eta^2)) is calculated as the ratio of the sum of squares between groups to the total sum of squares. Range from 0 to 1, where larger values indicate a stronger effect size.

$\eta^2 =$

rac{SS_{between}}{SS_{between} + SS_{within}}

where: - ( SS_{between} ) is the sum of squares between groups. - ( SS_{within} ) is the sum of squares within groups.

param np.ndarray x

1D array containing the dependent variable data.

param np.ndarray y

1d array containing the grouping variable (categorical) data of same size as x.

return float

The eta-squared value representing the proportion of variance in the dependent variable that is attributable to the grouping variable.

static find_collinear_features(df: DataFrame, threshold: float, method: Optional[typing_extensions.Literal['pearson', 'spearman', 'kendall']] = 'pearson', verbose: Optional[bool] = False) → List[str][source]

Identify collinear features in the dataframe based on the specified correlation method and threshold.

Parameters

df (pd.DataFrame) – Input DataFrame containing features.
threshold (float) – Threshold value to determine collinearity.
method (Optional[Literal['pearson', 'spearman', 'kendall']]) – Method for calculating correlation. Defaults to ‘pearson’.

Returns

Set of feature names identified as collinear. Returns one feature for every feature pair with correlation value above specified threshold.

Example

>>> x = pd.DataFrame(np.random.randint(0, 100, (100, 100)))
>>> names = Statistics.find_collinear_features(df=x, threshold=0.2, method='pearson', verbose=True)

static fowlkes_mallows(x: ndarray, y: ndarray) → float[source]

Calculate the Fowlkes-Mallows Index (FMI) between two clusterings.

The Fowlkes-Mallows index (FMI) is a measure of similarity between two clusterings. It compares the similarity of the clusters obtained by two different clustering algorithms or procedures.

The index is defined as the geometric mean of the pairwise precision and recall:

$FMI = \sqrt{\frac{TP}{TP + FP} \times \frac{TP}{TP + FN}}$

where: - TP (True Positive) is the number of pairs of elements that are in the same cluster in both x and y, - FP (False Positive) is the number of pairs of elements that are in the same cluster in y but not in x, - FN (False Negative) is the number of pairs of elements that are in the same cluster in x but not in y.

Note

Modified from scikit-learn

Parameters

x (np.ndarray) – 1D array representing the labels of the first model.
y (np.ndarray) – 1D array representing the labels of the second model.

Return float

Score between 0 and 1. 1 indicates perfect clustering agreement, 0 indicates random clustering.

static grubbs_test(x: ndarray, left_tail: Optional[bool] = False) → float[source]

Perform Grubbs’ test to detect outliers if the minimum or maximum value in a feature series is an outlier.

Grubbs’ test is a statistical test used to detect outliers in a univariate data set. It calculates the Grubbs’ test statistic as the absolute difference between the extreme value (either the minimum or maximum) and the sample mean, divided by the sample standard deviation.

$ext{Grubbs' Test Statistic} =$

rac{|ar{x} - x_{ ext{min/max}}|}{s}

where:

( ar{x} ) is the sample mean,

( x_{ ext{min/max}} ) is the minimum or maximum value of the sample (depending on the tail being tested),

( s ) is the sample standard deviation.

param np.ndarray x

1D array representing numeric data.

param Optional[bool] left_tail

If True, the test calculates the Grubbs’ test statistic for the left tail (minimum value). If False (default), it calculates the statistic for the right tail (maximum value).

return float

The computed Grubbs’ test statistic.

example
>>> x = np.random.random((100,))
>>> Statistics.grubbs_test(x=x)

static hamming_distance(x: ndarray, y: ndarray, sort: Optional[bool] = False, w: Optional[ndarray] = None) → float[source]

Jitted compute of the Hamming similarity between two vectors.

The Hamming similarity measures the similarity between two binary vectors by counting the number of positions at which the corresponding elements are different.

Note

If w is not provided, equal weights are assumed. Adapted from pynndescent.

$\text{Hamming distance}(x, y) = \frac{{\sum_{i=1}^{n} w_i}}{{n}}$

where:

( n ) is the length of the vectors,
( w_i ) is the weight associated with the ( i )th element of the vectors.

Parameters

x (np.ndarray) – First binary vector.
x – Second binary vector.
w (Optional[np.ndarray]) – Optional weights for each element. Can be classification probabilities. If not provided, equal weights are assumed.
sort (Optional[bool]) – If True, sorts x and y prior to hamming distance calculation. Default, False.

Example

>>> x, y = np.random.randint(0, 2, (10,)).astype(np.int8), np.random.randint(0, 2, (10,)).astype(np.int8)
>>> Statistics().hamming_distance(x=x, y=y)
>>> 0.91

static hartley_fmax(x: ndarray, y: ndarray) → float[source]

Compute Hartley’s Fmax statistic to test for equality of variances between two features or groups.

Hartley’s Fmax statistic is used to test whether two samples have equal variances. It is calculated as the ratio of the largest sample variance to the smallest sample variance. Values close to one represent closer to equal variance.

$ext{Hartley's Fmax} =$

rac{max( ext{Var}(x), ext{Var}(y))}{min( ext{Var}(x), ext{Var}(y))}

where:

Var(x) is the variance of sample x,

Var(y) is the variance of sample y.

param np.ndarray x

1D array representing numeric data of the first group/feature.

param np.ndarray x

1D array representing numeric data of the second group/feature.

example
>>> x = np.random.random((100,))
>>> y = np.random.random((100,))
>>> Statistics.hartley_fmax(x=x, y=y)

hbos(data: ndarray, bucket_method: typing_extensions.Literal['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt'] = 'auto') → ndarray[source]

Jitted compute of Histogram-based Outlier Scores (HBOS). HBOS quantifies the abnormality of data points based on the densities of their feature values within their respective buckets over all feature values.

Parameters

data (np.ndarray) – 2d array with frames represented by rows and columns representing feature values.
bucket_method (Literal) – Estimator determining optimal bucket count and bucket width. Default: The maximum of the Sturges and Freedman-Diaconis estimators.

Return np.ndarray

Array of size data.shape[0] representing outlier scores, with higher values representing greater outliers.

Example

>>> sample_1 = np.random.random_integers(low=1, high=2, size=(10, 50)).astype(np.float64)
>>> sample_2 = np.random.random_integers(low=7, high=20, size=(2, 50)).astype(np.float64)
>>> data = np.vstack([sample_1, sample_2])
>>> Statistics().hbos(data=data)

hellinger_distance(x: ndarray, y: ndarray, bucket_method: Optional[typing_extensions.Literal['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt']] = 'auto') → float[source]

Compute the Hellinger distance between two vector distribitions.

Note

The Hellinger distance is bounded and ranges from 0 to √2. Distance of √2 indicates that the two distributions are maximally dissimilar

Parameters

x (np.ndarray) – First 1D array representing a probability distribution.
y (np.ndarray) – Second 1D array representing a probability distribution.
bucket_method (Optional[Literal['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt']]) – Method for computing histogram bins. Default is ‘auto’.

Returns float

Hellinger distance between the two input probability distributions.

Example

>>> x = np.random.randint(0, 9000, (500000,))
>>> y = np.random.randint(0, 9000, (500000,))
>>> Statistics().hellinger_distance(x=x, y=y, bucket_method='auto')

static independent_samples_t(sample_1: ~numpy.ndarray, sample_2: ~numpy.ndarray, critical_values: ~typing.Optional[~numpy.ndarray] = None) -> (<class 'float'>, typing.Union[NoneType, bool])[source]

Jitted compute independent-samples t-test statistic and boolean significance between two distributions.

Note

Critical values are stored in simba.assets.lookups.critical_values_**.pickle

The t-statistic for independent samples t-test is calculated using the following formula:

$t =$

rac{ar{x}_1 - ar{x}_2}{s_p sqrt{ rac{1}{n_1} + rac{1}{n_2}}}

where:

(bar{x}_1) and (bar{x}_2) are the means of sample_1 and sample_2 respectively,

(s_p) is the pooled standard deviation,

(n_1) and (n_2) are the sample sizes of sample_1 and sample_2 respectively.

parameter ndarray sample_1

First 1d array representing feature values.

parameter ndarray sample_2

Second 1d array representing feature values.

parameter ndarray critical_values

2d array where the first column represents degrees of freedom and second column represents critical values.

returns (float Union[None, bool]) t_statistic, p_value

Representing t-statistic and associated probability value. p_value is None if critical_values is None. Else True or False with True representing significant.

example
>>> sample_1 = np.array([1, 2, 3, 1, 3, 2, 1, 10, 8, 4, 10])
>>> sample_2 = np.array([2, 5, 10, 4, 8, 10, 7, 10, 7, 10, 10])
>>> Statistics().independent_samples_t(sample_1=sample_1, sample_2=sample_2)
>>> (-2.5266046804590183, None)
>>> critical_values = pickle.load(open("simba/assets/lookups/critical_values_05.pickle","rb"))['independent_t_test']['one_tail'].values
>>> Statistics().independent_samples_t(sample_1=sample_1, sample_2=sample_2, critical_values=critical_values)
>>> (-2.5266046804590183, True)

static isolation_forest(x: ndarray, estimators: Union[int, float] = 0.2, groupby_idx: Optional[int] = None, normalize: Optional[bool] = False)[source]

An implementation of the Isolation Forest algorithm for outlier detection.

Note

The isolation forest scores are negated. Thus, higher values indicate more atypical (outlier) data points.

Parameters

x (np.ndarray) – 2-D array with feature values.
estimators (Union[int, float]) – Number of splits. If the value is a float, then interpreted as the ratio of x shape.
groupby_idx (Optional[int]) – If int, then the index 1 of data for which to group the data and compute LOF on each segment. E.g., can be field holding a cluster identifier.
normalize (Optional[bool]) – Whether to normalize the outlier score between 0 and 1. Defaults to False.

Returns

Example

>>> x, lbls = make_blobs(n_samples=10000, n_features=2, centers=10, random_state=42)
>>> x = np.hstack((x, lbls.reshape(-1, 1)))
>>> scores = isolation_forest(x=x, estimators=10, normalize=True)
>>> results = np.hstack((x[:, 0:2], scores.reshape(scores.shape[0], 1)))
>>> results = pd.DataFrame(results, columns=['X', 'Y', 'ISOLATION SCORE'])
>>> PlottingMixin.continuous_scatter(data=results, palette='seismic', bg_clr='lightgrey', columns=['X', 'Y', 'ISOLATION SCORE'],size=30)

static jaccard_distance(x: ndarray, y: ndarray) → float[source]

Calculate the Jaccard distance between two 1D NumPy arrays.

The Jaccard distance is a measure of dissimilarity between two sets. It is defined as the size of the intersection of the sets divided by the size of the union of the sets.

Parameters

x (np.ndarray) – The first 1D NumPy array.
y (np.ndarray) – The second 1D NumPy array.

Return float

The Jaccard distance between arrays x and y.

Example

>>> x = np.random.randint(0, 5, (100))
>>> y = np.random.randint(0, 7, (100))
>>> Statistics.jaccard_distance(x=x, y=y)
>>> 0.2857143

jensen_shannon_divergence(sample_1: ndarray, sample_2: ndarray, bucket_method: typing_extensions.Literal['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt'] = 'auto') → float[source]

Compute Jensen-Shannon divergence between two distributions. Useful for (i) measure drift in datasets, and (ii) featurization of distribution shifts across sequential time-bins.

Note

JSD = 0: Indicates that the two distributions are identical. 0 < JSD < 1: Indicates a degree of dissimilarity between the distributions, with values closer to 1 indicating greater dissimilarity. JSD = 1: Indicates that the two distributions are maximally dissimilar.

$JSD =$

rac{{KL(P_1 || M) + KL(P_2 || M)}}{2}

parameter ndarray sample_1

First 1d array representing feature values.

parameter ndarray sample_2

Second 1d array representing feature values.

parameter Literal bucket_method

Estimator determining optimal bucket count and bucket width. Default: The maximum of the Sturges and Freedman-Diaconis estimators.

returns float

Jensen-Shannon divergence between sample_1 and sample_2

example
>>> sample_1, sample_2 = np.array([1, 2, 3, 4, 5, 10, 1, 2, 3]), np.array([1, 5, 10, 9, 10, 1, 10, 6, 7])
>>> Statistics().jensen_shannon_divergence(sample_1=sample_1, sample_2=sample_2, bucket_method='fd')
>>> 0.30806541358219786

static kendall_tau(sample_1: ndarray, sample_2: ndarray) → Tuple[float, float][source]

Jitted compute of Kendall Tau (rank correlation coefficient). Non-parametric method for computing correlation between two time-series features. Returns tau and associated z-score.

Kendall Tau is a measure of the correspondence between two rankings. It compares the number of concordant pairs (pairs of elements that are in the same order in both rankings) to the number of discordant pairs (pairs of elements that are in different orders in the rankings).

Kendall Tau is calculated using the following formula:

$\tau = \frac{{\sum C - \sum D}}{{\sum C + \sum D}}$

where $C$ is the count of concordant pairs and $D$ is the count of discordant pairs.

Parameters

sample_1 (ndarray) – First 1D array with feature values.
sample_1 – Second 1D array with feature values.

Returns Tuple[float, float]

Kendall Tau and associated z-score.

Examples

>>> sample_1 = np.array([4, 2, 3, 4, 5, 7]).astype(np.float32)
>>> sample_2 = np.array([1, 2, 3, 4, 5, 7]).astype(np.float32)
>>> Statistics().kendall_tau(sample_1=sample_1, sample_2=sample_2)
>>> (0.7333333333333333, 2.0665401605809928)

References

1: Stephanie Glen, “Kendall’s Tau (Kendall Rank Correlation Coefficient)”.

static kmeans_1d(data: ndarray, k: int, max_iters: int, calc_medians: bool) → Tuple[ndarray, ndarray, Union[None, DictType]][source]

Perform k-means clustering on a 1-dimensional dataset.

Parameters

data (np.ndarray) – 1d array containing feature values.
k (int) – Number of clusters.
max_iters (int) – Maximum number of iterations for the k-means algorithm.
calc_medians (bool) – Flag indicating whether to calculate cluster medians.

Returns Tuple

Tuple of three elements. Final centroids of the clusters. Labels assigned to each data point based on clusters. Cluster medians (if calc_medians is True), otherwise None.

Example

>>> data_1d = np.array([1, 2, 3, 55, 65, 40, 43, 40]).astype(np.float64)
>>> centroids, labels, medians = Statistics().kmeans_1d(data_1d, 2, 1000, True)

static kruskal_wallis(sample_1: ndarray, sample_2: ndarray) → float[source]

Compute the Kruskal-Wallis H statistic between two distributions.

The Kruskal-Wallis test is a non-parametric method for testing whether samples originate from the same distribution. It ranks all the values from the combined samples, then calculates the H statistic based on the ranks.

$H = \frac{{12}}{{n(n + 1)}} \left(\frac{{(\sum R_{ ext{sample1}})^2}}{{n_1}} + \frac{{(\sum R_{ ext{sample2}})^2}}{{n_2}}\right) - 3(n + 1)$

where:

( n ) is the total number of observations,
( n_1 ) and ( n_2 ) are the number of observations in sample 1 and sample 2 respectively,
( R_{ ext{sample1}} ) and ( R_{ ext{sample2}} ) are the sums of ranks for sample 1 and sample 2 respectively.

Parameters

sample_1 (ndarray) – First 1d array representing feature values.
sample_2 (ndarray) – Second 1d array representing feature values.

Returns float

Kruskal-Wallis H statistic.

Example

>>> sample_1 = np.array([1, 1, 3, 4, 5]).astype(np.float64)
>>> sample_2 = np.array([6, 7, 8, 9, 10]).astype(np.float64)
>>> Statistics().kruskal_wallis(sample_1=sample_1, sample_2=sample_2)
>>> 39.4

kullback_leibler_divergence(sample_1: ndarray, sample_2: ndarray, fill_value: Optional[int] = 1, bucket_method: typing_extensions.Literal['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt'] = 'auto') → float[source]

Compute Kullback-Leibler divergence between two distributions.

Note

Empty bins (0 observations in bin) in is replaced with passed fill_value.

Its range is from 0 to positive infinity. When the KL divergence is zero, it indicates that the two distributions are identical. As the KL divergence increases, it signifies an increasing difference between the distributions.

System Message: WARNING/2 (ext{KL}(P || Q) = \sum{P(x) \log{\left( )

latex exited with error [stdout] This is pdfTeX, Version 3.141592653-2.6-1.40.22 (TeX Live 2022/dev/Debian) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2021-11-15> patch level 1 L3 programming layer <2022-01-21> (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2021/10/04 v1.4n Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/anyfontsize/anyfontsize.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/bm.sty) (/usr/share/texlive/texmf-dist/tex/latex/l3backend/l3backend-dvips.def) (./math.aux) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/umsa.fd) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/umsb.fd)) Runaway argument? ext{KL}(P || Q) = \sum {P(x) \log {\left (\end {split} \end {equation\ETC. ! File ended while scanning use of \split. <inserted text> \par <*> math.tex ! Emergency stop. <*> math.tex No pages of output. Transcript written on math.log.

rac{P(x)}{Q(x)} ight)}}

parameter ndarray sample_1

First 1d array representing feature values.

parameter ndarray sample_2

Second 1d array representing feature values.

parameter Optional[int] fill_value

Optional pseudo-value to use to fill empty buckets in sample_2 histogram

parameter Literal bucket_method

Estimator determining optimal bucket count and bucket width. Default: The maximum of the Sturges and Freedman-Diaconis estimators

returns float

Kullback-Leibler divergence between sample_1 and sample_2

static levenes(sample_1: ~numpy.ndarray, sample_2: ~numpy.ndarray, critical_values: ~typing.Optional[~numpy.ndarray] = None) -> (<class 'float'>, typing.Union[bool, NoneType])[source]

Compute Levene’s W statistic, a test for the equality of variances between two samples.

Levene’s test is a statistical test used to determine whether two or more groups have equal variances. It is often used as an alternative to the Bartlett test when the assumption of normality is violated. The function computes the Levene’s W statistic, which measures the degree of difference in variances between the two samples.

Parameters

sample_1 (ndarray) – First 1d array representing feature values.
sample_2 (ndarray) – Second 1d array representing feature values.
critical_values (ndarray) – 2D array with where first column represent dfn first row dfd with values represent critical values. Can be found in simba.assets.critical_values_05.pickle

Returns tuple[float, Union[bool, None]]

Levene’s W statistic and a boolean indicating whether the test is statistically significant (if critical values is not None).

Examples

>>> sample_1 = np.array(list(range(0, 50)))
>>> sample_2 = np.array(list(range(25, 100)))
>>> Statistics().levenes(sample_1=sample_1, sample_2=sample_2)
>>> 12.63909108903254
>>> critical_values = pickle.load(open("simba/assets/lookups/critical_values_5.pickle","rb"))['f']['one_tail'].values
>>> Statistics().levenes(sample_1=sample_1, sample_2=sample_2, critical_values=critical_values)
>>> (12.63909108903254, True)

static local_outlier_factor(data: ndarray, k: Union[int, float] = 5, contamination: Optional[float] = 1e-10, normalize: Optional[bool] = False, groupby_idx: Optional[int] = None) → ndarray[source]

Compute the local outlier factor of each observation.

Note

The final LOF scores are negated. Thus, higher values indicate more atypical (outlier) data points. Values Method calls sklearn.neighbors.LocalOutlierFactor directly. Attempted to use own jit compiled implementation, but runtime was 3x-ish slower than sklearn.neighbors.LocalOutlierFactor.

If groupby_idx is not None, then the index 1 of data array for which to group the data and compute LOF within each segment/cluster. E.g., can be field holding cluster identifier. Thus, outliers are computed within each segment/cluster, ensuring that other segments cannot affect outlier scores within each analyzing each cluster.

If groupby_idx is provided, then all observations with cluster/segment variable -1 will be treated as unclustered and assigned the max outlier score found withiin the clustered observations.

Parameters

data (ndarray) – 2D array with feature values where rows represent frames and columns represent features.
k (Union[int, float]) – Number of neighbors to evaluate for each observation. If the value is a float, then interpreted as the ratio of data.shape[0]. If the value is an integer, then it represent the number of neighbours to evaluate.
contamination (Optional[float]) – Small pseudonumber to avoid DivisionByZero error.
normalize (Optional[bool]) – Whether to normalize the distances between 0 and 1. Defaults to False.
groupby_idx (Optional[int]) – If int, then the index 1 of data for which to group the data and compute LOF on each segment. E.g., can be field holding a cluster identifier.

Returns np.ndarray

Array of size data.shape[0] with local outlier scores.

Example

>>> data, lbls = make_blobs(n_samples=2000, n_features=2, centers=10, random_state=42)
>>> data = np.hstack((data, lbls.reshape(-1, 1)))
>>> lof = Statistics.local_outlier_factor(data=data, groupby_idx=2, k=100, normalize=True)
>>> results = np.hstack((data[:, 0:2], lof.reshape(lof.shape[0], 1)))
>>> PlottingMixin.continuous_scatter(data=results, palette='seismic', bg_clr='lightgrey',size=30)

static mad_median_rule(data: ndarray, k: int) → ndarray[source]

Detect outliers using the MAD-Median Rule. Returns 1d array of size data.shape[0] with 1 representing outlier and 0 representing inlier.

Example

>>> data = np.random.randint(0, 600, (9000000,)).astype(np.float32)
>>> Statistics.mad_median_rule(data=data, k=1)

static mahalanobis_distance_cdist(data: ndarray) → ndarray[source]

Compute the Mahalanobis distance between every pair of observations in a 2D array using numba.

The Mahalanobis distance is a measure of the distance between a point and a distribution. It accounts for correlations between variables and the scales of the variables, making it suitable for datasets where features are not independent and have different variances.

Note

Significantly reduced runtime versus Mahalanobis scipy.cdist only with larger feature sets ( > 10-50).

However, Mahalanobis distance may not be suitable in certain scenarios, such as: - When the dataset is small and the covariance matrix is not accurately estimated. - When the dataset contains outliers that significantly affect the estimation of the covariance matrix. - When the assumptions of multivariate normality are violated.

Parameters: data (np.ndarray) – 2D array with feature observations. Frames on axis 0 and feature values on axis 1
Return np.ndarray: Pairwise Mahalanobis distance matrix where element (i, j) represents the Mahalanobis distance between observations i and j.
Example

>>> data = np.random.randint(0, 50, (1000, 200)).astype(np.float32)
>>> x = mahalanobis_distance_cdist(data=data)

static manhattan_distance_cdist(data: ndarray) → ndarray[source]

Compute the pairwise Manhattan distance matrix between points in a 2D array.

Can be preferred over Euclidean distance in scenarios where the movement is restricted to grid-based paths and/or the data is high dimensional.

$D_{ ext{Manhattan}} = |x_2 - x_1| + |y_2 - y_1|$

Parameters: data – 2D array where each row represents a featurized observation (e.g., frame)
Return np.ndarray: Pairwise Manhattan distance matrix where element (i, j) represents the distance between points i and j.
Example

>>> data = np.random.randint(0, 50, (10000, 2))
>>> Statistics.manhattan_distance_cdist(data=data)

static mann_whitney(sample_1: ndarray, sample_2: ndarray) → float[source]

Jitted compute of Mann-Whitney U between two distributions.

The Mann-Whitney U test is used to assess whether the distributions of two groups are the same or different based on their ranks. It is commonly used as an alternative to the t-test when the assumptions of normality and equal variances are violated.

$U = \min(U_1, U_2)$

Where:

U is the Mann-Whitney U statistic,
U_1 is the sum of ranks for sample 1,
U_2 is the sum of ranks for sample 2.

Parameters

sample_1 (ndarray) – First 1d array representing feature values.
sample_2 (ndarray) – Second 1d array representing feature values.

Returns float

The Mann-Whitney U statistic.

References

Modified from James Webber gist on GitHub.

Example

>>> sample_1 = np.array([1, 1, 3, 4, 5])
>>> sample_2 = np.array([6, 7, 8, 9, 10])
>>> results = Statistics().mann_whitney(sample_1=sample_1, sample_2=sample_2)

static mcnemar(x: ndarray, y: ndarray, ground_truth: ndarray, continuity_corrected: Optional[bool] = True) → Tuple[float, float][source]

McNemar’s test to compare the difference in predictive accuracy of two models.

E.g., can be used to compute if the accuracy of two classifiers are significantly different when transforming the same data.

Note

Adapted from mlextend.

Parameters

x (np.ndarray) – 1-dimensional Boolean array with predictions of the first model.
x – 1-dimensional Boolean array with predictions of the second model.
x – 1-dimensional Boolean array with ground truth labels.

:param Optional[bool] continuity_corrected : Whether to apply continuity correction. Default is True.

Example

>>> x = np.random.randint(0, 2, (100000, ))
>>> y = np.random.randint(0, 2, (100000, ))
>>> ground_truth = np.random.randint(0, 2, (100000, ))
>>> Statistics.mcnemar(x=x, y=y, ground_truth=ground_truth)

static one_way_anova(sample_1: ~numpy.ndarray, sample_2: ~numpy.ndarray, critical_values: ~typing.Optional[~numpy.ndarray] = None) -> (<class 'float'>, <class 'float'>)[source]

Jitted compute of one-way ANOVA F statistics and associated p-value for two distributions.

Parameters

sample_1 (ndarray) – First 1d array representing feature values.
sample_2 (ndarray) – Second 1d array representing feature values.

Returns (float float)

Representing ANOVA F statistic and associated probability value.

Example

>>> sample_1 = np.array([1, 2, 3, 1, 3, 2, 1, 10, 8, 4, 10])
>>> sample_2 = np.array([8, 5, 5, 8, 8, 9, 10, 1, 7, 10, 10])
>>> Statistics().one_way_anova(sample_1=sample_2, sample_2=sample_1)

static pct_in_top_n(x: ndarray, n: float) → float[source]

Compute the percentage of elements in the top ‘n’ frequencies in the input array.

This function calculates the percentage of elements that belong to the ‘n’ most frequent categories in the input array ‘x’.

Parameters

x (np.ndarray) – Input array.
n (float) – Number of top frequencies.

Return float

Percentage of elements in the top ‘n’ frequencies.

Example

>>> x = np.random.randint(0, 10, (100,))
>>> Statistics.pct_in_top_n(x=x, n=5)

static pearsons_r(sample_1: ndarray, sample_2: ndarray) → float[source]

Calculate the Pearson correlation coefficient (Pearson’s r) between two numeric samples.

Pearson’s r is a measure of the linear correlation between two sets of data points. It quantifies the strength and direction of the linear relationship between the two variables. The coefficient varies between -1 and 1, with -1 indicating a perfect negative linear relationship, 1 indicating a perfect positive linear relationship, and 0 indicating no linear relationship.

Pearson’s r is calculated using the formula:

# .. math:: # # r =

rac{sum{(x_i - ar{x})(y_i - ar{y})}}{sqrt{sum{(x_i - ar{x})^2}sum{(y_i - ar{y})^2}}}

# # where: # - ( x_i ) and ( y_i ) are individual data points in sample_1 and sample_2, respectively. # - ( ar{x} ) and ( ar{y} ) are the means of sample_1 and sample_2, respectively.

param np.ndarray sample_1: First numeric sample.
param np.ndarray sample_2: Second numeric sample.
return float: Pearson’s correlation coefficient between the two samples.
example

>>> sample_1 = np.array([7, 2, 9, 4, 5, 6, 7, 8, 9]).astype(np.float32)
>>> sample_2 = np.array([1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5]).astype(np.float32)
>>> Statistics().pearsons_r(sample_1=sample_1, sample_2=sample_2)
>>> 0.47

static phi_coefficient(data: ndarray) → float[source]

Compute the phi coefficient for a Nx2 array of binary data.

The phi coefficient (a.k.a Matthews Correlation Coefficient (MCC)), is a measure of association for binary data in a 2x2 contingency table. It quantifies the degree of association or correlation between two binary variables (e.g., binary classification targets).

The formula for the phi coefficient is defined as:

$\phi = \frac{{(BC - AD)}}{{\sqrt{{(C\_1 + C\_2)(R\_1 + R\_2)(C\_1 + R\_1)(C\_2 + R\_2)}}}}$

where:

BC: Hit rate (reponse and truth is both 1)
AD: Correct rejections (response and truth are both 0)
C1, C2: Counts of occurrences where the response is 1 and 0, respectively.
R1, R2: Counts of occurrences where the truth is 1 and 0, respectively.

Parameters

data (np.ndarray) – A NumPy array containing binary data organized in two columns. Each row represents a pair of binary values for two variables. Columns represent two features or two binary classification results.
float – The calculated phi coefficient, a value between 0 and 1. A value of 0 indicates no association between the variables, while 1 indicates a perfect association.

Example

>>> data = np.array([[0, 1], [1, 0], [1, 0], [1, 1]]).astype(np.int64)
>>> Statistics().phi_coefficient(data=data)
>>> 0.8164965809277261
>>> data = np.random.randint(0, 2, (100, 2))
>>> result = Statistics.phi_coefficient(data=data)

population_stability_index(sample_1: ndarray, sample_2: ndarray, fill_value: Optional[int] = 1, bucket_method: Optional[typing_extensions.Literal['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt']] = 'auto') → float[source]

Compute Population Stability Index (PSI) comparing two distributions.

The Population Stability Index (PSI) is a measure of the difference in distribution patterns between two groups of data. A low PSI value indicates a minimal or negligible change in the distribution patterns between the two samples. A high PSI value suggests a significant difference in the distribution patterns between the two samples.

Note

Empty bins (0 observations in bin) in is replaced with fill_value. The PSI value ranges from 0 to positive infinity.

The Population Stability Index (PSI) is calculated as:

$PSI = \sum \left(\frac{{p_2 - p_1}}{{ln(p_2 / p_1)}}\right)$

where:

( p_1 ) and ( p_2 ) are the proportions of observations in the bins for sample 1 and sample 2 respectively.

Parameters

sample_1 (ndarray) – First 1d array representing feature values.
sample_2 (ndarray) – Second 1d array representing feature values.
fill_value (Optional[int]) – Empty bins (0 observations in bin) in is replaced with fill_value. Default 1.
bucket_method (Literal) – Estimator determining optimal bucket count and bucket width. Default: The maximum of the Sturges and Freedman-Diaconis estimators

Returns float

PSI distance between sample_1 and sample_2

Example

>>> sample_1, sample_2 = np.random.randint(0, 100, (100,)), np.random.randint(0, 10, (100,))
>>> Statistics().population_stability_index(sample_1=sample_1, sample_2=sample_2, fill_value=1, bucket_method='auto')
>>> 3.9657026867553817

static relative_risk(x: ndarray, y: ndarray) → float[source]

Calculate the relative risk between two binary arrays.

Relative risk (RR) is the ratio of the probability of an event occurring in one group/feature/cluster/variable (x) to the probability of the event occurring in another group/feature/cluster/variable (y).

Parameters

x (np.ndarray) – The first 1D binary array.
y (np.ndarray) – The second 1D binary array.

Return float

The relative risk between arrays x and y.

Example

>>> Statistics.relative_risk(x=np.array([0, 1, 1]), y=np.array([0, 1, 0]))
>>> 2.0

static rolling_cohens_d(data: ndarray, time_windows: ndarray, fps: float) → ndarray[source]

Jitted compute of rolling Cohen’s D statistic comparing the current time-window of size N to the preceding window of size N.

Parameters

data (ndarray) – 1D array of size len(frames) representing feature values.
time_window (np.ndarray[ints]) – Time windows to compute ANOVAs for in seconds.
fps (int) – Frame-rate of recorded video.

Returns np.ndarray

Array of size data.shape[0] x window_sizes.shape[1] with Cohens D.

Example

>>> sample_1, sample_2 = np.random.normal(loc=10, scale=1, size=4), np.random.normal(loc=11, scale=2, size=4)
>>> sample = np.hstack((sample_1, sample_2))
>>> Statistics().rolling_cohens_d(data=sample, window_sizes=np.array([1]), fps=4)
>>> [[0.],[0.],[0.],[0.],[0.14718302],[0.14718302],[0.14718302],[0.14718302]])

static rolling_independent_sample_t(data: ndarray, time_window: float, fps: float) → ndarray[source]

Jitted compute independent-sample t-statistics for sequentially binned values in a time-series. E.g., compute t-test statistics when comparing Feature N in the current 1s time-window, versus Feature N in the previous 1s time-window.

Parameters

data (ndarray) – 1D array of size len(frames) representing feature values.
group_size_s (int) – The size of the buckets in seconds.
fps (int) – Frame-rate of recorded video.

Attention

Each window is compared to the prior window. Output for the windows without a prior window (the first window) is -1.

Example

>>> data_1, data_2 = np.random.normal(loc=10, scale=2, size=10), np.random.normal(loc=20, scale=2, size=10)
>>> data = np.hstack([data_1, data_2])
>>> Statistics().rolling_independent_sample_t(data, time_window=1, fps=10)
>>> [[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -6.88741389, -6.88741389, -6.88741389, -6.88741389, -6.88741389, -6.88741389, -6.88741389, -6.88741389, -6.88741389, -6.88741389])

rolling_jensen_shannon_divergence(data: ndarray, time_windows: ndarray, fps: int, bucket_method: typing_extensions.Literal['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt'] = 'auto') → ndarray[source]

Compute rolling Jensen-Shannon divergence comparing the current time-window of size N to the preceding window of size N.

Parameters

data (ndarray) – 1D array of size len(frames) representing feature values.
time_windows (np.ndarray[ints]) – Time windows to compute JS for in seconds.
fps (int) – Frame-rate of recorded video.
bucket_method (Literal) – Estimator determining optimal bucket count and bucket width. Default: The maximum of the Sturges and Freedman-Diaconis estimators

rolling_kullback_leibler_divergence(data: ndarray, time_windows: ndarray, fps: int, fill_value: int = 1, bucket_method: typing_extensions.Literal['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt'] = 'auto') → ndarray[source]

Compute rolling Kullback-Leibler divergence comparing the current time-window of size N to the preceding window of size N.

Note

Empty bins (0 observations in bin) in is replaced with fill_value.

Parameters

sample_1 (ndarray) – 1d array representing feature values.
bucket_method (Literal) – Estimator determining optimal bucket count and bucket width. Default: The maximum of the Sturges and Freedman-Diaconis estimators
time_windows (np.ndarray[floats]) – Time windows to compute JS for in seconds.
fps (int) – Frame-rate of recorded video.

Returns np.ndarray

Size data.shape[0] x window_sizes.shape with Kullback-Leibler divergence. Columns represents different tiem windows.

Example

>>> sample_1, sample_2 = np.random.normal(loc=10, scale=700, size=5), np.random.normal(loc=50, scale=700, size=5)
>>> data = np.hstack((sample_1, sample_2))
>>> Statistics().rolling_kullback_leibler_divergence(data=data, time_windows=np.array([1]), fps=2)

static rolling_levenes(data: ndarray, time_windows: ndarray, fps: float) → float[source]

Jitted compute of rolling Levene’s W comparing the current time-window of size N to the preceding window of size N.

Note

First time bin (where has no preceding time bin) will have fill value 0

Parameters

sample_1 (ndarray) – First 1d array representing feature values.
sample_2 (ndarray) – Second 1d array representing feature values.

Returns np.ndarray

Levene’s W data of size len(data) x len(time_windows).

Example

>>> data = np.random.randint(0, 50, (100)).astype(np.float64)
>>> Statistics().rolling_levenes(data=data, time_windows=np.array([1]).astype(np.float64), fps=5.0)

static rolling_mann_whitney(data: ndarray, time_windows: ndarray, fps: float)[source]

Jitted compute of rolling Mann-Whitney U comparing the current time-window of size N to the preceding window of size N.

Note

First time bin (where has no preceding time bin) will have fill value 0

Modified from James Webber gist.

Parameters

sample_1 (ndarray) – First 1d array representing feature values.
sample_2 (ndarray) – Second 1d array representing feature values.

Returns np.ndarray

Mann-Whitney U data of size len(data) x len(time_windows).

Examples

>>> data = np.random.randint(0, 4, (200)).astype(np.float32)
>>> results = Statistics().rolling_mann_whitney(data=data, time_windows=np.array([1.0]), fps=1)

static rolling_one_way_anova(data: ndarray, time_windows: ndarray, fps: int) → ndarray[source]

Jitted compute of rolling one-way ANOVA F-statistic comparing the current time-window of size N to the preceding window of size N.

Parameters

data (ndarray) – 1D array of size len(frames) representing feature values.
time_windows (np.ndarray[ints]) – Time windows to compute ANOVAs for in seconds.
fps (int) – Frame-rate of recorded video.

Example

>>> sample = np.random.normal(loc=10, scale=1, size=10).astype(np.float32)
>>> Statistics().rolling_one_way_anova(data=sample, time_windows=np.array([1.0]), fps=2)
>>> [[0.00000000e+00][0.00000000e+00][2.26221263e-06][2.26221263e-06][5.39119950e-03][5.39119950e-03][1.46725486e-03][1.46725486e-03][1.16392111e-02][1.16392111e-02]]

rolling_population_stability_index(data: ndarray, time_windows: ndarray, fps: int, fill_value: int = 1, bucket_method: typing_extensions.Literal['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt'] = 'auto') → ndarray[source]

Compute rolling Population Stability Index (PSI) comparing the current time-window of size N to the preceding window of size N.

Note

Empty bins (0 observations in bin) in is replaced with fill_value.

Parameters

sample_1 (ndarray) – First 1d array representing feature values.
sample_2 (ndarray) – Second 1d array representing feature values.
fill_value (int) – Empty bins (0 observations in bin) in is replaced with fill_value.
bucket_method (Literal) – Estimator determining optimal bucket count and bucket width. Default: The maximum of the Sturges and Freedman-Diaconis estimators

Returns np.ndarray

PSI data of size len(data) x len(time_windows).

rolling_shapiro_wilks(data: ndarray, time_window: float, fps: int) → ndarray[source]

Compute Shapiro-Wilks normality statistics for sequentially binned values in a time-series. E.g., compute the normality statistics of Feature N in each window of time_window seconds.

Parameters

data (ndarray) – 1D array of size len(frames) representing feature values.
time_window (int) – The size of the buckets in seconds.
fps (int) – Frame-rate of recorded video.

Return np.ndarray

Array of size data.shape[0] with Shapiro-Wilks normality statistics

Example

>>> data = np.random.randint(low=0, high=100, size=(200)).astype('float32')
>>> results = self.rolling_shapiro_wilks(data=data, time_window=1, fps=30)

static rolling_two_sample_ks(data: ndarray, time_window: float, fps: float) → ndarray[source]

Jitted compute Kolmogorov two-sample statistics for sequentially binned values in a time-series. E.g., compute KS statistics when comparing Feature N in the current 1s time-window, versus Feature N in the previous 1s time-window.

Parameters

data (ndarray) – 1D array of size len(frames) representing feature values.
time_window (float) – The size of the buckets in seconds.
fps (int) – Frame-rate of recorded video.

Return np.ndarray

Array of size data.shape[0] with KS statistics

Example

>>> data = np.random.randint(low=0, high=100, size=(200)).astype('float32')
>>> results = Statistics().rolling_two_sample_ks(data=data, time_window=1, fps=30)

rolling_wasserstein_distance(data: ndarray, time_windows: ndarray, fps: int, bucket_method: typing_extensions.Literal['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt'] = 'auto') → ndarray[source]

Compute rolling Wasserstein distance comparing the current time-window of size N to the preceding window of size N.

Parameters

data (ndarray) – 1D array of size len(frames) representing feature values.
time_windows (np.ndarray[ints]) – Time windows to compute JS for in seconds.
fps (int) – Frame-rate of recorded video.
bucket_method (Literal) – Estimator determining optimal bucket count and bucket width. Default: The maximum of the Sturges and Freedman-Diaconis estimators

Returns np.ndarray

Size data.shape[0] x window_sizes.shape with Wasserstein distance. Columns represent different time windows.

Example

>>> data = np.random.randint(0, 100, (100,))
>>> Statistics().rolling_wasserstein_distance(data=data, time_windows=np.array([1, 2]), fps=30)

static sliding_autocorrelation(data: ndarray, max_lag: float, time_window: float, fps: float)[source]

Jitted computation of sliding autocorrelations, which measures the correlation of a feature with itself using lagged windows.

Parameters

data (np.ndarray) – 1D array containing feature values.
max_lag (float) – Maximum lag in seconds for the autocorrelation window.
time_window (float) – Length of the sliding time window in seconds.
fps (float) – Frames per second, used to convert time-related parameters into frames.

Return np.ndarray

1D array containing the sliding autocorrelation values.

Example

>>> data = np.array([0,1,2,3,4, 5,6,7,8,1,10,11,12,13,14]).astype(np.float32)
>>> Statistics().sliding_autocorrelation(data=data, max_lag=0.5, time_window=1.0, fps=10)
>>> [ 0., 0., 0.,  0.,  0., 0., 0.,  0. ,  0., -3.686, -2.029, -1.323, -1.753, -3.807, -4.634]

static sliding_dominant_frequencies(data: ndarray, fps: float, k: int, time_windows: ndarray, window_function: Optional[typing_extensions.Literal['Hann', 'Hamming', 'Blackman']] = None)[source]: Find the K dominant frequencies within a feature vector using sliding windows

static sliding_eta_squared(x: ndarray, y: ndarray, window_sizes: ndarray, sample_rate: int) → ndarray[source]

Calculate sliding window eta-squared, a measure of effect size for between-subjects designs, over multiple window sizes.

Parameters

x (np.ndarray) – The array containing the dependent variable data.
y (np.ndarray) – The array containing the grouping variable (categorical) data.
window_sizes (np.ndarray) – 1D array of window sizes in seconds.
sample_rate (int) – The sampling rate of the data in frames per second.

Return np.ndarray

Array of size x.shape[0] x window_sizes.shape[0] with sliding eta squared values.

Example

>>> x = np.random.randint(0, 10, (10000,))
>>> y = np.random.randint(0, 2, (10000,))
>>> Statistics.sliding_eta_squared(x=x, y=y, window_sizes=np.array([1.0, 2.0]), sample_rate=10)

static sliding_independent_samples_t(data: ndarray, time_window: float, slide_time: float, critical_values: ndarray, fps: float) → ndarray[source]

Jitted compute of sliding independent sample t-test. Compares the feature values in current time-window to prior time-windows to find the length in time to the most recent time-window where a significantly different feature value distribution is detected.

Parameters

data (ndarray) – 1D array with feature values.
time_window (float) – The sizes of the two feature value windows being compared in seconds.
slide_time (float) – The slide size of the second window.
critical_values (ndarray) – 2D array with where indexes represent degrees of freedom and values represent critical T values. Can be found in simba.assets.critical_values_05.pickle.
fps (int) – The fps of the recorded video.

Returns np.ndarray

1D array of size len(data) with values representing time to most recent significantly different feature distribution.

Example

>>> data = np.random.randint(0, 50, (10)).astype(np.float32)
>>> critical_values = pickle.load(open("simba/assets/lookups/critical_values_05.pickle", "rb"))['independent_t_test']['one_tail'].values.astype(np.float32)
>>> results = Statistics().sliding_independent_samples_t(data=data, time_window=0.5, fps=5.0, critical_values=critical_values, slide_time=0.30)

static sliding_kendall_tau(sample_1: ndarray, sample_2: ndarray, time_windows: ndarray, fps: float) → ndarray[source]

Compute sliding Kendall’s Tau correlation coefficient.

Calculates Kendall’s Tau correlation coefficient between two samples over sliding time windows. Kendall’s Tau is a measure of correlation between two ranked datasets.

The computation is based on the formula:

$\tau = \frac{{\text{{concordant pairs}} - \text{{discordant pairs}}}}{{\text{{concordant pairs}} + \text{{discordant pairs}}}}$

where concordant pairs are pairs of elements with the same order in both samples, and discordant pairs are pairs with different orders.

References

1: Stephanie Glen, “Kendall’s Tau (Kendall Rank Correlation Coefficient)”.

Parameters

sample_1 (np.ndarray) – First sample for comparison.
sample_2 (np.ndarray) – Second sample for comparison.
time_windows (np.ndarray) – Rolling time windows in seconds.
fps (float) – Frames per second (FPS) of the recorded video.

Returns

Array of Kendall’s Tau correlation coefficients corresponding to each time window.

static sliding_kurtosis(data: ndarray, time_windows: ndarray, sample_rate: int) → ndarray[source]

Compute the kurtosis of a 1D array within sliding time windows.

Parameters

data (np.ndarray) – Input data array.
time_windows (np.ndarray) – 1D array of time window durations in seconds.
sample_rate (np.ndarray) – Sampling rate of the data in samples per second.

Return np.ndarray

2D array of skewness`1 values with rows corresponding to data points and columns corresponding to time windows.

Example

>>> data = np.random.randint(0, 100, (10,))
>>> kurtosis = Statistics().sliding_kurtosis(data=data.astype(np.float32), time_windows=np.array([1.0, 2.0]), sample_rate=2)

static sliding_mad_median_rule(data: ndarray, k: int, time_windows: ndarray, fps: float) → ndarray[source]

Count the number of outliers in a sliding time-window using the MAD-Median Rule.

The MAD-Median Rule is a robust method for outlier detection. It calculates the median absolute deviation (MAD) and uses it to identify outliers based on a threshold defined as k times the MAD.

Parameters

data (np.ndarray) – 1D numerical array representing feature.
k (int) – The outlier threshold defined as k * median absolute deviation in each time window.
time_windows (np.ndarray) – 1D array of time window sizes in seconds.
fps (float) – The frequency of the signal.

Return np.ndarray

Array of size (data.shape[0], time_windows.shape[0]) with counts if outliers detected.

Example

>>> data = np.random.randint(0, 50, (50000,)).astype(np.float32)
>>> Statistics.sliding_mad_median_rule(data=data, k=2, time_windows=np.array([20.0]), fps=1.0)

static sliding_pearsons_r(sample_1: ndarray, sample_2: ndarray, time_windows: ndarray, fps: int) → ndarray[source]

Given two 1D arrays of size N, create sliding window of size time_windows[i] * fps and return Pearson’s R between the values in the two 1D arrays in each window. Address “what is the correlation between Feature 1 and Feature 2 in the current X.X seconds of the video”.

Parameters

sample_1 (ndarray) – First 1D array with feature values.
sample_1 – Second 1D array with feature values.
time_windows (float) – The length of the sliding window in seconds.
fps (int) – The fps of the recorded video.

Returns np.ndarray

2d array of Pearsons R of size len(sample_1) x len(time_windows). Note, if sliding window is 10 frames, the first 9 entries will be filled with 0.

Example

>>> sample_1 = np.random.randint(0, 50, (10)).astype(np.float32)
>>> sample_2 = np.random.randint(0, 50, (10)).astype(np.float32)
>>> Statistics().sliding_pearsons_r(sample_1=sample_1, sample_2=sample_2, time_windows=np.array([0.5]), fps=10)
>>> [[-1.][-1.][-1.][-1.][0.227][-0.319][-0.196][0.474][-0.061][0.713]]

static sliding_phi_coefficient(data: ndarray, window_sizes: ndarray, sample_rate: int) → ndarray[source]

Calculate sliding phi coefficients for a 2x2 contingency table derived from binary data.

Computes sliding phi coefficients for a 2x2 contingency table derived from binary data over different time windows. The phi coefficient is a measure of association between two binary variables, and sliding phi coefficients can reveal changes in association over time.

Parameters

data (np.ndarray) – A 2D NumPy array containing binary data organized in two columns. Each row represents a pair of binary values for two variables.
window_sizes (np.ndarray) – 1D NumPy array specifying the time windows (in seconds) over which to calculate the sliding phi coefficients.
sample_rate (int) – The sampling rate or time interval (in samples per second, e.g., fps) at which data points were collected.

Returns np.ndarray

A 2D NumPy array containing the calculated sliding phi coefficients. Each row corresponds to the phi coefficients calculated for a specific time point, the columns correspond to time-windows.

Example

>>> data = np.random.randint(0, 2, (200, 2))
>>> Statistics().sliding_phi_coefficient(data=data, window_sizes=np.array([1.0, 4.0]), sample_rate=10)

static sliding_relative_risk(x: ndarray, y: ndarray, window_sizes: ndarray, sample_rate: int) → ndarray[source]

Calculate sliding relative risk values between two binary arrays using different window sizes.

Parameters

x (np.ndarray) – The first 1D binary array.
y (np.ndarray) – The second 1D binary array.
window_sizes (np.ndarray) –
sample_rate (int) –

Return np.ndarray

Array of size x.shape[0] x window_sizes.shape[0] with sliding eta squared values.

Example

>>> Statistics.sliding_relative_risk(x=np.array([0, 1, 1, 0]), y=np.array([0, 1, 0, 0]), window_sizes=np.array([1.0]), sample_rate=2)

static sliding_skew(data: ndarray, time_windows: ndarray, sample_rate: int) → ndarray[source]

Compute the skewness of a 1D array within sliding time windows.

Parameters

data (np.ndarray) – Input data array.
data – 1D array of time window durations in seconds.
data – Sampling rate of the data in samples per second.

Return np.ndarray

2D array of skewness`1 values with rows corresponding to data points and columns corresponding to time windows.

Example

>>> data = np.random.randint(0, 100, (10,))
>>> skewness = Statistics().sliding_skew(data=data.astype(np.float32), time_windows=np.array([1.0, 2.0]), sample_rate=2)

static sliding_spearman_rank_correlation(sample_1: ndarray, sample_2: ndarray, time_windows: ndarray, fps: int) → ndarray[source]

Given two 1D arrays of size N, create sliding window of size time_windows[i] * fps and return Spearman’s rank correlation between the values in the two 1D arrays in each window. Address “what is the correlation between Feature 1 and Feature 2 in the current X.X seconds of the video.

Parameters

sample_1 (ndarray) – First 1D array with feature values.
sample_1 – Second 1D array with feature values.
time_windows (float) – The length of the sliding window in seconds.
fps (int) – The fps of the recorded video.

Returns np.ndarray

2d array of Soearman’s ranks of size len(sample_1) x len(time_windows). Note, if sliding window is 10 frames, the first 9 entries will be filled with 0. The 10th value represents the correlation in the first 10 frames.

Example

>>> sample_1 = np.array([9,10,13,22,15,18,15,19,32,11]).astype(np.float32)
>>> sample_2 = np.array([11, 12, 15, 19, 21, 26, 19, 20, 22, 19]).astype(np.float32)
>>> Statistics().sliding_spearman_rank_correlation(sample_1=sample_1, sample_2=sample_2, time_windows=np.array([0.5]), fps=10)

static sliding_z_scores(data: ndarray, time_windows: ndarray, fps: int) → ndarray[source]

Calculate sliding Z-scores for a given data array over specified time windows.

This function computes sliding Z-scores for a 1D data array over different time windows. The sliding Z-score is a measure of how many standard deviations a data point is from the mean of the surrounding data within the specified time window. This can be useful for detecting anomalies or variations in time-series data.

Parameters

data (ndarray) – 1D NumPy array containing the time-series data.
time_windows (int) – 1D NumPy array specifying the time windows in seconds over which to calculate the Z-scores.
time_windows – Frames per second, used to convert time windows from seconds to the corresponding number of data points.

Returns np.ndarray

A 2D NumPy array containing the calculated Z-scores. Each row corresponds to the Z-scores calculated for a specific time window. The time windows are represented by the columns.

Example

>>> data = np.random.randint(0, 100, (1000,)).astype(np.float32)
>>> z_scores = Statistics().sliding_z_scores(data=data, time_windows=np.array([1.0, 2.5]), fps=10)

static sokal_sneath(x: ndarray, y: ndarray, w: Optional[ndarray] = None) → float64[source]

Jitted calculate of the sokal sneath coefficient between two binary vectors (e.g., to classified behaviors). 0 represent independence, 1 represents complete interdependence.

$Sokal-Sneath = \frac{{f_t + t_f}}{{2 \cdot (t_{{cnt}} + f_{{cnt}}) + f_t + t_f}}$

Note

Adapted from pynndescent.

Parameters

x (np.ndarray) – First binary vector.
x – Second binary vector.
w (Optional[np.ndarray]) – Optional weights for each element. Can be classification probabilities. If not provided, equal weights are assumed.

Example

>>> x = np.array([0, 1, 0, 0, 1]).astype(np.int8)
>>> y = np.array([1, 0, 1, 1, 0]).astype(np.int8)
>>> Statistics().sokal_sneath(x, y)
>>> 0.0

static spearman_rank_correlation(sample_1: ndarray, sample_2: ndarray) → float[source]

Jitted compute of Spearman’s rank correlation coefficient between two samples.

Spearman’s rank correlation coefficient assesses how well the relationship between two variables can be described using a monotonic function. It computes the strength and direction of the monotonic relationship between ranked variables.

# .. math:: # ρ = 1 - frac{{6 ∑(d_i^2)}}{{n(n^2 - 1)}} # # where: # - ( d_i ) is the difference between the ranks of corresponding elements in sample_1 and sample_2. # - ( n ) is the number of observations.

Parameters

sample_1 (np.ndarray) – First 1D array containing feature values.
sample_2 (np.ndarray) – Second 1D array containing feature values.

Return float

Spearman’s rank correlation coefficient.

Example

>>> sample_1 = np.array([7, 2, 9, 4, 5, 6, 7, 8, 9]).astype(np.float32)
>>> sample_2 = np.array([1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5]).astype(np.float32)
>>> Statistics().spearman_rank_correlation(sample_1=sample_1, sample_2=sample_2)
>>> 0.0003979206085205078

static total_variation_distance(x: ndarray, y: ndarray, bucket_method: Optional[typing_extensions.Literal['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt']] = 'auto')[source]

Calculate the total variation distance between two probability distributions.

Parameters

x (np.ndarray) – A 1-D array representing the first sample.
y (np.ndarray) – A 1-D array representing the second sample.
bucket_method (Optional[str]) – The method used to determine the number of bins for histogram computation. Supported methods are ‘fd’ (Freedman-Diaconis), ‘doane’, ‘auto’, ‘scott’, ‘stone’, ‘rice’, ‘sturges’, and ‘sqrt’. Defaults to ‘auto’.

Return float

The total variation distance between the two distributions.

$TV(P, Q) = 0.5 \sum_i |P_i - Q_i|$

where $P_i$ and $Q_i$ are the probabilities assigned by the distributions $P$ and $Q$ to the same event $i$ , respectively.

Example

>>> total_variation_distance(x=np.array([1, 5, 10, 20, 50]), y=np.array([1, 5, 10, 100, 110]))
>>> 0.3999999761581421

static two_sample_ks(sample_1: ~numpy.ndarray, sample_2: ~numpy.ndarray, critical_values: ~typing.Optional[array(float64, 2d, A)] = None) -> (<class 'float'>, typing.Union[bool, NoneType])[source]

Jitted compute the two-sample Kolmogorov-Smirnov (KS) test statistic and, optionally, test for statistical significance.

The two-sample KS test is a non-parametric test that compares the cumulative distribution functions (ECDFs) of two independent samples to assess whether they come from the same distribution.

KS statistic (D) is calculated as the maximum absolute difference between the empirical cumulative distribution functions (ECDFs) of the two samples.

$D = \max(| ECDF_1(x) - ECDF_2(x) |)$

If critical_values are provided, the function checks the significance of the KS statistic against the critical values.

Parameters

data (np.ndarray) – The first sample array for the KS test.
data – The second sample array for the KS test.
critical_values (Optional[float64[:, :]]) – An array of critical values for the KS test. If provided, the function will also check the significance of the KS statistic against the critical values. Default: None.

Returns (float Union[bool, None])

Returns a tuple containing the KS statistic and a boolean indicating whether the test is statistically significant.

Example

>>> sample_1 = np.array([1, 2, 3, 1, 3, 2, 1, 10, 8, 4, 10]).astype(np.float32)
>>> sample_2 = np.array([10, 5, 10, 4, 8, 10, 7, 10, 7, 10, 10]).astype(np.float32)
>>> critical_values = pickle.load(open("simba/assets/lookups/critical_values_5.pickle", "rb"))['two_sample_KS']['one_tail'].values
>>> two_sample_ks(sample_1=sample_1, sample_2=sample_2, critical_values=critical_values)
>>> (0.7272727272727273, True)

wasserstein_distance(sample_1: ndarray, sample_2: ndarray, bucket_method: typing_extensions.Literal['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt'] = 'auto') → float[source]

Compute Wasserstein distance between two distributions.

Note

Uses stats.wasserstein_distance. I have tried to move stats.wasserstein_distance to jitted method extensively, but this doesn’t give significant runtime improvement. Rate-limiter appears to be the _hist_1d.

Parameters

sample_1 (ndarray) – First 1d array representing feature values.
sample_2 (ndarray) – Second 1d array representing feature values.
bucket_method (Literal) – Estimator determining optimal bucket count and bucket width. Default: The maximum of the Sturges and Freedman-Diaconis estimators

Returns float

Wasserstein distance between sample_1 and sample_2

Example

>>> sample_1 = np.random.normal(loc=10, scale=2, size=10)
>>> sample_2 = np.random.normal(loc=10, scale=3, size=10)
>>> Statistics().wasserstein_distance(sample_1=sample_1, sample_2=sample_2)
>>> 0.020833333333333332

static wilcoxon(x: ndarray, y: ndarray) → Tuple[float, float][source]

Perform the Wilcoxon signed-rank test for paired samples.

Wilcoxon signed-rank test is a non-parametric statistical hypothesis test used to compare two related samples, matched samples, or repeated measurements on a single sample to assess whether their population mean ranks differ.

Parameters

x (np.ndarray) – 1D array representing the observations for the first sample.
y (np.ndarray) – 1D array representing the observations for the second sample.

Returns

A tuple containing the test statistic (z-score) and the effect size (r).

The test statistic (z-score) measures the deviation of the observed ranks sum from the expected sum.
The effect size (r) measures the strength of association between the variables.

static youden_j(sample_1: ndarray, sample_2: ndarray) → float[source]

Calculate Youden’s J statistic from two binary arrays.

Youden’s J statistic is a measure of the overall performance of a binary classification test, taking into account both sensitivity (true positive rate) and specificity (true negative rate).

Parameters

sample_1 – The first binary array.
sample_2 – The second binary array.

Return float

Youden’s J statistic.

static yule_coef(x: ndarray, y: ndarray, w: Optional[ndarray] = None) → float64[source]

Jitted calculate of the yule coefficient between two binary vectors (e.g., to classified behaviors). 0 represent independence, 2 represents complete interdependence.

$Yule Coefficient = \frac{{2 \cdot t_f \cdot f_t}}{{t_t \cdot f_f + t_f \cdot f_t}}$

Note

Adapted from pynndescent.

Parameters

x (np.ndarray) – First binary vector.
x – Second binary vector.
w (Optional[np.ndarray]) – Optional weights for each element. Can be classification probabilities. If not provided, equal weights are assumed.

Example

>>> x = np.random.randint(0, 2, (50,)).astype(np.int8)
>>> y = x ^ 1
>>> Statistics().yule_coef(x=x, y=y)
>>> 2
>>> random_indices = np.random.choice(len(x), size=len(x)//2, replace=False)
>>> y = np.copy(x)
>>> y[random_indices] = 1 - y[random_indices]
>>> Statistics().yule_coef(x=x, y=y)
>>> 0.99

Circular feature extraction methods

class simba.mixins.circular_statistics.CircularStatisticsMixin[source]

Bases: object

Mixin for circular statistics. Unlike linear data, circular data wrap around in a circular or periodic manner such as two measurements of e.g., 360 vs. 1 are more similar than two measurements of 1 vs. 3. Thus, the minimum and maximum values are connected, forming a closed loop, and we therefore need specialized statistical methods.

These methods have support for multiple animals and base radial directions derived from two or three body-parts.

Methods are adopted from the referenced packages below which are far more reliable. However, runtime on standard hardware (multicore CPU) is prioritized and typically orders of magnitude faster than referenced libraries.

See image below for example of expected run-times for a small set of method examples included in this class.

Note

Many method has numba typed signatures to decrease compilation time through reduced type inference. Make sure to pass the correct dtypes as indicated by signature decorators.

Important

See references below for mature packages computing more extensive circular measurements.

References

1: pycircstat.
2: circstat.
3: pingouin.circular.
4: pycircular.
5: scipy.stats.directional_stats.
6: astropy.stats.circstats.

static agg_angular_diff_timebins(data: ndarray, time_windows: ndarray, fps: int) → ndarray[source]

Compute the difference between the median angle in the current time-window versus the previous time window. For example, computes the difference between the mean angle in the first 1s of the video versus the second 1s of the video, the second 1s of the video versus the third 1s of the video, … etc.

Note

The first time-bin of the video cannot be compared against the prior time-bin and is populated with 0.

_images/circular_difference_time_bins.png

Parameters

data (ndarray) – 1D array of size len(frames) representing degrees.
time_window (np.ndarray) – Rolling time-window as float in seconds.
fps (int) – fps of the recorded video

Example

>>> data = np.random.normal(loc=45, scale=3, size=20).astype(np.float32)
>>> CircularStatisticsMixin().agg_angular_diff_timebins(data=data,time_windows=np.array([1.0]), fps=5.0))

static circular_correlation(sample_1: ndarray, sample_2: ndarray) → float[source]

Jitted compute of circular correlation coefficient of two samples using the cross-correlation coefficient. Ranges from -1 to 1: 1 indicates perfect positive correlation, -1 indicates perfect negative correlation, 0 indicates no correlation.

Note

Adapted from astropy.stats.circstats.circcorrcoef.

Parameters

sample_1 (np.ndarray) – Angular data for e.g., Animal 1
sample_1 – Angular data for e.g., Animal 2
circular_correlation (float) – The correlation between the two distributions.

Example

>>> sample_1 = np.array([50, 90, 20, 60, 20, 90]).astype(np.float32)
>>> sample_2 = np.array([50, 90, 70, 60, 20, 90]).astype(np.float32)
>>> CircularStatisticsMixin().circular_correlation(sample_1=sample_1, sample_2=sample_2)
>>> 0.7649115920066833

static circular_hotspots(data: ndarray, bins: ndarray) → ndarray[source]

Calculate the proportion of data points falling within circular bins.

Parameters

data (ndarray) – 1D array of circular data measured in degrees.
bins (ndarray) – 2D array of shape representing circular bins defining [start_degree, end_degree] inclusive..

Return np.ndarray

1D array containing the proportion of data points that fall within each specified circular bin.

Example

>>> data = np.array([270, 360, 10, 90, 91, 180, 185, 260]).astype(np.float32)
>>> bins = np.array([[270, 90], [91, 269]])
>>> CircularStatisticsMixin().circular_hotspots(data=data, bins=bins)
>>> [0.5, 0.5]
>>> bins = np.array([[270, 0], [1, 90], [91, 180], [181, 269]])
>>> CircularStatisticsMixin().circular_hotspots(data=data, bins=bins)
>>> [0.25, 0.25, 0.25, 0.25]

static circular_mean(data: ndarray) → float[source]

Jitted compute of the circular mean of single sample.

Parameters: data (np.ndarray) – 1D array of size len(frames) representing angles in degrees.
Returns float: The circular mean of the angles in degrees.

Example

>>> data = np.array([50, 90, 70, 60, 20, 90]).astype(np.float32)
>>> CircularStatisticsMixin().circular_mean(data=data)
>>> 63.737892150878906

static circular_range(data: ndarray) → float[source]

Jitted compute of circular range in degrees. The range is defined as the angular span of the shortest arc that can contain all the data points. A smaller range indicates a more concentrated distribution, while a larger range suggests a more dispersed distribution.

Parameters: data (ndarray) – 1D array of circular data measured in degrees
Return np.ndarray: The circular range in degrees.
Example

>>> CircularStatisticsMixin().circular_range([350, 20, 60, 100])
>>> 110.0
>>> CircularStatisticsMixin().circular_range([110, 20, 60, 100])
>>> 90.0

static circular_std(data: ndarray) → float[source]

Jitted compute of the circular standard deviation from a single distribution of angles in degrees.

Parameters: data (ndarray) – 1D array of size len(frames) with angles in degrees
Returns float: The standard deviation of the data sample in degrees

$\sigma_{\text{circular}} = \text{rad2deg}\left(\sqrt{-2 \cdot \log\left(| ext{mean}(\exp(j \cdot \theta))|\right)}\right)$

where (theta) represents the angles in radians

Example

>>> data = np.array([180, 221, 32, 42, 212, 101, 139, 41, 69, 171, 149, 200]).astype(np.float32)
>>> CircularStatisticsMixin().circular_std(data=data)
>>> 75.03725024504664

static degrees_to_cardinal(data: ndarray) → List[str][source]

Convert degree angles to cardinal direction bucket e.g., 0 -> “N”, 180 -> “S”

Note

To convert cardinal literals to integers, map using simba.utils.enums.lookups.cardinality_to_integer_lookup. To convert integers to cardinal literals, map using simba.utils.enums.lookups.integer_to_cardinality_lookup.

Parameters: degree_angles (np.ndarray) – 1d array of degrees. Note: return by self.head_direction.
Return List[str]: List of strings representing frame-wise cardinality.
Example

>>> data = np.array(list(range(0, 405, 45))).astype(np.float32)
>>> CircularStatisticsMixin().degrees_to_cardinal(degree_angles=data)
>>> ['N', 'NE', 'E', 'SE', 'S', 'SW', 'W', 'NW', 'N']

static direction_three_bps(nose_loc: ndarray, left_ear_loc: ndarray, right_ear_loc: ndarray) → ndarray[source]

Jitted helper to compute the degree angle from three body-parts. Computes the angle in degrees left_ear <-> nose and right_ear_nose and returns the midpoint.

Parameters

nose_loc (ndarray) – 2D array of size len(frames)x2 representing nose coordinates
left_ear_loc (ndarray) – 2D array of size len(frames)x2 representing left ear coordinates
right_ear_loc (ndarray) – 2D array of size len(frames)x2 representing right ear coordinates

Return np.ndarray

Array of size nose_loc.shape[0] with direction in degrees.

Example

>>> nose_loc = np.random.randint(low=0, high=500, size=(50, 2)).astype(np.float32)
>>> left_ear_loc = np.random.randint(low=0, high=500, size=(50, 2)).astype(np.float32)
>>> right_ear_loc = np.random.randint(low=0, high=500, size=(50, 2)).astype(np.float32)
>>> results = CircularStatisticsMixin().direction_three_bps(nose_loc=nose_loc, left_ear_loc=left_ear_loc, right_ear_loc=right_ear_loc)

static direction_two_bps(anterior_loc: ndarray, posterior_loc: ndarray) → ndarray[source]

Jitted method computing degree directionality from two body-parts. E.g., nape and nose, or swim_bladder and tail.

Parameters

bp_x (np.ndarray) – Size len(frames) x 2 representing x and y coordinates for first body-part.
bp_y (np.ndarray) – Size len(frames) x 2 representing x and y coordinates for second body-part.

Return np.ndarray

Frame-wise directionality in degrees.

Example

>>> swim_bladder_loc = np.random.randint(low=0, high=500, size=(50, 2)).astype(np.float32)
>>> tail_loc = np.random.randint(low=0, high=500, size=(50, 2)).astype(np.float32)
>>> CircularStatisticsMixin().direction_two_bps(anterior_loc=swim_bladder_loc, posterior_loc=tail_loc)

static fit_circle(data: ndarray, max_iterations: Optional[int] = 400) → ndarray[source]

Fit a circle to a dataset using the least squares method.

This function fits a circle to a dataset using the least squares method. The circle is defined by the equation:

$X^2 + Y^2 = R^2$

Note

Adapted to numba JIT from circle-fit hyperLSQ method.

References

1: Kanatani, Rangarajan, Hyper least squares fitting of circles and ellipses, Computational Statistics & Data Analysis, vol. 55, pp. 2197-2208, 2011.
2: Lapp, Salazar, Champagne. Automated maternal behavior during early life in rodents (AMBER) pipeline, Scientific Reports, 13:18277, 2023.

Parameters

data (np.ndarray) – A 3D NumPy array with shape (N, M, 2). N represent frames, M represents the number of body-parts, and 2 represents x and y coordinates.
max_iterations (int) – The maximum number of iterations for fitting the circle.

Returns np.ndarray

Array with shape (N, 3) with N representing frame and 3 representing (i) X-coordinate of the circle center, (ii) Y-coordinate of the circle center, and (iii) Radius of the circle

Example

>>> data = np.array([[[5, 10], [10, 5], [15, 10], [10, 15]]])
>>> CircularStatisticsMixin().fit_circle(data=data, iter_max=88)
>>> [[10, 10, 5]]

static instantaneous_angular_velocity(data: ndarray, bin_size: int) → ndarray[source]

Jitted compute of absolute angular change in the smallest possible time bin.

Note

If the smallest possible frame-to-frame time-bin in Video 1 is 33ms (recorded at 30fps), and the smallest possible frame-to-frame time-bin in Video 2 is 66ms (recorded at 15fps), we correct for this across recordings using the bin_size argument. E.g., when passing angular data from Video 1 we pass bin_size as 2, and when passing angular data for Video 2 we set bin_size to 1 to allow comparisons of instantaneous angular velocity between Video 1 and Video 2.

When current frame minus bin_size results in a negative index, -1 is returned.

_images/instantaneous_angular_velocity.png

Parameters

data (ndarray) – 1D array of size len(frames) representing degrees.
bin_size (int) – The number of frames prior to compare the current angular velocity against.

Example

>>> data = np.array([350, 360, 365, 360]).astype(np.float32)
>>> CircularStatisticsMixin().instantaneous_angular_velocity(data=data, bin_size=1.0)
>>> [-1., 10.00002532, 4.999999, 4.999999]
>>> CircularStatisticsMixin().instantaneous_angular_velocity(data=data, bin_size=2)
>>> [-1., -1., 15.00002432, 0.]

static kuipers_two_sample_test(sample_1: ndarray, sample_2: ndarray) → float[source]

Compute the Kuiper’s two-sample test statistic for circular distributions.

Kuiper’s two-sample test is a non-parametric test used to determine if two samples are drawn from the same circular distribution. It is particularly useful for circular data, such as angles or directions.

Note

Adapted from Kuiper by Anne Archibald.

Parameters

data (ndarray) – The first circular sample array in degrees.
data – The second circular sample array in degrees.

Return float

Kuiper’s test statistic.

Example

>>> sample_1, sample_2 = np.random.normal(loc=45, scale=1, size=100).astype(np.float32), np.random.normal(loc=180, scale=20, size=100).astype(np.float32)
>>> CircularStatisticsMixin().kuipers_two_sample_test(sample_1=sample_1, sample_2=sample_2)

static mean_resultant_vector_length(data: ndarray) → float[source]

Jitted compute of the mean resultant vector length of a single sample. Captures the overall “pull” or “tendency” of the data points towards a central direction on the circle with a range between 0 and 1.

$R = \frac{{\sqrt{{\sum_{{i=1}}^N \cos(\theta_i - \bar{ heta})^2 + \sum_{{i=1}}^N \sin( heta_i - \bar{\theta})^2}}}}{{N}}$

where (N) is the number of data points, ( heta_i) is the angle of the ith data point, and (ar{ heta}) is the mean angle.

Parameters: data (np.ndarray) – 1D array of size len(frames) representing angles in degrees.
Returns float: The mean resultant vector of the angles. 1 represents tendency towards a single point. 0 represents no central point.
Example

>>> data = np.array([50, 90, 70, 60, 20, 90]).astype(np.float32)
>>> CircularStatisticsMixin().mean_resultant_vector_length(data=data)
>>> 0.9132277170817057

static rao_spacing(data: array)[source]

Jitted compute of Rao’s spacing for angular data.

Computes the uniformity of a circular dataset in degrees. Low output values represent concentrated angularity, while high values represent dispersed angularity.

Parameters: data (ndarray) – 1D array of size len(frames) with data in degrees.
Return int: Rao’s spacing measure, indicating the dispersion or concentration of angular data points.
References

1: UCSB.

Example

>>> data = np.random.randint(0, 360, (5000,)).astype(np.float32)
>>> rao_spacing(data=data)

static rayleigh(data: ndarray) → Tuple[float, float][source]

Jitted compute of Rayleigh Z (test of non-uniformity) of single sample of circular data in degrees.

$Z = nR^2$

where $n$ is the sample size and $R$ is the mean resultant length.

The associated p-value is calculated as follows:

$p = e^{\sqrt{1 + 4n + 4(n^2 - R^2)} - (1 + 2n)}$

Parameters: data (ndarray) – 1D array of size len(frames) representing degrees.
Returns Tuple[float, float]: Tuple with Rayleigh Z score and associated probability value.

>>> data = np.array([350, 360, 365, 360, 100, 109, 232, 123, 42, 3,4, 145]).astype(np.float32)
>>> CircularStatisticsMixin().rayleigh(data=data)
>>> (2.3845645695246467, 0.9842236169985417)

static rolling_rayleigh_z(data: ndarray, time_windows: ndarray, fps: int) → Tuple[ndarray, ndarray][source]

Jitted compute of Rayleigh Z (test of non-uniformity) of circular data within sliding time-window.

Note

Adapted from pingouin.circular.circ_rayleigh and pycircstat.tests.rayleigh.

Parameters

data (ndarray) – 1D array of size len(frames) representing degrees.
time_window (np.ndarray) – Rolling time-window as float in seconds. Two windows of 0.5s and 1s would be represented as np.array([0.5, 1.0])
fps (int) – fps of the recorded video

Returns Tuple[np.ndarray, np.ndarray]

Two 2d arrays with the first representing Rayleigh Z scores and second representing associated p values.

Example

>>> data = np.random.randint(low=0, high=361, size=(100,)).astype(np.float32)
>>> CircularStatisticsMixin().rolling_rayleigh_z(data=data, time_windows=np.array([0.5, 1.0]), fps=10)

static rotational_direction(data: ndarray, stride: int = 1) → ndarray[source]

Jitted compute of frame-by-frame rotational direction within a 1D timeseries array of angular data.

Parameters: data (ndarray) – 1D array of size len(frames) representing degrees.
Return numpy.ndarray: An array of directional indicators. - 0 indicates no rotational change relative to prior frame. - 1 indicates a clockwise rotational change relative to prior frame. - 2 indicates a counter-clockwise rotational change relative to prior frame.

Note

For the first frame, no rotation is possible so is populated with -1.
Frame-by-frame rotations of 180° degrees are denoted as clockwise rotations.

Example

>>> data = np.array([45, 50, 35, 50, 80, 350, 350, 0 , 180]).astype(np.float32)
>>> CircularStatisticsMixin().rotational_direction(data)
>>> [-1.,  1.,  2.,  1.,  1.,  2.,  0.,  1.,  1.]

static sliding_angular_diff(data: ndarray, time_windows: ndarray, fps: float) → ndarray[source]

Computes the angular difference in the current frame versus N seconds previously. For example, if the current angle is 45 degrees, and the angle N seconds previously was 350 degrees, then the difference is 55 degrees.

Note

Frames where current frame - N seconds prior equal a negative value is populated with 0.

Results are returned in rounded nearest integer.

Parameters

data (ndarray) – 1D array of size len(frames) representing degrees.
time_window (np.ndarray) – Rolling time-window as float in seconds.
fps (int) – fps of the recorded video

Example

>>> data = np.array([350, 350, 1, 1]).astype(np.float32)
>>> CircularStatisticsMixin().sliding_angular_diff(data=data, fps=1.0, time_windows=np.array([1.0]))

static sliding_bearing(x: ndarray, lag: float, fps: float) → ndarray[source]

Calculates the sliding bearing (direction) of movement in degrees for a sequence of 2D points representing a single body-part.

Note

To calculate frame-by-frame bearing, pass fps == 1 and lag == 1.

Parameters

x (np.ndarray) – An array of shape (n, 2) representing the time-series sequence of 2D points.
lag (float) – The lag time (in seconds) used for calculating the sliding bearing. E.g., if 1, then bearing will be calculated using coordinates in the current frame vs the frame 1s previously.
fps (float) – The sample rate (frames per second) of the sequence.

Return np.ndarray

An array containing the sliding bearings (in degrees) for each point in the sequence.

Example

>>> x = np.array([[10, 10], [20, 10]])
>>> CircularStatisticsMixin.sliding_bearing(x=x, lag=1, fps=1)
>>> [-1. 90.]

static sliding_circular_correlation(sample_1: ndarray, sample_2: ndarray, time_windows: ndarray, fps: float) → ndarray[source]

Jitted compute of correlations between two angular distributions in sliding time-windows using the cross-correlation coefficient.

Note

Values prior to the ending of the first time window will be filles with 0.

Parameters

sample_1 (np.ndarray) – Angular data for e.g., Animal 1
sample_1 – Angular data for e.g., Animal 2
time_windows (float) – Size of sliding time window in seconds. E.g., two windows of 0.5s and 1s would be represented as np.array([0.5, 1.0])
fps (int) – Frame-rate of recorded video.

Return np.ndarray

Array of size len(sample_1) x len(time_window) with correlation coefficients.

Example

>>> sample_1 = np.random.randint(0, 361, (200,)).astype(np.float32)
>>> sample_2 = np.random.randint(0, 361, (200,)).astype(np.float32)
>>> CircularStatisticsMixin().sliding_circular_correlation(sample_1=sample_1, sample_2=sample_2, time_windows=np.array([0.5, 1.0]), fps=10.0)

static sliding_circular_hotspots(data: ndarray, bins: ndarray, time_window: float, fps: float) → ndarray[source]

Jitted compute of sliding circular hotspots in a dataset. Calculates circular hotspots in a time-series dataset by sliding a time window across the data and computing hotspot statistics for specified circular bins.

Parameters

data (ndarray) – 1D array of circular data measured in degrees.
bins (ndarray) – 2D array of shape representing circular bins defining [start_degree, end_degree] inclusive.
time_window (float) – The size of the sliding window in seconds.
fps (float) – The frame-rate of the video.

Return np.ndarray

A 2D numpy array where each row corresponds to a time point in data, and each column represents a circular bin. The values in the array represent the proportion of data points within each bin at each time point.

Note

The function utilizes the Numba JIT compiler for improved performance.
Circular bin definitions should follow the convention where angles are specified in degrees within the range [0, 360], and the bins are defined using start and end angles inclusive. For example, (0, 90) represents the first quadrant in a circular space.

Output data in the beginning of the series where a full time-window is not satisfied (e.g., first 9 observations when fps equals 10 and time_windows = [1.0], will be populated by 0.

Warning

Note that 0 is noted as a bin-edge, 360 should not be a bin-edge. Instead, use 0 and 359 or 1 and 360.

Example

>>> data = np.array([270, 360, 10, 20, 90, 91, 180, 185, 260, 265]).astype(np.float32)
>>> bins = np.array([[270, 90], [91, 268]])
>>> CircularStatisticsMixin().sliding_circular_hotspots(data=data, bins=bins, time_window=0.5, fps=10)
>>> [[-1. , -1. ],
>>>  [-1. , -1. ],
>>>  [-1. , -1. ],
>>>  [-1. , -1. ],
>>>  [ 0.5,  0. ],
>>>  [ 0.4,  0.1],
>>>  [ 0.3,  0.2],
>>>  [ 0.2,  0.3],
>>>  [ 0.1,  0.4],
>>>  [ 0. ,  0.5]]

static sliding_circular_mean(data: ndarray, time_windows: ndarray, fps: int) → ndarray[source]

Compute the circular mean in degrees within sliding temporal windows.

Parameters

data (np.ndarray) – 1d array with feature values in degrees.
time_windows (np.ndarray) – Rolling time-windows as floats in seconds. E.g., [0.2, 0.4, 0.6]
fps (int) – fps of the recorded video

Returns np.ndarray

Size data.shape[0] x time_windows.shape[0] array

_images/mean_rolling_timeseries_angle.png

Attention

The returned values represents the angular mean dispersion in the time-window [current_frame-time_window->current_frame]. -1 is returned when current_frame-time_window is less than 0.

Example

>>> data = np.random.normal(loc=45, scale=1, size=20).astype(np.float32)
>>> CircularStatisticsMixin().sliding_circular_mean(data=data,time_windows=np.array([0.5, 1.0]), fps=10)

static sliding_circular_range(data: ndarray, time_windows: ndarray, fps: int) → ndarray[source]

Jitted compute of sliding circular range for a time series of circular data. The range is defined as the angular span of the shortest arc that can contain all the data points. Measures the circular spread of data within sliding time windows of specified duration.

Note

Output data in the beginning of the series where a full time-window is not satisfied (e.g., first 9 observations when fps equals 10 and time_windows = [1.0], will be populated by 0.

Parameters

data (np.ndarray) – 1D array of circular data measured in degrees
time_windows (np.ndarray) – Size of sliding time window in seconds. E.g., two windows of 0.5s and 1s would be represented as np.array([0.5, 1.0])
fps (int) – Frame-rate of recorded video.

Return np.ndarray

Array of size len(sample_1) x len(time_window) with angular ranges in degrees.

Examples

>>> data = np.array([260, 280, 300, 340, 360, 0, 10, 350, 0, 15]).astype(np.float32)
>>> CircularStatisticsMixin().sliding_circular_range(data=data, time_windows=np.array([0.5]), fps=10)
>>> [[ -1.],[ -1.],[ -1.],[ -1.],[100.],[80],[70],[30],[20],[25]]

static sliding_circular_std(data: ndarray, fps: int, time_windows: ndarray) → ndarray[source]

Compute standard deviation of angular data in sliding time windows.

Parameters

data (ndarray) – 1D array of size len(frames) representing degrees.
time_window (np.ndarray) – Sliding time-window as float in seconds.
fps (int) – fps of the recorded video

Returns np.ndarray

Size data.shape[0] x time_windows.shape[0] with angular standard deviations in rolling time windows in degrees.

Example

>>> data = np.array([180, 221, 32, 42, 212, 101, 139, 41, 69, 171, 149, 200]).astype(np.float32)
>>> CircularStatisticsMixin().sliding_circular_std(data=data.astype(np.float32), time_windows=np.array([0.5]), fps=10)

static sliding_kuipers_two_sample_test(sample_1: ndarray, sample_2: ndarray, time_windows: ndarray, fps: int) → ndarray[source]

Jitted compute of Kuipers two-sample test comparing two distributions with sliding time window.

This function calculates the Kuipers two-sample test statistic for each time window, sliding through the given circular data sequences.

Parameters

data (np.ndarray) – The first circular sample array in degrees.
data – The second circular sample array in degrees.
time_windows (np.ndarray) – An array containing the time window sizes (in seconds) for which the Kuipers two-sample test will be computed.
fps (int) – The frames per second, representing the sampling rate of the data.

Returns np.ndarray

A 2D array containing the Kuipers two-sample test statistics for each time window and each time step.

Examples

>>> data = np.random.randint(low=0, high=360, size=(100,)).astype(np.float64)
>>> D = CircularStatisticsMixin().sliding_kuipers_two_sample_test(data=data, time_windows=np.array([0.5, 5]), fps=2)

static sliding_mean_resultant_vector_length(data: ndarray, fps: int, time_windows: ndarray) → ndarray[source]

Jitted compute of the mean resultant vector within sliding time window. Captures the overall “pull” or “tendency” of the data points towards a central direction on the circle with a range between 0 and 1.

Attention

The returned values represents resultant vector length in the time-window [(current_frame-time_window)->current_frame]. -1 is returned where current_frame-time_window is less than 0.

_images/sliding_mean_resultant_length.png

Parameters

data (np.ndarray) – 1D array of size len(frames) representing degrees.
time_window (np.ndarray) – Rolling time-window as float in seconds.
fps (int) – fps of the recorded video

Returns np.ndarray

Size len(data) x len(time_windows) representing resultant vector length in the prior time_window.

Example

>>> data_1, data_2 = np.random.normal(loc=45, scale=1, size=100), np.random.normal(loc=90, scale=45, size=100)
>>> data = np.hstack([data_1, data_2])
>>> CircularStatisticsMixin().sliding_mean_resultant_vector_length(data=data.astype(np.float32),time_windows=np.array([1.0]), fps=10)

static sliding_rao_spacing(data: ndarray, time_windows: ndarray, fps: int) → ndarray[source]

Jitted compute of the uniformity of a circular dataset in sliding windows.

Parameters

data (ndarray) – 1D array of size len(frames) representing degrees.
time_window (np.ndarray) – Rolling time-window as float in seconds.
fps (int) – fps of the recorded video

Return np.ndarray

representing rao-spacing U in every sliding windows [-window:n]

The Rao’s Spacing ( $U$ ) is calculated as follows:

$U = \frac{1}{2} \sum_{i=1}^{N} |l - T_i|$

where $N$ is the number of data points in the sliding window, $T_i$ is the spacing between adjacent data points, and $l$ is the equal angular spacing.

Note

For frames occuring before a complete time window, 0.0 is returned.

References

1: UCSB.

Example

>>> data = np.random.randint(low=0, high=360, size=(500,)).astype(np.float32)
>>> result = CircularStatisticsMixin().sliding_rao_spacing(data=data, time_windows=np.array([0.5, 1.0]), fps=10)

Plotting methods

class simba.mixins.plotting_mixin.PlottingMixin[source]

Bases: object

Methods for visualizations

static categorical_scatter(data: Union[ndarray, DataFrame], columns: Optional[List[str]] = ('X', 'Y', 'Cluster'), palette: Optional[str] = 'Set1', show_box: Optional[bool] = False, size: Optional[int] = 10, title: Optional[str] = None, save_path: Optional[Union[str, PathLike]] = None)[source]

Create a 2D scatterplot with a categorical legend.

Parameters

data (Union[np.ndarray, pd.DataFrame]) – Input data, either a NumPy array or a pandas DataFrame.
columns (Optional[List[str]]) – A list of column names for the x-axis, y-axis, and the categorical variable respectively. Default is [“X”, “Y”, “Cluster”].
palette (Optional[str]) – The color palette to be used for the categorical variable. Default is “Set1”.
show_box (Optional[bool]) – Whether to display the plot axis. Default is False.
size (Optional[int]) – Size of markers in the scatterplot. Default is 10.
title (Optional[str]) – Title for the plot. Default is None.
save_path (Optional[Union[str, os.PathLike]]) – The path where the plot will be saved. Default is None which returns the image.

Returns matplotlib.axes._subplots.AxesSubplot or None

The scatterplot if ‘save_path’ is not provided, otherwise None.

static continuous_scatter(data: Union[ndarray, DataFrame], columns: Optional[List[str]] = ('X', 'Y', 'Cluster'), palette: Optional[str] = 'magma', show_box: Optional[bool] = False, size: Optional[int] = 10, title: Optional[str] = None, bg_clr: Optional[str] = None, save_path: Optional[Union[str, PathLike]] = None)[source]: Create a 2D scatterplot with a continuous legend

create_gantt_img(bouts_df: DataFrame, clf_name: str, image_index: int, fps: int, gantt_img_title: str)[source]

Helper to create a single gantt plot based on the data preceeding the input image

Parameters

bouts_df (pd.DataFrame) – ataframe holding information on individual bouts created by simba.misc_tools.get_bouts_for_gantt().
clf_name (str) – Name of the classifier.
image_index (int) – The count of the image. E.g., 1000 will create a gantt image representing frame 1-1000.
fps (int) – The fps of the input video.
gantt_img_title (str) – Title of the image.

:return np.ndarray

create_single_color_lst(pallete_name: typing_extensions.Literal[<Options.PALETTE_OPTIONS: ['magma', 'jet', 'inferno', 'plasma', 'viridis', 'gnuplot2', 'RdBu', 'winter']>], increments: int, as_rgb_ratio: bool = False, as_hex: bool = False) → List[Union[str, int, float]][source]

Helper to create a color palette of bgr colors in a list.

Parameters

pallete_name (str) – Palette name (e.g., ‘jet’)
increments (int) – Numbers of colors in the color palette to create.
as_rgb_ratio (bool) – If True returns the colors as RGB ratios (0-1).
as_hex (bool) – If True, returns the colors as HEX.

:return list

Note

If as_rgb_ratio AND as_hex, then returns HEX.

static draw_lines_on_img(img: ndarray, start_positions: ndarray, end_positions: ndarray, color: Tuple[int, int, int], highlight_endpoint: Optional[bool] = False, thickness: Optional[int] = 2, circle_size: Optional[int] = 2) → ndarray[source]

Helper to draw a set of lines onto an image.

Parameters

img (np.ndarray) – The image to draw the lines on.
start_positions (np.ndarray) – 2D numpy array representing the start positions of the lines in x, y format.
end_positions (np.ndarray) – 2D numpy array representing the end positions of the lines in x, y format.
color (Tuple[int, int, int]) – The color of the lines in BGR format.
highlight_endpoint (Optional[bool]) – If True, highlights the ends of the lines with circles.
thickness (Optional[int]) – The thickness of the lines.
circle_size (Optional[int]) – If highlight_endpoint is True, the size of the highlighted points.

Return np.ndarray

The image with the lines overlayed.

get_bouts_for_gantt(data_df: DataFrame, clf_name: str, fps: int) → ndarray[source]

Helper to detect all behavior bouts for a specific classifier.

Parameters

data_df (pd.DataFrame) – Pandas Dataframe with classifier prediction data.
clf_name (str) – Name of the classifier
fps (int) – The fps of the input video.

Return pd.DataFrame

Holding the start time, end time, end frame, bout time etc of each classified bout.

static insert_directing_line(directing_df: DataFrame, img: ndarray, shape_name: str, animal_name: str, frame_id: int, color: Optional[Tuple[int]] = (0, 0, 255), thickness: Optional[int] = 2, style: Optional[str] = 'lines') → ndarray[source]

Helper to insert lines between the actor ‘eye’ and the ROI centers.

Parameters

directing_df – Dataframe containing eye and ROI locations. Stored as results in instance of simba.roi_tools.ROI_directing_analyzer.DirectingROIAnalyzer.
img (np.ndarray) – The image to draw the line on.
shape_name (str) – The name of the shape to draw the line to.
animal_name (str) – The name of the animal
frame_id (int) – The frame number in the video
color (Optional[Tuple[int]]) – The color of the line
thickness (Optional[int]) – The thickness of the line.
style (Optional[str]) – The style of the line. “lines” or “funnel”.

Return np.ndarray

The input image with the line.

static joint_plot(data: Union[ndarray, DataFrame], columns: Optional[List[str]] = ('X', 'Y', 'Cluster'), palette: Optional[str] = 'Set1', kind: Optional[str] = 'scatter', size: Optional[int] = 10, title: Optional[str] = None, save_path: Optional[Union[str, PathLike]] = None)[source]

Generate a joint plot.

Useful when visualizing embedded behavior data latent spaces with dense and overlapping scatters.

Parameters

data (Union[np.ndarray, pd.DataFrame]) – Input data, either a NumPy array or a pandas DataFrame.
columns (Optional[List[str]]) – Names of columns if input is dataframe, default is [“X”, “Y”, “Cluster”].
palette (Optional[str]) – Palette for the plot, default is “Set1”.
kind (Optional[str]) – Type of plot (“scatter”, “kde”, “hist”, or “reg”), default is “scatter”.
size (Optional[int]) – Size of markers for scatter plot, default is 10.
title (Optional[str]) – Title of the plot, default is None.
save_path (Optional[Union[str, os.PathLike]]) – Path to save the plot image, default is None.

Returns sns.JointGrid or None

JointGrid object if save_path is None, else None.

Example

>>> data, lbls = make_blobs(n_samples=100000, n_features=2, centers=10, random_state=42)
>>> data = np.hstack((data, lbls.reshape(-1, 1)))
>>> PlottingMixin.joint_plot(data=data, columns=['X', 'Y', 'Cluster'], title='The plot')

make_distance_plot(data: array, line_attr: Dict[int, str], style_attr: Dict[str, Any], fps: int, save_img: bool = False, save_path: Optional[str] = None) → ndarray[source]

Helper to make a single line plot .png image with N lines.

Parameters

data (np.array) – Two-dimensional array where rows represent frames and columns represent intertwined x and y coordinates.
line_attr (dict) – Line color attributes.
style_attr (dict) – Plot attributes (size, font size, line width etc).
fps (int) – Video frame rate.
save_path (Optionan[str]) – Location to store output .png image. If None, then return image.

Note

GitHub tutorial/documentation.

Example

>>> fps = 10
>>> data = np.random.random((100,2))
>>> line_attr = {0: ['Blue'], 1: ['Red']}
>>> save_path = '/_tests/final_frm.png'
>>> style_attr = {'width': 640, 'height': 480, 'line width': 6, 'font size': 8, 'y_max': 'auto'}
>>> self.make_distance_plot(fps=fps, data=data, line_attr=line_attr, style_attr=style_attr, save_path=save_path)

static make_line_plot_plotly(data: List[ndarray], colors: List[str], show_box: Optional[bool] = True, show_grid: Optional[bool] = False, width: Optional[int] = 640, height: Optional[int] = 480, line_width: Optional[int] = 6, font_size: Optional[int] = 8, bg_clr: Optional[str] = 'white', x_lbl_divisor: Optional[float] = None, title: Optional[str] = None, y_lbl: Optional[str] = None, x_lbl: Optional[str] = None, y_max: Optional[int] = -1, line_opacity: Optional[int] = 0.5, save_path: Optional[Union[str, PathLike]] = None)[source]

Create a line plot using Plotly.

Note

Plotly can be more reliable than matplotlib on some systems when accessed through multprocessing calls.

If not called though multiprocessing, consider using simba.mixins.plotting_mixin.PlottingMixin.make_line_plot()

Uses kaleido for transform image to numpy array or save to disk.

Parameters

data (List[np.ndarray]) – List of 1D numpy arrays representing lines.
colors (List[str]) – List of named colors of size len(data).
show_box (bool) – Whether to show the plot box (axes, title, etc.).
show_grid (bool) – Whether to show gridlines on the plot.
width (int) – Width of the plot in pixels.
height (int) – Height of the plot in pixels.
line_width (int) – Width of the lines in the plot.
font_size (int) – Font size for axis labels and tick labels.
bg_clr (str) – Background color of the plot.
x_lbl_divisor (float) – Divisor for adjusting the tick spacing on the x-axis.
title (str) – Title of the plot.
y_lbl (str) – Label for the y-axis.
x_lbl (str) – Label for the x-axis.
y_max (int) – Maximum value for the y-axis.
line_opacity (float) – Opacity of the lines in the plot.
save_path (Union[str, os.PathLike]) – Path to save the plot image. If None, returns a numpy array of the plot.

Returns

If save_path is None, returns a numpy array representing the plot image.

Example

>>> p = np.random.randint(0, 50, (100,))
>>> y = np.random.randint(0, 50, (200,))
>>> img = PlottingMixin.make_line_plot_plotly(data=[p, y], show_box=False, font_size=20, bg_clr='white', show_grid=False, x_lbl_divisor=30, colors=['Red', 'Green'], save_path='/Users/simon/Desktop/envs/simba/troubleshooting/beepboop174/project_folder/frames/output/line_plot/Trial     3_final_img.png')

static make_path_plot(data: List[ndarray], colors: List[Tuple[int, int, int]], width: Optional[int] = 640, height: Optional[int] = 480, max_lines: Optional[int] = None, bg_clr: Optional[Union[Tuple[int, int, int], ndarray]] = (255, 255, 255), circle_size: Optional[int] = 3, font_size: Optional[float] = 2.0, font_thickness: Optional[int] = 2, line_width: Optional[int] = 2, animal_names: Optional[List[str]] = None, clf_attr: Optional[Dict[str, Any]] = None, save_path: Optional[Union[str, PathLike]] = None) → Union[None, ndarray][source]

Creates a path plot visualization from the given data.

Parameters

data (List[np.ndarray]) – List of numpy arrays containing path data.
colors (List[Tuple[int, int, int]]) – List of RGB tuples representing colors for each path.
width – Width of the output image (default is 640 pixels).
height – Height of the output image (default is 480 pixels).
max_lines – Maximum number of lines to plot from each path data.
bg_clr – Background color of the plot (default is white).
circle_size – Size of the circle marker at the end of each path (default is 3).
font_size – Font size for displaying animal names (default is 2.0).
font_thickness – Thickness of the font for displaying animal names (default is 2).
line_width – Width of the lines representing paths (default is 2).
animal_names – List of names for the animals corresponding to each path.
clf_attr – Dictionary containing attributes for classification markers.
save_path – Path to save the generated plot image.

Returns

If save_path is None, returns the generated image as a numpy array, otherwise, returns None.

Example

>>> x = np.random.randint(0, 500, (100, 2))
>>> y = np.random.randint(0, 500, (100, 2))
>>> position_data = np.random.randint(0, 500, (100, 2))
>>> clf_data_1 = np.random.randint(0, 2, (100,))
>>> clf_data_2 = np.random.randint(0, 2, (100,))
>>> clf_data = {'Attack': {'color': (155, 1, 10), 'size': 30, 'positions': position_data, 'clfs': clf_data_1},  'Sniffing': {'color': (155, 90, 10), 'size': 30, 'positions': position_data, 'clfs': clf_data_2}}
>>> PlottingMixin.make_path_plot(data=[x, y], colors=[(0, 255, 0), (255, 0, 0)], clf_attr=clf_data)

make_probability_plot(data: Series, style_attr: dict, clf_name: str, fps: int, save_path: str) → ndarray[source]

Make a single classifier probability plot png image.

Parameters

data (pd.Series) – row representing frames and field representing classification probabilities.
line_attr (dict) – Line color attributes.
style_attr (dict) – Image attributes (size, font size, line width etc).
fps (int) – Video frame rate.

:param str ot :param str save_path: Location to store output .png image.

Example

>>> data = pd.Series(np.random.random((100, 1)).flatten())
>>> style_attr = {'width': 640, 'height': 480, 'font size': 10, 'line width': 6, 'color': 'blue', 'circle size': 20}
>>> clf_name='Attack'
>>> fps=10
>>> save_path = '/_test/frames/output/probability_plots/Together_1_final_frame.png'

>>> _ = self.make_probability_plot(data=data, style_attr=style_attr, clf_name=clf_name, fps=fps, save_path=save_path)

static polygons_onto_image(img: ndarray, polygons: DataFrame, show_center: Optional[bool] = False, show_tags: Optional[bool] = False, circle_size: Optional[int] = 2) → ndarray[source]

Helper to insert polygon overlays onto an image.

Parameters

img (np.ndarray) –
polygons –
show_center –
show_tags –
circle_size –

Returns

remove_a_folder(folder_dir: str) → None[source]: Helper to remove a directory, use for cleaning up smaller multiprocessed videos following concat

resize_gantt(gantt_img: array, img_height: int) → ndarray[source]: Helper to resize image while retaining aspect ratio.

static rotate_img(img: ndarray, right: bool) → ndarray[source]

Flip a color image 90 degrees to the left or right

Parameters

img (np.ndarray) – Input image as numpy array in uint8 format.
right (bool) – If True, flips to the right. If False, flips to the left.

Returns

The rotated image as a numpy array of uint8 format.

Example

>>> img = cv2.imread('/Users/simon/Desktop/test.png')
>>> rotated_img = PlottingMixin.rotate_img(img=img, right=False)

split_and_group_df(df: ~pandas.core.frame.DataFrame, splits: int, include_row_index: bool = False, include_split_order: bool = True) -> (typing.List[pandas.core.frame.DataFrame], <class 'int'>)[source]: Helper to split a dataframe for multiprocessing. If include_split_order, then include the group number in split data as a column. If include_row_index, includes a column representing the row index in the array, which can be helpful for knowing the frame indexes while multiprocessing videos. Returns split data and approximations of number of observations per split.

GUI pop-up methods

class simba.mixins.pop_up_mixin.PopUpMixin(title: str, config_path: Optional[str] = None, main_scrollbar: Optional[bool] = True, size: Tuple[int, int] = (960, 720))[source]

Bases: object

Methods for pop-up windows in SimBA. E.g., common methods for creating pop-up windows with drop-downs, checkboxes, entry-boxes, listboxes etc.

Parameters

title (str) – Pop-up window title
config_path (Optional[configparser.Configparser]) – path to SimBA project_config.ini. If path, the project config is read in. If None, the project config is not read in.
size (tuple) – HxW of the pop-up window. The size of the pop-up window in pixels.
main_scrollbar (bool) – If True, the pop-up window is scrollable.

add_to_listbox_from_entrybox(list_box: Listbox, entry_box: Entry_Box)[source]

Add a value that populates a tkinter entry_box to a tkinter listbox.

Parameters

list_box (Listbox) – The tkinter Listbox to add the value to.
entry_box (Entry_Box) – The tkinter Entry_Box containing the value that should be added to the list_box.

add_value_to_listbox(list_box: Listbox, value: float)[source]

Add a float value to a tkinter listbox.

Parameters

list_box (Listbox) – The tkinter Listbox to add the value to.
value (float) – Value to add to the listbox.

add_values_to_several_listboxes(list_boxes: List[Listbox], values: List[float])[source]

Add N values to N listboxes. E.g., values[0] will be added to list_boxes[0].

Parameters

list_boxes (List[Listbox]) – List of Listboxes that the values should be added to.
values (List[float]) – List of floats that will be added to the list_boxes.

children_cnt_main() → int[source]: Find the number of children (e.g., labelframes) currently exist within a main pop-up window. Useful for finding the row at which a new frame within the window should be inserted.

create_cb_frame(main_frm: Frame, cb_titles: List[str], frm_title: str, command: object = None) → Dict[str, BooleanVar][source]

Creates a labelframe with one checkbox per classifier, and inserts the labelframe into the bottom of the pop-up window.

Parameters

main_frm (Frame) – The tkinter pop-up window.
cb_titles (List[str]) – List of strings representing the names of the checkboxes.
frm_title (str) – Title of the frame.

Return Dict[str, BooleanVar]

Dictionary holding the cb_titles as keys and the BooleanVar representing if the checkbox is ticked or not.

create_choose_number_of_body_parts_frm(project_body_parts: List[str], run_function: object)[source]

Many menus depend on how many animals the user choose to compute metrics for. Thus, we need to populate the menus dynamically. This function creates a single drop-down menu where the user select the number of animals the user choose to compute metrics for. It inserts this drop-down iat the bottom of the pop-up window, and ties this dropdown menu choice to a callback.

Parameters

project_body_parts (List[str]) – Options of the dropdown menu.
run_function (object) – Function tied to the choice in the dropdown menu.

create_clf_checkboxes(main_frm: Frame, clfs: List[str], title: str = 'SELECT CLASSIFIER ANNOTATIONS')[source]: Creates a labelframe with one checkbox per classifier, and inserts the labelframe into the bottom of the pop-up window.

Note

Legacy. Use create_cb_frame instead.

create_dropdown_frame(main_frm: Frame, drop_down_titles: List[str], drop_down_options: List[str], frm_title: str) → Dict[str, DropDownMenu][source]

Creates a labelframe with dropdown menus and inserts it at the bottom of the pop-up window.

Parameters

main_frm (Frame) – The tkinter pop-up window.
drop_down_titles (List[str]) – The dropdown menu names
drop_down_options (List[str]) – The options in each dropdown. All dropdowns must have the same options.
frm_title (str) – Title of the frame.

Return Dict[str, BooleanVar]

Dictionary holding the drop_down_titles and the drop-down menus as values.

create_run_frm(run_function: Callable, title: Optional[str] = 'RUN', btn_txt_clr: Optional[str] = 'black') → None[source]

Create a label frame with a single button with a specified callback.

Parameters

run_function (object) – The function/method callback of the button.
title (str) – The title of the frame.

enable_dropdown_from_checkbox(check_box_var: BooleanVar, dropdown_menus: List[DropDownMenu])[source]

Given a single checkbox, enable a bunch of dropdowns if the checkbox is ticked, and disable the dropdowns if the checkbox is un-ticked.

Parameters

check_box_var (BooleanVar) – The checkbox associated tkinter BooleanVar.
dropdown_menus (List[DropDownMenu]) – List of dropdowns which status is controlled by the check_box_var.

enable_entrybox_from_checkbox(check_box_var: BooleanVar, entry_boxes: List[Entry_Box], reverse: bool = False)[source]

Given a single checkbox, enable or disable a bunch of entry-boxes based on the status of the checkbox.

Parameters

check_box_var (BooleanVar) – The checkbox associated tkinter BooleanVar.
entry_boxes (List[Entry_Box]) – List of entry-boxes which status is controlled by the check_box_var.
reverse (bool) – If False, the entry-boxes are enabled with the checkbox is ticked. Else, the entry-boxes are enabled if checkbox is unticked. Default: False.

frame_children(frame: Frame) → int[source]: Find the number of children (e.g., labelframes) currently exist within specified frame.Similar to children_cnt_main, but accepts a specific frame rather than the main frame beeing hardcoded.

place_frm_at_top_right(frm: Toplevel)[source]: Place a TopLevel tkinter pop-up at the top right of the monitor. Note: call before putting scrollbars or converting to Canvas.

remove_from_listbox(list_box: Listbox)[source]

Remove the current selection in a listbox from a listbox.

Parameters: list_box (Listbox) – The listbox that the current selection should be removed from.

update_config() → None[source]: Helper to update the SimBA project config file

update_file_select_box_from_dropdown(filename: str, fileselectbox: FileSelect)[source]: Updates the text inside a tkinter FileSelect entrybox with a new string.

Pose importing methods

class simba.mixins.pose_importer_mixin.PoseImporterMixin[source]

Bases: object

Methods for importing pose-estimation data.

link_video_paths_to_data_paths(data_paths: List[str], video_paths: List[str], str_splits: Optional[List[str]] = None, filename_cleaning_func: object = None) → dict[source]

Given a list of paths to video files and a separate list of paths to data files, create a dictionary pairing each video file to a datafile based on the file names of the video and data file.

Parameters

data_paths (List[str]) – List of full paths to data files, e.g., CSV or H5 files.
video_paths (List[str]) – List of full paths to video files, e.g., MP4 or AVI files.
str_splits (Optional[List[str]]) – Optional list of substrings that the data_paths would need to be split at in order to find a matching video name. E.g., [‘dlc_resnet50’].
filename_cleaning_func (Optional[object]) – Optional filename cleaning function that the data_paths filenames would have to pass through in order to find a matching video name. E.g., simba.utils.read_write.clean_sleap_filename(filepath).

Returns dict

Dictionary with the data/file name as keys, and the video and data paths as values.

Modelling methods

class simba.mixins.train_model_mixin.TrainModelMixin[source]

Bases: object

Train model methods

bout_train_test_splitter(x_df: ~pandas.core.frame.DataFrame, y_df: ~pandas.core.series.Series, test_size: float) -> (<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.series.Series'>, <class 'pandas.core.series.Series'>)[source]

Helper to split train and test based on annotated bouts.

Parameters

x_df (pd.DataFrame) – Features
y_df (pd.Series) – Target
test_size (float) – Size of test as ratio of all annotated bouts (e.g., 0.2).

Return np.ndarray x_train

Features for training

Return np.ndarray x_test

Features for testing

Return np.ndarray y_train

Target for training

Return np.ndarray y_test

Target for testing

Examples

>>> x = pd.DataFrame(data=[[11, 23, 12], [87, 65, 76], [23, 73, 27], [10, 29, 2], [12, 32, 42], [32, 73, 2], [21, 83, 98], [98, 1, 1]])
>>> y =  pd.Series([0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1])
>>> x_train, x_test, y_train, y_test = TrainModelMixin().bout_train_test_splitter(x_df=x, y_df=y, test_size=0.5)

calc_learning_curve(x_y_df: DataFrame, clf_name: str, shuffle_splits: int, dataset_splits: int, tt_size: float, rf_clf: RandomForestClassifier, save_dir: str, save_file_no: Optional[int] = None, multiclass: bool = False) → None[source]

Helper to compute random forest learning curves with cross-validation.

Parameters

x_y_df (pd.DataFrame) – Dataframe holding features and target.
clf_name (str) – Name of the classifier
shuffle_splits (int) – Number of cross-validation datasets at each data split.
dataset_splits (int) – Number of data splits.
tt_size (float) – test size
rf_clf (RandomForestClassifier) – sklearn RandomForestClassifier object
save_dir (str) – Directory where to save output in csv file format.
save_file_no (Optional[int]) – If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.
multiclass (bool) – If True, then target consist of several categories [0, 1, 2 …] and scoring becomes None. If False, then coring f1.

calc_permutation_importance(x_test: ndarray, y_test: ndarray, clf: RandomForestClassifier, feature_names: List[str], clf_name: str, save_dir: Union[str, PathLike], save_file_no: Optional[int] = None) → None[source]

Computes feature permutation importance scores.

Parameters

x_test (np.ndarray) – 2d feature test data of shape len(frames) x len(features)
y_test (np.ndarray) – 2d feature target test data of shape len(frames) x 1
clf (RandomForestClassifier) – random forest classifier
feature_names (List[str]) – Names of features in x_test
clf_name (str) – Name of classifier in y_test
save_dir (str) – Directory where to save results in CSV format
save_file_no (Optional[int]) – If permutation importance calculation is part of a grid search, provide integer identifier representing the model in the grid serach sequence. Will be used as suffix in output filename.

calc_pr_curve(rf_clf: RandomForestClassifier, x_df: DataFrame, y_df: DataFrame, clf_name: str, save_dir: str, multiclass: bool = False, classifier_map: Optional[Dict[int, str]] = None, save_file_no: Optional[int] = None) → None[source]

Helper to compute random forest precision-recall curve.

Parameters

rf_clf (RandomForestClassifier) – sklearn RandomForestClassifier object.
x_df (pd.DataFrame) – Pandas dataframe holding test features.
y_df (pd.DataFrame) – Pandas dataframe holding test target.
clf_name (str) – Classifier name.
save_dir (str) – Directory where to save output in csv file format.
multiclass (bool) – If the classifier is a multi-classifier. Default: False.
classifier_map (Dict[int, str]) – If multiclass, dictionary mapping integers to classifier names.
save_file_no (Optional[int]) – If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.

check_df_dataset_integrity(df: DataFrame, file_name: str, logs_path: Union[str, PathLike]) → None[source]: Helper to check for non-numerical np.inf, -np.inf, NaN, None in a single dataframe. :parameter pd.DataFrame x_df: Features :raise NoDataError: If data contains np.inf, -np.inf, None.

check_raw_dataset_integrity(df: DataFrame, logs_path: Optional[Union[str, PathLike]]) → None[source]

Helper to check column-wise NaNs in raw input data for fitting model.

:param pd.DataFrame df :param str logs_path: The logs directory of the SimBA project :raise FaultyTrainingSetError: When the dataset contains NaNs

check_sampled_dataset_integrity(x_df: DataFrame, y_df: DataFrame) → None[source]

Helper to check for non-numerical entries post data sampling

Parameters

x_df (pd.DataFrame) – Features
y_df (pd.DataFrame) – Target

Raises

FaultyTrainingSetError – Training or testing data sets contain non-numerical values

clf_fit(clf: RandomForestClassifier, x_df: DataFrame, y_df: DataFrame) → RandomForestClassifier[source]

Helper to fit clf model

Parameters

clf – Un-fitted random forest classifier object
x_df (pd.DataFrame) – Pandas dataframe with features.
y_df (pd.DataFrame) – Pandas dataframe/Series with target

Return RandomForestClassifier

Fitted random forest classifier object

clf_predict_proba(clf: RandomForestClassifier, x_df: DataFrame, multiclass: bool = False, model_name: Optional[str] = None, data_path: Optional[Union[str, PathLike]] = None) → ndarray[source]

Parameters

clf (RandomForestClassifier) – Random forest classifier object
x_df (pd.DataFrame) – Features df
multiclass (bool) – If True, the classifier predicts more than 2 targets. Else, boolean classifier.
model_name (Optional[str]) – Name of model
data_path (Optional[str]) – Path to model on disk

Return np.ndarray

2D array with frame represented by rows and present/absent probabilities as columns

Raises

FeatureNumberMismatchError – If shape of x_df and clf.n_features_ or n_features_in_ show mismatch

create_clf_report(rf_clf: RandomForestClassifier, x_df: DataFrame, y_df: DataFrame, class_names: List[str], save_dir: str, clf_name: Optional[str] = None, save_file_no: Optional[int] = None) → None[source]

Helper to create classifier truth table report.

See also

Documentation

Parameters

rf_clf (RandomForestClassifier) – sklearn RandomForestClassifier object.
x_df (pd.DataFrame) – dataframe holding test features
y_df (pd.DataFrame) – dataframe holding test target
class_names (List[str]) – List of classes. E.g., [‘Attack absent’, ‘Attack present’]
clf_name (Optional[str]) – Name of the classifier. If not None, then used in the output file name.
save_dir (str) – Directory where to save output in csv file format.
save_file_no (Optional[int]) – If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.

create_example_dt(rf_clf: RandomForestClassifier, clf_name: str, feature_names: List[str], class_names: List[str], save_dir: str, tree_id: Optional[int] = 3, save_file_no: Optional[int] = None) → None[source]

Helper to produce visualization of random forest decision tree using graphviz.

Note

Example expected output.

Parameters

rf_clf (RandomForestClassifier) – sklearn RandomForestClassifier object.
clf_name (str) – Classifier name.
feature_names (List[str]) – List of feature names.
class_names (List[str]) – List of classes. E.g., [‘Attack absent’, ‘Attack present’]
save_dir (str) – Directory where to save output in csv file format.
save_file_no (Optional[int]) – If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.

create_meta_data_csv_training_one_model(meta_data_lst: list, clf_name: str, save_dir: Union[str, PathLike]) → None[source]

Helper to save single model meta data (hyperparameters, sampling settings etc.) from list format into SimBA compatible CSV config file.

Parameters

meta_data_lst (list) – Meta data in list format
clf_name (str) – Name of classifier
clf_name – Name of classifier
save_dir (str) – Directory where to save output in csv file format.

create_shap_log(ini_file_path: str, rf_clf: RandomForestClassifier, x_df: DataFrame, y_df: Series, x_names: List[str], clf_name: str, cnt_present: int, cnt_absent: int, save_it: int = 100, save_path: Optional[Union[str, PathLike]] = None, save_file_no: Optional[int] = None) → Union[None, Tuple[DataFrame]][source]

Compute SHAP values for a random forest classifier.

This method computes SHAP (SHapley Additive exPlanations) values for a given random forest classifier.

See also

Documentation

Note

For improved run-times, use multiprocessing through simba.mixins.train_model_mixins.TrainModelMixin.create_shap_log_mp() Uses TreeSHAP Documentation

The SHAP value for feature ‘i’ in the context of a prediction ‘f’ and input ‘x’ is calculated using the following formula:

$\phi_i(f, x) = \sum_{S \subseteq F \setminus {i}} \frac{|S|!(|F| - |S| - 1)!}{|F|!} (f_{S \cup {i}}(x_{S \cup {i}}) - f_S(x_S))$

Parameters

ini_file_path (str) – Path to the SimBA project_config.ini
rf_clf (RandomForestClassifier) – sklearn random forest classifier
x_df (pd.DataFrame) – Test features.
y_df (pd.DataFrame) – Test target.
x_names (List[str]) – Feature names.
clf_name (str) – Classifier name.
cnt_present (int) – Number of behavior-present frames to calculate SHAP values for.
cnt_absent (int) – Number of behavior-absent frames to calculate SHAP values for.
save_path (str) – Directory where to save output in csv file format.
save_file_no (Optional[int]) – If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.

create_shap_log_mp(ini_file_path: str, rf_clf: RandomForestClassifier, x_df: DataFrame, y_df: DataFrame, x_names: List[str], clf_name: str, cnt_present: int, cnt_absent: int, batch_size: int = 10, save_path: Optional[Union[str, PathLike]] = None, save_file_no: Optional[int] = None) → Union[None, Tuple[DataFrame]][source]

Helper to compute SHAP values using multiprocessing. For single-core alternative, see meth:simba.mixins.train_model_mixins.TrainModelMixin.create_shap_log_mp.

See also

Documentation

Parameters

ini_file_path (str) – Path to the SimBA project_config.ini
rf_clf (RandomForestClassifier) – sklearn random forest classifier
x_df (pd.DataFrame) – Test features.
y_df (pd.DataFrame) – Test target.
x_names (List[str]) – Feature names.
clf_name (str) – Classifier name.
cnt_present (int) – Number of behavior-present frames to calculate SHAP values for.
cnt_absent (int) – Number of behavior-absent frames to calculate SHAP values for.
save_dir (Optional[str, os.PathLike]) – Optional directory where to save output in csv file format. If None, then returns the dataframes.
save_file_no (Optional[int]) – If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.

create_x_importance_bar_chart(rf_clf: RandomForestClassifier, x_names: list, clf_name: str, save_dir: str, n_bars: int, palette: Optional[str] = 'hot', save_file_no: Optional[int] = None) → None[source]

Helper to create a bar chart displaying the top N gini or entropy feature importance scores.

See also

Documentation

Parameters

rf_clf (RandomForestClassifier) – sklearn RandomForestClassifier object.
x_names (List[str]) – Names of features.
clf_name (str) – Name of classifier.
save_dir (str) – Directory where to save output in csv file format.
n_bars (int) – Number of bars in the plot.
save_file_no (Optional[int]) – If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search

create_x_importance_log(rf_clf: RandomForestClassifier, x_names: List[str], clf_name: str, save_dir: str, save_file_no: Optional[int] = None) → None[source]

Helper to save gini or entropy based feature importance scores.

Note

Example expected output.

Parameters

rf_clf (RandomForestClassifier) – sklearn RandomForestClassifier object.
x_names (List[str]) – Names of features.
clf_name (str) – Name of classifier
save_dir (str) – Directory where to save output in csv file format.
save_file_no (Optional[int]) – If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.

static define_scaler(scaler_name: typing_extensions.Literal['MIN-MAX', 'STANDARD', 'QUANTILE']) → Union[MinMaxScaler, StandardScaler, QuantileTransformer][source]

Defines a sklearn scaler object. See UMLOptions.SCALER_OPTIONS.value for accepted scalers.

Example

>>> TrainModelMixin.define_scaler(scaler_name='MIN-MAX')

delete_other_annotation_columns(df: DataFrame, annotations_lst: List[str], raise_error: bool = True) → DataFrame[source]

Helper to drop fields that contain annotations which are not the target.

Parameters

df (pd.DataFrame) – Dataframe holding features and annotations.
annotations_lst (List[str]) – column fields to be removed from df

Raise_error bool raise_error

If True, throw error if annotation column doesn’t exist. Else, skip. Default: True.

Return pd.DataFrame

Dataframe without non-target annotation columns

Examples

>>> self.delete_other_annotation_columns(df=df, annotations_lst=['Sniffing'])

dviz_classification_visualization(x_train: ndarray, y_train: ndarray, clf_name: str, class_names: List[str], save_dir: str) → None[source]

Helper to create visualization of example decision tree using dtreeviz.

Parameters

x_train (np.ndarray) – training features
y_train (np.ndarray) – training targets
clf_name (str) – Name of classifier
class_names (List[str]) – List of class names. E.g., [‘Attack absent’, ‘Attack present’]
save_dir (str) – Directory where to save output in csv file format.

static find_highly_correlated_fields(data: ndarray, threshold: float, field_names: ListType[unicode_type]) → List[str][source]

Find highly correlated fields in a dataset.

Calculates the absolute correlation coefficients between columns in a given dataset and identifies pairs of columns that have a correlation coefficient greater than the specified threshold. For every pair of correlated features identified, the function returns the field name of one feature. These field names can later be dropped from the input data to reduce memory requirements and collinearity.

Parameters

data (np.ndarray) – Two dimensional numpy array with features represented as columns and frames represented as rows.
threshold (float) – Threshold value for significant collinearity.
field_names (List[str]) – List mapping the column names in data to a field name. Use types.ListType(types.unicode_type) to take advantage of JIT compilation

Return List[str]

Unique field names that correlates with at least one other field above the threshold value.

Example

>>> data = np.random.randint(0, 1000, (1000, 5000)).astype(np.float32)
>>> field_names = []
>>> for i in range(data.shape[1]): field_names.append(f'Feature_{i+1}')
>>> highly_correlated_fields = TrainModelMixin().find_highly_correlated_fields(data=data, field_names=typed.List(field_names), threshold=0.10)

static find_low_variance_fields(data: DataFrame, variance_threshold: float) → List[str][source]

Finds fields with variance below provided threshold.

Parameters

data (pd.DataFrame) – Dataframe with continoues numerical features.
variance (float) – Variance threshold (0.0-1.0).

Return List[str]

get_all_clf_names(config: ConfigParser, target_cnt: int) → List[str][source]

Helper to get all classifier names in a SimBA project.

Parameters

config (configparser.ConfigParser) – Parsed SimBA project_config.ini
target_cnt (int.ConfigParser) – Parsed SimBA project_config.ini

Return List[str]

All classifier names in project

Example

>>> self.get_all_clf_names(config=config, target_cnt=2)
>>> ['Attack', 'Sniffing']

get_model_info(config: ConfigParser, model_cnt: int) → Dict[int, Any][source]

Helper to read in N SimBA random forest config meta files to python dict memory.

Parameters

config (configparser.ConfigParser) – Parsed SimBA project_config.ini
model_cnt (int) – Count of models

Return dict

Dictionary with integers as keys and hyperparameter dictionaries as keys.

insert_column_headers_for_outlier_correction(data_df: DataFrame, new_headers: List[str], filepath: Union[str, PathLike]) → DataFrame[source]

Helper to insert new column headers onto a dataframe following outlier correction.

Parameters

data_df (pd.DataFrame) – Dataframe with headers to-be replaced.
filepath (str) – Path to where data_df is stored on disk.
new_headers (List[str]) – New headers.

partial_dependence_calculator(clf: RandomForestClassifier, x_df: DataFrame, clf_name: str, save_dir: Union[str, PathLike], clf_cnt: Optional[int] = None) → None[source]

Compute feature partial dependencies for every feature in training set.

Parameters

clf (RandomForestClassifier) – Random forest classifier
x_df (pd.DataFrame) – Features training set
clf_name (str) – Name of classifier
save_dir (str) – Directory where to save the data
clf_cnt (Optional[int]) – If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.

print_machine_model_information(model_dict: dict) → None[source]

Helper to print model information in tabular form.

Parameters: model_dict (dict) – dictionary holding model meta data in SimBA meta-config format.

random_multiclass_bout_sampler(x_df: DataFrame, y_df: DataFrame, target_field: str, target_var: int, sampling_ratio: Union[float, Dict[int, float]], raise_error: bool = False) → DataFrame[source]

Randomly sample multiclass behavioral bouts.

This function performs random sampling on a multiclass dataset to balance the class distribution. From each class, the function selects a count of “bouts” where the count is computed as a ratio of a user-specified class variable count. All bout observations in the user-specified class is selected.

Parameters

x_df (pd.DataFrame) – A dataframe holding features.
y_df (pd.DataFrame) – A dataframe holding target.
target_field (str) – The name of the target column.
target_var (int) – The variable in the target that should serve as baseline. E.g., 0 if 0 represents no behavior.
sampling_ratio (Union[float, dict]) – The ratio of target_var bout observations that should be sampled of non-target_var observations. E.g., if float 1.0, and there are 10` bouts of target_var observations in the dataset, then 10 bouts of each non-target_var observations will be sampled. If different under-sampling ratios for different class variables are needed, use dict with the class variable name as key and ratio relative to target_var as the value.
raise_error (bool) – If True, then raises error if there are not enough observations of the non-target_var fullfilling the sampling_ratio. Else, takes all observations even though not enough to reach criterion.

Raises

SamplingError – If any of the following conditions are met: - No bouts of the target class are detected in the data. - The target variable is present in the sampling ratio dictionary. - The sampling ratio dictionary contains non-integer keys or non-float values less than 0.0. - The variable specified in the sampling ratio is not present in the DataFrame. - The sampling ratio results in a sample size of zero or less. - The requested sample size exceeds the available data and raise_error is True.

Return (pd.DataFrame, pd.DataFrame)

resampled features, and resampled associated target.

Examples

>>> df = pd.read_csv('/Users/simon/Desktop/envs/troubleshooting/multilabel/project_folder/csv/targets_inserted/01.YC015YC016phase45-sample_sampler.csv', index_col=0)
>>> undersampled_df = TrainModelMixin().random_multiclass_bout_sampler(data=df, target_field='syllable_class', target_var=0, sampling_ratio={1: 1.0, 2: 1, 3: 1}, raise_error=True)

random_multiclass_frm_sampler(x_df: DataFrame, y_df: DataFrame, target_field: str, target_var: int, sampling_ratio: Union[float, Dict[int, float]], raise_error: bool = False)[source]

Random multiclass undersampler.

This function performs random under-sampling on a multiclass dataset to balance the class distribution. From each class, the function selects a number of frames computed as a ratio relative to a user-specified class variable.

All the observations in the user-specified class is selected.

Parameters

x_df (pd.DataFrame) – A dataframe holding features.
y_df (pd.DataFrame) – A dataframe holding target.
target_field (str) – The name of the target column.
target_var (int) – The variable in the target that should serve as baseline. E.g., 0 if 0 represents no behavior.
sampling_ratio (Union[float, dict]) – The ratio of target_var observations that should be sampled of non-target_var observations. E.g., if float 1.0, and there are 10` target_var observations in the dataset, then 10 of each non-target_var observations will be sampled. If different under-sampling ratios for different class variables are needed, use dict with the class variable name as key and ratio raletive to target_var as the value.
raise_error (bool) – If True, then raises error if there are not enough observations of the non-target_var fullfilling the sampling_ratio. Else, takes all observations even though not enough to reach criterion.

Return (pd.DataFrame, pd.DataFrame)

resampled features, and resampled associated target.

Examples

>>> df = pd.read_csv('/Users/simon/Desktop/envs/troubleshooting/multilabel/project_folder/csv/targets_inserted/01.YC015YC016phase45-sample_sampler.csv', index_col=0)
>>> TrainModelMixin().random_multiclass_frm_sampler(data_df=df, target_field='syllable_class', target_var=0, sampling_ratio=0.20)
>>> TrainModelMixin().random_multiclass_frm_sampler(data_df=df, target_field='syllable_class', target_var=0, sampling_ratio={1: 0.1, 2: 0.2, 3: 0.3})

random_undersampler(x_train: ~numpy.ndarray, y_train: ~numpy.ndarray, sample_ratio: float) -> (<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>)[source]

Helper to perform random under-sampling of behavior-absent frames in a dataframe.

Parameters

x_train (np.ndarray) – Features in train set
y_train (np.ndarray) – Target in train set
sample_ratio (float) – Ratio of behavior-absent frames to keep relative to the behavior-present frames. E.g., 1.0 returns an equal count of behavior-absent and behavior-present frames. 2.0 returns twice as many behavior-absent frames as and behavior-present frames.

Return pd.DataFrame

Under-sampled feature-set

Return pd.DataFrame

Under-sampled target-set

Examples

>>> self.random_undersampler(x_train=x_train, y_train=y_train, sample_ratio=1.0)

read_all_files_in_folder(file_paths: ~typing.List[str], file_type: str, classifier_names: ~typing.Optional[~typing.List[str]] = None, raise_bool_clf_error: bool = True) -> (<class 'pandas.core.frame.DataFrame'>, typing.List[int])[source]

Read in all data files in a folder to a single pd.DataFrame for downstream ML algo. Asserts that all classifiers have annotation fields present in concatenated dataframe.

Note

For improved runtime through pyarrow, use simba.mixins.train_model_mixin.read_all_files_in_folder_mp()

Parameters

file_paths (List[str]) – List of file paths representing files to be read in.
file_type (str) – List of file paths representing files to be read in.
classifier_names (str or None) – List of classifier names representing fields of human annotations. If not None, then assert that classifier names are present in each data file.

Return pd.DataFrame

concatenated dataframe if all data represented in file_paths.

Return List[int]

The frame numbers (index) of the sampled data.

Examples

>>> self.read_all_files_in_folder(file_paths=['targets_inserted/Video_1.csv', 'targets_inserted/Video_2.csv'], file_type='csv', classifier_names=['Attack'])

static read_all_files_in_folder_mp(file_paths: ~typing.List[str], file_type: typing_extensions.Literal['csv', 'parquet', 'pickle'], classifier_names: ~typing.Optional[~typing.List[str]] = None, raise_bool_clf_error: bool = True) -> (<class 'pandas.core.frame.DataFrame'>, typing.List[int])[source]

Multiprocessing helper function to read in all data files in a folder to a single pd.DataFrame for downstream ML. Defaults to ceil(CPU COUNT / 2) cores. Asserts that all classifiers have annotation fields present in each dataframe.

Note

If multiprocess failure, reverts to simba.mixins.train_model_mixin.read_all_files_in_folder()

Parameters

file_paths (List[str]) – List of file-paths
file_paths – The filetype of file_paths OPTIONS: csv or parquet.
classifier_names (Optional[List[str]]) – List of classifier names representing fields of human annotations. If not None, then assert that classifier names are present in each data file.

Return pd.DataFrame

Concatenated dataframe of all data in file_paths.

Return pd.DataFrame

List of frame indexes of all concatenated files.

read_all_files_in_folder_mp_futures(annotations_file_paths: ~typing.List[str], file_type: typing_extensions.Literal['csv', 'parquet', 'pickle'], classifier_names: ~typing.Optional[~typing.List[str]] = None, raise_bool_clf_error: bool = True) -> (<class 'pandas.core.frame.DataFrame'>, typing.List[int])[source]

Multiprocessing helper function to read in all data files in a folder to a single pd.DataFrame for downstream ML through concurrent.Futures. Asserts that all classifiers have annotation fields present in each dataframe.

Note

A concurrent.Futures alternative to simba.mixins.train_model_mixin.read_all_files_in_folder_mp() which has uses multiprocessing.ProcessPoolExecutor and reported unstable on Linux machines.

If multiprocess failure, reverts to simba.mixins.train_model_mixin.read_all_files_in_folder()

Parameters

file_paths (List[str]) – List of file-paths
file_paths – The filetype of file_paths OPTIONS: csv or parquet.
classifier_names (Optional[List[str]]) – List of classifier names representing fields of human annotations. If not None, then assert that classifier names are present in each data file.
raise_bool_clf_error (bool) – If True, raises an error if a classifier column contains values outside 0 and 1.

Return pd.DataFrame

Concatenated dataframe of all data in file_paths.

read_in_all_model_names_to_remove(config: ConfigParser, model_cnt: int, clf_name: str) → List[str][source]

Helper to find all field names that are annotations but are not the target.

Parameters

config (configparser.ConfigParser) – Configparser object holding data from the project_config.ini
model_cnt (int) – Number of classifiers in the SimBA project
clf_name (str) – Name of the classifier.

Return List[str]

List of non-target annotation column names.

Examples

>>> self.read_in_all_model_names_to_remove(config=config, model_cnt=2, clf_name=['Attack'])

read_pickle(file_path: Union[str, PathLike]) → object[source]

Read pickle file

Parameters: file_path (str) – Path to pickle file on disk.

:return dict

save_rf_model(rf_clf: RandomForestClassifier, clf_name: str, save_dir: Union[str, PathLike], save_file_no: Optional[int] = None) → None[source]

Helper to save pickled classifier object to disk.

Parameters

rf_clf (RandomForestClassifier) – sklearn random forest classifier
clf_name (str) – Classifier name
save_dir (str) – Directory where to save output in csv file format.
save_file_no (Optional[int]) – If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.

static scaler_transform(data: DataFrame, scaler: Union[MinMaxScaler, StandardScaler, QuantileTransformer], name: Optional[str] = '') → DataFrame[source]

Helper to run transform dataframe using previously fitted scaler.

Parameters

data (pd.DataFrame) – Data to transform.
scaler – fitted scaler.

smote_oversampler(x_train: ~pandas.core.frame.DataFrame, y_train: ~pandas.core.frame.DataFrame, sample_ratio: float) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Helper to perform SMOTE oversampling of behavior-present annotations.

Parameters

x_train (np.ndarray) – Features in train set
y_train (np.ndarray) – Target in train set
sample_ratio (float) – Over-sampling ratio

Return np.ndarray

Oversampled features.

Return np.ndarray

Oversampled target.

Examples

>>> self.smote_oversampler(x_train=x_train, y_train=y_train, sample_ratio=1.0)

smoteen_oversampler(x_train: ~pandas.core.frame.DataFrame, y_train: ~pandas.core.frame.DataFrame, sample_ratio: float) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Helper to perform SMOTEEN oversampling of behavior-present annotations.

Parameters

x_train (np.ndarray) – Features in train set
y_train (np.ndarray) – Target in train set
sample_ratio (float) – Over-sampling ratio

Return np.ndarray

Oversampled features.

Return np.ndarray

Oversampled target.

Examples

>>> self.smoteen_oversampler(x_train=x_train, y_train=y_train, sample_ratio=1.0)

static split_and_group_df(df: ~pandas.core.frame.DataFrame, splits: int, include_split_order: bool = True) -> (typing.List[pandas.core.frame.DataFrame], <class 'int'>)[source]: Helper to split a dataframe for multiprocessing. If include_split_order, then include the group number in split data as a column. Returns split data and approximations of number of observations per split.

split_df_to_x_y(df: ~pandas.core.frame.DataFrame, clf_name: str) -> (<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>)[source]

Helper to split dataframe into features and target.

Parameters

df (pd.DataFrame) – Dataframe holding features and annotations.
clf_name (str) – Name of target.

Return pd.DataFrame

features

Return pd.DataFrame

target

Examples

>>> self.split_df_to_x_y(df=df, clf_name='Attack')

Time-series feature methods

class simba.mixins.timeseries_features_mixin.TimeseriesFeatureMixin[source]

Bases: object

Time-series methods focused on signal complexity in sliding windows. Mainly in time-domain - fft methods (through e.g. scipy) I’ve found so far has not been fast enough for rolling windows in large datasets.

Note

Many method has numba typed signatures to decrease compilation time through reduced type inference. Make sure to pass the correct dtypes as indicated by signature decorators.

Important

See references for mature packages computing more extensive timeseries measurements

1: cesium.
2: eeglib.
3: antropy.
4: tsfresh.
5: pycaret.

static acceleration(data: ndarray, pixels_per_mm: float, fps: int, time_window: float = 1, unit: typing_extensions.Literal['mm', 'cm', 'dm', 'm'] = 'mm') → ndarray[source]

Compute acceleration.

Computes acceleration from a sequence of body-part coordinates over time. It calculates the difference in velocity between consecutive frames and provides an array of accelerations.

The computation is based on the formula:

$\text{{Acceleration}}(t) = \frac{{\text{{Norm}}(\text{{Shift}}(\text{{data}}[t], t, t-1) - \text{{data}}[t])}}{{\text{{pixels\_per\_mm}}}}$

where $\text{{Norm}}$ calculates the Euclidean norm, $\text{{Shift}}(\text{{array}}, t, t-1)$ shifts the array by $t-1$ frames, and $\text{{pixels\_per\_mm}}$ is the conversion factor from pixels to millimeters.

Note

By default, acceleration is calculated as change in velocity at millimeters/s. To change the denomitator, modify the time_window argument. To change the nominator, modify the unit argument (accepted mm, cm``, dm, mm)

Parameters

data (np.ndarray) – 1D array of framewise euclidean distances.
pixels_per_mm (float) – Pixels per millimeter of the recorded video.
fps (int) – Frames per second (FPS) of the recorded video.
time_window (float) – Rolling time window in seconds. Default is 1.0 representing 1 second.
unit (Literal['mm', 'cm', 'dm', 'm']) – If acceleration should be presented as millimeter, centimeters, decimeter, or meter. Default millimeters.

Returns

Array of accelerations corresponding to each frame.

Example

>>> data = np.array([1, 2, 3, 4, 5, 5, 5, 5, 5, 6]).astype(np.float32)
>>> TimeseriesFeatureMixin().acceleration(data=data, pixels_per_mm=1.0, fps=2, time_window=1.0)
>>> [ 0.,  0.,  0.,  0., -1., -1.,  0.,  0.,  1.,  1.]

static benford_correlation(data: ndarray) → float[source]

Jitted compute of the correlation between the Benford’s Law distribution and the first-digit distribution of given data.

Benford’s Law describes the expected distribution of leading (first) digits in many real-life datasets. This function calculates the correlation between the expected Benford’s Law distribution and the actual distribution of the first digits in the provided data.

Note

Adapted from tsfresh.

The returned correlation values are calculated using Pearson’s correlation coefficient.

Parameters: data (np.ndarray) – The input 1D array containing the time series data.
Return float: The correlation coefficient between the Benford’s Law distribution and the first-digit distribution in the input data. A higher correlation value suggests that the data follows the expected distribution more closely.
Examples

>>> data = np.array([1, 8, 2, 10, 8, 6, 8, 1, 1, 1]).astype(np.float32)
>>> TimeseriesFeatureMixin().benford_correlation(data=data)
>>> 0.6797500374831786

static crossings(data: ndarray, val: float) → int[source]

Jitted compute of the count in time-series where sequential values crosses a defined value.

Parameters

data (np.ndarray) – Time-series data.
val (float) – Cross value. E.g., to count the number of zero-crossings, pass 0.

Return int

Count of events where sequential values crosses val.

Example

>>> data = np.array([3.9, 7.5,  4.2, 6.2, 7.5, 3.9, 6.2, 6.5, 7.2, 9.5]).astype(np.float32)
>>> TimeseriesFeatureMixin().crossings(data=data, val=7)
>>> 5

static dominant_frequencies(data: ndarray, fps: float, k: int, window_function: Optional[typing_extensions.Literal['Hann', 'Hamming', 'Blackman']] = None)[source]: Find the K dominant frequencies within a feature vector

static granger_tests(data: DataFrame, variables: List[str], lag: int, test: typing_extensions.Literal['ssr_ftest', 'ssr_chi2test', 'lrtest', 'params_ftest'] = 'ssr_chi2test') → DataFrame[source]

Perform Granger causality tests between pairs of variables in a DataFrame.

This function computes Granger causality tests between pairs of variables in a DataFrame using the statsmodels library. The Granger causality test assesses whether one time series variable (predictor) can predict another time series variable (outcome). This test can help determine the presence of causal relationships between variables.

Note

Modified from Selva Prabhakaran.

Example

>>> x = np.random.randint(0, 50, (100, 2))
>>> data = pd.DataFrame(x, columns=['r', 'k'])
>>> TimeseriesFeatureMixin.granger_tests(data=data, variables=['r', 'k'], lag=4, test='ssr_chi2test')
>>>     r           k
>>>     r  1.0000  0.4312
>>>     k  0.3102  1.0000

static higuchi_fractal_dimension(data: ndarray, kmax: int = 10)[source]

Jitted compute of the Higuchi Fractal Dimension of a given time series data. The Higuchi Fractal Dimension provides a measure of the fractal complexity of a time series.

The maximum value of k used in the calculation. Increasing kmax considers longer sequences: of data, providing a more detailed analysis of fractal complexity. Default is 10.

Parameters

data (np.ndarray) – A 1-dimensional numpy array containing the time series data.
kmax (int) – The maximum value of k used in the calculation. Increasing kmax considers longer sequences of data, providing a more detailed analysis of fractal complexity. Default is 10.

Returns float

The Higuchi Fractal Dimension of the input time series.

Note

Adapted from eeglib.

$HFD = \frac{\log(N)}{\log(N) + \log\left(\frac{N}{N + 0.4 \cdot zC}\right)}$

Example

>>> t = np.linspace(0, 50, int(44100 * 2.0), endpoint=False)
>>> sine_wave = 1.0 * np.sin(2 * np.pi * 1.0 * t).astype(np.float32)
>>> sine_wave = (sine_wave - np.min(sine_wave)) / (np.max(sine_wave) - np.min(sine_wave))
>>> TimeseriesFeatureMixin().higuchi_fractal_dimension(data=data, kmax=10)
>>> 1.0001506805419922
>>> np.random.shuffle(sine_wave)
>>> TimeseriesFeatureMixin().higuchi_fractal_dimension(data=data, kmax=10)
>>> 1.9996402263641357

static hjort_parameters(data: ~numpy.ndarray) -> (<class 'float'>, <class 'float'>, <class 'float'>)[source]

Jitted compute of Hjorth parameters for a given time series data. Hjorth parameters describe mobility, complexity, and activity of a time series.

Parameters

data (numpy.ndarray) – A 1-dimensional numpy array containing the time series data.

Returns

tuple A tuple containing the following Hjorth parameters: - activity (float): The activity of the time series, which is the variance of the input data. - mobility (float): The mobility of the time series, calculated as the square root of the variance

of the first derivative of the input data divided by the variance of the input data.

complexity (float): The complexity of the time series, calculated as the square root of the variance of the second derivative of the input data divided by the variance of the first derivative, and then divided by the mobility.

Example

>>> data = np.array([1.0, 2.0, 3.0, 4.0, 5.0], dtype=np.float32)
>>> TimeseriesFeatureMixin().hjort_parameters(data)
>>> (2.5, 0.5, 0.4082482904638631)

$mobility = \sqrt{\frac{dx_{var}}{x_{var}}}$

$complexity = \sqrt{\frac{ddx_{var}}{dx_{var}} / mobility}$

static line_length(data: ndarray) → float[source]

Calculate the line length of a 1D array.

Line length is a measure of signal complexity and is computed by summing the absolute differences between consecutive elements of the input array. Used in EEG analysis and other signal processing applications to quantify variations in the signal.

Parameters: data (numpy.ndarray) – The 1D array for which the line length is to be calculated.
Return float: The line length of the input array, indicating its complexity.

$LL = \sum_{i=1}^{N-1} |x[i] - x[i-1]|$

where: - LL is the line length. - N is the number of elements in the input data array. - x[i] represents the value of the data at index i.

Example

>>> data = np.array([1, 4, 2, 3, 5, 6, 8, 7, 9, 10]).astype(np.float32)
>>> TimeseriesFeatureMixin().line_length(data=data)
>>> 12.0

static local_maxima_minima(data: ndarray, maxima: bool = True) → ndarray[source]

Jitted compute of the local maxima or minima defined as values which are higher or lower than immediately preceding and proceeding time-series neighbors, repectively. Returns 2D np.ndarray with columns representing idx and values of local maxima.

Parameters

data (np.ndarray) – Time-series data.
maxima (bool) – If True, returns maxima. Else, minima.

Return np.ndarray

2D np.ndarray with columns representing idx in input data in first column and values of local maxima in second column

Example

>>> data = np.array([3.9, 7.5,  4.2, 6.2, 7.5, 3.9, 6.2, 6.5, 7.2, 9.5]).astype(np.float32)
>>> TimeseriesFeatureMixin().local_maxima_minima(data=data, maxima=True)
>>> [[1, 7.5], [4, 7.5], [9, 9.5]]
>>> TimeseriesFeatureMixin().local_maxima_minima(data=data, maxima=False)
>>> [[0, 3.9], [2, 4.2], [5, 3.9]]

static longest_strike(data: ndarray, threshold: float, above: bool = True) → int[source]

Jitted compute of the length of the longest consecutive sequence of values in the input data that either exceed or fall below a specified threshold.

Parameters

data (np.ndarray) – The input 1D NumPy array containing the values to be analyzed.
threshold (float) – The threshold value used for the comparison.
above (bool) – If True, the function looks for strikes where values are above or equal to the threshold. If False, it looks for strikes where values are below or equal to the threshold.

Return int

The length of the longest strike that satisfies the condition.

Example

>>> data = np.array([1, 8, 2, 10, 8, 6, 8, 1, 1, 1]).astype(np.float32)
>>> TimeseriesFeatureMixin().longest_strike(data=data, threshold=7, above=True)
>>> 2
>>> TimeseriesFeatureMixin().longest_strike(data=data, threshold=7, above=False)
>>> 3

static percent_beyond_n_std(data: ndarray, n: float) → float[source]

Jitted compute of the ratio of values in time-series more than N standard deviations from the mean of the time-series.

Parameters

data (np.ndarray) – 1D array representing time-series.
n (float) – Standard deviation cut-off.

Returns float

Ratio of values in data that fall more than n standard deviations from mean of data.

Note

Adapted from cesium.

Oddetity: mean calculation is incorrect if passing float32 data but correct if passing float64.

Examples

>>> data = np.array([3.9, 7.5,  4.2, 6.2, 7.5, 3.9, 6.2, 6.5, 7.2, 9.5]).astype(np.float32)
>>> TimeseriesFeatureMixin().percent_beyond_n_std(data=data, n=1)
>>> 0.1

static percent_in_percentile_window(data: ndarray, upper_pct: int, lower_pct: int)[source]

Jitted compute of the ratio of values in time-series that fall between the upper and lower percentile.

Parameters

data (np.ndarray) – 1D array of representing time-series.
upper_pct (int) – Upper-boundary percentile.
lower_pct (int) – Lower-boundary percentile.

Returns float

Ratio of values in data that fall within upper_pct and lower_pct percentiles.

Note

Adapted from cesium.

_images/percent_in_percentile_window.png

Example

>>> data = np.array([3.9, 7.5,  4.2, 6.2, 7.5, 3.9, 6.2, 6.5, 7.2, 9.5]).astype(np.float32)
>>> TimeseriesFeatureMixin().percent_in_percentile_window(data, upper_pct=70, lower_pct=30)
>>> 0.4

static percentile_difference(data: ndarray, upper_pct: int, lower_pct: int) → float[source]

Jitted compute of the difference between the upper and lower percentiles of the data as a percentage of the median value. Helps understanding the spread or variability of the data within specified percentiles.

Note

Adapted from cesium.

Parameters

data (np.ndarray) – 1D array of representing time-series.
upper_pct (int) – Upper-boundary percentile.
lower_pct (int) – Lower-boundary percentile.

Returns float

The difference between the upper and lower percentiles of the data as a percentage of the median value.

Examples

>>> data = np.array([3.9, 7.5,  4.2, 6.2, 7.5, 3.9, 6.2, 6.5, 7.2, 9.5]).astype(np.float32)
>>> TimeseriesFeatureMixin().percentile_difference(data=data, upper_pct=95, lower_pct=5)
>>> 0.7401574764125177

static permutation_entropy(data: ndarray, dimension: int, delay: int) → float[source]

Calculate the permutation entropy of a time series.

Permutation entropy is a measure of the complexity of a time series data by quantifying the irregularity and unpredictability of its order patterns. It is computed based on the frequency of unique order patterns of a given dimension in the time series data.

The permutation entropy (PE) is calculated using the following formula:

$PE = - \sum(p_i \log(p_i))$

where: - PE is the permutation entropy. - p_i is the probability of each unique order pattern.

Parameters

data (numpy.ndarray) – The time series data for which permutation entropy is calculated.
dimension (int) – It specifies the length of the order patterns to be considered.
delay (int) – Time delay between elements in an order pattern.

Return float

The permutation entropy of the time series, indicating its complexity and predictability. A higher permutation entropy value indicates higher complexity and unpredictability in the time series.

Example

>>> t = np.linspace(0, 50, int(44100 * 2.0), endpoint=False)
>>> sine_wave = 1.0 * np.sin(2 * np.pi * 1.0 * t).astype(np.float32)
>>> TimeseriesFeatureMixin().permutation_entropy(data=sine_wave, dimension=3, delay=1)
>>> 0.701970058666407
>>> np.random.shuffle(sine_wave)
>>> TimeseriesFeatureMixin().permutation_entropy(data=sine_wave, dimension=3, delay=1)
>>> 1.79172449934604

static petrosian_fractal_dimension(data: ndarray) → float[source]

Calculate the Petrosian Fractal Dimension (PFD) of a given time series data. The PFD is a measure of the irregularity or self-similarity of a time series. Larger values indicate higher complexity. Lower values indicate lower complexity.

Note

The PFD is computed based on the number of sign changes in the first derivative of the time series. If the input data is empty or no sign changes are found, the PFD is returned as -1.0. Adapted from eeglib.

$PFD = \frac{\log_{10}(N)}{\log_{10}(N) + \log_{10}\left(\frac{N}{N + 0.4 \cdot zC}\right)}$

Parameters: data (np.ndarray) – A 1-dimensional numpy array containing the time series data.
Returns float: The Petrosian Fractal Dimension of the input time series.
Examples

>>> t = np.linspace(0, 50, int(44100 * 2.0), endpoint=False)
>>> sine_wave = 1.0 * np.sin(2 * np.pi * 1.0 * t).astype(np.float32)
>>> TimeseriesFeatureMixin().petrosian_fractal_dimension(data=sine_wave)
>>> 1.0000398187022719
>>> np.random.shuffle(sine_wave)
>>> TimeseriesFeatureMixin().petrosian_fractal_dimension(data=sine_wave)
>>> 1.0211625348743218

static sliding_benford_correlation(data: ndarray, time_windows: ndarray, sample_rate: int) → ndarray[source]

Calculate the sliding Benford’s Law correlation coefficient for a given dataset within specified time windows.

Benford’s Law is a statistical phenomenon where the leading digits of many datasets follow a specific distribution pattern. This function calculates the correlation between the observed distribution of leading digits in a dataset and the ideal Benford’s Law distribution.

Note

Adapted from tsfresh.

The returned correlation values are calculated using Pearson’s correlation coefficient.

The correlation coefficient is calculated between the observed leading digit distribution and the ideal Benford’s Law distribution.

$P(d) = \log_{10}\left(1 + \frac{1}{d}\right) \quad \text{for } d \in \{1, 2, \ldots, 9\}$

Parameters

data (np.ndarray) – The input 1D array containing the time series data.
time_windows (np.ndarray) – A 1D array containing the time windows (in seconds) for which the correlation will be calculated at different points in the dataset.
sample_rate (int) – The sample rate, indicating how many data points are collected per second.

Return np.ndarray

2D array containing the correlation coefficient values for each time window. With time window lenths represented by different columns.

Examples

>>> data = np.array([1, 8, 2, 10, 8, 6, 8, 1, 1, 1]).astype(np.float32)
>>> TimeseriesFeatureMixin.sliding_benford_correlation(data=data, time_windows=np.array([1.0]), sample_rate=2)
>>> [[ 0.][0.447][0.017][0.877][0.447][0.358][0.358][0.447][0.864][0.864]]

static sliding_crossings(data: ndarray, val: float, time_windows: ndarray, fps: int) → ndarray[source]

Compute the number of crossings over sliding windows in a data array.

Computes the number of times a value in the data array crosses a given threshold value within sliding windows of varying sizes. The number of crossings is computed for each window size and stored in the result array where columns represents time windows.

Note

For frames occurring before a complete time window, -1.0 is returned.

Parameters

data (np.ndarray) – Input data array.
val (float) – Threshold value for crossings.
time_windows (np.ndarray) – Array of window sizes (in seconds).
sample_rate (int) – Sampling rate of the data in samples per second.

Return np.ndarray

An array containing the number of crossings for each window size and data point. The shape of the result array is (data.shape[0], window_sizes.shape[0]).

Example

>>> data = np.array([3.9, 7.5,  4.2, 6.2, 7.5, 3.9, 6.2, 6.5, 7.2, 9.5]).astype(np.float32)
>>> results = TimeseriesFeatureMixin().sliding_crossings(data=data, time_windows=np.array([1.0]), fps=2.0, val=7.0)

static sliding_descriptive_statistics(data: ndarray, window_sizes: ndarray, sample_rate: int, statistics: typing_extensions.Literal['var', 'max', 'min', 'std', 'median', 'mean', 'mad', 'sum', 'mac', 'rms', 'absenergy']) → ndarray[source]

Jitted compute of descriptive statistics over sliding windows in 1D data array.

Computes various descriptive statistics (e.g., variance, maximum, minimum, standard deviation, median, mean, median absolute deviation) for sliding windows of varying sizes applied to the input data array.

Parameters

data (np.ndarray) – 1D input data array.
window_sizes (np.ndarray) – Array of window sizes (in seconds).
sample_rate (int) – Sampling rate of the data in samples per second.
statistics (types.ListType(types.unicode_type)) – List of statistics to compute. Options: ‘var’, ‘max’, ‘min’, ‘std’, ‘median’, ‘mean’, ‘mad’, ‘sum’, ‘mac’, ‘rms’, ‘abs_energy’.

Return np.ndarray

Array containing the selected descriptive statistics for each window size, data point, and statistic type. The shape of the result array is (len(statistics), data.shape[0], window_sizes.shape[0).

Note

The statistics parameter should be a list containing one or more of the following statistics:

‘var’ (variance), ‘max’ (maximum), ‘min’ (minimum), ‘std’ (standard deviation), ‘median’ (median), ‘mean’ (mean), ‘mad’ (median absolute deviation), ‘sum’ (sum), ‘mac’ (mean absolute change), ‘rms’ (root mean square), ‘absenergy’ (absolute energy). - If the statistics list is [‘var’, ‘max’, ‘mean’], the 3rd dimension order in the result array will be: [variance, maximum, mean]

Example

>>> data = np.array([1, 4, 2, 3, 5, 6, 8, 7, 9, 10]).astype(np.float32)
>>> results = TimeseriesFeatureMixin().sliding_descriptive_statistics(data=data, window_sizes=np.array([1.0, 5.0]), sample_rate=2, statistics=typed.List(['var', 'max']))

static sliding_displacement(x: ndarray, time_windows: ndarray, fps: float, px_per_mm: float) → ndarray[source]

Calculate sliding Euclidean displacement of a body-part point over time windows.

Parameters

x (np.ndarray) – An array of shape (n, 2) representing the time-series sequence of 2D points.
time_windows (np.ndarray) – Array of time windows (in seconds).
fps (float) – The sample rate (frames per second) of the sequence.
px_per_mm (float) – Pixels per millimeter conversion factor.

Return np.ndarray

1D array containing the calculated displacements.

Example

>>> x = np.random.randint(0, 50, (100, 2)).astype(np.int32)
>>> TimeseriesFeatureMixin.sliding_displacement(x=x, time_windows=np.array([1.0]), fps=1.0, px_per_mm=1.0)

static sliding_hjort_parameters(data: ndarray, window_sizes: ndarray, sample_rate: int) → ndarray[source]

Jitted compute of Hjorth parameters, including mobility, complexity, and activity, for sliding windows of varying sizes applied to the input data array.

Parameters

data (np.ndarray) – Input data array.
window_sizes (np.ndarray) – Array of window sizes (in seconds).
sample_rate (int) – Sampling rate of the data in samples per second.

Return np.ndarray

An array containing Hjorth parameters for each window size and data point. The shape of the result array is (3, data.shape[0], window_sizes.shape[0]). The three parameters are stored in the first dimension (0 - mobility, 1 - complexity, 2 - activity), and the remaining dimensions correspond to data points and window sizes.

static sliding_line_length(data: ndarray, window_sizes: ndarray, sample_rate: int) → ndarray[source]

Jitted compute of sliding line length for a given time series using different window sizes.

The function computes line length for the input data using various window sizes. It returns a 2D array where each row corresponds to a position in the time series, and each column corresponds to a different window size. The line length is calculated for each window, and the results are returned as a 2D array of float32 values.

Parameters

data (np.ndarray) – 1D array input data.
window_sizes – An array of window sizes (in seconds) to use for line length calculation.
sample_rate – The sampling rate (samples per second) of the time series data.

Return np.ndarray

A 2D array containing line length values for each window size at each position in the time series.

Examples

>>> data = np.array([1, 4, 2, 3, 5, 6, 8, 7, 9, 10]).astype(np.float32)
>>> TimeseriesFeatureMixin().sliding_line_length(data=data, window_sizes=np.array([1.0]), sample_rate=2)

static sliding_longest_strike(data: ndarray, threshold: float, time_windows: ndarray, sample_rate: int, above: bool) → ndarray[source]

Jitted compute of the length of the longest strike of values within sliding time windows that satisfy a given condition.

Calculates the length of the longest consecutive sequence of values in a 1D NumPy array, where each sequence is determined by a sliding time window. The condition is specified by a threshold, and you can choose whether to look for values above or below the threshold.

Parameters

data (np.ndarray) – The input 1D NumPy array containing the values to be analyzed.
threshold (float) – The threshold value used for the comparison.
time_windows (np.ndarray) – An array containing the time window sizes in seconds.
sample_rate (int) – The sample rate in samples per second.
above (bool) – If True, the function looks for strikes where values are above or equal to the threshold. If False, it looks for strikes where values are below or equal to the threshold.

Return np.ndarray

A 2D NumPy array with dimensions (data.shape[0], time_windows.shape[0]). Each element in the array represents the length of the longest strike that satisfies the condition for the

corresponding time window.

Example

>>> data = np.array([1, 8, 2, 10, 8, 6, 8, 1, 1, 1]).astype(np.float32)
>>> TimeseriesFeatureMixin().sliding_longest_strike(data=data, threshold=7, above=True, time_windows=np.array([1.0]), sample_rate=2)
>>> [[-1.][ 1.][ 1.][ 1.][ 2.][ 1.][ 1.][ 1.][ 0.][ 0.]]
>>> TimeseriesFeatureMixin().sliding_longest_strike(data=data, threshold=7, above=True, time_windows=np.array([1.0]), sample_rate=2)
>>> [[-1.][ 1.][ 1.][ 1.][ 0.][ 1.][ 1.][ 1.][ 2.][ 2.]]

static sliding_pct_in_top_n(x: ndarray, windows: ndarray, n: int, fps: float) → ndarray[source]

Compute the percentage of elements in the top ‘n’ frequencies in sliding windows of the input array.

Note

To compute percentage of elements in the top ‘n’ frequencies in entire array, use simba.mixins.statistics_mixin.Statistics.pct_in_top_n.

Parameters

x (np.ndarray) – Input 1D array.
windows (np.ndarray) – Array of window sizes in seconds.
n (int) – Number of top frequencies.
fps (float) – Sampling frequency for time convesrion.

Return np.ndarray

2D array of computed percentages of elements in the top ‘n’ frequencies for each sliding window.

Example

>>> x = np.random.randint(0, 10, (100000,))
>>> results = TimeseriesFeatureMixin.sliding_pct_in_top_n(x=x, windows=np.array([1.0]), n=4, fps=10)

static sliding_percent_beyond_n_std(data: ndarray, n: float, window_sizes: ndarray, sample_rate: int) → ndarray[source]

Computed the percentage of data points that exceed ‘n’ standard deviations from the mean for each position in the time series using various window sizes. It returns a 2D array where each row corresponds to a position in the time series, and each column corresponds to a different window size. The results are given as a percentage of data points beyond the threshold.

Parameters

data (np.ndarray) – The input time series data.
n (float) – The number of standard deviations to determine the threshold.
window_sizes (np.ndarray) – An array of window sizes (in seconds) to use for the sliding calculation.
sample_rate (int) – The sampling rate (samples per second) of the time series data.

Return np.ndarray

A 2D array containing the percentage of data points beyond the specified ‘n’ standard deviations for each window size.

static sliding_percent_in_percentile_window(data: ndarray, upper_pct: int, lower_pct: int, window_sizes: ndarray, sample_rate: int)[source]

Jitted compute of the percentage of data points falling within a percentile window in a sliding manner.

The function computes the percentage of data points within the specified percentile window for each position in the time series using various window sizes. It returns a 2D array where each row corresponds to a position in the time series, and each column corresponds to a different window size. The results are given as a percentage of data points within the percentile window.

Parameters

data (np.ndarray) – The input time series data.
upper_pct (int) – The upper percentile value for the window (e.g., 95 for the 95th percentile).
lower_pct (int) – The lower percentile value for the window (e.g., 5 for the 5th percentile).
window_sizes (np.ndarray) – An array of window sizes (in seconds) to use for the sliding calculation.
sample_rate (int) – The sampling rate (samples per second) of the time series data.

Return np.ndarray

A 2D array containing the percentage of data points within the percentile window for each window size.

static sliding_percentile_difference(data: ndarray, upper_pct: int, lower_pct: int, window_sizes: ndarray, fps: int) → ndarray[source]

Jitted computes the difference between the upper and lower percentiles within a sliding window for each position in the time series using various window sizes. It returns a 2D array where each row corresponds to a position in the time series, and each column corresponds to a different window size. The results are calculated as the absolute difference between upper and lower percentiles divided by the median of the window.

Parameters

data (np.ndarray) – The input time series data.
upper_pct (int) – The upper percentile value for the window (e.g., 95 for the 95th percentile).
lower_pct (int) – The lower percentile value for the window (e.g., 5 for the 5th percentile).
window_sizes (np.ndarray) – An array of window sizes (in seconds) to use for the sliding calculation.
sample_rate (int) – The sampling rate (samples per second) of the time series data.

Return np.ndarray

A 2D array containing the difference between upper and lower percentiles for each window size.

static sliding_petrosian_fractal_dimension(data: ndarray, window_sizes: ndarray, sample_rate: int) → ndarray[source]

Jitted compute of Petrosian Fractal Dimension over sliding windows in a data array.

This method computes the Petrosian Fractal Dimension for sliding windows of varying sizes applied to the input data array. The Petrosian Fractal Dimension is a measure of signal complexity.

Parameters

data (np.ndarray) – Input data array.
window_sizes (np.ndarray) – Array of window sizes (in seconds).
sample_rate (int) – Sampling rate of the data in samples per second.

Return np.ndarray

An array containing Petrosian Fractal Dimension values for each window size and data point. The shape of the result array is (data.shape[0], window_sizes.shape[0]).

static sliding_stationary(data: ~numpy.ndarray, time_windows: ~numpy.ndarray, sample_rate: int, test: typing_extensions.Literal['ADF', 'KPSS', 'ZA'] = 'adf') -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Perform the Augmented Dickey-Fuller (ADF), Kwiatkowski-Phillips-Schmidt-Shin (KPSS), or Zivot-Andrews test on sliding windows of time series data. Parallel processing using all available cores is used to accelerate computation.

Note

ADF: A high p-value suggests non-stationarity, while a low p-value indicates stationarity.
KPSS: A high p-value suggests stationarity, while a low p-value indicates non-stationarity.
ZA: A high p-value suggests non-stationarity, while a low p-value indicates stationarity.

Parameters

data (np.ndarray) – 1-D NumPy array containing the time series data to be tested.
time_windows (np.ndarray) – A 1-D NumPy array containing the time window sizes in seconds.
sample_rate (np.ndarray) – The sample rate of the time series data (samples per second).
test (Literal) – Test to perfrom: Options: ‘ADF’ (Augmented Dickey-Fuller), ‘KPSS’ (Kwiatkowski-Phillips-Schmidt-Shin), ‘ZA’ (Zivot-Andrews).

Return (np.ndarray, np.ndarray)

A tuple of two 2-D NumPy arrays containing test statistics and p-values. - The first array (stat) contains the ADF test statistics. - The second array (p_vals) contains the corresponding p-values

Example

>>> data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> TimeseriesFeatureMixin().sliding_stationary(data=data, time_windows=np.array([2.0]), test='KPSS', sample_rate=2)

static sliding_two_signal_crosscorrelation(x: ndarray, y: ndarray, windows: ndarray, sample_rate: float, normalize: bool, lag: float) → ndarray[source]

Calculate sliding (lagged) cross-correlation between two signals, e.g., the movement and velocity of two animals.

Note

If no lag needed, pass lag 0.0.

Parameters

x (np.ndarray) – The first input signal.
y (np.ndarray) – The second input signal.
windows (np.ndarray) – Array of window lengths in seconds.
sample_rate (float) – Sampling rate of the signals (in Hz or FPS).
normalize (bool) – If True, normalize the signals before computing the correlation.
lag (float) – Time lag between the signals in seconds.

Returns

2D array of sliding cross-correlation values. Each row corresponds to a time index, and each column corresponds to a window size specified in the windows parameter.

Example

>>> x = np.random.randint(0, 10, size=(20,))
>>> y = np.random.randint(0, 10, size=(20,))
>>> TimeseriesFeatureMixin.sliding_two_signal_crosscorrelation(x=x, y=y, windows=np.array([1.0, 1.2]), sample_rate=10, normalize=True, lag=0.0)

static sliding_variance(data: ndarray, window_sizes: ndarray, sample_rate: int) → ndarray[source]

Jitted compute of the variance of data within sliding windows of varying sizes applied to the input data array. Variance is a measure of data dispersion or spread.

Parameters

data – 1d input data array.
window_sizes – Array of window sizes (in seconds).
sample_rate – Sampling rate of the data in samples per second.

Returns

Variance values for each window size and data point. The shape of the result array is (data.shape[0], window_sizes.shape[0]).

Example

>>> data = np.array([1, 2, 3, 1, 2, 9, 17, 2, 10, 4]).astype(np.float32)
>>> TimeseriesFeatureMixin().sliding_variance(data=data, window_sizes=np.array([0.5]), sample_rate=10)
>>> [[-1.],[-1.],[-1.],[-1.],[ 0.56],[ 8.23],[35.84],[39.20],[34.15],[30.15]])

static spike_finder(data: ndarray, sample_rate: int, baseline: float, min_spike_amplitude: float, min_fwhm: float = -inf, min_half_width: float = -inf) → float[source]

Identify and characterize spikes in a given time-series data sequence. This method identifies spikes in the input data based on the specified criteria and characterizes each detected spike by computing its amplitude, full-width at half maximum (FWHM), and half-width.

Parameters

data (np.ndarray) – A 1D array containing the input data sequence to analyze.
sample_rate (int) – The sample rate, indicating how many data points are collected per second.
baseline (float) – The baseline value used to identify spikes. Any data point above (baseline + min_spike_amplitude) is considered part of a spike.
min_spike_amplitude (float) – The minimum amplitude (above baseline) required for a spike to be considered.
min_fwhm (Optional[float]) – The minimum full-width at half maximum (FWHM) for a spike to be included. If not specified, it defaults to negative infinity, meaning it is not considered for filtering.
min_half_width (Optional[float]) – The minimum half-width required for a spike to be included. If not specified, it defaults to negative infinity, meaning it is not considered for filtering.

Return tuple

A tuple containing three elements: - spike_idx (List[np.ndarray]): A list of 1D arrays, each representing the indices of the data points belonging to a detected spike. - spike_vals (List[np.ndarray]): A list of 1D arrays, each containing the values of the data points within a detected spike. - spike_dict (Dict[int, Dict[str, float]]): A dictionary where the keys are spike indices, and the values are dictionaries containing spike characteristics including ‘amplitude’ (spike amplitude), ‘fwhm’ (FWHM), and ‘half_width’ (half-width).

Note

The function uses the Numba JIT (Just-In-Time) compilation for optimized performance. Without fastmath=True there is no runtime improvement over standard numpy.

Example

>>> data = np.array([0.1, 0.1, 0.3, 0.1, 10, 10, 8, 0.1, 0.1, 0.1, 10, 10, 8, 99, 0.1, 99, 99, 0.1]).astype(np.float32)
>>> spike_idx, spike_vals, spike_stats = TimeseriesFeatureMixin().spike_finder(data=data, baseline=1, min_spike_amplitude=5, sample_rate=2, min_fwhm=-np.inf, min_half_width=0.0002)

static spike_train_finder(data: ndarray, spike_idx: list, sample_rate: float, min_spike_train_length: float = inf, max_spike_train_separation: float = inf)[source]

Identify and analyze spike trains from a list of spike indices.

This function takes spike indices and additional information, such as the data, sample rate, minimum spike train length, and maximum spike train separation, to identify and analyze spike trains in the data.

Note

The function may return an empty dictionary if no spike trains meet the criteria.
A required input is spike_idx, which is returned by spike_finder().

Parameters

data (types.List(types.Array(types.int64, 1, 'C'))) – The data from which spike trains are extracted.
data – A list of spike indices, typically as integer timestamps.
sample_rate (float) – The sample rate of the data.
min_spike_train_length (Optional[float]) – The minimum length a spike train must have to be considered. Default is set to positive infinity, meaning no minimum length is enforced.
max_spike_train_separation (Optional[float]) – The maximum allowable separation between spikes in the same train. Default is set to positive infinity, meaning no maximum separation is enforced.

Return DictType[int64,DictType[unicode_type,float64]]

A dictionary containing information about identified spike trains.

Each entry in the returned dictionary is indexed by an integer, and contains the following information:

‘train_start_time’: Start time of the spike train in seconds.
‘train_end_time’: End time of the spike train in seconds.
‘train_start_obs’: Start time index in observations.
‘train_end_obs’: End time index in observations.
‘spike_cnt’: Number of spikes in the spike train.
‘train_length_obs_cnt’: Length of the spike train in observations.
‘train_length_obs_s’: Length of the spike train in seconds.
‘train_spike_mean_lengths_s’: Mean length of individual spikes in seconds.
‘train_spike_std_length_obs’: Standard deviation of spike lengths in observations.
‘train_spike_std_length_s’: Standard deviation of spike lengths in seconds.
‘train_spike_max_length_obs’: Maximum spike length in observations.
‘train_spike_max_length_s’: Maximum spike length in seconds.
‘train_spike_min_length_obs’: Minimum spike length in observations.
‘train_spike_min_length_s’: Minimum spike length in seconds.
‘train_mean_amplitude’: Mean amplitude of the spike train.
‘train_std_amplitude’: Standard deviation of spike amplitudes.
‘train_min_amplitude’: Minimum spike amplitude.
‘train_max_amplitude’: Maximum spike amplitude.

Example

>>> data = np.array([0.1, 0.1, 0.3, 0.1, 10, 10, 8, 0.1, 0.1, 0.1, 10, 10, 8, 99, 0.1, 99, 99, 0.1]).astype(np.float32)
>>> spike_idx, _, _ = TimeseriesFeatureMixin().spike_finder(data=data, baseline=0.3, min_spike_amplitude=0.2, sample_rate=2, min_fwhm=-np.inf, min_half_width=-np.inf)
>>> results = TimeseriesFeatureMixin().spike_train_finder(data=data, spike_idx=typed.List(spike_idx), sample_rate=2.0, min_spike_train_length=2.0, max_spike_train_separation=2.0)

static time_since_previous_target_value(data: ndarray, value: float, fps: int, inverse: bool = False) → ndarray[source]

Calculate the time duration (in seconds) since the previous occurrence of a specific value in a data array.

Calculates the time duration, in seconds, between each data point and the previous occurrence of a specific value within the data array.

_images/time_since_previous_target_value.png

Parameters

data (np.ndarray) – The input 1D array containing the time series data.
value (float) – The specific value to search for in the data array.
sample_rate (int) – The sampling rate which data points were collected. It is used to calculate the time duration in seconds.
inverse (bool) – If True, the function calculates the time since the previous value that is NOT equal to the specified ‘value’. If False, it calculates the time since the previous occurrence of the specified ‘value’.

Returns np.ndarray

A 1D NumPy array containing the time duration (in seconds) since the previous occurrence of the specified ‘value’ for each data point.

Example

>>> data = np.array([8, 8, 2, 10, 8, 6, 8, 1, 1, 1]).astype(np.float32)
>>> TimeseriesFeatureMixin().time_since_previous_target_value(data=data, value=8.0, inverse=False, sample_rate=2.0)
>>> [0. , 0. , 0.5, 1. , 0. , 0.5, 0. , 0.5, 1. , 1.5])
>>> TimeseriesFeatureMixin().time_since_previous_target_value(data=data, value=8.0, inverse=True, sample_rate=2.0)
>>> [-1. , -1. ,  0. ,  0. ,  0.5,  0. ,  0.5,  0. ,  0. ,  0. ]

static time_since_previous_threshold(data: ndarray, threshold: float, fps: int, above: bool) → ndarray[source]

Jitted compute of the time (in seconds) that has elapsed since the last occurrence of a value above (or below) a specified threshold in a time series. The time series is assumed to have a constant sample rate.

_static/img/time_since_previous_threshold.png

Parameters

data (np.ndarray) – The input 1D array containing the time series data.
threshold (int) – The threshold value used for the comparison.
fps (int) – The sample rate of the time series in samples per second.
above (bool) – If True, the function looks for values above or equal to the threshold. If False, it looks for values below or equal to the threshold.

Return np.ndarray

A 1D array of the same length as the input data. Each element represents the time elapsed (in seconds) since the last occurrence of the threshold value. If no threshold value is found before the current data point, the corresponding result is set to -1.0.

Examples

>>> data = np.array([1, 8, 2, 10, 8, 6, 8, 1, 1, 1]).astype(np.float32)
>>> TimeseriesFeatureMixin().time_since_previous_threshold(data=data, threshold=7.0, above=True, sample_rate=2.0)
>>> [-1. ,  0. ,  0.5,  0. ,  0. ,  0.5,  0. ,  0.5,  1. ,  1.5]
>>> TimeseriesFeatureMixin().time_since_previous_threshold(data=data, threshold=7.0, above=False, sample_rate=2.0)
>>> [0. , 0.5, 0. , 0.5, 1. , 0. , 0.5, 0. , 0. , 0. ]

Unsupervised methods

Annotation GUI methods

class simba.mixins.annotator_mixin.AnnotatorMixin(config_path: Union[str, PathLike], video_path: Union[str, PathLike], data_path: Union[str, PathLike], frame_size: Optional[Tuple[int]] = (1280, 650), title: Optional[str] = None)[source]

Bases: ConfigReader

Methods for creating tkinter GUI frames and functions associated with annotating videos.

Currently under development (starting 01/24). As the number of different annotation methods and interfaces increases, this class will contain common methods for all annotation interfaces to decrease code duplication.

Parameters

config_path (Union[str, os.PathLike]) – Path to SimBA project config ini file.
video_path (Union[str, os.PathLike]) – Path to video file to-be annotated.
data_path (Union[str, os.PathLike]) – Path to featurized pose-estimation data associated with the video.
frame_size (Optional[Tuple[int]]) – The size of the subframe displaying the video frame in the GUI.

change_frame(new_frm_id: int, min_frm: Optional[int] = None, max_frm: Optional[int] = None, update_funcs: Optional[List[Callable[int, None]]] = None, store_funcs: Optional[List[Callable[None]]] = None, keep_radio_btn_choices: Optional[bool] = False) → None[source]

Change the frame displayed in annotator GUI.

Note

store_funcs will be executed before update_funcs.

Parameters

new_frm_id (int) – The frame number of the new frame.
min_frm (Optional[int]) – If the minimum frame number is not the first frame of the video, pass the minimum frame number here.
max_frm (Optional[int]) – If the maximum frame number is not the last frame of the video, pass the max frame number here.
max_frm – If the maximum frame number is not the last frame of the video, pass the max frame number here.
update_funcs (Optional[List[Callable[[int], None]]]) – Optional functions that takes accepts the new frame numers that should be called. E.g., if updating the frame number should cause the display of the new frame numbers in any other Frame.
store_funcs (Optional[List[Callable[[], None]]]) – Optional functions that saves user frame selections in memory.
keep_radio_btn_choices (Optional[bool]) – If True, then any update_funcs that causes the update of radio button choices in the newly displayed frame will be supressed. Thus, the choices in the prior frame is maintained.

find_proximal_annotated_frm(forwards: bool, present: bool)[source]: Helper to find the most proximal preceding or proceeding frame where any one behavior are annotated as present or absent.

get_annotation_of_frame(frm_num: int, clf: str, allowed_vals: Optional[List[Optional[int]]] = None) → Optional[int][source]: Helper to retrieve the stored annotation of a specific classifier at a specific frame index

h_nav_bar(parent: Frame, update_funcs: Optional[List[Callable[int, None]]] = None, store_funcs: Optional[List[Callable[None]]] = None, size: Optional[Tuple[int, int]] = (300, 700), loc: Optional[Tuple[int, int]] = (1, 0), previous_next_clf: Optional[bool] = False) → None[source]

Creates a horizontal frame navigation bar where the buttons are tied to callbacks for changing and displaying video frames.

Parameters

parent (Frame) – The tkinter Frame to place navigation bar within.
update_funcs (Optional[List[Callable[[int], None]]]) – Optional list of callables that accepts a single integer inputs. Can be methods that updates part of the interface as the frame number changes.
store_funcs (Optional[List[Callable[[], None]]]) – Optional list of callables without arguments. Can be methods that stores the selections in memory as users proceeds through the frames.
size (Optional[Tuple[int, int]]) – The size of the navigation bar in h x w. Default 300 x 700.
loc (Optional[Tuple[int, int]]) – The grid location (row, column) within the parent frame at which the navigation bar should be displayed. Defualt: (1, 0).
previous_next_clf (Optional[bool]) – If True, then include four buttons allowing users to navigate to the most proximal preceding or proceeding frame where behaviors are annotated as present or absent.

store_annotation() → None[source]: Method to store annotations in memory

store_targeted_annotations_frames()[source]: Method to store annotations in memory while annotating targeted bouts frame-wise

targeted_annotations_frames_save()[source]: Method to save annotations to disk when using targeted bout frame-wise annotations

targeted_bouts_pane(parent: Frame) → Frame[source]: Create a pane for choosing bouts start and end and a radiobutton truth table for targeted bouts annotations. Used by simba.labelling.targeted_annotations_bouts.TargetedAnnotatorBouts

targeted_frames_selection_pane(parent: Frame, loc: Optional[Tuple[int, int]] = (0, 1)) → None[source]

Creates a vertical pane that includes tkinter frames for selecting bouts and annotating behaviours in those bouts frame-wise.

_images/targeted_frames_selection_pane.png

Parameters

parent (Frame) – The tkinter Frame to place the vertical annotation frame within.
loc (Optional[Tuple[int, int]]) – The grid location (row, column) within the parent frame at which the pane should be displayed. Default: (0, 1).

update_clf_radiobtns(frm_num: int)[source]: Update helper to set the radio button to annotated values

update_current_selected_frm_lbl(new_frm: Union[int, str])[source]: Helper to update label showing current frame text shown when annotating bouts frame-wise.

v_navigation_pane_targeted(parent: Frame, save_func: Callable[None], update_funcs: Optional[List[Callable[int, None]]] = None, store_funcs: Optional[List[Callable[None]]] = None, loc: Optional[Tuple[int, int]] = (0, 2)) → None[source]

Create a vertical navigation pane for playing a video and displaying and activating keyboard shortcuts when annotating bouts.

Parameters

parent (Frame) – The tkinter Frame to place the vertical navigation bar within.
save_func (Callable[[], None]) – The save-data-to-disk function that should be called when using the save data shortcut.
update_funcs (Optional[List[Callable[[int], None]]]) – Optional list of callables that accepts a single integer inputs. Can be methods that updates part of the interface as the frame number changes.
store_funcs (Optional[List[Callable[[], None]]]) – Optional list of callables without arguments. Can be methods that stores the selections in memory as users proceeds through the frames.
loc (Optional[Tuple[int, int]]) – The grid location (row, column) within the parent frame at which the navigation bar should be displayed. Default: (1, 0).

video_frm_label(frm_number: int, max_size: Optional[Tuple[int]] = None, loc: Tuple[int] = (0, 0)) → None[source]

Inserts a video frame as a tkinter label at a specified maximum size at specified grid location.

Parameters

frm_number (int) – The frame number if the video that should be displayed as a tkinter label.
max_size (Optional[Tuple[int, int]]) – The maximum size of the image when displayed. If None, then frame_size defined at instance init.
loc (Tuple[int, int]) – The grid location (row, column) within the main frame at which the video frame should be displayed.

Image attribute extraction

class simba.mixins.image_mixin.ImageMixin[source]

Bases: object

Methods to slice and compute attributes of images and comparing those attributes across sequential images.

This can be helpful when the behaviors studied are very subtle and the signal is very low in relation to the noise within the pose-estimated data. In these use-cases, we cannot use pose-estimated data directly, and we instead study histograms, contours and other image metrics within images derived from the intersection of geometries (like a circle around the nose) across sequential images. Often these methods are called using image masks created from pose-estimated points within simba.mixins.geometry_mixin.GeometryMixin methods.

Important

If there is non-pose related noise in the environment (e.g., there are non-experiment related light sources that goes on and off, or other image noise that doesn’t necesserily affect pose-estimation reliability), this will negatively affect the reliability of most image attribute comparisons.

static add_img_border_and_flood_fill(img: array, invert: Optional[bool] = False, size: Optional[int] = 1) → ndarray[source]

Add a border to the input image and perform flood fill.

E.g., Used to remove any black pixel areas connected to the border of the image. Used to remove noise if noise is defined as being connected to the edges of the image.

_images/add_img_border_and_flood_fill.png

Parameters

img (np.ndarray) – Input image as a numpy array.
invert (Optional[bool]) – If false, make black border and floodfill black pixels with white. If True, make white border and floodfill white pixels with black. Default False.
size (Optional[bool]) – Size of border. Default 1 pixel.

static brightness_intensity(imgs: List[ndarray], ignore_black: Optional[bool] = True) → List[float][source]

Compute the average brightness intensity within each image within a list.

For example, (i) create a list of images containing a light cue ROI, (ii) compute brightness in each image, (iii) perform kmeans on brightness, and get the frames when the light cue is on vs off.

Parameters

imgs (List[np.ndarray]) – List of images as arrays to calculate average brightness intensity within.
ignore_black (Optional[bool]) – If True, ignores black pixels. If the images are sliced non-rectangular geometric shapes created by slice_shapes_in_img, then pixels that don’t belong to the shape has been masked in black.

Returns List[float]

List of floats of size len(imgs) with brightness intensities.

Example

>>> img = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/khan/project_folder/videos/stitched_frames/0.png').astype(np.uint8)
>>> ImageMixin.brightness_intensity(imgs=[img], ignore_black=False)
>>> [159.0]

static canny_edge_detection(img: ndarray, threshold_1: int = 30, threshold_2: int = 200, aperture_size: int = 3, l2_gradient: bool = False) → ndarray[source]: Apply Canny edge detection to the input image.

static find_contours(img: ndarray, mode: Optional[typing_extensions.Literal['all', 'exterior']] = 'all', method: Optional[typing_extensions.Literal['simple', 'none', 'l1', 'kcos']] = 'simple') → ndarray[source]

Find contours in the input image.

Parameters

img (Optional[Literal['all', 'exterior']]) – Input image as a NumPy array.
img – Contour retrieval mode. E.g., which contours should be kept. Default is ‘all’.
'kcos']] (Optional[Literal['simple', 'none', 'l1',) – Contour approximation method. Default is ‘simple’.

static get_contourmatch(img_1: ndarray, img_2: ndarray, mode: Optional[typing_extensions.Literal['all', 'exterior']] = 'all', method: Optional[typing_extensions.Literal['simple', 'none', 'l2', 'kcos']] = 'simple', canny: Optional[bool] = True) → float[source]

Calculate contour similarity between two images.

Parameters

img_1 (np.ndarray) – First input image (numpy array).
img_2 (np.ndarray) – Second input image (numpy array).
method (Optional[Literal['all', 'exterior']]) – Method for contour extraction. Options: ‘all’ (all contours) or ‘exterior’ (only exterior contours). Defaults to ‘all’.

Return float

Contour similarity score between the two images. Lower values indicate greater similarity, and higher values indicate greater dissimilarity.

Example

>>> img_1 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/khan/project_folder/videos/stitched_frames/0.png').astype(np.uint8)
>>> img_2 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/khan/project_folder/videos/stitched_frames/3.png').astype(np.uint8)
>>> ImageMixin.get_contourmatch(img_1=img_1, img_2=img_2, method='exterior')

static get_histocomparison(img_1: ndarray, img_2: ndarray, method: Optional[typing_extensions.Literal['chi_square', 'correlation', 'intersection', 'bhattacharyya', 'hellinger', 'chi_square_alternative', 'kl_divergence']] = 'correlation', absolute: Optional[bool] = True)[source]

Example

>>> img_1 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/khan/project_folder/videos/stitched_frames/0.png').astype(np.uint8)
>>> img_2 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/khan/project_folder/videos/stitched_frames/3.png').astype(np.uint8)
>>> ImageMixin.get_histocomparison(img_1=img_1, img_2=img_2, method='chi_square_alternative')

static img_emd(imgs: Optional[List[ndarray]] = None, img_1: Optional[ndarray] = None, img_2: Optional[ndarray] = None, lower_bound: Optional[float] = 0.5)[source]

Compute Wasserstein distance between two images represented as numpy arrays.

Example

>>> img_1 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1_frames/24.png', 0).astype(np.float32)
>>> img_2 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1_frames/1984.png', 0).astype(np.float32)
>>> img_emd(img_1=img_1, img_2=img_3, lower_bound=0.5)
>>> 10.658767700195312

static img_matrix_mse(imgs: ndarray) → ndarray[source]

Compute the mean squared error (MSE) matrix table for a stack of images.

This function calculates the MSE between each pair of images in the input array and returns a symmetric matrix where each element (i, j) represents the MSE between the i-th and j-th images. Useful for image similarities and anomalities.

Parameters: imgs (np.ndarray) – A stack of images represented as a numpy array.
Return np.ndarray: The MSE matrix table.
Example

>>> imgs = ImageMixin().read_img_batch_from_video(video_path='/Users/simon/Desktop/envs/troubleshooting/two_black_animals_14bp/videos/Together_1.avi', start_frm=0, end_frm=50)
>>> imgs = np.stack(list(imgs.values()))
>>> ImageMixin().img_matrix_mse(imgs=imgs)

static img_moments(img: ndarray, hu_moments: Optional[bool] = False) → ndarray[source]

Compute image moments.

Parameters

img (Optional[bool]) – The input image.
img – If True, returns the 7 Hu moments. Else, returns the moments.

Returns np.ndarray

A 24x1 2d-array if hu_moments is False, 7x1 2d-array if hu_moments is True.

Example

>>> img_1 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/khan/project_folder/videos/stitched_frames/0.png').astype(np.uint8)
>>> ImageMixin.img_moments(img=img_1, hu_moments=True)
>>> [[ 1.01270313e-03], [ 8.85983106e-10], [ 4.67680675e-13], [ 1.00442018e-12], [-4.64181508e-25], [-2.49036749e-17], [ 5.08375216e-25]]

static img_sliding_mse(imgs: ndarray, slide_size: int = 1) → ndarray[source]

Pairwise comparison of images in sliding windows using mean squared errors

Example

>>> imgs = ImageMixin().read_all_img_in_dir(dir='/Users/simon/Desktop/envs/troubleshooting/two_black_animals_14bp/project_folder/Together_4_cropped_frames')
>>> imgs = np.stack(imgs.values())
>>> mse = ImageMixin().img_sliding_mse(imgs=imgs, slide_size=2)

static img_stack_mse(imgs_1: ndarray, imgs_2: ndarray) → ndarray[source]

Pairwise comparison of images in two stacks of equal length using mean squared errors.

Note

Useful for noting subtle changes, each imgs_2 equals imgs_1 with images shifted by 1. Images has to be in uint8 format. Also see img_sliding_mse.

Parameters

imgs_1 (np.ndarray) – First three (non-color) or four (color) dimensional stack of images in array format.
imgs_1 – Second three (non-color) or four (color) dimensional stack of images in array format.

Return np.ndarray

Array of size len(imgs_1) comparing imgs_1 and imgs_2 at each index using mean squared errors at each pixel location.

Example

>>> img_1 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/khan/project_folder/videos/stitched_frames/0.png').astype(np.uint8)
>>> img_2 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/khan/project_folder/videos/stitched_frames/10.png').astype(np.uint8)
>>> imgs_1 = np.stack((img_1, img_2)); imgs_2 = np.stack((img_2, img_2))
>>> ImageMixin.img_stack_mse(imgs_1=imgs_1, imgs_2=imgs_2)
>>> [637,   0]
>>> imgs = ImageMixin().read_all_img_in_dir(dir='/Users/simon/Desktop/envs/troubleshooting/two_black_animals_14bp/project_folder/Together_4_cropped_frames')
>>> imgs_1 = np.stack(imgs.values())
>>> imgs_2 = np.roll(imgs_1,-1, axis=0)
>>> mse = ImageMixin().img_stack_mse(imgs_1=imgs_1, imgs_2=imgs_1)

static img_stack_to_bw(imgs: ndarray, lower_thresh: int, upper_thresh: int, invert: bool)[source]

Convert a stack of color images into black and white format.

Note

If converting a single image, consider simba.mixins.image_mixin.ImageMixin.img_to_bw()

Parameters

img (np.ndarray) – 4-dimensional array of color images.
lower_thresh (Optional[int]) – Lower threshold value for binary conversion. Pixels below this value become black. Default is 20.
upper_thresh (Optional[int]) – Upper threshold value for binary conversion. Pixels above this value become white. Default is 250.
invert (Optional[bool]) – Flag indicating whether to invert the binary image (black becomes white and vice versa). Default is True.

Return np.ndarray

4-dimensional array with black and white image.

Example

>>> imgs = ImageMixin.read_img_batch_from_video(video_path='/Users/simon/Downloads/3A_Mouse_5-choice_MouseTouchBasic_a1.mp4', start_frm=0, end_frm=100)
>>> imgs = np.stack(imgs.values(), axis=0)
>>> bw_imgs = ImageMixin.img_stack_to_bw(imgs=imgs, upper_thresh=255, lower_thresh=20, invert=False)

static img_stack_to_greyscale(imgs: ndarray)[source]

Jitted conversion of a 4D stack of color images (RGB format) to grayscale.

Parameters: imgs (np.ndarray) – A 4D array representing color images. It should have the shape (num_images, height, width, 3) where the last dimension represents the color channels (R, G, B).
Returns np.ndarray: A 3D array containing the grayscale versions of the input images. The shape of the output array is (num_images, height, width).
Example

>>> imgs = ImageMixin().read_img_batch_from_video( video_path='/Users/simon/Desktop/envs/troubleshooting/two_black_animals_14bp/videos/Together_1.avi', start_frm=0, end_frm=100)
>>> imgs = np.stack(list(imgs.values()))
>>> imgs_gray = ImageMixin.img_stack_to_greyscale(imgs=imgs)

static img_stack_to_video(imgs: Dict[int, ndarray], save_path: Union[str, PathLike], fps: int, verbose: Optional[bool] = True)[source]

Convert a dictionary of images into a video file.

Note

The input dict can be greated with ImageMixin().slice_shapes_in_imgs()

Parameters

imgs (Dict[int, np.ndarray]) – A dictionary containing frames of the video, where the keys represent frame indices and the values are numpy arrays representing the images.
save_path (Union[str, os.PathLike]) – The path to save the output video file.
fps (int) – Frames per second (FPS) of the output video.
verbose (Optional[bool]) – If True, prints progress messages. Defaults to True.

static img_to_bw(img: ndarray, lower_thresh: Optional[int] = 20, upper_thresh: Optional[int] = 250, invert: Optional[bool] = True) → ndarray[source]

Convert an image to black and white (binary).

Note

If converting multiple images from colour to black and white, consider simba.mixins.image_mixin.ImageMixin.img_stack_to_bw()

Parameters

img (np.ndarray) – Input image as a NumPy array.
lower_thresh (Optional[int]) – Lower threshold value for binary conversion. Pixels below this value become black. Default is 20.
upper_thresh (Optional[int]) – Upper threshold value for binary conversion. Pixels above this value become white. Default is 250.
invert (Optional[bool]) – Flag indicating whether to invert the binary image (black becomes white and vice versa). Default is True.

Return np.ndarray

Binary black and white image.

static orb_matching_similarity_(img_1: ndarray, img_2: ndarray, method: typing_extensions.Literal['knn', 'match', 'radius'] = 'knn', mask: Optional[ndarray] = None, threshold: Optional[int] = 0.75) → int[source]

Perform ORB feature matching between two sets of images.

>>> img_1 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/khan/project_folder/videos/stitched_frames/0.png').astype(np.uint8)
>>> img_2 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/khan/project_folder/videos/stitched_frames/10.png').astype(np.uint8)
>>> ImageMixin().orb_matching_similarity_(img_1=img_1, img_2=img_2, method='radius')
>>> 4

static pad_img_stack(image_dict: Dict[int, ndarray], pad_value: Optional[int] = 0) → Dict[int, ndarray][source]

Pad images in a dictionary stack to have the same dimensions (the same dimension is represented by the largest image in the stack)

Parameters

image_dict (Dict[int, np.ndarray]) – A dictionary mapping integer keys to numpy arrays representing images.
pad_value (Optional[int]) – The value used for padding. Defaults to 0 (black)

Return Dict[int, np.ndarray]

A dictionary mapping integer keys to numpy arrays representing padded images.

static read_all_img_in_dir(dir: Union[str, PathLike], core_cnt: Optional[int] = -1) → Dict[str, ndarray][source]

Helper to read in all images within a directory using multiprocessing. Returns a dictionary with the image name as key and the images in array format as values.

Example

>>> imgs = ImageMixin().read_all_img_in_dir(dir='/Users/simon/Desktop/envs/troubleshooting/two_black_animals_14bp/project_folder/Together_4_cropped_frames')

static read_img_batch_from_video(video_path: Union[str, PathLike], start_frm: int, end_frm: int, core_cnt: Optional[int] = -1) → Dict[int, ndarray][source]

Read a batch of frames from a video file. This method reads frames from a specified range of frames within a video file using multiprocessing.

Parameters

video_path (Union[str, os.PathLike]) – Path to the video file.
start_frm (int) – Starting frame index.
end_frm (int) – Ending frame index.
core_cnt (Optionalint]) – Number of CPU cores to use for parallel processing. Default is -1, indicating using all available cores.
greyscale (Optional[bool]) – If True, reads the images as greyscale. If False, then as original color scale. Default: True.

Returns Dict[int, np.ndarray]

A dictionary containing frame indices as keys and corresponding frame arrays as values.

Example

>>> ImageMixin().read_img_batch_from_video(video_path='/Users/simon/Desktop/envs/troubleshooting/two_black_animals_14bp/videos/Together_1.avi', start_frm=0, end_frm=50)

static segment_img_horizontal(img: ndarray, pct: int, lower: Optional[bool] = True, both: Optional[bool] = False) → ndarray[source]

Segment a horizontal part of the input image.

This function segments either the lower, upper, or both lower and upper part of the input image based on the specified percentage.

Parameters

img (np.ndarray) – Input image as a NumPy array.
pct (int) – Percentage of the image to be segmented. If lower is True, it represents the lower part; if False, it represents the upper part.
lower (Optional[bool]) – Flag indicating whether to segment the lower part (True) or upper part (False) of the image. Default is True.
both (Optional[bool]) – If True, removes both the upper pct and lower pct and keeps middle part.

Return np.array

Segmented part of the image.

Example

>>> img = cv2.imread('/Users/simon/Desktop/test.png')
>>> img = ImageMixin.segment_img_horizontal(img=img, pct=10, both=True)

static segment_img_stack_horizontal(imgs: ndarray, pct: int, lower: bool, both: bool) → ndarray[source]

Segment a horizontal part of all images in stack.

Example

>>> imgs = ImageMixin.read_img_batch_from_video(video_path='/Users/simon/Downloads/3A_Mouse_5-choice_MouseTouchBasic_a1.mp4', start_frm=0, end_frm=400)
>>> imgs = np.stack(imgs.values(), axis=0)
>>> sliced_imgs = ImageMixin.segment_img_stack_horizontal(imgs=imgs, pct=50, lower=True, both=False)

static segment_img_vertical(img: ndarray, pct: int, left: Optional[bool] = True, both: Optional[bool] = False) → ndarray[source]

Segment a vertical part of the input image.

This function segments either the left, right or both the left and right part of input image based on the specified percentage.

Parameters

img (np.ndarray) – Input image as a NumPy array.
pct (int) – Percentage of the image to be segmented. If lower is True, it represents the lower part; if False, it represents the upper part.
lower (Optional[bool]) – Flag indicating whether to segment the lower part (True) or upper part (False) of the image. Default is True.
both (Optional[bool]) – If True, removes both the left pct and right pct and keeps middle part.

Return np.array

Segmented part of the image.

static slice_shapes_in_img(img: Union[ndarray, Tuple[VideoCapture, int]], geometries: List[Union[Polygon, ndarray]]) → List[ndarray][source]

Slice regions of interest (ROIs) from an image based on provided shapes.

Note

Use for slicing one or several static geometries from a single image. If you have several images, and shifting geometries across images, consider simba.mixins.image_mixin.ImageMixin.slice_shapes_in_imgs which uses CPU multiprocessing.

Parameters

img (List[Union[Polygon, np.ndarray]]) – Either an image in numpy array format OR a tuple with cv2.VideoCapture object and the frame index.
img – A list of shapes either as vertices in a numpy array, or as shapely Polygons.

Returns List[np.ndarray]

List of sliced ROIs from the input image.

>>> img = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/img_comparisons_4/1.png')
>>> img_video = cv2.VideoCapture('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1.mp4')
>>> data_path = '/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/csv/outlier_corrected_movement_location/Example_1.csv'
>>> data = pd.read_csv(data_path, nrows=4, usecols=['Nose_x', 'Nose_y']).fillna(-1).values.astype(np.int64)
>>> shapes = []
>>> for frm_data in data: shapes.append(GeometryMixin().bodyparts_to_circle(frm_data, 100))
>>> ImageMixin().slice_shapes_in_img(img=(img_video, 1), shapes=shapes)

slice_shapes_in_imgs(imgs: Union[ndarray, PathLike], shapes: Union[ndarray, List[Polygon]], core_cnt: Optional[int] = -1, verbose: Optional[bool] = False) → Dict[int, ndarray][source]

Slice regions from a stack of images or a video file, where the regions are based on defined shapes. Uses multiprocessing.

For example, given a stack of N images, and N*X geometries representing the region around the animal body-part(s), slice out the X geometries from each of the N images and return the sliced areas.

Example I

>>> imgs = ImageMixin().read_img_batch_from_video( video_path='/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1.mp4', start_frm=0, end_frm=10)
>>> imgs = np.stack(list(imgs.values()))
>>> imgs_gray = ImageMixin().img_stack_to_greyscale(imgs=imgs)
>>> data = pd.read_csv('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/csv/outlier_corrected_movement_location/Example_1.csv', nrows=11).fillna(-1)
>>> nose_array, tail_array = data.loc[0:10, ['Nose_x', 'Nose_y']].values.astype(np.float32), data.loc[0:10, ['Tail_base_x', 'Tail_base_y']].values.astype(np.float32)
>>> nose_shapes, tail_shapes = [], []
>>> for frm_data in nose_array: nose_shapes.append(GeometryMixin().bodyparts_to_circle(frm_data, 80))
>>> for frm_data in tail_array: tail_shapes.append(GeometryMixin().bodyparts_to_circle(frm_data, 80))
>>> shapes = np.array(np.vstack([nose_shapes, tail_shapes]).T)
>>> sliced_images = ImageMixin().slice_shapes_in_imgs(imgs=imgs_gray, shapes=shapes)

Example II

>>> video_path = '/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1_clipped.mp4'
>>> data_path = r'/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/csv/outlier_corrected_movement_location/Example_1_clipped.csv'
>>> df = pd.read_csv(data_path, usecols=['Nose_x', 'Nose_y', 'Tail_base_x', 'Tail_base_y']).fillna(0).values.astype(int)
>>> data = df.reshape(len(df), -1, int(df.shape[1]/2))
>>> geometries = GeometryMixin().multiframe_bodyparts_to_line(data=data, buffer=30, px_per_mm=4.1)
>>> imgs = ImageMixin().slice_shapes_in_imgs(imgs=video_path, shapes=geometries)

static template_matching_cpu(video_path: Union[str, PathLike], img: ndarray, core_cnt: Optional[int] = -1, return_img: Optional[bool] = False) → Tuple[int, dict, Union[None, ndarray]][source]

Perform template matching on CPU using multiprocessing for parallelization.

E.g., having a cropped image, find the image and frame number in a video it most likely has been cropped from.

Parameters

video_path (Union[str, os.PathLike]) – Path to the video file on disk.
img (np.ndarray) – Template image for matching. E.g., a cropped image from video_path.
core_cnt (Optional[int]) – Number of CPU cores to use for parallel processing. Default is -1 (max available cores).
return_img (Optional[bool]) – Whether to return the annotated best match image with rectangle around matched template area. Default is False.

Returns Tuple[ int, dict, Union[None, np.ndarray]]

A tuple containing: (i) int: frame index of the frame with the best match. (ii) dict: Dictionary containing results (probability and match location) for each frame. (iii) Union[None, np.ndarray]: Annotated image with rectangles around matches (if return_img is True), otherwise None.

Example

>>> img = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/two_black_animals_14bp/videos/Screenshot 2024-01-17 at 12.45.55 PM.png')
>>> results = ImageMixin().template_matching_cpu(video_path='/Users/simon/Desktop/envs/troubleshooting/two_black_animals_14bp/videos/Together_1.avi', img=img, return_img=True)