Mixins
Config reader methods
Methods for reading SimBA configparser.Configparser project config and associated project data
- class simba.mixins.config_reader.ConfigReader(config_path: str, read_video_info: bool = True, create_logger: bool = True)[source]
Bases:
object
Methods for reading SimBA configparser.Configparser project config and associated project data.
- Parameters
config_path (configparser.Configparser) – path to SimBA project_config.ini
read_video_info (bool) – if true, read the project_folder/logs/video_info.csv file.
- add_missing_ROI_cols(shape_df: DataFrame) DataFrame [source]
Helper to add missing ROI definitions (
Color BGR
,Thickness
,Color name
) in ROI info dataframes created by the first version of the SimBA ROI user-interface but analyzed using newer versions of SimBA.- Parameters
shape_df (pd.DataFrame) – Dataframe holding ROI definitions.
:return pd.DataFrame with
Color BGR
,Thickness
,Color name
fields
- check_multi_animal_status() None [source]
Helper to check if the project is a multi-animal SimBA project.
- create_body_part_dictionary(multi_animal_status: bool, animal_id_lst: list, animal_cnt: int, x_cols: List[str], y_cols: List[str], p_cols: Optional[List[str]] = None, colors: Optional[List[List[Tuple[int, int, int]]]] = None) Dict[str, Union[List[str], List[Tuple]]] [source]
Helper to create dict of dict lookup of body-parts where the keys are animal names, and values are the body-part names.
- Parameters
multi_animal_status (bool) – If True, it is a multi-animal SimBA project.
multi_animal_id_lst (List[str]) – Animal names. Eg., [‘Simon, ‘JJ’]. Note: If a single animal project, this will be overridden and set to Animal_1.
animal_cnt (int) – Number of animals in the SimBA project.
x_cols (List[str]) – column names for body-part coordinates on x-axis. Returned by
simba.mixins.config_reader.ConfigReader.get_body_part_names()
y_cols (List[str]) – column names for body-part coordinates on y-axis. Returned by
simba.mixins.config_reader.ConfigReader.get_body_part_names()
p_cols (List[str]) – column names for body-part pose-estimation probability values. Returned by
simba.mixins.config_reader.ConfigReader.get_body_part_names()
colors (Optional[List[List[Tuple[int, int, int]]]]) – Optional bgr colors to associate with the body-parts. Returned by
simba.utils.data.create_color_palettes()
.
:returns dict
- Example
>>> ConfigReader.create_body_part_dictionary(multi_animal_status=True, animal_id_lst=['simon',]) >>> {'simon': {'X_bps': ['Nose_1_x', 'Ear_left_1_x', 'Ear_right_1_x', 'Center_1_x', 'Lat_left_1_x', 'Lat_right_1_x', 'Tail_base_1_x', 'Tail_end_1_x'], 'Y_bps': ['Nose_1_y', 'Ear_left_1_y', 'Ear_right_1_y', 'Center_1_y', 'Lat_left_1_y', 'Lat_right_1_y', 'Tail_base_1_y', 'Tail_end_1_y'], 'colors': [[255.0, 0.0, 255.0], [223.125, 31.875, 255.0], [191.25, 63.75, 255.0], [159.375, 95.625, 255.0], [127.5, 127.5, 255.0], [95.625, 159.375, 255.0], [63.75, 191.25, 255.0], [31.875, 223.125, 255.0], [0.0, 255.0, 255.0]], 'P_bps': ['Nose_1_p', 'Ear_left_1_p', 'Ear_right_1_p', 'Center_1_p', 'Lat_left_1_p', 'Lat_right_1_p', 'Tail_base_1_p', 'Tail_end_1_p']}, 'jj': {'X_bps': ['Nose_2_x', 'Ear_left_2_x', 'Ear_right_2_x', 'Center_2_x', 'Lat_left_2_x', 'Lat_right_2_x', 'Tail_base_2_x', 'Tail_end_2_x'], 'Y_bps': ['Nose_2_y', 'Ear_left_2_y', 'Ear_right_2_y', 'Center_2_y', 'Lat_left_2_y', 'Lat_right_2_y', 'Tail_base_2_y', 'Tail_end_2_y'], 'colors': [[102.0, 127.5, 0.0], [102.0, 143.4375, 31.875], [102.0, 159.375, 63.75], [102.0, 175.3125, 95.625], [102.0, 191.25, 127.5], [102.0, 207.1875, 159.375], [102.0, 223.125, 191.25], [102.0, 239.0625, 223.125], [102.0, 255.0, 255.0]], 'P_bps': ['Nose_2_p', 'Ear_left_2_p', 'Ear_right_2_p', 'Center_2_p', 'Lat_left_2_p', 'Lat_right_2_p', 'Tail_base_2_p', 'Tail_end_2_p']}}
- drop_bp_cords(df: DataFrame, raise_error: bool = False) DataFrame [source]
Helper to remove pose-estimation fields from dataframe.
- Parameters
df (pd.DataFrame) – pandas dataframe containing pose-estimation fields (body-part x, y, p fields)
raise_error (bool) – If True, raise error if body-parts cant be found. Else, print warning
- Return pd.DataFrame
df
without pose-estimation fields- Example
>>> config_reader = ConfigReader(config_path='test/project_folder/project_config.csv') >>> df = read_df(config_reader.machine_results_paths[0], file_type='csv') >>> df = config_reader.drop_bp_cords(df=df)
- find_animal_name_from_body_part_name(bp_name: str, bp_dict: dict) str [source]
Given body-part name and animal body-part dict, returns the animal name
- Parameters
bp_name (str) – Name of the body-part. E.g.,
Ear_1
.bp_dict (dict) – Nested dict holding animal names as keys and body-part names as and coordinates as values. Created by
simba.mixins.config_reader.ConfigReader.create_body_part_dictionary()
:returns str
- Example
>>> config_reader = ConfigReader(config_path='tests/data/test_projects/two_c57/project_folder/project_config.ini') >>> ConfigReader.find_animal_name_from_body_part_name(bp_name='Ear_1', bp_dict=config_reader.animal_bp_dict) >>> 'simon'
- find_video_of_file(video_dir: Union[str, PathLike], filename: str, raise_error: bool = False) Union[str, PathLike] [source]
Helper to find the video file representing a known data file basename.
- Parameters
video_dir (Union[str, os.PathLike]) – Directory holding putative video file.
filename (str) – Data file name, e.g.,
Video_1
.raise_error (bool) – If True, raise error if no video can be found.
- Return Union[str, os.PathLike]
Path to video file.
- Example
>>> config_reader = ConfigReader(config_path='My_SimBA_Config') >>> config_reader.find_video_of_file(video_dir=config_reader.video_dir, filename='Video1') >>> '/project_folder/videos/Video1.mp4'
- get_all_clf_names() List[str] [source]
Helper to return all classifier names in SimBA project
- Return List[str]
- get_body_part_names()[source]
Helper to extract pose-estimation data field names (x, y, p)
- Example
>>> config_reader = ConfigReader(config_path='test/project_config.csv') >>> config_reader.get_body_part_names()
- get_bp_headers() None [source]
Helper to create ordered list of all column header fields for SimBA project dataframes.
>>> config_reader = ConfigReader(config_path='test/project_folder/project_config.ini') >>> config_reader.get_bp_headers()
- get_number_of_header_columns_in_df(df: DataFrame) int [source]
Helper to find the count of non-numerical rows at the top of a dataframe.
- Parameters
data_df (pd.DataFrame) – Dataframe to find the count non-numerical header rows in.
:returns int :raises DataHeaderError: All rows are non-numerical.
- Example
>>> ConfigReader.get_number_of_header_columns_in_df(df=pd.DataFrame(data=[[1, 2, 3], [1, 2, 3]])) >>> 0 >>> ConfigReader.get_number_of_header_columns_in_df(df=pd.DataFrame(data=[['Head_1', 'Body_2', 'Tail_3'], ['Some_nonsense', 'A_mistake', 'Maybe_multi_headers?'], [11, 99, 109], [122, 43, 2091]])) >>> 2
- insert_column_headers_for_outlier_correction(data_df: DataFrame, new_headers: List[str], filepath: str) DataFrame [source]
Helper to insert new column headers onto a dataframe.
- Parameters
- Returns pd.DataFrame
Dataframe with new headers
- Raises
DataHeaderWarning – If new headers are fewer/more than columns in dataframe
- Example
>>> df = pd.DataFrame(data=[[1, 2, 3], [1, 2, 3]], columns=['Feature_1', 'Feature_2', 'Feature_3']) >>> ConfigReader.insert_column_headers_for_outlier_correction(data_df=df, new_headers=['Feature_4', 'Feature_5', 'Feature_6'], filepath='test/my_test_file.csv')
- read_config_entry(config: ConfigParser, section: str, option: str, data_type: typing_extensions.Literal['str', 'int', 'float', 'folder_path'], default_value: Optional[Any] = None, options: Optional[List[Any]] = None) Union[str, int, float] [source]
Helper to read entry from a configparser.ConfigParser object
- Parameters
config (ConfigParser) – Parsed SimBA project config
section (str) – Project config section name
option (str) – Project config option name
data_type (str) – Type of data. E.g.,
str
,int
,float
,folder_path
.default_value (Optional[Any]) – If entry cannot be found, then default to this value.
options (Optional[List[Any]]) – If not
None
, then list of viable entries.
- Raises
InvalidInputError – If returned value is not in
options
.MissingProjectConfigEntryError – If no entry is found and no
default_value
is provided.
:return Union[str, float, int, os.Pathlike]
- Example
>>> config = ConfigReader(config_path='tests/data/test_projects/two_c57/project_folder/project_config.ini') >>> config.read_config_entry(config=self.config, section='Multi animal IDs', option='id_list', data_type='str') >>> 'simon,jj'
- read_video_info(video_name: str, raise_error: ~typing.Optional[bool] = True) -> (<class 'pandas.core.frame.DataFrame'>, <class 'float'>, <class 'float'>)[source]
Helper to read the meta-data (pixels per mm, resolution, fps) from the video_info.csv for a single input file.
- Parameters
- Raises
ParametersFileError – If
raise_error
and video metadata info is not foundDuplicationError – If file contains multiple entries for the same video.
:return (pd.DataFrame, float, float) representing all video info, pixels per mm, and fps
- read_video_info_csv(file_path: str) DataFrame [source]
Helper to read the project_folder/logs/video_info.csv of the SimBA project in as a pd.DataFrame :param file_path: :type file_path: str
- Return type
pd.DataFrame
- remove_a_folder(folder_dir: str, raise_error: Optional[bool] = False) None [source]
Helper to remove single directory.
- Parameters
folder_dir – Directory to remove.
raise_error (bool) – If True, raise
NotDirectoryError
error of folder does not exist.
- Raises
NotDirectoryError – If
raise_error
and directory does not exist.- Example
>>> self.remove_a_folder(folder_dir'gerbil/gerbil_data/featurized_data/temp')
- remove_multiple_folders(folders: List[PathLike], raise_error: Optional[bool] = False) None [source]
Helper to remove multiple directories.
- Parameters
List[os.PathLike] (folders) – List of directory paths.
raise_error (bool) – If True, raise
NotDirectoryError
error of folder does not exist. if False, then pass. Default False.
- Raises
NotDirectoryError – If
raise_error
and directory does not exist.- Example
>>> self.remove_multiple_folders(folders= ['gerbil/gerbil_data/featurized_data/temp'])
- remove_roi_features(data_dir: Union[str, PathLike]) None [source]
Helper to remove ROI-based features from datasets within a directory. The identified ROI-based fields are move to the
project_folder/logs/ROI_data_{datetime}
directory.Note
ROI-based features are identified based on the combined criteria of (i) The prefix of the field is a named ROI in the
project_folder/logs/ROI_definitions.h5
file, and (ii) the suffix of the field is contained in the [‘in zone’, ‘n zone_cumulative_time’, ‘in zone_cumulative_percent’, ‘distance’, ‘facing’]- Parameters
data_dir (Union[str, os.PathLike]) – directory with data to remove ROi features from.
- Example
>>> self.remove_roi_features('/project_folder/csv/features_extracted')
Feature extraction methods
- class simba.mixins.feature_extraction_mixin.FeatureExtractionMixin(config_path: Optional[str] = None)[source]
Bases:
object
Methods for featurizing pose-estimation data.
- Parameters
config_path (Optional[configparser.Configparser]) – path to SimBA project_config.ini
- static angle3pt(ax: float, ay: float, bx: float, by: float, cx: float, cy: float) float [source]
Jitted helper for single frame 3-point angle.
See also
For 3-point angles across multiple frames and improved runtime, see
simba.mixins.feature_extraction_mixin.FeatureExtractionMixin.angle3pt_serialized()
.- Example
>>> FeatureExtractionMixin.angle3pt(ax=122.0, ay=198.0, bx=237.0, by=138.0, cx=191.0, cy=109) >>> 59.78156901181637
- static angle3pt_serialized(data: ndarray) ndarray [source]
Jitted helper for frame-wise 3-point angles.
- Parameters
data (ndarray) – 2D numerical array with frame number on x and [ax, ay, bx, by, cx, cy] on y.
- Return ndarray
1d float numerical array of size data.shape[0] with angles.
- Examples
>>> coordinates = np.random.randint(1, 10, size=(6, 6)) >>> FeatureExtractionMixin.angle3pt_serialized(data=coordinates) >>> [ 67.16634582, 1.84761027, 334.23067238, 258.69006753, 11.30993247, 288.43494882]
- static cdist(array_1: ndarray, array_2: ndarray) ndarray [source]
Jitted analogue of meth:scipy.cdist for two 2D arrays. Use to calculate Euclidean distances between all coordinates in one array and all coordinates in a second array. E.g., computes the distances between all body-parts of one animal and all body-parts of a second animal.
- Parameters
array_1 (np.ndarray) – 2D array of body-part coordinates
array_2 (np.ndarray) – 2D array of body-part coordinates
- Return np.ndarray
2D array of euclidean distances between body-parts in
array_1
andarray_2
- Example
>>> array_1 = np.random.randint(1, 10, size=(3, 2)).astype(np.float32) >>> array_2 = np.random.randint(1, 10, size=(3, 2)).astype(np.float32) >>> FeatureExtractionMixin.cdist(array_1=array_1, array_2=array_2) >>> [[7.07106781, 1. , 3.60555124], >>> [3.60555124, 6.3245554 , 2. ], >>> [3.1622777 , 5.38516474, 4.12310553]])
- static cdist_3d(data: ndarray) ndarray [source]
Jitted analogue of meth:scipy.cdist for 3D array. Use to calculate Euclidean distances between all coordinates in of one array and itself.
- Parameters
data (np.ndarray) – 3D array of body-part coordinates of size len(frames) x -1 x 2.
- Return np.ndarray
3D array of size data.shape[0], data.shape[1], data.shape[1].
- change_in_bodypart_euclidean_distance(location_1: ndarray, location_2: ndarray, fps: int, px_per_mm: float, time_windows: ndarray = array([0.2, 0.4, 0.8, 1.6])) ndarray [source]
Computes the difference between the distances of two body-parts in the current frame versus N.N seconds ago. Used for computing if animal body-parts are traveling away or towards each other within defined time-windows.
- check_directionality_cords() dict [source]
Helper to check if ear and nose body-parts are present within the pose-estimation data.
- Return dict
Body-part names of ear and nose body-parts as values and animal names as keys. If empty, ear and nose body-parts are not present within the pose-estimation data
- check_directionality_viable()[source]
Check if it is possible to calculate
directionality
statistics (i.e., nose, and ear coordinates from pose estimation has to be present)- Return bool
If True, directionality is viable. Else, not viable.
- Return np.ndarray nose_coord
If viable, then 2D array with coordinates of the nose in all frames. Else, empty array.
- Return np.ndarray ear_left_coord
If viable, then 2D array with coordinates of the left ear in all frames. Else, empty array.
- Return np.ndarray ear_right_coord
If viable, then 2D array with coordinates of the right ear in all frames. Else, empty array.
- static convex_hull_calculator_mp(arr: ndarray, px_per_mm: float) float [source]
Calculate single frame convex hull perimeter length in millimeters.
See also
For acceptable run-time, call using
parallel.delayed
. For large data, usesimba.feature_extractors.perimeter_jit.jitted_hull()
which returns perimiter length OR area.- Parameters
arr (np.ndarray) – 2D array of size len(body-parts) x 2.
px_per_mm (float) – Video pixels per millimeter.
- Return float
The length of the animal perimeter in millimeters.
- Example
>>> coordinates = np.random.randint(1, 200, size=(6, 2)).astype(np.float32) >>> FeatureExtractionMixin.convex_hull_calculator_mp(arr=coordinates, px_per_mm=4.56) >>> 98.6676814218373
- static cosine_similarity(data: ndarray) ndarray [source]
Jitted analogue of sklearn.metrics.pairwise import cosine_similarity. Similar to scipy.cdist. calculates the cosine similarity between all pairs in 2D array.
- Example
>>> data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]).astype(np.float32) >>> FeatureExtractionMixin().cosine_similarity(data=data) >>> [[1.0, 0.974, 0.959][0.974, 1.0, 0.998] [0.959, 0.998, 1.0]
- static count_values_in_range(data: ndarray, ranges: ndarray) ndarray [source]
Jitted helper finding count of values that falls within ranges. E.g., count number of pose-estimated body-parts that fall within defined bracket of probabilities per frame.
- Parameters
data (np.ndarray) – 2D numpy array with frames on X.
ranges (np.ndarray) – 2D numpy array representing the brackets. E.g., [[0, 0.1], [0.1, 0.5]]
- Return np.ndarray
2D numpy array of size data.shape[0], ranges.shape[1]
- Example
>>> FeatureExtractionMixin.count_values_in_range(data=np.random.random((3,10)), ranges=np.array([[0.0, 0.25], [0.25, 0.5]])) >>> [[6, 1], [3, 2],[2, 1]]
- static create_shifted_df(df: DataFrame, periods: int = 1) DataFrame [source]
Create dataframe including duplicated shifted (1) columns with
_shifted
suffix.:parameter pd.DataFrame df :return pd.DataFrame: Dataframe including original and shifted columns.
- Example
>>> df = pd.DataFrame(np.random.randint(0,100,size=(3, 1)), columns=['Feature_1']) >>> FeatureExtractionMixin.create_shifted_df(df=df) >>> Feature_1 Feature_1_shifted >>> 0 76 76.0 >>> 1 41 76.0 >>> 2 89 41.0
- dataframe_gaussian_smoother(df: DataFrame, fps: int, time_window: int = 100) DataFrame [source]
Column-wise Gaussian smoothing of dataframe.
- Parameters
- Return pd.DataFrame
Dataframe with smoothened data
- References
- dataframe_savgol_smoother(df: DataFrame, fps: int, time_window: int = 150) DataFrame [source]
Column-wise Savitzky-Golay smoothing of dataframe.
- Parameters
- Return pd.DataFrame
Dataframe with smoothened data
- References
- static euclidean_distance(bp_1_x: ndarray, bp_2_x: ndarray, bp_1_y: ndarray, bp_2_y: ndarray, px_per_mm: float) ndarray [source]
Helper to compute the Euclidean distance in millimeters between two body-parts in all frames of a video
See also
Use
simba.mixins.feature_extraction_mixin.FeatureExtractionMixin.framewise_euclidean_distance()
for imporved run-times.- Parameters
bp_1_x (np.ndarray) – 2D array of size len(frames) x 1 with bodypart 1 x-coordinates.
bp_2_x (np.ndarray) – 2D array of size len(frames) x 1 with bodypart 2 x-coordinates.
bp_1_y (np.ndarray) – 2D array of size len(frames) x 1 with bodypart 1 y-coordinates.
bp_2_y (np.ndarray) – 2D array of size len(frames) x 1 with bodypart 2 y-coordinates.
- Return np.ndarray
2D array of size len(frames) x 1 with distances between body-part 1 and 2 in millimeters
- Example
>>> x1, x2 = np.random.randint(1, 10, size=(10, 1)), np.random.randint(1, 10, size=(10, 1)) >>> y1, y2 = np.random.randint(1, 10, size=(10, 1)), np.random.randint(1, 10, size=(10, 1)) >>> FeatureExtractionMixin.euclidean_distance(bp_1_x=x1, bp_2_x=x2, bp_1_y=y1, bp_2_y=y2, px_per_mm=4.56)
- static find_midpoints(bp_1: ndarray, bp_2: ndarray, percentile: float = 0.5) ndarray [source]
Compute the midpoints between two sets of 2D points based on a given percentile.
- Parameters
bp_1 (np.ndarray) – An array of 2D points representing the first set of points. Rows represent frames. First column represent x coordinates. Second column represent y coordinates.
bp_2 (np.ndarray) – An array of 2D points representing the second set of points. Rows represent frames. First column represent x coordinates. Second column represent y coordinates.
percentile (float) – The percentile value to determine the distance between the points for calculating midpoints. When set to 0.5 it calculates midpoints at the midpoint of the two points.
- Returns
np.ndarray: An array of 2D points representing the midpoints between the points in bp_1 and bp_2 based on the specified percentile.
- Example
>>> bp_1 = np.array([[1, 3], [30, 10]]).astype(np.int64) >>> bp_2 = np.array([[10, 4], [20, 1]]).astype(np.int64) >>> FeatureExtractionMixin().find_midpoints(bp_1=bp_1, bp_2=bp_2, percentile=0.5) >>> [[ 5, 3], [25, 6]]
- static framewise_euclidean_distance(location_1: ndarray, location_2: ndarray, px_per_mm: float, centimeter: bool = False) ndarray [source]
Jitted helper finding frame-wise distances between two moving locations in millimeter or centimeter.
- Parameters
- Return np.ndarray
1D array of size location_1.shape[0]
- Example
>>> loc_1 = np.random.randint(1, 200, size=(6, 2)).astype(np.float32) >>> loc_2 = np.random.randint(1, 200, size=(6, 2)).astype(np.float32) >>> FeatureExtractionMixin.framewise_euclidean_distance(location_1=loc_1, location_2=loc_2, px_per_mm=4.56, centimeter=False) >>> [49.80098657, 46.54963644, 49.60650394, 70.35919993, 37.91069901, 71.95422524]
- static framewise_euclidean_distance_roi(location_1: ndarray, location_2: ndarray, px_per_mm: float, centimeter: bool = False) ndarray [source]
Find frame-wise distances between a moving location (location_1) and static location (location_2) in millimeter or centimeter.
- Parameters
- Return np.ndarray
1D array of size location_1.shape[0]
- Example
>>> loc_1 = np.random.randint(1, 200, size=(6, 2)).astype(np.float32) >>> loc_2 = np.random.randint(1, 200, size=(1, 2)).astype(np.float32) >>> FeatureExtractionMixin.framewise_euclidean_distance_roi(location_1=loc_1, location_2=loc_2, px_per_mm=4.56, centimeter=False) >>> [11.31884926, 13.84534585, 6.09712224, 17.12773976, 19.32066031, 12.18043378] >>> FeatureExtractionMixin.framewise_euclidean_distance_roi(location_1=loc_1, location_2=loc_2, px_per_mm=4.56, centimeter=True) >>> [1.13188493, 1.38453458, 0.60971222, 1.71277398, 1.93206603, 1.21804338]
- static framewise_inside_polygon_roi(bp_location: ndarray, roi_coords: ndarray) ndarray [source]
Jitted helper for frame-wise detection if animal is inside static polygon ROI.
Note
Modified from epifanio
- Parameters
bp_location (np.ndarray) – 2d numeric np.ndarray size len(frames) x 2
roi_coords (np.ndarray) – 2d numeric np.ndarray size len(polygon points) x 2
- Return ndarray
2d numeric boolean np.ndarray size len(frames) x 1 with 0 representing outside the polygon and 1 representing inside the polygon
- Example
>>> bp_loc = np.random.randint(1, 10, size=(6, 2)).astype(np.float32) >>> roi_coords = np.random.randint(1, 10, size=(10, 2)).astype(np.float32) >>> FeatureExtractionMixin.framewise_inside_polygon_roi(bp_location=bp_loc, roi_coords=roi_coords) >>> [0, 0, 0, 1]
- static framewise_inside_rectangle_roi(bp_location: ndarray, roi_coords: ndarray) ndarray [source]
Jitted helper for frame-wise analysis if animal is inside static rectangular ROI.
- Parameters
bp_location (np.ndarray) – 2d numeric np.ndarray size len(frames) x 2
roi_coords (np.ndarray) – 2d numeric np.ndarray size 2x2 (top left[x, y], bottom right[x, y)
- Return ndarray
2d numeric boolean np.ndarray size len(frames) x 1 with 0 representing outside the rectangle and 1 representing inside the rectangle
- Example
>>> bp_loc = np.random.randint(1, 10, size=(6, 2)).astype(np.float32) >>> roi_coords = np.random.randint(1, 10, size=(2, 2)).astype(np.float32) >>> FeatureExtractionMixin.framewise_inside_rectangle_roi(bp_location=bp_loc, roi_coords=roi_coords) >>> [0, 0, 0, 0, 0, 0]
- get_bp_headers() None [source]
Helper to create ordered list of all column header fields for SimBA project dataframes.
- get_feature_extraction_headers(pose: str) List[str] [source]
Helper to return the headers names (body-part location columns) that should be used during feature extraction.
- Parameters
pose (str) – Pose-estimation setting, e.g.,
16
.- Return List[str]
The names and order of the pose-estimation columns.
- insert_default_headers_for_feature_extraction(df: DataFrame, headers: List[str], pose_config: str, filename: str) DataFrame [source]
Helper to insert correct body-part column names prior to defualt feature extraction methods.
- static jitted_line_crosses_to_nonstatic_targets(left_ear_array: ndarray, right_ear_array: ndarray, nose_array: ndarray, target_array: ndarray) ndarray [source]
Jitted helper to calculate if an animal is directing towards another animals body-part coordinate, given the target body-part and the left ear, right ear, and nose coordinates of the observer.
Note
Input left ear, right ear, and nose coordinates of the observer is returned by
simba.mixins.feature_extraction_mixin.FeatureExtractionMixin.check_directionality_viable()
- Parameters
left_ear_array (np.ndarray) – 2D array of size len(frames) x 2 with the coordinates of the observer animals left ear
right_ear_array (np.ndarray) – 2D array of size len(frames) x 2 with the coordinates of the observer animals right ear
nose_array (np.ndarray) – 2D array of size len(frames) x 2 with the coordinates of the observer animals nose
target_array (np.ndarray) – 2D array of size len(frames) x 2 with the target body-part location
- Return np.ndarray
2D array of size len(frames) x 4. First column represent the side of the observer that the target is in view. 0 = Left side, 1 = Right side, 2 = Not in view.
Second and third column represent the x and y location of the observer animals
eye
(half-way between the ear and the nose). Fourth column represent if target is is view (bool).
- static jitted_line_crosses_to_static_targets(left_ear_array: ndarray, right_ear_array: ndarray, nose_array: ndarray, target_array: ndarray) ndarray [source]
Jitted helper to calculate if an animal is directing towards a static location (ROI centroid), given the target location and the left ear, right ear, and nose coordinates of the observer.
Note
Input left ear, right ear, and nose coordinates of the observer is returned by
simba.mixins.feature_extraction_mixin.FeatureExtractionMixin.check_directionality_viable()
- Parameters
left_ear_array (np.ndarray) – 2D array of size len(frames) x 2 with the coordinates of the observer animals left ear
right_ear_array (np.ndarray) – 2D array of size len(frames) x 2 with the coordinates of the observer animals right ear
nose_array (np.ndarray) – 2D array of size len(frames) x 2 with the coordinates of the observer animals nose
target_array (np.ndarray) – 1D array of with x,y of target location
- Return np.ndarray
2D array of size len(frames) x 4. First column represent the side of the observer that the target is in view. 0 = Left side, 1 = Right side, 2 = Not in view.
Second and third column represent the x and y location of the observer animals
eye
(half-way between the ear and the nose). Fourth column represent if target is view (bool).
- static line_crosses_to_static_targets(p: ~typing.List[float], q: ~typing.List[float], n: ~typing.List[float], M: ~typing.List[float], coord: ~typing.List[float]) -> (<class 'bool'>, typing.List[float])[source]
Legacy non-jitted helper to calculate if an animal is directing towards a static coordinate (e.g., ROI centroid).
- Parameters
- Return bool
If True, static coordinate is in view.
- Return List
If True, the coordinate of the observing animals
eye
(half-way between nose and ear).
- static minimum_bounding_rectangle(points: ndarray) ndarray [source]
Finds the minimum bounding rectangle from convex hull vertices.
- Parameters
points (np.ndarray) – 2D array representing the convexhull vertices of the animal.
- Return np.ndarray
2D array representing minimum bounding rectangle of the convexhull vertices of the animal.
Note
Modified from JesseBuesking
See
simba.mixins.feature_extractors.perimeter_jit.jitted_hull()
for computing the convexhull vertices.TODO: Place in numba njit.
- Example
>>> points = np.random.randint(1, 10, size=(10, 2)) >>> FeatureExtractionMixin.minimum_bounding_rectangle(points=points) >>> [[10.7260274 , 3.39726027], [ 1.4109589 , -0.09589041], [-0.31506849, 4.50684932], [ 9., 8. ]]
- static windowed_frequentist_distribution_tests(data: ndarray, feature_name: str, fps: int) DataFrame [source]
Calculates feature value distributions and feature peak counts in 1-s sequential time-bins.
Computes (i) feature value distributions in 1-s sequential time-bins: Kolmogorov-Smirnov and T-tests. Computes (ii) feature values against a normal distribution: Shapiro-Wilks. Computes (iii) peak count in rolling 1s long feature window: scipy.find_peaks.
- Parameters
data (np.ndarray) – Single feature 1D array
feature_name (np.ndarray) – The name of the input feature.
fps (int) – The framerate of the video representing the data.
- Return pd.DataFrame
Of size len(data) x 4 with columns representing KS, T, Shapiro-Wilks, and peak count statistics.
- Example
>>> feature_data = np.random.randint(1, 10, size=(100)) >>> FeatureExtractionMixin.windowed_frequentist_distribution_tests(data=feature_data, fps=25, feature_name='Anima_1_velocity')
Geometry transformation methods
- class simba.mixins.geometry_mixin.GeometryMixin[source]
Bases:
object
Methods to perform geometry transformation of pose-estimation data. This includes creating bounding boxes, line objects, circles etc. from pose-estimated body-parts and computing metric representations of the relationships between created shapes or their attributes (sizes, distances etc.).
As of 01/24, very much wip and relies heavily on shapley.
Note
These methods generally do not create visualizations - they mainly generate geometry data-objects or metrics. To create visualizations with geometries overlay on videos, pass returned shapes to simba.plotting.geometry_plotter.GeometryPlotter.
- static adjust_geometry_locations(geometries: List[Polygon], shift: Tuple[int, int], minimum: Optional[Tuple[int, int]] = (0, 0), maximum: Optional[Tuple[int, int]] = (inf, inf)) List[Polygon] [source]
Shift a set of geometries specified distance in the x and/or y-axis.
- Parameters
geometries (List[Polygon]) – List of input polygons to be adjusted.
shift (Tuple[int, int]) – Tuple specifying the shift distances in the x and y-axis.
minimum (Optional[Tuple[int, int]]) – Minimim allowed coordinates of Polygon points on x and y axes. Default: (0,0).
maximum (Optional[Tuple[int, int]]) – Maximum allowed coordinates of Polygon points on x and y axes. Default: (np.inf, np.inf).
- Return List[Polygon]
List of adjusted polygons.
- Example
>>> shapes = GeometryMixin().adjust_geometry_locations(geometries=shapes, shift=(0, 333))
- static area(shape: Union[MultiPolygon, Polygon], pixels_per_mm: float)[source]
Calculate the area of a geometry in square millimeters.
Note
If certain that the input data is a valid Polygon, consider using
simba.feature_extractors.perimeter_jit.jitted_hull()
- Parameters
shape (Union[MultiPolygon, Polygon]) – The geometry (MultiPolygon or Polygon) for which to calculate the area.
pixels_per_mm (float) – The pixel-to-millimeter conversion factor.
- Return float
The area of the geometry in square millimeters.
- Example
>>> polygon = GeometryMixin().bodyparts_to_polygon(np.array([[10, 10], [10, 100], [100, 10], [100, 100]])) >>> GeometryMixin().area(shape=polygon, pixels_per_mm=4.9) >>> 1701.556313816644
- static bodyparts_to_circle(data: ndarray, parallel_offset: float, pixels_per_mm: Optional[int] = 1) Polygon [source]
Create a circle geometry from a single body-part (x,y) coordinate.
Note
For multiple frames, call this method using
multiframe_bodyparts_to_circle()
- Parameters
data (np.ndarray) – The body-part coordinate xy as a 1d array. E.g., np.array([364, 308])
parallel_offset (float) – The radius of the resultant circle in millimeters.
pixels_per_mm (int) – The pixels per millimeter of the video. If not passed, 1 will be used meaning revert to radius in pixels rather than millimeters.
- Returns Polygon
Shapely Polygon of curcular shape.
- Example
>>> data = np.array([364, 308]) >>> polygon = GeometryMixin().bodyparts_to_circle(data=data, parallel_offset=10, pixels_per_mm=4)
- static bodyparts_to_line(data: ndarray, buffer: Optional[int] = None, px_per_mm: Optional[float] = None) Union[Polygon, LineString] [source]
Convert body-part coordinates to a Linestring.
Note
If buffer and px_per_mm is provided, then the returned object will be linestring buffered to a 2D object rectangle with specificed area.
- Example
>>> data = np.array([[364, 308],[383, 323], [403, 335],[423, 351]]) >>> line = GeometryMixin().bodyparts_to_line(data=data) >>> line = GeometryMixin().bodyparts_to_line(data=data, buffer=10, px_per_mm=4)
- static bodyparts_to_multistring_skeleton(data: ndarray) MultiLineString [source]
Create a multistring skeleton from a 3d array where each 2d array represents start and end coordinates of a line within the skeleton.
- Parameters
data (np.ndarray) – A 3D numpy array where each 2D array represents the start position and end position of each LineString.
- Returns MultiLineString
Shapely MultiLineString representing animal skeleton.
- Example
>>> skeleton = np.array([[[5, 5], [1, 10]], [[5, 5], [9, 10]], [[9, 10], [1, 10]], [[9, 10], [9, 25]], [[1, 10], [1, 25]], [[9, 25], [5, 50]], [[1, 25], [5, 50]]]) >>> shape_multistring = GeometryMixin().bodyparts_to_multistring_skeleton(data=skeleton)
- static bodyparts_to_points(data: ndarray, buffer: Optional[int] = None, px_per_mm: Optional[int] = None) List[Union[Point, Polygon]] [source]
Convert body-parts coordinate to Point geometries.
- Parameters
data (np.ndarray) – 2D array with body-part coordinates where rows are frames and columns are x and y coordinates.
buffer (Optional[int]) – If not None, then the area of the Point. Thus, if not None, then returns Polygons representing the Points.
px_per_mm (Optional[int]) – Pixels to millimeter convertion factor. Required if buffer is not None.
- Example
>>> data = np.random.randint(0, 100, (1, 2)) >>> GeometryMixin().bodyparts_to_points(data=data)
- static bodyparts_to_polygon(data: ndarray, cap_style: typing_extensions.Literal['round', 'square', 'flat'] = 'round', parallel_offset: int = 1, pixels_per_mm: int = 1, simplify_tolerance: float = 2, preserve_topology: bool = True) Polygon [source]
-
- Example
>>> data = [[[364, 308],[383, 323],[403, 335], [423, 351]]] >>> GeometryMixin().bodyparts_to_polygon(data=data)
- static bucket_img_into_grid_hexagon(bucket_size_mm: float, img_size: Tuple[int, int], px_per_mm: float) Tuple[Dict[Tuple[int, int], Polygon], float] [source]
Bucketize an image into hexagons and return a dictionary of polygons representing the hexagon locations.
- Parameters
- Return Tuple[Dict[Tuple[int, int], Polygon], float]
First value is a dictionary where keys are (row, column) indices of the bucket, and values are Shapely Polygon objects representing the corresponding hexagon buckets. Second value is the aspect ratio of the hexagonal grid.
- Example
>>> polygons, aspect_ratio = GeometryMixin().bucket_img_into_grid_hexagon(bucket_size_mm=10, img_size=(800, 600), px_per_mm=5.0, add_correction=True)
- static bucket_img_into_grid_points(point_distance: int, px_per_mm: float, img_size: Tuple[int, int], border_sites: Optional[bool] = True) Dict[Tuple[int, int], Point] [source]
Generate a grid of evenly spaced points within an image. Use for creating spatial markers within an arena.
- Parameters
point_distance (int) – Distance between adjacent points in millimeters.
px_per_mm (float) – Pixels per millimeter conversion factor.
img_size (Tuple[int, int]) – Size of the image in pixels (width, height).
border_sites (Optional[bool]) – If True, includes points on the border of the image. Default is True.
- Returns Dict[Tuple[int, int], Point]
Dictionary where keys are (row, column) indices of the point, and values are Shapely Point objects.
- Example
>>> GeometryMixin.bucket_img_into_grid_points(point_distance=20, px_per_mm=4, img_size=img.shape, border_sites=False)
- static bucket_img_into_grid_square(img_size: Iterable[int], bucket_grid_size_mm: Optional[float] = None, bucket_grid_size: Optional[Iterable[int]] = None, px_per_mm: Optional[float] = None, add_correction: Optional[bool] = True) Tuple[Dict[Tuple[int, int], Polygon], float] [source]
Bucketize an image into squares and return a dictionary of polygons representing the bucket locations.
- Parameters
img_size (Iterable[int]) – 2-value tuple, list or array representing the width and height of the image in pixels.
bucket_grid_size_mm (Optional[float]) – The width/height of each square bucket in millimeters. E.g., 50 will create 5cm by 5cm squares. If None, then buckets will by defined by
bucket_grid_size
argument.bucket_grid_size (Optional[Iterable[int]]) – 2-value tuple, list or array representing the grid square in number of horizontal squares x number of vertical squares. If None, then buckets will be defined by the
bucket_size_mm
argument.px_per_mm (Optional[float]) – Pixels per millimeter conversion factor. Necessery if buckets are defined by
bucket_size_mm
argument.add_correction (Optional[bool]) – If True, performs correction by adding extra columns or rows to cover any remaining space if using
bucket_size_mm
. Default True.
- Example
>>> img = cv2.imread('/Users/simon/Desktop/Screenshot 2024-01-21 at 10.15.55 AM.png', 1) >>> polygons = GeometryMixin().bucket_img_into_grid_square(bucket_grid_size=(10, 5), bucket_grid_size_mm=None, img_size=(img.shape[1], img.shape[0]), px_per_mm=5.0) >>> for k, v in polygons[0].items(): cv2.polylines(img, [np.array(v.exterior.coords).astype(int)], True, (255, 0, 133), 2) >>> cv2.imshow('img', img) >>> cv2.waitKey()
- static buffer_shape(shape: Union[Polygon, LineString], size_mm: int, pixels_per_mm: float, cap_style: typing_extensions.Literal['round', 'square', 'flat'] = 'round') Polygon [source]
Create a buffered shape by applying a buffer operation to the input polygon or linestring.
- Parameters
shape (Union[Polygon, LineString]) – The input Polygon or LineString to be buffered.
size_mm (int) – The size of the buffer in millimeters. Use a negative value for an inward buffer.
pixels_per_mm (float) – The conversion factor from millimeters to pixels.
cap_style (Literal['round', 'square', 'flat']) – The cap style for the buffer. Valid values are ‘round’, ‘square’, or ‘flat’. Defaults to ‘round’.
- Return Polygon
The buffered shape.
- Example
>>> polygon = GeometryMixin().bodyparts_to_polygon(np.array([[100, 110],[100, 100],[110, 100],[110, 110]])) >>> buffered_polygon = GeometryMixin().buffer_shape(shape=polygon, size_mm=-1, pixels_per_mm=1)
- static compute_pct_shape_overlap(shapes: ndarray, denominator: Optional[typing_extensions.Literal['difference', 'shape_1', 'shape_2']] = 'difference') int [source]
Compute the percentage of overlap between two shapes.
- Parameters
shapes (List[Union[LineString, Polygon]]) – A 2D array, where each sub-array has two Polygon or LineString shapes.
denominator (Optional[Literal['union', 'shape_1', 'shape_2']]) – If
difference
, then percent overlap is calculated using non-intersection area as denominator. Ifshape_1
, percent overlap is calculated using the area of the first shape as denominator. Ifshape_2
, percent overlap is calculated using the area of the second shape as denominator. Default:difference
.
- Return float
The percentage of overlap between the two shapes as integer.
- Example
>>> polygon_1 = GeometryMixin().bodyparts_to_polygon(np.array([[364, 308],[383, 323],[403, 335],[423, 351]])) >>> polygon_2 = GeometryMixin().bodyparts_to_polygon(np.array([[356, 307],[376, 319],[396, 331],[419, 347]])) >>> polygon_1 = [polygon_1 for x in range(100)] >>> polygon_2 = [polygon_2 for x in range(100)] >>> data = np.column_stack((polygon_1, polygon_2)) >>> results = GeometryMixin.compute_pct_shape_overlap(shapes=data)
- static compute_shape_overlap(shapes: List[Union[Polygon, LineString]]) int [source]
Computes if two geometrical shapes (Polygon or LineString) overlaps or are disjoint.
Note
Only returns if two shapes are overlapping or not overlapping. If the amount of overlap is required, use
GeometryMixin().compute_shape_overlap()
.- Parameters
shapes (List[Union[LineString, Polygon]]) – A list of two input Polygon or LineString shapes.
- Return float
Returns 1 if the two shapes overlap, otherwise returns 0.
- static contours_to_geometries(contours: List[ndarray], force_rectangles: Optional[bool] = True) List[Polygon] [source]
Convert a list of contours to a list of geometries.
E.g., convert a list of contours detected with
ImageMixin.find_contours
to a list of Shapely geometries that can be used within theGeometryMixin
.- Parameters
contours (List[np.ndarray]) – List of contours represented as 2D arrays.
force_rectangles – If True, then force the resulting geometries to be rectangular.
- Return List[Polygon]
List of Shapley Polygons.
- Example
>>> video_frm = read_frm_of_video(video_path='/Users/simon/Desktop/envs/platea_featurizer/data/video/3D_Mouse_5-choice_MouseTouchBasic_s9_a4_grayscale.mp4') >>> contours = ImageMixin.find_contours(img=video_frm) >>> GeometryMixin.contours_to_geometries(contours=contours)
- static crosses(shapes: List[LineString]) bool [source]
Check if two LineString objects cross each other.
- Parameters
shapes (List[LineString]) – A list containing two LineString objects.
- Return bool
True if the LineStrings cross each other, False otherwise.
- Example
>>> line_1 = GeometryMixin().bodyparts_to_line(np.array([[10, 10],[20, 10],[30, 10],[40, 10]])) >>> line_2 = GeometryMixin().bodyparts_to_line(np.array([[25, 5],[25, 20],[25, 30],[25, 40]])) >>> GeometryMixin().crosses(shapes=[line_1, line_2]) >>> True
- cumsum_bool_geometries(data: ndarray, geometries: Dict[Tuple[int, int], Polygon], bool_data: ndarray, fps: Optional[float] = None, core_cnt: Optional[int] = -1) ndarray [source]
Compute the cumulative sums of boolean events within polygon geometries over time using multiprocessing.
E.g., compute the cumulative time of classified events within spatial locations at all time-points of the video.
- Parameters
data (np.ndarray) – Array containing spatial data with shape (n, 2). E.g., 2D-array with body-part coordinates.
geometries (Dict[Tuple[int, int], Polygon]) – Dictionary of polygons representing spatial regions. Created by
GeometryMixin.bucket_img_into_squares
.bool_data (np.ndarray) – Boolean array with shape (data.shape[0],) or (data.shape[0], 1) indicating the presence or absence in each frame.
fps (Optional[float]) – Frames per second. If provided, the result is normalized by the frame rate.
core_cnt (Optional[float]) – Number of CPU cores to use for parallel processing. Default is -1, which means using all available cores.
- Returns np.ndarray
Array of size (frames x horizontal bins x verical bins) with times in seconds (if fps passed) or frames (if fps not passed)
- Example
>>> geometries = GeometryMixin.bucket_img_into_grid_square(bucket_size_mm=50, img_size=(800, 800) , px_per_mm=5.0)[0] >>> coord_data = np.random.randint(0, 800, (500, 2)) >>> bool_data = np.random.randint(0, 2, (500,)) >>> x = GeometryMixin().cumsum_bool_geometries(data=coord_data, geometries=geometries, bool_data=bool_data, fps=15) >>> x.shape >>> (500, 4, 4)
- cumsum_coord_geometries(data: ndarray, geometries: Dict[Tuple[int, int], Polygon], fps: Optional[int] = None, core_cnt: Optional[int] = -1, verbose: Optional[bool] = True)[source]
Compute the cumulative time a body-part has spent inside a grid of geometries using multiprocessing.
- Parameters
data (np.ndarray) – Input data array where rows represent frames and columns represent body-part x and y coordinates.
geometries (Dict[Tuple[int, int], Polygon]) – Dictionary of polygons representing spatial regions. Created by
GeometryMixin.bucket_img_into_squares
.fps (Optional[int]) – Frames per second (fps) for time normalization. If None, cumulative sum of frame count is returned.
- Example
>>> img_geometries = GeometryMixin.bucket_img_into_grid_square(img_size=(640, 640), bucket_grid_size=(10, 10), px_per_mm=1) >>> bp_arr = np.random.randint(0, 640, (5000, 2)) >>> geo_data = GeometryMixin().cumsum_coord_geometries(data=bp_arr, geometries=img_geometries[0], verbose=False, fps=1)
- static delaunay_triangulate_keypoints(data: ndarray) List[Polygon] [source]
Triangulates a set of 2D keypoints. E.g., use to polygonize animal hull, or triangulate a gridpoint areana.
This method takes a 2D numpy array representing a set of keypoints and triangulates them using the Delaunay triangulation algorithm. The input array should have two columns corresponding to the x and y coordinates of the keypoints.
- Parameters
data (np.ndarray) – NumPy array of body part coordinates. Each subarray represents the coordinates of a body part.
- Returns List[Polygon]
A list of Polygon objects representing the triangles formed by the Delaunay triangulation.
- Example
>>> data = np.array([[126, 122],[152, 116],[136, 85],[167, 172],[161, 206],[197, 193],[191, 237]]) >>> triangulated_hull = GeometryMixin().delaunay_triangulate_keypoints(data=data)
- static difference(shapes=typing.List[typing.Union[shapely.geometry.linestring.LineString, shapely.geometry.polygon.Polygon, shapely.geometry.multipolygon.MultiPolygon]]) Polygon [source]
Calculate the difference between a shape and one or more potentially overlapping shapes.
- Parameters
shapes (List[Union[LineString, Polygon, MultiPolygon]]) – A list of geometries.
- Returns
The first geometry in
shapes
is returned where all parts that overlap with the other geometries in ``shapes have been removed.- Example
>>> polygon_1 = GeometryMixin().bodyparts_to_polygon(np.array([[10, 10], [10, 100], [100, 10], [100, 100]])) >>> polygon_2 = GeometryMixin().bodyparts_to_polygon(np.array([[25, 25],[25, 75],[90, 25],[90, 75]])) >>> polygon_3 = GeometryMixin().bodyparts_to_polygon(np.array([[1, 25],[1, 75],[110, 25],[110, 75]])) >>> difference = GeometryMixin().difference(shapes = [polygon_1, polygon_2, polygon_3])
- static extend_line_to_bounding_box_edges(line_points: ndarray, bounding_box: ndarray) ndarray [source]
Jitted extend a line segment defined by two points to fit within a bounding box.
- Parameters
line_points (np.ndarray) – Coordinates of the line segment’s two points. Two rows and each row represents a point (x, y).
bounding_box (np.ndarray) – Bounding box coordinates in the format (min_x, min_y, max_x, max_y).
- Returns np.ndarray
Intersection points where the extended line crosses the bounding box edges. The shape of the array is (2, 2), where each row represents a point (x, y).
- Example
>>> line_points = np.array([[25, 25], [45, 25]]).astype(np.float32) >>> bounding_box = np.array([0, 0, 50, 50]).astype(np.float32) >>> GeometryMixin().extend_line_to_bounding_box_edges(line_points, bounding_box) >>> [[ 0. 25.] [50. 25.]]
- static filter_low_p_bps_for_shapes(x: ndarray, p: ndarray, threshold: float)[source]
Filter body-part data for geometry construction while maintaining valid geometry arrays.
Having a 3D array representing body-parts across time, and a second 3D array representing probabilities of those body-parts across time, we want to “remove” body-parts with low detection probabilities whilst also keeping the array sizes intact and suitable for geometry construction. To do this, we find body-parts with detection probabilities below the threshold, and replace these with a body-part that doesn’t fall below the detection probability threshold within the same frame. However, to construct a geometry, we need >= 3 unique key-point locations. Thus, no substitution can be made to when there are less than three unique body-part locations within a frame that falls above the threshold.
- Example
>>> x = np.random.randint(0, 500, (18000, 7, 2)) >>> p = np.random.random(size=(18000, 7, 1)) >>> x = GeometryMixin.filter_low_p_bps_for_shapes(x=x, p=p, threshold=0.1) >>> x = x.reshape(x.shape[0], int(x.shape[1] * 2))
- static geometry_contourcomparison(imgs: List[Union[ndarray, Tuple[VideoCapture, int]]], geometry: Optional[Polygon] = None, method: Optional[typing_extensions.Literal['all', 'exterior']] = 'all', canny: Optional[bool] = True) float [source]
Compare contours between a geometry in two images using shape matching.
Important
If there is non-pose related noise in the environment (e.g., there are non-experiment related intermittant light or shade sources that goes on and off, this will negatively affect the reliability of contour comparisons.
Used to pick up very subtle changes around pose-estimated body-part locations.
- Parameters
imgs (List[Union[np.ndarray, Tuple[cv2.VideoCapture, int]]]) – List of two input images. Can be either be two images in numpy array format OR a two tuples with cv2.VideoCapture object and the frame index.
geometry (Optional[Polygon]) – If Polygon, then the geometry in the two images that should be compared. If None, then entire images will be contourcompared.
method (Literal['all', 'exterior']) – The method used for contour comparison.
canny (Optional[bool]) – If True, applies Canny edge detection before contour comparison. Helps reduce noise and enhance contours. Default is True.
- Returns float
Contour matching score between the two images. Lower scores indicate higher similarity.
- Example
>>> img_1 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1_frames/1978.png').astype(np.uint8) >>> img_2 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1_frames/1977.png').astype(np.uint8) >>> data = pd.read_csv('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/csv/outlier_corrected_movement_location/Example_1.csv', nrows=1, usecols=['Nose_x', 'Nose_y']).fillna(-1).values.astype(np.int64) >>> geometry = GeometryMixin().bodyparts_to_circle(data[0, :], 100) >>> GeometryMixin().geometry_contourcomparison(imgs=[img_1, img_2], geometry=geometry, canny=True, method='exterior') >>> 22.54
- static geometry_histocomparison(imgs: List[Union[ndarray, Tuple[VideoCapture, int]]], geometry: Optional[Polygon] = None, method: Optional[typing_extensions.Literal['chi_square', 'correlation', 'intersection', 'bhattacharyya', 'hellinger', 'chi_square_alternative', 'kl_divergence']] = 'correlation', absolute: Optional[bool] = True) float [source]
Retrieve histogram similarities within a geometry inside two images.
For example, the polygon may represent an area around a rodents head. While the front paws are not pose-estimated, computing the histograms of the geometry in two sequential images gives indication of non-freezing.
Important
If there is non-pose related noise in the environment (e.g., there are non-experiment related light sources that goes on and off, or waving window curtains causing changes in histgram values w/o affecting pose) this will negatively affect the realiability of histogram comparisons.
- Parameters
imgs (List[Union[np.ndarray, Tuple[cv2.VideoCapture, int]]]) – List of two input images. Can be either an two image in numpy array format OR a two tuples with cv2.VideoCapture object and the frame index.
geometry (Optional[Polygon]) – If Polygon, then the geometry in the two images that should be compared. If None, then entire images will be histocompared.
method (Literal['correlation', 'chi_square']) – The method used for comparison. E.g., if correlation, then small output values suggest large differences between the current versus prior image. If chi_square, then large output values suggest large differences between the geometries.
absolute (Optional[bool]) – If True, the absolute difference between the two histograms. If False, then (image2 histogram) - (image1 histogram)
- Return float
Value representing the histogram similarities between the geometry in the two images.
- Example
>>> img_1 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1_frames/1.png') >>> img_2 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1_frames/2.png') >>> data_path = '/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/csv/outlier_corrected_movement_location/Example_1.csv' >>> data = pd.read_csv(data_path, nrows=1, usecols=['Nose_x', 'Nose_y']).fillna(-1).values.astype(np.int64) >>> polygon = GeometryMixin().bodyparts_to_circle(data[0], 100) >>> GeometryMixin().geometry_histocomparison(imgs=[img_1, img_2], geometry=polygon, method='correlation') >>> 0.9999769684923543 >>> img_2 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1_frames/41411.png') >>> GeometryMixin().geometry_histocomparison(imgs=[img_1, img_2], geometry=polygon, method='correlation') >>> 0.6732792208872572 >>> img_1 = (cv2.VideoCapture('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1.mp4'), 1) >>> img_2 = (cv2.VideoCapture('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1.mp4'), 2) >>> GeometryMixin().geometry_histocomparison(imgs=[img_1, img_2], geometry=polygon, method='correlation') >>> 0.9999769684923543
- static geometry_video(shapes: List[List[Union[LineString, Polygon, MultiPolygon, MultiLineString, MultiPoint]]], save_path: Union[str, PathLike], size: Optional[Tuple[int]], fps: Optional[int] = 10, verbose: Optional[bool] = False, bg_img: Optional[ndarray] = None, bg_clr: Optional[Tuple[int]] = None) None [source]
Helper to create a geometry video from a list of shapes.
Note
If more aesthetic videos are needed, overlaid on video, then use
simba.plotting.geometry_plotter.GeometryPlotter
If single images of geometries are needed, then usesimba.mixins.geometry_mixin.view_shapes
- Parameters
shapes (List[List[Union[LineString, Polygon, MultiPolygon, MultiPoint, MultiLineString]]]) – List of lists containing geometric shapes to be included in the video. Each sublist represents a frame, and each element within the sublist represents a shape for that frame.
save_path (Union[str, os.PathLike]) – Path where the resulting video will be saved.
size (Optional[Tuple[int]]) – Tuple specifying the size of the output video in pixels (width, height).
fps (Optional[int]) – Frames per second of the output video. Defaults to 10.
verbose (Optional[bool]) – If True, then prints progress frmae-by-frame. Default: False.
bg_img (Optional[np.ndarray]) – Background image to be used as the canvas for drawing shapes. Defaults to None. Could be e.g., a low opacity image of the arena.
bg_clr (Optional[Tuple[int]]) – Background color specified as a tuple of RGB values. Defaults to white.
- static get_center(shape: Union[LineString, Polygon, MultiPolygon]) ndarray [source]
-
- Example
>>> multipolygon = MultiPolygon([Polygon([[200, 110],[200, 100],[200, 100],[200, 110]]), Polygon([[70, 70],[70, 60],[10, 50],[1, 70]])]) >>> GeometryMixin().get_center(shape=multipolygon) >>> [33.96969697, 62.32323232]
- static get_geometry_brightness_intensity(img: Union[ndarray, Tuple[VideoCapture, int]], geometries: List[Union[Polygon, ndarray]], ignore_black: Optional[bool] = True) ndarray [source]
Calculate the average brightness intensity within a geometry region-of-interest of an image.
E.g., can be used with hardcoded thresholds or model kmeans in simba.mixins.statistics_mixin.Statistics.kmeans_1d to detect if a light source is ON or OFF state.
- Parameters
img (np.ndarray) – Either an image in numpy array format OR a tuple with cv2.VideoCapture object and the frame index.
geometries (List[Union[Polygon, np.ndarray]]) – A list of shapes either as vertices in a numpy array, or as shapely Polygons.
ignore_black (Optional[bool]) – If non-rectangular geometries, then pixels that don’t belong to the geometry are masked in black. If True, then these pixels will be ignored when computing averages.
- Example
>>> img = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1_frames/1.png').astype(np.uint8) >>> data_path = '/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/csv/outlier_corrected_movement_location/Example_1.csv' >>> data = pd.read_csv(data_path, usecols=['Nose_x', 'Nose_y']).sample(n=3).fillna(1).values.astype(np.int64) >>> geometries = [] >>> for frm_data in data: geometries.append(GeometryMixin().bodyparts_to_circle(frm_data, 100)) >>> GeometryMixin().get_geometry_brightness_intensity(img=img, geometries=geometries, ignore_black=False) >>> [125.0, 113.0, 118.0]
- static hausdorff_distance(geometries: List[List[Union[Polygon, LineString]]]) ndarray [source]
The Hausdorff distance measure of the similarity between time-series sequential geometries. It is defined as the maximum of the distances from each point in one set to the nearest point in the other set.
Hausdorff distance can be used to measure the similarity of the geometry in one frame relative to the geometry in the next frame. Larger values indicate that the animal has a different shape than in the preceding shape.
- Parameters
geometries (List[List[Union[Polygon, LineString]]]) – List of list where each list has two geometries.
- Return np.ndarray
1D array of hausdorff distances of geometries in each list.
- Example
>>> x = Polygon([[0,1], [0, 2], [1,1]]) >>> y = Polygon([[0,1], [0, 2], [0,1]]) >>> GeometryMixin.hausdorff_distance(geometries=[[x, y]]) >>> [1.]
- static is_containing(shapes=typing.List[typing.Union[shapely.geometry.polygon.Polygon, shapely.geometry.linestring.LineString]]) bool [source]
-
- Example
- static is_shape_covered(shapes: List[Union[LineString, Polygon, MultiPolygon, MultiPoint]]) bool [source]
Check if one geometry fully covers another.
- Parameters
shapes (Union[LineString, Polygon, MultiPolygon, MultiPoint]) – List of 2 geometries, checks if the second geometry fully covers the first geometry.
- Return bool
True if the second geometry fully covers the first geometry, otherwise False.
>>> polygon_1 = GeometryMixin().bodyparts_to_polygon(np.array([[10, 10], [10, 100], [100, 10], [100, 100]])) >>> polygon_2 = GeometryMixin().bodyparts_to_polygon(np.array([[25, 25], [25, 75], [90, 25], [90, 75]])) >>> GeometryMixin().is_shape_covered(shapes=[polygon_2, polygon_1]) >>> True
- static is_touching(shapes=typing.List[typing.Union[shapely.geometry.polygon.Polygon, shapely.geometry.linestring.LineString]]) bool [source]
Check if two geometries touch each other.
Note
Different from GeometryMixin().crosses: Touches requires a common boundary, and does not require the sharing of interior space.
- Parameters
shapes (List[Union[LineString, Polygon]]) – A list containing two LineString or Polygon geometries.
- Return bool
True if the geometries touch each other, False otherwise.
- Example
>>> rectangle_1 = Polygon(np.array([[0, 0], [10, 10], [0, 10], [10, 0]])) >>> rectangle_2 = Polygon(np.array([[20, 20], [30, 30], [20, 30], [30, 20]])) >>> GeometryMixin().is_touching(shapes=[rectangle_1, rectangle_2]) >>> False
- static length(shape: Union[LineString, MultiLineString], pixels_per_mm: float, unit: typing_extensions.Literal['mm', 'cm', 'dm', 'm'] = 'mm') float [source]
Calculate the length of a LineString geometry.
- Parameters
shape (LineString) – The LineString geometry for which the length is to be calculated.
unit (Literal['mm', 'cm', 'dm', 'm']) – The desired unit for the length measurement (‘mm’, ‘cm’, ‘dm’, ‘m’).
- Return float
The length of the LineString geometry in the specified unit.
- Example
>>> line_1 = GeometryMixin().bodyparts_to_line(np.array([[10, 70],[20, 60],[30, 50],[40, 70]])) >>> GeometryMixin().length(shape=line_1, pixels_per_mm=1.0) >>> 50.6449510224598
- static line_split_bounding_box(intersections: ndarray, bounding_box: ndarray) GeometryCollection [source]
Split a bounding box into two parts using an extended line.
Note
Extended line can be found by body-parts using
GeometryMixin().extend_line_to_bounding_box_edges
.- Parameters
line_points (np.ndarray) – Intersection points where the extended line crosses the bounding box edges. The shape of the array is (2, 2), where each row represents a point (x, y).
bounding_box (np.ndarray) – Bounding box coordinates in the format (min_x, min_y, max_x, max_y).
- Returns GeometryCollection
A collection of polygons resulting from splitting the bounding box with the extended line.
- Example
>>> line_points = np.array([[25, 25], [45, 25]]).astype(np.float32) >>> bounding_box = np.array([0, 0, 50, 50]).astype(np.float32) >>> intersection_points = GeometryMixin().extend_line_to_bounding_box_edges(line_points, bounding_box) >>> GeometryMixin().line_split_bounding_box(intersections=intersection_points, bounding_box=bounding_box)
- static linear_frechet_distance(x: ndarray, y: ndarray, sample: int = 100) float [source]
Compute the Linear Fréchet Distance between two trajectories.
The Fréchet Distance measures the dissimilarity between two continuous curves or trajectories represented as sequences of points in a 2-dimensional space.
- Parameters
data (ndarray) – First 2D array of size len(frames) representing body-part coordinates x and y.
data – Second 2D array of size len(frames) representing body-part coordinates x and y.
sample (int) – The downsampling factor for the trajectories (default is 100If sample > 1, the trajectories are downsampled by selecting every sample-th point.
Note
Slightly modified from João Paulo Figueira
- Example
>>> x = np.random.randint(0, 100, (10000, 2)).astype(np.float32) >>> y = np.random.randint(0, 100, (10000, 2)).astype(np.float32) >>> distance = GeometryMixin.linear_frechet_distance(x=x, y=y, sample=100)
- static locate_line_point(path: Union[LineString, ndarray], geometry: Union[LineString, Polygon, Point], px_per_mm: Optional[float] = 1, fps: Optional[float] = 1, core_cnt: Optional[int] = -1, distance_min: Optional[bool] = True, time_prior: Optional[bool] = True) Dict[str, float] [source]
Compute the time and distance travelled along a path to reach the most proximal point in reference to a second geometry.
Note
To compute the time and distance travelled to along a path to reach the most distal point to a second geometry, pass
distance_min = False
.To compute the time and distance travelled along a path after reaching the most distal or proximal point to a second geometry, pass
time_prior = False
.
- Example
>>> line = LineString([[10, 10], [7.5, 7.5], [15, 15], [7.5, 7.5]]) >>> polygon = Polygon([[0, 5], [0, 0], [5, 0], [5, 5]]) >>> GeometryMixin.locate_line_point(path=line, geometry=polygon) >>> {'distance_value': 3.5355339059327378, 'distance_travelled': 3.5355339059327378, 'time_travelled': 1.0, 'distance_index': 1}
- static minimum_rotated_rectangle(shape=<class 'shapely.geometry.polygon.Polygon'>) Polygon [source]
Calculate the minimum rotated rectangle that bounds a given polygon.
The minimum rotated rectangle, also known as the minimum bounding rectangle (MBR) or oriented bounding box (OBB), is the smallest rectangle that can fully contain a given polygon or set of points while allowing rotation. It is defined by its center, dimensions (length and width), and rotation angle.
- Parameters
shape (Polygon) – The Polygon for which the minimum rotated rectangle is to be calculated.
- Return Polygon
The minimum rotated rectangle geometry that bounds the input polygon.
- Example
>>> polygon = GeometryMixin().bodyparts_to_polygon(np.array([[364, 308],[383, 323],[403, 335],[423, 351]])) >>> rectangle = GeometryMixin().minimum_rotated_rectangle(shape=polygon)
- static multiframe_bodypart_to_point(data: ndarray, core_cnt: Optional[int] = -1, buffer: Optional[int] = None, px_per_mm: Optional[int] = None) Union[List[Point], List[List[Point]]] [source]
Process multiple frames of body part data in parallel and convert them to shapely Points.
This function takes a multi-frame body part data represented as an array and converts it into points. It utilizes multiprocessing for parallel processing.
- Parameters
data (np.ndarray) – 2D or 3D array with body-part coordinates where rows are frames and columns are x and y coordinates.
core_cnt (Optional[int]) – The number of cores to use. If -1, then all available cores.
px_per_mm (Optional[int]) – Pixels ro millimeter convertion factor. Required if buffer is not None.
buffer (Optional[int]) – If not None, then the area of the Point. Thus, if not None, then returns Polygons representing the Points.
px_per_mm – Pixels to millimeter convertion factor. Required if buffer is not None.
- Returns Union[List[Point], List[List[Point]]]
If input is a 2D array, then list of Points. If 3D array, then list of list of Points.
Note
If buffer and px_per_mm is not None, then the points will be buffered and a 2D share polygon created with the specified buffered area. If buffer is provided, then also provide px_per_mm for accurate conversion factor between pixels and millimeters.
- Example
>>> data = np.random.randint(0, 100, (100, 2)) >>> points_lst = GeometryMixin().multiframe_bodypart_to_point(data=data, buffer=10, px_per_mm=4) >>> data = np.random.randint(0, 100, (10, 10, 2)) >>> point_lst_of_lst = GeometryMixin().multiframe_bodypart_to_point(data=data)
- multiframe_bodyparts_to_circle(data: ndarray, parallel_offset: int = 1, core_cnt: int = -1, pixels_per_mm: Optional[int] = 1) List[Polygon] [source]
Convert a set of pose-estimated key-points to circles with specified radius using multiprocessing.
- Parameters
data (int) – The body-part coordinates xy as a 2d array where rows are frames and columns represent x and y coordinates . E.g., np.array([[364, 308], [369, 309]])
data – The radius of the resultant circle in millimeters.
core_cnt (int) – Number of CPU cores to use. Defaults to -1 meaning all available cores will be used.
pixels_per_mm (int) – The pixels per millimeter of the video. If not passed, 1 will be used meaning revert to radius in pixels rather than millimeters.
- Returns Polygon
List of shapely Polygons of circular shape of size data.shape[0].
- Example
>>> data = np.random.randint(0, 100, (100, 2)) >>> circles = GeometryMixin().multiframe_bodyparts_to_circle(data=data)
- multiframe_bodyparts_to_line(data: ndarray, buffer: Optional[int] = None, px_per_mm: Optional[float] = None, core_cnt: Optional[int] = -1) List[LineString] [source]
Convert multiframe body-parts data to a list of LineString objects using multiprocessing.
- Parameters
data (np.ndarray) – Input array representing multiframe body-parts data. It should be a 3D array with dimensions (frames, points, coordinates).
buffer (Optional[int]) – If not None, then the linestring will be expanded into a 2D geometry polygon with area
buffer
.px_per_mm (Optional[int]) – If
buffer
if not None, then provide the pixels to millimetercore_cnt (Optional[int]) – Number of CPU cores to use for parallel processing. If set to -1, the function will automatically determine the available core count.
- Return List[LineString]
A list of LineString objects representing the body-parts trajectories.
- Example
>>> data = np.random.randint(0, 100, (100, 2)) >>> data = data.reshape(50,-1, data.shape[1]) >>> lines = GeometryMixin().multiframe_bodyparts_to_line(data=data)
- multiframe_bodyparts_to_multistring_skeleton(data_df: DataFrame, skeleton: Iterable[str], core_cnt: Optional[int] = -1, verbose: Optional[bool] = False, video_name: Optional[bool] = False, animal_names: Optional[bool] = False) List[Union[LineString, MultiLineString]] [source]
Convert body parts to LineString skeleton representations in a videos using multiprocessing.
- Parameters
data_df (pd.DataFrame) – Pose-estimation data.
skeleton (Iterable[str]) – Iterable of body part pairs defining the skeleton structure. Eg., [[‘Center’, ‘Lat_left’], [‘Center’, ‘Lat_right’], [‘Center’, ‘Nose’], [‘Center’, ‘Tail_base’]]
core_cnt (Optional[int]) – Number of CPU cores to use for parallel processing. Default is -1, which uses all available cores.
verbose (Optional[bool]) – If True, print progress information during computation. Default is False.
video_name (Optional[bool]) – If True, include video name in progress information. Default is False.
animal_names (Optional[bool]) – If True, include animal names in progress information. Default is False.
- Return List[Union[LineString, MultiLineString]]
List of LineString or MultiLineString objects representing the computed skeletons.
- Example
>>> df = pd.read_csv('/Users/simon/Desktop/envs/troubleshooting/Rat_NOR/project_folder/csv/machine_results/08102021_DOT_Rat7_8(2).csv', nrows=500).fillna(0).astype(int) >>> skeleton = [['Center', 'Lat_left'], ['Center', 'Lat_right'], ['Center', 'Nose'], ['Center', 'Tail_base'], ['Lat_left', 'Tail_base'], ['Lat_right', 'Tail_base'], ['Nose', 'Ear_left'], ['Nose', 'Ear_right'], ['Ear_left', 'Lat_left'], ['Ear_right', 'Lat_right']] >>> geometries = GeometryMixin().multiframe_bodyparts_to_multistring_skeleton(data_df=df, skeleton=skeleton, core_cnt=2, verbose=True)
- multiframe_bodyparts_to_polygon(data: ndarray, video_name: Optional[str] = None, animal_name: Optional[str] = None, verbose: Optional[bool] = False, cap_style: Optional[typing_extensions.Literal['round', 'square', 'flat']] = 'round', parallel_offset: Optional[int] = 1, pixels_per_mm: Optional[float] = None, simplify_tolerance: Optional[float] = 2, preserve_topology: bool = True, core_cnt: int = -1) List[Polygon] [source]
Convert multidimensional NumPy array representing body part coordinates to a list of Polygons.
- Parameters
data (np.ndarray) – NumPy array of body part coordinates. Each subarray represents the coordinates of a body part.
cap_style (Literal['round', 'square', 'flat']) – Style of line cap for parallel offset. Options: ‘round’, ‘square’, ‘flat’.
parallel_offset (int) – Offset distance for parallel lines. Default is 1.
simplify_tolerance (float) – Tolerance parameter for simplifying geometries. Default is 2.
- Example
>>> data = np.array([[[364, 308], [383, 323], [403, 335], [423, 351]],[[356, 307], [376, 319], [396, 331], [419, 347]]]) >>> GeometryMixin().multiframe_bodyparts_to_polygon(data=data)
- multiframe_compute_pct_shape_overlap(shape_1: List[Polygon], shape_2: List[Polygon], core_cnt: Optional[int] = -1, video_name: Optional[str] = None, verbose: Optional[bool] = False, animal_names: Optional[Tuple[str]] = None, denominator: Optional[typing_extensions.Literal['difference', 'shape_1', 'shape_2']] = 'difference') List[float] [source]
Compute the percentage overlap between corresponding Polygons in two lists.
- Parameters
shape_1 (List[Polygon]) – List of Polygons.
shape_2 (List[Polygon]) – List of Polygons with the same length as shape_1.
core_cnt (int) – Number of CPU cores to use for parallel processing. Default is -1, which uses all available cores.
video_name (Optional[bool]) – If not None, then the name of the video being processed for interpretable progress msgs.
video_name – If True, then prints interpretable progress msgs.
animal_names (Optional[Tuple[str]]) – If not None, then a two-tuple of animal names (or alternative shape names) interpretable progress msgs.
- Return List[float]
List of percentage overlap between corresponding Polygons.
- Example
- multiframe_compute_shape_overlap(shape_1: List[Polygon], shape_2: List[Polygon], core_cnt: Optional[int] = -1, verbose: Optional[bool] = False, names: Optional[Tuple[str]] = None) List[int] [source]
Multiprocess compute overlap between corresponding Polygons in two lists.
Note
Only returns if two shapes are overlapping or not overlapping. If the amount of overlap is required, use
GeometryMixin().multifrm_compute_pct_shape_overlap()
.- Parameters
shape_1 (List[Polygon]) – List of Polygons.
shape_2 (List[Polygon]) – List of Polygons with the same length as shape_1.
core_cnt (int) – Number of CPU cores to use for parallel processing. Default is -1, which uses all available cores.
- Return List[float]
List of overlap between corresponding Polygons. If overlap 1, else 0.
- multiframe_delaunay_triangulate_keypoints(data: ndarray, core_cnt: int = -1) List[List[Polygon]] [source]
>>> data_path = '/Users/simon/Desktop/envs/troubleshooting/Rat_NOR/project_folder/csv/machine_results/08102021_DOT_Rat7_8(2).csv' >>> data = pd.read_csv(data_path, index_col=0).head(1000).iloc[:, 0:21] >>> data = data[data.columns.drop(list(data.filter(regex='_p')))] >>> animal_data = data.values.reshape(len(data), -1, 2).astype(int) >>> tri = GeometryMixin().multiframe_delaunay_triangulate_keypoints(data=animal_data)
- multiframe_difference(shapes: Iterable[Union[LineString, Polygon, MultiPolygon]], core_cnt: Optional[int] = -1, verbose: Optional[bool] = False, animal_names: Optional[str] = None, video_name: Optional[str] = None) List[Union[Polygon, MultiPolygon]] [source]
Compute the multi-frame difference for a collection of shapes using parallel processing.
- Parameters
shapes (Iterable[Union[LineString, Polygon, MultiPolygon]]) – A collection of shapes, where each shape is a list containing two geometries.
core_cnt (int) – The number of CPU cores to use for parallel processing. Default is -1, which automatically detects the available cores.
verbose (Optional[bool]) – If True, print progress messages during computation. Default is False.
animal_names (Optional[str]) – Optional string representing the names of animals for informative messages.
Optional[str]video_name – Optional string representing the name of the video for informative messages.
- Return List[Union[Polygon, MultiPolygon]]
A list of geometries representing the multi-frame difference.
- multiframe_hausdorff_distance(geometries: List[Union[Polygon, LineString]], lag: Optional[int] = 1, core_cnt: Optional[int] = -1) List[float] [source]
The Hausdorff distance measure of the similarity between sequential time-series geometries.
- Example
>>> df = read_df(file_path='/Users/simon/Desktop/envs/simba/troubleshooting/mouse_open_field/project_folder/csv/outlier_corrected_movement_location/SI_DAY3_308_CD1_PRESENT.csv', file_type='csv') >>> cols = [x for x in df.columns if not x.endswith('_p')] >>> data = df[cols].values.reshape(len(df), -1 , 2).astype(np.int) >>> geometries = GeometryMixin().multiframe_bodyparts_to_polygon(data=data, pixels_per_mm=1, parallel_offset=1, verbose=False, core_cnt=-1) >>> hausdorff_distances = GeometryMixin.multiframe_hausdorff_distance(geometries=geometries)
- multiframe_is_shape_covered(shape_1: List[Polygon], shape_2: List[Polygon], core_cnt: Optional[int] = -1) List[bool] [source]
For each shape in time-series of shapes, check if another shape in the same time-series fully covers the first shape.
- Example
>>> shape_1 = GeometryMixin().multiframe_bodyparts_to_polygon(data=np.random.randint(0, 200, (100, 6, 2))) >>> shape_2 = [Polygon([[0, 0], [20, 20], [20, 10], [10, 20]]) for x in range(len(shape_1))] >>> GeometryMixin.multiframe_is_shape_covered(shape_1=shape_1, shape_2=shape_2, core_cnt=3)
- multiframe_length(shapes: List[Union[LineString, MultiLineString]], pixels_per_mm: float, core_cnt: int = -1, unit: typing_extensions.Literal['mm', 'cm', 'dm', 'm'] = 'mm') List[float] [source]
- Example
>>> data = np.random.randint(0, 100, (5000, 2)) >>> data = data.reshape(2500,-1, data.shape[1]) >>> lines = GeometryMixin().multiframe_bodyparts_to_line(data=data) >>> lengths = GeometryMixin().multiframe_length(shapes=lines, pixels_per_mm=1.0)
- multiframe_minimum_rotated_rectangle(shapes: List[Polygon], video_name: Optional[str] = None, verbose: Optional[bool] = False, animal_name: Optional[bool] = None, core_cnt: int = -1) List[Polygon] [source]
Compute the minimum rotated rectangle for each Polygon in a list using mutiprocessing.
- Parameters
shapes (List[Polygon]) – List of Polygons.
core_cnt – Number of CPU cores to use for parallel processing. Default is -1, which uses all available cores.
- multiframe_shape_distance(shape_1: List[Union[Polygon, LineString]], shape_2: List[Union[Polygon, LineString]], pixels_per_mm: float, unit: typing_extensions.Literal['mm', 'cm', 'dm', 'm'] = 'mm', core_cnt=-1) List[float] [source]
Compute shape distances between corresponding shapes in two lists of LineString or Polygon geometries for multiple frames.
- Parameters
shape_1 (List[Union[LineString, Polygon]]) – List of LineString or Polygon geometries.
shape_2 (List[Union[LineString, Polygon]]) – List of LineString or Polygon geometries with the same length as shape_1.
pixels_per_mm (float) – Conversion factor from pixels to millimeters.
unit (Literal['mm', 'cm', 'dm', 'm']) – Unit of measurement for the result. Options: ‘mm’, ‘cm’, ‘dm’, ‘m’. Default: ‘mm’.
core_cnt – Number of CPU cores to use for parallel processing. Default is -1, which uses all available cores.
- Return List[float]
List of shape distances between corresponding shapes in passed unit.
- multiframe_symmetric_difference(shapes: Iterable[Union[LineString, MultiLineString]], core_cnt: int = -1)[source]
Compute the symmetric differences between corresponding LineString or MultiLineString geometries usng multiprocessing.
- Example
>>> data_1 = np.random.randint(0, 100, (5000, 2)).reshape(1000,-1, 2) >>> data_2 = np.random.randint(0, 100, (5000, 2)).reshape(1000,-1, 2) >>> polygon_1 = GeometryMixin().multiframe_bodyparts_to_polygon(data=data_1) >>> polygon_2 = GeometryMixin().multiframe_bodyparts_to_polygon(data=data_2) >>> data = np.array([polygon_1, polygon_2]).T >>> symmetric_differences = GeometryMixin().multiframe_symmetric_difference(shapes=data)
- multiframe_union(shapes: Iterable[Union[LineString, MultiLineString]], core_cnt: int = -1) Iterable[Union[LineString, MultiLineString]] [source]
- Example
>>> data_1 = np.random.randint(0, 100, (5000, 2)).reshape(1000,-1, 2) >>> data_2 = np.random.randint(0, 100, (5000, 2)).reshape(1000,-1, 2) >>> polygon_1 = GeometryMixin().multiframe_bodyparts_to_polygon(data=data_1) >>> polygon_2 = GeometryMixin().multiframe_bodyparts_to_polygon(data=data_2) >>> data = np.array([polygon_1, polygon_2]).T >>> unions = GeometryMixin().multiframe_union(shapes=data)
- multifrm_geometry_histocomparison(video_path: Union[str, PathLike], data: ndarray, shape_type: typing_extensions.Literal['rectangle', 'circle', 'line'], lag: Optional[int] = 2, core_cnt: Optional[int] = -1, pixels_per_mm: int = 1, parallel_offset: int = 1) ndarray [source]
Perform geometry histocomparison on multiple video frames using multiprocessing.
Note
Comparions are made using the intersections of the two image geometries, meaning that the same experimental area of the image and arena is used in the comparison and shifts in animal location cannot account for variability.
- Parameters
video_path (Union[str, os.PathLike]) – Path to the video file.
data (np.ndarray) – Input data, typically containing coordinates of one or several body-parts.
shape_type (Literal['rectangle', 'circle']) – Type of shape for comparison.
lag (Optional[int]) – Number of frames to lag between comparisons. Default is 2.
core_cnt (Optional[int]) – Number of CPU cores to use for parallel processing. Default is -1 which is all available cores.
pixels_per_mm (Optional[int]) – Pixels per millimeter for conversion. Default is 1.
parallel_offset (Optional[int]) – Size of the geometry ROI in millimeters. Default 1.
- Returns np.ndarray
The difference between the successive geometry histograms.
- Example
>>> data = pd.read_csv('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/csv/outlier_corrected_movement_location/Example_1.csv', nrows=2100, usecols=['Nose_x', 'Nose_y']).fillna(-1).values.astype(np.int64) >>> results = GeometryMixin().multifrm_geometry_histocomparison(video_path='/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1.mp4', data= data, shape_type='circle', pixels_per_mm=1, parallel_offset=100) >>> data = pd.read_csv('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/csv/outlier_corrected_movement_location/Example_2.csv', nrows=2100, usecols=['Nose_x', 'Nose_y', 'Tail_base_x' , 'Tail_base_y', 'Center_x' , 'Center_y']).fillna(-1).values.astype(np.int64) >>> results = GeometryMixin().multifrm_geometry_histocomparison(video_path='/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1.mp4', data= data, shape_type='rectangle', pixels_per_mm=1, parallel_offset=1)
- static point_lineside(lines: ndarray, points: ndarray) ndarray [source]
Determine the relative position of a point (left vs right) with respect to a lines in each frame.
- Parameters
lines (numpy.ndarray) – An array of shape (N, 2, 2) representing N lines, where each line is defined by two points. The first point that denotes the beginning of the line, the second point denotes the end of the line.
point (numpy.ndarray) – An array of shape (N, 2) representing N points.
- Return np.ndarray
An array of length N containing the results for each line. 2 if the point is on the right side of the line. 1 if the point is on the left side of the line. 0 if the point is on the line.
- Example
>>> lines = np.array([[[25, 25], [25, 20]], [[15, 25], [15, 20]], [[15, 25], [50, 20]]]).astype(np.float32) >>> points = np.array([[20, 0], [15, 20], [90, 0]]).astype(np.float32) >>> GeometryMixin().point_lineside(lines=lines, points=points) >>> [1., 0., 1.]
- static rank_shapes(shapes: List[Polygon], method: typing_extensions.Literal['area', 'min_distance', 'max_distance', 'mean_distance', 'left_to_right', 'top_to_bottom'], deviation: Optional[bool] = False, descending: Optional[bool] = True) List[Polygon] [source]
Rank a list of polygon geometries based on a specified method. E.g., order the list of geometries according to sizes or distances to each other or from left to right etc.
- Parameters
shapes (List[Polygon]) – List of Shapely polygons to be ranked. List has to contain two or more shapes.
method (Literal['area', 'min_center_distance', 'max_center_distance', 'mean_shape_distance']) – The ranking method to use.
deviation (Optional[bool]) – If True, rank based on absolute deviation from the mean. Default: False.
descending (Optional[bool]) – If True, rank in descending order; otherwise, rank in ascending order. Default: False.
- Returns
A input list of Shapely polygons sorted according to the specified ranking method.
- static shape_distance(shapes: List[Union[LineString, Polygon, Point]], pixels_per_mm: float, unit: typing_extensions.Literal['mm', 'cm', 'dm', 'm'] = 'mm') float [source]
Calculate the distance between two geometries in specified units.
- Parameters
shapes (List[Union[LineString, Polygon]]) – A list containing two LineString or Polygon geometries.
pixels_per_mm (float) – The conversion factor from pixels to millimeters.
unit (Literal['mm', 'cm', 'dm', 'm']) – The desired unit for the distance calculation. Options: ‘mm’, ‘cm’, ‘dm’, ‘m’. Defaults to ‘mm’.
- Return float
The distance between the two geometries in the specified unit.
>>> shape_1 = Polygon([(0, 0), 10, 10), 0, 10), 10, 0)]) >>> shape_2 = Polygon([(0, 0), 10, 10), 0, 10), 10, 0)]) >>> GeometryMixin.shape_distance(shapes=[shape_1, shape_2], pixels_per_mm=1) >>> 0
- static simba_roi_to_geometries(rectangles_df: DataFrame, circles_df: DataFrame, polygons_df: DataFrame, color: Optional[bool] = False) dict [source]
Convert SimBA dataframes holding ROI geometries to nested dictionary holding Shapley polygons.
- Example
>>> #config_path = '/Users/simon/Desktop/envs/simba/troubleshooting/spontenous_alternation/project_folder/project_config.ini' >>> #config = ConfigReader(config_path=config_path) >>> #config.read_roi_data() >>> #GeometryMixin.simba_roi_to_geometries(rectangles_df=config.rectangles_df, circles_df=config.circles_df, polygons_df=config.polygon_df)
- static static_point_lineside(lines: ndarray, point: ndarray) ndarray [source]
Determine the relative position (left vs right) of a static point with respect to multiple lines.
Note
Modified from rayryeng.
- Parameters
lines (numpy.ndarray) – An array of shape (N, 2, 2) representing N lines, where each line is defined by two points. The first point that denotes the beginning of the line, the second point denotes the end of the line.
point (numpy.ndarray) – A 2-element array representing the coordinates of the static point.
- Return np.ndarray
An array of length N containing the results for each line. 2 if the point is on the right side of the line. 1 if the point is on the left side of the line. 0 if the point is on the line.
- Example
>>> line = np.array([[[25, 25], [25, 20]], [[15, 25], [15, 20]], [[15, 25], [50, 20]]]).astype(np.float32) >>> point = np.array([20, 0]).astype(np.float64) >>> GeometryMixin().static_point_lineside(lines=line, point=point) >>> [1. 2. 1.]
- static symmetric_difference(shapes: List[Union[LineString, Polygon, MultiPolygon]]) List[Union[Polygon, MultiPolygon]] [source]
Computes a new geometry consisting of the parts that are exclusive to each input geometry.
In other words, it includes the parts that are unique to each geometry while excluding the parts that are common to both.
- Parameters
shapes (List[Union[LineString, Polygon, MultiPolygon]]) – A list of LineString, Polygon, or MultiPolygon geometries to find the symmetric difference.
- Return List[Union[Polygon, MultiPolygon]]
A list containing the resulting geometries after performing symmetric difference operations.
- Example
>>> polygon_1 = GeometryMixin().bodyparts_to_polygon(np.array([[10, 10], [10, 100], [100, 10], [100, 100]])) >>> polygon_2 = GeometryMixin().bodyparts_to_polygon(np.array([[1, 25], [1, 75], [110, 25], [110, 75]])) >>> symmetric_difference = symmetric_difference(shapes=[polygon_1, polygon_2])
- static to_linestring(data: ndarray) LineString [source]
Convert a 2D array of x and y coordinates to a shapely linestring.
Linestrings are useful for representing an animal path, and to answer questions like (i) “How far along the animals paths was the animal most proximal to geometry X”? “How far had the animal travelled at time T?” “When does the animal path intersect geometry X?”
- Parameters
data (np.ndarray) – 2D array with floats or ints of size Nx2 representing body-part coordinates.
- Example
>>> data = np.load('/Users/simon/Desktop/envs/simba/simba/simba/sandbox/data.npy') >>> linestring = GeometryMixin.to_linestring(data=data)
- static union(shapes: List[Union[LineString, Polygon, MultiPolygon]]) Union[MultiPolygon, Polygon, MultiLineString] [source]
Compute the union of multiple geometries.
- Parameters
shapes (List[Union[LineString, Polygon, MultiPolygon]]) – A list of LineString, Polygon, or MultiPolygon geometries to be unioned.
- Return Union[MultiPolygon, Polygon]
The resulting geometry after performing the union operation.
- Example
>>> polygon_1 = GeometryMixin().bodyparts_to_polygon(np.array([[10, 10], [10, 100], [100, 10], [100, 100]])) >>> polygon_2 = GeometryMixin().bodyparts_to_polygon(np.array([[1, 25],[1, 75],[110, 25],[110, 75]])) >>> union = GeometryMixin().union(shape = polygon_1, overlap_shapes=[polygon_2, polygon_2])
- static view_shapes(shapes: List[Union[LineString, Polygon, MultiPolygon, MultiLineString]], bg_img: Optional[ndarray] = None, bg_clr: Optional[Tuple[int]] = None, size: Optional[int] = None, color_palette: Optional[str] = None) ndarray [source]
Helper function to draw shapes on white canvas or specified background image. Useful for quick troubleshooting.
- Example
>>> multipolygon_1 = MultiPolygon([Polygon([[200, 110],[200, 100],[200, 100],[200, 110]]), Polygon([[70, 70],[70, 60],[10, 50],[1, 70]])]) >>> polygon_1 = GeometryMixin().bodyparts_to_polygon(np.array([[100, 110],[100, 100],[110, 100],[110, 110]])) >>> line_1 = GeometryMixin().bodyparts_to_line(np.array([[10, 70],[20, 60],[30, 50],[40, 70]])) >>> img = GeometryMixin.view_shapes(shapes=[line_1, polygon_1, multipolygon_1])
Network (Graph) methods
- class simba.mixins.network_mixin.NetworkMixin[source]
Bases:
object
Methods to create and analyze time-dependent graphs from pose-estimation data.
When working with pose-estimation data for more than two animals - over extended periods - it can be beneficial to represent the data as a graph where the animals feature as nodes are their relationship strengths are represented as edges.
When formatted as a graph, we can compute (i) how the relationships between animal pairs change across time and recordings, (ii) the relative importance’s and hierarchies of individual animals within the group, or (iii) identify sub-groups with the network.
The critical component determining the results is how edge weights are represented. These edge weight values could be the amount of time animal bounding boxes overlap each other, aggregate distances between the animals, or how much time animals engange in coordinated behaviors. These values can be computed through other methods within SimBA mixin methods.
Very much wip and so far primarily depend on networkx.
References
See below references for mature and reliable packages (12/2023):
- static berger_parker(x: ndarray) float [source]
Berger-Parker index for the given one-dimensional array. The Berger-Parker index is a measure of category dominance, calculated as the ratio of the frequency of the most abundant category to the total number of observations. Answer how dominated a cluster or community is by categorical variable.
The Berger-Parker index (BP) is calculated using the formula:
where: - ( f_{max} ) is the frequency of the most abundant category, - ( N ) is the total number of observations.
- Parameters
x (np.ndarray) – One-dimensional numpy array containing the values for which the Berger-Parker index is calculated.
- Return float
Berger-Parker index value for the input array x
- Example
>>> x = np.random.randint(0, 25, (100,)).astype(np.float32) >>> z = NetworkMixin.berger_parker(x=x)
- static brillouins_index(x: array) float [source]
Calculate Brillouin’s Diversity Index for a given array of values.
Brillouin’s Diversity Index is a measure of cluster/community diversity that accounts for both richness and evenness of distribution.
Brillouin’s Diversity Index (H) is calculated using the formula:
where: - ( H ) is Brillouin’s Diversity Index, - ( S ) is the total number of unique species, - ( N_i ) is the count of individuals in the i-th species, - ( n ) is the total number of individuals.
- Parameters
x (np.array) – One-dimensional numpy array containing the values for which Brillouin’s Index is calculated.
- Return float
Brillouin’s Diversity Index value for the input array x
- Example
>>> x = np.random.randint(0, 10, (100,)) >>> NetworkMixin.brillouins_index(x)
- static create_graph(data: Dict[Tuple[str, str], float]) <networkx.classes.graph.Graph object at 0x7f93fe326da0> [source]
Create a single undirected graph with single edges from on dictionary.
- Parameters
data (Dict[Tuple[str, str], float]) – A dictionary where keys are tuples representing node pairs and values are the corresponding edge weights.
- Returns nx.Graph
A networkx graph with nodes and edges defined by the input data.
- Example
>>> data = {('Animal_1', 'Animal_2'): 1.0, ('Animal_1', 'Animal_3'): 0.2, ('Animal_2', 'Animal_3'): 0.5} >>> graph = NetworkMixin.create_graph(data=data)
- static create_multigraph(data: Dict[Tuple[str, str], List[float]]) MultiGraph [source]
Create a multi-graph from a dictionary of node pairs and associated edge weights.
For example, creates a multi-graph where node edges represent animal relationship weights at different timepoints.
- Parameters
data (Dict[Tuple[str, str], List[float]]) – A dictionary where keys are tuples representing node pairs, and values are lists of edge weights associated with each pair.
- Returns nx.MultiGraph
A NetworkX multigraph with nodes and edges specified by the input data. Each edge is labeled and weighted based on the provided information.
- Example
>>> data = {('Animal_1', 'Animal_2'): [0, 0, 0, 6], ('Animal_1', 'Animal_3'): [0, 0, 0, 0], ('Animal_1', 'Animal_4'): [0, 0, 0, 0], ('Animal_1', 'Animal_5'): [0, 0, 0, 0], ('Animal_2', 'Animal_3'): [0, 0, 0, 0], ('Animal_2', 'Animal_4'): [5, 0, 0, 2], ('Animal_2', 'Animal_5'): [0, 0, 0, 0], ('Animal_3', 'Animal_4'): [0, 0, 0, 0], ('Animal_3', 'Animal_5'): [0, 2, 22, 0], ('Animal_4', 'Animal_5'): [0, 0, 0, 0]} >>> NetworkMixin().create_multigraph(data=data)
- static girvan_newman(graph: Graph, levels: Optional[int] = 1, most_valuable_edge: object = None)[source]
- Example
>>> graph = NetworkMixin.create_graph({ ('Animal_1', 'Animal_2'): 0.0, ('Animal_1', 'Animal_3'): 0.0, ('Animal_1', 'Animal_4'): 0.0, ('Animal_1', 'Animal_5'): 0.0, ('Animal_2', 'Animal_3'): 1.0, ('Animal_2', 'Animal_4'): 1.0, ('Animal_2', 'Animal_5'): 1.0, ('Animal_3', 'Animal_4'): 1.0, ('Animal_3', 'Animal_5'): 1.0, ('Animal_4', 'Animal_5'): 1.0}) >>> NetworkMixin().girvan_newman(graph=graph, levels = 1) >>> [({'Animal_1'}, {'Animal_2', 'Animal_3', 'Animal_4', 'Animal_5'})]
- static graph_current_flow_closeness_centrality(graph: Graph, weights: Optional[str] = 'weight')[source]
- Example
>>> graph = NetworkMixin.create_graph(data={('Animal_1', 'Animal_2'): 1.0, ('Animal_1', 'Animal_3'): 0.2, ('Animal_2', 'Animal_3'): 0.5}) >>> NetworkMixin().graph_current_flow_closeness_centrality(graph=graph)
- static graph_katz_centrality(graph: Graph, weights: Optional[str] = 'weight', alpha: Optional[float] = 0.85)[source]
Katz centrality is an algorithm in NetworkX that measures the relative influence of a node in a network.
See networkx documentation
- Example
>>> graph = NetworkMixin.create_graph(data={('Animal_1', 'Animal_2'): 1.0, ('Animal_1', 'Animal_3'): 0.2, ('Animal_2', 'Animal_3'): 0.5}) >>> NetworkMixin().graph_katz_centrality(graph=graph)
- static graph_page_rank(graph: Graph, weights: Optional[str] = 'weight', alpha: Optional[float] = 0.85, max_iter: Optional[int] = 100) Dict[str, float] [source]
Calculate the PageRank of nodes in a graph.
- Example
>>> graph = NetworkMixin.create_graph(data={('Animal_1', 'Animal_2'): 1.0, ('Animal_1', 'Animal_3'): 0.2, ('Animal_2', 'Animal_3'): 0.5}) >>> NetworkMixin().graph_page_rank(graph=graph)
- static margalef_diversification_index(x: array) float [source]
Calculate the Margalef Diversification Index for a given array of values.
The Margalef Diversification Index is a measure of category diversity. It quantifies the richness of a community/cluster relative to the number of individuals. A high Margalef Diversification Index indicates a high diversity of categories relative to the number of observations. A low Margalef Diversification Index suggests a lower diversity of categories relative to the number of observations.
The Margalef Diversification Index (D) is calculated using the formula:
where: - ( S ) is the number of unique categories, - ( N ) is the total number of individuals.
- Parameters
x (np.array) – One-dimensional numpy array containing nominal values for which the Margalef Diversification Index is calculated.
- Return float
Margalef Diversification Index value for the input array x
- Example
>>> x = np.random.randint(0, 100, (100,)) >>> NetworkMixin.margalef_diversification_index(x=x)
- static menhinicks_index(x: array) float [source]
Calculate the Menhinick’s Index for a given array of values.
Menhinick’s Index is a measure of category richness. It quantifies the number of categories relative to the square root of the total number of observations. A high Menhinick’s Index suggests a high diversity of categories relative to the number of observations. A low Menhinick’s Index indicates a lower diversity of categories relative to the number of observations.
Menhinick’s Index (D) is calculated using the formula:
where: - ( S ) is the number of unique categories, - ( N ) is the total number of observations.
- param np.array x
One-dimensional numpy array containing the integer values representing nominal values for which Menhinick’s Index is calculated.
- return float
Menhinick’s Index value for the input array x
- Example
>>> x = np.random.randint(0, 5, (1000,)) >>> NetworkMixin.menhinicks_index(x=x)
- static multigraph_page_rank(graph: MultiGraph, weights: Optional[str] = 'weight', alpha: Optional[float] = 0.85, max_iter: Optional[int] = 100) Dict[str, List[float]] [source]
Calculate multi-graph PageRank scores for each node in a MultiGraph.
For example, each node-pair in a graph has N undirected edges representing the weighted relationship between the two nodes atobserved point in time. Calculates the page rank of each node at each observed time point.
- Parameters
graph (nx.MultiGraph) – The input MultiGraph, created by
NetworkMixin.create_multigraph()
.- Example
>>> multigraph = NetworkMixin().create_multigraph(data={('Animal_1', 'Animal_2'): [0, 0, 0, 6], ('Animal_1', 'Animal_3'): [0, 0, 0, 0], ('Animal_1', 'Animal_4'): [0, 0, 0, 0], ('Animal_1', 'Animal_5'): [0, 0, 0, 0], ('Animal_2', 'Animal_3'): [0, 0, 0, 0], ('Animal_2', 'Animal_4'): [5, 0, 0, 2], ('Animal_2', 'Animal_5'): [0, 0, 0, 0], ('Animal_3', 'Animal_4'): [0, 0, 0, 0], ('Animal_3', 'Animal_5'): [0, 2, 22, 0], ('Animal_4', 'Animal_5'): [0, 0, 0, 0]}) >>> NetworkMixin().multigraph_page_rank(graph=multigraph) >>> {'Animal_1': [0.06122524589028524, 0.06122524589028524, 0.06122524589028524, 0.32739635847890775], 'Animal_2': [0.06122524589028524, 0.40816213116457223, 0.06122524589028524, 0.442259400816002], 'Animal_3': [0.40816213116457223, 0.06122524589028524, 0.40816213116457223, 0.04545454545454547], 'Animal_4': [0.06122524589028524, 0.40816213116457223, 0.06122524589028524, 0.13943514979599955], 'Animal_5': [0.40816213116457223, 0.06122524589028524, 0.40816213116457223, 0.04545454545454547]}
- static shannon_diversity_index(x: ndarray) float [source]
Calculate the Shannon Diversity Index for a given array of categories. The Shannon Diversity Index is a measure of diversity in a categorical feature, taking into account both the number of different categories (richness) and their relative abundances (evenness). Answer how homogenous a cluster or community is for categorical variable. A low value indicates that one or a few categories dominate.
where: - ( p_i ) is the proportion of individuals belonging to the i-th category, - ( n ) is the total number of categories.
- Parameters
x (np.ndarray) – One-dimensional numpy array containing the categories for which the Shannon Diversity Index is calculated.
- Return float
Shannon Diversity Index value for the input array x
- Example
>>> x = np.random.randint(0, 100, (100, )) >>> NetworkMixin.shannon_diversity_index(x=x)
- static simpson_index(x: ndarray) float [source]
Calculate Simpson’s diversity index for a given array of values.
Simpson’s diversity index is a measure of diversity that takes into account the number of different categories present in the input data as well as the relative abundance of each category. Answer how homogenous a cluster or community is for categorical input variable.
rac{sum(n(n-1))}{N(N-1)}
where: - ( n ) is the number of individuals of a particular category, - ( N ) is the total number of individuals, - ( sum ) represents the sum over all categories.
- param np.ndarray x
1-dimensional numpy array containing the values representing categories for which Simpson’s index is calculated.
- return float
Simpson’s diversity index value for the input array x
- static sorensen_dice_coefficient(x: ndarray, y: ndarray) float [source]
Calculate Sørensen’s Similarity Index between two communities/clusters.
The Sørensen similarity index, also known as the overlap index, quantifies the overlap between two populations by comparing the number of shared categories to the total number of categories in both populations. It ranges from zero, indicating no overlap, to one, representing perfect overlap
Sørensen’s Similarity Index (S) is calculated using the formula:
where: - ( S ) is Sørensen’s Similarity Index, - ( X ) and ( Y ) are the sets representing the categories in the first and second communities, respectively, - ( |X \cap Y| ) is the number of shared categories between the two communities, - ( |X| ) and ( |Y| ) are the total number of categories in the first and second communities, respectively.
- Parameters
x – 1D numpy array with nominal values for the first cluster/community.
y – 1D numpy array with nominal values for the second cluster/community.
- Returns
Sørensen’s Similarity Index between x and y.
- Example
>>> x = np.random.randint(0, 10, (100,)) >>> y = np.random.randint(0, 10, (100,)) >>> NetworkMixin.sorensen_dice_coefficient(x=x, y=y)
- static visualize(graph: Graph, save_path: Optional[Union[str, PathLike]] = None, node_size: Optional[Union[float, Dict[str, float]]] = 25.0, palette: Optional[Union[str, Dict[str, str]]] = 'magma', img_size: Optional[Tuple[int, int]] = (500, 500)) Union[None, Network] [source]
Visualizes a network graph using the vis.js library and saves the result as an HTML file.
Note
Multi-networks created by
simba.mixins.network_mixin.create_multigraph
can be a little messy to look at. Instead, creates seperate objects and files with single edges from each time-point.- Parameters
graph (Union[nx.Graph, nx.MultiGraph]) – The input graph to be visualized.
save_path (Optional[Union[str, os.PathLike]]) – The path to save the HTML file. If None, the graph is not saved but returned. Default: None.
node_size (Optional[Union[float, Dict[str, float]]]) – The size of nodes. Can be a single float or a dictionary mapping node names to their respective sizes. Default: 25.0.
palette (Optional[Union[str, Dict[str, str]]]) – The color palette for nodes. Can be a single string representing a palette name or a dictionary mapping node names to their respective colors. Default; magma.
img_size (Optional[Tuple[int, int]]) – The size of the resulting image in pixels, represented as (width, height). Default: 500x500.
- Example
>>> graph = NetworkMixin.create_graph(data={('Animal_1', 'Animal_2'): 1.0, ('Animal_1', 'Animal_3'): 0.2, ('Animal_2', 'Animal_3'): 0.5}) >>> graph_pg = NetworkMixin().graph_page_rank(graph=graph)
Feature extraction supplement methods
- class simba.mixins.feature_extraction_supplement_mixin.FeatureExtractionSupplemental[source]
Bases:
FeatureExtractionMixin
Additional feature extraction method not called by default feature extraction classes from
simba.feature_extractors
.- static border_distances(data: ndarray, pixels_per_mm: float, img_resolution: ndarray, time_window: float, fps: int)[source]
Compute the mean distance of key-point to the left, right, top, and bottom sides of the image in rolling time-windows. Uses a straight line.
Attention
Output for initial frames where [current_frm - window_size] < 0 will be populated with
-1
.- Parameters
data (np.ndarray) – 2d array of size len(frames)x2 with body-part coordinates.
img_resolution (np.ndarray) – Resolution of video in WxH format.
pixels_per_mm (float) – Pixels per millimeter of recorded video.
fps (int) – FPS of the recorded video
time_windows (float) – Rolling time-window as floats in seconds. E.g.,
0.2
- Returns np.ndarray
Size data.shape[0] x time_windows.shape[0] array with millimeter distances from LEFT, RIGH, TOP, BOTTOM,
- Example
>>> data = np.array([[250, 250], [250, 250], [250, 250], [500, 500],[500, 500], [500, 500]]).astype(float) >>> img_resolution = np.array([500, 500]) >>> FeatureExtractionSupplemental().border_distances(data=data, img_resolution=img_resolution, time_window=1, fps=2, pixels_per_mm=1) >>> [[-1, -1, -1, -1][250, 250, 250, 250][250, 250, 250, 250][375, 125, 375, 125][500, 0, 500, 0][500, 0, 500, 0]]
- static consecutive_time_series_categories_count(data: ndarray, fps: int)[source]
Compute the count of consecutive milliseconds the feature value has remained static. For example, compute for how long in milleseconds the animal has remained in the current cardinal direction or the within an ROI.
- Parameters
data (np.ndarray) – 1d array of feature values
fps (int) – Frame-rate of video.
- Returns np.ndarray
Array of size data.shape[0]
- Example
>>> data = np.array([0, 1, 1, 1, 4, 5, 6, 7, 8, 9]) >>> FeatureExtractionSupplemental().consecutive_time_series_categories_count(data=data, fps=10) >>> [0.1, 0.1, 0.2, 0.3, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1] >>> data = np.array(['A', 'B', 'B', 'B', 'C', 'D', 'E', 'F', 'G', 'H']) >>> [0.1, 0.1, 0.2, 0.3, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
- static distance_and_velocity(x: ndarray, fps: float, pixels_per_mm: float, centimeters: Optional[bool] = True) Tuple[float, float] [source]
Calculate total movement and mean velocity from a sequence of position data.
- Parameters
x – Array containing movement data. For example, created by
simba.mixins.FeatureExtractionMixin.framewise_euclidean_distance
.fps – Frames per second of the data.
pixels_per_mm – Conversion factor from pixels to millimeters.
centimeters (Optional[bool]) – If True, results are returned in centimeters and centimeters per second. Defaults to True.
- Return Tuple[float, float]
A tuple containing total movement and mean velocity.
- Example
>>> x = np.random.randint(0, 100, (100,)) >>> sum_movement, avg_velocity = FeatureExtractionSupplemental.distance_and_velocity(x=x, fps=10, pixels_per_mm=10, centimeters=True)
- euclidean_distance_timeseries_change(location_1: ndarray, location_2: ndarray, fps: int, px_per_mm: float, time_windows: ndarray = array([0.2, 0.4, 0.8, 1.6])) ndarray [source]
Compute the difference in distance between two points in the current frame versus N.N seconds ago. E.g., computes if two points are traveling away from each other (positive output values) or towards each other (negative output values) relative to reference time-point(s)
- Parameters
location_1 (ndarray) – 2D array of size len(frames) x 2 representing pose-estimated locations of body-part one
location_2 (ndarray) – 2D array of size len(frames) x 2 representing pose-estimated locations of body-part two
fps (int) – Fps of the recorded video.
px_per_mm (float) – The pixels per millimeter in the video.
time_windows (np.ndarray) – Time windows to compare.
- Return np.array
Array of size location_1.shape[0] x time_windows.shape[0]
- Example
>>> location_1 = np.random.randint(low=0, high=100, size=(2000, 2)).astype('float32') >>> location_2 = np.random.randint(low=0, high=100, size=(2000, 2)).astype('float32') >>> distances = self.euclidean_distance_timeseries_change(location_1=location_1, location_2=location_2, fps=10, px_per_mm=4.33, time_windows=np.array([0.2, 0.4, 0.8, 1.6]))
- static find_path_loops(data: ndarray) Dict[Tuple[int], List[int]] [source]
Compute the loops detected within a 2-dimensional path.
- Parameters
data (np.ndarray) – Nx2 2-dimensional array with the x and y coordinated represented on axis 1.
- Returns
Dictionary with the coordinate tuple(x, y) as keys, and sequential frame numbers as values when animals visited, and re-visited the key coordinate.
- Example
>>> data = read_df(file_path='/Users/simon/Desktop/envs/simba/troubleshooting/mouse_open_field/project_folder/csv/outlier_corrected_movement_location/SI_DAY3_308_CD1_PRESENT.csv', usecols=['Center_x', 'Center_y'], file_type='csv').values.astype(int) >>> FeatureExtractionSupplemental.find_path_loops(data=data)
- static peak_ratio(data: ndarray, bin_size_s: int, fps: int)[source]
Compute the ratio of peak values relative to number of values within each seqential time-period represented of
bin_size_s
seconds. Peak is defined as value is higher than in the prior observation (i.e., no future data is involved in comparison).- Parameters
- Return np.ndarray
Array of size data.shape[0] with peak counts as ratio of len(frames).
- Example
>>> data = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> FeatureExtractionSupplemental().peak_ratio(data=data, bin_size_s=1, fps=10) >>> [0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9] >>> data = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) >>> FeatureExtractionSupplemental().peak_ratio(data=data, bin_size_s=1, fps=10) >>> [0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
- static rolling_categorical_switches_ratio(data: ndarray, time_windows: ndarray, fps: int) ndarray [source]
Compute the ratio of in categorical feature switches within rolling windows.
Attention
Output for initial frames where [current_frm - window_size] < 0, are populated with
0
.- Parameters
data (np.ndarray) – 1d array of feature values
time_windows (np.ndarray) – Rolling time-windows as floats in seconds. E.g., [0.2, 0.4, 0.6]
fps (int) – fps of the recorded video
- Returns np.ndarray
Size data.shape[0] x time_windows.shape[0] array
- Example
>>> data = np.array([0, 1, 1, 1, 4, 5, 6, 7, 8, 9]) >>> FeatureExtractionSupplemental().rolling_categorical_switches_ratio(data=data, time_windows=np.array([1.0]), fps=10) >>> [[-1][-1][-1][-1][-1][-1][-1][-1][-1][ 0.7]] >>> data = np.array(['A', 'B', 'B', 'B', 'C', 'D', 'E', 'F', 'G', 'H']) >>> FeatureExtractionSupplemental().rolling_categorical_switches_ratio(data=data, time_windows=np.array([1.0]), fps=10) >>> [[-1][-1][-1][-1][-1][-1][-1][-1][-1][ 0.7]]
- static rolling_horizontal_vs_vertical_movement(data: ndarray, pixels_per_mm: float, time_windows: ndarray, fps: int) ndarray [source]
Compute the movement along the x-axis relative to the y-axis in rolling time bins.
Attention
Output for initial frames where [current_frm - window_size] < 0, are populated with
0
.- Parameters
- Returns np.ndarray
Size data.shape[0] x time_windows.shape[0] array
- Returns np.ndarray
Size data.shape[0] x time_windows.shape[0]. Greater values denote greater movement on x-axis relative to y-axis.
- Example
>>> data = np.array([[250, 250], [250, 250], [250, 250], [250, 500], [500, 500], 500, 500]]).astype(float) >>> FeatureExtractionSupplemental().rolling_horizontal_vs_vertical_movement(data=data, time_windows=np.array([1.0]), fps=2, pixels_per_mm=1) >>> [[ -1.][ 0.][ 0.][-250.][ 250.][ 0.]]
- static sequential_lag_analysis(data: DataFrame, criterion: str, target: str, time_window: float, fps: float)[source]
Perform sequential lag analysis to determine the temporal relationship between two events.
For every onset of behavior C, count the proportions of behavior T onsets in the time-window preceding the onset of behavior C vs the proportion of behavior T onsets in the time-window proceeding the onset of behavior C.
A value closer to 1.0 indicates that behavior T always precede behavior C. A value closer to 0.0 indicates that behavior T follows behavior C. A value of -1.0 indicates that behavior T never precede nor proceed behavior C.
See also
simba.data_processorsfsttc_calculator.FSTTCCalculator
- Example
>>> df = pd.DataFrame(np.random.randint(0, 2, (100, 2)), columns=['Attack', 'Sniffing']) >>> FeatureExtractionSupplemental.sequential_lag_analysis(data=df, criterion='Attack', target='Sniffing', fps=5, time_window=2.0)
References
- 1
Casarrubea et al., Structural analyses in the study of behavior: From rodents to non-human primates, Frontiers in Psychology, 2022.
- static spontaneous_alternations(data: DataFrame, arm_names: List[str], center_name: str) Tuple[Dict[Union[str, Tuple[int]], int]] [source]
Detects spontaneous alternations between a set of user-defined ROIs.
- Parameters
data (pd.DataFrame) – DataFrame containing shape data where each row represents a frame and each column represents a shape where 0 represents not in ROI and 1 represents inside the ROI
shape_names (List[str]) – List of column names in the DataFrame corresponding to shape names.
- Returns Dict[Union[str, Tuple[str], Union[int, float, List[int]]]]
Dict with the following keys and values:
Dict with the following keys and values:
‘pct_alternation’: Percent alternation computed as (spontaneous alternation cnt / (total number of arm entries - (number of arms - 1))) × 100
‘alternation_cnt’: The sliding count of ROI entry sequences of length len(shape_names) that are all unique.
‘same_arm_returns_cnt’: Aggregate count of sequential visits to the same ROI.
‘alternate_arm_returns_cnt’: Aggregate count of errors which are not same-arm-return errors.
‘error_cnt’: Aggregate error count (same_arm_returns_cnt + alternate_arm_returns_cnt),
‘same_arm_returns_dict’: Dictionary with the keys being the name of the ROI and values are a list of frames when the same-arm-return errors were committed.
‘alternate_arm_returns_cnt’: Dictionary with the keys being the name of the ROI and values are a list of frames when the alternate-arm-return errors were committed.
‘alternations_dict’: Dictionary with the keys being unique ROI name tuple sequences of length len(shape_names) and values are a list of frames when the sequence was completed.
‘arm_entry_sequence’: Pandas dataframe with two columns: sequence of arm names entered, the frame the animal entered the arm, the frame that the animal left the arm.
- Example
>>> data = np.zeros((100, 4), dtype=int) >>> random_indices = np.random.randint(0, 4, size=100) >>> for i in range(100): data[i, random_indices[i]] = 1 >>> df = pd.DataFrame(data, columns=['left', 'top', 'right', 'bottom']) >>> spontanous_alternations = FeatureExtractionSupplemental.spontaneous_alternations(data=df, shape_names=['left', 'top', 'right', 'bottom'])
- static velocity_aggregator(config_path: Union[str, PathLike], data_dir: Union[str, PathLike], body_part: str, ts_plot: Optional[bool] = True)[source]
Aggregate and plot velocity data from multiple pose-estimation files.
- Parameters
config_path (Union[str, os.PathLike]) – Path to SimBA configuration file.
data_dir (Union[str, os.PathLike]) – Directory containing data files.
body_part (str data_dir) – Body part to use when calculating velocity.
ts_plot (Optional[bool] data_dir) – Whether to generate a time series plot of velocities for each data file. Defaults to True.
- Example
>>> config_path = '/Users/simon/Desktop/envs/simba/troubleshooting/two_black_animals_14bp/project_folder/project_config.ini' >>> data_dir = '/Users/simon/Desktop/envs/simba/troubleshooting/two_black_animals_14bp/project_folder/csv/outlier_corrected_movement_location' >>> body_part = 'Nose_1' >>> FeatureExtractionSupplemental.velocity_aggregator(config_path=config_path, data_dir=data_dir, body_part=body_part)
Statistics methods
- class simba.mixins.statistics_mixin.Statistics[source]
Bases:
FeatureExtractionMixin
Primarily frequentist statistics methods used for feature extraction or drift assessment.
Note
Most methods implemented using numba parallelization for improved run-times. See line graph below for expected run-times for a few methods included in this class.
Most method has numba typed signatures to decrease compilation time through reduced type inference. Make sure to pass the correct dtypes as indicated by signature decorators. If dtype is not specified at array creation, it will typically be
float64
orint64
. As most methods here usefloat32
for the input data argument, make sure to downcast.This class contains a few probability distribution comparison methods. These are being moved to
simba.sandbox.distances
(05.24).- static adjusted_mutual_info(x: ndarray, y: ndarray) float [source]
Calculate the Adjusted Mutual Information (AMI) between two clusterings as a measure of similarity.
Calculates the Adjusted Mutual Information (AMI) between two sets of cluster labels. AMI measures the agreement between two clustering results, accounting for chance agreement. The value of AMI ranges from 0 (indicating no agreement) to 1 (perfect agreement).
rac{ ext{MI}(x, y) - E( ext{MI}(x, y))}{max(H(x), H(y)) - E( ext{MI}(x, y))}
- where:
ext{MI}(x, y) ext{ is the mutual information between } x ext{ and } y.
E( ext{MI}(x, y)) ext{ is the expected mutual information.}
H(x) ext{ and } H(y) ext{ are the entropies of } x ext{ and } y, ext{ respectively.}
- param np.ndarray x
1D array representing the labels of the first model.
- param np.ndarray y
1D array representing the labels of the second model.
- return float
Score between 0 and 1, where 1 indicates perfect clustering agreement.
- static adjusted_rand(x: ndarray, y: ndarray) float [source]
Calculate the Adjusted Rand Index (ARI) between two clusterings.
The Adjusted Rand Index (ARI) is a measure of the similarity between two clusterings. It considers all pairs of samples and counts pairs that are assigned to the same or different clusters in both the true and predicted clusterings.
The ARI is defined as:
- where:
TP (True Positive) is the number of pairs of elements that are in the same cluster in both x and y,
FP (False Positive) is the number of pairs of elements that are in the same cluster in y but not in x,
FN (False Negative) is the number of pairs of elements that are in the same cluster in x but not in y,
TN (True Negative) is the number of pairs of elements that are in different clusters in both x and y.
The ARI value ranges from -1 to 1. A value of 1 indicates perfect clustering agreement, 0 indicates random clustering, and negative values indicate disagreement between the clusterings.
Note
Modified from scikit-learn
- Parameters
x (np.ndarray) – 1D array representing the labels of the first model.
y (np.ndarray) – 1D array representing the labels of the second model.
- Return float
A value of 1 indicates perfect clustering agreement, a value of 0 indicates random clustering, and negative values indicate disagreement between the clusterings.
- Example
>>> x = np.array([0, 0, 0, 0, 0]) >>> y = np.array([1, 1, 1, 1, 1]) >>> Statistics.adjusted_rand(x=x, y=y) >>> 1.0
- static bray_curtis_dissimilarity(x: ndarray, w: Optional[ndarray] = None) ndarray [source]
Jitted compute of the Bray-Curtis dissimilarity matrix between samples based on feature values.
The Bray-Curtis dissimilarity measures the dissimilarity between two samples based on their feature values. It is useful for finding similar frames based on behavior.
Useful for finding similar frames based on behavior.
Note
Adapted from pynndescent.
- Parameters
x (np.ndarray) – 2d array with likely normalized feature values.
w (Optional[np.ndarray]) – Optional 2d array with weights of same size as x. Default None and all observations will have the same weight.
- Returns np.ndarray
2d array with same size as x representing dissimilarity values. 0 and the observations are identical and at 1 the observations are completly disimilar.
- Example
>>> x = np.array([[1, 1, 1, 1, 1], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [1, 1, 1, 1, 1]]).astype(np.float32) >>> Statistics().bray_curtis_dissimilarity(x=x) >>> [[0, 1., 1., 0.], [1., 0., 0., 1.], [1., 0., 0., 1.], [0., 1., 1., 0.]]
- static brunner_munzel(sample_1: ndarray, sample_2: ndarray) float [source]
Jitted compute of Brunner-Munzel W between two distributions.
The Brunner-Munzel W statistic compares the central tendency and the spread of two independent samples. It is useful for comparing the distribution of a continuous variable between two groups, especially when the assumptions of parametric tests like the t-test are violated.
Note
Modified from scipy.stats.brunnermunzel
- where:
( n_x ) and ( n_y ) are the sizes of sample_1 and sample_2 respectively,
( ar{R}_x ) and ( ar{R}_y ) are the mean ranks of sample_1 and sample_2 respectively,
( S_x ) and ( S_y ) are the dispersion statistics of sample_1 and sample_2 respectively.
- Parameters
sample_1 (ndarray) – First 1d array representing feature values.
sample_2 (ndarray) – Second 1d array representing feature values.
- Returns float
Brunner-Munzel W.
- Example
>>> sample_1, sample_2 = np.random.normal(loc=10, scale=2, size=10), np.random.normal(loc=20, scale=2, size=10) >>> Statistics().brunner_munzel(sample_1=sample_1, sample_2=sample_2) >>> 0.5751408161437165
- static calinski_harabasz(x: ndarray, y: ndarray) float [source]
Compute the Calinski-Harabasz score to evaluate clustering quality.
The Calinski-Harabasz score is a measure of cluster separation and compactness. It is calculated as the ratio of the between-cluster dispersion to the within-cluster dispersion. A higher score indicates better clustering.
Note
Modified from scikit-learn
- Parameters
x – 2D array representing the data points. Shape (n_samples, n_features/n_dimension).
y – 2D array representing cluster labels for each data point. Shape (n_samples,).
- Return float
Calinski-Harabasz score.
- Example
>>> x = np.random.random((100, 2)).astype(np.float32) >>> y = np.random.randint(0, 100, (100,)).astype(np.int64) >>> Statistics.calinski_harabasz(x=x, y=y)
- static chi_square(sample_1: ndarray, sample_2: ndarray, critical_values: Optional[ndarray] = None, type: Optional[typing_extensions.Literal['goodness_of_fit', 'independence']] = 'goodness_of_fit') Tuple[float, Optional[bool]] [source]
Jitted compute of chi square between two categorical distributions.
- Parameters
sample_1 (ndarray) – First 1d array representing feature values.
sample_2 (ndarray) – Second 1d array representing feature values.
critical_values (ndarray) – 2D array with where indexes represent degrees of freedom and values represent critical values. Can be found in
simba.assets.critical_values_05.pickle
Note
Requires sample_1 and sample_2 has to be numeric. if working with strings, convert to numeric category values before using chi_square.
- Example
>>> sample_1 = np.array([1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 5]).astype(np.float32) >>> sample_2 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).astype(np.float32) >>> critical_values = pickle.load(open("simba/assets/lookups/critical_values_5.pickle", "rb"))['chi_square']['one_tail'].values >>> Statistics.chi_square(sample_1=sample_2, sample_2=sample_1, critical_values=critical_values, type='goodness_of_fit') >>> (8.333, False) >>>
- static cochrans_q(data: ndarray) Tuple[float, float] [source]
Compute Cochrans Q for 2-dimensional boolean array.
Cochran’s Q statistic is used to test for significant differences between more than two proportions. It can be used to evaluate if the performance of multiple (>=2) classifiers on the same data is the same or significantly different.
Note
If two classifiers, consider
simba.mixins.statistics.Statistics.mcnemar
.Useful background: https://psych.unl.edu/psycrs/handcomp/hccochran.PDF
- Parameters
data (np.ndarray) – Two dimensional array of boolean values where axis 1 represents classifiers or features and rows represent frames.
- Return Tuple[float, float]
Cochran’s Q statistic signidicance value.
- Example
>>> data = np.random.randint(0, 2, (100000, 4)) >>> Statistics.cochrans_q(data=data)
- static cohens_d(sample_1: ndarray, sample_2: ndarray) float [source]
Jitted compute of Cohen’s d between two distributions.
Cohen’s d is a measure of effect size that quantifies the difference between the means of two distributions in terms of their standard deviation. It is calculated as the difference between the means of the two distributions divided by the pooled standard deviation.
Higher values indicate a larger effect size, with 0.2 considered a small effect, 0.5 a medium effect, and 0.8 or above a large effect. Negative values indicate that the mean of sample 2 is larger than the mean of sample 1.
- where:
(bar{x}_1) and (bar{x}_2) are the means of sample_1 and sample_2 respectively,
(s_1) and (s_2) are the standard deviations of sample_1 and sample_2 respectively.
- Parameters
sample_1 (ndarray) – First 1d array representing feature values.
sample_2 (ndarray) – Second 1d array representing feature values.
- Returns float
Cohens D statistic.
- Example
>>> sample_1 = [2, 4, 7, 3, 7, 35, 8, 9] >>> sample_2 = [4, 8, 14, 6, 14, 70, 16, 18] >>> Statistics().cohens_d(sample_1=sample_1, sample_2=sample_2) >>> -0.5952099775170546
- static cohens_h(sample_1: ndarray, sample_2: ndarray) float [source]
Jitted compute Cohen’s h effect size for two samples of binary [0, 1] values. Cohen’s h is a measure of effect size for comparing two independent samples based on the differences in proportions of the two samples.
Where N_1 and N_2 are the sample sizes of sample_1 and sample_2, respectively.
- Parameters
sample_1 (np.ndarray) – 1D array with binary [0, 1] values (e.g., first classifier inference values).
sample_2 (np.ndarray) – 1D array with binary [0, 1] values (e.g., second classifier inference values).
- Return float
Cohen’s h effect size.
- Example
>>> sample_1 = np.array([1, 0, 0, 1]) >>> sample_2 = np.array([1, 1, 1, 0]) >>> Statistics().cohens_h(sample_1=sample_1, sample_2=sample_2) >>> -0.5235987755982985
- static cohens_kappa(sample_1: ndarray, sample_2: ndarray)[source]
Jitted compute Cohen’s Kappa coefficient for two binary samples.
Cohen’s Kappa coefficient measures the agreement between two sets of binary ratings, taking into account agreement occurring by chance. It ranges from -1 to 1, where 1 indicates perfect agreement, 0 indicates agreement by chance, and -1 indicates complete disagreement.
- where:
( kappa ) is Cohen’s Kappa coefficient,
( w_{ij} ) are the weights,
( D_{ij} ) are the observed frequencies,
( E_{ij} ) are the expected frequencies.
- Example
>>> sample_1 = np.random.randint(0, 2, size=(10000,)) >>> sample_2 = np.random.randint(0, 2, size=(10000,)) >>> Statistics.cohens_kappa(sample_1=sample_1, sample_2=sample_2))
- static concordance_ratio(x: ndarray, invert: bool) float [source]
Calculate the concordance ratio of a 2D numpy array.
- Parameters
x (np.ndarray) – A 2D numpy array with ordinals represented as integers.
invert (bool) – If True, the concordance ratio is inverted, and disconcordance ratio is returned
- Return float
The concordance ratio, representing the count of rows with only one unique value divided by the total number of rows in the array.
- Example
>>> x = np.random.randint(0, 2, (5000, 4)) >>> results = Statistics.concordance_ratio(x=x, invert=False)
- static cov_matrix(data: ndarray)[source]
Jitted helper to compute the covariance matrix of the input data. Helper for computing cronbach alpha, multivariate analysis, and distance computations.
- Parameters
data (np.ndarray) – 2-dimensional numpy array representing the input data with shape (n, m), where n is the number of observations and m is the number of features.
- Returns
Covariance matrix of the input data with shape (m, m). The (i, j)-th element of the matrix represents the covariance between the i-th and j-th features in the data.
- Example
>>> data = np.random.randint(0,2, (200, 40)).astype(np.float32) >>> covariance_matrix = Statistics.cov_matrix(data=data)
- static d_prime(x: ndarray, y: ndarray, lower_limit: Optional[float] = 0.0001, upper_limit: Optional[float] = 0.9999) float [source]
Computes d-prime from two Boolean 1d arrays, e.g., between classifications and ground truth.
D-prime (d’) is a measure of signal detection performance, indicating the ability to discriminate between signal and noise. It is computed as the difference between the inverse cumulative distribution function (CDF) of the hit rate and the false alarm rate.
- Parameters
x (np.ndarray) – Boolean 1D array of response values, where 1 represents presence, and 0 representing absence.
y (np.ndarray) – Boolean 1D array of ground truth, where 1 represents presence, and 0 representing absence.
lower_limit (Optional[float]) – Lower limit to bound hit and false alarm rates. Defaults to 0.0001.
upper_limit (Optional[float]) – Upper limit to bound hit and false alarm rates. Defaults to 0.9999.
- Return float
The calculated d’ (d-prime) value.
- Example
>>> x = np.random.randint(0, 2, (1000,)) >>> y = np.random.randint(0, 2, (1000,)) >>> Statistics.d_prime(x=x, y=y)
- davis_bouldin(y: ndarray) float [source]
Calculate the Davis-Bouldin index for evaluating clustering performance.
Davis-Bouldin index measures the clustering quality based on the within-cluster similarity and between-cluster dissimilarity. Lower values indicate better clustering.
Note
Modified from scikit-learn
- Parameters
x (np.ndarray) – 2D array representing the data points. Shape (n_samples, n_features/n_dimension).
y (np.ndarray) – 2D array representing cluster labels for each data point. Shape (n_samples,).
- Return float
Davis-Bouldin score.
- Example
>>> x = np.random.randint(0, 100, (100, 2)) >>> y = np.random.randint(0, 3, (100,)) >>> Statistics.davis_bouldin(x=x, y=y)
- static dunn_index(x: ndarray, y: ndarray) float [source]
Calculate the Dunn index to evaluate the quality of clustered labels.
This function calculates the Dunn index, which is a measure of clustering quality. The index considers the ratio of the minimum inter-cluster distance to the maximum intra-cluster distance. A higher Dunn index indicates better clustering.
Note
Modified from jqmviegas
Wiki https://en.wikipedia.org/wiki/Dunn_index
Uses Euclidean distances.
- Parameters
x (np.ndarray) – 2D array representing the data points. Shape (n_samples, n_features).
y (np.ndarray) – 2D array representing cluster labels for each data point. Shape (n_samples,).
- Return float
The Dunn index value
- Example
>>> x = np.random.randint(0, 100, (100, 2)) >>> y = np.random.randint(0, 3, (100,)) >>> Statistics.dunn_index(x=x, y=y)
- static elliptic_envelope(data: ndarray, contamination: Optional[float] = 0.1, normalize: Optional[bool] = False, groupby_idx: Optional[int] = None) ndarray [source]
Compute the Mahalanobis distances of each observation in the input array using Elliptic Envelope method.
- Parameters
- Return np.ndarray
The Mahalanobis distances of each observation in array. Larger values indicate outliers.
- Example
>>> data, lbls = make_blobs(n_samples=2000, n_features=2, centers=1, random_state=42) >>> envelope_score = elliptic_envelope(data=data, normalize=True) >>> results = np.hstack((data[:, 0:2], envelope_score.reshape(lof.shape[0], 1))) >>> results = pd.DataFrame(results, columns=['X', 'Y', 'ENVELOPE SCORE']) >>> PlottingMixin.continuous_scatter(data=results, palette='seismic', bg_clr='lightgrey', columns=['X', 'Y', 'ENVELOPE SCORE'],size=30)
- static eta_squared(x: ndarray, y: ndarray) float [source]
Calculate eta-squared, a measure of between-subjects effect size.
Eta-squared ((eta^2)) is calculated as the ratio of the sum of squares between groups to the total sum of squares. Range from 0 to 1, where larger values indicate a stronger effect size.
rac{SS_{between}}{SS_{between} + SS_{within}}
where: - ( SS_{between} ) is the sum of squares between groups. - ( SS_{within} ) is the sum of squares within groups.
- param np.ndarray x
1D array containing the dependent variable data.
- param np.ndarray y
1d array containing the grouping variable (categorical) data of same size as
x
.- return float
The eta-squared value representing the proportion of variance in the dependent variable that is attributable to the grouping variable.
- static find_collinear_features(df: DataFrame, threshold: float, method: Optional[typing_extensions.Literal['pearson', 'spearman', 'kendall']] = 'pearson', verbose: Optional[bool] = False) List[str] [source]
Identify collinear features in the dataframe based on the specified correlation method and threshold.
- Parameters
df (pd.DataFrame) – Input DataFrame containing features.
threshold (float) – Threshold value to determine collinearity.
method (Optional[Literal['pearson', 'spearman', 'kendall']]) – Method for calculating correlation. Defaults to ‘pearson’.
- Returns
Set of feature names identified as collinear. Returns one feature for every feature pair with correlation value above specified threshold.
- Example
>>> x = pd.DataFrame(np.random.randint(0, 100, (100, 100))) >>> names = Statistics.find_collinear_features(df=x, threshold=0.2, method='pearson', verbose=True)
- static fowlkes_mallows(x: ndarray, y: ndarray) float [source]
Calculate the Fowlkes-Mallows Index (FMI) between two clusterings.
The Fowlkes-Mallows index (FMI) is a measure of similarity between two clusterings. It compares the similarity of the clusters obtained by two different clustering algorithms or procedures.
The index is defined as the geometric mean of the pairwise precision and recall:
where: - TP (True Positive) is the number of pairs of elements that are in the same cluster in both x and y, - FP (False Positive) is the number of pairs of elements that are in the same cluster in y but not in x, - FN (False Negative) is the number of pairs of elements that are in the same cluster in x but not in y.
Note
Modified from scikit-learn
- Parameters
x (np.ndarray) – 1D array representing the labels of the first model.
y (np.ndarray) – 1D array representing the labels of the second model.
- Return float
Score between 0 and 1. 1 indicates perfect clustering agreement, 0 indicates random clustering.
- static grubbs_test(x: ndarray, left_tail: Optional[bool] = False) float [source]
Perform Grubbs’ test to detect outliers if the minimum or maximum value in a feature series is an outlier.
Grubbs’ test is a statistical test used to detect outliers in a univariate data set. It calculates the Grubbs’ test statistic as the absolute difference between the extreme value (either the minimum or maximum) and the sample mean, divided by the sample standard deviation.
rac{|ar{x} - x_{ ext{min/max}}|}{s}
- where:
( ar{x} ) is the sample mean,
( x_{ ext{min/max}} ) is the minimum or maximum value of the sample (depending on the tail being tested),
( s ) is the sample standard deviation.
- param np.ndarray x
1D array representing numeric data.
- param Optional[bool] left_tail
If True, the test calculates the Grubbs’ test statistic for the left tail (minimum value). If False (default), it calculates the statistic for the right tail (maximum value).
- return float
The computed Grubbs’ test statistic.
- example
>>> x = np.random.random((100,)) >>> Statistics.grubbs_test(x=x)
- static hamming_distance(x: ndarray, y: ndarray, sort: Optional[bool] = False, w: Optional[ndarray] = None) float [source]
Jitted compute of the Hamming similarity between two vectors.
The Hamming similarity measures the similarity between two binary vectors by counting the number of positions at which the corresponding elements are different.
Note
If w is not provided, equal weights are assumed. Adapted from pynndescent.
- where:
( n ) is the length of the vectors,
( w_i ) is the weight associated with the ( i )th element of the vectors.
- Parameters
x (np.ndarray) – First binary vector.
x – Second binary vector.
w (Optional[np.ndarray]) – Optional weights for each element. Can be classification probabilities. If not provided, equal weights are assumed.
sort (Optional[bool]) – If True, sorts x and y prior to hamming distance calculation. Default, False.
- Example
>>> x, y = np.random.randint(0, 2, (10,)).astype(np.int8), np.random.randint(0, 2, (10,)).astype(np.int8) >>> Statistics().hamming_distance(x=x, y=y) >>> 0.91
- static hartley_fmax(x: ndarray, y: ndarray) float [source]
Compute Hartley’s Fmax statistic to test for equality of variances between two features or groups.
Hartley’s Fmax statistic is used to test whether two samples have equal variances. It is calculated as the ratio of the largest sample variance to the smallest sample variance. Values close to one represent closer to equal variance.
rac{max( ext{Var}(x), ext{Var}(y))}{min( ext{Var}(x), ext{Var}(y))}
- where:
Var(x) is the variance of sample x,
Var(y) is the variance of sample y.
- param np.ndarray x
1D array representing numeric data of the first group/feature.
- param np.ndarray x
1D array representing numeric data of the second group/feature.
- example
>>> x = np.random.random((100,)) >>> y = np.random.random((100,)) >>> Statistics.hartley_fmax(x=x, y=y)
- hbos(data: ndarray, bucket_method: typing_extensions.Literal['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt'] = 'auto') ndarray [source]
Jitted compute of Histogram-based Outlier Scores (HBOS). HBOS quantifies the abnormality of data points based on the densities of their feature values within their respective buckets over all feature values.
- Parameters
data (np.ndarray) – 2d array with frames represented by rows and columns representing feature values.
bucket_method (Literal) – Estimator determining optimal bucket count and bucket width. Default: The maximum of the Sturges and Freedman-Diaconis estimators.
- Return np.ndarray
Array of size data.shape[0] representing outlier scores, with higher values representing greater outliers.
- Example
>>> sample_1 = np.random.random_integers(low=1, high=2, size=(10, 50)).astype(np.float64) >>> sample_2 = np.random.random_integers(low=7, high=20, size=(2, 50)).astype(np.float64) >>> data = np.vstack([sample_1, sample_2]) >>> Statistics().hbos(data=data)
- hellinger_distance(x: ndarray, y: ndarray, bucket_method: Optional[typing_extensions.Literal['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt']] = 'auto') float [source]
Compute the Hellinger distance between two vector distribitions.
Note
The Hellinger distance is bounded and ranges from 0 to √2. Distance of √2 indicates that the two distributions are maximally dissimilar
- Parameters
x (np.ndarray) – First 1D array representing a probability distribution.
y (np.ndarray) – Second 1D array representing a probability distribution.
bucket_method (Optional[Literal['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt']]) – Method for computing histogram bins. Default is ‘auto’.
- Returns float
Hellinger distance between the two input probability distributions.
- Example
>>> x = np.random.randint(0, 9000, (500000,)) >>> y = np.random.randint(0, 9000, (500000,)) >>> Statistics().hellinger_distance(x=x, y=y, bucket_method='auto')
- static independent_samples_t(sample_1: ~numpy.ndarray, sample_2: ~numpy.ndarray, critical_values: ~typing.Optional[~numpy.ndarray] = None) -> (<class 'float'>, typing.Union[NoneType, bool])[source]
Jitted compute independent-samples t-test statistic and boolean significance between two distributions.
Note
Critical values are stored in simba.assets.lookups.critical_values_**.pickle
The t-statistic for independent samples t-test is calculated using the following formula:
rac{ar{x}_1 - ar{x}_2}{s_p sqrt{ rac{1}{n_1} + rac{1}{n_2}}}
- where:
(bar{x}_1) and (bar{x}_2) are the means of sample_1 and sample_2 respectively,
(s_p) is the pooled standard deviation,
(n_1) and (n_2) are the sample sizes of sample_1 and sample_2 respectively.
- parameter ndarray sample_1
First 1d array representing feature values.
- parameter ndarray sample_2
Second 1d array representing feature values.
- parameter ndarray critical_values
2d array where the first column represents degrees of freedom and second column represents critical values.
- returns (float Union[None, bool]) t_statistic, p_value
Representing t-statistic and associated probability value. p_value is
None
if critical_values is None. Else True or False with True representing significant.- example
>>> sample_1 = np.array([1, 2, 3, 1, 3, 2, 1, 10, 8, 4, 10]) >>> sample_2 = np.array([2, 5, 10, 4, 8, 10, 7, 10, 7, 10, 10]) >>> Statistics().independent_samples_t(sample_1=sample_1, sample_2=sample_2) >>> (-2.5266046804590183, None) >>> critical_values = pickle.load(open("simba/assets/lookups/critical_values_05.pickle","rb"))['independent_t_test']['one_tail'].values >>> Statistics().independent_samples_t(sample_1=sample_1, sample_2=sample_2, critical_values=critical_values) >>> (-2.5266046804590183, True)
- static isolation_forest(x: ndarray, estimators: Union[int, float] = 0.2, groupby_idx: Optional[int] = None, normalize: Optional[bool] = False)[source]
An implementation of the Isolation Forest algorithm for outlier detection.
Note
The isolation forest scores are negated. Thus, higher values indicate more atypical (outlier) data points.
- Parameters
x (np.ndarray) – 2-D array with feature values.
estimators (Union[int, float]) – Number of splits. If the value is a float, then interpreted as the ratio of x shape.
groupby_idx (Optional[int]) – If int, then the index 1 of
data
for which to group the data and compute LOF on each segment. E.g., can be field holding a cluster identifier.normalize (Optional[bool]) – Whether to normalize the outlier score between 0 and 1. Defaults to False.
- Returns
- Example
>>> x, lbls = make_blobs(n_samples=10000, n_features=2, centers=10, random_state=42) >>> x = np.hstack((x, lbls.reshape(-1, 1))) >>> scores = isolation_forest(x=x, estimators=10, normalize=True) >>> results = np.hstack((x[:, 0:2], scores.reshape(scores.shape[0], 1))) >>> results = pd.DataFrame(results, columns=['X', 'Y', 'ISOLATION SCORE']) >>> PlottingMixin.continuous_scatter(data=results, palette='seismic', bg_clr='lightgrey', columns=['X', 'Y', 'ISOLATION SCORE'],size=30)
- static jaccard_distance(x: ndarray, y: ndarray) float [source]
Calculate the Jaccard distance between two 1D NumPy arrays.
The Jaccard distance is a measure of dissimilarity between two sets. It is defined as the size of the intersection of the sets divided by the size of the union of the sets.
- Parameters
x (np.ndarray) – The first 1D NumPy array.
y (np.ndarray) – The second 1D NumPy array.
- Return float
The Jaccard distance between arrays x and y.
- Example
>>> x = np.random.randint(0, 5, (100)) >>> y = np.random.randint(0, 7, (100)) >>> Statistics.jaccard_distance(x=x, y=y) >>> 0.2857143
- jensen_shannon_divergence(sample_1: ndarray, sample_2: ndarray, bucket_method: typing_extensions.Literal['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt'] = 'auto') float [source]
Compute Jensen-Shannon divergence between two distributions. Useful for (i) measure drift in datasets, and (ii) featurization of distribution shifts across sequential time-bins.
Note
JSD = 0: Indicates that the two distributions are identical. 0 < JSD < 1: Indicates a degree of dissimilarity between the distributions, with values closer to 1 indicating greater dissimilarity. JSD = 1: Indicates that the two distributions are maximally dissimilar.
rac{{KL(P_1 || M) + KL(P_2 || M)}}{2}
- parameter ndarray sample_1
First 1d array representing feature values.
- parameter ndarray sample_2
Second 1d array representing feature values.
- parameter Literal bucket_method
Estimator determining optimal bucket count and bucket width. Default: The maximum of the Sturges and Freedman-Diaconis estimators.
- returns float
Jensen-Shannon divergence between
sample_1
andsample_2
- example
>>> sample_1, sample_2 = np.array([1, 2, 3, 4, 5, 10, 1, 2, 3]), np.array([1, 5, 10, 9, 10, 1, 10, 6, 7]) >>> Statistics().jensen_shannon_divergence(sample_1=sample_1, sample_2=sample_2, bucket_method='fd') >>> 0.30806541358219786
- static kendall_tau(sample_1: ndarray, sample_2: ndarray) Tuple[float, float] [source]
Jitted compute of Kendall Tau (rank correlation coefficient). Non-parametric method for computing correlation between two time-series features. Returns tau and associated z-score.
Kendall Tau is a measure of the correspondence between two rankings. It compares the number of concordant pairs (pairs of elements that are in the same order in both rankings) to the number of discordant pairs (pairs of elements that are in different orders in the rankings).
Kendall Tau is calculated using the following formula:
where is the count of concordant pairs and is the count of discordant pairs.
- Parameters
sample_1 (ndarray) – First 1D array with feature values.
sample_1 – Second 1D array with feature values.
- Returns Tuple[float, float]
Kendall Tau and associated z-score.
- Examples
>>> sample_1 = np.array([4, 2, 3, 4, 5, 7]).astype(np.float32) >>> sample_2 = np.array([1, 2, 3, 4, 5, 7]).astype(np.float32) >>> Statistics().kendall_tau(sample_1=sample_1, sample_2=sample_2) >>> (0.7333333333333333, 2.0665401605809928)
References
- static kmeans_1d(data: ndarray, k: int, max_iters: int, calc_medians: bool) Tuple[ndarray, ndarray, Union[None, DictType]] [source]
Perform k-means clustering on a 1-dimensional dataset.
- Parameters
- Returns Tuple
Tuple of three elements. Final centroids of the clusters. Labels assigned to each data point based on clusters. Cluster medians (if calc_medians is True), otherwise None.
- Example
>>> data_1d = np.array([1, 2, 3, 55, 65, 40, 43, 40]).astype(np.float64) >>> centroids, labels, medians = Statistics().kmeans_1d(data_1d, 2, 1000, True)
- static kruskal_wallis(sample_1: ndarray, sample_2: ndarray) float [source]
Compute the Kruskal-Wallis H statistic between two distributions.
The Kruskal-Wallis test is a non-parametric method for testing whether samples originate from the same distribution. It ranks all the values from the combined samples, then calculates the H statistic based on the ranks.
- where:
( n ) is the total number of observations,
( n_1 ) and ( n_2 ) are the number of observations in sample 1 and sample 2 respectively,
( R_{ ext{sample1}} ) and ( R_{ ext{sample2}} ) are the sums of ranks for sample 1 and sample 2 respectively.
- Parameters
sample_1 (ndarray) – First 1d array representing feature values.
sample_2 (ndarray) – Second 1d array representing feature values.
- Returns float
Kruskal-Wallis H statistic.
- Example
>>> sample_1 = np.array([1, 1, 3, 4, 5]).astype(np.float64) >>> sample_2 = np.array([6, 7, 8, 9, 10]).astype(np.float64) >>> Statistics().kruskal_wallis(sample_1=sample_1, sample_2=sample_2) >>> 39.4
- kullback_leibler_divergence(sample_1: ndarray, sample_2: ndarray, fill_value: Optional[int] = 1, bucket_method: typing_extensions.Literal['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt'] = 'auto') float [source]
Compute Kullback-Leibler divergence between two distributions.
Note
Empty bins (0 observations in bin) in is replaced with passed
fill_value
.Its range is from 0 to positive infinity. When the KL divergence is zero, it indicates that the two distributions are identical. As the KL divergence increases, it signifies an increasing difference between the distributions.
rac{P(x)}{Q(x)} ight)}}
- parameter ndarray sample_1
First 1d array representing feature values.
- parameter ndarray sample_2
Second 1d array representing feature values.
- parameter Optional[int] fill_value
Optional pseudo-value to use to fill empty buckets in
sample_2
histogram- parameter Literal bucket_method
Estimator determining optimal bucket count and bucket width. Default: The maximum of the Sturges and Freedman-Diaconis estimators
- returns float
Kullback-Leibler divergence between
sample_1
andsample_2
- static levenes(sample_1: ~numpy.ndarray, sample_2: ~numpy.ndarray, critical_values: ~typing.Optional[~numpy.ndarray] = None) -> (<class 'float'>, typing.Union[bool, NoneType])[source]
Compute Levene’s W statistic, a test for the equality of variances between two samples.
Levene’s test is a statistical test used to determine whether two or more groups have equal variances. It is often used as an alternative to the Bartlett test when the assumption of normality is violated. The function computes the Levene’s W statistic, which measures the degree of difference in variances between the two samples.
- Parameters
sample_1 (ndarray) – First 1d array representing feature values.
sample_2 (ndarray) – Second 1d array representing feature values.
critical_values (ndarray) – 2D array with where first column represent dfn first row dfd with values represent critical values. Can be found in
simba.assets.critical_values_05.pickle
- Returns tuple[float, Union[bool, None]]
Levene’s W statistic and a boolean indicating whether the test is statistically significant (if critical values is not None).
- Examples
>>> sample_1 = np.array(list(range(0, 50))) >>> sample_2 = np.array(list(range(25, 100))) >>> Statistics().levenes(sample_1=sample_1, sample_2=sample_2) >>> 12.63909108903254 >>> critical_values = pickle.load(open("simba/assets/lookups/critical_values_5.pickle","rb"))['f']['one_tail'].values >>> Statistics().levenes(sample_1=sample_1, sample_2=sample_2, critical_values=critical_values) >>> (12.63909108903254, True)
- static local_outlier_factor(data: ndarray, k: Union[int, float] = 5, contamination: Optional[float] = 1e-10, normalize: Optional[bool] = False, groupby_idx: Optional[int] = None) ndarray [source]
Compute the local outlier factor of each observation.
Note
The final LOF scores are negated. Thus, higher values indicate more atypical (outlier) data points. Values Method calls
sklearn.neighbors.LocalOutlierFactor
directly. Attempted to use own jit compiled implementation, but runtime was 3x-ish slower thansklearn.neighbors.LocalOutlierFactor
.If groupby_idx is not None, then the index 1 of
data
array for which to group the data and compute LOF within each segment/cluster. E.g., can be field holding cluster identifier. Thus, outliers are computed within each segment/cluster, ensuring that other segments cannot affect outlier scores within each analyzing each cluster.If groupby_idx is provided, then all observations with cluster/segment variable
-1
will be treated as unclustered and assigned the max outlier score found withiin the clustered observations.- Parameters
data (ndarray) – 2D array with feature values where rows represent frames and columns represent features.
k (Union[int, float]) – Number of neighbors to evaluate for each observation. If the value is a float, then interpreted as the ratio of data.shape[0]. If the value is an integer, then it represent the number of neighbours to evaluate.
contamination (Optional[float]) – Small pseudonumber to avoid DivisionByZero error.
normalize (Optional[bool]) – Whether to normalize the distances between 0 and 1. Defaults to False.
groupby_idx (Optional[int]) – If int, then the index 1 of
data
for which to group the data and compute LOF on each segment. E.g., can be field holding a cluster identifier.
- Returns np.ndarray
Array of size data.shape[0] with local outlier scores.
- Example
>>> data, lbls = make_blobs(n_samples=2000, n_features=2, centers=10, random_state=42) >>> data = np.hstack((data, lbls.reshape(-1, 1))) >>> lof = Statistics.local_outlier_factor(data=data, groupby_idx=2, k=100, normalize=True) >>> results = np.hstack((data[:, 0:2], lof.reshape(lof.shape[0], 1))) >>> PlottingMixin.continuous_scatter(data=results, palette='seismic', bg_clr='lightgrey',size=30)
- static mad_median_rule(data: ndarray, k: int) ndarray [source]
Detect outliers using the MAD-Median Rule. Returns 1d array of size data.shape[0] with 1 representing outlier and 0 representing inlier.
- Example
>>> data = np.random.randint(0, 600, (9000000,)).astype(np.float32) >>> Statistics.mad_median_rule(data=data, k=1)
- static mahalanobis_distance_cdist(data: ndarray) ndarray [source]
Compute the Mahalanobis distance between every pair of observations in a 2D array using numba.
The Mahalanobis distance is a measure of the distance between a point and a distribution. It accounts for correlations between variables and the scales of the variables, making it suitable for datasets where features are not independent and have different variances.
Note
Significantly reduced runtime versus Mahalanobis scipy.cdist only with larger feature sets ( > 10-50).
However, Mahalanobis distance may not be suitable in certain scenarios, such as: - When the dataset is small and the covariance matrix is not accurately estimated. - When the dataset contains outliers that significantly affect the estimation of the covariance matrix. - When the assumptions of multivariate normality are violated.
- Parameters
data (np.ndarray) – 2D array with feature observations. Frames on axis 0 and feature values on axis 1
- Return np.ndarray
Pairwise Mahalanobis distance matrix where element (i, j) represents the Mahalanobis distance between observations i and j.
- Example
>>> data = np.random.randint(0, 50, (1000, 200)).astype(np.float32) >>> x = mahalanobis_distance_cdist(data=data)
- static manhattan_distance_cdist(data: ndarray) ndarray [source]
Compute the pairwise Manhattan distance matrix between points in a 2D array.
Can be preferred over Euclidean distance in scenarios where the movement is restricted to grid-based paths and/or the data is high dimensional.
- Parameters
data – 2D array where each row represents a featurized observation (e.g., frame)
- Return np.ndarray
Pairwise Manhattan distance matrix where element (i, j) represents the distance between points i and j.
- Example
>>> data = np.random.randint(0, 50, (10000, 2)) >>> Statistics.manhattan_distance_cdist(data=data)
- static mann_whitney(sample_1: ndarray, sample_2: ndarray) float [source]
Jitted compute of Mann-Whitney U between two distributions.
The Mann-Whitney U test is used to assess whether the distributions of two groups are the same or different based on their ranks. It is commonly used as an alternative to the t-test when the assumptions of normality and equal variances are violated.
- Where:
U is the Mann-Whitney U statistic,
U_1 is the sum of ranks for sample 1,
U_2 is the sum of ranks for sample 2.
- Parameters
sample_1 (ndarray) – First 1d array representing feature values.
sample_2 (ndarray) – Second 1d array representing feature values.
- Returns float
The Mann-Whitney U statistic.
- References
Modified from James Webber gist on GitHub.
- Example
>>> sample_1 = np.array([1, 1, 3, 4, 5]) >>> sample_2 = np.array([6, 7, 8, 9, 10]) >>> results = Statistics().mann_whitney(sample_1=sample_1, sample_2=sample_2)
- static mcnemar(x: ndarray, y: ndarray, ground_truth: ndarray, continuity_corrected: Optional[bool] = True) Tuple[float, float] [source]
McNemar’s test to compare the difference in predictive accuracy of two models.
E.g., can be used to compute if the accuracy of two classifiers are significantly different when transforming the same data.
Note
Adapted from mlextend.
- Parameters
x (np.ndarray) – 1-dimensional Boolean array with predictions of the first model.
x – 1-dimensional Boolean array with predictions of the second model.
x – 1-dimensional Boolean array with ground truth labels.
:param Optional[bool] continuity_corrected : Whether to apply continuity correction. Default is True.
- Example
>>> x = np.random.randint(0, 2, (100000, )) >>> y = np.random.randint(0, 2, (100000, )) >>> ground_truth = np.random.randint(0, 2, (100000, )) >>> Statistics.mcnemar(x=x, y=y, ground_truth=ground_truth)
- static one_way_anova(sample_1: ~numpy.ndarray, sample_2: ~numpy.ndarray, critical_values: ~typing.Optional[~numpy.ndarray] = None) -> (<class 'float'>, <class 'float'>)[source]
Jitted compute of one-way ANOVA F statistics and associated p-value for two distributions.
- Parameters
sample_1 (ndarray) – First 1d array representing feature values.
sample_2 (ndarray) – Second 1d array representing feature values.
- Returns (float float)
Representing ANOVA F statistic and associated probability value.
- Example
>>> sample_1 = np.array([1, 2, 3, 1, 3, 2, 1, 10, 8, 4, 10]) >>> sample_2 = np.array([8, 5, 5, 8, 8, 9, 10, 1, 7, 10, 10]) >>> Statistics().one_way_anova(sample_1=sample_2, sample_2=sample_1)
- static pct_in_top_n(x: ndarray, n: float) float [source]
Compute the percentage of elements in the top ‘n’ frequencies in the input array.
This function calculates the percentage of elements that belong to the ‘n’ most frequent categories in the input array ‘x’.
- Parameters
x (np.ndarray) – Input array.
n (float) – Number of top frequencies.
- Return float
Percentage of elements in the top ‘n’ frequencies.
- Example
>>> x = np.random.randint(0, 10, (100,)) >>> Statistics.pct_in_top_n(x=x, n=5)
- static pearsons_r(sample_1: ndarray, sample_2: ndarray) float [source]
Calculate the Pearson correlation coefficient (Pearson’s r) between two numeric samples.
Pearson’s r is a measure of the linear correlation between two sets of data points. It quantifies the strength and direction of the linear relationship between the two variables. The coefficient varies between -1 and 1, with -1 indicating a perfect negative linear relationship, 1 indicating a perfect positive linear relationship, and 0 indicating no linear relationship.
Pearson’s r is calculated using the formula:
# .. math:: # # r =
- rac{sum{(x_i - ar{x})(y_i - ar{y})}}{sqrt{sum{(x_i - ar{x})^2}sum{(y_i - ar{y})^2}}}
# # where: # - ( x_i ) and ( y_i ) are individual data points in sample_1 and sample_2, respectively. # - ( ar{x} ) and ( ar{y} ) are the means of sample_1 and sample_2, respectively.
- param np.ndarray sample_1
First numeric sample.
- param np.ndarray sample_2
Second numeric sample.
- return float
Pearson’s correlation coefficient between the two samples.
- example
>>> sample_1 = np.array([7, 2, 9, 4, 5, 6, 7, 8, 9]).astype(np.float32) >>> sample_2 = np.array([1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5]).astype(np.float32) >>> Statistics().pearsons_r(sample_1=sample_1, sample_2=sample_2) >>> 0.47
- static phi_coefficient(data: ndarray) float [source]
Compute the phi coefficient for a Nx2 array of binary data.
The phi coefficient (a.k.a Matthews Correlation Coefficient (MCC)), is a measure of association for binary data in a 2x2 contingency table. It quantifies the degree of association or correlation between two binary variables (e.g., binary classification targets).
The formula for the phi coefficient is defined as:
- where:
BC: Hit rate (reponse and truth is both 1)
AD: Correct rejections (response and truth are both 0)
C1, C2: Counts of occurrences where the response is 1 and 0, respectively.
R1, R2: Counts of occurrences where the truth is 1 and 0, respectively.
- Parameters
data (np.ndarray) – A NumPy array containing binary data organized in two columns. Each row represents a pair of binary values for two variables. Columns represent two features or two binary classification results.
float – The calculated phi coefficient, a value between 0 and 1. A value of 0 indicates no association between the variables, while 1 indicates a perfect association.
- Example
>>> data = np.array([[0, 1], [1, 0], [1, 0], [1, 1]]).astype(np.int64) >>> Statistics().phi_coefficient(data=data) >>> 0.8164965809277261 >>> data = np.random.randint(0, 2, (100, 2)) >>> result = Statistics.phi_coefficient(data=data)
- population_stability_index(sample_1: ndarray, sample_2: ndarray, fill_value: Optional[int] = 1, bucket_method: Optional[typing_extensions.Literal['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt']] = 'auto') float [source]
Compute Population Stability Index (PSI) comparing two distributions.
The Population Stability Index (PSI) is a measure of the difference in distribution patterns between two groups of data. A low PSI value indicates a minimal or negligible change in the distribution patterns between the two samples. A high PSI value suggests a significant difference in the distribution patterns between the two samples.
Note
Empty bins (0 observations in bin) in is replaced with
fill_value
. The PSI value ranges from 0 to positive infinity.The Population Stability Index (PSI) is calculated as:
- where:
( p_1 ) and ( p_2 ) are the proportions of observations in the bins for sample 1 and sample 2 respectively.
- Parameters
sample_1 (ndarray) – First 1d array representing feature values.
sample_2 (ndarray) – Second 1d array representing feature values.
fill_value (Optional[int]) – Empty bins (0 observations in bin) in is replaced with
fill_value
. Default 1.bucket_method (Literal) – Estimator determining optimal bucket count and bucket width. Default: The maximum of the Sturges and Freedman-Diaconis estimators
- Returns float
PSI distance between
sample_1
andsample_2
- Example
>>> sample_1, sample_2 = np.random.randint(0, 100, (100,)), np.random.randint(0, 10, (100,)) >>> Statistics().population_stability_index(sample_1=sample_1, sample_2=sample_2, fill_value=1, bucket_method='auto') >>> 3.9657026867553817
- static relative_risk(x: ndarray, y: ndarray) float [source]
Calculate the relative risk between two binary arrays.
Relative risk (RR) is the ratio of the probability of an event occurring in one group/feature/cluster/variable (x) to the probability of the event occurring in another group/feature/cluster/variable (y).
- Parameters
x (np.ndarray) – The first 1D binary array.
y (np.ndarray) – The second 1D binary array.
- Return float
The relative risk between arrays x and y.
- Example
>>> Statistics.relative_risk(x=np.array([0, 1, 1]), y=np.array([0, 1, 0])) >>> 2.0
- static rolling_cohens_d(data: ndarray, time_windows: ndarray, fps: float) ndarray [source]
Jitted compute of rolling Cohen’s D statistic comparing the current time-window of size N to the preceding window of size N.
- Parameters
data (ndarray) – 1D array of size len(frames) representing feature values.
time_window (np.ndarray[ints]) – Time windows to compute ANOVAs for in seconds.
fps (int) – Frame-rate of recorded video.
- Returns np.ndarray
Array of size data.shape[0] x window_sizes.shape[1] with Cohens D.
- Example
>>> sample_1, sample_2 = np.random.normal(loc=10, scale=1, size=4), np.random.normal(loc=11, scale=2, size=4) >>> sample = np.hstack((sample_1, sample_2)) >>> Statistics().rolling_cohens_d(data=sample, window_sizes=np.array([1]), fps=4) >>> [[0.],[0.],[0.],[0.],[0.14718302],[0.14718302],[0.14718302],[0.14718302]])
- static rolling_independent_sample_t(data: ndarray, time_window: float, fps: float) ndarray [source]
Jitted compute independent-sample t-statistics for sequentially binned values in a time-series. E.g., compute t-test statistics when comparing
Feature N
in the current 1s time-window, versusFeature N
in the previous 1s time-window.- Parameters
Attention
Each window is compared to the prior window. Output for the windows without a prior window (the first window) is
-1
.- Example
>>> data_1, data_2 = np.random.normal(loc=10, scale=2, size=10), np.random.normal(loc=20, scale=2, size=10) >>> data = np.hstack([data_1, data_2]) >>> Statistics().rolling_independent_sample_t(data, time_window=1, fps=10) >>> [[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -6.88741389, -6.88741389, -6.88741389, -6.88741389, -6.88741389, -6.88741389, -6.88741389, -6.88741389, -6.88741389, -6.88741389])
- rolling_jensen_shannon_divergence(data: ndarray, time_windows: ndarray, fps: int, bucket_method: typing_extensions.Literal['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt'] = 'auto') ndarray [source]
Compute rolling Jensen-Shannon divergence comparing the current time-window of size N to the preceding window of size N.
- Parameters
data (ndarray) – 1D array of size len(frames) representing feature values.
time_windows (np.ndarray[ints]) – Time windows to compute JS for in seconds.
fps (int) – Frame-rate of recorded video.
bucket_method (Literal) – Estimator determining optimal bucket count and bucket width. Default: The maximum of the Sturges and Freedman-Diaconis estimators
- rolling_kullback_leibler_divergence(data: ndarray, time_windows: ndarray, fps: int, fill_value: int = 1, bucket_method: typing_extensions.Literal['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt'] = 'auto') ndarray [source]
Compute rolling Kullback-Leibler divergence comparing the current time-window of size N to the preceding window of size N.
Note
Empty bins (0 observations in bin) in is replaced with
fill_value
.- Parameters
sample_1 (ndarray) – 1d array representing feature values.
bucket_method (Literal) – Estimator determining optimal bucket count and bucket width. Default: The maximum of the Sturges and Freedman-Diaconis estimators
time_windows (np.ndarray[floats]) – Time windows to compute JS for in seconds.
fps (int) – Frame-rate of recorded video.
- Returns np.ndarray
Size data.shape[0] x window_sizes.shape with Kullback-Leibler divergence. Columns represents different tiem windows.
- Example
>>> sample_1, sample_2 = np.random.normal(loc=10, scale=700, size=5), np.random.normal(loc=50, scale=700, size=5) >>> data = np.hstack((sample_1, sample_2)) >>> Statistics().rolling_kullback_leibler_divergence(data=data, time_windows=np.array([1]), fps=2)
- static rolling_levenes(data: ndarray, time_windows: ndarray, fps: float) float [source]
Jitted compute of rolling Levene’s W comparing the current time-window of size N to the preceding window of size N.
Note
First time bin (where has no preceding time bin) will have fill value
0
- Parameters
sample_1 (ndarray) – First 1d array representing feature values.
sample_2 (ndarray) – Second 1d array representing feature values.
- Returns np.ndarray
Levene’s W data of size len(data) x len(time_windows).
- Example
>>> data = np.random.randint(0, 50, (100)).astype(np.float64) >>> Statistics().rolling_levenes(data=data, time_windows=np.array([1]).astype(np.float64), fps=5.0)
- static rolling_mann_whitney(data: ndarray, time_windows: ndarray, fps: float)[source]
Jitted compute of rolling Mann-Whitney U comparing the current time-window of size N to the preceding window of size N.
Note
First time bin (where has no preceding time bin) will have fill value
0
- Parameters
sample_1 (ndarray) – First 1d array representing feature values.
sample_2 (ndarray) – Second 1d array representing feature values.
- Returns np.ndarray
Mann-Whitney U data of size len(data) x len(time_windows).
- Examples
>>> data = np.random.randint(0, 4, (200)).astype(np.float32) >>> results = Statistics().rolling_mann_whitney(data=data, time_windows=np.array([1.0]), fps=1)
- static rolling_one_way_anova(data: ndarray, time_windows: ndarray, fps: int) ndarray [source]
Jitted compute of rolling one-way ANOVA F-statistic comparing the current time-window of size N to the preceding window of size N.
- Parameters
data (ndarray) – 1D array of size len(frames) representing feature values.
time_windows (np.ndarray[ints]) – Time windows to compute ANOVAs for in seconds.
fps (int) – Frame-rate of recorded video.
- Example
>>> sample = np.random.normal(loc=10, scale=1, size=10).astype(np.float32) >>> Statistics().rolling_one_way_anova(data=sample, time_windows=np.array([1.0]), fps=2) >>> [[0.00000000e+00][0.00000000e+00][2.26221263e-06][2.26221263e-06][5.39119950e-03][5.39119950e-03][1.46725486e-03][1.46725486e-03][1.16392111e-02][1.16392111e-02]]
- rolling_population_stability_index(data: ndarray, time_windows: ndarray, fps: int, fill_value: int = 1, bucket_method: typing_extensions.Literal['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt'] = 'auto') ndarray [source]
Compute rolling Population Stability Index (PSI) comparing the current time-window of size N to the preceding window of size N.
Note
Empty bins (0 observations in bin) in is replaced with
fill_value
.- Parameters
sample_1 (ndarray) – First 1d array representing feature values.
sample_2 (ndarray) – Second 1d array representing feature values.
fill_value (int) – Empty bins (0 observations in bin) in is replaced with
fill_value
.bucket_method (Literal) – Estimator determining optimal bucket count and bucket width. Default: The maximum of the Sturges and Freedman-Diaconis estimators
- Returns np.ndarray
PSI data of size len(data) x len(time_windows).
- rolling_shapiro_wilks(data: ndarray, time_window: float, fps: int) ndarray [source]
Compute Shapiro-Wilks normality statistics for sequentially binned values in a time-series. E.g., compute the normality statistics of
Feature N
in each window oftime_window
seconds.- Parameters
- Return np.ndarray
Array of size data.shape[0] with Shapiro-Wilks normality statistics
- Example
>>> data = np.random.randint(low=0, high=100, size=(200)).astype('float32') >>> results = self.rolling_shapiro_wilks(data=data, time_window=1, fps=30)
- static rolling_two_sample_ks(data: ndarray, time_window: float, fps: float) ndarray [source]
Jitted compute Kolmogorov two-sample statistics for sequentially binned values in a time-series. E.g., compute KS statistics when comparing
Feature N
in the current 1s time-window, versusFeature N
in the previous 1s time-window.- Parameters
- Return np.ndarray
Array of size data.shape[0] with KS statistics
- Example
>>> data = np.random.randint(low=0, high=100, size=(200)).astype('float32') >>> results = Statistics().rolling_two_sample_ks(data=data, time_window=1, fps=30)
- rolling_wasserstein_distance(data: ndarray, time_windows: ndarray, fps: int, bucket_method: typing_extensions.Literal['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt'] = 'auto') ndarray [source]
Compute rolling Wasserstein distance comparing the current time-window of size N to the preceding window of size N.
- Parameters
data (ndarray) – 1D array of size len(frames) representing feature values.
time_windows (np.ndarray[ints]) – Time windows to compute JS for in seconds.
fps (int) – Frame-rate of recorded video.
bucket_method (Literal) – Estimator determining optimal bucket count and bucket width. Default: The maximum of the Sturges and Freedman-Diaconis estimators
- Returns np.ndarray
Size data.shape[0] x window_sizes.shape with Wasserstein distance. Columns represent different time windows.
- Example
>>> data = np.random.randint(0, 100, (100,)) >>> Statistics().rolling_wasserstein_distance(data=data, time_windows=np.array([1, 2]), fps=30)
- static sliding_autocorrelation(data: ndarray, max_lag: float, time_window: float, fps: float)[source]
Jitted computation of sliding autocorrelations, which measures the correlation of a feature with itself using lagged windows.
- Parameters
- Return np.ndarray
1D array containing the sliding autocorrelation values.
- Example
>>> data = np.array([0,1,2,3,4, 5,6,7,8,1,10,11,12,13,14]).astype(np.float32) >>> Statistics().sliding_autocorrelation(data=data, max_lag=0.5, time_window=1.0, fps=10) >>> [ 0., 0., 0., 0., 0., 0., 0., 0. , 0., -3.686, -2.029, -1.323, -1.753, -3.807, -4.634]
- static sliding_dominant_frequencies(data: ndarray, fps: float, k: int, time_windows: ndarray, window_function: Optional[typing_extensions.Literal['Hann', 'Hamming', 'Blackman']] = None)[source]
Find the K dominant frequencies within a feature vector using sliding windows
- static sliding_eta_squared(x: ndarray, y: ndarray, window_sizes: ndarray, sample_rate: int) ndarray [source]
Calculate sliding window eta-squared, a measure of effect size for between-subjects designs, over multiple window sizes.
- Parameters
x (np.ndarray) – The array containing the dependent variable data.
y (np.ndarray) – The array containing the grouping variable (categorical) data.
window_sizes (np.ndarray) – 1D array of window sizes in seconds.
sample_rate (int) – The sampling rate of the data in frames per second.
- Return np.ndarray
Array of size x.shape[0] x window_sizes.shape[0] with sliding eta squared values.
- Example
>>> x = np.random.randint(0, 10, (10000,)) >>> y = np.random.randint(0, 2, (10000,)) >>> Statistics.sliding_eta_squared(x=x, y=y, window_sizes=np.array([1.0, 2.0]), sample_rate=10)
- static sliding_independent_samples_t(data: ndarray, time_window: float, slide_time: float, critical_values: ndarray, fps: float) ndarray [source]
Jitted compute of sliding independent sample t-test. Compares the feature values in current time-window to prior time-windows to find the length in time to the most recent time-window where a significantly different feature value distribution is detected.
- Parameters
data (ndarray) – 1D array with feature values.
time_window (float) – The sizes of the two feature value windows being compared in seconds.
slide_time (float) – The slide size of the second window.
critical_values (ndarray) – 2D array with where indexes represent degrees of freedom and values represent critical T values. Can be found in
simba.assets.critical_values_05.pickle
.fps (int) – The fps of the recorded video.
- Returns np.ndarray
1D array of size len(data) with values representing time to most recent significantly different feature distribution.
- Example
>>> data = np.random.randint(0, 50, (10)).astype(np.float32) >>> critical_values = pickle.load(open("simba/assets/lookups/critical_values_05.pickle", "rb"))['independent_t_test']['one_tail'].values.astype(np.float32) >>> results = Statistics().sliding_independent_samples_t(data=data, time_window=0.5, fps=5.0, critical_values=critical_values, slide_time=0.30)
- static sliding_kendall_tau(sample_1: ndarray, sample_2: ndarray, time_windows: ndarray, fps: float) ndarray [source]
Compute sliding Kendall’s Tau correlation coefficient.
Calculates Kendall’s Tau correlation coefficient between two samples over sliding time windows. Kendall’s Tau is a measure of correlation between two ranked datasets.
The computation is based on the formula:
where concordant pairs are pairs of elements with the same order in both samples, and discordant pairs are pairs with different orders.
References
- Parameters
sample_1 (np.ndarray) – First sample for comparison.
sample_2 (np.ndarray) – Second sample for comparison.
time_windows (np.ndarray) – Rolling time windows in seconds.
fps (float) – Frames per second (FPS) of the recorded video.
- Returns
Array of Kendall’s Tau correlation coefficients corresponding to each time window.
- static sliding_kurtosis(data: ndarray, time_windows: ndarray, sample_rate: int) ndarray [source]
Compute the kurtosis of a 1D array within sliding time windows.
- Parameters
data (np.ndarray) – Input data array.
time_windows (np.ndarray) – 1D array of time window durations in seconds.
sample_rate (np.ndarray) – Sampling rate of the data in samples per second.
- Return np.ndarray
2D array of skewness`1 values with rows corresponding to data points and columns corresponding to time windows.
- Example
>>> data = np.random.randint(0, 100, (10,)) >>> kurtosis = Statistics().sliding_kurtosis(data=data.astype(np.float32), time_windows=np.array([1.0, 2.0]), sample_rate=2)
- static sliding_mad_median_rule(data: ndarray, k: int, time_windows: ndarray, fps: float) ndarray [source]
Count the number of outliers in a sliding time-window using the MAD-Median Rule.
The MAD-Median Rule is a robust method for outlier detection. It calculates the median absolute deviation (MAD) and uses it to identify outliers based on a threshold defined as k times the MAD.
- Parameters
- Return np.ndarray
Array of size (data.shape[0], time_windows.shape[0]) with counts if outliers detected.
- Example
>>> data = np.random.randint(0, 50, (50000,)).astype(np.float32) >>> Statistics.sliding_mad_median_rule(data=data, k=2, time_windows=np.array([20.0]), fps=1.0)
- static sliding_pearsons_r(sample_1: ndarray, sample_2: ndarray, time_windows: ndarray, fps: int) ndarray [source]
Given two 1D arrays of size N, create sliding window of size time_windows[i] * fps and return Pearson’s R between the values in the two 1D arrays in each window. Address “what is the correlation between Feature 1 and Feature 2 in the current X.X seconds of the video”.
- Parameters
- Returns np.ndarray
2d array of Pearsons R of size len(sample_1) x len(time_windows). Note, if sliding window is 10 frames, the first 9 entries will be filled with 0.
- Example
>>> sample_1 = np.random.randint(0, 50, (10)).astype(np.float32) >>> sample_2 = np.random.randint(0, 50, (10)).astype(np.float32) >>> Statistics().sliding_pearsons_r(sample_1=sample_1, sample_2=sample_2, time_windows=np.array([0.5]), fps=10) >>> [[-1.][-1.][-1.][-1.][0.227][-0.319][-0.196][0.474][-0.061][0.713]]
- static sliding_phi_coefficient(data: ndarray, window_sizes: ndarray, sample_rate: int) ndarray [source]
Calculate sliding phi coefficients for a 2x2 contingency table derived from binary data.
Computes sliding phi coefficients for a 2x2 contingency table derived from binary data over different time windows. The phi coefficient is a measure of association between two binary variables, and sliding phi coefficients can reveal changes in association over time.
- Parameters
data (np.ndarray) – A 2D NumPy array containing binary data organized in two columns. Each row represents a pair of binary values for two variables.
window_sizes (np.ndarray) – 1D NumPy array specifying the time windows (in seconds) over which to calculate the sliding phi coefficients.
sample_rate (int) – The sampling rate or time interval (in samples per second, e.g., fps) at which data points were collected.
- Returns np.ndarray
A 2D NumPy array containing the calculated sliding phi coefficients. Each row corresponds to the phi coefficients calculated for a specific time point, the columns correspond to time-windows.
- Example
>>> data = np.random.randint(0, 2, (200, 2)) >>> Statistics().sliding_phi_coefficient(data=data, window_sizes=np.array([1.0, 4.0]), sample_rate=10)
- static sliding_relative_risk(x: ndarray, y: ndarray, window_sizes: ndarray, sample_rate: int) ndarray [source]
Calculate sliding relative risk values between two binary arrays using different window sizes.
- Parameters
x (np.ndarray) – The first 1D binary array.
y (np.ndarray) – The second 1D binary array.
window_sizes (np.ndarray) –
sample_rate (int) –
- Return np.ndarray
Array of size x.shape[0] x window_sizes.shape[0] with sliding eta squared values.
- Example
>>> Statistics.sliding_relative_risk(x=np.array([0, 1, 1, 0]), y=np.array([0, 1, 0, 0]), window_sizes=np.array([1.0]), sample_rate=2)
- static sliding_skew(data: ndarray, time_windows: ndarray, sample_rate: int) ndarray [source]
Compute the skewness of a 1D array within sliding time windows.
- Parameters
data (np.ndarray) – Input data array.
data – 1D array of time window durations in seconds.
data – Sampling rate of the data in samples per second.
- Return np.ndarray
2D array of skewness`1 values with rows corresponding to data points and columns corresponding to time windows.
- Example
>>> data = np.random.randint(0, 100, (10,)) >>> skewness = Statistics().sliding_skew(data=data.astype(np.float32), time_windows=np.array([1.0, 2.0]), sample_rate=2)
- static sliding_spearman_rank_correlation(sample_1: ndarray, sample_2: ndarray, time_windows: ndarray, fps: int) ndarray [source]
Given two 1D arrays of size N, create sliding window of size time_windows[i] * fps and return Spearman’s rank correlation between the values in the two 1D arrays in each window. Address “what is the correlation between Feature 1 and Feature 2 in the current X.X seconds of the video.
- Parameters
- Returns np.ndarray
2d array of Soearman’s ranks of size len(sample_1) x len(time_windows). Note, if sliding window is 10 frames, the first 9 entries will be filled with 0. The 10th value represents the correlation in the first 10 frames.
- Example
>>> sample_1 = np.array([9,10,13,22,15,18,15,19,32,11]).astype(np.float32) >>> sample_2 = np.array([11, 12, 15, 19, 21, 26, 19, 20, 22, 19]).astype(np.float32) >>> Statistics().sliding_spearman_rank_correlation(sample_1=sample_1, sample_2=sample_2, time_windows=np.array([0.5]), fps=10)
- static sliding_z_scores(data: ndarray, time_windows: ndarray, fps: int) ndarray [source]
Calculate sliding Z-scores for a given data array over specified time windows.
This function computes sliding Z-scores for a 1D data array over different time windows. The sliding Z-score is a measure of how many standard deviations a data point is from the mean of the surrounding data within the specified time window. This can be useful for detecting anomalies or variations in time-series data.
- Parameters
data (ndarray) – 1D NumPy array containing the time-series data.
time_windows (int) – 1D NumPy array specifying the time windows in seconds over which to calculate the Z-scores.
time_windows – Frames per second, used to convert time windows from seconds to the corresponding number of data points.
- Returns np.ndarray
A 2D NumPy array containing the calculated Z-scores. Each row corresponds to the Z-scores calculated for a specific time window. The time windows are represented by the columns.
- Example
>>> data = np.random.randint(0, 100, (1000,)).astype(np.float32) >>> z_scores = Statistics().sliding_z_scores(data=data, time_windows=np.array([1.0, 2.5]), fps=10)
- static sokal_sneath(x: ndarray, y: ndarray, w: Optional[ndarray] = None) float64 [source]
Jitted calculate of the sokal sneath coefficient between two binary vectors (e.g., to classified behaviors). 0 represent independence, 1 represents complete interdependence.
Note
Adapted from pynndescent.
- Parameters
x (np.ndarray) – First binary vector.
x – Second binary vector.
w (Optional[np.ndarray]) – Optional weights for each element. Can be classification probabilities. If not provided, equal weights are assumed.
- Example
>>> x = np.array([0, 1, 0, 0, 1]).astype(np.int8) >>> y = np.array([1, 0, 1, 1, 0]).astype(np.int8) >>> Statistics().sokal_sneath(x, y) >>> 0.0
- static spearman_rank_correlation(sample_1: ndarray, sample_2: ndarray) float [source]
Jitted compute of Spearman’s rank correlation coefficient between two samples.
Spearman’s rank correlation coefficient assesses how well the relationship between two variables can be described using a monotonic function. It computes the strength and direction of the monotonic relationship between ranked variables.
# .. math:: # ρ = 1 - frac{{6 ∑(d_i^2)}}{{n(n^2 - 1)}} # # where: # - ( d_i ) is the difference between the ranks of corresponding elements in sample_1 and sample_2. # - ( n ) is the number of observations.
- Parameters
sample_1 (np.ndarray) – First 1D array containing feature values.
sample_2 (np.ndarray) – Second 1D array containing feature values.
- Return float
Spearman’s rank correlation coefficient.
- Example
>>> sample_1 = np.array([7, 2, 9, 4, 5, 6, 7, 8, 9]).astype(np.float32) >>> sample_2 = np.array([1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5]).astype(np.float32) >>> Statistics().spearman_rank_correlation(sample_1=sample_1, sample_2=sample_2) >>> 0.0003979206085205078
- static total_variation_distance(x: ndarray, y: ndarray, bucket_method: Optional[typing_extensions.Literal['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt']] = 'auto')[source]
Calculate the total variation distance between two probability distributions.
- Parameters
x (np.ndarray) – A 1-D array representing the first sample.
y (np.ndarray) – A 1-D array representing the second sample.
bucket_method (Optional[str]) – The method used to determine the number of bins for histogram computation. Supported methods are ‘fd’ (Freedman-Diaconis), ‘doane’, ‘auto’, ‘scott’, ‘stone’, ‘rice’, ‘sturges’, and ‘sqrt’. Defaults to ‘auto’.
- Return float
The total variation distance between the two distributions.
where and are the probabilities assigned by the distributions and to the same event , respectively.
- Example
>>> total_variation_distance(x=np.array([1, 5, 10, 20, 50]), y=np.array([1, 5, 10, 100, 110])) >>> 0.3999999761581421
- static two_sample_ks(sample_1: ~numpy.ndarray, sample_2: ~numpy.ndarray, critical_values: ~typing.Optional[array(float64, 2d, A)] = None) -> (<class 'float'>, typing.Union[bool, NoneType])[source]
Jitted compute the two-sample Kolmogorov-Smirnov (KS) test statistic and, optionally, test for statistical significance.
The two-sample KS test is a non-parametric test that compares the cumulative distribution functions (ECDFs) of two independent samples to assess whether they come from the same distribution.
KS statistic (D) is calculated as the maximum absolute difference between the empirical cumulative distribution functions (ECDFs) of the two samples.
If critical_values are provided, the function checks the significance of the KS statistic against the critical values.
- Parameters
data (np.ndarray) – The first sample array for the KS test.
data – The second sample array for the KS test.
critical_values (Optional[float64[:, :]]) – An array of critical values for the KS test. If provided, the function will also check the significance of the KS statistic against the critical values. Default: None.
- Returns (float Union[bool, None])
Returns a tuple containing the KS statistic and a boolean indicating whether the test is statistically significant.
- Example
>>> sample_1 = np.array([1, 2, 3, 1, 3, 2, 1, 10, 8, 4, 10]).astype(np.float32) >>> sample_2 = np.array([10, 5, 10, 4, 8, 10, 7, 10, 7, 10, 10]).astype(np.float32) >>> critical_values = pickle.load(open("simba/assets/lookups/critical_values_5.pickle", "rb"))['two_sample_KS']['one_tail'].values >>> two_sample_ks(sample_1=sample_1, sample_2=sample_2, critical_values=critical_values) >>> (0.7272727272727273, True)
- wasserstein_distance(sample_1: ndarray, sample_2: ndarray, bucket_method: typing_extensions.Literal['fd', 'doane', 'auto', 'scott', 'stone', 'rice', 'sturges', 'sqrt'] = 'auto') float [source]
Compute Wasserstein distance between two distributions.
Note
Uses
stats.wasserstein_distance
. I have tried to movestats.wasserstein_distance
to jitted method extensively, but this doesn’t give significant runtime improvement. Rate-limiter appears to be the _hist_1d.- Parameters
sample_1 (ndarray) – First 1d array representing feature values.
sample_2 (ndarray) – Second 1d array representing feature values.
bucket_method (Literal) – Estimator determining optimal bucket count and bucket width. Default: The maximum of the Sturges and Freedman-Diaconis estimators
- Returns float
Wasserstein distance between
sample_1
andsample_2
- Example
>>> sample_1 = np.random.normal(loc=10, scale=2, size=10) >>> sample_2 = np.random.normal(loc=10, scale=3, size=10) >>> Statistics().wasserstein_distance(sample_1=sample_1, sample_2=sample_2) >>> 0.020833333333333332
- static wilcoxon(x: ndarray, y: ndarray) Tuple[float, float] [source]
Perform the Wilcoxon signed-rank test for paired samples.
Wilcoxon signed-rank test is a non-parametric statistical hypothesis test used to compare two related samples, matched samples, or repeated measurements on a single sample to assess whether their population mean ranks differ.
- Parameters
x (np.ndarray) – 1D array representing the observations for the first sample.
y (np.ndarray) – 1D array representing the observations for the second sample.
- Returns
A tuple containing the test statistic (z-score) and the effect size (r).
The test statistic (z-score) measures the deviation of the observed ranks sum from the expected sum.
The effect size (r) measures the strength of association between the variables.
- static youden_j(sample_1: ndarray, sample_2: ndarray) float [source]
Calculate Youden’s J statistic from two binary arrays.
Youden’s J statistic is a measure of the overall performance of a binary classification test, taking into account both sensitivity (true positive rate) and specificity (true negative rate).
- Parameters
sample_1 – The first binary array.
sample_2 – The second binary array.
- Return float
Youden’s J statistic.
- static yule_coef(x: ndarray, y: ndarray, w: Optional[ndarray] = None) float64 [source]
Jitted calculate of the yule coefficient between two binary vectors (e.g., to classified behaviors). 0 represent independence, 2 represents complete interdependence.
Note
Adapted from pynndescent.
- Parameters
x (np.ndarray) – First binary vector.
x – Second binary vector.
w (Optional[np.ndarray]) – Optional weights for each element. Can be classification probabilities. If not provided, equal weights are assumed.
- Example
>>> x = np.random.randint(0, 2, (50,)).astype(np.int8) >>> y = x ^ 1 >>> Statistics().yule_coef(x=x, y=y) >>> 2 >>> random_indices = np.random.choice(len(x), size=len(x)//2, replace=False) >>> y = np.copy(x) >>> y[random_indices] = 1 - y[random_indices] >>> Statistics().yule_coef(x=x, y=y) >>> 0.99
Circular feature extraction methods
- class simba.mixins.circular_statistics.CircularStatisticsMixin[source]
Bases:
object
Mixin for circular statistics. Unlike linear data, circular data wrap around in a circular or periodic manner such as two measurements of e.g., 360 vs. 1 are more similar than two measurements of 1 vs. 3. Thus, the minimum and maximum values are connected, forming a closed loop, and we therefore need specialized statistical methods.
These methods have support for multiple animals and base radial directions derived from two or three body-parts.
Methods are adopted from the referenced packages below which are far more reliable. However, runtime on standard hardware (multicore CPU) is prioritized and typically orders of magnitude faster than referenced libraries.
See image below for example of expected run-times for a small set of method examples included in this class.
Note
Many method has numba typed signatures to decrease compilation time through reduced type inference. Make sure to pass the correct dtypes as indicated by signature decorators.
Important
See references below for mature packages computing more extensive circular measurements.
References
- 1
- 2
- 3
- 4
- 5
- 6
- static agg_angular_diff_timebins(data: ndarray, time_windows: ndarray, fps: int) ndarray [source]
Compute the difference between the median angle in the current time-window versus the previous time window. For example, computes the difference between the mean angle in the first 1s of the video versus the second 1s of the video, the second 1s of the video versus the third 1s of the video, … etc.
Note
The first time-bin of the video cannot be compared against the prior time-bin and is populated with 0.
- Parameters
data (ndarray) – 1D array of size len(frames) representing degrees.
time_window (np.ndarray) – Rolling time-window as float in seconds.
fps (int) – fps of the recorded video
- Example
>>> data = np.random.normal(loc=45, scale=3, size=20).astype(np.float32) >>> CircularStatisticsMixin().agg_angular_diff_timebins(data=data,time_windows=np.array([1.0]), fps=5.0))
- static circular_correlation(sample_1: ndarray, sample_2: ndarray) float [source]
Jitted compute of circular correlation coefficient of two samples using the cross-correlation coefficient. Ranges from -1 to 1: 1 indicates perfect positive correlation, -1 indicates perfect negative correlation, 0 indicates no correlation.
Note
Adapted from
astropy.stats.circstats.circcorrcoef
.- Parameters
sample_1 (np.ndarray) – Angular data for e.g., Animal 1
sample_1 – Angular data for e.g., Animal 2
circular_correlation (float) – The correlation between the two distributions.
- Example
>>> sample_1 = np.array([50, 90, 20, 60, 20, 90]).astype(np.float32) >>> sample_2 = np.array([50, 90, 70, 60, 20, 90]).astype(np.float32) >>> CircularStatisticsMixin().circular_correlation(sample_1=sample_1, sample_2=sample_2) >>> 0.7649115920066833
- static circular_hotspots(data: ndarray, bins: ndarray) ndarray [source]
Calculate the proportion of data points falling within circular bins.
- Parameters
data (ndarray) – 1D array of circular data measured in degrees.
bins (ndarray) – 2D array of shape representing circular bins defining [start_degree, end_degree] inclusive..
- Return np.ndarray
1D array containing the proportion of data points that fall within each specified circular bin.
- Example
>>> data = np.array([270, 360, 10, 90, 91, 180, 185, 260]).astype(np.float32) >>> bins = np.array([[270, 90], [91, 269]]) >>> CircularStatisticsMixin().circular_hotspots(data=data, bins=bins) >>> [0.5, 0.5] >>> bins = np.array([[270, 0], [1, 90], [91, 180], [181, 269]]) >>> CircularStatisticsMixin().circular_hotspots(data=data, bins=bins) >>> [0.25, 0.25, 0.25, 0.25]
- static circular_mean(data: ndarray) float [source]
Jitted compute of the circular mean of single sample.
- Parameters
data (np.ndarray) – 1D array of size len(frames) representing angles in degrees.
- Returns float
The circular mean of the angles in degrees.
- Example
>>> data = np.array([50, 90, 70, 60, 20, 90]).astype(np.float32) >>> CircularStatisticsMixin().circular_mean(data=data) >>> 63.737892150878906
- static circular_range(data: ndarray) float [source]
Jitted compute of circular range in degrees. The range is defined as the angular span of the shortest arc that can contain all the data points. A smaller range indicates a more concentrated distribution, while a larger range suggests a more dispersed distribution.
- Parameters
data (ndarray) – 1D array of circular data measured in degrees
- Return np.ndarray
The circular range in degrees.
- Example
>>> CircularStatisticsMixin().circular_range([350, 20, 60, 100]) >>> 110.0 >>> CircularStatisticsMixin().circular_range([110, 20, 60, 100]) >>> 90.0
- static circular_std(data: ndarray) float [source]
Jitted compute of the circular standard deviation from a single distribution of angles in degrees.
- Parameters
data (ndarray) – 1D array of size len(frames) with angles in degrees
- Returns float
The standard deviation of the data sample in degrees
where (theta) represents the angles in radians
- Example
>>> data = np.array([180, 221, 32, 42, 212, 101, 139, 41, 69, 171, 149, 200]).astype(np.float32) >>> CircularStatisticsMixin().circular_std(data=data) >>> 75.03725024504664
- static degrees_to_cardinal(data: ndarray) List[str] [source]
Convert degree angles to cardinal direction bucket e.g., 0 -> “N”, 180 -> “S”
Note
To convert cardinal literals to integers, map using
simba.utils.enums.lookups.cardinality_to_integer_lookup
. To convert integers to cardinal literals, map usingsimba.utils.enums.lookups.integer_to_cardinality_lookup
.- Parameters
degree_angles (np.ndarray) – 1d array of degrees. Note: return by
self.head_direction
.- Return List[str]
List of strings representing frame-wise cardinality.
- Example
>>> data = np.array(list(range(0, 405, 45))).astype(np.float32) >>> CircularStatisticsMixin().degrees_to_cardinal(degree_angles=data) >>> ['N', 'NE', 'E', 'SE', 'S', 'SW', 'W', 'NW', 'N']
- static direction_three_bps(nose_loc: ndarray, left_ear_loc: ndarray, right_ear_loc: ndarray) ndarray [source]
Jitted helper to compute the degree angle from three body-parts. Computes the angle in degrees left_ear <-> nose and right_ear_nose and returns the midpoint.
- Parameters
nose_loc (ndarray) – 2D array of size len(frames)x2 representing nose coordinates
left_ear_loc (ndarray) – 2D array of size len(frames)x2 representing left ear coordinates
right_ear_loc (ndarray) – 2D array of size len(frames)x2 representing right ear coordinates
- Return np.ndarray
Array of size nose_loc.shape[0] with direction in degrees.
- Example
>>> nose_loc = np.random.randint(low=0, high=500, size=(50, 2)).astype(np.float32) >>> left_ear_loc = np.random.randint(low=0, high=500, size=(50, 2)).astype(np.float32) >>> right_ear_loc = np.random.randint(low=0, high=500, size=(50, 2)).astype(np.float32) >>> results = CircularStatisticsMixin().direction_three_bps(nose_loc=nose_loc, left_ear_loc=left_ear_loc, right_ear_loc=right_ear_loc)
- static direction_two_bps(anterior_loc: ndarray, posterior_loc: ndarray) ndarray [source]
Jitted method computing degree directionality from two body-parts. E.g.,
nape
andnose
, orswim_bladder
andtail
.- Parameters
bp_x (np.ndarray) – Size len(frames) x 2 representing x and y coordinates for first body-part.
bp_y (np.ndarray) – Size len(frames) x 2 representing x and y coordinates for second body-part.
- Return np.ndarray
Frame-wise directionality in degrees.
- Example
>>> swim_bladder_loc = np.random.randint(low=0, high=500, size=(50, 2)).astype(np.float32) >>> tail_loc = np.random.randint(low=0, high=500, size=(50, 2)).astype(np.float32) >>> CircularStatisticsMixin().direction_two_bps(anterior_loc=swim_bladder_loc, posterior_loc=tail_loc)
- static fit_circle(data: ndarray, max_iterations: Optional[int] = 400) ndarray [source]
Fit a circle to a dataset using the least squares method.
This function fits a circle to a dataset using the least squares method. The circle is defined by the equation:
Note
Adapted to numba JIT from circle-fit
hyperLSQ
method.References
- 1
Kanatani, Rangarajan, Hyper least squares fitting of circles and ellipses, Computational Statistics & Data Analysis, vol. 55, pp. 2197-2208, 2011.
- 2
Lapp, Salazar, Champagne. Automated maternal behavior during early life in rodents (AMBER) pipeline, Scientific Reports, 13:18277, 2023.
- Parameters
data (np.ndarray) – A 3D NumPy array with shape (N, M, 2). N represent frames, M represents the number of body-parts, and 2 represents x and y coordinates.
max_iterations (int) – The maximum number of iterations for fitting the circle.
- Returns np.ndarray
Array with shape (N, 3) with N representing frame and 3 representing (i) X-coordinate of the circle center, (ii) Y-coordinate of the circle center, and (iii) Radius of the circle
- Example
>>> data = np.array([[[5, 10], [10, 5], [15, 10], [10, 15]]]) >>> CircularStatisticsMixin().fit_circle(data=data, iter_max=88) >>> [[10, 10, 5]]
- static instantaneous_angular_velocity(data: ndarray, bin_size: int) ndarray [source]
Jitted compute of absolute angular change in the smallest possible time bin.
Note
If the smallest possible frame-to-frame time-bin in Video 1 is 33ms (recorded at 30fps), and the smallest possible frame-to-frame time-bin in Video 2 is 66ms (recorded at 15fps), we correct for this across recordings using the
bin_size
argument. E.g., when passing angular data from Video 1 we pass bin_size as2
, and when passing angular data for Video 2 we set bin_size to1
to allow comparisons of instantaneous angular velocity between Video 1 and Video 2.When current frame minus bin_size results in a negative index, -1 is returned.
- Parameters
data (ndarray) – 1D array of size len(frames) representing degrees.
bin_size (int) – The number of frames prior to compare the current angular velocity against.
- Example
>>> data = np.array([350, 360, 365, 360]).astype(np.float32) >>> CircularStatisticsMixin().instantaneous_angular_velocity(data=data, bin_size=1.0) >>> [-1., 10.00002532, 4.999999, 4.999999] >>> CircularStatisticsMixin().instantaneous_angular_velocity(data=data, bin_size=2) >>> [-1., -1., 15.00002432, 0.]
- static kuipers_two_sample_test(sample_1: ndarray, sample_2: ndarray) float [source]
Compute the Kuiper’s two-sample test statistic for circular distributions.
Kuiper’s two-sample test is a non-parametric test used to determine if two samples are drawn from the same circular distribution. It is particularly useful for circular data, such as angles or directions.
Note
Adapted from Kuiper by Anne Archibald.
- Parameters
data (ndarray) – The first circular sample array in degrees.
data – The second circular sample array in degrees.
- Return float
Kuiper’s test statistic.
- Example
>>> sample_1, sample_2 = np.random.normal(loc=45, scale=1, size=100).astype(np.float32), np.random.normal(loc=180, scale=20, size=100).astype(np.float32) >>> CircularStatisticsMixin().kuipers_two_sample_test(sample_1=sample_1, sample_2=sample_2)
- static mean_resultant_vector_length(data: ndarray) float [source]
Jitted compute of the mean resultant vector length of a single sample. Captures the overall “pull” or “tendency” of the data points towards a central direction on the circle with a range between 0 and 1.
where (N) is the number of data points, ( heta_i) is the angle of the ith data point, and (ar{ heta}) is the mean angle.
- Parameters
data (np.ndarray) – 1D array of size len(frames) representing angles in degrees.
- Returns float
The mean resultant vector of the angles. 1 represents tendency towards a single point. 0 represents no central point.
- Example
>>> data = np.array([50, 90, 70, 60, 20, 90]).astype(np.float32) >>> CircularStatisticsMixin().mean_resultant_vector_length(data=data) >>> 0.9132277170817057
- static rao_spacing(data: array)[source]
Jitted compute of Rao’s spacing for angular data.
Computes the uniformity of a circular dataset in degrees. Low output values represent concentrated angularity, while high values represent dispersed angularity.
- Parameters
data (ndarray) – 1D array of size len(frames) with data in degrees.
- Return int
Rao’s spacing measure, indicating the dispersion or concentration of angular data points.
- References
- 1
UCSB.
- Example
>>> data = np.random.randint(0, 360, (5000,)).astype(np.float32) >>> rao_spacing(data=data)
- static rayleigh(data: ndarray) Tuple[float, float] [source]
Jitted compute of Rayleigh Z (test of non-uniformity) of single sample of circular data in degrees.
where is the sample size and is the mean resultant length.
The associated p-value is calculated as follows:
- Parameters
data (ndarray) – 1D array of size len(frames) representing degrees.
- Returns Tuple[float, float]
Tuple with Rayleigh Z score and associated probability value.
>>> data = np.array([350, 360, 365, 360, 100, 109, 232, 123, 42, 3,4, 145]).astype(np.float32) >>> CircularStatisticsMixin().rayleigh(data=data) >>> (2.3845645695246467, 0.9842236169985417)
- static rolling_rayleigh_z(data: ndarray, time_windows: ndarray, fps: int) Tuple[ndarray, ndarray] [source]
Jitted compute of Rayleigh Z (test of non-uniformity) of circular data within sliding time-window.
Note
Adapted from
pingouin.circular.circ_rayleigh
andpycircstat.tests.rayleigh
.- Parameters
data (ndarray) – 1D array of size len(frames) representing degrees.
time_window (np.ndarray) – Rolling time-window as float in seconds. Two windows of 0.5s and 1s would be represented as np.array([0.5, 1.0])
fps (int) – fps of the recorded video
- Returns Tuple[np.ndarray, np.ndarray]
Two 2d arrays with the first representing Rayleigh Z scores and second representing associated p values.
- Example
>>> data = np.random.randint(low=0, high=361, size=(100,)).astype(np.float32) >>> CircularStatisticsMixin().rolling_rayleigh_z(data=data, time_windows=np.array([0.5, 1.0]), fps=10)
- static rotational_direction(data: ndarray, stride: int = 1) ndarray [source]
Jitted compute of frame-by-frame rotational direction within a 1D timeseries array of angular data.
- Parameters
data (ndarray) – 1D array of size len(frames) representing degrees.
- Return numpy.ndarray
An array of directional indicators. - 0 indicates no rotational change relative to prior frame. - 1 indicates a clockwise rotational change relative to prior frame. - 2 indicates a counter-clockwise rotational change relative to prior frame.
Note
For the first frame, no rotation is possible so is populated with -1.
Frame-by-frame rotations of 180° degrees are denoted as clockwise rotations.
- Example
>>> data = np.array([45, 50, 35, 50, 80, 350, 350, 0 , 180]).astype(np.float32) >>> CircularStatisticsMixin().rotational_direction(data) >>> [-1., 1., 2., 1., 1., 2., 0., 1., 1.]
- static sliding_angular_diff(data: ndarray, time_windows: ndarray, fps: float) ndarray [source]
Computes the angular difference in the current frame versus N seconds previously. For example, if the current angle is 45 degrees, and the angle N seconds previously was 350 degrees, then the difference is 55 degrees.
Note
Frames where current frame - N seconds prior equal a negative value is populated with
0
.Results are returned in rounded nearest integer.
- Parameters
data (ndarray) – 1D array of size len(frames) representing degrees.
time_window (np.ndarray) – Rolling time-window as float in seconds.
fps (int) – fps of the recorded video
- Example
>>> data = np.array([350, 350, 1, 1]).astype(np.float32) >>> CircularStatisticsMixin().sliding_angular_diff(data=data, fps=1.0, time_windows=np.array([1.0]))
- static sliding_bearing(x: ndarray, lag: float, fps: float) ndarray [source]
Calculates the sliding bearing (direction) of movement in degrees for a sequence of 2D points representing a single body-part.
Note
To calculate frame-by-frame bearing, pass fps == 1 and lag == 1.
- Parameters
x (np.ndarray) – An array of shape (n, 2) representing the time-series sequence of 2D points.
lag (float) – The lag time (in seconds) used for calculating the sliding bearing. E.g., if 1, then bearing will be calculated using coordinates in the current frame vs the frame 1s previously.
fps (float) – The sample rate (frames per second) of the sequence.
- Return np.ndarray
An array containing the sliding bearings (in degrees) for each point in the sequence.
- Example
>>> x = np.array([[10, 10], [20, 10]]) >>> CircularStatisticsMixin.sliding_bearing(x=x, lag=1, fps=1) >>> [-1. 90.]
- static sliding_circular_correlation(sample_1: ndarray, sample_2: ndarray, time_windows: ndarray, fps: float) ndarray [source]
Jitted compute of correlations between two angular distributions in sliding time-windows using the cross-correlation coefficient.
Note
Values prior to the ending of the first time window will be filles with
0
.- Parameters
- Return np.ndarray
Array of size len(sample_1) x len(time_window) with correlation coefficients.
- Example
>>> sample_1 = np.random.randint(0, 361, (200,)).astype(np.float32) >>> sample_2 = np.random.randint(0, 361, (200,)).astype(np.float32) >>> CircularStatisticsMixin().sliding_circular_correlation(sample_1=sample_1, sample_2=sample_2, time_windows=np.array([0.5, 1.0]), fps=10.0)
- static sliding_circular_hotspots(data: ndarray, bins: ndarray, time_window: float, fps: float) ndarray [source]
Jitted compute of sliding circular hotspots in a dataset. Calculates circular hotspots in a time-series dataset by sliding a time window across the data and computing hotspot statistics for specified circular bins.
- Parameters
- Return np.ndarray
A 2D numpy array where each row corresponds to a time point in data, and each column represents a circular bin. The values in the array represent the proportion of data points within each bin at each time point.
Note
The function utilizes the Numba JIT compiler for improved performance.
Circular bin definitions should follow the convention where angles are specified in degrees within the range [0, 360], and the bins are defined using start and end angles inclusive. For example, (0, 90) represents the first quadrant in a circular space.
Output data in the beginning of the series where a full time-window is not satisfied (e.g., first 9 observations when fps equals 10 and time_windows = [1.0], will be populated by
0
.
Warning
Note that
0
is noted as a bin-edge,360
should not be a bin-edge. Instead, use0
and359
or1
and360
.- Example
>>> data = np.array([270, 360, 10, 20, 90, 91, 180, 185, 260, 265]).astype(np.float32) >>> bins = np.array([[270, 90], [91, 268]]) >>> CircularStatisticsMixin().sliding_circular_hotspots(data=data, bins=bins, time_window=0.5, fps=10) >>> [[-1. , -1. ], >>> [-1. , -1. ], >>> [-1. , -1. ], >>> [-1. , -1. ], >>> [ 0.5, 0. ], >>> [ 0.4, 0.1], >>> [ 0.3, 0.2], >>> [ 0.2, 0.3], >>> [ 0.1, 0.4], >>> [ 0. , 0.5]]
- static sliding_circular_mean(data: ndarray, time_windows: ndarray, fps: int) ndarray [source]
Compute the circular mean in degrees within sliding temporal windows.
- Parameters
data (np.ndarray) – 1d array with feature values in degrees.
time_windows (np.ndarray) – Rolling time-windows as floats in seconds. E.g., [0.2, 0.4, 0.6]
fps (int) – fps of the recorded video
- Returns np.ndarray
Size data.shape[0] x time_windows.shape[0] array
Attention
The returned values represents the angular mean dispersion in the time-window
[current_frame-time_window->current_frame]
. -1 is returned whencurrent_frame-time_window
is less than 0.- Example
>>> data = np.random.normal(loc=45, scale=1, size=20).astype(np.float32) >>> CircularStatisticsMixin().sliding_circular_mean(data=data,time_windows=np.array([0.5, 1.0]), fps=10)
- static sliding_circular_range(data: ndarray, time_windows: ndarray, fps: int) ndarray [source]
Jitted compute of sliding circular range for a time series of circular data. The range is defined as the angular span of the shortest arc that can contain all the data points. Measures the circular spread of data within sliding time windows of specified duration.
Note
Output data in the beginning of the series where a full time-window is not satisfied (e.g., first 9 observations when fps equals 10 and time_windows = [1.0], will be populated by
0
.- Parameters
data (np.ndarray) – 1D array of circular data measured in degrees
time_windows (np.ndarray) – Size of sliding time window in seconds. E.g., two windows of 0.5s and 1s would be represented as np.array([0.5, 1.0])
fps (int) – Frame-rate of recorded video.
- Return np.ndarray
Array of size len(sample_1) x len(time_window) with angular ranges in degrees.
- Examples
>>> data = np.array([260, 280, 300, 340, 360, 0, 10, 350, 0, 15]).astype(np.float32) >>> CircularStatisticsMixin().sliding_circular_range(data=data, time_windows=np.array([0.5]), fps=10) >>> [[ -1.],[ -1.],[ -1.],[ -1.],[100.],[80],[70],[30],[20],[25]]
- static sliding_circular_std(data: ndarray, fps: int, time_windows: ndarray) ndarray [source]
Compute standard deviation of angular data in sliding time windows.
- Parameters
data (ndarray) – 1D array of size len(frames) representing degrees.
time_window (np.ndarray) – Sliding time-window as float in seconds.
fps (int) – fps of the recorded video
- Returns np.ndarray
Size data.shape[0] x time_windows.shape[0] with angular standard deviations in rolling time windows in degrees.
- Example
>>> data = np.array([180, 221, 32, 42, 212, 101, 139, 41, 69, 171, 149, 200]).astype(np.float32) >>> CircularStatisticsMixin().sliding_circular_std(data=data.astype(np.float32), time_windows=np.array([0.5]), fps=10)
- static sliding_kuipers_two_sample_test(sample_1: ndarray, sample_2: ndarray, time_windows: ndarray, fps: int) ndarray [source]
Jitted compute of Kuipers two-sample test comparing two distributions with sliding time window.
This function calculates the Kuipers two-sample test statistic for each time window, sliding through the given circular data sequences.
- Parameters
data (np.ndarray) – The first circular sample array in degrees.
data – The second circular sample array in degrees.
time_windows (np.ndarray) – An array containing the time window sizes (in seconds) for which the Kuipers two-sample test will be computed.
fps (int) – The frames per second, representing the sampling rate of the data.
- Returns np.ndarray
A 2D array containing the Kuipers two-sample test statistics for each time window and each time step.
- Examples
>>> data = np.random.randint(low=0, high=360, size=(100,)).astype(np.float64) >>> D = CircularStatisticsMixin().sliding_kuipers_two_sample_test(data=data, time_windows=np.array([0.5, 5]), fps=2)
- static sliding_mean_resultant_vector_length(data: ndarray, fps: int, time_windows: ndarray) ndarray [source]
Jitted compute of the mean resultant vector within sliding time window. Captures the overall “pull” or “tendency” of the data points towards a central direction on the circle with a range between 0 and 1.
Attention
The returned values represents resultant vector length in the time-window
[(current_frame-time_window)->current_frame]
. -1 is returned wherecurrent_frame-time_window
is less than 0.- Parameters
data (np.ndarray) – 1D array of size len(frames) representing degrees.
time_window (np.ndarray) – Rolling time-window as float in seconds.
fps (int) – fps of the recorded video
- Returns np.ndarray
Size len(data) x len(time_windows) representing resultant vector length in the prior
time_window
.- Example
>>> data_1, data_2 = np.random.normal(loc=45, scale=1, size=100), np.random.normal(loc=90, scale=45, size=100) >>> data = np.hstack([data_1, data_2]) >>> CircularStatisticsMixin().sliding_mean_resultant_vector_length(data=data.astype(np.float32),time_windows=np.array([1.0]), fps=10)
- static sliding_rao_spacing(data: ndarray, time_windows: ndarray, fps: int) ndarray [source]
Jitted compute of the uniformity of a circular dataset in sliding windows.
- Parameters
data (ndarray) – 1D array of size len(frames) representing degrees.
time_window (np.ndarray) – Rolling time-window as float in seconds.
fps (int) – fps of the recorded video
- Return np.ndarray
representing rao-spacing U in every sliding windows [-window:n]
The Rao’s Spacing () is calculated as follows:
where is the number of data points in the sliding window, is the spacing between adjacent data points, and is the equal angular spacing.
Note
For frames occuring before a complete time window, 0.0 is returned.
- References
- 1
UCSB.
- Example
>>> data = np.random.randint(low=0, high=360, size=(500,)).astype(np.float32) >>> result = CircularStatisticsMixin().sliding_rao_spacing(data=data, time_windows=np.array([0.5, 1.0]), fps=10)
Plotting methods
- class simba.mixins.plotting_mixin.PlottingMixin[source]
Bases:
object
Methods for visualizations
- static categorical_scatter(data: Union[ndarray, DataFrame], columns: Optional[List[str]] = ('X', 'Y', 'Cluster'), palette: Optional[str] = 'Set1', show_box: Optional[bool] = False, size: Optional[int] = 10, title: Optional[str] = None, save_path: Optional[Union[str, PathLike]] = None)[source]
Create a 2D scatterplot with a categorical legend.
- Parameters
data (Union[np.ndarray, pd.DataFrame]) – Input data, either a NumPy array or a pandas DataFrame.
columns (Optional[List[str]]) – A list of column names for the x-axis, y-axis, and the categorical variable respectively. Default is [“X”, “Y”, “Cluster”].
palette (Optional[str]) – The color palette to be used for the categorical variable. Default is “Set1”.
show_box (Optional[bool]) – Whether to display the plot axis. Default is False.
size (Optional[int]) – Size of markers in the scatterplot. Default is 10.
title (Optional[str]) – Title for the plot. Default is None.
save_path (Optional[Union[str, os.PathLike]]) – The path where the plot will be saved. Default is None which returns the image.
- Returns matplotlib.axes._subplots.AxesSubplot or None
The scatterplot if ‘save_path’ is not provided, otherwise None.
- static continuous_scatter(data: Union[ndarray, DataFrame], columns: Optional[List[str]] = ('X', 'Y', 'Cluster'), palette: Optional[str] = 'magma', show_box: Optional[bool] = False, size: Optional[int] = 10, title: Optional[str] = None, bg_clr: Optional[str] = None, save_path: Optional[Union[str, PathLike]] = None)[source]
Create a 2D scatterplot with a continuous legend
- create_gantt_img(bouts_df: DataFrame, clf_name: str, image_index: int, fps: int, gantt_img_title: str)[source]
Helper to create a single gantt plot based on the data preceeding the input image
- Parameters
bouts_df (pd.DataFrame) – ataframe holding information on individual bouts created by
simba.misc_tools.get_bouts_for_gantt()
.clf_name (str) – Name of the classifier.
image_index (int) – The count of the image. E.g.,
1000
will create a gantt image representing frame 1-1000.fps (int) – The fps of the input video.
gantt_img_title (str) – Title of the image.
:return np.ndarray
- create_single_color_lst(pallete_name: typing_extensions.Literal[<Options.PALETTE_OPTIONS: ['magma', 'jet', 'inferno', 'plasma', 'viridis', 'gnuplot2', 'RdBu', 'winter']>], increments: int, as_rgb_ratio: bool = False, as_hex: bool = False) List[Union[str, int, float]] [source]
Helper to create a color palette of bgr colors in a list.
- Parameters
:return list
Note
If as_rgb_ratio AND as_hex, then returns HEX.
- static draw_lines_on_img(img: ndarray, start_positions: ndarray, end_positions: ndarray, color: Tuple[int, int, int], highlight_endpoint: Optional[bool] = False, thickness: Optional[int] = 2, circle_size: Optional[int] = 2) ndarray [source]
Helper to draw a set of lines onto an image.
- Parameters
img (np.ndarray) – The image to draw the lines on.
start_positions (np.ndarray) – 2D numpy array representing the start positions of the lines in x, y format.
end_positions (np.ndarray) – 2D numpy array representing the end positions of the lines in x, y format.
color (Tuple[int, int, int]) – The color of the lines in BGR format.
highlight_endpoint (Optional[bool]) – If True, highlights the ends of the lines with circles.
thickness (Optional[int]) – The thickness of the lines.
circle_size (Optional[int]) – If
highlight_endpoint
is True, the size of the highlighted points.
- Return np.ndarray
The image with the lines overlayed.
- get_bouts_for_gantt(data_df: DataFrame, clf_name: str, fps: int) ndarray [source]
Helper to detect all behavior bouts for a specific classifier.
- static insert_directing_line(directing_df: DataFrame, img: ndarray, shape_name: str, animal_name: str, frame_id: int, color: Optional[Tuple[int]] = (0, 0, 255), thickness: Optional[int] = 2, style: Optional[str] = 'lines') ndarray [source]
Helper to insert lines between the actor ‘eye’ and the ROI centers.
- Parameters
directing_df – Dataframe containing eye and ROI locations. Stored as
results
in instance ofsimba.roi_tools.ROI_directing_analyzer.DirectingROIAnalyzer
.img (np.ndarray) – The image to draw the line on.
shape_name (str) – The name of the shape to draw the line to.
animal_name (str) – The name of the animal
frame_id (int) – The frame number in the video
color (Optional[Tuple[int]]) – The color of the line
thickness (Optional[int]) – The thickness of the line.
style (Optional[str]) – The style of the line. “lines” or “funnel”.
- Return np.ndarray
The input image with the line.
- static joint_plot(data: Union[ndarray, DataFrame], columns: Optional[List[str]] = ('X', 'Y', 'Cluster'), palette: Optional[str] = 'Set1', kind: Optional[str] = 'scatter', size: Optional[int] = 10, title: Optional[str] = None, save_path: Optional[Union[str, PathLike]] = None)[source]
Generate a joint plot.
Useful when visualizing embedded behavior data latent spaces with dense and overlapping scatters.
- Parameters
data (Union[np.ndarray, pd.DataFrame]) – Input data, either a NumPy array or a pandas DataFrame.
columns (Optional[List[str]]) – Names of columns if input is dataframe, default is [“X”, “Y”, “Cluster”].
palette (Optional[str]) – Palette for the plot, default is “Set1”.
kind (Optional[str]) – Type of plot (“scatter”, “kde”, “hist”, or “reg”), default is “scatter”.
size (Optional[int]) – Size of markers for scatter plot, default is 10.
title (Optional[str]) – Title of the plot, default is None.
save_path (Optional[Union[str, os.PathLike]]) – Path to save the plot image, default is None.
- Returns sns.JointGrid or None
JointGrid object if save_path is None, else None.
- Example
>>> data, lbls = make_blobs(n_samples=100000, n_features=2, centers=10, random_state=42) >>> data = np.hstack((data, lbls.reshape(-1, 1))) >>> PlottingMixin.joint_plot(data=data, columns=['X', 'Y', 'Cluster'], title='The plot')
- make_distance_plot(data: array, line_attr: Dict[int, str], style_attr: Dict[str, Any], fps: int, save_img: bool = False, save_path: Optional[str] = None) ndarray [source]
Helper to make a single line plot .png image with N lines.
- Parameters
data (np.array) – Two-dimensional array where rows represent frames and columns represent intertwined x and y coordinates.
line_attr (dict) – Line color attributes.
style_attr (dict) – Plot attributes (size, font size, line width etc).
fps (int) – Video frame rate.
save_path (Optionan[str]) – Location to store output .png image. If None, then return image.
- Example
>>> fps = 10 >>> data = np.random.random((100,2)) >>> line_attr = {0: ['Blue'], 1: ['Red']} >>> save_path = '/_tests/final_frm.png' >>> style_attr = {'width': 640, 'height': 480, 'line width': 6, 'font size': 8, 'y_max': 'auto'} >>> self.make_distance_plot(fps=fps, data=data, line_attr=line_attr, style_attr=style_attr, save_path=save_path)
- static make_line_plot_plotly(data: List[ndarray], colors: List[str], show_box: Optional[bool] = True, show_grid: Optional[bool] = False, width: Optional[int] = 640, height: Optional[int] = 480, line_width: Optional[int] = 6, font_size: Optional[int] = 8, bg_clr: Optional[str] = 'white', x_lbl_divisor: Optional[float] = None, title: Optional[str] = None, y_lbl: Optional[str] = None, x_lbl: Optional[str] = None, y_max: Optional[int] = -1, line_opacity: Optional[int] = 0.5, save_path: Optional[Union[str, PathLike]] = None)[source]
Create a line plot using Plotly.
Note
Plotly can be more reliable than matplotlib on some systems when accessed through multprocessing calls.
If not called though multiprocessing, consider using
simba.mixins.plotting_mixin.PlottingMixin.make_line_plot()
Uses
kaleido
for transform image to numpy array or save to disk.- Parameters
data (List[np.ndarray]) – List of 1D numpy arrays representing lines.
colors (List[str]) – List of named colors of size len(data).
show_box (bool) – Whether to show the plot box (axes, title, etc.).
show_grid (bool) – Whether to show gridlines on the plot.
width (int) – Width of the plot in pixels.
height (int) – Height of the plot in pixels.
line_width (int) – Width of the lines in the plot.
font_size (int) – Font size for axis labels and tick labels.
bg_clr (str) – Background color of the plot.
x_lbl_divisor (float) – Divisor for adjusting the tick spacing on the x-axis.
title (str) – Title of the plot.
y_lbl (str) – Label for the y-axis.
x_lbl (str) – Label for the x-axis.
y_max (int) – Maximum value for the y-axis.
line_opacity (float) – Opacity of the lines in the plot.
save_path (Union[str, os.PathLike]) – Path to save the plot image. If None, returns a numpy array of the plot.
- Returns
If save_path is None, returns a numpy array representing the plot image.
- Example
>>> p = np.random.randint(0, 50, (100,)) >>> y = np.random.randint(0, 50, (200,)) >>> img = PlottingMixin.make_line_plot_plotly(data=[p, y], show_box=False, font_size=20, bg_clr='white', show_grid=False, x_lbl_divisor=30, colors=['Red', 'Green'], save_path='/Users/simon/Desktop/envs/simba/troubleshooting/beepboop174/project_folder/frames/output/line_plot/Trial 3_final_img.png')
- static make_path_plot(data: List[ndarray], colors: List[Tuple[int, int, int]], width: Optional[int] = 640, height: Optional[int] = 480, max_lines: Optional[int] = None, bg_clr: Optional[Union[Tuple[int, int, int], ndarray]] = (255, 255, 255), circle_size: Optional[int] = 3, font_size: Optional[float] = 2.0, font_thickness: Optional[int] = 2, line_width: Optional[int] = 2, animal_names: Optional[List[str]] = None, clf_attr: Optional[Dict[str, Any]] = None, save_path: Optional[Union[str, PathLike]] = None) Union[None, ndarray] [source]
Creates a path plot visualization from the given data.
- Parameters
data (List[np.ndarray]) – List of numpy arrays containing path data.
colors (List[Tuple[int, int, int]]) – List of RGB tuples representing colors for each path.
width – Width of the output image (default is 640 pixels).
height – Height of the output image (default is 480 pixels).
max_lines – Maximum number of lines to plot from each path data.
bg_clr – Background color of the plot (default is white).
circle_size – Size of the circle marker at the end of each path (default is 3).
font_size – Font size for displaying animal names (default is 2.0).
font_thickness – Thickness of the font for displaying animal names (default is 2).
line_width – Width of the lines representing paths (default is 2).
animal_names – List of names for the animals corresponding to each path.
clf_attr – Dictionary containing attributes for classification markers.
save_path – Path to save the generated plot image.
- Returns
If save_path is None, returns the generated image as a numpy array, otherwise, returns None.
- Example
>>> x = np.random.randint(0, 500, (100, 2)) >>> y = np.random.randint(0, 500, (100, 2)) >>> position_data = np.random.randint(0, 500, (100, 2)) >>> clf_data_1 = np.random.randint(0, 2, (100,)) >>> clf_data_2 = np.random.randint(0, 2, (100,)) >>> clf_data = {'Attack': {'color': (155, 1, 10), 'size': 30, 'positions': position_data, 'clfs': clf_data_1}, 'Sniffing': {'color': (155, 90, 10), 'size': 30, 'positions': position_data, 'clfs': clf_data_2}} >>> PlottingMixin.make_path_plot(data=[x, y], colors=[(0, 255, 0), (255, 0, 0)], clf_attr=clf_data)
- make_probability_plot(data: Series, style_attr: dict, clf_name: str, fps: int, save_path: str) ndarray [source]
Make a single classifier probability plot png image.
- Parameters
:param str ot :param str save_path: Location to store output .png image.
- Example
>>> data = pd.Series(np.random.random((100, 1)).flatten()) >>> style_attr = {'width': 640, 'height': 480, 'font size': 10, 'line width': 6, 'color': 'blue', 'circle size': 20} >>> clf_name='Attack' >>> fps=10 >>> save_path = '/_test/frames/output/probability_plots/Together_1_final_frame.png'
>>> _ = self.make_probability_plot(data=data, style_attr=style_attr, clf_name=clf_name, fps=fps, save_path=save_path)
- static polygons_onto_image(img: ndarray, polygons: DataFrame, show_center: Optional[bool] = False, show_tags: Optional[bool] = False, circle_size: Optional[int] = 2) ndarray [source]
Helper to insert polygon overlays onto an image.
- Parameters
img (np.ndarray) –
polygons –
show_center –
show_tags –
circle_size –
- Returns
- remove_a_folder(folder_dir: str) None [source]
Helper to remove a directory, use for cleaning up smaller multiprocessed videos following concat
- resize_gantt(gantt_img: array, img_height: int) ndarray [source]
Helper to resize image while retaining aspect ratio.
- static rotate_img(img: ndarray, right: bool) ndarray [source]
Flip a color image 90 degrees to the left or right
- Parameters
img (np.ndarray) – Input image as numpy array in uint8 format.
right (bool) – If True, flips to the right. If False, flips to the left.
- Returns
The rotated image as a numpy array of uint8 format.
- Example
>>> img = cv2.imread('/Users/simon/Desktop/test.png') >>> rotated_img = PlottingMixin.rotate_img(img=img, right=False)
- split_and_group_df(df: ~pandas.core.frame.DataFrame, splits: int, include_row_index: bool = False, include_split_order: bool = True) -> (typing.List[pandas.core.frame.DataFrame], <class 'int'>)[source]
Helper to split a dataframe for multiprocessing. If include_split_order, then include the group number in split data as a column. If include_row_index, includes a column representing the row index in the array, which can be helpful for knowing the frame indexes while multiprocessing videos. Returns split data and approximations of number of observations per split.
GUI pop-up methods
- class simba.mixins.pop_up_mixin.PopUpMixin(title: str, config_path: Optional[str] = None, main_scrollbar: Optional[bool] = True, size: Tuple[int, int] = (960, 720))[source]
Bases:
object
Methods for pop-up windows in SimBA. E.g., common methods for creating pop-up windows with drop-downs, checkboxes, entry-boxes, listboxes etc.
- Parameters
title (str) – Pop-up window title
config_path (Optional[configparser.Configparser]) – path to SimBA project_config.ini. If path, the project config is read in. If None, the project config is not read in.
size (tuple) – HxW of the pop-up window. The size of the pop-up window in pixels.
main_scrollbar (bool) – If True, the pop-up window is scrollable.
- add_to_listbox_from_entrybox(list_box: Listbox, entry_box: Entry_Box)[source]
Add a value that populates a tkinter entry_box to a tkinter listbox.
- Parameters
list_box (Listbox) – The tkinter Listbox to add the value to.
entry_box (Entry_Box) – The tkinter Entry_Box containing the value that should be added to the list_box.
- add_value_to_listbox(list_box: Listbox, value: float)[source]
Add a float value to a tkinter listbox.
- Parameters
list_box (Listbox) – The tkinter Listbox to add the value to.
value (float) – Value to add to the listbox.
- add_values_to_several_listboxes(list_boxes: List[Listbox], values: List[float])[source]
Add N values to N listboxes. E.g., values[0] will be added to list_boxes[0].
- Parameters
list_boxes (List[Listbox]) – List of Listboxes that the values should be added to.
values (List[float]) – List of floats that will be added to the list_boxes.
- children_cnt_main() int [source]
Find the number of children (e.g., labelframes) currently exist within a main pop-up window. Useful for finding the row at which a new frame within the window should be inserted.
- create_cb_frame(main_frm: Frame, cb_titles: List[str], frm_title: str, command: object = None) Dict[str, BooleanVar] [source]
Creates a labelframe with one checkbox per classifier, and inserts the labelframe into the bottom of the pop-up window.
- Parameters
- Return Dict[str, BooleanVar]
Dictionary holding the
cb_titles
as keys and the BooleanVar representing if the checkbox is ticked or not.
- create_choose_number_of_body_parts_frm(project_body_parts: List[str], run_function: object)[source]
Many menus depend on how many animals the user choose to compute metrics for. Thus, we need to populate the menus dynamically. This function creates a single drop-down menu where the user select the number of animals the user choose to compute metrics for. It inserts this drop-down iat the bottom of the pop-up window, and ties this dropdown menu choice to a callback.
- create_clf_checkboxes(main_frm: Frame, clfs: List[str], title: str = 'SELECT CLASSIFIER ANNOTATIONS')[source]
Creates a labelframe with one checkbox per classifier, and inserts the labelframe into the bottom of the pop-up window.
Note
Legacy. Use
create_cb_frame
instead.
- create_dropdown_frame(main_frm: Frame, drop_down_titles: List[str], drop_down_options: List[str], frm_title: str) Dict[str, DropDownMenu] [source]
Creates a labelframe with dropdown menus and inserts it at the bottom of the pop-up window.
- Parameters
- Return Dict[str, BooleanVar]
Dictionary holding the
drop_down_titles
and the drop-down menus as values.
- create_run_frm(run_function: Callable, title: Optional[str] = 'RUN', btn_txt_clr: Optional[str] = 'black') None [source]
Create a label frame with a single button with a specified callback.
- enable_dropdown_from_checkbox(check_box_var: BooleanVar, dropdown_menus: List[DropDownMenu])[source]
Given a single checkbox, enable a bunch of dropdowns if the checkbox is ticked, and disable the dropdowns if the checkbox is un-ticked.
- Parameters
check_box_var (BooleanVar) – The checkbox associated tkinter BooleanVar.
dropdown_menus (List[DropDownMenu]) – List of dropdowns which status is controlled by the
check_box_var
.
- enable_entrybox_from_checkbox(check_box_var: BooleanVar, entry_boxes: List[Entry_Box], reverse: bool = False)[source]
Given a single checkbox, enable or disable a bunch of entry-boxes based on the status of the checkbox.
- Parameters
check_box_var (BooleanVar) – The checkbox associated tkinter BooleanVar.
entry_boxes (List[Entry_Box]) – List of entry-boxes which status is controlled by the
check_box_var
.reverse (bool) – If False, the entry-boxes are enabled with the checkbox is ticked. Else, the entry-boxes are enabled if checkbox is unticked. Default: False.
- frame_children(frame: Frame) int [source]
Find the number of children (e.g., labelframes) currently exist within specified frame.Similar to
children_cnt_main
, but accepts a specific frame rather than the main frame beeing hardcoded.
- place_frm_at_top_right(frm: Toplevel)[source]
Place a TopLevel tkinter pop-up at the top right of the monitor. Note: call before putting scrollbars or converting to Canvas.
- remove_from_listbox(list_box: Listbox)[source]
Remove the current selection in a listbox from a listbox.
- Parameters
list_box (Listbox) – The listbox that the current selection should be removed from.
- update_file_select_box_from_dropdown(filename: str, fileselectbox: FileSelect)[source]
Updates the text inside a tkinter FileSelect entrybox with a new string.
Pose importing methods
- class simba.mixins.pose_importer_mixin.PoseImporterMixin[source]
Bases:
object
Methods for importing pose-estimation data.
- link_video_paths_to_data_paths(data_paths: List[str], video_paths: List[str], str_splits: Optional[List[str]] = None, filename_cleaning_func: object = None) dict [source]
Given a list of paths to video files and a separate list of paths to data files, create a dictionary pairing each video file to a datafile based on the file names of the video and data file.
- Parameters
data_paths (List[str]) – List of full paths to data files, e.g., CSV or H5 files.
video_paths (List[str]) – List of full paths to video files, e.g., MP4 or AVI files.
str_splits (Optional[List[str]]) – Optional list of substrings that the data_paths would need to be split at in order to find a matching video name. E.g., [‘dlc_resnet50’].
filename_cleaning_func (Optional[object]) – Optional filename cleaning function that the data_paths filenames would have to pass through in order to find a matching video name. E.g.,
simba.utils.read_write.clean_sleap_filename(filepath)
.
- Returns dict
Dictionary with the data/file name as keys, and the video and data paths as values.
Modelling methods
- class simba.mixins.train_model_mixin.TrainModelMixin[source]
Bases:
object
Train model methods
- bout_train_test_splitter(x_df: ~pandas.core.frame.DataFrame, y_df: ~pandas.core.series.Series, test_size: float) -> (<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.series.Series'>, <class 'pandas.core.series.Series'>)[source]
Helper to split train and test based on annotated bouts.
- Parameters
x_df (pd.DataFrame) – Features
y_df (pd.Series) – Target
test_size (float) – Size of test as ratio of all annotated bouts (e.g.,
0.2
).
- Return np.ndarray x_train
Features for training
- Return np.ndarray x_test
Features for testing
- Return np.ndarray y_train
Target for training
- Return np.ndarray y_test
Target for testing
- Examples
>>> x = pd.DataFrame(data=[[11, 23, 12], [87, 65, 76], [23, 73, 27], [10, 29, 2], [12, 32, 42], [32, 73, 2], [21, 83, 98], [98, 1, 1]]) >>> y = pd.Series([0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1]) >>> x_train, x_test, y_train, y_test = TrainModelMixin().bout_train_test_splitter(x_df=x, y_df=y, test_size=0.5)
- calc_learning_curve(x_y_df: DataFrame, clf_name: str, shuffle_splits: int, dataset_splits: int, tt_size: float, rf_clf: RandomForestClassifier, save_dir: str, save_file_no: Optional[int] = None, multiclass: bool = False) None [source]
Helper to compute random forest learning curves with cross-validation.
- Parameters
x_y_df (pd.DataFrame) – Dataframe holding features and target.
clf_name (str) – Name of the classifier
shuffle_splits (int) – Number of cross-validation datasets at each data split.
dataset_splits (int) – Number of data splits.
tt_size (float) – test size
rf_clf (RandomForestClassifier) – sklearn RandomForestClassifier object
save_dir (str) – Directory where to save output in csv file format.
save_file_no (Optional[int]) – If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.
multiclass (bool) – If True, then target consist of several categories [0, 1, 2 …] and scoring becomes
None
. If False, then coringf1
.
- calc_permutation_importance(x_test: ndarray, y_test: ndarray, clf: RandomForestClassifier, feature_names: List[str], clf_name: str, save_dir: Union[str, PathLike], save_file_no: Optional[int] = None) None [source]
Computes feature permutation importance scores.
- Parameters
x_test (np.ndarray) – 2d feature test data of shape len(frames) x len(features)
y_test (np.ndarray) – 2d feature target test data of shape len(frames) x 1
clf (RandomForestClassifier) – random forest classifier
feature_names (List[str]) – Names of features in x_test
clf_name (str) – Name of classifier in y_test
save_dir (str) – Directory where to save results in CSV format
save_file_no (Optional[int]) – If permutation importance calculation is part of a grid search, provide integer identifier representing the model in the grid serach sequence. Will be used as suffix in output filename.
- calc_pr_curve(rf_clf: RandomForestClassifier, x_df: DataFrame, y_df: DataFrame, clf_name: str, save_dir: str, multiclass: bool = False, classifier_map: Optional[Dict[int, str]] = None, save_file_no: Optional[int] = None) None [source]
Helper to compute random forest precision-recall curve.
- Parameters
rf_clf (RandomForestClassifier) – sklearn RandomForestClassifier object.
x_df (pd.DataFrame) – Pandas dataframe holding test features.
y_df (pd.DataFrame) – Pandas dataframe holding test target.
clf_name (str) – Classifier name.
save_dir (str) – Directory where to save output in csv file format.
multiclass (bool) – If the classifier is a multi-classifier. Default: False.
classifier_map (Dict[int, str]) – If multiclass, dictionary mapping integers to classifier names.
save_file_no (Optional[int]) – If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.
- check_df_dataset_integrity(df: DataFrame, file_name: str, logs_path: Union[str, PathLike]) None [source]
Helper to check for non-numerical np.inf, -np.inf, NaN, None in a single dataframe. :parameter pd.DataFrame x_df: Features :raise NoDataError: If data contains np.inf, -np.inf, None.
- check_raw_dataset_integrity(df: DataFrame, logs_path: Optional[Union[str, PathLike]]) None [source]
Helper to check column-wise NaNs in raw input data for fitting model.
:param pd.DataFrame df :param str logs_path: The logs directory of the SimBA project :raise FaultyTrainingSetError: When the dataset contains NaNs
- check_sampled_dataset_integrity(x_df: DataFrame, y_df: DataFrame) None [source]
Helper to check for non-numerical entries post data sampling
- Parameters
x_df (pd.DataFrame) – Features
y_df (pd.DataFrame) – Target
- Raises
FaultyTrainingSetError – Training or testing data sets contain non-numerical values
- clf_fit(clf: RandomForestClassifier, x_df: DataFrame, y_df: DataFrame) RandomForestClassifier [source]
Helper to fit clf model
- Parameters
clf – Un-fitted random forest classifier object
x_df (pd.DataFrame) – Pandas dataframe with features.
y_df (pd.DataFrame) – Pandas dataframe/Series with target
- Return RandomForestClassifier
Fitted random forest classifier object
- clf_predict_proba(clf: RandomForestClassifier, x_df: DataFrame, multiclass: bool = False, model_name: Optional[str] = None, data_path: Optional[Union[str, PathLike]] = None) ndarray [source]
- Parameters
- Return np.ndarray
2D array with frame represented by rows and present/absent probabilities as columns
- Raises
FeatureNumberMismatchError – If shape of x_df and clf.n_features_ or n_features_in_ show mismatch
- create_clf_report(rf_clf: RandomForestClassifier, x_df: DataFrame, y_df: DataFrame, class_names: List[str], save_dir: str, clf_name: Optional[str] = None, save_file_no: Optional[int] = None) None [source]
Helper to create classifier truth table report.
See also
- Parameters
rf_clf (RandomForestClassifier) – sklearn RandomForestClassifier object.
x_df (pd.DataFrame) – dataframe holding test features
y_df (pd.DataFrame) – dataframe holding test target
class_names (List[str]) – List of classes. E.g., [‘Attack absent’, ‘Attack present’]
clf_name (Optional[str]) – Name of the classifier. If not None, then used in the output file name.
save_dir (str) – Directory where to save output in csv file format.
save_file_no (Optional[int]) – If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.
- create_example_dt(rf_clf: RandomForestClassifier, clf_name: str, feature_names: List[str], class_names: List[str], save_dir: str, tree_id: Optional[int] = 3, save_file_no: Optional[int] = None) None [source]
Helper to produce visualization of random forest decision tree using graphviz.
Note
- Parameters
rf_clf (RandomForestClassifier) – sklearn RandomForestClassifier object.
clf_name (str) – Classifier name.
feature_names (List[str]) – List of feature names.
class_names (List[str]) – List of classes. E.g., [‘Attack absent’, ‘Attack present’]
save_dir (str) – Directory where to save output in csv file format.
save_file_no (Optional[int]) – If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.
- create_meta_data_csv_training_one_model(meta_data_lst: list, clf_name: str, save_dir: Union[str, PathLike]) None [source]
Helper to save single model meta data (hyperparameters, sampling settings etc.) from list format into SimBA compatible CSV config file.
- create_shap_log(ini_file_path: str, rf_clf: RandomForestClassifier, x_df: DataFrame, y_df: Series, x_names: List[str], clf_name: str, cnt_present: int, cnt_absent: int, save_it: int = 100, save_path: Optional[Union[str, PathLike]] = None, save_file_no: Optional[int] = None) Union[None, Tuple[DataFrame]] [source]
Compute SHAP values for a random forest classifier.
This method computes SHAP (SHapley Additive exPlanations) values for a given random forest classifier.
See also
Note
For improved run-times, use multiprocessing through
simba.mixins.train_model_mixins.TrainModelMixin.create_shap_log_mp()
Uses TreeSHAP DocumentationThe SHAP value for feature ‘i’ in the context of a prediction ‘f’ and input ‘x’ is calculated using the following formula:
- Parameters
ini_file_path (str) – Path to the SimBA project_config.ini
rf_clf (RandomForestClassifier) – sklearn random forest classifier
x_df (pd.DataFrame) – Test features.
y_df (pd.DataFrame) – Test target.
x_names (List[str]) – Feature names.
clf_name (str) – Classifier name.
cnt_present (int) – Number of behavior-present frames to calculate SHAP values for.
cnt_absent (int) – Number of behavior-absent frames to calculate SHAP values for.
save_path (str) – Directory where to save output in csv file format.
save_file_no (Optional[int]) – If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.
- create_shap_log_mp(ini_file_path: str, rf_clf: RandomForestClassifier, x_df: DataFrame, y_df: DataFrame, x_names: List[str], clf_name: str, cnt_present: int, cnt_absent: int, batch_size: int = 10, save_path: Optional[Union[str, PathLike]] = None, save_file_no: Optional[int] = None) Union[None, Tuple[DataFrame]] [source]
Helper to compute SHAP values using multiprocessing. For single-core alternative, see meth:simba.mixins.train_model_mixins.TrainModelMixin.create_shap_log_mp.
- Parameters
ini_file_path (str) – Path to the SimBA project_config.ini
rf_clf (RandomForestClassifier) – sklearn random forest classifier
x_df (pd.DataFrame) – Test features.
y_df (pd.DataFrame) – Test target.
x_names (List[str]) – Feature names.
clf_name (str) – Classifier name.
cnt_present (int) – Number of behavior-present frames to calculate SHAP values for.
cnt_absent (int) – Number of behavior-absent frames to calculate SHAP values for.
save_dir (Optional[str, os.PathLike]) – Optional directory where to save output in csv file format. If None, then returns the dataframes.
save_file_no (Optional[int]) – If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.
- create_x_importance_bar_chart(rf_clf: RandomForestClassifier, x_names: list, clf_name: str, save_dir: str, n_bars: int, palette: Optional[str] = 'hot', save_file_no: Optional[int] = None) None [source]
Helper to create a bar chart displaying the top N gini or entropy feature importance scores.
See also
- Parameters
rf_clf (RandomForestClassifier) – sklearn RandomForestClassifier object.
x_names (List[str]) – Names of features.
clf_name (str) – Name of classifier.
save_dir (str) – Directory where to save output in csv file format.
n_bars (int) – Number of bars in the plot.
save_file_no (Optional[int]) – If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search
- create_x_importance_log(rf_clf: RandomForestClassifier, x_names: List[str], clf_name: str, save_dir: str, save_file_no: Optional[int] = None) None [source]
Helper to save gini or entropy based feature importance scores.
Note
- Parameters
rf_clf (RandomForestClassifier) – sklearn RandomForestClassifier object.
x_names (List[str]) – Names of features.
clf_name (str) – Name of classifier
save_dir (str) – Directory where to save output in csv file format.
save_file_no (Optional[int]) – If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.
- static define_scaler(scaler_name: typing_extensions.Literal['MIN-MAX', 'STANDARD', 'QUANTILE']) Union[MinMaxScaler, StandardScaler, QuantileTransformer] [source]
Defines a sklearn scaler object. See
UMLOptions.SCALER_OPTIONS.value
for accepted scalers.- Example
>>> TrainModelMixin.define_scaler(scaler_name='MIN-MAX')
- delete_other_annotation_columns(df: DataFrame, annotations_lst: List[str], raise_error: bool = True) DataFrame [source]
Helper to drop fields that contain annotations which are not the target.
- Parameters
df (pd.DataFrame) – Dataframe holding features and annotations.
annotations_lst (List[str]) – column fields to be removed from df
- Raise_error bool raise_error
If True, throw error if annotation column doesn’t exist. Else, skip. Default: True.
- Return pd.DataFrame
Dataframe without non-target annotation columns
- Examples
>>> self.delete_other_annotation_columns(df=df, annotations_lst=['Sniffing'])
- dviz_classification_visualization(x_train: ndarray, y_train: ndarray, clf_name: str, class_names: List[str], save_dir: str) None [source]
Helper to create visualization of example decision tree using dtreeviz.
Find highly correlated fields in a dataset.
Calculates the absolute correlation coefficients between columns in a given dataset and identifies pairs of columns that have a correlation coefficient greater than the specified threshold. For every pair of correlated features identified, the function returns the field name of one feature. These field names can later be dropped from the input data to reduce memory requirements and collinearity.
- Parameters
data (np.ndarray) – Two dimensional numpy array with features represented as columns and frames represented as rows.
threshold (float) – Threshold value for significant collinearity.
field_names (List[str]) – List mapping the column names in data to a field name. Use types.ListType(types.unicode_type) to take advantage of JIT compilation
- Return List[str]
Unique field names that correlates with at least one other field above the threshold value.
- Example
>>> data = np.random.randint(0, 1000, (1000, 5000)).astype(np.float32) >>> field_names = [] >>> for i in range(data.shape[1]): field_names.append(f'Feature_{i+1}') >>> highly_correlated_fields = TrainModelMixin().find_highly_correlated_fields(data=data, field_names=typed.List(field_names), threshold=0.10)
- static find_low_variance_fields(data: DataFrame, variance_threshold: float) List[str] [source]
Finds fields with variance below provided threshold.
- Parameters
data (pd.DataFrame) – Dataframe with continoues numerical features.
variance (float) – Variance threshold (0.0-1.0).
- Return List[str]
- get_all_clf_names(config: ConfigParser, target_cnt: int) List[str] [source]
Helper to get all classifier names in a SimBA project.
- Parameters
config (configparser.ConfigParser) – Parsed SimBA project_config.ini
target_cnt (int.ConfigParser) – Parsed SimBA project_config.ini
- Return List[str]
All classifier names in project
- Example
>>> self.get_all_clf_names(config=config, target_cnt=2) >>> ['Attack', 'Sniffing']
- get_model_info(config: ConfigParser, model_cnt: int) Dict[int, Any] [source]
Helper to read in N SimBA random forest config meta files to python dict memory.
- Parameters
config (configparser.ConfigParser) – Parsed SimBA project_config.ini
model_cnt (int) – Count of models
- Return dict
Dictionary with integers as keys and hyperparameter dictionaries as keys.
- insert_column_headers_for_outlier_correction(data_df: DataFrame, new_headers: List[str], filepath: Union[str, PathLike]) DataFrame [source]
Helper to insert new column headers onto a dataframe following outlier correction.
- partial_dependence_calculator(clf: RandomForestClassifier, x_df: DataFrame, clf_name: str, save_dir: Union[str, PathLike], clf_cnt: Optional[int] = None) None [source]
Compute feature partial dependencies for every feature in training set.
- Parameters
clf (RandomForestClassifier) – Random forest classifier
x_df (pd.DataFrame) – Features training set
clf_name (str) – Name of classifier
save_dir (str) – Directory where to save the data
clf_cnt (Optional[int]) – If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.
- print_machine_model_information(model_dict: dict) None [source]
Helper to print model information in tabular form.
- Parameters
model_dict (dict) – dictionary holding model meta data in SimBA meta-config format.
- random_multiclass_bout_sampler(x_df: DataFrame, y_df: DataFrame, target_field: str, target_var: int, sampling_ratio: Union[float, Dict[int, float]], raise_error: bool = False) DataFrame [source]
Randomly sample multiclass behavioral bouts.
This function performs random sampling on a multiclass dataset to balance the class distribution. From each class, the function selects a count of “bouts” where the count is computed as a ratio of a user-specified class variable count. All bout observations in the user-specified class is selected.
- Parameters
x_df (pd.DataFrame) – A dataframe holding features.
y_df (pd.DataFrame) – A dataframe holding target.
target_field (str) – The name of the target column.
target_var (int) – The variable in the target that should serve as baseline. E.g.,
0
if0
represents no behavior.sampling_ratio (Union[float, dict]) – The ratio of target_var bout observations that should be sampled of non-target_var observations. E.g., if float
1.0
, and there are 10` bouts of target_var observations in the dataset, then 10 bouts of each non-target_var observations will be sampled. If different under-sampling ratios for different class variables are needed, use dict with the class variable name as key and ratio relative to target_var as the value.raise_error (bool) – If True, then raises error if there are not enough observations of the non-target_var fullfilling the sampling_ratio. Else, takes all observations even though not enough to reach criterion.
- Raises
SamplingError – If any of the following conditions are met: - No bouts of the target class are detected in the data. - The target variable is present in the sampling ratio dictionary. - The sampling ratio dictionary contains non-integer keys or non-float values less than 0.0. - The variable specified in the sampling ratio is not present in the DataFrame. - The sampling ratio results in a sample size of zero or less. - The requested sample size exceeds the available data and raise_error is True.
- Return (pd.DataFrame, pd.DataFrame)
resampled features, and resampled associated target.
- Examples
>>> df = pd.read_csv('/Users/simon/Desktop/envs/troubleshooting/multilabel/project_folder/csv/targets_inserted/01.YC015YC016phase45-sample_sampler.csv', index_col=0) >>> undersampled_df = TrainModelMixin().random_multiclass_bout_sampler(data=df, target_field='syllable_class', target_var=0, sampling_ratio={1: 1.0, 2: 1, 3: 1}, raise_error=True)
- random_multiclass_frm_sampler(x_df: DataFrame, y_df: DataFrame, target_field: str, target_var: int, sampling_ratio: Union[float, Dict[int, float]], raise_error: bool = False)[source]
Random multiclass undersampler.
This function performs random under-sampling on a multiclass dataset to balance the class distribution. From each class, the function selects a number of frames computed as a ratio relative to a user-specified class variable.
All the observations in the user-specified class is selected.
- Parameters
x_df (pd.DataFrame) – A dataframe holding features.
y_df (pd.DataFrame) – A dataframe holding target.
target_field (str) – The name of the target column.
target_var (int) – The variable in the target that should serve as baseline. E.g.,
0
if0
represents no behavior.sampling_ratio (Union[float, dict]) – The ratio of target_var observations that should be sampled of non-target_var observations. E.g., if float
1.0
, and there are 10` target_var observations in the dataset, then 10 of each non-target_var observations will be sampled. If different under-sampling ratios for different class variables are needed, use dict with the class variable name as key and ratio raletive to target_var as the value.raise_error (bool) – If True, then raises error if there are not enough observations of the non-target_var fullfilling the sampling_ratio. Else, takes all observations even though not enough to reach criterion.
- Return (pd.DataFrame, pd.DataFrame)
resampled features, and resampled associated target.
- Examples
>>> df = pd.read_csv('/Users/simon/Desktop/envs/troubleshooting/multilabel/project_folder/csv/targets_inserted/01.YC015YC016phase45-sample_sampler.csv', index_col=0) >>> TrainModelMixin().random_multiclass_frm_sampler(data_df=df, target_field='syllable_class', target_var=0, sampling_ratio=0.20) >>> TrainModelMixin().random_multiclass_frm_sampler(data_df=df, target_field='syllable_class', target_var=0, sampling_ratio={1: 0.1, 2: 0.2, 3: 0.3})
- random_undersampler(x_train: ~numpy.ndarray, y_train: ~numpy.ndarray, sample_ratio: float) -> (<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>)[source]
Helper to perform random under-sampling of behavior-absent frames in a dataframe.
- Parameters
x_train (np.ndarray) – Features in train set
y_train (np.ndarray) – Target in train set
sample_ratio (float) – Ratio of behavior-absent frames to keep relative to the behavior-present frames. E.g.,
1.0
returns an equal count of behavior-absent and behavior-present frames.2.0
returns twice as many behavior-absent frames as and behavior-present frames.
- Return pd.DataFrame
Under-sampled feature-set
- Return pd.DataFrame
Under-sampled target-set
- Examples
>>> self.random_undersampler(x_train=x_train, y_train=y_train, sample_ratio=1.0)
- read_all_files_in_folder(file_paths: ~typing.List[str], file_type: str, classifier_names: ~typing.Optional[~typing.List[str]] = None, raise_bool_clf_error: bool = True) -> (<class 'pandas.core.frame.DataFrame'>, typing.List[int])[source]
Read in all data files in a folder to a single pd.DataFrame for downstream ML algo. Asserts that all classifiers have annotation fields present in concatenated dataframe.
Note
For improved runtime through pyarrow, use
simba.mixins.train_model_mixin.read_all_files_in_folder_mp()
- Parameters
file_paths (List[str]) – List of file paths representing files to be read in.
file_type (str) – List of file paths representing files to be read in.
classifier_names (str or None) – List of classifier names representing fields of human annotations. If not None, then assert that classifier names are present in each data file.
- Return pd.DataFrame
concatenated dataframe if all data represented in
file_paths
.- Return List[int]
The frame numbers (index) of the sampled data.
- Examples
>>> self.read_all_files_in_folder(file_paths=['targets_inserted/Video_1.csv', 'targets_inserted/Video_2.csv'], file_type='csv', classifier_names=['Attack'])
- static read_all_files_in_folder_mp(file_paths: ~typing.List[str], file_type: typing_extensions.Literal['csv', 'parquet', 'pickle'], classifier_names: ~typing.Optional[~typing.List[str]] = None, raise_bool_clf_error: bool = True) -> (<class 'pandas.core.frame.DataFrame'>, typing.List[int])[source]
Multiprocessing helper function to read in all data files in a folder to a single pd.DataFrame for downstream ML. Defaults to ceil(CPU COUNT / 2) cores. Asserts that all classifiers have annotation fields present in each dataframe.
Note
If multiprocess failure, reverts to
simba.mixins.train_model_mixin.read_all_files_in_folder()
- Parameters
- Return pd.DataFrame
Concatenated dataframe of all data in
file_paths
.- Return pd.DataFrame
List of frame indexes of all concatenated files.
- read_all_files_in_folder_mp_futures(annotations_file_paths: ~typing.List[str], file_type: typing_extensions.Literal['csv', 'parquet', 'pickle'], classifier_names: ~typing.Optional[~typing.List[str]] = None, raise_bool_clf_error: bool = True) -> (<class 'pandas.core.frame.DataFrame'>, typing.List[int])[source]
Multiprocessing helper function to read in all data files in a folder to a single pd.DataFrame for downstream ML through
concurrent.Futures
. Asserts that all classifiers have annotation fields present in each dataframe.Note
A
concurrent.Futures
alternative tosimba.mixins.train_model_mixin.read_all_files_in_folder_mp()
which has usesmultiprocessing.ProcessPoolExecutor
and reported unstable on Linux machines.If multiprocess failure, reverts to
simba.mixins.train_model_mixin.read_all_files_in_folder()
- Parameters
file_paths (List[str]) – List of file-paths
file_paths – The filetype of
file_paths
OPTIONS: csv or parquet.classifier_names (Optional[List[str]]) – List of classifier names representing fields of human annotations. If not None, then assert that classifier names are present in each data file.
raise_bool_clf_error (bool) – If True, raises an error if a classifier column contains values outside 0 and 1.
- Return pd.DataFrame
Concatenated dataframe of all data in
file_paths
.
- read_in_all_model_names_to_remove(config: ConfigParser, model_cnt: int, clf_name: str) List[str] [source]
Helper to find all field names that are annotations but are not the target.
- Parameters
config (configparser.ConfigParser) – Configparser object holding data from the project_config.ini
model_cnt (int) – Number of classifiers in the SimBA project
clf_name (str) – Name of the classifier.
- Return List[str]
List of non-target annotation column names.
- Examples
>>> self.read_in_all_model_names_to_remove(config=config, model_cnt=2, clf_name=['Attack'])
- read_pickle(file_path: Union[str, PathLike]) object [source]
Read pickle file
- Parameters
file_path (str) – Path to pickle file on disk.
:return dict
- save_rf_model(rf_clf: RandomForestClassifier, clf_name: str, save_dir: Union[str, PathLike], save_file_no: Optional[int] = None) None [source]
Helper to save pickled classifier object to disk.
- Parameters
rf_clf (RandomForestClassifier) – sklearn random forest classifier
clf_name (str) – Classifier name
save_dir (str) – Directory where to save output in csv file format.
save_file_no (Optional[int]) – If integer, represents the count of the classifier within a grid search. If none, the classifier is not part of a grid search.
- static scaler_transform(data: DataFrame, scaler: Union[MinMaxScaler, StandardScaler, QuantileTransformer], name: Optional[str] = '') DataFrame [source]
Helper to run transform dataframe using previously fitted scaler.
- Parameters
data (pd.DataFrame) – Data to transform.
scaler – fitted scaler.
- smote_oversampler(x_train: ~pandas.core.frame.DataFrame, y_train: ~pandas.core.frame.DataFrame, sample_ratio: float) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]
Helper to perform SMOTE oversampling of behavior-present annotations.
- Parameters
x_train (np.ndarray) – Features in train set
y_train (np.ndarray) – Target in train set
sample_ratio (float) – Over-sampling ratio
- Return np.ndarray
Oversampled features.
- Return np.ndarray
Oversampled target.
- Examples
>>> self.smote_oversampler(x_train=x_train, y_train=y_train, sample_ratio=1.0)
- smoteen_oversampler(x_train: ~pandas.core.frame.DataFrame, y_train: ~pandas.core.frame.DataFrame, sample_ratio: float) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]
Helper to perform SMOTEEN oversampling of behavior-present annotations.
- Parameters
x_train (np.ndarray) – Features in train set
y_train (np.ndarray) – Target in train set
sample_ratio (float) – Over-sampling ratio
- Return np.ndarray
Oversampled features.
- Return np.ndarray
Oversampled target.
- Examples
>>> self.smoteen_oversampler(x_train=x_train, y_train=y_train, sample_ratio=1.0)
- static split_and_group_df(df: ~pandas.core.frame.DataFrame, splits: int, include_split_order: bool = True) -> (typing.List[pandas.core.frame.DataFrame], <class 'int'>)[source]
Helper to split a dataframe for multiprocessing. If include_split_order, then include the group number in split data as a column. Returns split data and approximations of number of observations per split.
- split_df_to_x_y(df: ~pandas.core.frame.DataFrame, clf_name: str) -> (<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>)[source]
Helper to split dataframe into features and target.
- Parameters
df (pd.DataFrame) – Dataframe holding features and annotations.
clf_name (str) – Name of target.
- Return pd.DataFrame
features
- Return pd.DataFrame
target
- Examples
>>> self.split_df_to_x_y(df=df, clf_name='Attack')
Time-series feature methods
- class simba.mixins.timeseries_features_mixin.TimeseriesFeatureMixin[source]
Bases:
object
Time-series methods focused on signal complexity in sliding windows. Mainly in time-domain - fft methods (through e.g. scipy) I’ve found so far has not been fast enough for rolling windows in large datasets.
Note
Many method has numba typed signatures to decrease compilation time through reduced type inference. Make sure to pass the correct dtypes as indicated by signature decorators.
Important
See references for mature packages computing more extensive timeseries measurements
- static acceleration(data: ndarray, pixels_per_mm: float, fps: int, time_window: float = 1, unit: typing_extensions.Literal['mm', 'cm', 'dm', 'm'] = 'mm') ndarray [source]
Compute acceleration.
Computes acceleration from a sequence of body-part coordinates over time. It calculates the difference in velocity between consecutive frames and provides an array of accelerations.
The computation is based on the formula:
where calculates the Euclidean norm, shifts the array by frames, and is the conversion factor from pixels to millimeters.
Note
By default, acceleration is calculated as change in velocity at millimeters/s. To change the denomitator, modify the
time_window
argument. To change the nominator, modify theunit
argument (acceptedmm
, cm``,dm
,mm
)- Parameters
data (np.ndarray) – 1D array of framewise euclidean distances.
pixels_per_mm (float) – Pixels per millimeter of the recorded video.
fps (int) – Frames per second (FPS) of the recorded video.
time_window (float) – Rolling time window in seconds. Default is 1.0 representing 1 second.
unit (Literal['mm', 'cm', 'dm', 'm']) – If acceleration should be presented as millimeter, centimeters, decimeter, or meter. Default millimeters.
- Returns
Array of accelerations corresponding to each frame.
- Example
>>> data = np.array([1, 2, 3, 4, 5, 5, 5, 5, 5, 6]).astype(np.float32) >>> TimeseriesFeatureMixin().acceleration(data=data, pixels_per_mm=1.0, fps=2, time_window=1.0) >>> [ 0., 0., 0., 0., -1., -1., 0., 0., 1., 1.]
- static benford_correlation(data: ndarray) float [source]
Jitted compute of the correlation between the Benford’s Law distribution and the first-digit distribution of given data.
Benford’s Law describes the expected distribution of leading (first) digits in many real-life datasets. This function calculates the correlation between the expected Benford’s Law distribution and the actual distribution of the first digits in the provided data.
Note
Adapted from tsfresh.
The returned correlation values are calculated using Pearson’s correlation coefficient.
- Parameters
data (np.ndarray) – The input 1D array containing the time series data.
- Return float
The correlation coefficient between the Benford’s Law distribution and the first-digit distribution in the input data. A higher correlation value suggests that the data follows the expected distribution more closely.
- Examples
>>> data = np.array([1, 8, 2, 10, 8, 6, 8, 1, 1, 1]).astype(np.float32) >>> TimeseriesFeatureMixin().benford_correlation(data=data) >>> 0.6797500374831786
- static crossings(data: ndarray, val: float) int [source]
Jitted compute of the count in time-series where sequential values crosses a defined value.
- Parameters
data (np.ndarray) – Time-series data.
val (float) – Cross value. E.g., to count the number of zero-crossings, pass 0.
- Return int
Count of events where sequential values crosses
val
.- Example
>>> data = np.array([3.9, 7.5, 4.2, 6.2, 7.5, 3.9, 6.2, 6.5, 7.2, 9.5]).astype(np.float32) >>> TimeseriesFeatureMixin().crossings(data=data, val=7) >>> 5
- static dominant_frequencies(data: ndarray, fps: float, k: int, window_function: Optional[typing_extensions.Literal['Hann', 'Hamming', 'Blackman']] = None)[source]
Find the K dominant frequencies within a feature vector
- static granger_tests(data: DataFrame, variables: List[str], lag: int, test: typing_extensions.Literal['ssr_ftest', 'ssr_chi2test', 'lrtest', 'params_ftest'] = 'ssr_chi2test') DataFrame [source]
Perform Granger causality tests between pairs of variables in a DataFrame.
This function computes Granger causality tests between pairs of variables in a DataFrame using the statsmodels library. The Granger causality test assesses whether one time series variable (predictor) can predict another time series variable (outcome). This test can help determine the presence of causal relationships between variables.
Note
Modified from Selva Prabhakaran.
- Example
>>> x = np.random.randint(0, 50, (100, 2)) >>> data = pd.DataFrame(x, columns=['r', 'k']) >>> TimeseriesFeatureMixin.granger_tests(data=data, variables=['r', 'k'], lag=4, test='ssr_chi2test') >>> r k >>> r 1.0000 0.4312 >>> k 0.3102 1.0000
- static higuchi_fractal_dimension(data: ndarray, kmax: int = 10)[source]
Jitted compute of the Higuchi Fractal Dimension of a given time series data. The Higuchi Fractal Dimension provides a measure of the fractal complexity of a time series.
- The maximum value of k used in the calculation. Increasing kmax considers longer sequences
of data, providing a more detailed analysis of fractal complexity. Default is 10.
- Parameters
data (np.ndarray) – A 1-dimensional numpy array containing the time series data.
kmax (int) – The maximum value of k used in the calculation. Increasing kmax considers longer sequences of data, providing a more detailed analysis of fractal complexity. Default is 10.
- Returns float
The Higuchi Fractal Dimension of the input time series.
Note
Adapted from eeglib.
- Example
>>> t = np.linspace(0, 50, int(44100 * 2.0), endpoint=False) >>> sine_wave = 1.0 * np.sin(2 * np.pi * 1.0 * t).astype(np.float32) >>> sine_wave = (sine_wave - np.min(sine_wave)) / (np.max(sine_wave) - np.min(sine_wave)) >>> TimeseriesFeatureMixin().higuchi_fractal_dimension(data=data, kmax=10) >>> 1.0001506805419922 >>> np.random.shuffle(sine_wave) >>> TimeseriesFeatureMixin().higuchi_fractal_dimension(data=data, kmax=10) >>> 1.9996402263641357
- static hjort_parameters(data: ~numpy.ndarray) -> (<class 'float'>, <class 'float'>, <class 'float'>)[source]
Jitted compute of Hjorth parameters for a given time series data. Hjorth parameters describe mobility, complexity, and activity of a time series.
- Parameters
data (numpy.ndarray) – A 1-dimensional numpy array containing the time series data.
- Returns
tuple A tuple containing the following Hjorth parameters: - activity (float): The activity of the time series, which is the variance of the input data. - mobility (float): The mobility of the time series, calculated as the square root of the variance
of the first derivative of the input data divided by the variance of the input data.
complexity (float): The complexity of the time series, calculated as the square root of the variance of the second derivative of the input data divided by the variance of the first derivative, and then divided by the mobility.
- Example
>>> data = np.array([1.0, 2.0, 3.0, 4.0, 5.0], dtype=np.float32) >>> TimeseriesFeatureMixin().hjort_parameters(data) >>> (2.5, 0.5, 0.4082482904638631)
- static line_length(data: ndarray) float [source]
Calculate the line length of a 1D array.
Line length is a measure of signal complexity and is computed by summing the absolute differences between consecutive elements of the input array. Used in EEG analysis and other signal processing applications to quantify variations in the signal.
- Parameters
data (numpy.ndarray) – The 1D array for which the line length is to be calculated.
- Return float
The line length of the input array, indicating its complexity.
where: - LL is the line length. - N is the number of elements in the input data array. - x[i] represents the value of the data at index i.
- Example
>>> data = np.array([1, 4, 2, 3, 5, 6, 8, 7, 9, 10]).astype(np.float32) >>> TimeseriesFeatureMixin().line_length(data=data) >>> 12.0
- static local_maxima_minima(data: ndarray, maxima: bool = True) ndarray [source]
Jitted compute of the local maxima or minima defined as values which are higher or lower than immediately preceding and proceeding time-series neighbors, repectively. Returns 2D np.ndarray with columns representing idx and values of local maxima.
- Parameters
data (np.ndarray) – Time-series data.
maxima (bool) – If True, returns maxima. Else, minima.
- Return np.ndarray
2D np.ndarray with columns representing idx in input data in first column and values of local maxima in second column
- Example
>>> data = np.array([3.9, 7.5, 4.2, 6.2, 7.5, 3.9, 6.2, 6.5, 7.2, 9.5]).astype(np.float32) >>> TimeseriesFeatureMixin().local_maxima_minima(data=data, maxima=True) >>> [[1, 7.5], [4, 7.5], [9, 9.5]] >>> TimeseriesFeatureMixin().local_maxima_minima(data=data, maxima=False) >>> [[0, 3.9], [2, 4.2], [5, 3.9]]
- static longest_strike(data: ndarray, threshold: float, above: bool = True) int [source]
Jitted compute of the length of the longest consecutive sequence of values in the input data that either exceed or fall below a specified threshold.
- Parameters
data (np.ndarray) – The input 1D NumPy array containing the values to be analyzed.
threshold (float) – The threshold value used for the comparison.
above (bool) – If True, the function looks for strikes where values are above or equal to the threshold. If False, it looks for strikes where values are below or equal to the threshold.
- Return int
The length of the longest strike that satisfies the condition.
- Example
>>> data = np.array([1, 8, 2, 10, 8, 6, 8, 1, 1, 1]).astype(np.float32) >>> TimeseriesFeatureMixin().longest_strike(data=data, threshold=7, above=True) >>> 2 >>> TimeseriesFeatureMixin().longest_strike(data=data, threshold=7, above=False) >>> 3
- static percent_beyond_n_std(data: ndarray, n: float) float [source]
Jitted compute of the ratio of values in time-series more than N standard deviations from the mean of the time-series.
- Parameters
data (np.ndarray) – 1D array representing time-series.
n (float) – Standard deviation cut-off.
- Returns float
Ratio of values in
data
that fall more thann
standard deviations from mean ofdata
.
Note
Adapted from cesium.
Oddetity: mean calculation is incorrect if passing float32 data but correct if passing float64.
- Examples
>>> data = np.array([3.9, 7.5, 4.2, 6.2, 7.5, 3.9, 6.2, 6.5, 7.2, 9.5]).astype(np.float32) >>> TimeseriesFeatureMixin().percent_beyond_n_std(data=data, n=1) >>> 0.1
- static percent_in_percentile_window(data: ndarray, upper_pct: int, lower_pct: int)[source]
Jitted compute of the ratio of values in time-series that fall between the
upper
andlower
percentile.- Parameters
- Returns float
Ratio of values in
data
that fall withinupper_pct
andlower_pct
percentiles.
Note
Adapted from cesium.
- Example
>>> data = np.array([3.9, 7.5, 4.2, 6.2, 7.5, 3.9, 6.2, 6.5, 7.2, 9.5]).astype(np.float32) >>> TimeseriesFeatureMixin().percent_in_percentile_window(data, upper_pct=70, lower_pct=30) >>> 0.4
- static percentile_difference(data: ndarray, upper_pct: int, lower_pct: int) float [source]
Jitted compute of the difference between the
upper
andlower
percentiles of the data as a percentage of the median value. Helps understanding the spread or variability of the data within specified percentiles.Note
Adapted from cesium.
- Parameters
- Returns float
The difference between the
upper
andlower
percentiles of the data as a percentage of the median value.- Examples
>>> data = np.array([3.9, 7.5, 4.2, 6.2, 7.5, 3.9, 6.2, 6.5, 7.2, 9.5]).astype(np.float32) >>> TimeseriesFeatureMixin().percentile_difference(data=data, upper_pct=95, lower_pct=5) >>> 0.7401574764125177
- static permutation_entropy(data: ndarray, dimension: int, delay: int) float [source]
Calculate the permutation entropy of a time series.
Permutation entropy is a measure of the complexity of a time series data by quantifying the irregularity and unpredictability of its order patterns. It is computed based on the frequency of unique order patterns of a given dimension in the time series data.
The permutation entropy (PE) is calculated using the following formula:
where: - PE is the permutation entropy. - p_i is the probability of each unique order pattern.
- Parameters
- Return float
The permutation entropy of the time series, indicating its complexity and predictability. A higher permutation entropy value indicates higher complexity and unpredictability in the time series.
- Example
>>> t = np.linspace(0, 50, int(44100 * 2.0), endpoint=False) >>> sine_wave = 1.0 * np.sin(2 * np.pi * 1.0 * t).astype(np.float32) >>> TimeseriesFeatureMixin().permutation_entropy(data=sine_wave, dimension=3, delay=1) >>> 0.701970058666407 >>> np.random.shuffle(sine_wave) >>> TimeseriesFeatureMixin().permutation_entropy(data=sine_wave, dimension=3, delay=1) >>> 1.79172449934604
- static petrosian_fractal_dimension(data: ndarray) float [source]
Calculate the Petrosian Fractal Dimension (PFD) of a given time series data. The PFD is a measure of the irregularity or self-similarity of a time series. Larger values indicate higher complexity. Lower values indicate lower complexity.
Note
The PFD is computed based on the number of sign changes in the first derivative of the time series. If the input data is empty or no sign changes are found, the PFD is returned as -1.0. Adapted from eeglib.
- Parameters
data (np.ndarray) – A 1-dimensional numpy array containing the time series data.
- Returns float
The Petrosian Fractal Dimension of the input time series.
- Examples
>>> t = np.linspace(0, 50, int(44100 * 2.0), endpoint=False) >>> sine_wave = 1.0 * np.sin(2 * np.pi * 1.0 * t).astype(np.float32) >>> TimeseriesFeatureMixin().petrosian_fractal_dimension(data=sine_wave) >>> 1.0000398187022719 >>> np.random.shuffle(sine_wave) >>> TimeseriesFeatureMixin().petrosian_fractal_dimension(data=sine_wave) >>> 1.0211625348743218
- static sliding_benford_correlation(data: ndarray, time_windows: ndarray, sample_rate: int) ndarray [source]
Calculate the sliding Benford’s Law correlation coefficient for a given dataset within specified time windows.
Benford’s Law is a statistical phenomenon where the leading digits of many datasets follow a specific distribution pattern. This function calculates the correlation between the observed distribution of leading digits in a dataset and the ideal Benford’s Law distribution.
Note
Adapted from tsfresh.
The returned correlation values are calculated using Pearson’s correlation coefficient.
The correlation coefficient is calculated between the observed leading digit distribution and the ideal Benford’s Law distribution.
- Parameters
data (np.ndarray) – The input 1D array containing the time series data.
time_windows (np.ndarray) – A 1D array containing the time windows (in seconds) for which the correlation will be calculated at different points in the dataset.
sample_rate (int) – The sample rate, indicating how many data points are collected per second.
- Return np.ndarray
2D array containing the correlation coefficient values for each time window. With time window lenths represented by different columns.
- Examples
>>> data = np.array([1, 8, 2, 10, 8, 6, 8, 1, 1, 1]).astype(np.float32) >>> TimeseriesFeatureMixin.sliding_benford_correlation(data=data, time_windows=np.array([1.0]), sample_rate=2) >>> [[ 0.][0.447][0.017][0.877][0.447][0.358][0.358][0.447][0.864][0.864]]
- static sliding_crossings(data: ndarray, val: float, time_windows: ndarray, fps: int) ndarray [source]
Compute the number of crossings over sliding windows in a data array.
Computes the number of times a value in the data array crosses a given threshold value within sliding windows of varying sizes. The number of crossings is computed for each window size and stored in the result array where columns represents time windows.
Note
For frames occurring before a complete time window, -1.0 is returned.
- Parameters
- Return np.ndarray
An array containing the number of crossings for each window size and data point. The shape of the result array is (data.shape[0], window_sizes.shape[0]).
- Example
>>> data = np.array([3.9, 7.5, 4.2, 6.2, 7.5, 3.9, 6.2, 6.5, 7.2, 9.5]).astype(np.float32) >>> results = TimeseriesFeatureMixin().sliding_crossings(data=data, time_windows=np.array([1.0]), fps=2.0, val=7.0)
- static sliding_descriptive_statistics(data: ndarray, window_sizes: ndarray, sample_rate: int, statistics: typing_extensions.Literal['var', 'max', 'min', 'std', 'median', 'mean', 'mad', 'sum', 'mac', 'rms', 'absenergy']) ndarray [source]
Jitted compute of descriptive statistics over sliding windows in 1D data array.
Computes various descriptive statistics (e.g., variance, maximum, minimum, standard deviation, median, mean, median absolute deviation) for sliding windows of varying sizes applied to the input data array.
- Parameters
data (np.ndarray) – 1D input data array.
window_sizes (np.ndarray) – Array of window sizes (in seconds).
sample_rate (int) – Sampling rate of the data in samples per second.
statistics (types.ListType(types.unicode_type)) – List of statistics to compute. Options: ‘var’, ‘max’, ‘min’, ‘std’, ‘median’, ‘mean’, ‘mad’, ‘sum’, ‘mac’, ‘rms’, ‘abs_energy’.
- Return np.ndarray
Array containing the selected descriptive statistics for each window size, data point, and statistic type. The shape of the result array is (len(statistics), data.shape[0], window_sizes.shape[0).
Note
The statistics parameter should be a list containing one or more of the following statistics:
‘var’ (variance), ‘max’ (maximum), ‘min’ (minimum), ‘std’ (standard deviation), ‘median’ (median), ‘mean’ (mean), ‘mad’ (median absolute deviation), ‘sum’ (sum), ‘mac’ (mean absolute change), ‘rms’ (root mean square), ‘absenergy’ (absolute energy). - If the statistics list is [‘var’, ‘max’, ‘mean’], the 3rd dimension order in the result array will be: [variance, maximum, mean]
- Example
>>> data = np.array([1, 4, 2, 3, 5, 6, 8, 7, 9, 10]).astype(np.float32) >>> results = TimeseriesFeatureMixin().sliding_descriptive_statistics(data=data, window_sizes=np.array([1.0, 5.0]), sample_rate=2, statistics=typed.List(['var', 'max']))
- static sliding_displacement(x: ndarray, time_windows: ndarray, fps: float, px_per_mm: float) ndarray [source]
Calculate sliding Euclidean displacement of a body-part point over time windows.
- Parameters
- Return np.ndarray
1D array containing the calculated displacements.
- Example
>>> x = np.random.randint(0, 50, (100, 2)).astype(np.int32) >>> TimeseriesFeatureMixin.sliding_displacement(x=x, time_windows=np.array([1.0]), fps=1.0, px_per_mm=1.0)
- static sliding_hjort_parameters(data: ndarray, window_sizes: ndarray, sample_rate: int) ndarray [source]
Jitted compute of Hjorth parameters, including mobility, complexity, and activity, for sliding windows of varying sizes applied to the input data array.
- Parameters
data (np.ndarray) – Input data array.
window_sizes (np.ndarray) – Array of window sizes (in seconds).
sample_rate (int) – Sampling rate of the data in samples per second.
- Return np.ndarray
An array containing Hjorth parameters for each window size and data point. The shape of the result array is (3, data.shape[0], window_sizes.shape[0]). The three parameters are stored in the first dimension (0 - mobility, 1 - complexity, 2 - activity), and the remaining dimensions correspond to data points and window sizes.
- static sliding_line_length(data: ndarray, window_sizes: ndarray, sample_rate: int) ndarray [source]
Jitted compute of sliding line length for a given time series using different window sizes.
The function computes line length for the input data using various window sizes. It returns a 2D array where each row corresponds to a position in the time series, and each column corresponds to a different window size. The line length is calculated for each window, and the results are returned as a 2D array of float32 values.
- Parameters
data (np.ndarray) – 1D array input data.
window_sizes – An array of window sizes (in seconds) to use for line length calculation.
sample_rate – The sampling rate (samples per second) of the time series data.
- Return np.ndarray
A 2D array containing line length values for each window size at each position in the time series.
- Examples
>>> data = np.array([1, 4, 2, 3, 5, 6, 8, 7, 9, 10]).astype(np.float32) >>> TimeseriesFeatureMixin().sliding_line_length(data=data, window_sizes=np.array([1.0]), sample_rate=2)
- static sliding_longest_strike(data: ndarray, threshold: float, time_windows: ndarray, sample_rate: int, above: bool) ndarray [source]
Jitted compute of the length of the longest strike of values within sliding time windows that satisfy a given condition.
Calculates the length of the longest consecutive sequence of values in a 1D NumPy array, where each sequence is determined by a sliding time window. The condition is specified by a threshold, and you can choose whether to look for values above or below the threshold.
- Parameters
data (np.ndarray) – The input 1D NumPy array containing the values to be analyzed.
threshold (float) – The threshold value used for the comparison.
time_windows (np.ndarray) – An array containing the time window sizes in seconds.
sample_rate (int) – The sample rate in samples per second.
above (bool) – If True, the function looks for strikes where values are above or equal to the threshold. If False, it looks for strikes where values are below or equal to the threshold.
- Return np.ndarray
A 2D NumPy array with dimensions (data.shape[0], time_windows.shape[0]). Each element in the array represents the length of the longest strike that satisfies the condition for the
corresponding time window.
- Example
>>> data = np.array([1, 8, 2, 10, 8, 6, 8, 1, 1, 1]).astype(np.float32) >>> TimeseriesFeatureMixin().sliding_longest_strike(data=data, threshold=7, above=True, time_windows=np.array([1.0]), sample_rate=2) >>> [[-1.][ 1.][ 1.][ 1.][ 2.][ 1.][ 1.][ 1.][ 0.][ 0.]] >>> TimeseriesFeatureMixin().sliding_longest_strike(data=data, threshold=7, above=True, time_windows=np.array([1.0]), sample_rate=2) >>> [[-1.][ 1.][ 1.][ 1.][ 0.][ 1.][ 1.][ 1.][ 2.][ 2.]]
- static sliding_pct_in_top_n(x: ndarray, windows: ndarray, n: int, fps: float) ndarray [source]
Compute the percentage of elements in the top ‘n’ frequencies in sliding windows of the input array.
Note
To compute percentage of elements in the top ‘n’ frequencies in entire array, use
simba.mixins.statistics_mixin.Statistics.pct_in_top_n
.- Parameters
- Return np.ndarray
2D array of computed percentages of elements in the top ‘n’ frequencies for each sliding window.
- Example
>>> x = np.random.randint(0, 10, (100000,)) >>> results = TimeseriesFeatureMixin.sliding_pct_in_top_n(x=x, windows=np.array([1.0]), n=4, fps=10)
- static sliding_percent_beyond_n_std(data: ndarray, n: float, window_sizes: ndarray, sample_rate: int) ndarray [source]
Computed the percentage of data points that exceed ‘n’ standard deviations from the mean for each position in the time series using various window sizes. It returns a 2D array where each row corresponds to a position in the time series, and each column corresponds to a different window size. The results are given as a percentage of data points beyond the threshold.
- Parameters
data (np.ndarray) – The input time series data.
n (float) – The number of standard deviations to determine the threshold.
window_sizes (np.ndarray) – An array of window sizes (in seconds) to use for the sliding calculation.
sample_rate (int) – The sampling rate (samples per second) of the time series data.
- Return np.ndarray
A 2D array containing the percentage of data points beyond the specified ‘n’ standard deviations for each window size.
- static sliding_percent_in_percentile_window(data: ndarray, upper_pct: int, lower_pct: int, window_sizes: ndarray, sample_rate: int)[source]
Jitted compute of the percentage of data points falling within a percentile window in a sliding manner.
The function computes the percentage of data points within the specified percentile window for each position in the time series using various window sizes. It returns a 2D array where each row corresponds to a position in the time series, and each column corresponds to a different window size. The results are given as a percentage of data points within the percentile window.
- Parameters
data (np.ndarray) – The input time series data.
upper_pct (int) – The upper percentile value for the window (e.g., 95 for the 95th percentile).
lower_pct (int) – The lower percentile value for the window (e.g., 5 for the 5th percentile).
window_sizes (np.ndarray) – An array of window sizes (in seconds) to use for the sliding calculation.
sample_rate (int) – The sampling rate (samples per second) of the time series data.
- Return np.ndarray
A 2D array containing the percentage of data points within the percentile window for each window size.
- static sliding_percentile_difference(data: ndarray, upper_pct: int, lower_pct: int, window_sizes: ndarray, fps: int) ndarray [source]
Jitted computes the difference between the upper and lower percentiles within a sliding window for each position in the time series using various window sizes. It returns a 2D array where each row corresponds to a position in the time series, and each column corresponds to a different window size. The results are calculated as the absolute difference between upper and lower percentiles divided by the median of the window.
- Parameters
data (np.ndarray) – The input time series data.
upper_pct (int) – The upper percentile value for the window (e.g., 95 for the 95th percentile).
lower_pct (int) – The lower percentile value for the window (e.g., 5 for the 5th percentile).
window_sizes (np.ndarray) – An array of window sizes (in seconds) to use for the sliding calculation.
sample_rate (int) – The sampling rate (samples per second) of the time series data.
- Return np.ndarray
A 2D array containing the difference between upper and lower percentiles for each window size.
- static sliding_petrosian_fractal_dimension(data: ndarray, window_sizes: ndarray, sample_rate: int) ndarray [source]
Jitted compute of Petrosian Fractal Dimension over sliding windows in a data array.
This method computes the Petrosian Fractal Dimension for sliding windows of varying sizes applied to the input data array. The Petrosian Fractal Dimension is a measure of signal complexity.
- Parameters
data (np.ndarray) – Input data array.
window_sizes (np.ndarray) – Array of window sizes (in seconds).
sample_rate (int) – Sampling rate of the data in samples per second.
- Return np.ndarray
An array containing Petrosian Fractal Dimension values for each window size and data point. The shape of the result array is (data.shape[0], window_sizes.shape[0]).
- static sliding_stationary(data: ~numpy.ndarray, time_windows: ~numpy.ndarray, sample_rate: int, test: typing_extensions.Literal['ADF', 'KPSS', 'ZA'] = 'adf') -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]
Perform the Augmented Dickey-Fuller (ADF), Kwiatkowski-Phillips-Schmidt-Shin (KPSS), or Zivot-Andrews test on sliding windows of time series data. Parallel processing using all available cores is used to accelerate computation.
Note
ADF: A high p-value suggests non-stationarity, while a low p-value indicates stationarity.
KPSS: A high p-value suggests stationarity, while a low p-value indicates non-stationarity.
ZA: A high p-value suggests non-stationarity, while a low p-value indicates stationarity.
- Parameters
data (np.ndarray) – 1-D NumPy array containing the time series data to be tested.
time_windows (np.ndarray) – A 1-D NumPy array containing the time window sizes in seconds.
sample_rate (np.ndarray) – The sample rate of the time series data (samples per second).
test (Literal) – Test to perfrom: Options: ‘ADF’ (Augmented Dickey-Fuller), ‘KPSS’ (Kwiatkowski-Phillips-Schmidt-Shin), ‘ZA’ (Zivot-Andrews).
- Return (np.ndarray, np.ndarray)
A tuple of two 2-D NumPy arrays containing test statistics and p-values. - The first array (stat) contains the ADF test statistics. - The second array (p_vals) contains the corresponding p-values
- Example
>>> data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) >>> TimeseriesFeatureMixin().sliding_stationary(data=data, time_windows=np.array([2.0]), test='KPSS', sample_rate=2)
- static sliding_two_signal_crosscorrelation(x: ndarray, y: ndarray, windows: ndarray, sample_rate: float, normalize: bool, lag: float) ndarray [source]
Calculate sliding (lagged) cross-correlation between two signals, e.g., the movement and velocity of two animals.
Note
If no lag needed, pass lag 0.0.
- Parameters
x (np.ndarray) – The first input signal.
y (np.ndarray) – The second input signal.
windows (np.ndarray) – Array of window lengths in seconds.
sample_rate (float) – Sampling rate of the signals (in Hz or FPS).
normalize (bool) – If True, normalize the signals before computing the correlation.
lag (float) – Time lag between the signals in seconds.
- Returns
2D array of sliding cross-correlation values. Each row corresponds to a time index, and each column corresponds to a window size specified in the windows parameter.
- Example
>>> x = np.random.randint(0, 10, size=(20,)) >>> y = np.random.randint(0, 10, size=(20,)) >>> TimeseriesFeatureMixin.sliding_two_signal_crosscorrelation(x=x, y=y, windows=np.array([1.0, 1.2]), sample_rate=10, normalize=True, lag=0.0)
- static sliding_variance(data: ndarray, window_sizes: ndarray, sample_rate: int) ndarray [source]
Jitted compute of the variance of data within sliding windows of varying sizes applied to the input data array. Variance is a measure of data dispersion or spread.
- Parameters
data – 1d input data array.
window_sizes – Array of window sizes (in seconds).
sample_rate – Sampling rate of the data in samples per second.
- Returns
Variance values for each window size and data point. The shape of the result array is (data.shape[0], window_sizes.shape[0]).
- Example
>>> data = np.array([1, 2, 3, 1, 2, 9, 17, 2, 10, 4]).astype(np.float32) >>> TimeseriesFeatureMixin().sliding_variance(data=data, window_sizes=np.array([0.5]), sample_rate=10) >>> [[-1.],[-1.],[-1.],[-1.],[ 0.56],[ 8.23],[35.84],[39.20],[34.15],[30.15]])
- static spike_finder(data: ndarray, sample_rate: int, baseline: float, min_spike_amplitude: float, min_fwhm: float = -inf, min_half_width: float = -inf) float [source]
Identify and characterize spikes in a given time-series data sequence. This method identifies spikes in the input data based on the specified criteria and characterizes each detected spike by computing its amplitude, full-width at half maximum (FWHM), and half-width.
- Parameters
data (np.ndarray) – A 1D array containing the input data sequence to analyze.
sample_rate (int) – The sample rate, indicating how many data points are collected per second.
baseline (float) – The baseline value used to identify spikes. Any data point above (baseline + min_spike_amplitude) is considered part of a spike.
min_spike_amplitude (float) – The minimum amplitude (above baseline) required for a spike to be considered.
min_fwhm (Optional[float]) – The minimum full-width at half maximum (FWHM) for a spike to be included. If not specified, it defaults to negative infinity, meaning it is not considered for filtering.
min_half_width (Optional[float]) – The minimum half-width required for a spike to be included. If not specified, it defaults to negative infinity, meaning it is not considered for filtering.
- Return tuple
A tuple containing three elements: - spike_idx (List[np.ndarray]): A list of 1D arrays, each representing the indices of the data points belonging to a detected spike. - spike_vals (List[np.ndarray]): A list of 1D arrays, each containing the values of the data points within a detected spike. - spike_dict (Dict[int, Dict[str, float]]): A dictionary where the keys are spike indices, and the values are dictionaries containing spike characteristics including ‘amplitude’ (spike amplitude), ‘fwhm’ (FWHM), and ‘half_width’ (half-width).
Note
The function uses the Numba JIT (Just-In-Time) compilation for optimized performance. Without fastmath=True there is no runtime improvement over standard numpy.
- Example
>>> data = np.array([0.1, 0.1, 0.3, 0.1, 10, 10, 8, 0.1, 0.1, 0.1, 10, 10, 8, 99, 0.1, 99, 99, 0.1]).astype(np.float32) >>> spike_idx, spike_vals, spike_stats = TimeseriesFeatureMixin().spike_finder(data=data, baseline=1, min_spike_amplitude=5, sample_rate=2, min_fwhm=-np.inf, min_half_width=0.0002)
- static spike_train_finder(data: ndarray, spike_idx: list, sample_rate: float, min_spike_train_length: float = inf, max_spike_train_separation: float = inf)[source]
Identify and analyze spike trains from a list of spike indices.
This function takes spike indices and additional information, such as the data, sample rate, minimum spike train length, and maximum spike train separation, to identify and analyze spike trains in the data.
Note
The function may return an empty dictionary if no spike trains meet the criteria.
A required input is
spike_idx
, which is returned byspike_finder()
.
- Parameters
data (types.List(types.Array(types.int64, 1, 'C'))) – The data from which spike trains are extracted.
data – A list of spike indices, typically as integer timestamps.
sample_rate (float) – The sample rate of the data.
min_spike_train_length (Optional[float]) – The minimum length a spike train must have to be considered. Default is set to positive infinity, meaning no minimum length is enforced.
max_spike_train_separation (Optional[float]) – The maximum allowable separation between spikes in the same train. Default is set to positive infinity, meaning no maximum separation is enforced.
- Return DictType[int64,DictType[unicode_type,float64]]
A dictionary containing information about identified spike trains.
- Each entry in the returned dictionary is indexed by an integer, and contains the following information:
‘train_start_time’: Start time of the spike train in seconds.
‘train_end_time’: End time of the spike train in seconds.
‘train_start_obs’: Start time index in observations.
‘train_end_obs’: End time index in observations.
‘spike_cnt’: Number of spikes in the spike train.
‘train_length_obs_cnt’: Length of the spike train in observations.
‘train_length_obs_s’: Length of the spike train in seconds.
‘train_spike_mean_lengths_s’: Mean length of individual spikes in seconds.
‘train_spike_std_length_obs’: Standard deviation of spike lengths in observations.
‘train_spike_std_length_s’: Standard deviation of spike lengths in seconds.
‘train_spike_max_length_obs’: Maximum spike length in observations.
‘train_spike_max_length_s’: Maximum spike length in seconds.
‘train_spike_min_length_obs’: Minimum spike length in observations.
‘train_spike_min_length_s’: Minimum spike length in seconds.
‘train_mean_amplitude’: Mean amplitude of the spike train.
‘train_std_amplitude’: Standard deviation of spike amplitudes.
‘train_min_amplitude’: Minimum spike amplitude.
‘train_max_amplitude’: Maximum spike amplitude.
- Example
>>> data = np.array([0.1, 0.1, 0.3, 0.1, 10, 10, 8, 0.1, 0.1, 0.1, 10, 10, 8, 99, 0.1, 99, 99, 0.1]).astype(np.float32) >>> spike_idx, _, _ = TimeseriesFeatureMixin().spike_finder(data=data, baseline=0.3, min_spike_amplitude=0.2, sample_rate=2, min_fwhm=-np.inf, min_half_width=-np.inf) >>> results = TimeseriesFeatureMixin().spike_train_finder(data=data, spike_idx=typed.List(spike_idx), sample_rate=2.0, min_spike_train_length=2.0, max_spike_train_separation=2.0)
- static time_since_previous_target_value(data: ndarray, value: float, fps: int, inverse: bool = False) ndarray [source]
Calculate the time duration (in seconds) since the previous occurrence of a specific value in a data array.
Calculates the time duration, in seconds, between each data point and the previous occurrence of a specific value within the data array.
- Parameters
data (np.ndarray) – The input 1D array containing the time series data.
value (float) – The specific value to search for in the data array.
sample_rate (int) – The sampling rate which data points were collected. It is used to calculate the time duration in seconds.
inverse (bool) – If True, the function calculates the time since the previous value that is NOT equal to the specified ‘value’. If False, it calculates the time since the previous occurrence of the specified ‘value’.
- Returns np.ndarray
A 1D NumPy array containing the time duration (in seconds) since the previous occurrence of the specified ‘value’ for each data point.
- Example
>>> data = np.array([8, 8, 2, 10, 8, 6, 8, 1, 1, 1]).astype(np.float32) >>> TimeseriesFeatureMixin().time_since_previous_target_value(data=data, value=8.0, inverse=False, sample_rate=2.0) >>> [0. , 0. , 0.5, 1. , 0. , 0.5, 0. , 0.5, 1. , 1.5]) >>> TimeseriesFeatureMixin().time_since_previous_target_value(data=data, value=8.0, inverse=True, sample_rate=2.0) >>> [-1. , -1. , 0. , 0. , 0.5, 0. , 0.5, 0. , 0. , 0. ]
- static time_since_previous_threshold(data: ndarray, threshold: float, fps: int, above: bool) ndarray [source]
Jitted compute of the time (in seconds) that has elapsed since the last occurrence of a value above (or below) a specified threshold in a time series. The time series is assumed to have a constant sample rate.
- Parameters
data (np.ndarray) – The input 1D array containing the time series data.
threshold (int) – The threshold value used for the comparison.
fps (int) – The sample rate of the time series in samples per second.
above (bool) – If True, the function looks for values above or equal to the threshold. If False, it looks for values below or equal to the threshold.
- Return np.ndarray
A 1D array of the same length as the input data. Each element represents the time elapsed (in seconds) since the last occurrence of the threshold value. If no threshold value is found before the current data point, the corresponding result is set to -1.0.
- Examples
>>> data = np.array([1, 8, 2, 10, 8, 6, 8, 1, 1, 1]).astype(np.float32) >>> TimeseriesFeatureMixin().time_since_previous_threshold(data=data, threshold=7.0, above=True, sample_rate=2.0) >>> [-1. , 0. , 0.5, 0. , 0. , 0.5, 0. , 0.5, 1. , 1.5] >>> TimeseriesFeatureMixin().time_since_previous_threshold(data=data, threshold=7.0, above=False, sample_rate=2.0) >>> [0. , 0.5, 0. , 0.5, 1. , 0. , 0.5, 0. , 0. , 0. ]
Unsupervised methods
Annotation GUI methods
- class simba.mixins.annotator_mixin.AnnotatorMixin(config_path: Union[str, PathLike], video_path: Union[str, PathLike], data_path: Union[str, PathLike], frame_size: Optional[Tuple[int]] = (1280, 650), title: Optional[str] = None)[source]
Bases:
ConfigReader
Methods for creating tkinter GUI frames and functions associated with annotating videos.
Currently under development (starting 01/24). As the number of different annotation methods and interfaces increases, this class will contain common methods for all annotation interfaces to decrease code duplication.
- Parameters
config_path (Union[str, os.PathLike]) – Path to SimBA project config ini file.
video_path (Union[str, os.PathLike]) – Path to video file to-be annotated.
data_path (Union[str, os.PathLike]) – Path to featurized pose-estimation data associated with the video.
frame_size (Optional[Tuple[int]]) – The size of the subframe displaying the video frame in the GUI.
- change_frame(new_frm_id: int, min_frm: Optional[int] = None, max_frm: Optional[int] = None, update_funcs: Optional[List[Callable[int, None]]] = None, store_funcs: Optional[List[Callable[None]]] = None, keep_radio_btn_choices: Optional[bool] = False) None [source]
Change the frame displayed in annotator GUI.
Note
store_funcs will be executed before update_funcs.
- Parameters
new_frm_id (int) – The frame number of the new frame.
min_frm (Optional[int]) – If the minimum frame number is not the first frame of the video, pass the minimum frame number here.
max_frm (Optional[int]) – If the maximum frame number is not the last frame of the video, pass the max frame number here.
max_frm – If the maximum frame number is not the last frame of the video, pass the max frame number here.
update_funcs (Optional[List[Callable[[int], None]]]) – Optional functions that takes accepts the new frame numers that should be called. E.g., if updating the frame number should cause the display of the new frame numbers in any other Frame.
store_funcs (Optional[List[Callable[[], None]]]) – Optional functions that saves user frame selections in memory.
keep_radio_btn_choices (Optional[bool]) – If True, then any update_funcs that causes the update of radio button choices in the newly displayed frame will be supressed. Thus, the choices in the prior frame is maintained.
- find_proximal_annotated_frm(forwards: bool, present: bool)[source]
Helper to find the most proximal preceding or proceeding frame where any one behavior are annotated as present or absent.
- get_annotation_of_frame(frm_num: int, clf: str, allowed_vals: Optional[List[Optional[int]]] = None) Optional[int] [source]
Helper to retrieve the stored annotation of a specific classifier at a specific frame index
Creates a horizontal frame navigation bar where the buttons are tied to callbacks for changing and displaying video frames.
- Parameters
parent (Frame) – The tkinter Frame to place navigation bar within.
update_funcs (Optional[List[Callable[[int], None]]]) – Optional list of callables that accepts a single integer inputs. Can be methods that updates part of the interface as the frame number changes.
store_funcs (Optional[List[Callable[[], None]]]) – Optional list of callables without arguments. Can be methods that stores the selections in memory as users proceeds through the frames.
size (Optional[Tuple[int, int]]) – The size of the navigation bar in h x w. Default 300 x 700.
loc (Optional[Tuple[int, int]]) – The grid location (row, column) within the parent frame at which the navigation bar should be displayed. Defualt: (1, 0).
previous_next_clf (Optional[bool]) – If True, then include four buttons allowing users to navigate to the most proximal preceding or proceeding frame where behaviors are annotated as present or absent.
- store_targeted_annotations_frames()[source]
Method to store annotations in memory while annotating targeted bouts frame-wise
- targeted_annotations_frames_save()[source]
Method to save annotations to disk when using targeted bout frame-wise annotations
- targeted_bouts_pane(parent: Frame) Frame [source]
Create a pane for choosing bouts start and end and a radiobutton truth table for targeted bouts annotations. Used by simba.labelling.targeted_annotations_bouts.TargetedAnnotatorBouts
- targeted_frames_selection_pane(parent: Frame, loc: Optional[Tuple[int, int]] = (0, 1)) None [source]
Creates a vertical pane that includes tkinter frames for selecting bouts and annotating behaviours in those bouts frame-wise.
- update_clf_radiobtns(frm_num: int)[source]
Update helper to set the radio button to annotated values
- update_current_selected_frm_lbl(new_frm: Union[int, str])[source]
Helper to update label showing current frame text shown when annotating bouts frame-wise.
Create a vertical navigation pane for playing a video and displaying and activating keyboard shortcuts when annotating bouts.
- Parameters
parent (Frame) – The tkinter Frame to place the vertical navigation bar within.
save_func (Callable[[], None]) – The save-data-to-disk function that should be called when using the save data shortcut.
update_funcs (Optional[List[Callable[[int], None]]]) – Optional list of callables that accepts a single integer inputs. Can be methods that updates part of the interface as the frame number changes.
store_funcs (Optional[List[Callable[[], None]]]) – Optional list of callables without arguments. Can be methods that stores the selections in memory as users proceeds through the frames.
loc (Optional[Tuple[int, int]]) – The grid location (row, column) within the parent frame at which the navigation bar should be displayed. Default: (1, 0).
- video_frm_label(frm_number: int, max_size: Optional[Tuple[int]] = None, loc: Tuple[int] = (0, 0)) None [source]
Inserts a video frame as a tkinter label at a specified maximum size at specified grid location.
- Parameters
frm_number (int) – The frame number if the video that should be displayed as a tkinter label.
max_size (Optional[Tuple[int, int]]) – The maximum size of the image when displayed. If None, then
frame_size
defined at instance init.loc (Tuple[int, int]) – The grid location (row, column) within the main frame at which the video frame should be displayed.
Image attribute extraction
- class simba.mixins.image_mixin.ImageMixin[source]
Bases:
object
Methods to slice and compute attributes of images and comparing those attributes across sequential images.
This can be helpful when the behaviors studied are very subtle and the signal is very low in relation to the noise within the pose-estimated data. In these use-cases, we cannot use pose-estimated data directly, and we instead study histograms, contours and other image metrics within images derived from the intersection of geometries (like a circle around the nose) across sequential images. Often these methods are called using image masks created from pose-estimated points within
simba.mixins.geometry_mixin.GeometryMixin
methods.Important
If there is non-pose related noise in the environment (e.g., there are non-experiment related light sources that goes on and off, or other image noise that doesn’t necesserily affect pose-estimation reliability), this will negatively affect the reliability of most image attribute comparisons.
- static add_img_border_and_flood_fill(img: array, invert: Optional[bool] = False, size: Optional[int] = 1) ndarray [source]
Add a border to the input image and perform flood fill.
E.g., Used to remove any black pixel areas connected to the border of the image. Used to remove noise if noise is defined as being connected to the edges of the image.
- static brightness_intensity(imgs: List[ndarray], ignore_black: Optional[bool] = True) List[float] [source]
Compute the average brightness intensity within each image within a list.
For example, (i) create a list of images containing a light cue ROI, (ii) compute brightness in each image, (iii) perform kmeans on brightness, and get the frames when the light cue is on vs off.
- Parameters
imgs (List[np.ndarray]) – List of images as arrays to calculate average brightness intensity within.
ignore_black (Optional[bool]) – If True, ignores black pixels. If the images are sliced non-rectangular geometric shapes created by
slice_shapes_in_img
, then pixels that don’t belong to the shape has been masked in black.
- Returns List[float]
List of floats of size len(imgs) with brightness intensities.
- Example
>>> img = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/khan/project_folder/videos/stitched_frames/0.png').astype(np.uint8) >>> ImageMixin.brightness_intensity(imgs=[img], ignore_black=False) >>> [159.0]
- static canny_edge_detection(img: ndarray, threshold_1: int = 30, threshold_2: int = 200, aperture_size: int = 3, l2_gradient: bool = False) ndarray [source]
Apply Canny edge detection to the input image.
- static find_contours(img: ndarray, mode: Optional[typing_extensions.Literal['all', 'exterior']] = 'all', method: Optional[typing_extensions.Literal['simple', 'none', 'l1', 'kcos']] = 'simple') ndarray [source]
Find contours in the input image.
- Parameters
img (Optional[Literal['all', 'exterior']]) – Input image as a NumPy array.
img – Contour retrieval mode. E.g., which contours should be kept. Default is ‘all’.
'kcos']] (Optional[Literal['simple', 'none', 'l1',) – Contour approximation method. Default is ‘simple’.
- static get_contourmatch(img_1: ndarray, img_2: ndarray, mode: Optional[typing_extensions.Literal['all', 'exterior']] = 'all', method: Optional[typing_extensions.Literal['simple', 'none', 'l2', 'kcos']] = 'simple', canny: Optional[bool] = True) float [source]
Calculate contour similarity between two images.
- Parameters
img_1 (np.ndarray) – First input image (numpy array).
img_2 (np.ndarray) – Second input image (numpy array).
method (Optional[Literal['all', 'exterior']]) – Method for contour extraction. Options: ‘all’ (all contours) or ‘exterior’ (only exterior contours). Defaults to ‘all’.
- Return float
Contour similarity score between the two images. Lower values indicate greater similarity, and higher values indicate greater dissimilarity.
- Example
>>> img_1 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/khan/project_folder/videos/stitched_frames/0.png').astype(np.uint8) >>> img_2 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/khan/project_folder/videos/stitched_frames/3.png').astype(np.uint8) >>> ImageMixin.get_contourmatch(img_1=img_1, img_2=img_2, method='exterior')
- static get_histocomparison(img_1: ndarray, img_2: ndarray, method: Optional[typing_extensions.Literal['chi_square', 'correlation', 'intersection', 'bhattacharyya', 'hellinger', 'chi_square_alternative', 'kl_divergence']] = 'correlation', absolute: Optional[bool] = True)[source]
- Example
>>> img_1 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/khan/project_folder/videos/stitched_frames/0.png').astype(np.uint8) >>> img_2 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/khan/project_folder/videos/stitched_frames/3.png').astype(np.uint8) >>> ImageMixin.get_histocomparison(img_1=img_1, img_2=img_2, method='chi_square_alternative')
- static img_emd(imgs: Optional[List[ndarray]] = None, img_1: Optional[ndarray] = None, img_2: Optional[ndarray] = None, lower_bound: Optional[float] = 0.5)[source]
Compute Wasserstein distance between two images represented as numpy arrays.
- Example
>>> img_1 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1_frames/24.png', 0).astype(np.float32) >>> img_2 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1_frames/1984.png', 0).astype(np.float32) >>> img_emd(img_1=img_1, img_2=img_3, lower_bound=0.5) >>> 10.658767700195312
- static img_matrix_mse(imgs: ndarray) ndarray [source]
Compute the mean squared error (MSE) matrix table for a stack of images.
This function calculates the MSE between each pair of images in the input array and returns a symmetric matrix where each element (i, j) represents the MSE between the i-th and j-th images. Useful for image similarities and anomalities.
- Parameters
imgs (np.ndarray) – A stack of images represented as a numpy array.
- Return np.ndarray
The MSE matrix table.
- Example
>>> imgs = ImageMixin().read_img_batch_from_video(video_path='/Users/simon/Desktop/envs/troubleshooting/two_black_animals_14bp/videos/Together_1.avi', start_frm=0, end_frm=50) >>> imgs = np.stack(list(imgs.values())) >>> ImageMixin().img_matrix_mse(imgs=imgs)
- static img_moments(img: ndarray, hu_moments: Optional[bool] = False) ndarray [source]
Compute image moments.
- Parameters
img (Optional[bool]) – The input image.
img – If True, returns the 7 Hu moments. Else, returns the moments.
- Returns np.ndarray
A 24x1 2d-array if hu_moments is False, 7x1 2d-array if hu_moments is True.
- Example
>>> img_1 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/khan/project_folder/videos/stitched_frames/0.png').astype(np.uint8) >>> ImageMixin.img_moments(img=img_1, hu_moments=True) >>> [[ 1.01270313e-03], [ 8.85983106e-10], [ 4.67680675e-13], [ 1.00442018e-12], [-4.64181508e-25], [-2.49036749e-17], [ 5.08375216e-25]]
- static img_sliding_mse(imgs: ndarray, slide_size: int = 1) ndarray [source]
Pairwise comparison of images in sliding windows using mean squared errors
- Example
>>> imgs = ImageMixin().read_all_img_in_dir(dir='/Users/simon/Desktop/envs/troubleshooting/two_black_animals_14bp/project_folder/Together_4_cropped_frames') >>> imgs = np.stack(imgs.values()) >>> mse = ImageMixin().img_sliding_mse(imgs=imgs, slide_size=2)
- static img_stack_mse(imgs_1: ndarray, imgs_2: ndarray) ndarray [source]
Pairwise comparison of images in two stacks of equal length using mean squared errors.
Note
Useful for noting subtle changes, each imgs_2 equals imgs_1 with images shifted by 1. Images has to be in uint8 format. Also see
img_sliding_mse
.- Parameters
imgs_1 (np.ndarray) – First three (non-color) or four (color) dimensional stack of images in array format.
imgs_1 – Second three (non-color) or four (color) dimensional stack of images in array format.
- Return np.ndarray
Array of size len(imgs_1) comparing
imgs_1
andimgs_2
at each index using mean squared errors at each pixel location.- Example
>>> img_1 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/khan/project_folder/videos/stitched_frames/0.png').astype(np.uint8) >>> img_2 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/khan/project_folder/videos/stitched_frames/10.png').astype(np.uint8) >>> imgs_1 = np.stack((img_1, img_2)); imgs_2 = np.stack((img_2, img_2)) >>> ImageMixin.img_stack_mse(imgs_1=imgs_1, imgs_2=imgs_2) >>> [637, 0] >>> imgs = ImageMixin().read_all_img_in_dir(dir='/Users/simon/Desktop/envs/troubleshooting/two_black_animals_14bp/project_folder/Together_4_cropped_frames') >>> imgs_1 = np.stack(imgs.values()) >>> imgs_2 = np.roll(imgs_1,-1, axis=0) >>> mse = ImageMixin().img_stack_mse(imgs_1=imgs_1, imgs_2=imgs_1)
- static img_stack_to_bw(imgs: ndarray, lower_thresh: int, upper_thresh: int, invert: bool)[source]
Convert a stack of color images into black and white format.
Note
If converting a single image, consider
simba.mixins.image_mixin.ImageMixin.img_to_bw()
- Parameters
img (np.ndarray) – 4-dimensional array of color images.
lower_thresh (Optional[int]) – Lower threshold value for binary conversion. Pixels below this value become black. Default is 20.
upper_thresh (Optional[int]) – Upper threshold value for binary conversion. Pixels above this value become white. Default is 250.
invert (Optional[bool]) – Flag indicating whether to invert the binary image (black becomes white and vice versa). Default is True.
- Return np.ndarray
4-dimensional array with black and white image.
- Example
>>> imgs = ImageMixin.read_img_batch_from_video(video_path='/Users/simon/Downloads/3A_Mouse_5-choice_MouseTouchBasic_a1.mp4', start_frm=0, end_frm=100) >>> imgs = np.stack(imgs.values(), axis=0) >>> bw_imgs = ImageMixin.img_stack_to_bw(imgs=imgs, upper_thresh=255, lower_thresh=20, invert=False)
- static img_stack_to_greyscale(imgs: ndarray)[source]
Jitted conversion of a 4D stack of color images (RGB format) to grayscale.
- Parameters
imgs (np.ndarray) – A 4D array representing color images. It should have the shape (num_images, height, width, 3) where the last dimension represents the color channels (R, G, B).
- Returns np.ndarray
A 3D array containing the grayscale versions of the input images. The shape of the output array is (num_images, height, width).
- Example
>>> imgs = ImageMixin().read_img_batch_from_video( video_path='/Users/simon/Desktop/envs/troubleshooting/two_black_animals_14bp/videos/Together_1.avi', start_frm=0, end_frm=100) >>> imgs = np.stack(list(imgs.values())) >>> imgs_gray = ImageMixin.img_stack_to_greyscale(imgs=imgs)
- static img_stack_to_video(imgs: Dict[int, ndarray], save_path: Union[str, PathLike], fps: int, verbose: Optional[bool] = True)[source]
Convert a dictionary of images into a video file.
Note
The input dict can be greated with ImageMixin().slice_shapes_in_imgs()
- Parameters
imgs (Dict[int, np.ndarray]) – A dictionary containing frames of the video, where the keys represent frame indices and the values are numpy arrays representing the images.
save_path (Union[str, os.PathLike]) – The path to save the output video file.
fps (int) – Frames per second (FPS) of the output video.
verbose (Optional[bool]) – If True, prints progress messages. Defaults to True.
- static img_to_bw(img: ndarray, lower_thresh: Optional[int] = 20, upper_thresh: Optional[int] = 250, invert: Optional[bool] = True) ndarray [source]
Convert an image to black and white (binary).
Note
If converting multiple images from colour to black and white, consider
simba.mixins.image_mixin.ImageMixin.img_stack_to_bw()
- Parameters
img (np.ndarray) – Input image as a NumPy array.
lower_thresh (Optional[int]) – Lower threshold value for binary conversion. Pixels below this value become black. Default is 20.
upper_thresh (Optional[int]) – Upper threshold value for binary conversion. Pixels above this value become white. Default is 250.
invert (Optional[bool]) – Flag indicating whether to invert the binary image (black becomes white and vice versa). Default is True.
- Return np.ndarray
Binary black and white image.
- static orb_matching_similarity_(img_1: ndarray, img_2: ndarray, method: typing_extensions.Literal['knn', 'match', 'radius'] = 'knn', mask: Optional[ndarray] = None, threshold: Optional[int] = 0.75) int [source]
Perform ORB feature matching between two sets of images.
>>> img_1 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/khan/project_folder/videos/stitched_frames/0.png').astype(np.uint8) >>> img_2 = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/khan/project_folder/videos/stitched_frames/10.png').astype(np.uint8) >>> ImageMixin().orb_matching_similarity_(img_1=img_1, img_2=img_2, method='radius') >>> 4
- static pad_img_stack(image_dict: Dict[int, ndarray], pad_value: Optional[int] = 0) Dict[int, ndarray] [source]
Pad images in a dictionary stack to have the same dimensions (the same dimension is represented by the largest image in the stack)
- Parameters
- Return Dict[int, np.ndarray]
A dictionary mapping integer keys to numpy arrays representing padded images.
- static read_all_img_in_dir(dir: Union[str, PathLike], core_cnt: Optional[int] = -1) Dict[str, ndarray] [source]
Helper to read in all images within a directory using multiprocessing. Returns a dictionary with the image name as key and the images in array format as values.
- Example
>>> imgs = ImageMixin().read_all_img_in_dir(dir='/Users/simon/Desktop/envs/troubleshooting/two_black_animals_14bp/project_folder/Together_4_cropped_frames')
- static read_img_batch_from_video(video_path: Union[str, PathLike], start_frm: int, end_frm: int, core_cnt: Optional[int] = -1) Dict[int, ndarray] [source]
Read a batch of frames from a video file. This method reads frames from a specified range of frames within a video file using multiprocessing.
- Parameters
video_path (Union[str, os.PathLike]) – Path to the video file.
start_frm (int) – Starting frame index.
end_frm (int) – Ending frame index.
core_cnt (Optionalint]) – Number of CPU cores to use for parallel processing. Default is -1, indicating using all available cores.
greyscale (Optional[bool]) – If True, reads the images as greyscale. If False, then as original color scale. Default: True.
- Returns Dict[int, np.ndarray]
A dictionary containing frame indices as keys and corresponding frame arrays as values.
- Example
>>> ImageMixin().read_img_batch_from_video(video_path='/Users/simon/Desktop/envs/troubleshooting/two_black_animals_14bp/videos/Together_1.avi', start_frm=0, end_frm=50)
- static segment_img_horizontal(img: ndarray, pct: int, lower: Optional[bool] = True, both: Optional[bool] = False) ndarray [source]
Segment a horizontal part of the input image.
This function segments either the lower, upper, or both lower and upper part of the input image based on the specified percentage.
- Parameters
img (np.ndarray) – Input image as a NumPy array.
pct (int) – Percentage of the image to be segmented. If lower is True, it represents the lower part; if False, it represents the upper part.
lower (Optional[bool]) – Flag indicating whether to segment the lower part (True) or upper part (False) of the image. Default is True.
both (Optional[bool]) – If True, removes both the upper pct and lower pct and keeps middle part.
- Return np.array
Segmented part of the image.
- Example
>>> img = cv2.imread('/Users/simon/Desktop/test.png') >>> img = ImageMixin.segment_img_horizontal(img=img, pct=10, both=True)
- static segment_img_stack_horizontal(imgs: ndarray, pct: int, lower: bool, both: bool) ndarray [source]
Segment a horizontal part of all images in stack.
- Example
>>> imgs = ImageMixin.read_img_batch_from_video(video_path='/Users/simon/Downloads/3A_Mouse_5-choice_MouseTouchBasic_a1.mp4', start_frm=0, end_frm=400) >>> imgs = np.stack(imgs.values(), axis=0) >>> sliced_imgs = ImageMixin.segment_img_stack_horizontal(imgs=imgs, pct=50, lower=True, both=False)
- static segment_img_vertical(img: ndarray, pct: int, left: Optional[bool] = True, both: Optional[bool] = False) ndarray [source]
Segment a vertical part of the input image.
This function segments either the left, right or both the left and right part of input image based on the specified percentage.
- Parameters
img (np.ndarray) – Input image as a NumPy array.
pct (int) – Percentage of the image to be segmented. If lower is True, it represents the lower part; if False, it represents the upper part.
lower (Optional[bool]) – Flag indicating whether to segment the lower part (True) or upper part (False) of the image. Default is True.
both (Optional[bool]) – If True, removes both the left pct and right pct and keeps middle part.
- Return np.array
Segmented part of the image.
- static slice_shapes_in_img(img: Union[ndarray, Tuple[VideoCapture, int]], geometries: List[Union[Polygon, ndarray]]) List[ndarray] [source]
Slice regions of interest (ROIs) from an image based on provided shapes.
Note
Use for slicing one or several static geometries from a single image. If you have several images, and shifting geometries across images, consider
simba.mixins.image_mixin.ImageMixin.slice_shapes_in_imgs
which uses CPU multiprocessing.- Parameters
img (List[Union[Polygon, np.ndarray]]) – Either an image in numpy array format OR a tuple with cv2.VideoCapture object and the frame index.
img – A list of shapes either as vertices in a numpy array, or as shapely Polygons.
- Returns List[np.ndarray]
List of sliced ROIs from the input image.
>>> img = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/img_comparisons_4/1.png') >>> img_video = cv2.VideoCapture('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1.mp4') >>> data_path = '/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/csv/outlier_corrected_movement_location/Example_1.csv' >>> data = pd.read_csv(data_path, nrows=4, usecols=['Nose_x', 'Nose_y']).fillna(-1).values.astype(np.int64) >>> shapes = [] >>> for frm_data in data: shapes.append(GeometryMixin().bodyparts_to_circle(frm_data, 100)) >>> ImageMixin().slice_shapes_in_img(img=(img_video, 1), shapes=shapes)
- slice_shapes_in_imgs(imgs: Union[ndarray, PathLike], shapes: Union[ndarray, List[Polygon]], core_cnt: Optional[int] = -1, verbose: Optional[bool] = False) Dict[int, ndarray] [source]
Slice regions from a stack of images or a video file, where the regions are based on defined shapes. Uses multiprocessing.
For example, given a stack of N images, and N*X geometries representing the region around the animal body-part(s), slice out the X geometries from each of the N images and return the sliced areas.
- Example I
>>> imgs = ImageMixin().read_img_batch_from_video( video_path='/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1.mp4', start_frm=0, end_frm=10) >>> imgs = np.stack(list(imgs.values())) >>> imgs_gray = ImageMixin().img_stack_to_greyscale(imgs=imgs) >>> data = pd.read_csv('/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/csv/outlier_corrected_movement_location/Example_1.csv', nrows=11).fillna(-1) >>> nose_array, tail_array = data.loc[0:10, ['Nose_x', 'Nose_y']].values.astype(np.float32), data.loc[0:10, ['Tail_base_x', 'Tail_base_y']].values.astype(np.float32) >>> nose_shapes, tail_shapes = [], [] >>> for frm_data in nose_array: nose_shapes.append(GeometryMixin().bodyparts_to_circle(frm_data, 80)) >>> for frm_data in tail_array: tail_shapes.append(GeometryMixin().bodyparts_to_circle(frm_data, 80)) >>> shapes = np.array(np.vstack([nose_shapes, tail_shapes]).T) >>> sliced_images = ImageMixin().slice_shapes_in_imgs(imgs=imgs_gray, shapes=shapes)
- Example II
>>> video_path = '/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/videos/Example_1_clipped.mp4' >>> data_path = r'/Users/simon/Desktop/envs/troubleshooting/Emergence/project_folder/csv/outlier_corrected_movement_location/Example_1_clipped.csv' >>> df = pd.read_csv(data_path, usecols=['Nose_x', 'Nose_y', 'Tail_base_x', 'Tail_base_y']).fillna(0).values.astype(int) >>> data = df.reshape(len(df), -1, int(df.shape[1]/2)) >>> geometries = GeometryMixin().multiframe_bodyparts_to_line(data=data, buffer=30, px_per_mm=4.1) >>> imgs = ImageMixin().slice_shapes_in_imgs(imgs=video_path, shapes=geometries)
- static template_matching_cpu(video_path: Union[str, PathLike], img: ndarray, core_cnt: Optional[int] = -1, return_img: Optional[bool] = False) Tuple[int, dict, Union[None, ndarray]] [source]
Perform template matching on CPU using multiprocessing for parallelization.
E.g., having a cropped image, find the image and frame number in a video it most likely has been cropped from.
- Parameters
video_path (Union[str, os.PathLike]) – Path to the video file on disk.
img (np.ndarray) – Template image for matching. E.g., a cropped image from
video_path
.core_cnt (Optional[int]) – Number of CPU cores to use for parallel processing. Default is -1 (max available cores).
return_img (Optional[bool]) – Whether to return the annotated best match image with rectangle around matched template area. Default is False.
- Returns Tuple[ int, dict, Union[None, np.ndarray]]
A tuple containing: (i) int: frame index of the frame with the best match. (ii) dict: Dictionary containing results (probability and match location) for each frame. (iii) Union[None, np.ndarray]: Annotated image with rectangles around matches (if return_img is True), otherwise None.
- Example
>>> img = cv2.imread('/Users/simon/Desktop/envs/troubleshooting/two_black_animals_14bp/videos/Screenshot 2024-01-17 at 12.45.55 PM.png') >>> results = ImageMixin().template_matching_cpu(video_path='/Users/simon/Desktop/envs/troubleshooting/two_black_animals_14bp/videos/Together_1.avi', img=img, return_img=True)