Shapley calculations: Example II (multiple cores)

In this example, we have previously created a classifier. We have the data used to create this classifier, and now we want to compute SHAP explainability scores for this classifier using multiple CPU cores (to speed things up a bit). Time should scale linearly with the number of cores available. Because the model has to be pushed to each core, it’s advisable to use as slim of a model as possible.

[1]:
from simba.mixins.train_model_mixin import TrainModelMixin
from simba.mixins.config_reader import ConfigReader
from simba.utils.read_write import read_config_file, read_pickle
import glob
[2]:
# DEFINITIONS
CONFIG_PATH = r"C:\troubleshooting\mitra\project_folder\project_config.ini"
CLASSIFIER_PATH = r"C:\troubleshooting\mitra\models\generated_models\grooming.sav"
CLASSIFIER_NAME = 'grooming'
COUNT_PRESENT = 250
COUNT_ABSENT = 250
[3]:
# READ IN THE CONFIG AND THE CLASSIFIER
config = read_config_file(config_path=CONFIG_PATH)
config_object = ConfigReader(config_path=CONFIG_PATH)
clf = read_pickle(data_path=CLASSIFIER_PATH)
[4]:
# READ IN THE DATA

#Read in the path to all files inside the project_folder/csv/targets_inserted directory
file_paths = glob.glob(config_object.targets_folder + '/*' + config_object.file_type)

#Reads in the data held in all files in ``file_paths`` defined above
data, _ = TrainModelMixin().read_all_files_in_folder_mp(file_paths=file_paths, file_type=config.get('General settings', 'workflow_file_type').strip())

#We find all behavior annotations that are NOT the targets. I.e., if SHAP values for Attack is going to be calculated, bit we need to find which other annotations exist in the data e.g., Escape and Defensive.
non_target_annotations = TrainModelMixin().read_in_all_model_names_to_remove(config=config, model_cnt=config_object.clf_cnt, clf_name=CLASSIFIER_NAME)

# We remove the body-part coordinate columns and the annotations which are not the target from the data
data = data.drop(non_target_annotations + config_object.bp_headers, axis=1)

# We place the target data in its own variable
target_df = data.pop(CLASSIFIER_NAME)

Dataset size: 544.10988MB / 0.54411GB
[5]:
TrainModelMixin().create_shap_log_mp(rf_clf=clf,
                                     x=data,
                                     y=target_df,
                                     x_names=list(data.columns),
                                     clf_name=CLASSIFIER_NAME,
                                     cnt_present=COUNT_PRESENT,
                                     cnt_absent=COUNT_ABSENT,
                                     core_cnt=2,
                                     chunk_size=100,
                                     verbose=True,
                                     save_dir=config_object.logs_path,
                                     save_file_suffix=1,
                                     plot=True)
Computing 500 SHAP values (MULTI-CORE BATCH SIZE: 100, FOLLOW PROGRESS IN OS TERMINAL)...
Concatenating multi-processed SHAP data (batch 1/5)
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.0s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    0.0s
[Parallel(n_jobs=32)]: Done 500 out of 500 | elapsed:    0.0s finished
Concatenating multi-processed SHAP data (batch 2/5)
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.0s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    0.0s
[Parallel(n_jobs=32)]: Done 500 out of 500 | elapsed:    0.0s finished
Concatenating multi-processed SHAP data (batch 3/5)
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.0s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    0.0s
[Parallel(n_jobs=32)]: Done 500 out of 500 | elapsed:    0.0s finished
Concatenating multi-processed SHAP data (batch 4/5)
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.0s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    0.0s
[Parallel(n_jobs=32)]: Done 500 out of 500 | elapsed:    0.0s finished
Concatenating multi-processed SHAP data (batch 5/5)
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.0s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    0.0s
[Parallel(n_jobs=32)]: Done 500 out of 500 | elapsed:    0.0s finished
SIMBA COMPLETE: SHAP calculations complete (elapsed time: 231.2415s)    complete
SIMBA WARNING: ShapWarning: SHAP visualizations/aggregate stats skipped (only viable for projects with two animals and default 7 or 8 body-parts per animal) ...        warning
[ ]: