Machine learning model monitoring
Model monitoring in 1 hour:
Introduction:
Once model is deployed in production, continuous model monitoring is required. In this article I will be focusing on below aspects majorly
1.Performance monitoring
2.Drift monitoring
3.Fairness
4.Explainability
Importance of model monitoring:
In production, Maintaining performance in production same as in train/test performance is challenging. Below are few cases where it becomes challenging.
1.variety of data which is not seen in training phase can appear in production
2.Taste/preferences of customer can change over period of time. Eg, customer prefers visiting malls for shopping, But, during covid-19 or any other pandemic times, same customers prefer online shipping
3.model trained on particular period of data(rainy season data) may not work well in other period data (summer data).
Eg for temperature prediction model, model trained on rainly season data may not predict accurately on summer season data
4.distribution of observation may change over time. low frequent observations becomes high frequent observations and vise versa
5. changes/degrading/malfuntionaing in sensors/external tools/applications, which collects data, reduces quality of inputs to models
6.In few Usecases, Customer preferences or taste is dependent on recent data/actions.Model needs retraining with recent data to maintain performance
. Model performance may gradially degrade over period of time Or stops performing well suddenly.
Identifying as soon as model is performing bad and bringing it down or retraining/replacing the model before there is negative impact on business is crucial and it needs continuous monitoring in the production
What we moniter in Production?
Performance ,drift, explainability , fairness are few highlevel metric catergories most frequently used for monitoring
In production , monitering one metric is not enough to get hints where there is high chance of model failures.
1.Performance metric monitoring:
In production, when model performance degrades ,it needs looking into why there is deviation and if needed the model must be replaced/retrain with new data set.
Eg1:Loan Repayment Prediction for a bank(classification usecase).
Performance metric: accuracy
sample dataset:
Case 1:
Test accuracy (accuracy(test))= 90% (accuracy on January)
March month Production accuracy (accuracy(prod))=85% (accuracy on March data. Assuming deployed in March)
April month Production dataset accuracy (accuracy(prod))=50%
Observation: model performance started degrading and underperforming in April month.
Action Item: Data scientist has to look at why the model degraded and need to consider for changing the model or retraining the model with new dataset.
By this time you might get below questions:
a. How much deviation in metrics is considered as really significant deviation? Is there any systematic way to decide there is deviation?
Ans:Due to variability,performance score fluctuates up(90+ %) or down(90- %), arround the actual value. follow anyone of procedure:
Procedure 1: Based on impact on business.few business will be impact more with slight performance degrade. Business Team desides some lower threshold,lets say lower threshold:87%. when model performance goes below lower threshold, then it is considered as significant deviation.
Procedure 2: Performance of model follows normal distribution. Find standard deviation of accuracy in development phase on test dataset as below:
Let say K records in test dataset.
step 1:select M records randomly sample with replacement from test dataset
Step 2:calcualte accuracy on sample selected in step 1
step 3: repeat step 1and step 2 for N times and get N accuracies.
step 4: find mean and std-standard deviation
step 5: set lower threshold= mean(N accuracies)-1.96*std
step6: Do we need to consider retraining model if perform
ance in production is far better than test performance?
upper threshold = mean(N accuracies)+1.96*std.
if performance is beyond upper threshold, one of the reason for this behaviour is change in distribtion of data.at training time, observations are hard ones. in production started receiving simple/easy observations and is considered as change in distribution and need to look at retraining option and see if performance can be improved further.
c.what are other performance metrics need to be monitored?
monitering only one accuracy doent give complete picture. monitoring other metrics like tpr,tnr,fpr,fnr auc,f1, etc makes to understand model behaviour well.
d.What plots helps us to understand model performance better?
e. When does choosing performance monitering not possible?
In case of eg1 , ground truth is available only after completion of loan tenure. for few customer it is 30 year. since ground truth is unavailable immediately, it not possible to calculate performance.
Work arroud is, monitor drift.
Over time monitoring :
monitoring over time rather than one perticular month or perticular week gives more insights as shown in fig 1.
Note: Here Y-axis can be other metrics like TPR,F1 or AUC etc. X-axis can we weekly, monthly or quaterly etc. I recommend try all possible combinations and see if we get more insights
Fairness:
in few countries/locations, decisions based on race, religion, national origin, gender, marital status, age, priviliged group,unpriviliged groups etc are illigial. These social bias by the model needs to be identified if exists and replace/retrain model.
one simple technique is to remove this protected feature from model training.even after removing these features, there can be proxy features of these protected features induced into model which is hard to identify in EDA.
eg: in table 1, salary is highly correlated with gender. it is acting as proxy featur to gender. All female customers employment status is 0- unemployed. This is perticulary true in few communities. even after removing gender, employment status feature may acts as proxy to gender and model may have gender discrimination.
How to test if model is not biased to any perticular privilaged/un privilaged group?
Procedure 1:simple procudure like calculate performance metrics for each category level in feature and make sure performance is almost same for all groups.
eg:
step1:devide test dataset into 2 groups male and female.one of the group is called as privilaged group. All observations which are not in privilaged group are considered as unprivilaged group
step2:calculate performance metric(accuracy),accuracy(p), for privilaged group
step3: calculate performance metric(accuracy),accuracy(un_p), for unprivilaged group
step4: compare both performance metrics. if accuracy(p) ≈ accuracy(un_p), then model is considred as unbiased and fair.
Procedure2:advanced metrics like Equal opportunity and predictive parity metrics helps to identify social bias.
Be care full with External features.Even features like gender,protected features, are not part of training, external feature, it is must to test fairness on these features.
Drift:
Let me start this with Question:
When performance monitoring is available, then why we need to monitor drift?
Ans:Direct check whether model is doing good or bad is by calculating performance metric. But for many usecases, ground truth will not be available on the same day or immdiately. sometimes getting ground truth is costlier as they need to be taken from third party systems or need huge preprocessing.one of the reason for model performance changes, either increase or decrease, is due to changes in distribution of data. changes in distribution of data is a strong signal that there may be changes in performance of model.
4 types of drift needs to be monitored:
1.Concept drift
2.Input drift
3.Decision drift
4.Label drift
Drift for each column
Explainability :
In production, ML model serves many input requests and predicts. in domains like banking , they not only focus on accuracy of prediction but also on explainabilty.
eg loan application is rejected by ML model, applicant requests bank why application got rejected. if accepted by ML model, bank needs reason why approved? or needs what are good qualities of the applicant etc
Models like linear/Decision Tree models are interpertable implecitly, but less acurate in predictions.
Techniques like LIME and SHAP helps to get explainability for each pridiction irrespective of what model is used. Best resource you can find at https://christophm.github.io/interpretable-ml-book/