Evaluating Reliability in Medical DNNs: A Critical Analysis of Feature and Confidence-Based OOD Detection

Harry Anthony & Konstantinos Kamnitsas

Image

Out-of-distribution detection
and Failure Detection

Consider the setting that we are training a Deep Neural Network (DNN) for medical image classification, such as diagnosing skin lesion images. We have a set of images and corresponding labels that makes up my training data. Once the model is trained, it can be applied to images not seen during training. If the image is from the same distribution as the training data, we call it in-distribution (ID). However, if the image is significantly different from the training data, we call it out-of-distribution (OOD). We see that DNN predictions on OOD data are both unpredictable and unreliable, which motivates the field of out-of-distribution detection - which aims to detect predictions made on OOD data and discard them to improve trustworthiness in the model's predictions.

We analysed post-hoc (which use the parameters or outputs of a pre-trained model) out-of-distribution methods, which we grouped into two groups:

We evaluated the OOD detection methods on unseen images for 2 criteria:

To study these methods, we created 2 NEW out-of-distribution detection benchmarks:

Studying the results, we see that:

But why do these trends occur? To study this, we created counterfactual data by synthetically removing the OOD artefact from each OOD image in our benchmarks. We did this using intra-image interpolation, where we use a patch from the same image. These counterfactual images are available on my GitHub repository if you would like to use them in your own research, where you can find more details on how this data was made.

We then studied the model predictions and the eXplainable AI heatmaps (using LRP), both with and without the OOD artefact. What we observe is OOD artefacts can cause HIGH-confidence diagnoses! But confidence-based methods assume OOD artefact should lead to high entropy outputs, and this mismatch is a key reseason why confidence-based methods have poor OOD detection performance.

We then studied the counterfactual datasets, looking at the model predictions both with and without the OOD artefact, and grouped the data into four categories:

We can then study what happens when we apply OOD detection methods. We study both a confidence-based method (MCP) and a feature-based method (Mahalanobis Score). We first calculate the scoring function for each test image, we then apply a threshold at the 75 percentile of the held-out ID data, and discard all diagnoses below the threshold. Looking at the results, we see that the strengths of one method, is the weakness of another. This motivates us to combine BOTH a confidence-based method and a feature-based method to mitigate against their weaknesses. We see that combining these methods results in more trustworthy predictions, but with a higher dismissal rate.

Dive into our research!

📄 Paper 📊 Poster </> Code