Pallavi Krishna
4 min readJun 29, 2022

--

Confusion Matrix and the problem of visualization

The true power of visualizations lies in simplicity and clarity. While much is made out of complex data visualizations often stacked one over the other, the simplicity of numbers is often disregarded.

To explain the point, let’s take the example of the Confusion Matrix.

Confusion Matrix is one of the critical visualizations used in data science. Often while working with data, we come across problems that pose a problem in presentation. What gets left out in the presentation sometimes provides that particular information that can change our opinion of the data. Let us see how this happens for an amazon rating and then a classification model.

Consider a simple thing like a rating of a product in a specific category on Amazon. We immediately get to see the number of ratings and the average rating. Something like this —

Based on this we immediately see, that if a product has been rated as 4.5 by 20 people, it is not as good as a product that has been rated 4 by 3971 people. [In a very early Amazon, we did not see the number of people who rated the product]

What if we only saw the rating? And not the number of ratings?

Immediately we see what a loss of information that is,

What if we also saw these numbers —

  1. The number of times a product has been rated in this category Vs the number of times a product has been bought. So, if we saw a number here that gave us the total no of times this was bought, say 7000, it would give us an added view of the product.
  2. Out of the total no of items bought in this category, the no of times this particular item was bought. Say out of all the smart audio sunglasses bought, how many times was this particular glass bought? A number that would give me a sense of what were the best-selling smart audio sunglasses. This could also be the number in the selling list in that particular category. The information that I would otherwise search in bestseller lists on Amazon.
  3. Additionally, the number of times people have bought this in the last 24 hours could also help and is used by some online marketplaces to great effect.

It has to be noted here that, marketplace economics and regulations should be added to the conversation here at this point.

However, for solely the purpose of visualizations, we can see how an additional number, an additional statistic can change the perception by at least 102 degrees. If not a 180 degree.

Confusion Matrix is such a critical and brilliant use of visualization in evaluating a model.

In a classification model, accuracy (accuracy = no of correct predictions / total no of predictions) otherwise a great metric to test a model, will not work very well.

Why is this so? Because only if we see the misclassifications, will we understand our model behavior.

Let’s take the Wikipedia explanation of Confusion Matrix which is fairly comprehensive.

If my model predicted whether a person would contract a particular kind of illness with 80 percent accuracy, what do I understand? I understand only that 20 percent of the time, my model will predict wrongly.

In the same model evaluation, if I add that 10 percent of the time, it predicted that a person could get an illness wrongly, and 10 percent of the time, it predicted that a person will not get cancer wrongly, things become much clearer. And I know whether I want to use this model or not. To get the exact numbers, lets look at the Confusion Matrix of our model

So my model has

  1. Predicted wrongly a total of — 3 times = Accuracy of the model
  2. In those predictions we see that it has predicted that a person has cancer wrongly only once, what we call a false positive.
  3. It has predicted that a person does not have cancer wrongly twice. What we call a false negative.

In this case, we are more concerned about a false negative than a false positive, for obvious reasons, but this can change from case to case.

This data is what a confusion matrix provides us. And this visualization of the data gives us a fair way to assess a classification model.

Where

True Positive, True Negative= When the model prediction is the same as actual / ground truth / label

False Positive / False Negative = When the model prediction is the opposite of the actual / ground truth / label

Just the representation of the metrics of the model, allows us to interpret the model differently, and more accurately.

Nothing to me gives a better demonstration of the power of visualization like a Confusion Matrix. It doesn’t have to be a fancy graph or a large and complex visualization alone, it could just be two or more different numbers/statistics next to each other. And that is all we need sometimes to get our much revered insights.

Happy Analyzing!

--

--

Pallavi Krishna

Passionate about programming, data and painting. Open to Work