# Confused by The Confusion Matrix: What’s the difference between Hit Rate, True Positive Rate, Sensitivity, Recall and Statistical Power?

If you tried to answer the question in the title, you’ll be disappointed to find out that it is actually a trick question – there is essentially no difference in the listed terms. Just like the issue mentioned in ANCOVA and Moderation, different terms are often used for the same thing, especially when they belong to different fields. This post will attempt to dispel the confusion by bringing these terms together, and explain how to interpret the cells of a confusion matrix using the context of detecting an effect.

The image above captures the commonly used terms for each cell in the confusion matrix. The blue cells are desired outcomes, while the red cells are errors. Here I use the context of whether an effect exists and whether the effect was observed, as opposed to the respective “Actual Class” and “Predicted Class” that are commonly used to explain confusion matrices. I believe this is a more intuitive way to understand the information in each cell, as the terms “Actual Class” and “Predicted Class” often confuse people as well.

“True Positive”, “False Positive”, “True Negative” and “False Negative” are perhaps the most popular labels for the cells in a confusion matrix, but how many times have you seen someone pause to try and figure out what “True Negative” and “False Negative” mean? The “Type I” and “Type II” errors of statistical hypothesis testing are even worse, as the names do not even give any clue about the type of error being committed. Some people may not even realise that these statistical terms are actually part of a confusion matrix.

Considering how unintuitive these terms are, I prefer to use the signal detection theory version of these labels: “Hit”, “False Alarm”, “Correct Rejection” and “Miss”. It actually makes a lot of sense when we put them into context:

• If I observed an effect when the effect exists, it is a Hit
• If I didn’t observe an effect when the effect exists, it is a Miss
• If I observed an effect when no effect exists, it is a False Alarm
• If I didn’t observe an effect when no effect exists, it is a Correct Rejection

This version of labels already contain the connotation of what is right and what is wrong, so it becomes very easy to immediately infer what exactly is going on. For this reason, I will refer to this version of labels for the rest of the post. At this point, it is important to note that the cells in the confusion matrix contain the absolute number of occurrences for each situation, and NOT to be confused with the probabilities of their occurrences such as “Hit Rate” or “True Positive Rate”. I will next explain how these different rates are calculated.

## Probabilities Based on The Existence of An Effect

Now that you are aware that the cells in a confusion matrix contain the absolute number of occurrences, you might be then wondering how do the other terms like “Hit Rate” and “True Positive Rate” come about. The rates are actually calculated based on the existence of an effect, which is with reference to only a specific portion of the confusion matrix and not the whole of it (red outlines in image directly above). For example, “Hit Rate” is calculated by taking the number of Hits divided by the total number of occurrences when the effect exists (i.e. total number of Hits plus Misses); the “Miss Rate” would then simply be 1 minus the “Hit Rate”. Conversely, the “False Alarm Rate” is calculated by taking the number of False Alarms divided by the total number of occurrences when the effect doesn’t exist (i.e. total number of False Alarms plus Correct Rejections); the “Correct Rejection Rate” would then simply be 1 minus the “False Alarm Rate”.

This is where various terms for the same concept start to appear. “Sensitivity” and “Recall” are the same as “Hit Rate”, while “Specificity” and “Selectivity” are the same as “Correct Rejection Rate”. “Sensitivity” and “Specificity” are more commonly used in the medical field where there is interest to measure the performance of a diagnostic test, while “Recall” and “Fall-out” are more commonly used in machine learning to measure prediction accuracy. Even the field of statistics has its own lingo, although they are all really referring to the same thing.

Most researchers should be familiar with “Statistical Significance”, which is the probability of committing a Type 1 error (α). It is probably not too hard to relate this to a “False Alarm Rate”, since it is the probability of observing an effect when the effect doesn’t exist. But some researchers may not have connected the concept of “Statistical Power” with “Hit Rate”. Because of the word “power”, it easily misleads researchers into thinking that a study with high power is powerful enough to make a conclusion. This is not what statistical power means; it is simply the probability of observing an effect when the effect exists. This becomes especially clear when we think of Type II error (β) as a Miss, and 1 minus the Miss Rate gives us the probability of Hits in the existence of an effect.

I hope this clears up what the different rates mean, and you are now comfortable with the various terms. At the end of this post, I will use the confusion matrix to illustrate the difference between Frequentist and Bayesian hypothesis testing. But before that, I will explain what happens when probabilities are calculated based on whether or not an effect was observed.

## Probabilities Based on Whether Effect was Observed

A less known but nonetheless important method of measuring accuracy is by calculating the probability that an effect exists when an effect is observed. Similarly, the rates are calculated with reference to only a specific portion of the confusion matrix, which are the red outlines in the image directly above. For example, “False Discovery Rate” is calculated by taking the number of False Alarms divided by the total number of occurrences when an effect was observed (i.e. total number of Hits plus False Alarms), while the “False Omission Rate” is calculated by taking the number of Misses divided by the total number of occurrences when an effect was not observed (i.e. total number of Misses plus Correct Rejections).

The terms “True Discovery Rate” and “True Omission Rate” have been placed in parentheses because they are not actual terms. The more commonly used terms are “Positive Predictive Value” and “Negative Predictive Value”, but I think it is a lot more intuitive to use “True Discovery Rate” as it connotes “the rate of what is true in what I have discovered”. After all, it is just the reverse of its counterpart “False Discovery Rate”, which makes the association a lot more direct. Another term for “True Discovery Rate” is “Precision”, which is the machine learning term commonly paired with “Recall” that has been introduced above.

Knowing the True Discovery Rate is as important as knowing the Hit Rate, if not more so. This is especially the case when the prevalence of the effect is actually not very large. In such situations, the chance of getting a False Alarm whenever an effect is observed becomes very high. This is where Bayesians try to account for what Frequentists miss out in their analyses, which I will be discussing next using confusion matrices.

## Bonus: Frequentist vs Bayesian Analysis using Confusion Matrices

In my previous post on Bayesian Analysis & The Replication Crisis, I introduced the fundamental principles of Bayesian analysis, and how those principles may help to address the Replication Crisis. While working on this post, I realised that Bayesian analysis can be explained so much more easily with the help of a confusion matrix, provided that you are already familiar with it. The terms in Bayesian analysis are again another variation of what we have already learnt in the confusion matrix, but I will come to that in a while.

First, let us consider the scenario I mentioned where the prevalence of an effect is not very large. So far, we have been looking at confusion matrices that have been divided into four equal quadrants. In reality, however, the proportions are usually not equal. A more realistic confusion matrix with actual proportions has been constructed below:

When we talk about the prevalence of an effect, we are essentially referring to the probability that an effect exists in the first place. This is also known as the base rate, or prior odds/probabilities commonly used in Bayesian analysis (different terms referring to same things again). We often assume that observing an effect by chance is a 50-50 affair. But as you can see from the image directly above, the area occupied by “Hits” and “Misses” is much smaller than the area occupied by “False Alarms” and “Correct Rejections”. This depicts that the prevalence of an effect is small, and a chance observation is certainly not 50-50. Coupled with a low Hit Rate, the area occupied by Hits compared to everything else becomes very small. Even though the proportion of Correct Rejections is the largest, the observation of an effect is simply too unreliable because there is a very high chance of it being a False Alarm instead of a Hit.

In the Frequentist approach to hypothesis testing, the most important statistic has traditionally been the p-value. The p-value is an estimate of the probability of observing an effect when the effect doesn’t exist. Hence, the smaller the p-value, the lower the False Alarm Rate (or Type I Error Rate), and the higher the Correct Rejection Rate. However, note that this method does not take the Hit Rate (or Statistical Power) into account, and it is also not concerned with the prevalence of the effect. This results in the problems mentioned in the confusion matrix with actual proportions.

In the Bayesian approach, Prior Probabilities are updated with Current Evidence (also known as Likelihood), to produce Posterior Probabilities. If all these sound foreign to you, please refer to my previous post for the explanations. But even without referring to my previous post, I made a discovery that these are again different terms that refer to the same thing in the confusion matrix. Referring to the image directly above, Prior Probability is actually the prevalence or base rate of an effect’s existence (green outlines in confusion matrix on the left). Likelihood is a little more complicated, but it is calculated by taking the Hit Rate (or Statistical Power, red outlines in confusion matrix on the left) divided the probability of making an observation in general (yellow outlines in confusion matrix on the left). Posterior Probability turns out to be a True Discovery Rate (red outlines in confusion matrix on the right), which cannot be calculated directly, but can be estimated based on the the Prior Probability taken from previously known information, and the Likelihood obtained from information in the present study. The ratio of the True Discovery Rate (or Posterior Probability) to the Base Rate (or Prior Probability) is then calculated to determine the Bayes Factor, which suggests if the evidence is leaning towards the Null or Alternative Hypothesis.

* * * * * * * * * *

The purpose of using confusion matrices to differentiate the two statistical approaches, is not to show that the Bayesian approach is better because it takes more information into account. The point I am trying to drive is that very often, these concepts may seem quite confusing when unfamiliar terms are used. But if we realise that there are equivalent terms that are more intuitive or we are more familiar with, the concepts become a lot easier to understand. I still do not know why these terms are not standardised across the different fields, but I will continue to advocate the use of the signal detection theory version of “Hit”, “False Alarm”, “Correct Rejection” and “Miss”, because they are really a lot more intuitive to conceptualise.