In the previous post, I explained the concept of interpreting a confusion matrix by clarifying all the different terms that actually mean the same thing. Unfortunately, the confusion does not end there. If you finished reading the previous post, you should have probably come to the realisation that a simple Hit Rate or False Alarm Rate is not actually very informative, because each rate looks at a different portion of the confusion matrix. Because of that, different fields have come up with different measures of accuracy that try to incorporate information from different portions of the confusion matrix. This post attempts to cover the more commonly known ones.
What you read in the title was not a mistake – there is indeed an official measure of accuracy called “Accuracy”. This is obviously a setup for confusion when someone asks, “So what is the the accuracy of your test’s performance?” Does the person mean accuracy in general, or the specific measure of “Accuracy” with its own method of calculation?
The calculation of “Accuracy” is actually very simple and intuitive. It is basically the number of Hits plus Correct Rejections divided by the total number of occurrences in the entire confusion matrix (i.e. the proportion occupied by blue cells in the whole matrix).
Accuracy = (Hits + Correct Rejections) / Total Occurrences
However, “Accuracy” suffers from the same problem as Frequentist analysis mentioned in the previous post, when prevalence is not taken into account. High Accuracy can be achieved even if the prevalence of an effect’s existence is very low, as long as the Correct Rejection Rate is high (see confusion matrix with actual proportions). This is known as the “Accuracy Paradox“. Because of this, other measures of accuracy are generally recommended over “Accuracy”.
The Cohen’s Kappa is a statistic that adjusts the Observed Accuracy by taking chance into account. Chance is defined as the Expected Accuracy, calculated by combining the probability of randomly observing an effect when the effect exists, together with the probability of randomly not observing an effect when the effect doesn’t exist.
Using the image directly above, the probability of randomly observing an effect when the effect exists is calculated by multiplying the prevalence of an effect’s existence (green outlines in confusion matrix on the left) with the probability of making an observation in general (yellow outlines in confusion matrix on the left). Similarly, the probability of randomly not observing an effect when the effect doesn’t exist is calculated by multiplying the prevalence of an effect’s non-existence (green outlines in confusion matrix on the right) with the probability of not making an observation in general (yellow outlines in confusion matrix on the right). The sum of these two probabilities make up the Expected Accuracy.
The Expected Accuracy is then subtracted from the Observed Accuracy, giving the supposed accuracy that is not by chance (numerator). This non-chance Accuracy is then compared to the probability of not being accurate by chance (1 minus the Expected Accuracy, denominator) as a ratio, which produces the Kappa statistic. It is possible for the statistic to be negative, suggesting that the Observed Accuracy is worse than random.
Kappa = (Observed Accuracy – Expected Accuracy) / (1 – Expected Accuracy)
The Kappa statistic also has its limitations, mainly because of how it can be misinterpreted if information about the distribution of occurrences in the confusion matrix is not taken into account. I will not be delving into these limitations, but more information can be found here.
The F1 score (also known as the F-score or F-measure, but NOT to be confused with the F-statistic or F-value) is a measure commonly used for binary classification accuracy in machine learning. Its calculation takes both the Hit Rate (or Recall) and True Discovery Rate (or Precision) into consideration.
The F1 score is basically the harmonic mean of the Hit Rate and True Discovery Rate, which is a suitable way for calculating the average of rates. The harmonic mean is calculated by taking the inverse of the average of the inverse Hit Rate and the inverse True Discovery Rate. This has been simplified by the following formula:
F1 = 2 * [(Hit Rate * True Discovery Rate) / (Hit Rate + True Discovery Rate)]
A criticism of the F1 score is that it only focuses on Hits and neglects Correct Rejections. In a situation where there are Correct Rejections but no Hits, the F1 score would not be informative as the value returned is 0. Hence, depending on the F1 score alone to determine accuracy is not sufficient.
Matthews Correlation Coefficient (MCC)
The Matthews correlation coefficient (MCC), more generally known as the phi-coefficient (φ), is also commonly used for binary classification accuracy in machine learning. Its calculation takes all cells of the confusion matrix into consideration, and is in essence a correlation coefficient between the existence of the effect and the observation of the effect. Since it is a correlation coefficient, a returned value of +1.00 represents perfect accuracy, 0 represents accuracy no better than chance, and -1.00 represents perfect inaccuracy. It is also related to the χ2 coefficient, in that MCC2 is the average of χ2 (i.e. χ2 / n). The MCC is calculated based on the following formula:
The MCC is considered to be more informative than the F1 score and Accuracy, because it takes the distribution of occurrences in the confusion matrix into account.
Signal vs Noise Measures
Signal-to-noise Ratio is a measure more commonly used in engineering to compare the level of a desired signal to the level of background noise. However, the terms “Signal” and “Noise” are again different terms for referring to “Hits” and “False Alarms” respectively. Hence, a simple signal-to-noise ratio would just be a comparison of the number of Hits to the number of False Alarms.
The d’ takes a step further compared to the signal-to-noise ratio, and is commonly used in Signal Detection Theory. It takes the means of signals and distributions of noise into account, and can be calculated using the Z-scores of the Hit Rate and False Alarm Rate, based on the following formula:
d’ = Z(Hit Rate) − Z(False Alarm Rate)
A flaw of the d’ is that it assumes the standard deviations for signal and noise are equal, even though they can be quite different in reality.
The ROC curve is a graphical plot of the Hit Rate against the False Alarm Rate at different threshold settings, also commonly used in Signal Detection Theory. It is generated by plotting the cumulative distribution function of the Hit Rate on the y-axis and the cumulative distribution function of the False Alarm Rate on the x-axis.
The Area Under the ROC Curve (AUC) is the main measure of accuracy, whereby the larger the area the better the accuracy. Perfect accuracy is represented by the entire square, but since the Hit Rate and False Alarm Rate are trade-offs, it is almost impossible to achieve perfect accuracy. The AUC is also related to d’ by the following formula:
* * * * * * * * * *
The purpose of compiling all these commonly used measures of accuracy is not just for easy reference, but to also show how they are related to one another. The take-away from learning all these different measures is to realise that there is no one single measure that best represents accuracy. To understand the accuracy of a test’s performance, it is important to take all information into consideration, and not hastily jump to conclusions based on a few numbers. Ultimately, good statistical thinking is still the key to better appreciation of nuances in data analysis.