---
title: "The "Zoo" of Performance Metrics: A Complete Guide to the Confusion Matrix"
description: "A comprehensive breakdown of every confusion matrix metric, explaining what it is, when to use it, and the formula behind it."
published: 2026-01-26
tags: ["ml"]
---

In data science, "Accuracy" is rarely enough. Depending on your problem—whether you are diagnosing a deadly disease, detecting spam, or predicting stock market trends—the way you define "success" changes completely.

This guide comments on **every single metric** derived from the Confusion Matrix, explaining what it is, when to use it, and the formula behind it.

---

You are probably familiar with the following table from [wiki](https://en.wikipedia.org/wiki/Sensitivity_and_specificity) on Sensitivity and specificity.
![](./wiki_table.png)

![](./what_the_fuck_is_going_on_here.gif)

## 1. The Foundation: The Four Quadrants

Everything starts here. These are the raw counts from your model's predictions.

* **True Positive (TP):** *The Hit.* The model predicted yes, and it was yes. (e.g., The fire alarm rang, and there was a fire.)
* **True Negative (TN):** *The Correct Rejection.* The model predicted no, and it was no. (e.g., The alarm stayed silent, and there was no fire.)
* **False Positive (FP):** *The False Alarm (Type I Error).* The model predicted yes, but it was actually no. (e.g., The alarm rang, but it was just burnt toast.)
* **False Negative (FN):** *The Miss (Type II Error).* The model predicted no, but it was actually yes. (e.g., The house burned down, and the alarm said nothing.)

---

## 2. The "Row" Metrics: How good are we at finding the truth?

These metrics look at the *actual* conditions (the rows of the matrix) and ask: "Did we capture them?"

### **True Positive Rate (TPR)**

* **Also known as:** Recall, Sensitivity, Hit Rate, Power.
* **Formula:** $\text{TPR} = \frac{\text{TP}}{\text{TP} + \text{FN}}$
* **The Gist:** Of all the *actual* positive cases, what percentage did we catch?
* **When to use:** Crucial for medical screenings or safety systems where missing a case is dangerous.

### **False Negative Rate (FNR)**

* **Also known as:** Miss Rate.
* **Formula:** $\text{FNR} = \frac{\text{FN}}{\text{TP} + \text{FN}}$
* **The Gist:** The opposite of Recall. What percentage of positive cases did we blindly ignore?
* **When to use:** When you need to report the "failure rate" of a safety system.

### **True Negative Rate (TNR)**

* **Also known as:** Specificity, Selectivity.
* **Formula:** $\text{TNR} = \frac{\text{TN}}{\text{TN} + \text{FP}}$
* **The Gist:** Of all the *actual* negative cases, what percentage did we correctly identify as safe?
* **When to use:** Important in drug testing or criminal justice (you don't want to convict an innocent person).

### **False Positive Rate (FPR)**

* **Also known as:** Fall-out, Probability of False Alarm.
* **Formula:** $\text{FPR} = \frac{\text{FP}}{\text{TN} + \text{FP}}$
* **The Gist:** The rate of false alarms.
* **When to use:** In user experience contexts (e.g., users will turn off a security camera if it sends too many false alerts).

---

## 3. The "Column" Metrics: Can we trust the prediction?

These metrics look at the *predicted* values (the columns) and ask: "If the model says X, should I believe it?"

### **Positive Predictive Value (PPV)**

* **Also known as:** Precision.
* **Formula:** $\text{PPV} = \frac{\text{TP}}{\text{TP} + \text{FP}}$
* **The Gist:** If the model screams "Positive!", what is the probability it's telling the truth?
* **When to use:** Spam filters, YouTube recommendations, or high-stakes stock picking.

### **False Discovery Rate (FDR)**

* **Formula:** $\text{FDR} = \frac{\text{FP}}{\text{TP} + \text{FP}}$
* **The Gist:** If the model predicts positive, how often is it lying?
* **When to use:** Used heavily in multiple hypothesis testing and genetics to control the noise in large datasets.

### **Negative Predictive Value (NPV)**

* **Formula:** $\text{NPV} = \frac{\text{TN}}{\text{TN} + \text{FN}}$
* **The Gist:** If the model says "Negative/Safe", how sure can we be?
* **When to use:** Reassuring patients. If a cancer screening is negative, high NPV means the patient can truly relax.

### **False Omission Rate (FOR)**

* **Formula:** $\text{FOR} = \frac{\text{FN}}{\text{TN} + \text{FN}}$
* **The Gist:** Of all the cases we labeled "Negative", how many were actually dangerous misses?

---

## 4. The Context Metrics: The Environment

These metrics describe the dataset itself, not just the model.

### **Prevalence**

* **Formula:** $\text{Prevalence} = \frac{\text{TP} + \text{FN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}$
* **The Gist:** How common is the positive case in the whole population?
* **Why it matters:** In very rare diseases (low prevalence), it is easy to get high Accuracy just by guessing "No" every time.

### **Prevalence Threshold (PT)**

* **Formula:** $\text{PT} = \frac{\sqrt{\text{TPR} \times \text{FPR}} - \text{FPR}}{\text{TPR} - \text{FPR}}$
* **The Gist:** This is an advanced metric used in medical testing. It calculates the prevalence level at which the *Positive Predictive Value (PPV)* would drop to a specific threshold (often roughly 0.5). It helps determine if a screening test is even worth doing on a specific population.

---

## 5. The "Composite" Metrics: The Summary Scores

Single numbers that try to summarize the entire matrix.

### **Accuracy (ACC)**

* **Formula:** $\text{ACC} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}$
* **The Gist:** The ratio of correct predictions to total predictions.
* **Warning:** Misleading on imbalanced datasets.

### **Balanced Accuracy (BA)**

* **Formula:** $\text{BA} = \frac{\text{TPR} + \text{TNR}}{2}$
* **The Gist:** The arithmetic mean of Sensitivity and Specificity.
* **Why it's better:** It treats the Positive and Negative classes equally, even if the dataset doesn't.

### **F1 Score**

* **Formula:** $\text{F1} = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$
* **The Gist:** The go-to metric for imbalanced classification when you care about the Positive class. It penalizes you heavily if *either* Precision or Recall is low.

### **Matthews Correlation Coefficient (MCC)**

* **Also known as:** Phi Coefficient.
* **Formula:** $\text{MCC} = \frac{\text{TP} \times \text{TN} - \text{FP} \times \text{FN}}{\sqrt{(\text{TP} + \text{FP})(\text{TP} + \text{FN})(\text{TN} + \text{FP})(\text{TN} + \text{FN})}}$
* **The Gist:** A correlation coefficient between observed and predicted classifications. It ranges from -1 (total disagreement) to +1 (perfect prediction).
* **Why it's the Gold Standard:** Unlike F1, MCC considers True Negatives. It is the most robust single metric for imbalanced data.

### **Fowlkes–Mallows Index (FM)**

* **Formula:** $\text{FM} = \sqrt{\text{PPV} \times \text{TPR}}$
* **The Gist:** The geometric mean of Precision and Recall.
* **Comment:** Very similar to F1, but mathematically tends to be slightly higher. It checks the similarity between two clusterings.

### **Threat Score (TS)**

* **Also known as:** Jaccard Index, Critical Success Index (CSI).
* **Formula:** $\text{TS} = \frac{\text{TP}}{\text{TP} + \text{FN} + \text{FP}}$
* **The Gist:** It measures the overlap between truth and prediction, ignoring the True Negatives entirely.
* **When to use:** Weather forecasting (predicting rain) and Computer Vision (Image Segmentation/IoU).

---

## 6. The "Information" Metrics: Medical & Probability

Advanced metrics for understanding the *value* a test adds.

### **Likelihood Ratios (LR+ and LR-)**

* **LR+:** $\text{LR+} = \frac{\text{TPR}}{\text{FPR}}$ — How much more likely is a positive test found in a sick person vs. a healthy person? (You want this high, >10).
* **LR-:** $\text{LR-} = \frac{\text{FNR}}{\text{TNR}}$ — How much more likely is a negative test found in a sick person vs. a healthy person? (You want this low, <0.1).

### **Diagnostic Odds Ratio (DOR)**

* **Formula:** $\text{DOR} = \frac{\text{LR+}}{\text{LR-}}$
* **The Gist:** A single number indicating the effectiveness of a diagnostic test. A value of 1 means the test is useless; higher is better. It is widely used in meta-analyses of medical studies.

### **Informedness (Youden's J)**

* **Formula:** $\text{Informedness} = \text{TPR} + \text{TNR} - 1$
* **The Gist:** The probability that an informed decision is made, rather than a random guess. It ranges from 0 (guessing) to 1 (perfect). It essentially rescales Balanced Accuracy to a 0-1 scale.

### **Markedness (MK)**

* **Formula:** $\text{Markedness} = \text{PPV} + \text{NPV} - 1$
* **The Gist:** The counterpart to Informedness. It measures how much confidence the *prediction* gives you about the *real condition*. If Markedness is 0, your predictors (Positive/Negative) tell you nothing about the actual reality.

---

## Conclusion: Which one should you pick?

Don't let the list overwhelm you. You usually only need to pick 2 or 3 based on your "Risk Profile":

1. **Risk of Missing Out (FOMO)?** Use **Recall** (Sensitivity).
2. **Risk of False Alarms?** Use **Precision**.
3. **Imbalanced Data?** Use **MCC** or **F1 Score**.
4. **Medical Diagnosis?** Use **Sensitivity, Specificity, and Likelihood Ratios**.

The key is understanding what kind of mistake would be most costly in your specific context, and choosing metrics that directly measure that risk.