Confusion Matrix and its implementation in Cyber Crime Investigation.

Devanshu Singh
4 min readJun 5, 2021

Hello everyone, I will tell you about the Confusion Matrix in a very simple way and how we can implement it in Cyber Crime. Let's start with the Confusion Matrix first.

What is a Confusion Matrix?

In simple words, A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. The confusion matrix itself is relatively simple to understand, but the related terminology can be confusing.

Let’s start with an example confusion matrix for a binary classifier (though it can easily be extended to the case of more than two classes):

What can we learn from the above matrix?

  • There are two possible predicted classes: “yes” and “no”. If we were predicting the presence of a disease, for example, “yes” would mean they have the disease, and “no” would mean they don’t have the disease.
  • The classifier made a total of 165 predictions (e.g., 165 patients were being tested for the presence of that disease).
  • Out of those 165 cases, the classifier predicted “yes” 110 times, and “no” 55 times.
  • In reality, 105 patients in the sample have the disease, and 60 patients do not.

Let’s now define the most basic terms, which are whole numbers (not rates):

  • True Positives (TP): These are cases in which we predicted yes (they have the disease), and they do have the disease.
  • True Negatives (TN): We predicted no, and they don’t have the disease.
  • False Positives (FP): We predicted yes, but they don’t actually have the disease. (Also known as a “Type I error.”)
  • False Negatives (FN): We predicted no, but they actually do have the disease. (Also known as a “Type II error.”)

I’ve added these terms to the confusion matrix, and also added the row and column totals:

This is a list of rates that are often computed from a confusion matrix for a binary classifier:

Accuracy: Overall, how often is the classifier correct?

(TP+TN)/total = (100+50)/165 = 0.91

Misclassification Rate: Overall, how often is it wrong?

(FP+FN)/total = (10+5)/165 = 0.09

  • equivalent to 1 minus Accuracy
  • also known as “Error Rate”

True Positive Rate: When it’s actually yes, how often does it predict yes?

  • TP/actual yes = 100/105 = 0.95
  • also known as “Sensitivity” or “Recall”

False Positive Rate: When it’s actually no, how often does it predict yes?

  • FP/actual no = 10/60 = 0.17

True Negative Rate: When it’s actually no, how often does it predict no?

  • TN/actual no = 50/60 = 0.83
  • equivalent to 1 minus False Positive Rate
  • also known as “Specificity”

Precision: When it predicts yes, how often is it correct?

  • TP/predicted yes = 100/110 = 0.91

Prevalence: How often does the yes condition actually occur in our sample?

  • actual yes/total = 105/165 = 0.64

Confusion Matrix’s implementation in monitoring Cyber Attacks:

The data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between “bad’’ connections, called intrusions or attacks, and “good’’ normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.
In the KDD99 dataset, these four attack classes (DoS, U2R, R2L, and probe) are divided into 22 different attack classes that tabulated below:

In the KDD Cup 99, the criteria used for evaluation of the participant entries is the Cost Per Test
(CPT) computed using the confusion matrix and a given cost matrix.
• True Positive (TP): The amount of attack detected when it is actually attacked.
• True Negative (TN): The amount of normal detected when it is actually normal.
• False Positive (FP): The amount of attack detected when it is actually normal (False alarm).
• False Negative (FN): The amount of normal detected when it is actually attacked.

Conclusion:

A confusion matrix is a tabular summary of the number of correct and incorrect predictions made by a classifier. It is used to measure the performance of a classification model. It can be used to evaluate the performance of a classification model through the calculation of
performance metrics like accuracy, precision, recall, and F1-score.
Need for Confusion Matrix in Machine learning:
- It evaluates the performance of the classification models, when they make predictions on test data, and tells how good our classification model is.
- It not only tells the error made by the classifiers but also the type of errors such as it is either type-I or type-II error.
- With the help of the confusion matrix, we can calculate the different parameters for the model, such as accuracy, precision, etc.

The confusion matrix is a matrix used to determine the performance of the classification models for a given set of test data. It can only be determined if the true values for test data are known. The matrix itself can be easily understood and implemented to test an ML model.

--

--