Analysis of the UNSW-NB15 Data Set for Network Anomaly Detection Systems using Machine Learning

1.0 Summary

Analysis of the UNSW-NB15 Data Set

The UNSW-NB15 data set is a benchmark that can be used to test Network Anomaly Detection Systems. It contains normal network traffic and nine types of attacks both represented by forty- nine features extracted from network traffic flow. In the paper The evaluation of Network Anomaly Detection Systems: Statistical analysis of the UNSW-NB15 data set and the comparison with the KDD99 data-set (hereinafter referred to as the Evaluation Paper), the authors demonstrate the complexity of UNSW-NB15 using three approaches; statistical analysis, feature correlation and five machine learning models (Moustafa & Slay 2016). The results are compared to an older benchmark, namely the KDDCUP1999 data set which does not include modern types of attacks nor network traffic flows (KDDCUP1999 2007).

In addition to repeating the analysis carried out in the Evaluation Paper a Support Vector Machine (SVM) and a Neural Network (NN) model both with tuned parameters are implemented to increase the accuracy and decrease the False Alarm Rate (FAR) of the predictions.

The results obtained agree with those presented in the Evaluation Paper and confirm that the UNSW-NB15 data set is more complex than the KDDCUP1999 data set. Further both the tuned SVM and NN show very high accuracies but there are still an unacceptable number of false positives and negatives.
2.0 Introduction

The aim of this work is to repeat the analysis carried out in the Evaluation Paper and extend it by tuning a Support Vector Machine (SVM) and a Neural Network (NN) model to increase the accuracy and decrease the False Alarm Rate (FAR).

The three approaches used in the Evaluation Paper were statistical analysis, feature correlation analysis and five machine learning model implementations.
The statistical analysis methods were performed with the SPSS tool and include:

i. Kolmogorov–Smirnov (K-S) test ii. Multivariate Skewness

iii. Multivariate Kurtosis

The i. ii.

The i. ii. iii. iv. v.

correlation methods were performed with SPSS, MatLab and Python and include: Pearson Correlation Coefficient (PCC)
Gain Ratio

machine learning models were implemented with SQL Server and Visual Studio and include: Naïve Bayes (NB)
Decision Tree (DT)
Neural Network (NN)

Logistic Regression (LR) Clustering

The new SVM was implemented in R and the NN was implemented with SQL Server and Visual Studio.

This paper is organised as follows: Section 3.0 Approach outlines the data sets, features and pre-processing used in the analysis. Section 3.1 Statistical Analysis details the statistical tests performed and Section 3.2 Feature Correlation Analysis outlines the PCC and Gain Ratio tests. Section 3.3 Machine Learning implements the models reported in the Evaluation Paper and Section 3.4 details the tuned SVM and NN Models. The results are presented in Section 4.0 and the conclusions are discussed in Section 5.0.

3.0 Approach

There are three features in all of the data sets detailed in Table 1 that are represented by character strings, these features were mapped to integers using the SPSS Transform > Automatic Recode tool (SPSS tool 2020). Specifically, the features mapped are protonum, servicenum and statenum in the UNSW-NB15 data sets and protocol, service and flag features in the KDDCUP1999 data sets, please see Tables 2 and 3. These mapped features were used in all of the statistical analysis’, feature correlations and machine learning models.

Further because the features in the data sets have a large distance between the minimum and maximum values, they were standardized using the Z-Score tool in SPSS, specifically with Analyze > DescriptiveStatistics > Descriptive and using the Save standardised values as variables option.

Screen Shot 2021-04-19 at 3.15.05 pm

Screen Shot 2021-04-19 at 3.15.31 pm

Screen Shot 2021-04-19 at 3.15.41 pm

3.1 Statistical Analysis

The SPSS tool (SPSS tool 2020) was used to analyze the UNSW-NB15 Training and Testing data sets (Moustafa 2018). Table 1 and Table 2 detail the records and features for both the sets. The three techniques used were the Kolmogorov–Smirnov (K-S) test, Multivariate Skewness and Multivariate Kurtosis.

3.1.0 The K-S test is used to compare the probability distributions of the features in the training and testing sets. The results obtained by using the SPSS tool Analyze > DescriptiveStatistics > Explore and Normality plots with tests, agree with those presented in the Evaluation Paper. This illustrates that the probability distributions of the features in the training and testing sets show a strong similarity. Please see Figures 1 and 2 below and Appendix D for the full list of results.

Screen Shot 2021-04-19 at 3.16.48 pm

Screen Shot 2021-04-19 at 3.16.58 pm

3.1.1 The Multivariate Skewness analysis is used to measure the asymmetry of the feature distributions and was performed with SPSS, specifically using Analyze > Frequencies > Distribution > Skewness and Kurtosis. The results obtained match those presented in the Evaluation Paper. The results show that the training and testing set features have very similar distributions with only the sttl feature having a negative skewness in both sets. The swin and dwin features in the training set show a very slight negative skewness and the trans_depth and response_body_length are very positively skewed. Please see Figures 3 and 4 below and Appendix D for the full list of results.

Screen Shot 2021-04-19 at 3.17.41 pm

Screen Shot 2021-04-19 at 3.17.53 pm

3.1.2 The Multivariate Kurtosis analysis is used to measure how extreme the tails of the feature distributions are and was performed with SPSS as per section 3.1.1. The results obtained agree with those presented in the Evaluation Paper. The high Kurtosis values indicate a large number of outliers and low Kurtosis values indicate a lack of outliers. The trans_depth and response_body_length in both the testing and training sets have a large number of outliers while djit in the testing set has a moderate number of outliers. Some other features also have some outliers but importantly these are present in both the testing and training sets. Please see Figure 5 below and Appendix D.

Screen Shot 2021-04-19 at 3.18.33 pm

3.2 Feature Correlation Analysis

The correlation of the features in the data sets was tested using two approaches; the Pearson Correlation Coefficient was used to analyze the features without the label and Gain Ratio was used to analyse the features with the label.

3.2.0 The Pearson Correlation Coefficient (PCC) was used to measure the correlation between the features with SPSS specifically using Analyze > Correlate > Bivariate with the Pearson Correlation Coefficient ticked. Please see Figure 6.

The results obtained match those presented in the Evaluation Paper. PCC is a measure of the strength of association between the features in each data set and are scored between -1 and 1. A correlation of -1 indicates an individual features average correlation with all other features is perfectly negatively linearly related. A correlation of 0 indicates an individual features average correlation with all other features does not have any linear relation. A correlation of 1 indicates an individual features average correlation with all other features is perfectly positively linearly related.

Figure 7 and Appendix D show that none of the features have a high degree of correlation with each other but this lack of correlation is shared by both the testing and training sets.

Screen Shot 2021-04-19 at 3.19.03 pm

Screen Shot 2021-04-19 at 3.19.12 pm

3.2 Machine Learning Models

Visual Studio Business Intelligence 2008 was used in the Evaluation Paper but it is no longer supported by Microsoft. The tools used in this work were Visual Studio 2019 (Visual Studio 2019), SQL Server 2019 (SQL Server 2019) and the SQL Server Analysis Services (SSAS) from the Microsoft Analysis Services 2020 library (Microsoft Analysis Services 2020).

Five ML models were used to evaluate the complexity of the UNSW datasets and to compare the results to the KDDCUP1999 data set. The specific data sets and features used are detailed in Tables 1, 2 and 3. Please see Appendix I and J which shows the Visual Studio Mining Structure and the SQL Server structure.

The five models used are Naïve Bayes (NB), Decision Tree (DT), Neural Network (NN), Logistic Regression (LR) and Clustering. The authors did not provide enough information to implement the models such that the results could be exactly reproduced with the newer tools. Three sets of tests were run on the UNSW-NB15 data sets with varying parameters and two sets of tests were run on the KDDCUP1999 data sets.

In the Evaluation Paper the accuracy is defined as:
(True Positive + True Negative) /
(True Positive + True Negative + False Positive + False Negative).

In the Evaluation Paper the False Alarm Rate (FAR) is defined by:
False Positive Rate (FPR) = False Positive / (False Positive + True Negative) False Negative Rate (FNR) = False Negative / (False Negative + True Positive) FAR is given by (FPR + FNR) / 2

3.3 SVM and NN Models

The data set used for training and testing the SVM and NN was the UNSW-NB15_1.csv which contains 700,001 records and is a simple split of the entire set of records. It is not curated like the training and testing sets used in the above analysis; it only has 22,215 attack records but is more a reflection of the raw network data flow. The features used in the previously analysis were used here with the same mappings between strings and integers using the SPSS transform tool. Consideration was given to using the source and destination IP and Port fields but after closer examination it appears that the malicious traffic stems from one or two addresses so using these features would be essentially putting a label on the records, that is IP address X equals malicious traffic.

3.4.0 SVM

R Studio was used to implement the SVM please see Appendix C for the code. The trainControl function was used with method set to CV and number set to 5, during training this splits the training data into 5 sets and uses each set as testing data for the model trained on the other 4 sets. This 5-fold cross validation estimates the prediction error by taking the average of the

Percentage

error across all tests. The preProcess function was used with the parameters set to centre and scale which standardise the data so that the features have a mean value of 0 and a standard deviation of 1. The tuneLength parameter was set to 5 and this allows the model to choose 5 different sets of parameters to determine the optimum for the data set. 30% of the data was held out for testing.

The results are presented below in Figure 14 and using the definition of accuracy and FAR outlined in section 3.3 the SVM achieves an accuracy of 99.48% and a FAR of 1.05%

Screen Shot 2021-04-19 at 3.21.29 pm

3.4.1 Neural Network

Using the same data set and features as outlined for the SVM a NN was implemented in Visual Studio 2019 SSAS. Two of the Mining Structure parameters were changed, HoldoutMaxPercent was set to 30 and the HoldoutSeed was set to 1. Two of the Neural Network Algorithm parameters were changed, MAXIMUM_STATES was set to 1000 and SAMPLE_SIZE was set to 700,000 * 0.7.

The results are presented below in Figure 15 and using the definition of accuracy and FAR outlined in section 3.3 the NN achieves an accuracy of 99.31% and a FAR of 4.45%

Screen Shot 2021-04-19 at 3.21.37 pm

Screen Shot 2021-04-19 at 3.22.38 pm

Screen Shot 2021-04-19 at 3.22.46 pm