Difference between revisions of "AI: Anomaly Detection in logfiles"
Line 35: | Line 35: | ||
logfile_features.dropna(inplace=True) | logfile_features.dropna(inplace=True) | ||
Our dataset has labels which define if its an attack or not, so we replace them with numericals (0 and 1) | Our dataset has labels which define if its an attack or not, so we replace them with numericals (0 and 1) | ||
In our case Benign is for standard user traffic and DoS attacks-Slowloris and DoS attacks-GoldenEye is for traffic where a DoS attack occured | |||
logfile_features["Label"].replace({"Benign": 0, "DoS attacks-Slowloris": 1, "DoS attacks-GoldenEye": 1}, inplace=True) | logfile_features["Label"].replace({"Benign": 0, "DoS attacks-Slowloris": 1, "DoS attacks-GoldenEye": 1}, inplace=True) | ||
Next we shuffle our dataset | Next we shuffle our dataset | ||
Line 60: | Line 61: | ||
normed_test_data = norm(test_dataset, train_stats) | normed_test_data = norm(test_dataset, train_stats) | ||
normed_valid_dataset = norm(valid_dataset, train_stats) | normed_valid_dataset = norm(valid_dataset, train_stats) | ||
=== Step 2 - Create a model === | === Step 2 - Create a model === |
Revision as of 14:49, 13 July 2022
Summary
This guide will create a basic AI model to perform binary classification in order to detect anomalies in logfiles. This AI model is also suitable for the Jetson AGX Xavier Development Kit
Requirements
- Packages: TensorFlow, Keras, Pandas, sklearn, numpy, seaborn, matplotlib
- Software: Pycharm or any other python editor
Description
Step 0 - Import the needed packages/libraries
from keras.callbacks import EarlyStopping, ModelCheckpoint # for training from keras.models import Sequential, load_model # for model from keras.layers import Dense, Activation # layers and activation function import pandas as pd # read/prep dataset pd.options.mode.chained_assignment = None # removes warning import numpy as np # read/prep dataset import sklearn.model_selection as sk # dataset splitting import tensorflow as tf # for model import seaborn as sns # plotting from sklearn.metrics import confusion_matrix # confusion matrix from matplotlib import pyplot as plt # plotting
Step 1 - Read the dataset
First we need to read the data, for that we can use the predefined function from pandas 'read_csv'
logfile_features = pd.read_csv(path)
Afterwards we replace the infinite values with nans and drop them all together
logfile_features.replace([np.inf, -np.inf], np.nan, inplace=True) logfile_features.dropna(inplace=True)
Our dataset has labels which define if its an attack or not, so we replace them with numericals (0 and 1) In our case Benign is for standard user traffic and DoS attacks-Slowloris and DoS attacks-GoldenEye is for traffic where a DoS attack occured
logfile_features["Label"].replace({"Benign": 0, "DoS attacks-Slowloris": 1, "DoS attacks-GoldenEye": 1}, inplace=True)
Next we shuffle our dataset
logfile_features = logfile_features.sample(frac=1)
Now we need to split our data into 3 parts: Training data (60%), Test data (20%) and Validation data (20%). To do that we use the following methods
train_dataset, temp_test_dataset = sk.train_test_split(logfile_features, test_size=0.4) test_dataset, valid_dataset = sk.train_test_split(temp_test_dataset, test_size=0.5)
Next we extract the labels from the actual dataset, we need them extra for our training
train_labels = train_dataset.pop('Label') test_labels = test_dataset.pop('Label') valid_labels = valid_dataset.pop('Label')
To norm our data correctly we need to get the stats of our training data (mean and standard deviation)
train_stats = train_dataset.describe() train_stats = train_stats.transpose()
For norming we use a following method
def norm(x, stats): return (x - stats['mean']) / stats['std']
This method can then be used the following way
normed_train_data = norm(train_dataset, train_stats) normed_test_data = norm(test_dataset, train_stats) normed_valid_dataset = norm(valid_dataset, train_stats)
Step 2 - Create a model
First we need to create a sequential model, which can be trained later
model = Sequential()
The next step is to create an input layer with exactly as many nodes as features in our training data
model.add(Dense(normed_train_data.shape[1],input_shape=(normed_train_data.shape[1],))
Next a hidden layer consisting of 128 nodes with the ReLU (Rectified Linear Unit) activation function
model.add(Dense(128, Activation('relu')))
And finally the output layer consisting of 1 node which represents 'attack' or 'no attack'
model.add(Dense(1))
Now we could change the learning rate to a specific value, but we just leave it at the default 0.001
learning_rate = 0.001
For the optimizer we just use the Adam Optimizer with the pre-defined learning rate
optimizer = tf.optimizers.Adam(learning_rate)
Lastly we need to compile the model, for the loss function we use BinaryCrossentropy, our optimizer and the metric should be the accuarcy of the model
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), optimizer=optimizer, metrics=['accuracy'])
Step 3 - Train the model
First we set our epochs, a complete pass of the normed training data through the model, and our batch size, after how many datapoints the model gets updated
EPOCHS = 5000 batch_size = 1024
To not have to wait for 5000 training epochs to finish and to prevent overfitting we can set an early stop
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=2)
Finally the training, model fitting, can start
with tf.device('/CPU:0'): # with tf.device('/GPU:0'): # wenn man mit der Grafikkarte trainieren will history = model.fit( normed_train_data, train_labels, batch_size=batch_size, epochs=EPOCHS, verbose=1, shuffle=True, steps_per_epoch=int(normed_train_data.shape[0] / batch_size), validation_data=(normed_valid_dataset, valid_labels), callbacks=[es], )
Step 4 - Plot the results
After the training has been completed you can easily plot the accuarcy and validation accuracy during the training using
plt.plot(history.history['accuracy']) plt.plot(history.history['val_accuracy']) plt.title('model accuracy') plt.ylabel('accuracy') plt.xlabel('epoch') plt.legend(['Train', 'Cross-Validation'], loc='upper left') plt.show()
To show the accuarcy when tested against the test data, meaning data the model didn't train with, we can use a confusion matrix the following
ax = plt.subplot() predict_results = model.predict(normed_test_data) predict_results = (predict_results > 0.5) cm = confusion_matrix(test_labels, predict_results) ax.set_xlabel('Predicted labels') ax.set_ylabel('True labels') ax.set_title('Confusion Matrix') ax.xaxis.set_ticklabels(['No Attack', 'Attack']) ax.yaxis.set_ticklabels(['No Attack', 'Attack']) plt.show()