AI: Anomaly Detection in logfiles

From Embedded Lab Vienna for IoT & Security
Revision as of 15:46, 27 July 2022 by ARessl (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Summary

This guide will create a basic AI model to perform binary classification in order to detect anomalies in logfiles. This AI model is also suitable for the Jetson AGX Xavier Development Kit

Requirements

  • Packages: TensorFlow, Keras, Pandas, sklearn, numpy, seaborn, matplotlib
  • Software: Pycharm or any other python editor
  • Dataset: We used the following Dataset

Description

Step 0 - Import the needed packages/libraries

from keras.callbacks import EarlyStopping, ModelCheckpoint # for training
from keras.models import Sequential, load_model # for model
from keras.layers import Dense, Activation # layers and activation function
import pandas as pd # read/prep dataset
pd.options.mode.chained_assignment = None # removes warning
import numpy as np # read/prep dataset
import sklearn.model_selection as sk # dataset splitting
import tensorflow as tf # for model
import seaborn as sns # plotting
from sklearn.metrics import confusion_matrix # confusion matrix
from matplotlib import pyplot as plt # plotting

Step 1 - Read the dataset

First we need to read the data, for that we can use the predefined function from pandas 'read_csv'

logfile_features = pd.read_csv(path)

Afterwards we replace the infinite values with nans and drop them all together

logfile_features.replace([np.inf, -np.inf], np.nan, inplace=True)
logfile_features.dropna(inplace=True)

Our dataset has labels which define if its an attack or not, so we replace them with numericals (0 and 1)

In our case Benign is for standard user traffic and DoS attacks-Slowloris and DoS attacks-GoldenEye is for traffic where a DoS attack occured

logfile_features["Label"].replace({"Benign": 0, "DoS attacks-Slowloris": 1, "DoS attacks-GoldenEye": 1}, inplace=True)

Next we shuffle our dataset

logfile_features = logfile_features.sample(frac=1) 

Now we need to split our data into 3 parts: Training data (60%), Test data (20%) and Validation data (20%). To do that we use the following methods

train_dataset, temp_test_dataset = sk.train_test_split(logfile_features, test_size=0.4)
test_dataset, valid_dataset = sk.train_test_split(temp_test_dataset, test_size=0.5)

Next we extract the labels from the actual dataset, we need them extra for our training

train_labels = train_dataset.pop('Label')
test_labels = test_dataset.pop('Label')
valid_labels = valid_dataset.pop('Label')

To norm our data correctly we need to get the stats of our training data (mean and standard deviation)

train_stats = train_dataset.describe()
train_stats = train_stats.transpose()

For norming we use a following method

def norm(x, stats):
    return (x - stats['mean']) / stats['std']

This method can then be used the following way

 normed_train_data = norm(train_dataset, train_stats)
 normed_test_data = norm(test_dataset, train_stats)
 normed_valid_dataset = norm(valid_dataset, train_stats)

Step 2 - Create a model

First we need to create a sequential model, which can be trained later

model = Sequential()

The next step is to create an input layer with exactly as many nodes as features in our training data

 model.add(Dense(normed_train_data.shape[1],input_shape=(normed_train_data.shape[1],))

Next a hidden layer consisting of 128 nodes with the ReLU (Rectified Linear Unit) activation function

model.add(Dense(128, Activation('relu')))

And finally the output layer consisting of 1 node which represents 'attack' or 'no attack'

model.add(Dense(1))

Now we could change the learning rate to a specific value, but we just leave it at the default 0.001

learning_rate = 0.001

For the optimizer we just use the Adam Optimizer with the pre-defined learning rate

optimizer = tf.optimizers.Adam(learning_rate)

Lastly we need to compile the model, for the loss function we use BinaryCrossentropy, our optimizer and the metric should be the accuarcy of the model

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), 
  optimizer=optimizer,
  metrics=['accuracy'])

Step 3 - Train the model

First we set our epochs, a complete pass of the normed training data through the model, and our batch size, after how many datapoints the model gets updated

EPOCHS = 5000
batch_size = 1024

To not have to wait for 5000 training epochs to finish and to prevent overfitting we can set an early stop

es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=2)

Finally the training, model fitting, can start

with tf.device('/CPU:0'): 
   # with tf.device('/GPU:0'): # wenn man mit der Grafikkarte trainieren will
   history = model.fit(
       normed_train_data,
       train_labels,
       batch_size=batch_size,
       epochs=EPOCHS,
       verbose=1,
       shuffle=True,
       steps_per_epoch=int(normed_train_data.shape[0] / batch_size),
       validation_data=(normed_valid_dataset, valid_labels), callbacks=[es],
   )

Step 4 - Plot the results

After the training has been completed you can easily plot the accuarcy and validation accuracy during the training using

plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['Train', 'Cross-Validation'], loc='upper left')
plt.show()

To show the accuarcy when tested against the test data, meaning data the model didn't train with, we can use a confusion matrix the following

ax = plt.subplot()
predict_results = model.predict(normed_test_data)
predict_results = (predict_results > 0.5)
cm = confusion_matrix(test_labels, predict_results)
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(['No Attack', 'Attack'])
ax.yaxis.set_ticklabels(['No Attack', 'Attack'])
plt.show()

Model Accuracy Confusion Matrix

Used Hardware

References