Difference between revisions of "AI: Anomaly Detection in logfiles"
Line 13: | Line 13: | ||
== Description == | == Description == | ||
=== Step 0 - Import the needed packages/libraries | |||
from keras.callbacks import EarlyStopping, ModelCheckpoint | |||
from keras.models import Sequential, load_model | |||
from keras.layers import Dense | |||
import pandas as pd | |||
import sklearn.model_selection as sk | |||
from DataCleaner import cleaner | |||
pd.options.mode.chained_assignment = None | |||
import numpy as np | |||
import tensorflow as tf | |||
import seaborn as sns | |||
from sklearn.metrics import confusion_matrix | |||
from matplotlib import pyplot as plt | |||
=== Step 1 - Read the dataset === | === Step 1 - Read the dataset === | ||
Line 51: | Line 66: | ||
=== Step 2 - Create a model === | === Step 2 - Create a model === | ||
First we need to create a sequential model, which can be trained later | First we need to create a sequential model, which can be trained later | ||
model = Sequential() | model = Sequential() | ||
Revision as of 14:37, 13 July 2022
Summary
This guide will create a basic AI model to perform binary classification in order to detect anomalies in logfiles. This AI model is also suitable for the Jetson AGX Xavier Development Kit
Requirements
- Packages: TensorFlow, Keras, Pandas, sklearn, numpy, seaborn, matplotlib
- Software: Pycharm or any other python editor
Description
=== Step 0 - Import the needed packages/libraries
from keras.callbacks import EarlyStopping, ModelCheckpoint from keras.models import Sequential, load_model from keras.layers import Dense import pandas as pd import sklearn.model_selection as sk from DataCleaner import cleaner
pd.options.mode.chained_assignment = None import numpy as np import tensorflow as tf import seaborn as sns from sklearn.metrics import confusion_matrix from matplotlib import pyplot as plt
Step 1 - Read the dataset
First we need to read the data, for that we can use the predefined function from pandas 'read_csv'
logfile_features = pd.read_csv(path)
Afterwards we replace the infinite values with nans and drop them all together
logfile_features.replace([np.inf, -np.inf], np.nan, inplace=True) logfile_features.dropna(inplace=True)
Our dataset has labels which define if its an attack or not, so we replace them with numericals (0 and 1)
logfile_features["Label"].replace({"Benign": 0, "DoS attacks-Slowloris": 1, "DoS attacks-GoldenEye": 1}, inplace=True)
Next we shuffle our dataset
logfile_features = logfile_features.sample(frac=1)
Now we need to split our data into 3 parts: Training data (60%), Test data (20%) and Validation data (20%). To do that we use the following methods
train_dataset, temp_test_dataset = sk.train_test_split(logfile_features, test_size=0.4) test_dataset, valid_dataset = sk.train_test_split(temp_test_dataset, test_size=0.5)
Next we extract the labels from the actual dataset, we need them extra for our training
train_labels = train_dataset.pop('Label') test_labels = test_dataset.pop('Label') valid_labels = valid_dataset.pop('Label')
To norm our data correctly we need to get the stats of our training data (mean and standard deviation)
train_stats = train_dataset.describe() train_stats = train_stats.transpose()
For norming we use a following method
def norm(x, stats): return (x - stats['mean']) / stats['std']
This method can then be used the following way
normed_train_data = norm(train_dataset, train_stats) normed_test_data = norm(test_dataset, train_stats) normed_valid_dataset = norm(valid_dataset, train_stats)
Step 2 - Create a model
First we need to create a sequential model, which can be trained later
model = Sequential()
The next step is to create an input layer with exactly as many nodes as features in our training data
model.add(Dense(normed_train_data.shape[1],input_shape=(normed_train_data.shape[1],))
Next a hidden layer consisting of 128 nodes with the ReLU (Rectified Linear Unit) activation function
model.add(Dense(128, Activation('relu')))
And finally the output layer consisting of 1 node which represents 'attack' or 'no attack'
model.add(Dense(1))
Now we could change the learning rate to a specific value, but we just leave it at the default 0.001
learning_rate = 0.001
For the optimizer we just use the Adam Optimizer with the pre-defined learning rate
optimizer = tf.optimizers.Adam(learning_rate)
Lastly we need to compile the model, for the loss function we use BinaryCorssentropy, our optimizer and the metric should be the accuarcy of the model
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), optimizer=optimizer, metrics=['accuracy'])
Step 3 - Train the model
First we set our epochs, a complete pass of the normed training data through the model, and our batch size, after how many datapoints the model gets updated
EPOCHS = 5000 batch_size = 1024
To not have to wait for 5000 training epochs to finish and to prevent overfitting we can set an early stop
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=2)
Finally the training, model fitting, can start
with tf.device('/CPU:0'): # with tf.device('/GPU:0'): # wenn man mit der Grafikkarte trainieren will history = model.fit( normed_train_data, train_labels, batch_size=batch_size, epochs=EPOCHS, verbose=1, shuffle=True, steps_per_epoch=int(normed_train_data.shape[0] / batch_size), validation_data=(normed_valid_dataset, valid_labels), callbacks=[es], )
Step 4 - Plot the results
After the training has been completed you can easily plot the accuarcy and validation accuracy during the training using
plt.plot(history.history['accuracy']) plt.plot(history.history['val_accuracy']) plt.title('model accuracy') plt.ylabel('accuracy') plt.xlabel('epoch') plt.legend(['Train', 'Cross-Validation'], loc='upper left') plt.show()
To show the accuarcy when tested against the test data, meaning data the model didn't train with, we can use a confusion matrix the following
ax = plt.subplot() predict_results = model.predict(normed_test_data) predict_results = (predict_results > 0.5) cm = confusion_matrix(test_labels, predict_results) ax.set_xlabel('Predicted labels') ax.set_ylabel('True labels') ax.set_title('Confusion Matrix') ax.xaxis.set_ticklabels(['No Attack', 'Attack']) ax.yaxis.set_ticklabels(['No Attack', 'Attack']) plt.show()