Difference between revisions of "AI: Anomaly Detection in logfiles"

From Embedded Lab Vienna for IoT & Security
Jump to navigation Jump to search
Line 23: Line 23:
Our dataset has labels which define if its an attack or not, so we replace them with numericals (0 and 1)
Our dataset has labels which define if its an attack or not, so we replace them with numericals (0 and 1)
  logfile_features["Label"].replace({"Benign": 0, "DoS attacks-Slowloris": 1, "DoS attacks-GoldenEye": 1}, inplace=True)
  logfile_features["Label"].replace({"Benign": 0, "DoS attacks-Slowloris": 1, "DoS attacks-GoldenEye": 1}, inplace=True)
Afterwards we shuffle our dataset
Next we shuffle our dataset
  logfile_features = logfile_features.sample(frac=1)  
  logfile_features = logfile_features.sample(frac=1)  


Now we need to split our data into 3 parts: Training data (60%), Test data (20%) and Validation data (20%).
To do that we use the following methods
train_dataset, temp_test_dataset = sk.train_test_split(logfile_features, test_size=0.4)
  test_dataset, valid_dataset = sk.train_test_split(temp_test_dataset, test_size=0.5)


=== Step 2 - Create a model ===
=== Step 2 - Create a model ===

Revision as of 14:16, 13 July 2022

 ➤ IMPORTANT: This page is still under construction.

Summary

This guide will create a basic AI model to perform binary classification in order to detect anomalies in logfiles. This AI model is also suitable for the Jetson AGX Xavier Development Kit

Requirements

  • Packages: TensorFlow, Keras, Pandas, sklearn, numpy, seaborn, matplotlib
  • Software: Pycharm or any other python editor

Description

Step 1 - Read the dataset

First we need to read the data, for that we can use the predefined function from pandas 'read_csv'

logfile_features = pd.read_csv(path)

Afterwards we replace the infinite values with nans and drop them all together

logfile_features.replace([np.inf, -np.inf], np.nan, inplace=True)
logfile_features.dropna(inplace=True)

Our dataset has labels which define if its an attack or not, so we replace them with numericals (0 and 1)

logfile_features["Label"].replace({"Benign": 0, "DoS attacks-Slowloris": 1, "DoS attacks-GoldenEye": 1}, inplace=True)

Next we shuffle our dataset

logfile_features = logfile_features.sample(frac=1) 

Now we need to split our data into 3 parts: Training data (60%), Test data (20%) and Validation data (20%). To do that we use the following methods

train_dataset, temp_test_dataset = sk.train_test_split(logfile_features, test_size=0.4)
 test_dataset, valid_dataset = sk.train_test_split(temp_test_dataset, test_size=0.5)

Step 2 - Create a model

First we need to create a sequential model, which can be trained later.

model = Sequential()

The next step is to create an input layer consisting of 63 nodes, one for every feature we have in our dataset.

 model.add(Dense(63))

Next a hidden layer consisting of 128 nodes with the ReLU (Rectified Linear Unit) activation function.

model.add(Dense(128, Activation('relu')))

And finally the output layer consisting of 1 node which represents 'attack' or 'no attack'

model.add(Dense(1))

Now we could change the learning rate to a specific value, but we just leave it at the default 0.001

learning_rate = 0.001

For the optimizer we just use the Adam Optimizer with the pre-defined learning rate.

optimizer = tf.optimizers.Adam(learning_rate)

Lastly we need to compile the model, for the loss function we use BinaryCorssentropy, our optimizer and the metric should be the accuarcy of the model

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), 
  optimizer=optimizer,
  metrics=['accuracy']) 



Used Hardware