Microsoft Malware Detection | Bhoomeendra Singh Sisodiya

The project is about classifing malware we are provided with a dataset of contaning 9 classes of malaware. The dataset is provided by Microsoft and is available on Kaggle. The dataset is huge and contains 10,868 files.

Dataset

For each malware we are provided with a file containing the assembly code and a file containing the byte code. The dataset is divided into 9 classes. The classes are as follows:

Ramnit
Lollipop
Kelihos_ver3
Vundo
Simda
Tracur
Kelihos_ver1
Obfuscator.ACY
Gatak

The challenging part of the problem is to extract features out of the dataset.

Data Exploration

Data is imbalanced as the no. of samples in each class are significantly different.
Distribution of the size of the files in the dataset.

Data Preprocessing

The data is in the form of assembly code and byte code. We need to extract features out of it. For that we used bag of words approach. We extracted uni-grams, bi-grams from the binary files and assembly files. We also extracted the frequency of the uni-grams and bi-grams.
Image features: We also extracted image features from the binary files so each binary number in the file is represented as a pixel in the image. We extracted first 800 pixels of the images formed from the binary files. This is has been shown to work in one of the best solutions to the problem.
We also extracted the size of the files as a feature.

Model

For the model part we used xgboost classifier and trained it on the features extracted from the dataset. We achieved an log loss of 0.0203 on the test split of the dataset.