This project aims to explore machine learning methods of distinguishing between benign and malicious web payloads, which could be used to enhance the capabilities of web application firewalls (WAFs) and intrusion detection systems (IDSs). It contains a comprehensive pipeline that involves data preprocessing, feature engineering, model selection, and hyperparameter tuning to detect malicious web payloads. Two models, namely a Random Forest model and a fine-tuned CodeBERT model, were trained and evaluated, exhibiting robust performance with F1-Scores of 0.9913 and 0.9970, respectively. The modelsβ performance was benchmarked and visualized, providing insights for their real-world applicability in web security contexts.
- Machine Learning: Scikit-learn, HuggingFace Transformers
- Data Processing: Pandas, NumPy, NLTK
- Data Visualization: Matplotlib, Seaborn
- Hyperparameter Tuning: Optuna
π configs # Configuration files for the pipeline
β π roberta.json # Configuration for fine-tuned RoBERTa pipeline
β π codebert.json # Configuration for fine-tuned CodeBERT pipeline
β π rf.json # Configuration for Random Forest pipeline
π data # Data storage
β π dataset.csv # Merged payload dataset
π models # Repository for trained model artifacts
β π rf # Random Forest model artifacts
β π roberta # RoBERTa model artifacts
β π codebert # CodeBERT model artifacts
π notebooks # Jupyter notebooks for EDA and experimentation
β π eda.ipynb # Exploratory data analysis notebook
β π evaluate_bert.ipynb # Evaluation of the RoBERTa model
β π evaluate_rf.ipynb # Evaluation of the Random Forest model
β π hyperopt_rf.ipynb # Hyperparameter tuning for Random Forest
β π hyperopt_tfidf.ipynb # Hyperparameter tuning for TF-IDF
β π merge.ipynb # Merging of the public datasets
π src # Source code of the project
β π pipeline.py # Configurable end-to-end pipeline
β π preprocess.py # Data preprocessing and feature engineering
β π tfidf.py # TF-IDF vectorization
β π rf.py # Random Forest model training and evaluation
β π bert.py # RoBERTa model fine-tuning and evaluation
π test # Test suite for the project
π Pipfile # Pipenv dependencies file
π Pipfile.lock # Lock file ensuring deterministic builds
π README.md # Project documentation and usage
π download.sh # Script to download the datasets
Get up and running with the project:
-
Setup and Activate Virtual Environment:
pip install pipenv pipenv install --dev pipenv shell
-
Run Tests:
pytest
-
Execute a Pipeline:
python -m src.pipeline --config configs/rf.json
In-depth exploration of the dataset was carried out in the eda.ipynb
notebook to identify key patterns and discrepancies in benign and malicious payloads. The following observations were made:
- Class Imbalance: A notable discrepancy in the representation of benign vs. malicious payloads, with malicious payloads accounting for only 12.39% of the dataset.
- Encoding / Obfuscation: A variety of encoding and obfuscation techniques were discovered, including URL and HTML encoding, and hex encoding.
- Payload Length: Malicious payloads tend to be longer than benign ones.
- Special Characters: Malicious payloads tend to contain more special characters than benign ones.
- Common Words: In benign payloads, words seem to relate more to web elements like
a
,href
,class
,li
, andtitle
. In malicious payloads, words likealert
,script
, andxss
dominate, indicating common attack patterns in cross-site scripting (XSS). - Feature Correlation: The
label
has a moderate positive correlation withspecial_chars_count
indicating that the presence of more special characters could be a sign of malicious intent.
-
Script Overview:
- A preprocessing script (
preprocess.py
) performs several actions on the dataset:- NA Handling: Omission of entries with missing values.
- Duplicate Handling: Removal of duplicate payloads.
- URL and HTML Decoding: Application of URL and HTML decoding to payloads.
- Text Normalization: Conversion of payloads to lowercase.
- Feature Engineering: Introduction of two new features:
payload_len
: The length of the payload.special_chars_count
: The count of special characters.
- Train-Test Split: 80/20 split while maintaining label distribution.
- Data Saving: The preprocessed data is stored in Parquet format for optimized I/O operations.
- Statistical Summary: Generation and storage of label and category count summaries.
- For further exploration, statistical summaries are stored in JSON format, facilitating convenient subsequent analysis and record-keeping.
- A preprocessing script (
-
Feature Engineering:
- TF-IDF Vectorization: The payload text for the random forest model is tokenized using the NTLK tokenizer and vectorized using the TF-IDF vectorizer.
- RoBERTa Tokenization: The payload text for the RoBERTa model is tokenized using the RoBERTa tokenizer, which is a subword tokenizer that can handle out-of-vocabulary words.
The selection of Random Forest and RoBERTa was informed by the understanding developed during data exploration and the inherent capabilities of the models.
- Random Forest:
- Insensitivity to Imbalanced Data: Can handle the class imbalance observed in the data.
- Interpretability: Offers insights into feature importances.
- RoBERTa:
- Sequential Understanding: Can understand the sequential nature of payloads, which is important for understanding the structure of HTML and JavaScript.
- Transfer Learning: Leverages pre-existing knowledge from pre-training, potentially providing robustness in understanding varied payload structures.
- Process & Technique:
- Random Forest: Performed in
hyperopt_rf.ipynb
andhyperopt_tfidf.ipynb
notebooks using Optuna and validated through cross-validation maximizing the F1-Score. - RoBERTa: Two pre-trained models were fine-tuned using the HuggingFace Transformers library. First, a original prtrained RoBERTa base model was fine-tuned on the dataset, and then a RoBERTa model pre-trained on the CodeSearchNet, called CodeBERT, was fine-tuned on the dataset.
- Random Forest: Performed in
- Improvements:
- The hyperparameter tuning process resulted in a slight improvement in the model's performance, as shown below
- Random Forest:
- Before: F1-Score of 0.9837
- After: F1-Score of 0.9913
- RoBERTa:
- Original: F1-Score of 0.9880
- CodeBERT: F1-Score of 0.9970
- Insight:
- CodeBERT significantly outperformed the original RoBERTa model, indicating that pre-training on code data can be beneficial for understanding web payloads.
- Random Forest:
- The hyperparameter tuning process resulted in a slight improvement in the model's performance, as shown below
Model | F1-Score | Precision | Recall | Inference Time |
---|---|---|---|---|
Random Forest | 0.9913 | 0.9913 | 0.9913 | ~25 ms (CPU) |
CodeBERT | 0.9970 | 0.9988 | 0.9952 | ~10 ms (GPU) |
- π₯ Random Forest:
- Real-World Applicability: Given its high precision and recall, this model shows promise for real-world applications. However, its inference time is relatively high, which could be a concern in real-time threat detection.
- π₯ CodeBERT:
- Real-World Applicability: With superior performance metrics and lower GPU-based inference time, the fine-tuned CodeBERT model presents itself as a well-suited option for real-time threat detection in web traffic, particularly when GPU resources are accessible.
- Inference Time Optimization: Explore methods to speed up the model's inference, possibly through low-level optimizations using languages like Rust.
- Enhanced Feature Engineering: Investigate the impact of integrating hand-crafted features, such as special character counts, with the text data used in the transformer-based models. This could potentially uplift model performance.
- Model architecture exploration: Explore the impact of using different transformer architectures, such as BERT, XLNet, and GPT-2, on the model's performance.
- Expand Dataset: The dataset used in this project is relatively small and could be expanded to include more malicious payloads, potentially improving model performance.