SQL Injection (SQLi) is a major security vulnerability where malicious SQL queries manipulate database operations. This project implements a machine learning-based binary classification system to distinguish between malicious and benign SQL queries.
The approach models SQL queries as text data and applies Bag-of-Words feature extraction, followed by:
- Logistic Regression (baseline)
- Feed-Forward Neural Network (improved model)
Let a SQL query be represented as a sequence of tokens:
Using Bag-of-Words (CountVectorizer), the query is transformed into a feature vector:
where:
-
$d$ = vocabulary size -
$x_i$ = frequency of token$i$
The probability of a query being malicious is:
where:
-
$w \in \mathbb{R}^d$ : weight vector -
$b$ : bias
Loss Function (Binary Cross-Entropy):
The neural network used consists of multiple dense layers:
Additional components:
- Batch Normalization
- Dropout (0.5)
- Input: labeled SQL queries (malicious / benign)
- No heavy manual preprocessing
- Vectorization using CountVectorizer:
min_df = 2max_df = 0.7- English stopwords removed
- Bag-of-Words encoding
- Captures token frequency patterns in SQL queries
- Logistic Regression trained as baseline
- Neural Network trained using:
- Binary Cross-Entropy Loss
- Backpropagation
- Gradient-based optimization
- Accuracy
- Precision
- Recall
- F1-score
- Logistic Regression Accuracy: 94.3%
- Neural Network Accuracy: 95.86%
- Precision: 98.63%
- Recall: 89.84%
- F1-score: 94.03%
The neural network performs slightly better due to its ability to model non-linear relationships.
Malicious SQL queries contain identifiable patterns such as:
' OR 1=1 --UNION SELECT
These patterns are captured as token frequency features, allowing the models to learn associations between specific token combinations and malicious behavior.
Queries are represented as vectors:
Neural networks involve stacked matrix multiplications.
Optimization is performed using gradient descent.
Gradients are computed as:
Parameters are updated iteratively to minimize the loss.
A Bag-of-Words representation combined with Logistic Regression and a Feed-Forward Neural Network is effective for SQL injection detection. The neural network provides improved accuracy by capturing non-linear relationships in token patterns.