NEU steel surface defect classification project for a resume-ready portfolio. It includes a classical computer vision baseline and a transfer-learning CNN.
Apple MDE teams need reliable inspection signals to catch surface defects before they become yield loss, rework, or customer-facing quality escapes. This project shows how a manufacturing defect classifier can compare a classical vision baseline against a ResNet18 transfer-learning model, then use confusion matrices and misclassified samples to understand inspection risk.
| Model | Accuracy | Macro Precision | Macro Recall |
|---|---|---|---|
| HOG + SVM Baseline | 61.48% | 60.98% | 61.48% |
| ResNet18 Transfer Learning | 99.63% | 99.64% | 99.63% |
The ResNet18 model misclassified only 1 sample out of 270 test images. The misclassified sample was an Inclusion defect predicted as Pitted Surface, which was further analyzed from a manufacturing false-negative risk perspective.
Classify six steel surface defects from the NEU dataset:
| Code | Class |
|---|---|
Cr |
Crazing |
In |
Inclusion |
Pa |
Patches |
PS |
Pitted surface |
RS |
Rolled-in scale |
Sc |
Scratches |
- OpenCV preprocessing: grayscale loading, resize to
200x200, and training-only augmentation. - Baseline: HOG features + linear SVM.
- Deep model: PyTorch ResNet18 transfer learning.
- Evaluation: accuracy, macro precision, macro recall, classification report, confusion matrix.
- Error analysis: use the confusion matrix and misclassified samples to explain which defect classes are confused and why.
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txtThe original dataset can stay outside the repo. This command creates data/splits.csv with stratified train/val/test splits:
python scripts/prepare_dataset.py --dataset-dir "C:\Users\18747\Desktop\NEU surface defect database"Default split is 70% train, 15% validation, 15% test.
Note: data/splits.csv should be regenerated on each machine because image paths depend on the local dataset location. The split file is ignored by Git; keep only data/README.md in the repository.
- Random seed: 42
- Image size: 200x200
- Split: 70% train, 15% validation, 15% test
- HOG + SVM: HOG features with LinearSVC and balanced class weights
- ResNet18: ImageNet pretrained weights, AdamW optimizer, learning rate 3e-4, batch size 32, epochs 15
- Model selection: best validation accuracy
python scripts/train_hog_svm.py --splits data/splits.csvOutputs:
outputs/hog_svm/metrics.jsonoutputs/hog_svm/classification_report.txtoutputs/hog_svm/confusion_matrix.pngmodels/hog_svm.joblib
python scripts/train_resnet18.py --splits data/splits.csv --epochs 15 --batch-size 32To save the full training log:
python scripts/train_resnet18.py --splits data/splits.csv --epochs 15 --batch-size 32 *> reports/training_log_resnet18.txtThe first pretrained run may download ImageNet weights. If the machine has no network access, use:
python scripts/train_resnet18.py --splits data/splits.csv --epochs 15 --batch-size 32 --no-pretrainedOutputs:
outputs/resnet18/metrics.jsonoutputs/resnet18/classification_report.txtoutputs/resnet18/confusion_matrix.pngmodels/resnet18.pth
Test set size: 270 images, with 45 images per class.
| Model | Accuracy | Macro Precision | Macro Recall |
|---|---|---|---|
| HOG + SVM | 61.48% | 60.98% | 61.48% |
| ResNet18 Transfer Learning | 99.63% | 99.64% | 99.63% |
The ResNet18 transfer-learning model improved test accuracy from 61.48% to 99.63% compared with the HOG + SVM baseline.
Result artifacts committed under reports/:
reports/metrics_hog_svm.jsonreports/metrics_resnet18.jsonreports/classification_report_hog_svm.txtreports/classification_report_resnet18.txtreports/confusion_hog_svm.jsonreports/confusion_resnet18.jsonreports/figures/hog_svm_confusion_matrix.pngreports/figures/resnet18_confusion_matrix.pngreports/misclassified_samples/resnet18_001_true_In_pred_PS.png
Based on the HOG + SVM confusion matrix, the baseline performed weakly on Pa/Patches and PS/Pitted surface. The largest confusion pairs were Pa -> Cr with 12 samples, PS -> Cr with 11 samples, and PS -> In with 8 samples. This is consistent with HOG relying on local edge and gradient patterns instead of higher-level defect texture.
Based on the ResNet18 confusion matrix, the model made one test-set error: In_240.bmp, true class In/Inclusion, predicted as PS/Pitted surface. A likely reason is that both classes contain small dark local defect patterns, and the difference may depend on local density and distribution.
See reports/error_analysis.md for the detailed analysis.
- Camera captures the steel surface image.
- Preprocessing normalizes size, grayscale format, and contrast.
- The CNN classifier predicts defect class and confidence score.
- Low-confidence or high-risk predictions enter manual review.
- Per-class recall and false negative rate are monitored weekly.
- Misclassified and borderline samples are added to the retraining set.
- The NEU dataset is relatively small and clean compared with real production images.
- The test set contains 270 images, so 99.63% accuracy corresponds to one misclassified image.
- Real production deployment would require validation on recent line images, camera and lighting drift checks, and a recall-oriented threshold.
- This project is a classification prototype, not a full production inspection system.
In a real production line, reducing false negatives matters more than maximizing overall accuracy. Practical measures include using a defect/non-defect threshold tuned for high recall, reviewing low-confidence predictions, collecting more hard negative and borderline defect samples, monitoring per-class recall, and adding lighting/camera checks so missed defects are not caused by image acquisition drift.

