This project is an end-to-end machine learning workflow to predict the sale price of bulldozers using historical auction data.
It is based on the Bluebook for Bulldozers Kaggle competition and follows a complete ML pipeline — from data preprocessing to model evaluation and prediction.
The goal of this project is:
To predict the future sale price of a bulldozer based on its characteristics and past sales data.
Since the output is a continuous value, this is a regression problem.
Additionally, because the data includes time-based features (sale dates), it also involves time series forecasting concepts. :contentReference[oaicite:0]{index=0}
The dataset comes from the Kaggle competition and consists of:
- Train.csv → Historical data up to 2011
- Valid.csv → Validation data (Jan–Apr 2012)
- Test.csv → Test data (May–Nov 2012, without target) :contentReference[oaicite:1]{index=1}
- Machine specifications (ModelID, ProductSize, etc.)
- Usage data (MachineHoursCurrentMeter)
- Sale information (state, auctioneerID)
- Time-based feature (
saledate)
SalePrice
This project follows a structured ML pipeline:
Predict bulldozer sale price using historical data.
- Load dataset with Pandas
- Understand structure, missing values, and data types
- Identify key features
- Convert
saledateinto:- Year
- Month
- Day
- Day of week
- Handle missing values:
- Numerical → filled with median
- Categorical → converted to numerical codes
- Add missing-value indicator columns
- Convert categorical variables into numeric format
- Ensure training and test data have the same feature structure
- Model used: RandomForestRegressor
- Reason:
- Works well on structured/tabular data
- Handles non-linear relationships
- No need for feature scaling :contentReference[oaicite:2]{index=2}
The evaluation metric used is:
- RMSLE (Root Mean Squared Log Error)
Why RMSLE?
- Penalizes large percentage errors
- Suitable for price prediction problems with wide value ranges :contentReference[oaicite:3]{index=3}
Other metrics:
- MAE (Mean Absolute Error)
- R² Score
- Used RandomizedSearchCV for hyperparameter tuning
- Tuned parameters such as:
n_estimatorsmax_depthmin_samples_splitmax_features
This improves model performance while keeping training time manageable.
- Preprocessed test data using the same pipeline
- Matched feature columns with training data
- Generated predictions using the trained model
- Created submission file with:
SalesID- Predicted
SalePrice
If you have any suggestions or feedback, feel free to connect!