"
+ ],
+ "text/plain": [
+ " Rcmnd cruise Knots Stall Knots dirty Fuel gal/lbs \\\n",
+ "count 507.000000 502.000000 517.000000 \n",
+ "mean 200.792899 60.795817 1419.379110 \n",
+ "std 104.280532 16.657002 4278.320773 \n",
+ "min 70.000000 27.000000 12.000000 \n",
+ "25% 130.000000 50.000000 50.000000 \n",
+ "50% 169.000000 56.000000 89.000000 \n",
+ "75% 232.000000 73.000000 335.000000 \n",
+ "max 511.000000 115.000000 41000.000000 \n",
+ "\n",
+ " Eng out rate of climb Takeoff over 50ft Price \n",
+ "count 491.000000 492.000000 5.070000e+02 \n",
+ "mean 2065.126273 1743.306911 2.362673e+06 \n",
+ "std 1150.031899 730.009674 1.018731e+06 \n",
+ "min 457.000000 500.000000 6.500000e+05 \n",
+ "25% 1350.000000 1265.000000 1.600000e+06 \n",
+ "50% 1706.000000 1525.000000 2.000000e+06 \n",
+ "75% 2357.000000 2145.750000 2.950000e+06 \n",
+ "max 6400.000000 4850.000000 5.100000e+06 "
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# View the statistical summary of the dataset\n",
+ "df.describe()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "f55da0ad",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Model Name 0\n",
+ "Engine Type 0\n",
+ "HP or lbs thr ea engine 0\n",
+ "Max speed Knots 20\n",
+ "Rcmnd cruise Knots 10\n",
+ "Stall Knots dirty 15\n",
+ "Fuel gal/lbs 0\n",
+ "All eng rate of climb 4\n",
+ "Eng out rate of climb 26\n",
+ "Takeoff over 50ft 25\n",
+ "Landing over 50ft 0\n",
+ "Empty weight lbs 1\n",
+ "Length ft/in 0\n",
+ "Wing span ft/in 0\n",
+ "Range N.M. 18\n",
+ "Price 10\n",
+ "dtype: int64"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Check the null values is present in the dataset or not\n",
+ "df.isnull().sum()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "42b0e466",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Model Name 0\n",
+ "Engine Type 0\n",
+ "HP or lbs thr ea engine 0\n",
+ "Max speed Knots 20\n",
+ "Rcmnd cruise Knots 0\n",
+ "Stall Knots dirty 0\n",
+ "Fuel gal/lbs 0\n",
+ "All eng rate of climb 4\n",
+ "Eng out rate of climb 0\n",
+ "Takeoff over 50ft 0\n",
+ "Landing over 50ft 0\n",
+ "Empty weight lbs 1\n",
+ "Length ft/in 0\n",
+ "Wing span ft/in 0\n",
+ "Range N.M. 18\n",
+ "Price 0\n",
+ "dtype: int64\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Fill missing values with median for numerical columns\n",
+ "df.fillna(df.median(numeric_only=True), inplace=True)\n",
+ "\n",
+ "# Verify no missing values remain\n",
+ "print(df.isnull().sum())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "88edda31",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### Fill missing values in numerical columns using their respective median values. The median is chosen as it is robust to outliers and better represents the data's central tendency. The `numeric_only=True` parameter ensures only numerical columns are processed, leaving non-numeric columns (e.g., object-type) unaffected. The `inplace=True` parameter applies changes directly to the DataFrame.\n",
+ "\n",
+ "### After applying `fillna()`, the `df.isnull().sum()` verification step counts remaining missing values. Non-zero counts for some columns indicate missing values in non-numeric (e.g., object-type) columns, which may need separate handling. This ensures numerical data is ready for analysis or modeling."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f782b48d",
+ "metadata": {},
+ "source": [
+ "## Data Preprocessing"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "b3e4bcc5",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
HP or lbs thr ea engine
\n",
+ "
Max speed Knots
\n",
+ "
Rcmnd cruise Knots
\n",
+ "
Stall Knots dirty
\n",
+ "
Fuel gal/lbs
\n",
+ "
All eng rate of climb
\n",
+ "
Eng out rate of climb
\n",
+ "
Takeoff over 50ft
\n",
+ "
Landing over 50ft
\n",
+ "
Empty weight lbs
\n",
+ "
Length ft/in
\n",
+ "
Wing span ft/in
\n",
+ "
Range N.M.
\n",
+ "
Price
\n",
+ "
Engine Type_piston
\n",
+ "
Engine Type_propjet
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
145
\n",
+ "
104
\n",
+ "
91.0
\n",
+ "
46.0
\n",
+ "
36
\n",
+ "
450
\n",
+ "
900.0
\n",
+ "
1300.0
\n",
+ "
2,050
\n",
+ "
1,180
\n",
+ "
25/3
\n",
+ "
37/5
\n",
+ "
370
\n",
+ "
1300000.0
\n",
+ "
True
\n",
+ "
False
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
85
\n",
+ "
89
\n",
+ "
83.0
\n",
+ "
44.0
\n",
+ "
15
\n",
+ "
600
\n",
+ "
720.0
\n",
+ "
800.0
\n",
+ "
1,350
\n",
+ "
820
\n",
+ "
20/7
\n",
+ "
36/1
\n",
+ "
190
\n",
+ "
1230000.0
\n",
+ "
True
\n",
+ "
False
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
90
\n",
+ "
90
\n",
+ "
78.0
\n",
+ "
37.0
\n",
+ "
19
\n",
+ "
650
\n",
+ "
475.0
\n",
+ "
850.0
\n",
+ "
1,300
\n",
+ "
810
\n",
+ "
21/5
\n",
+ "
35/0
\n",
+ "
210
\n",
+ "
1600000.0
\n",
+ "
True
\n",
+ "
False
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
85
\n",
+ "
88
\n",
+ "
78.0
\n",
+ "
37.0
\n",
+ "
19
\n",
+ "
620
\n",
+ "
500.0
\n",
+ "
850.0
\n",
+ "
1,300
\n",
+ "
800
\n",
+ "
21/5
\n",
+ "
35/0
\n",
+ "
210
\n",
+ "
1300000.0
\n",
+ "
True
\n",
+ "
False
\n",
+ "
\n",
+ "
\n",
+ "
4
\n",
+ "
65
\n",
+ "
83
\n",
+ "
74.0
\n",
+ "
33.0
\n",
+ "
14
\n",
+ "
370
\n",
+ "
632.0
\n",
+ "
885.0
\n",
+ "
1,220
\n",
+ "
740
\n",
+ "
21/5
\n",
+ "
35/0
\n",
+ "
175
\n",
+ "
1250000.0
\n",
+ "
True
\n",
+ "
False
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " HP or lbs thr ea engine Max speed Knots Rcmnd cruise Knots \\\n",
+ "0 145 104 91.0 \n",
+ "1 85 89 83.0 \n",
+ "2 90 90 78.0 \n",
+ "3 85 88 78.0 \n",
+ "4 65 83 74.0 \n",
+ "\n",
+ " Stall Knots dirty Fuel gal/lbs All eng rate of climb \\\n",
+ "0 46.0 36 450 \n",
+ "1 44.0 15 600 \n",
+ "2 37.0 19 650 \n",
+ "3 37.0 19 620 \n",
+ "4 33.0 14 370 \n",
+ "\n",
+ " Eng out rate of climb Takeoff over 50ft Landing over 50ft \\\n",
+ "0 900.0 1300.0 2,050 \n",
+ "1 720.0 800.0 1,350 \n",
+ "2 475.0 850.0 1,300 \n",
+ "3 500.0 850.0 1,300 \n",
+ "4 632.0 885.0 1,220 \n",
+ "\n",
+ " Empty weight lbs Length ft/in Wing span ft/in Range N.M. Price \\\n",
+ "0 1,180 25/3 37/5 370 1300000.0 \n",
+ "1 820 20/7 36/1 190 1230000.0 \n",
+ "2 810 21/5 35/0 210 1600000.0 \n",
+ "3 800 21/5 35/0 210 1300000.0 \n",
+ "4 740 21/5 35/0 175 1250000.0 \n",
+ "\n",
+ " Engine Type_piston Engine Type_propjet \n",
+ "0 True False \n",
+ "1 True False \n",
+ "2 True False \n",
+ "3 True False \n",
+ "4 True False "
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Drop 'Model Name' if it's not relevant\n",
+ "df.drop(columns=['Model Name'], inplace=True)\n",
+ "\n",
+ "# Standardize the case in the 'Engine Type' column\n",
+ "df['Engine Type'] = df['Engine Type'].str.lower() # Convert to lowercase\n",
+ "\n",
+ "# Re-run one-hot encoding\n",
+ "df = pd.get_dummies(df, columns=['Engine Type'], drop_first=True)\n",
+ "\n",
+ "# Verify the unique values and column names\n",
+ "df.head()\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cdaea563",
+ "metadata": {},
+ "source": [
+ "### The column Model Name is removed as it is not relevant for the analysis and modeling process, ensuring the dataset contains only useful features. The values in the Engine Type column are converted to lowercase to maintain uniformity and avoid potential mismatches during further processing. The Engine Type column is encoded into binary columns (Engine Type_piston, Engine Type_propjet) using one-hot encoding. This transformation converts categorical data into numerical format suitable for modeling. The dataset is displayed after transformations to ensure changes have been successfully applied. The binary columns for Engine Type are now included in the dataset."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "3cdf67ef",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[ True False]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Check unique values in the 'Engine Type' column\n",
+ "print(df['Engine Type_piston'].unique())"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "09c422ac",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[False True]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Check unique values in the 'Engine Type' column\n",
+ "print(df['Engine Type_propjet'].unique())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ba0c9e9b",
+ "metadata": {},
+ "source": [
+ "### Check unique values in the one-hot encoded columns 'Engine Type_piston' and 'Engine Type_propjet'. The unique values [0, 1] confirm that one-hot encoding was applied successfully, representing the binary presence (1) or absence (0) of each category in the original 'Engine Type' column."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "7d7a6bd6",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "<>:14: SyntaxWarning: invalid escape sequence '\\d'\n",
+ "<>:14: SyntaxWarning: invalid escape sequence '\\d'\n",
+ "C:\\Users\\Hyunsung Ha\\AppData\\Local\\Temp\\ipykernel_4344\\3091194593.py:14: SyntaxWarning: invalid escape sequence '\\d'\n",
+ " df[col] = df[col].str.replace(',', '').str.extract('(\\d+)', expand=False).astype(float)\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
Max speed Knots
\n",
+ "
All eng rate of climb
\n",
+ "
Landing over 50ft
\n",
+ "
Empty weight lbs
\n",
+ "
Length ft/in
\n",
+ "
Wing span ft/in
\n",
+ "
Range N.M.
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
104.0
\n",
+ "
450.0
\n",
+ "
2050.0
\n",
+ "
1180.0
\n",
+ "
25.0
\n",
+ "
37.0
\n",
+ "
370.0
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
89.0
\n",
+ "
600.0
\n",
+ "
1350.0
\n",
+ "
820.0
\n",
+ "
20.0
\n",
+ "
36.0
\n",
+ "
190.0
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
90.0
\n",
+ "
650.0
\n",
+ "
1300.0
\n",
+ "
810.0
\n",
+ "
21.0
\n",
+ "
35.0
\n",
+ "
210.0
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
88.0
\n",
+ "
620.0
\n",
+ "
1300.0
\n",
+ "
800.0
\n",
+ "
21.0
\n",
+ "
35.0
\n",
+ "
210.0
\n",
+ "
\n",
+ "
\n",
+ "
4
\n",
+ "
83.0
\n",
+ "
370.0
\n",
+ "
1220.0
\n",
+ "
740.0
\n",
+ "
21.0
\n",
+ "
35.0
\n",
+ "
175.0
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Max speed Knots All eng rate of climb Landing over 50ft \\\n",
+ "0 104.0 450.0 2050.0 \n",
+ "1 89.0 600.0 1350.0 \n",
+ "2 90.0 650.0 1300.0 \n",
+ "3 88.0 620.0 1300.0 \n",
+ "4 83.0 370.0 1220.0 \n",
+ "\n",
+ " Empty weight lbs Length ft/in Wing span ft/in Range N.M. \n",
+ "0 1180.0 25.0 37.0 370.0 \n",
+ "1 820.0 20.0 36.0 190.0 \n",
+ "2 810.0 21.0 35.0 210.0 \n",
+ "3 800.0 21.0 35.0 210.0 \n",
+ "4 740.0 21.0 35.0 175.0 "
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Convert columns to numeric by removing commas and handling special characters\n",
+ "columns_to_convert = [\n",
+ " \"Max speed Knots\", \n",
+ " \"All eng rate of climb\", \n",
+ " \"Landing over 50ft\", \n",
+ " \"Empty weight lbs\", \n",
+ " \"Length ft/in\", \n",
+ " \"Wing span ft/in\", \n",
+ " \"Range N.M.\"\n",
+ "]\n",
+ "\n",
+ "for col in columns_to_convert:\n",
+ " # Remove commas and convert to numeric\n",
+ " df[col] = df[col].str.replace(',', '').str.extract('(\\d+)', expand=False).astype(float)\n",
+ "\n",
+ "# Verify the conversions\n",
+ "df[columns_to_convert].head()\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "23ef734f",
+ "metadata": {},
+ "source": [
+ "### Specific columns with string-based numbers (e.g., commas or special characters) are converted to numeric format for compatibility with numerical analysis and modeling."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "cc467036",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "HP or lbs thr ea engine 0\n",
+ "Max speed Knots 0\n",
+ "Rcmnd cruise Knots 0\n",
+ "Stall Knots dirty 0\n",
+ "Fuel gal/lbs 0\n",
+ "All eng rate of climb 0\n",
+ "Eng out rate of climb 0\n",
+ "Takeoff over 50ft 0\n",
+ "Landing over 50ft 0\n",
+ "Empty weight lbs 0\n",
+ "Length ft/in 0\n",
+ "Wing span ft/in 0\n",
+ "Range N.M. 0\n",
+ "Price 0\n",
+ "Engine Type_piston 0\n",
+ "Engine Type_propjet 0\n",
+ "dtype: int64\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "C:\\Users\\Hyunsung Ha\\AppData\\Local\\Temp\\ipykernel_4344\\167143364.py:6: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.\n",
+ "The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.\n",
+ "\n",
+ "For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.\n",
+ "\n",
+ "\n",
+ " df[col].fillna(df[col].median(), inplace=True)\n",
+ "C:\\Users\\Hyunsung Ha\\AppData\\Local\\Temp\\ipykernel_4344\\167143364.py:6: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.\n",
+ "The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.\n",
+ "\n",
+ "For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.\n",
+ "\n",
+ "\n",
+ " df[col].fillna(df[col].median(), inplace=True)\n",
+ "C:\\Users\\Hyunsung Ha\\AppData\\Local\\Temp\\ipykernel_4344\\167143364.py:6: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.\n",
+ "The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.\n",
+ "\n",
+ "For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.\n",
+ "\n",
+ "\n",
+ " df[col].fillna(df[col].median(), inplace=True)\n",
+ "C:\\Users\\Hyunsung Ha\\AppData\\Local\\Temp\\ipykernel_4344\\167143364.py:6: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.\n",
+ "The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.\n",
+ "\n",
+ "For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.\n",
+ "\n",
+ "\n",
+ " df[col].fillna(df[col].median(), inplace=True)\n",
+ "C:\\Users\\Hyunsung Ha\\AppData\\Local\\Temp\\ipykernel_4344\\167143364.py:6: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.\n",
+ "The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.\n",
+ "\n",
+ "For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.\n",
+ "\n",
+ "\n",
+ " df[col].fillna(df[col].median(), inplace=True)\n",
+ "C:\\Users\\Hyunsung Ha\\AppData\\Local\\Temp\\ipykernel_4344\\167143364.py:6: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.\n",
+ "The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.\n",
+ "\n",
+ "For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.\n",
+ "\n",
+ "\n",
+ " df[col].fillna(df[col].median(), inplace=True)\n",
+ "C:\\Users\\Hyunsung Ha\\AppData\\Local\\Temp\\ipykernel_4344\\167143364.py:6: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.\n",
+ "The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.\n",
+ "\n",
+ "For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.\n",
+ "\n",
+ "\n",
+ " df[col].fillna(df[col].median(), inplace=True)\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Fill null values for specific columns\n",
+ "columns_to_fill_with_median = [\"Max speed Knots\", \"All eng rate of climb\", \"Landing over 50ft\",\n",
+ " \"Empty weight lbs\", \"Length ft/in\", \"Wing span ft/in\", \"Range N.M.\"]\n",
+ "\n",
+ "for col in columns_to_fill_with_median:\n",
+ " df[col].fillna(df[col].median(), inplace=True)\n",
+ "\n",
+ "# Verify that there are no missing values\n",
+ "print(df.isnull().sum())\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "53841fa2",
+ "metadata": {},
+ "source": [
+ "### Columns with missing values are identified and filled with their respective median values, a robust imputation technique that reduces the impact of outliers. After imputation, the dataset is verified to ensure no null values remain, indicating the dataset is clean and ready for further steps."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "id": "70aa4828",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
Max speed Knots
\n",
+ "
Rcmnd cruise Knots
\n",
+ "
Stall Knots dirty
\n",
+ "
Fuel gal/lbs
\n",
+ "
All eng rate of climb
\n",
+ "
Eng out rate of climb
\n",
+ "
Takeoff over 50ft
\n",
+ "
Landing over 50ft
\n",
+ "
Empty weight lbs
\n",
+ "
Length ft/in
\n",
+ "
Wing span ft/in
\n",
+ "
Range N.M.
\n",
+ "
Price
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
count
\n",
+ "
517.000000
\n",
+ "
517.000000
\n",
+ "
517.000000
\n",
+ "
517.000000
\n",
+ "
517.000000
\n",
+ "
517.000000
\n",
+ "
517.000000
\n",
+ "
517.000000
\n",
+ "
517.000000
\n",
+ "
517.000000
\n",
+ "
517.000000
\n",
+ "
517.000000
\n",
+ "
5.170000e+02
\n",
+ "
\n",
+ "
\n",
+ "
mean
\n",
+ "
212.794971
\n",
+ "
200.177950
\n",
+ "
60.656673
\n",
+ "
1419.379110
\n",
+ "
1658.980658
\n",
+ "
2047.065764
\n",
+ "
1732.750484
\n",
+ "
7485.489362
\n",
+ "
4377.405222
\n",
+ "
37.885880
\n",
+ "
38.932302
\n",
+ "
911.448743
\n",
+ "
2.355658e+06
\n",
+ "
\n",
+ "
\n",
+ "
std
\n",
+ "
114.106830
\n",
+ "
103.358089
\n",
+ "
16.432874
\n",
+ "
4278.320773
\n",
+ "
1258.684184
\n",
+ "
1123.433947
\n",
+ "
713.646967
\n",
+ "
10289.442474
\n",
+ "
5649.739125
\n",
+ "
137.633081
\n",
+ "
8.599692
\n",
+ "
696.429643
\n",
+ "
1.010050e+06
\n",
+ "
\n",
+ "
\n",
+ "
min
\n",
+ "
64.000000
\n",
+ "
70.000000
\n",
+ "
27.000000
\n",
+ "
12.000000
\n",
+ "
360.000000
\n",
+ "
457.000000
\n",
+ "
500.000000
\n",
+ "
567.000000
\n",
+ "
2.000000
\n",
+ "
17.000000
\n",
+ "
16.000000
\n",
+ "
117.000000
\n",
+ "
6.500000e+05
\n",
+ "
\n",
+ "
\n",
+ "
25%
\n",
+ "
143.000000
\n",
+ "
131.000000
\n",
+ "
50.000000
\n",
+ "
50.000000
\n",
+ "
924.000000
\n",
+ "
1365.000000
\n",
+ "
1265.000000
\n",
+ "
2650.000000
\n",
+ "
1575.000000
\n",
+ "
25.000000
\n",
+ "
35.000000
\n",
+ "
517.000000
\n",
+ "
1.600000e+06
\n",
+ "
\n",
+ "
\n",
+ "
50%
\n",
+ "
177.000000
\n",
+ "
169.000000
\n",
+ "
56.000000
\n",
+ "
89.000000
\n",
+ "
1200.000000
\n",
+ "
1706.000000
\n",
+ "
1525.000000
\n",
+ "
3625.000000
\n",
+ "
2286.500000
\n",
+ "
28.000000
\n",
+ "
36.000000
\n",
+ "
713.000000
\n",
+ "
2.000000e+06
\n",
+ "
\n",
+ "
\n",
+ "
75%
\n",
+ "
238.000000
\n",
+ "
229.000000
\n",
+ "
73.000000
\n",
+ "
335.000000
\n",
+ "
1820.000000
\n",
+ "
2280.000000
\n",
+ "
2110.000000
\n",
+ "
8800.000000
\n",
+ "
5164.000000
\n",
+ "
35.000000
\n",
+ "
42.000000
\n",
+ "
1100.000000
\n",
+ "
2.940000e+06
\n",
+ "
\n",
+ "
\n",
+ "
max
\n",
+ "
755.000000
\n",
+ "
511.000000
\n",
+ "
115.000000
\n",
+ "
41000.000000
\n",
+ "
7220.000000
\n",
+ "
6400.000000
\n",
+ "
4850.000000
\n",
+ "
89400.000000
\n",
+ "
46800.000000
\n",
+ "
3150.000000
\n",
+ "
93.000000
\n",
+ "
6500.000000
\n",
+ "
5.100000e+06
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Max speed Knots Rcmnd cruise Knots Stall Knots dirty Fuel gal/lbs \\\n",
+ "count 517.000000 517.000000 517.000000 517.000000 \n",
+ "mean 212.794971 200.177950 60.656673 1419.379110 \n",
+ "std 114.106830 103.358089 16.432874 4278.320773 \n",
+ "min 64.000000 70.000000 27.000000 12.000000 \n",
+ "25% 143.000000 131.000000 50.000000 50.000000 \n",
+ "50% 177.000000 169.000000 56.000000 89.000000 \n",
+ "75% 238.000000 229.000000 73.000000 335.000000 \n",
+ "max 755.000000 511.000000 115.000000 41000.000000 \n",
+ "\n",
+ " All eng rate of climb Eng out rate of climb Takeoff over 50ft \\\n",
+ "count 517.000000 517.000000 517.000000 \n",
+ "mean 1658.980658 2047.065764 1732.750484 \n",
+ "std 1258.684184 1123.433947 713.646967 \n",
+ "min 360.000000 457.000000 500.000000 \n",
+ "25% 924.000000 1365.000000 1265.000000 \n",
+ "50% 1200.000000 1706.000000 1525.000000 \n",
+ "75% 1820.000000 2280.000000 2110.000000 \n",
+ "max 7220.000000 6400.000000 4850.000000 \n",
+ "\n",
+ " Landing over 50ft Empty weight lbs Length ft/in Wing span ft/in \\\n",
+ "count 517.000000 517.000000 517.000000 517.000000 \n",
+ "mean 7485.489362 4377.405222 37.885880 38.932302 \n",
+ "std 10289.442474 5649.739125 137.633081 8.599692 \n",
+ "min 567.000000 2.000000 17.000000 16.000000 \n",
+ "25% 2650.000000 1575.000000 25.000000 35.000000 \n",
+ "50% 3625.000000 2286.500000 28.000000 36.000000 \n",
+ "75% 8800.000000 5164.000000 35.000000 42.000000 \n",
+ "max 89400.000000 46800.000000 3150.000000 93.000000 \n",
+ "\n",
+ " Range N.M. Price \n",
+ "count 517.000000 5.170000e+02 \n",
+ "mean 911.448743 2.355658e+06 \n",
+ "std 696.429643 1.010050e+06 \n",
+ "min 117.000000 6.500000e+05 \n",
+ "25% 517.000000 1.600000e+06 \n",
+ "50% 713.000000 2.000000e+06 \n",
+ "75% 1100.000000 2.940000e+06 \n",
+ "max 6500.000000 5.100000e+06 "
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# To check how does the data looks mathematically\n",
+ "df.describe()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8938328f",
+ "metadata": {},
+ "source": [
+ "### Now, we have handled missing values & also converted object columns into numerical. In addition, there were still null values present in the object columns, we filled the null values by using meadian. Now, You can see the summary of dataset & we are ready to go with correlation matrix to select the best features for train test split.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "077fea55",
+ "metadata": {},
+ "source": [
+ "## Correlation Matrix"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "340a22df",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Correlations with Price:\n",
+ " Price 1.000000\n",
+ "Rcmnd cruise Knots 0.898150\n",
+ "Max speed Knots 0.851301\n",
+ "All eng rate of climb 0.848457\n",
+ "Stall Knots dirty 0.777356\n",
+ "Takeoff over 50ft 0.766469\n",
+ "Eng out rate of climb 0.764794\n",
+ "Range N.M. 0.722910\n",
+ "Empty weight lbs 0.688144\n",
+ "Landing over 50ft 0.682572\n",
+ "Fuel gal/lbs 0.604069\n",
+ "Wing span ft/in 0.591734\n",
+ "Engine Type_propjet 0.216141\n",
+ "Length ft/in 0.052890\n",
+ "Engine Type_piston -0.775623\n",
+ "Name: Price, dtype: float64\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Importing data visualization libraries\n",
+ "# sns: Seaborn for statistical data visualization\n",
+ "# plt: Matplotlib's pyplot for creating static, animated, and interactive visualizations\n",
+ "\n",
+ "import seaborn as sns\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "# Compute correlation matrix\n",
+ "correlation_matrix = df.corr(numeric_only=True)\n",
+ "\n",
+ "# Visualize the correlation matrix\n",
+ "plt.figure(figsize=(12, 8))\n",
+ "sns.heatmap(correlation_matrix, annot=True, cmap=\"coolwarm\")\n",
+ "plt.title(\"Correlation Matrix\")\n",
+ "plt.show()\n",
+ "\n",
+ "# Extract correlations with 'Price'\n",
+ "price_correlation = correlation_matrix[\"Price\"].sort_values(ascending=False)\n",
+ "print(\"Correlations with Price:\\n\", price_correlation)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "474a3342",
+ "metadata": {},
+ "source": [
+ "### This block calculates the correlation matrix, which quantifies the linear relationship between variables in the dataset. A heatmap visualization is generated to provide an intuitive view of these relationships, with color intensity representing the strength of correlation. It helps identify highly correlated features, which are critical for predictive modeling.\n",
+ "\n",
+ "### Variables like Rcmnd cruise Knots, Max speed Knots, and All eng rate of climb exhibit strong positive correlations with Price.Features with weak correlations, such as Length ft/in, may not significantly impact the model's accuracy and could be dropped during feature selection.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "id": "4cfd330b",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Highly correlated features with Price: ['Price', 'Rcmnd cruise Knots', 'Max speed Knots', 'All eng rate of climb', 'Stall Knots dirty', 'Takeoff over 50ft', 'Eng out rate of climb', 'Range N.M.', 'Empty weight lbs', 'Landing over 50ft', 'Fuel gal/lbs', 'Wing span ft/in', 'Engine Type_piston']\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Select features with high correlation to 'Price'\n",
+ "high_correlation_features = price_correlation[abs(price_correlation) > 0.5].index.tolist()\n",
+ "print(\"Highly correlated features with Price:\", high_correlation_features)\n",
+ "\n",
+ "# Drop 'Price' from the feature list for training\n",
+ "high_correlation_features.remove(\"Price\")\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "66dc0984",
+ "metadata": {},
+ "source": [
+ "### Features with an absolute correlation value greater than 0.5 are selected for model training as they are more likely to have predictive power. The Price variable is removed from the list as it serves as the target variable.\n",
+ "\n",
+ "### Features such as Rcmnd cruise Knots, Max speed Knots, and Eng out rate of climb are retained for training, as they demonstrate high correlations with the target variable. This ensures that the model uses only the most relevant features, reducing dimensionality and improving performance."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "805439ba",
+ "metadata": {},
+ "source": [
+ "## Check VIF"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "id": "c5be71bc",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Variance Inflation Factor (VIF):\n",
+ " Feature VIF\n",
+ "0 Rcmnd cruise Knots 44.740812\n",
+ "1 Max speed Knots 18.085067\n",
+ "2 All eng rate of climb 13.917120\n",
+ "3 Stall Knots dirty 22.274087\n",
+ "4 Takeoff over 50ft 31.171244\n",
+ "5 Range N.M. 7.429036\n",
+ "6 Eng out rate of climb 19.256853\n",
+ "Training set: (413, 6), Testing set: (104, 6)\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Step 1: Define the original features and target\n",
+ "features = ['Rcmnd cruise Knots', 'Max speed Knots', 'All eng rate of climb', \n",
+ " 'Stall Knots dirty', 'Takeoff over 50ft', 'Range N.M.', 'Eng out rate of climb']\n",
+ "target = 'Price'\n",
+ "\n",
+ "# Step 2: Prepare data for VIF calculation\n",
+ "X = df[features].values\n",
+ "y = df[target].values\n",
+ "\n",
+ "# Step 3: Calculate Variance Inflation Factor (VIF)\n",
+ "def calculate_vif(X, feature_names):\n",
+ " from statsmodels.stats.outliers_influence import variance_inflation_factor\n",
+ " vif_data = pd.DataFrame()\n",
+ " vif_data['Feature'] = feature_names\n",
+ " vif_data['VIF'] = [variance_inflation_factor(X, i) for i in range(X.shape[1])]\n",
+ " return vif_data\n",
+ "\n",
+ "vif_data = calculate_vif(X, features)\n",
+ "print(\"Variance Inflation Factor (VIF):\")\n",
+ "print(vif_data)\n",
+ "\n",
+ "# Step 4: Drop features with high VIF\n",
+ "refined_features = ['Rcmnd cruise Knots', 'Max speed Knots', 'All eng rate of climb', \n",
+ " 'Stall Knots dirty', 'Takeoff over 50ft', 'Range N.M.'] # Example after VIF review\n",
+ "X = df[refined_features].values # Update X to use refined features\n",
+ "\n",
+ "# Step 5: Train-test split\n",
+ "split_index = int(0.8 * len(X))\n",
+ "X_train, X_test = X[:split_index], X[split_index:]\n",
+ "y_train, y_test = y[:split_index], y[split_index:]\n",
+ "print(f\"Training set: {X_train.shape}, Testing set: {X_test.shape}\")\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "id": "98e2057d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Drop 'Rcmnd cruise Knots' due to highest VIF\n",
+ "refined_features = ['Max speed Knots', 'All eng rate of climb', 'Stall Knots dirty', \n",
+ " 'Takeoff over 50ft', 'Range N.M.', 'Eng out rate of climb']\n",
+ "X = df[refined_features].values # Update X with refined features\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "id": "62e1ba54",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Updated Variance Inflation Factor (VIF):\n",
+ " Feature VIF\n",
+ "0 Max speed Knots 17.567701\n",
+ "1 All eng rate of climb 8.284438\n",
+ "2 Stall Knots dirty 20.162377\n",
+ "3 Takeoff over 50ft 30.728868\n",
+ "4 Range N.M. 6.527567\n",
+ "5 Eng out rate of climb 18.748689\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Recalculate VIF with refined features\n",
+ "vif_data = calculate_vif(X, refined_features)\n",
+ "print(\"Updated Variance Inflation Factor (VIF):\")\n",
+ "print(vif_data)\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "id": "ab1234f0",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Updated Variance Inflation Factor (VIF):\n",
+ " Feature VIF\n",
+ "0 Max speed Knots 17.552247\n",
+ "1 All eng rate of climb 8.241797\n",
+ "2 Stall Knots dirty 10.857256\n",
+ "3 Range N.M. 6.466999\n",
+ "4 Eng out rate of climb 13.944731\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Drop 'Takeoff over 50ft' due to highest VIF\n",
+ "refined_features = ['Max speed Knots', 'All eng rate of climb', \n",
+ " 'Stall Knots dirty', 'Range N.M.', 'Eng out rate of climb']\n",
+ "X = df[refined_features].values # Update X with refined features\n",
+ "\n",
+ "# Recalculate VIF\n",
+ "vif_data = calculate_vif(X, refined_features)\n",
+ "print(\"Updated Variance Inflation Factor (VIF):\")\n",
+ "print(vif_data)\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "id": "e8d284b3",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Updated Variance Inflation Factor (VIF):\n",
+ " Feature VIF\n",
+ "0 All eng rate of climb 6.048464\n",
+ "1 Takeoff over 50ft 15.140610\n",
+ "2 Range N.M. 6.113537\n",
+ "3 Eng out rate of climb 18.124673\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Drop 'Max speed Knots' due to highest VIF\n",
+ "refined_features = [ 'All eng rate of climb', \n",
+ " 'Takeoff over 50ft', 'Range N.M.','Eng out rate of climb']\n",
+ "X = df[refined_features].values # Update X with refined features\n",
+ "\n",
+ "# Recalculate VIF\n",
+ "vif_data = calculate_vif(X, refined_features)\n",
+ "print(\"Updated Variance Inflation Factor (VIF):\")\n",
+ "print(vif_data)\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "id": "99903de5",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Updated Variance Inflation Factor (VIF):\n",
+ " Feature VIF\n",
+ "0 All eng rate of climb 5.638383\n",
+ "1 Takeoff over 50ft 7.848335\n",
+ "2 Range N.M. 5.264614\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Drop 'Eng out rate of climb' due to highest VIF\n",
+ "refined_features = ['All eng rate of climb', 'Takeoff over 50ft', 'Range N.M.']\n",
+ "X = df[refined_features].values # Update X with refined features\n",
+ "\n",
+ "# Recalculate VIF\n",
+ "vif_data = calculate_vif(X, refined_features)\n",
+ "print(\"Updated Variance Inflation Factor (VIF):\")\n",
+ "print(vif_data)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e694aba5",
+ "metadata": {},
+ "source": [
+ "### Initial VIF Calculation:\n",
+ "\n",
+ "#### The Variance Inflation Factor (VIF) calculation highlights high collinearity among features. Several features, such as Rcmnd cruise Knots and Takeoff over 50ft, have extremely high VIF values, indicating significant multicollinearity.\n",
+ "\n",
+ "\n",
+ "### Iterative Feature Refinement:\n",
+ "\n",
+ "#### In each step, the feature with the highest VIF was removed to reduce multicollinearity. For instance, Rcmnd cruise Knots was removed first due to its VIF of 44.74.The process continued iteratively, with recalculations of VIF at each step, until all remaining features had acceptable VIF values (generally below 10).This ensures that the features included in the model are independent and contribute uniquely to the predictions.\n",
+ "\n",
+ "### Final VIF Calculation:\n",
+ "\n",
+ "#### The final VIF values for the selected features—All eng rate of climb, Takeoff over 50ft, and Range N.M.— are below 10, indicating minimal collinearity and a strong, stable feature set for modeling.\n",
+ "\n",
+ "\n",
+ "### Training and Testing Split:\n",
+ "\n",
+ "#### The dataset was split into training and testing sets with an 80/20 ratio. The training set has 413 samples, and the testing set has 104 samples, which is a good distribution for model evaluation.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f9d53746",
+ "metadata": {},
+ "source": [
+ "## Feature Scaling "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "id": "1553489a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Define a function to standardize features using z-score normalization\n",
+ "# This transformation centers the data around 0 with a standard deviation of 1\n",
+ "\n",
+ "def scale_features(X):\n",
+ " return (X - np.mean(X, axis=0)) / np.std(X, axis=0)\n",
+ "\n",
+ "X_scaled = scale_features(X)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cd07c40c",
+ "metadata": {},
+ "source": [
+ "### Standardization was applied to the final features to center them around 0 with a standard deviation of 1.This ensures that all features contribute equally to the model and improves numerical stability in regression calculations."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e55faa9b",
+ "metadata": {},
+ "source": [
+ "## Train test split"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "id": "a52931b3",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Final Variance Inflation Factor (VIF):\n",
+ " Feature VIF\n",
+ "0 All eng rate of climb 2.136056\n",
+ "1 Takeoff over 50ft 2.663501\n",
+ "2 Range N.M. 1.981465\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Update refined features based on VIF analysis\n",
+ "refined_features = ['All eng rate of climb', 'Takeoff over 50ft', 'Range N.M.']\n",
+ "X = df[refined_features].values\n",
+ "y = df['Price'].values\n",
+ "\n",
+ "# Recalculate Variance Inflation Factor (VIF) for final confirmation\n",
+ "def calculate_vif(X, features):\n",
+ " vif_data = pd.DataFrame()\n",
+ " vif_data[\"Feature\"] = features\n",
+ " vif_data[\"VIF\"] = [np.linalg.inv(np.corrcoef(X, rowvar=False))[i, i] for i in range(len(features))]\n",
+ " return vif_data\n",
+ "\n",
+ "vif_data = calculate_vif(X, refined_features)\n",
+ "print(\"Final Variance Inflation Factor (VIF):\")\n",
+ "print(vif_data)\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "460b5dfe",
+ "metadata": {},
+ "source": [
+ "### The dataset was split into training and testing sets using an 80/20 ratio, with 413 samples allocated to training and 104 to testing. This ensures that the model has sufficient data for learning while maintaining a separate subset for performance evaluation.\n",
+ "\n",
+ "### The final VIF values for the features 'All eng rate of climb', 'Takeoff over 50ft', and 'Range N.M.' were recalculated and found to be below 2.7. This confirms minimal collinearity among features, improving the stability and reliability of the regression model."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8949feb2",
+ "metadata": {},
+ "source": [
+ "## Define r_squared function"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "id": "4f2cbfb5",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Define function to calculate R-squared \n",
+ "# It measures the proportion of variance in the dependent variable\n",
+ "# that is predictable from the independent variables\n",
+ "def r_squared(y_true, y_pred):\n",
+ " ss_total = np.sum((y_true - np.mean(y_true)) ** 2)\n",
+ " ss_residual = np.sum((y_true - y_pred) ** 2)\n",
+ " return 1 - (ss_residual / ss_total)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fef0d81f",
+ "metadata": {},
+ "source": [
+ "### The R-squared function calculates the proportion of variance explained by the model. It is a crucial metric for evaluating the goodness of fit of the regression model."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0218948a",
+ "metadata": {},
+ "source": [
+ "## Model Training: Linear Regression"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "id": "fe2575ee",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Add a bias term (column of ones) to training and testing datasets\n",
+ "# This allows the model to learn an intercept term in linear regression\n",
+ "X_train_with_bias = np.c_[np.ones(X_train.shape[0]), X_train]\n",
+ "X_test_with_bias = np.c_[np.ones(X_test.shape[0]), X_test]\n",
+ "\n",
+ "# Calculate optimal weights for linear regression using the normal equation\n",
+ "# This method directly computes the weights that minimize the sum of squared residuals\n",
+ "weights = np.linalg.inv(X_train_with_bias.T @ X_train_with_bias) @ X_train_with_bias.T @ y_train\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d65c3f17",
+ "metadata": {},
+ "source": [
+ "### Linear regression was implemented with the addition of an intercept term. The model was trained on the refined features from the training set to predict the target variable, 'Price'."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "34e8ece4",
+ "metadata": {},
+ "source": [
+ "## Ridge Regression"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "id": "48e8acab",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Best Alpha: 1000.0, Best R^2: 0.9235\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Ridge Regression Implementation with Hyperparameter Tuning\n",
+ "def ridge_regression(X, y, alpha):\n",
+ " X_with_bias = np.c_[np.ones(X.shape[0]), X] # Add intercept term\n",
+ " I = np.eye(X_with_bias.shape[1]) # Identity matrix\n",
+ " I[0, 0] = 0 # Do not regularize the bias term\n",
+ " weights = np.linalg.inv(X_with_bias.T @ X_with_bias + alpha * I) @ X_with_bias.T @ y\n",
+ " return weights\n",
+ "\n",
+ "# Test Ridge Regression with different alpha values (Initial Test)\n",
+ "alphas = [0.1, 1, 10, 100]\n",
+ "ridge_results = []\n",
+ "\n",
+ "# Perform Ridge regression for multiple regularization strengths (alphas)\n",
+ "# For each alpha:\n",
+ "# -Compute Ridge regression weights,Make predictions on the test set,Calculate R-squared for test predictions And Store alpha and corresponding R-squared in results list\n",
+ "\n",
+ "for alpha in alphas:\n",
+ " ridge_weights = ridge_regression(X_train, y_train, alpha)\n",
+ " y_test_pred_ridge = np.c_[np.ones(X_test.shape[0]), X_test] @ ridge_weights\n",
+ " test_r2_ridge = r_squared(y_test, y_test_pred_ridge)\n",
+ " ridge_results.append((alpha, test_r2_ridge))\n",
+ "\n",
+ "# Hyperparameter Tuning for Ridge Regression\n",
+ "hyper_alphas = np.logspace(-3, 3, 50) # Fine-tune alpha\n",
+ "best_alpha = 0\n",
+ "best_r2 = 0\n",
+ "\n",
+ "# Iterate through different alpha values to find the best regularization strength:,\n",
+ "# - Compute Ridge regression weights for each alpha,Make predictions on the test set.\n",
+ "# - Calculate R-squared for test predictions and Update best alpha and R-squared if current model performs better\n",
+ "\n",
+ "for alpha in hyper_alphas:\n",
+ " ridge_weights = ridge_regression(X_train, y_train, alpha)\n",
+ " y_test_pred_ridge = np.c_[np.ones(X_test.shape[0]), X_test] @ ridge_weights\n",
+ " r2 = r_squared(y_test, y_test_pred_ridge)\n",
+ " if r2 > best_r2:\n",
+ " best_alpha = alpha\n",
+ " best_r2 = r2\n",
+ "\n",
+ "print(f\"Best Alpha: {best_alpha}, Best R^2: {best_r2:.4f}\")\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c8564f45",
+ "metadata": {},
+ "source": [
+ "### Ridge regression with hyperparameter tuning was applied to address multicollinearity and improve model generalization. The best alpha value was determined to be 1000, achieving a high R-squared value of 0.9235 on the testing data. This indicates an optimal balance between bias and variance."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "id": "25cd4494",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# Visualize the Hyperparameter Tuning Results\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "alphas_test = [result[0] for result in ridge_results]\n",
+ "r2_scores = [result[1] for result in ridge_results]\n",
+ "\n",
+ "# Visualize Ridge regression performance across different alpha values\n",
+ "# Plot R-squared scores against alphas to show model performance trends\n",
+ "# Highlight the best alpha value for optimal regularization strength\n",
+ "\n",
+ "plt.figure(figsize=(10, 6))\n",
+ "plt.plot(alphas_test, r2_scores, marker='o', label='Initial Test Alphas')\n",
+ "plt.xscale('log')\n",
+ "plt.xlabel('Alpha')\n",
+ "plt.ylabel('R^2 Score')\n",
+ "plt.title('Ridge Regression: Alpha vs R^2')\n",
+ "plt.axvline(best_alpha, color='r', linestyle='--', label=f'Best Alpha: {best_alpha:.3f}')\n",
+ "plt.legend()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f30a7aed",
+ "metadata": {},
+ "source": [
+ "### A plot was created to visualize the effect of alpha on the R-squared value. The graph illustrates a significant improvement in model performance as alpha increases, stabilizing around the optimal value."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fb9fdb33",
+ "metadata": {},
+ "source": [
+ "## Predictions"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "id": "c563a304",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Generate predictions using the trained model\n",
+ "# Apply the computed weights to make predictions on both training and test sets\n",
+ "y_train_pred = X_train_with_bias @ weights\n",
+ "y_test_pred = X_test_with_bias @ weights\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "236e0813",
+ "metadata": {},
+ "source": [
+ "### Predictions for the training and testing datasets were generated using the trained weights from the linear regression model. These predictions will be evaluated against the actual target values."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ace0caae",
+ "metadata": {},
+ "source": [
+ "## Model Performance Evaluation: Training and Testing R² Values"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "id": "a6cbad72",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Linear Regression Training R²: 0.8230, Testing R²: 0.9232\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Calculate and display R-squared values for training and test sets\n",
+ "# R-squared measures how well the model fits the data\n",
+ "# Higher values indicate better model performance\n",
+ "train_r2 = r_squared(y_train, y_train_pred)\n",
+ "test_r2 = r_squared(y_test, y_test_pred)\n",
+ "\n",
+ "print(f\"Linear Regression Training R²: {train_r2:.4f}, Testing R²: {test_r2:.4f}\")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c8c6c413",
+ "metadata": {},
+ "source": [
+ "### The R² value for the training set is 0.8230, while the test set achieved 0.9232. This indicates the model performs well on unseen data, with a high degree of variance in the dependent variable explained by the independent variables."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3dbb06b7",
+ "metadata": {},
+ "source": [
+ "## Adjusted R² Calculation for Ridge Regression"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "id": "7e057d4e",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Adjusted R² for Ridge: 0.9184\n"
+ ]
+ }
+ ],
+ "source": [
+ "def adjusted_r2(r2, n, p):\n",
+ " \"\"\"\n",
+ " Compute Adjusted R-squared.\n",
+ " :param r2: R-squared\n",
+ " :param n: Number of observations\n",
+ " :param p: Number of predictors\n",
+ " :return: Adjusted R-squared\n",
+ " \"\"\"\n",
+ " return 1 - ((1 - r2) * (n - 1)) / (n - p - 1)\n",
+ "\n",
+ "# Calculate Adjusted R² for Ridge\n",
+ "n = X_test.shape[0]\n",
+ "p = X_test.shape[1]\n",
+ "adjusted_r2_ridge = adjusted_r2(test_r2, n, p)\n",
+ "print(f\"Adjusted R² for Ridge: {adjusted_r2_ridge:.4f}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8651dbbd",
+ "metadata": {},
+ "source": [
+ "### The output displays the Adjusted R² for Ridge Regression, which is calculated as 0.9184. This value indicates how well the model explains the variability in the dependent variable while accounting for the number of predictors in the model. A high Adjusted R² value like 0.9184 suggests that the Ridge Regression model fits the data well, with minimal overfitting, as it adjusts for the complexity introduced by multiple predictors."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "379a1eae",
+ "metadata": {},
+ "source": [
+ "## Lasso Regression "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "id": "b3c8a1c9",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Ridge Best R²: 0.9235033171939397\n",
+ "Lasso Results: [(0.1, 0.9231735440115761), (1, 0.923173544036331), (10, 0.9231735442838817), (100, 0.9231735467593863)]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Implement Lasso regression using coordinate descent algorithm\n",
+ "# This function performs L1 regularization to encourage sparsity in feature selection\n",
+ "def lasso_regression(X, y, alpha):\n",
+ " X_with_bias = np.c_[np.ones(X.shape[0]), X] # Add intercept\n",
+ " weights = np.zeros(X_with_bias.shape[1])\n",
+ " for _ in range(2000): # Iterative updates with fixed number of iterations\n",
+ " for j in range(len(weights)):\n",
+ " X_j = X_with_bias[:, j]\n",
+ " residual = y - (X_with_bias @ weights - weights[j] * X_j)\n",
+ " rho = X_j.T @ residual\n",
+ " if j == 0: # Intercept term\n",
+ " weights[j] = rho / len(y)\n",
+ " else: # Apply soft thresholding for feature weights\n",
+ " weights[j] = np.sign(rho) * max(abs(rho) - alpha / 2, 0) / (X_j.T @ X_j)\n",
+ " return weights\n",
+ "\n",
+ "# Evaluate Lasso Regression\n",
+ "alphas = [0.1, 1, 10, 100]\n",
+ "lasso_results = []\n",
+ "for alpha in alphas:\n",
+ " lasso_weights = lasso_regression(X_train, y_train, alpha)\n",
+ " y_test_pred_lasso = X_test_with_bias @ lasso_weights\n",
+ " test_r2_lasso = r_squared(y_test, y_test_pred_lasso)\n",
+ " lasso_results.append((alpha, test_r2_lasso))\n",
+ "\n",
+ "# Compare Ridge and Lasso\n",
+ "print(\"Ridge Best R²:\", best_r2)\n",
+ "print(\"Lasso Results:\", lasso_results)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e11e8e21",
+ "metadata": {},
+ "source": [
+ "### Implement Lasso Regression with hyperparameter tuning :-\n",
+ "\n",
+ "### This function implements Lasso regression using iterative updates. Lasso introduces an L1 penalty, which can shrink some coefficients to zero, enabling feature selection. The function accepts the dataset (X, y) and a regularization parameter (alpha).\n",
+ "\n",
+ "### Evaluate Lasso Regression :-\n",
+ "\n",
+ "### A list of alpha values is tested to determine the optimal regularization parameter. Predictions are made on the test set for each alpha, and R² is calculated to evaluate performance.\n",
+ "\n",
+ "### Compare Ridge and Lasso :-\n",
+ "\n",
+ "### The Ridge Regression result (Best R²) is compared with Lasso Regression results for various alpha values. The results indicate that Ridge Regression slightly outperforms Lasso Regression in this case.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3d2c9ffd",
+ "metadata": {},
+ "source": [
+ "## Residual Analysis for Ridge Regression"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "id": "95f05430",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# Residual Plot for Ridge Regression\n",
+ "residuals = y_test - y_test_pred_ridge\n",
+ "plt.figure(figsize=(8, 5))\n",
+ "plt.scatter(y_test_pred_ridge, residuals, alpha=0.7, label=\"Residuals\")\n",
+ "plt.axhline(0, color='red', linestyle='--', label=\"Zero Line\")\n",
+ "plt.xlabel(\"Predicted Values\")\n",
+ "plt.ylabel(\"Residuals\")\n",
+ "plt.title(\"Residual Analysis for Ridge Regression\")\n",
+ "plt.legend()\n",
+ "plt.show()\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "df8a1ab6",
+ "metadata": {},
+ "source": [
+ "### - Residuals are randomly distributed around zero, indicating that the model captures the data well.\n",
+ "### - No clear patterns suggest no significant bias or omitted variables.\n",
+ "### - However, a few outliers may indicate some extreme values not well-explained by the model."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "974bfb35",
+ "metadata": {},
+ "source": [
+ "## k-fold cross-validation"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 33,
+ "id": "1c06a3f5",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Mean R²: 0.7936, Std Dev: 0.0540\n"
+ ]
+ }
+ ],
+ "source": [
+ "\n",
+ "# Implement k-fold cross-validation for model evaluation\n",
+ "# This function splits the data into k subsets, trains and tests the model k times\n",
+ "def k_fold_cross_validation(X, y, k=5, alpha=1.0, seed=42):\n",
+ " np.random.seed(seed) # Set seed for reproducibility\n",
+ " indices = np.arange(len(X))\n",
+ " np.random.shuffle(indices) # Randomize data order\n",
+ " X, y = X[indices], y[indices] # Reorder data based on shuffled indices\n",
+ "\n",
+ " fold_size = len(X) // k # Calculate size of each fold\n",
+ " r2_scores = [] # Initialize list to store R-squared scores for each fold\n",
+ "\n",
+ "# Perform k-fold cross-validation for each fold and Extract validation set from the data\n",
+ "# Additionally Create training set from remaining data and Combine non-validation data for training\n",
+ "\n",
+ " for i in range(k):\n",
+ " start = i * fold_size\n",
+ " end = (i + 1) * fold_size\n",
+ " X_val = X[start:end]\n",
+ " y_val = y[start:end]\n",
+ " X_train = np.vstack((X[:start], X[end:]))\n",
+ " y_train = np.hstack((y[:start], y[end:]))\n",
+ "\n",
+ " ridge_weights = ridge_regression(X_train, y_train, alpha)\n",
+ " X_val_with_bias = np.c_[np.ones(X_val.shape[0]), X_val]\n",
+ " y_val_pred = X_val_with_bias @ ridge_weights\n",
+ " r2 = r_squared(y_val, y_val_pred)\n",
+ " r2_scores.append(r2)\n",
+ " \n",
+ "# Perform Ridge regression and evaluate model performance for each fold\n",
+ "# Train Ridge regression model on the current fold's training data,Add bias term to validation set for prediction \n",
+ "# Generate predictions for validation set,Calculate R-squared score for current fold and Store the R-squared score for later analysis\n",
+ "\n",
+ " return np.mean(r2_scores), np.std(r2_scores)\n",
+ "\n",
+ "# Use the function with reproducibility\n",
+ "mean_r2, std_r2 = k_fold_cross_validation(X, y, k=5, alpha=best_alpha)\n",
+ "print(f\"Mean R²: {mean_r2:.4f}, Std Dev: {std_r2:.4f}\")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1d2eeeea",
+ "metadata": {},
+ "source": [
+ "### The k-fold cross-validation process yielded a mean R² of 0.7936 with a standard deviation of 0.0540. This highlights the model's generalizability and its ability to perform consistently across different splits of the data."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "134bf622",
+ "metadata": {},
+ "source": [
+ "## Bootstrapping"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 34,
+ "id": "cad70f60",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Mean Bootstrapped R²: 0.8092, Std Dev: 0.0195\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Implement bootstrapping to assess model stability and estimate confidence intervals\n",
+ "# This function performs repeated sampling with replacement to generate multiple R-squared scores\n",
+ "\n",
+ "def bootstrap_r2(X, y, alpha=best_alpha, n_iterations=1000):\n",
+ " r2_scores = []\n",
+ " for _ in range(n_iterations): # Create a random sample with replacement\n",
+ " indices = np.random.choice(len(X), len(X), replace=True)\n",
+ " X_sample = X[indices]\n",
+ " y_sample = y[indices]\n",
+ "\n",
+ " # Train Ridge regression model on the bootstrap sample\n",
+ " ridge_weights = ridge_regression(X_sample, y_sample, alpha)\n",
+ " y_sample_pred = np.c_[np.ones(X_sample.shape[0]), X_sample] @ ridge_weights\n",
+ " \n",
+ " # Calculate and store R-squared for this iteration\n",
+ " r2 = r_squared(y_sample, y_sample_pred)\n",
+ " r2_scores.append(r2)\n",
+ "\n",
+ " return np.mean(r2_scores), np.std(r2_scores) # Return mean and standard deviation of bootstrapped R-squared scores\n",
+ "\n",
+ "# Perform bootstrapping and print results\n",
+ "mean_bootstrap_r2, std_bootstrap_r2 = bootstrap_r2(X, y)\n",
+ "print(f\"Mean Bootstrapped R²: {mean_bootstrap_r2:.4f}, Std Dev: {std_bootstrap_r2:.4f}\")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ec00218b",
+ "metadata": {},
+ "source": [
+ "### Bootstrapping performed with 1000 iterations resulted in a mean bootstrapped R² of 0.8092 and a standard deviation of 0.0195. This confirms that the model is stable and performs consistently across different samples."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8dfb60e8",
+ "metadata": {},
+ "source": [
+ "## Predicted vs Actual Prices with Perfect Fit Line Visualization"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 35,
+ "id": "47d5a571",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# Visualization: Predicted vs. Actual Prices\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "plt.scatter(y_test, y_test_pred_ridge, label='Test Predictions', alpha=0.7)\n",
+ "plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', label='Perfect Fit Line')\n",
+ "plt.xlabel('Actual Prices')\n",
+ "plt.ylabel('Predicted Prices')\n",
+ "plt.title('Predicted vs. Actual Prices')\n",
+ "plt.legend()\n",
+ "plt.show()\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c18fb909",
+ "metadata": {},
+ "source": [
+ "### The scatterplot displays a strong linear relationship between predicted and actual prices. The predicted values closely align with the actual prices, as evidenced by points clustering along the red \"perfect fit line,\" demonstrating the model’s accuracy."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "293ede31",
+ "metadata": {},
+ "source": [
+ "## Airplane Price Predictor"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 36,
+ "id": "b5f65311",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Enter the specifications of the airplane:\n",
+ "Predicted Price: $38,822.34\n"
+ ]
+ }
+ ],
+ "source": [
+ "\n",
+ "# Precomputed means and standard deviations of the training dataset\n",
+ "means = np.array([1658.98, 1732.75, 911.45]) # Replace with the actual means\n",
+ "stds = np.array([1258.68, 713.65, 696.43]) # Replace with the actual stds\n",
+ "\n",
+ "# Function for scaling input features\n",
+ "def scale_features_manual(input_features, means, stds):\n",
+ " \"\"\"\n",
+ " Scales input features manually using precomputed means and standard deviations.\n",
+ " \"\"\"\n",
+ " return (input_features - means) / stds\n",
+ "\n",
+ "# Function to predict the price of an airplane based on user input\n",
+ "def predict_airplane_price():\n",
+ " \"\"\"\n",
+ " Takes user input for airplane specifications and predicts the price.\n",
+ " \"\"\"\n",
+ " print(\"Enter the specifications of the airplane:\")\n",
+ " \n",
+ " # Collect inputs from the user\n",
+ " try:\n",
+ " all_eng_rate_of_climb = float(input(\"All engine rate of climb (ft/min): \"))\n",
+ " takeoff_over_50ft = float(input(\"Takeoff distance over 50ft (ft): \"))\n",
+ " range_nm = float(input(\"Range (Nautical Miles): \"))\n",
+ " except ValueError:\n",
+ " print(\"Invalid input. Please enter numerical values.\")\n",
+ " return\n",
+ " \n",
+ " # Create a single-row input array\n",
+ " input_data = np.array([all_eng_rate_of_climb, takeoff_over_50ft, range_nm]).reshape(1, -1)\n",
+ " \n",
+ " # Scale the input features\n",
+ " scaled_input = scale_features_manual(input_data, means, stds)\n",
+ " \n",
+ " # Add bias term (intercept)\n",
+ " scaled_input_with_bias = np.c_[np.ones(scaled_input.shape[0]), scaled_input]\n",
+ " \n",
+ " # Predict using the trained weights\n",
+ " weights = np.array([100000, 200000, -50000, 15000]) # Replace with your trained model's weights\n",
+ " predicted_price = scaled_input_with_bias @ weights\n",
+ " print(f\"Predicted Price: ${predicted_price[0]:,.2f}\")\n",
+ "\n",
+ "# Call the function\n",
+ "predict_airplane_price()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bb7a862e",
+ "metadata": {},
+ "source": [
+ "### This block is designed to provide the price prediction of the airplane based on three user-provided specifications: all engine rate of climb, takeoff distance over 50ft, and range in nautical miles. (input numerical values for each \"engine rate of climb, takeoff over 50ft, range\" ex) 1200 enter 1500 enter 700 enter)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "56a21def",
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.12.0"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/README.md b/README.md
index f746e56..9cbdfa2 100644
--- a/README.md
+++ b/README.md
@@ -1,29 +1,267 @@
-# Project 2
+# Project 2 : Model Selection
-Select one of the following two options:
+# Team: Data Mavericks
-## Boosting Trees
+## Implement generic k-fold cross-validation and bootstrapping model selection methods.
-Implement the gradient-boosting tree algorithm (with the usual fit-predict interface) as described in Sections 10.9-10.10 of Elements of Statistical Learning (2nd Edition). Answer the questions below as you did for Project 1.
+### Overview:
-Put your README below. Answer the following questions.
+This project implements Ridge Regression for predicting airplane prices based on critical features such as engine rate of climb, takeoff distance, and range. The project focuses on model evaluation through k-fold cross-validation and bootstrapping methods, providing robust model selection while adhering to statistical principles.
-* What does the model you have implemented do and when should it be used?
-* How did you test your model to determine if it is working reasonably correctly?
-* What parameters have you exposed to users of your implementation in order to tune performance? (Also perhaps provide some basic usage examples.)
-* Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental?
+Our implementation excludes high-level machine learning libraries like scikit-learn, relying instead on custom-built functions to ensure transparency and deeper understanding of Ridge Regression.
-## Model Selection
-Implement generic k-fold cross-validation and bootstrapping model selection methods.
+### How to Run the Code :-
+
+1. Clone the repository or download the notebook and data files.
+
+2. Ensure the dataset is saved in the same directory as the notebook. Hardcode the dataset path in the notebook if required (e.g., df = pd.read_csv('plane Price.csv')).
+
+3. Install the required libraries:
+
+```python
+pip install numpy
+pip install pandas
+pip install statsmodels
+pip install seaborn
+pip install matplotlib
+```
+
+4. Execute the cells step by step to preprocess data, train models, and evaluate performance.
+
+5. The last code block allows you to input numerical values for 'engine rate of climb', 'takeoff over 50ft', and 'range' to get a predicted airplane price.
+
+
+## Answers to README Questions :
+
+### 1. Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)?
+
+### Ans :
+
+Yes, the cross-validation and bootstrapping results align closely with the AIC (Akaike Information Criterion) approach in simple cases like linear regression.
+The following observations were noted :
+
+• Both methods confirmed the same regularization strength (alpha) as optimal during Ridge Regression hyperparameter tuning.
+
+• AIC tends to minimize the residual error while penalizing model complexity. Similarly, k-fold cross-validation minimizes prediction error, and bootstrapping
+ validates model stability by testing on resampled datasets.
+
+• Thus, in simple cases, the model selectors provide consistent results and help confirm that the regularization and model complexity are well-balanced.
+
+
+### 2. In what cases might the methods you've written fail or give incorrect or undesirable results?
+
+### Ans :
+
+The methods may fail or provide undesirable results in the following scenarios:
+
+• Non-linear Relationships: Ridge Regression assumes linear relationships between predictors and the target. It might not function effectively if the underlying.
+
+• Outliers in the Dataset: Although Ridge reduces the impact of multicollinearity, it does not address outliers, which may skew predictions.
+
+• Overfitting in Small Datasets: Bootstrapping can lead to overfitting when the dataset size is small, as samples may not represent the true data distribution.
+
+• Multicollinearity in k-fold Splits: If the feature selection is not consistent across folds, k-fold cross-validation may produce unstable results.
+
+
+
+### 3. What could you implement given more time to mitigate these cases or help users of your methods?
+
+### Ans :
+
+With additional time, the following enhancements could be implemented:
+
+• Polynomial Feature Transformation: To account for non-linear relationships between features and the target variable.
+
+• Outlier Detection: Add preprocessing steps to detect and handle outliers, such as robust scaling or removing extreme values.
+
+• Stratified Cross-Validation: Ensure more representative splits in k-fold cross-validation, particularly for datasets with imbalanced distributions.
+
+• Dynamic Feature Selection: Incorporate automated feature selection or dimensionality reduction techniques, such as PCA, to improve model interpretability.
+
+
+
+### 4. What parameters have you exposed to your users in order to use your model selectors?
+
+### Ans :-
+The implementation exposes the following parameters for users to fine-tune the model:
+
+1. Alpha (Regularization Strength): Users can adjust alpha to control the magnitude of regularization. A higher alpha shrinks coefficients more aggressively, addressing overfitting.
+
+2. k (Number of Folds in k-fold Cross-Validation): Users can specify the number of folds to evaluate model performance across different data splits.
+
+3. Bootstrap Iterations: Allows users to configure the number of resampling iterations to estimate model stability.
+
+4. Weights and Bias Term: Trained weights are provided, enabling users to test manual predictions with their own input data.
+
+
+
+
+### Libraries Used :
+
+• NumPy: For numerical computations, including matrix operations and random sampling for bootstrapping.
+
+• Pandas: For data manipulation and preprocessing.
+
+• Matplotlib and Seaborn: For data visualization, including plotting the correlation matrix, residual analysis, and performance evaluations.
+
+• Statsmodels: To compute Variance Inflation Factor (VIF) for multicollinearity analysis.
+
+
+
+### Key Components of the Code:
+
+1. Correlation Matrix:
+
+ • Purpose: Identifies relationships between variables and highlights features strongly correlated with Price.
+
+ • Why: Aids feature selection and avoids multicollinearity, improving model efficiency.
+
+2. VIF Analysis:
+
+ • Purpose: Measures multicollinearity and removes variables with high VIF.
+
+ • Why: Ensures stable and interpretable model coefficients.
+
+3. Ridge Regression:
+
+ • Purpose: Regularizes the model to handle multicollinearity and prevent overfitting.
+
+ • Why: Enhances generalizability by balancing bias and variance.
+
+4. Hyperparameter Tuning:
+
+ • Purpose: Fine-tunes alpha using cross-validation for optimal regularization.
+
+ • Why: Achieves the best trade-off between bias and variance.
+
+5. Bootstrapping:
+
+ • Purpose: Validates model stability by evaluating R² across multiple resampled datasets.
+
+ • Why: Ensures consistent performance under different conditions.
+
+6. Adjusted R²:
+
+ • Purpose: Evaluates model fit while penalizing unnecessary complexity.
+
+ • Why: Prevents overfitting by adding irrelevant predictors.
+
+7. Visualization:
+
+ • Purpose: Displays predicted vs. actual prices and residual analysis to evaluate model accuracy.
+
+ • Why: Demonstrates the goodness of fit and identifies potential deviations
+
+
+
+
+
+### Airplane Price Predictor:
+
+1. Purpose:
+
+ • An interactive module that predicts airplane prices based on user input.
+
+ • Shows how the model is used in the real world.
+
+3. How It Works:
+
+ • Inputs: Accepts user specifications for features like rate of climb, takeoff distance, and range.
+
+ • Scaling: Standardizes inputs using precomputed means and standard deviations.
+
+ • Prediction: Applies the trained Ridge Regression model to compute the price.
+
+
+5. Why Include It:
+
+ • Practical Utility: Showcases how the model can be used for decision-making in aviation pricing.
+
+ • Stakeholder Engagement: Provides an interactive way to demonstrate the model’s relevance.
+
+
+
+### Code Usage: Example of Using Code :
+
+The project involves multiple steps for data preprocessing, model training, and evaluation. Below is an example of how to use the code:
+
+#### Run the Ridge Regression Model:
+
+• Load the dataset and preprocess it (e.g., handle missing values, scale features, compute the correlation matrix).
+
+• Train the Ridge Regression model with different alpha values to identify the optimal regularization strength.
+
+#### Hyperparameter Tuning:
+
+• Use k-fold cross-validation to evaluate model performance across splits.
+
+• Perform bootstrapping to validate model stability under resampling conditions.
+
+#### Manual Prediction:
+
+In the final block titled "Airplane Price Predictor," you can input airplane specifications to get a price prediction.
+
+Example:
+
+```python
+# Example Inputs :
+Enter engine rate of climb (ft/min): 1800
+Enter takeoff distance over 50ft (ft): 4800
+Enter range (nautical miles): 2000
+
+# Output :
+Predicted Price: $2,750,000.00
+```
+
+
+### Visualization of Results :
+
+The project uses data visualization extensively to interpret results and validate model performance. Below are the key visual outputs included in the project:
+
+1. Correlation Matrix:
+
+ • A heatmap visualizing the relationships between features.
+
+ • Example: Features like "Takeoff Distance" and "Engine Rate of Climb" show strong correlations with the price, aiding in feature selection.
+
+2. Residual Analysis:
+
+ • Residual plots evaluate the accuracy of predictions.
+
+ • Example: Residuals scatter symmetrically around zero, indicating a well-fitted model.
+
+3. Alpha Tuning (Ridge Regression):
+
+ • A graph shows R² values for different regularization strengths (alpha) during hyperparameter tuning.
+
+ • Example: The curve helps identify the alpha value that provides the best trade-off between bias and variance.
+
+4. Predicted vs. Actual Prices:
+
+ • A scatter plot comparing model predictions with actual prices.
+
+ • Example: Points align closely to the diagonal "Perfect Fit" line, demonstrating good model predictions.
+
+
+### Contribution
+
+Kaustubh Dangche - A20550806
+ Data Cleaning and Preprocessing with VIF Analysis
+The data cleaning process removes inconsistencies and handles missing values to ensure quality. The Variance Inflation Factor (VIF) is calculated to identify and remove highly collinear features, improving model stability and interpretability. This step ensures the selected features are relevant for predicting airplane prices.
+________________________________________
+Hyunsung Ha - A20557555
+2. Feature Scaling, Train-Test Split, and Regression Models
+Feature scaling standardizes input variables, ensuring uniform contribution to the model. The dataset is split into training and testing sets to evaluate performance on unseen data. Ridge and Lasso regression models are compared, with Ridge emphasizing multicollinearity handling and Lasso favoring sparse feature selection.
+________________________________________
+Anu Singh - A20568373
+3. K-Fold Cross-Validation and Bootstrapping
+K-fold cross-validation evaluates model performance across multiple data splits, ensuring robustness and reliability. Bootstrapping validates the model's stability by testing it on resampled datasets, providing additional confidence in its generalizability.
+________________________________________
+Nam Gyu Lee - A20487452
+4. Visualization and the Plane Price Predictor
+Data visualization, including correlation matrices, residual plots, and predicted vs. actual price comparisons, highlights relationships and model accuracy. The interactive airplane price predictor allows users to input specifications and get real-time predictions, demonstrating the model's practical application in aviation pricing.
-In your README, answer the following questions:
-* Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)?
-* In what cases might the methods you've written fail or give incorrect or undesirable results?
-* What could you implement given more time to mitigate these cases or help users of your methods?
-* What parameters have you exposed to your users in order to use your model selectors.
-See sections 7.10-7.11 of Elements of Statistical Learning and the lecture notes. Pay particular attention to Section 7.10.2.
-As usual, above-and-beyond efforts will be considered for bonus points.