Mason Phung
Last updated: December 2024
Space Exploration Technologies Corp. or SpaceX is an American spacecraft manufacturer, popular for their successful mission in sending a spacecraft and astronauts to the International Space Station. They are also well-known for their VTVL rocket launches , in which rockets can land and be resued, thus save a huge amount of launching cost for the company.
One of SpaceX's most popular rocket - the Falcon 9, have landed and reflown more than 200 times . The rocket was advertised on its website with a launch cost of 62 million dollars; other providers cost upward of 165 million dollars each, much of the savings is because SpaceX can reuse the first stage. Therefore if we can determine if the first stage will land, we can determine the cost of a launch.
We are working at a business in the aerospace industry who are developing a space rocket and are researching different rocket technologies and their competitiors.
In this project, we will collect & analyze past launches data of the Falcon 9 rockets then try to predict the outcome of future launches using different Machine Learning models. The main purpose is to find which factor contributes to the success of each flight and to build a Machine Learning model that can predict the outcome of a rocket launch.
- This original questions & problem sets are initally parts of the IBM Data Science Specialization capstone project.
- After completing the base project, I created a new personal version, the dashboard app as well as the machine learning section by adding extra integrations, new features and customization (Details below)
- Have been refining this project for several times since then!
Pythonprogramming andSQLdata analysis.- Data collection: API data collection w.
requestsand webscraping w.BeautifulSoup - Data analysis with SQL using local database
mySQLandSQLalchemytoolkit for python - Exploratory analysis by visualization with
matplotlibandseaborn - Building an interactive dashboard using
dashandplotly.express - Train classification models (Logistic Regression, SVM, Decision Tree, kNN, XGBoost, Multi-layer perceptron) with hyperparameter tuning to predict of future landing outcomes with
scikit-learn
The data analysis project provides an in-depth view of the space industries. The data not only presents the important features to be considered in each rocket launch to the space, but also let us know which essential data to be recored when it finished.
The exploratory analysis shows the incredible progress the company made after a period of 10 years, with the success rate increased from about 20% at the first 10 launches to around 70-80% from flight 20th and achieved a complete 100% success after flight 80th.
Success rate increased had allowed the company to conduct further experiments with higher payloads. According to the data, the carried payload tended to increase launches after launches. With the average of about 3000 kg each launch in the period from 2010-2014, the average payload increased about 30% every 2 years and sharply went up for 230% in the last 3 years 2018-2020. Note that this is the period where SpaceX achieves the highest launch success rate.
Since all of the launches has a high success rate, geography may be a important contributor to this outcome. Three launch sites are all located in the coast of the United States. There are some remarkable points that can be noticed:
- The sites are far from the urban, where there is no residential area.
- The sites are close to forest or natural/wild life area.
- The sites has one side faced the ocean.
- There are railway routes went through the sites (black/white lines).
The first three criteria contribute to the safety of the population. Based on the information from the industry, the failure rate of rocket launches is usually at around 4-5%, while in initial phase, it can be up to 30% (In our analysis, we observed a failure rate of 80% in the first 10 launches). Therefore, it is extremely important for launch sites to be built close to an uninhabited zone (such as forest, wild life area). Moreover, as the ocean is also considered to be uninhabited, it would be an ideal place for crashing rockets to land, reduce noises, damages and negative influences to the home country.
In this project, we need to use Classification models because the target variable (class: success or not sucess) is a categorical variable. All of the machine learning models perform well and yielded high accuracy scores.
The model was superior in predicting successful launches (class = 1, with f1-score = 87.5% in average) while they seem to be a little bit less reliable in predicting the other outcome. This can be explained by the imbalance in the number of class=0 vs class=1 values, while we have doubled class=1 compared to the other. This caused imbalanced weight between the two class and can affect the accuracy score negatively. In our work, we have applied two strategies to overcome this problem including stratification and using parameter class_weight = 'balanced' to penalize class 1.
There are many factors that we still need to consider to make the models work better and make sure it predicts without bias or overfitting.
- More data is needed, especially with failed launches. Out of 90 launches, we have 60 successful outcome while only have 30 failed outcomes. This not only may make the weight of each class imbalanced when modeling, but also shows that we may not have enough
faileddata for the model to train (even though we applied stratification to overcome this, but it is clearly better to have actual data. - The dataset is relatively small. With more dataset, the model can have more data to learn, thus prevent overfitting and bias predictions.
- Removed lengthy features, convert repeated functions into predefined functions.
- Added step explanations, result comments to each section.
- Set up and worked with SQL in a local MySQL database (Originally IBM Skills Network Lab Environment).
- Added new ML models with correlation-based features selection and hyperparameter tuning.
- Estimate the models' performance with classification report, ROC curve and AUC score.
- Styled with
dash_bootstrap_components, external css, responsive customized layout. - Added multiple-inputs callbacks.
- Added extra plots, statistic summary, global attribute selection and year slider.
Ver 3.0
- Fix linguist to detect Python.
- Split the analysis notebook into small notebooks for easier interpretation.
- Add some SQL scripts in
/sql. - Refine working repo.
Ver 2.0
- Remove separate notebooks, combine all parts of the project into a single jupyter notebook (the dash app is still in a separate .py file).
- Refine data cleaning part, add new cleaning features so we can use the collected data to all parts of the projects (Originally, later parts used pre-cleaned data, not the data that we collected in the beginning).
- Remove multiple excess datasets, only use original datasets from collected.


