Author: Jesus Antonanzas Acero, Alex Carrillo Alza
Version: "1.0"
Email: jesus.maria.antonanzas@est.fib.upc.edu, alex.carrillo.alza@est.fib.upc.edu
Info: BDA, GCED, Big Data Analytics project
Date: 16/12/2019
Our project should contain 5 .py scripts:
config.pyconfigures Spark environment when necessaryload_into_hdfs.pyloads sensor CSV's into HDFS in AVRO formatdata_management.pycreates training data matrixdata_analysis.pytrains a decission tree classifier model on training datadata_classifier.pypredicts new flight observations
These have to be put into the same directory as the resources directory that was included in the code skeleton, in order for the local reading option of the sensor data (CSV's) to work.
load_into_hdfs.py is to be executed first if one wants to use HDFS.
Note that the HDFS path into which the files are to be loaded has to be explicitly changed, as well as the reading path in
data_management.pyanddata_classifier.py, if using this option.
If nothing is specified in data_management.py or data_classifier.py, though, CSV's will be read from local.
Then, the order of execution is: (optional): load_into_hdfs.py
data_management.pydata_analysis.pydata_classifier.py
Note that these scripts will write three objects:
data_matrixmodeltest_matrix