1101-datascience · github-classroom · Dec 16, 2021 · Jan 3, 2022 · Jan 4, 2022 · Jan 4, 2022
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1 @@
+.DS_Store
diff --git a/README.md b/README.md
@@ -1,51 +1,213 @@
-# [GroupID] Title of your final project
-
-### Groups
-* name, student ID1
-* name, student ID2
-* name, student ID3
-* ...
-
-### Goal
-A breif introduction about your project, i.e., what is your goal?
+# [Group4] Bankrutcy Prediciton
 
+### 組員
+* 鄭宇傑, 108703014
+* 賴冠瑜, 108703019
+* 張瀚文, 108304003
+* 江宗樺, 108703029
+* 田詠恩, 108703030
+### 目標
+95% 以上的資料中的公司都沒有破產(Bankruptcy == 0)
+所以全部猜 1 就可以有超級高的 Accuracy
+我們將目標設定成要盡可能增加 recall 。嘗試預測出更多可能會倒的公司去對他們做關切 或提早做應對措施，並去檢視可能面臨的問題，是這次專題的主要目標。
 ### Demo 
-You should provide an example commend to reproduce your result
-```R
-Rscript code/your_script.R --input data/training --output results/performance.tsv
-```
-* any on-line visualization
+* [ShinyApp](https://yjack0000.shinyapps.io/shinyui/?_ga=2.142920117.1862022445.1641973117-1531152518.1641397296)
+* Rscript 使用
+```console
+  Rscript pcaRpart.R (max depth) (threshold)
+  ```
 
-## Folder organization and its related information
+## 檔案架構及其相關資訊
 
 ### docs
-* Your presentation, 1101_datascience_FP_<yourID|groupName>.ppt/pptx/pdf, by **Jan. 13**
-* Any related document for the final project
-  * papers
-  * software user guide
+* [Google Slide for Presentation](https://docs.google.com/presentation/d/1TWPNksUenzi-DsquO6Yv7WBCVPvZE-HgyjMmvAcAH3U/edit#slide=id.g10d591fe8d9_0_169)
 
 ### data
 
-* Source
-* Input format
-* Any preprocessing?
-  * Handle missing data
-  * Scale value
+* [來源](https://www.kaggle.com/fedesoriano/company-bankruptcy-prediction)
+* [格式](https://github.com/1101-datascience/finalproject_group4/tree/main/data)
+* 預處理
+  * 資料分析
+  ![Correltion](./graph/corr2.png)
+  * PCA
+  ![PCA](./graph/PC.png)
+  ![Variable Explained](./graph/var_explain.png)
+  * Normalize
 
 ### code
 
-* Which method do you use?
-* What is a null model for comparison?
-* How do your perform evaluation? ie. cross-validation, or addtional indepedent data set
+* method we use
+  * decision tree
+  ```console
+  Rscript pcaRpart.R (max depth) (threshold)
+  ```
+
+    ![Desision Tree](./graph/decision_tree.png)
+  * random forest
+  ```console
+  Rscript pcaRandomForest.R (tree number)
+  ```
+
+    ![Random Forest](./graph/random_forest.png)
+  * logistic regression 
+  ```console
+  Rscript pcaGlm.R (threshold)
+  ```
+
+    ![Logistic Regression](./graph/logistic_regression.jpeg)
+  * cnn
+
+
+* Null model predict all 1
+* data split
+
+  ![Data Split](./graph/data_split.png)
+
+
+* 使用 SMOTE 製作額外的 traning data
+
+  ![smote](./graph/original.png)
+  ![smot](./graph/oversample.png)
 
 ### results
 
-* Which metric do you use 
-  * precision, recall, R-square
-* Is your improvement significant?
-* What is the challenge part of your project?
+* use precision accuracy and recall to evaluate model
+* improvement
+  1. 用pca 分析並使用前40個主成份 
+  2. 利用k-fold validation的方法 篩選出recall最高的model
+  3. 對於threshold、max depth等超參數進行調整
+
+* decision tree （詳細請到code資料夾）
+  ```r
+  model<- rpart( Bankrupt. ~ .,
+              data=res[["train"]],
+              control=rpart.control(maxdepth=20),
+              method="class")
+
+
+  trainframe <- data.frame(truth=res[["train"]]$Bankrupt.,
+                          pred=predict(model, type="class"))
+
+  TP.train<-nrow(filter(trainframe, truth == pred , truth == 1,))
+  TN.train<-nrow(filter(trainframe, truth == pred , truth == 0,))
+  FP.train<-nrow(filter(trainframe, truth != pred , truth == 0,))
+  FN.train<-nrow(filter(trainframe, truth != pred , truth == 1,))
+  accuracy.train<-(TP.train+TN.train)/nrow(trainframe)
+  fallback.train<-(TP.train)/(TP.train+FN.train)
+  precision.train<-(TP.train)/(TP.train+FP.train)
+  NegativePrecision.train<-(TN.train)/(TN.train+FN.train)
+  ```
+
+  ![decision tree](./graph/DecisionTree-ConfusionMatrix.png)  ![decision tree](./graph/DecisionTree-PCA-ConfusionMatrix.png)
+
+* random forest  （詳細請到code資料夾）
+
+  ```r
+  g = sample(cut(
+            seq(nrow(data)), 
+            nrow(data)*cumsum(c(0,spec)),
+            labels = names(spec)
+            ))
+
+  # final data
+  res = split(data, g)
+
+
+
+
+  #train pca
+  in_d <- res[["train"]]
+  in_d = in_d[,!colnames(in_d) %in% c('Net.Income.Flag','Bankrupt.')]
+  pca <- prcomp(in_d, center=TRUE, scale=TRUE)
+
+  #--- watch pc
+  std_dev <- pca$sdev 
+  pr_var <- std_dev^2
+  prop_varex <- pr_var/sum(pr_var)
+  #plot(prop_varex, type = 'lines')
+
+
+  #--- built train data with Bankrupt and top 40 component
+  train.data <- data.frame(Bankrupt. = res[["train"]]$Bankrupt., pca$x)
+  train.data <- train.data[,1:41]
+  train.data.var <- colnames(train.data[,2:40])
+
+  #--- built val data with Bankrupt and top 40 component
+  val.data <- predict(pca, newdata = res[["validate"]]) 
+
+  model <- randomForest(x = train.data[,train.data.var], y = as.factor(train.data$Bankrupt.),
+						  ntree = as.integer(tree),  importance = T)
+
+	val <- data.frame(truth = res[["validate"]]$Bankrupt.,
+					  prediction = predict(model, val.data))
+	#val <- mutate(val, result = ifelse(prediction > 0.5, 1, 0))
+
+	# confusion matrix of validation
+	cm <- table(val)
+  ```
+
+  ![random forest](./graph/RandomForest-ConfusionMatrix.png)  ![random forest](./graph/RandomForest-PCA-ConfusionMatrix.png)
+
+* logistic regression  （詳細請到code資料夾）
+  ```r
+  model <- glm(formula = Bankrupt. ~ . ,
+				 family = binomial(link='probit'),
+				 epsilon = 1e-14,
+				 data = train.data)
+
+	# val data
+	val.data <- predict(pca, newdata = res[["validate"]]) 
+	val.data <- as.data.frame(val.data)
+	val.data <- val.data[,1:40]
+
+
+	# data frame of val
+	val <- data.frame(truth = res[["validate"]]$Bankrupt.,
+					  prediction = predict(model, val.data))
+	val <- mutate(val, pred = ifelse(prediction > ts, 1, 0))
+
+	# confusion matrix of validation
+	cm <- table(val[,c(1,3)])
+  ```
+
+  ![logistic regression](./graph/LogisticRegression-ConfusionMatrix.png)  ![logistic regression](./graph/LogisticRegression-PCA-ConfusionMatrix.png)
+* cnn   （詳細請到code資料夾）
+  ```python
+  class BinaryClassification(nn.Module):
+    def __init__(self):
+        super(BinaryClassification, self).__init__()
+        # Number of input features is 12.
+        self.layer_1 = nn.Linear(95, 200) 
+        self.layer_2 = nn.Linear(200, 64)
+        self.layer_out = nn.Linear(64, 1) 
+
+        self.relu = nn.ReLU()
+        self.dropout = nn.Dropout(p=0.5)
+        self.batchnorm1 = nn.BatchNorm1d(200)
+        self.batchnorm2 = nn.BatchNorm1d(64)
+
+    def forward(self, inputs):
+        x = self.relu(self.layer_1(inputs))
+        x = self.batchnorm1(x)
+        x = self.relu(self.layer_2(x))
+        x = self.batchnorm2(x)
+        x = self.dropout(x)
+        x = self.layer_out(x)
+
+        return x
+  ```
+
+  ![cnn](./graph/cnn-ConfusionMatrix.png)
+
+* 較有挑戰性的部分
+  * 訓練資料中不平均（大多都沒有倒閉）
+  * tuning of hyperparameters
+  * testing of different epoch
+  * choose between precision and recall
+
 
 ## References
-* Code/implementation which you include/reference (__You should indicate in your presentation if you use code for others. Otherwise, cheating will result in 0 score for final project.__)
-* Packages you use
-* Related publications
+https://www.kaggle.com/jerryfang5/bankrutcy-prediciton-by-r/notebook
+https://www.kaggle.com/seongwonr/bankruptcy-prediction-with-smote
+https://colab.research.google.com/drive/12wXAyrbX8Ji5J6CNAEIQwtDOaxy8BCIO?usp=sharing
+https://towardsdatascience.com/deep-learning-using-pytorch-for-tabular-data-c68017d8b480
diff --git a/code/empty b/code/empty