Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
0187fc3
Setting up GitHub Classroom Feedback
github-classroom[bot] Dec 16, 2021
83d9718
Bankrutcy Prediciton
YJack0000 Jan 3, 2022
eb88096
更新組員名稱
YJack0000 Jan 4, 2022
8122c79
刪除 empty
YJack0000 Jan 4, 2022
7d2a0ed
Correlation
YJack0000 Jan 5, 2022
d6ae567
加 gitignore 和 decision_tree
MarkLai0317 Jan 6, 2022
3246e71
加入 panel
YJack0000 Jan 6, 2022
a7179ea
Merge branch 'main' of https://github.com/1101-datascience/finalproje…
YJack0000 Jan 6, 2022
cb65dee
upload pca with decision tree
bryant-nn Jan 6, 2022
6cc1302
Merge branch 'main' of https://github.com/1101-datascience/finalproje…
YJack0000 Jan 6, 2022
aa822a4
README v1.0
YJack0000 Jan 12, 2022
cd44583
沒改
YJack0000 Jan 12, 2022
350d5fc
新增 shiny
YJack0000 Jan 12, 2022
8e7bdf6
sihny
YJack0000 Jan 12, 2022
80a9135
改成 README
YJack0000 Jan 12, 2022
464ca3d
README.
YJack0000 Jan 12, 2022
962e45e
打好格式
YJack0000 Jan 12, 2022
93dc5cb
Update README.md
YJack0000 Jan 12, 2022
46183c2
Update README.md
YJack0000 Jan 12, 2022
ffb6ed2
epoch graph
MarkLai0317 Jan 12, 2022
4a093b7
d
YJack0000 Jan 12, 2022
1c5c9e2
Merge branch 'main' of https://github.com/1101-datascience/finalproje…
YJack0000 Jan 12, 2022
9ba3a58
圖片
YJack0000 Jan 12, 2022
17e2fad
upload pca with decision tree, random forest, lr
bryant-nn Jan 12, 2022
0191279
add image
MarkLai0317 Jan 12, 2022
0ad53aa
readme
MarkLai0317 Jan 12, 2022
59e6dc9
load plcture with decision tree, random forest, lr
bryant-nn Jan 12, 2022
3754a1f
merge conflict 解決
MarkLai0317 Jan 12, 2022
4657fbc
Merge branch 'main' of github.com:1101-datascience/finalproject_group…
MarkLai0317 Jan 12, 2022
53d00dd
readme 更新 差command line
MarkLai0317 Jan 12, 2022
9877e83
完成
MarkLai0317 Jan 12, 2022
2109dc1
加 random forest code
MarkLai0317 Jan 12, 2022
627d086
add link
MarkLai0317 Jan 12, 2022
4717158
last version
YJack0000 Jan 12, 2022
3d2fa7a
Merge branch 'main' of https://github.com/1101-datascience/finalproje…
YJack0000 Jan 12, 2022
d85e1bb
[What]Finish slides and add some image
CTHua Jan 12, 2022
e4556f3
[What]Export MARP to PDF
CTHua Jan 12, 2022
7a3a4f2
[What]Fix README to preview of slide
CTHua Jan 12, 2022
1c45d18
[What]Update README
CTHua Jan 12, 2022
21aaf1a
[What]Fix conclusion
CTHua Jan 13, 2022
b1e0316
Update slides.md
CTHua Jan 13, 2022
3f840be
[What]Update conclusion
CTHua Jan 13, 2022
5debc6c
[What]Update PDF
CTHua Jan 13, 2022
81a02ad
ppt
MarkLai0317 Jan 13, 2022
db2d11a
[What]Fix PDF
CTHua Jan 13, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.DS_Store
232 changes: 197 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,51 +1,213 @@
# [GroupID] Title of your final project

### Groups
* name, student ID1
* name, student ID2
* name, student ID3
* ...

### Goal
A breif introduction about your project, i.e., what is your goal?
# [Group4] Bankrutcy Prediciton

### 組員
* 鄭宇傑, 108703014
* 賴冠瑜, 108703019
* 張瀚文, 108304003
* 江宗樺, 108703029
* 田詠恩, 108703030
### 目標
95% 以上的資料中的公司都沒有破產(Bankruptcy == 0)
所以全部猜 1 就可以有超級高的 Accuracy
我們將目標設定成要盡可能增加 recall 。嘗試預測出更多可能會倒的公司去對他們做關切 或提早做應對措施,並去檢視可能面臨的問題,是這次專題的主要目標。
### Demo
You should provide an example commend to reproduce your result
```R
Rscript code/your_script.R --input data/training --output results/performance.tsv
```
* any on-line visualization
* [ShinyApp](https://yjack0000.shinyapps.io/shinyui/?_ga=2.142920117.1862022445.1641973117-1531152518.1641397296)
* Rscript 使用
```console
Rscript pcaRpart.R (max depth) (threshold)
```

## Folder organization and its related information
## 檔案架構及其相關資訊

### docs
* Your presentation, 1101_datascience_FP_<yourID|groupName>.ppt/pptx/pdf, by **Jan. 13**
* Any related document for the final project
* papers
* software user guide
* [Google Slide for Presentation](https://docs.google.com/presentation/d/1TWPNksUenzi-DsquO6Yv7WBCVPvZE-HgyjMmvAcAH3U/edit#slide=id.g10d591fe8d9_0_169)

### data

* Source
* Input format
* Any preprocessing?
* Handle missing data
* Scale value
* [來源](https://www.kaggle.com/fedesoriano/company-bankruptcy-prediction)
* [格式](https://github.com/1101-datascience/finalproject_group4/tree/main/data)
* 預處理
* 資料分析
![Correltion](./graph/corr2.png)
* PCA
![PCA](./graph/PC.png)
![Variable Explained](./graph/var_explain.png)
* Normalize

### code

* Which method do you use?
* What is a null model for comparison?
* How do your perform evaluation? ie. cross-validation, or addtional indepedent data set
* method we use
* decision tree
```console
Rscript pcaRpart.R (max depth) (threshold)
```

![Desision Tree](./graph/decision_tree.png)
* random forest
```console
Rscript pcaRandomForest.R (tree number)
```

![Random Forest](./graph/random_forest.png)
* logistic regression
```console
Rscript pcaGlm.R (threshold)
```

![Logistic Regression](./graph/logistic_regression.jpeg)
* cnn


* Null model predict all 1
* data split

![Data Split](./graph/data_split.png)


* 使用 SMOTE 製作額外的 traning data

![smote](./graph/original.png)
![smot](./graph/oversample.png)

### results

* Which metric do you use
* precision, recall, R-square
* Is your improvement significant?
* What is the challenge part of your project?
* use precision accuracy and recall to evaluate model
* improvement
1. 用pca 分析並使用前40個主成份
2. 利用k-fold validation的方法 篩選出recall最高的model
3. 對於threshold、max depth等超參數進行調整

* decision tree (詳細請到code資料夾)
```r
model<- rpart( Bankrupt. ~ .,
data=res[["train"]],
control=rpart.control(maxdepth=20),
method="class")


trainframe <- data.frame(truth=res[["train"]]$Bankrupt.,
pred=predict(model, type="class"))

TP.train<-nrow(filter(trainframe, truth == pred , truth == 1,))
TN.train<-nrow(filter(trainframe, truth == pred , truth == 0,))
FP.train<-nrow(filter(trainframe, truth != pred , truth == 0,))
FN.train<-nrow(filter(trainframe, truth != pred , truth == 1,))
accuracy.train<-(TP.train+TN.train)/nrow(trainframe)
fallback.train<-(TP.train)/(TP.train+FN.train)
precision.train<-(TP.train)/(TP.train+FP.train)
NegativePrecision.train<-(TN.train)/(TN.train+FN.train)
```

![decision tree](./graph/DecisionTree-ConfusionMatrix.png) ![decision tree](./graph/DecisionTree-PCA-ConfusionMatrix.png)

* random forest (詳細請到code資料夾)

```r
g = sample(cut(
seq(nrow(data)),
nrow(data)*cumsum(c(0,spec)),
labels = names(spec)
))

# final data
res = split(data, g)




#train pca
in_d <- res[["train"]]
in_d = in_d[,!colnames(in_d) %in% c('Net.Income.Flag','Bankrupt.')]
pca <- prcomp(in_d, center=TRUE, scale=TRUE)

#--- watch pc
std_dev <- pca$sdev
pr_var <- std_dev^2
prop_varex <- pr_var/sum(pr_var)
#plot(prop_varex, type = 'lines')


#--- built train data with Bankrupt and top 40 component
train.data <- data.frame(Bankrupt. = res[["train"]]$Bankrupt., pca$x)
train.data <- train.data[,1:41]
train.data.var <- colnames(train.data[,2:40])

#--- built val data with Bankrupt and top 40 component
val.data <- predict(pca, newdata = res[["validate"]])

model <- randomForest(x = train.data[,train.data.var], y = as.factor(train.data$Bankrupt.),
ntree = as.integer(tree), importance = T)

val <- data.frame(truth = res[["validate"]]$Bankrupt.,
prediction = predict(model, val.data))
#val <- mutate(val, result = ifelse(prediction > 0.5, 1, 0))

# confusion matrix of validation
cm <- table(val)
```

![random forest](./graph/RandomForest-ConfusionMatrix.png) ![random forest](./graph/RandomForest-PCA-ConfusionMatrix.png)

* logistic regression (詳細請到code資料夾)
```r
model <- glm(formula = Bankrupt. ~ . ,
family = binomial(link='probit'),
epsilon = 1e-14,
data = train.data)

# val data
val.data <- predict(pca, newdata = res[["validate"]])
val.data <- as.data.frame(val.data)
val.data <- val.data[,1:40]


# data frame of val
val <- data.frame(truth = res[["validate"]]$Bankrupt.,
prediction = predict(model, val.data))
val <- mutate(val, pred = ifelse(prediction > ts, 1, 0))

# confusion matrix of validation
cm <- table(val[,c(1,3)])
```

![logistic regression](./graph/LogisticRegression-ConfusionMatrix.png) ![logistic regression](./graph/LogisticRegression-PCA-ConfusionMatrix.png)
* cnn (詳細請到code資料夾)
```python
class BinaryClassification(nn.Module):
def __init__(self):
super(BinaryClassification, self).__init__()
# Number of input features is 12.
self.layer_1 = nn.Linear(95, 200)
self.layer_2 = nn.Linear(200, 64)
self.layer_out = nn.Linear(64, 1)

self.relu = nn.ReLU()
self.dropout = nn.Dropout(p=0.5)
self.batchnorm1 = nn.BatchNorm1d(200)
self.batchnorm2 = nn.BatchNorm1d(64)

def forward(self, inputs):
x = self.relu(self.layer_1(inputs))
x = self.batchnorm1(x)
x = self.relu(self.layer_2(x))
x = self.batchnorm2(x)
x = self.dropout(x)
x = self.layer_out(x)

return x
```

![cnn](./graph/cnn-ConfusionMatrix.png)

* 較有挑戰性的部分
* 訓練資料中不平均(大多都沒有倒閉)
* tuning of hyperparameters
* testing of different epoch
* choose between precision and recall


## References
* Code/implementation which you include/reference (__You should indicate in your presentation if you use code for others. Otherwise, cheating will result in 0 score for final project.__)
* Packages you use
* Related publications
https://www.kaggle.com/jerryfang5/bankrutcy-prediciton-by-r/notebook
https://www.kaggle.com/seongwonr/bankruptcy-prediction-with-smote
https://colab.research.google.com/drive/12wXAyrbX8Ji5J6CNAEIQwtDOaxy8BCIO?usp=sharing
https://towardsdatascience.com/deep-learning-using-pytorch-for-tabular-data-c68017d8b480
Empty file removed code/empty
Empty file.
Loading