Skip to content

1101-datascience/finalproject-finalproject_group3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[Group 3] Sberbank Russian Housing Market

Groups

  • 柯敦瀚 統計四 107304050
  • 陳羿丞 資科碩一 110753138
  • 蘇俊憲 資科碩一 110753158
  • 朱進益 資科碩一 110753144
  • 洪丞桀 資科三 108703045

Goal

Sberbank, Russia’s oldest and largest bank, helps their customers by making predictions about realty prices so renters, developers, and lenders are more confident when they sign a lease or purchase a building.

Giving us dataset, our goal is predicting the price of those asset.

Demo

You should provide an example commend to reproduce your result

Rscript code/DSfinal_xgb_fix.R
  • any on-line visualization

Folder organization and its related information

docs

  • 1101_datascience_FP_group3.pptx
  • 1101_datascience_FP_group3.pdf

data

  • Source:

  • Input format:

    • Three .csv file:

      • train.csv
      • test.csv
      • macro.csv
    • Features of dataset:

      Our predict target is "predict_doc" variable. This dataset provide large number of data, including house feature like area, floors and location and there is more than 30000 datas and nearly 300 parameter in this dataset.

      It also give us environmental data like green land, coffee shop nearby and how many road near this asset. One file call "macro.csv" even provide the government economic data like GDP in Russia or labor income, etc

      We find out that most of the data are useless, we've try a lot to fill up those missing data, but it comes out that it better to drop those data.

  • Any preprocessing?

    • Handle missing data
      • Remove those data missing rate greater than 35%
      • Fill up some incorrect or odd value with common sense
      • Remove those variables with low correlation coefficient

code

  • Which method do you use?
    • randomforest
    • xgboost ✓ ( The Best )
  • What is a null model for comparison?
    • randomforest & xgboost without model tuning
  • How do your perform evaluation? ie. cross-validation, or addtional indepedent data set
    • k-fold Cross-validation
    • addtional indepedent data set(the test dataset)

results (need updated)

  • Which metric do you use
    • RMSE, RMSLE
  • Is your improvement significant?
    • about 1% more on accuracy
  • What is the challenge part of your project?
    • choose model for this dataset

References

library(randomForest)
library(xgboost)
library(ggplot2)
library(readr) 
library(caret)
library(dummies)
library(vegan)
library(DMwR)
library(ggplot2) # Data visualization
library(readr) # CSV file I/O, e.g. the read_csv function
library(data.table)
library(lubridate)
library(methods)
library(tidyverse)
library(scales)
library(corrplot)
library(DT)
  • Related publications

About

finalproject-finalproject_group3 created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages