[Group 3] Sberbank Russian Housing Market

Groups

柯敦瀚統計四 107304050
陳羿丞資科碩一 110753138
蘇俊憲資科碩一 110753158
朱進益資科碩一 110753144
洪丞桀資科三 108703045

Goal

Sberbank, Russia’s oldest and largest bank, helps their customers by making predictions about realty prices so renters, developers, and lenders are more confident when they sign a lease or purchase a building.

Giving us dataset, our goal is predicting the price of those asset.

Demo

You should provide an example commend to reproduce your result

Rscript code/DSfinal_xgb_fix.R

any on-line visualization

Folder organization and its related information

docs

1101_datascience_FP_group3.pptx
1101_datascience_FP_group3.pdf

data

Source:
- Sberbank Russian Housing Market
Input format:
- Three .csv file:
  - train.csv
  - test.csv
  - macro.csv
- Features of dataset:
  
  Our predict target is "predict_doc" variable. This dataset provide large number of data, including house feature like area, floors and location and there is more than 30000 datas and nearly 300 parameter in this dataset.
  
  It also give us environmental data like green land, coffee shop nearby and how many road near this asset. One file call "macro.csv" even provide the government economic data like GDP in Russia or labor income, etc
  
  We find out that most of the data are useless, we've try a lot to fill up those missing data, but it comes out that it better to drop those data.
Any preprocessing?
- Handle missing data
  - Remove those data missing rate greater than 35%
  - Fill up some incorrect or odd value with common sense
  - Remove those variables with low correlation coefficient

code

Which method do you use?
- randomforest
- xgboost ✓ ( The Best )
What is a null model for comparison?
- randomforest & xgboost without model tuning
How do your perform evaluation? ie. cross-validation, or addtional indepedent data set
- k-fold Cross-validation
- addtional indepedent data set(the test dataset)

results (need updated)

Which metric do you use
- RMSE, RMSLE
Is your improvement significant?
- about 1% more on accuracy
What is the challenge part of your project?
- choose model for this dataset

References

Code/implementation which you include/reference (You should indicate in your presentation if you use code for others. Otherwise, cheating will result in 0 score for final project.)
Packages you use

library(randomForest)
library(xgboost)
library(ggplot2)
library(readr) 
library(caret)
library(dummies)
library(vegan)
library(DMwR)
library(ggplot2) # Data visualization
library(readr) # CSV file I/O, e.g. the read_csv function
library(data.table)
library(lubridate)
library(methods)
library(tidyverse)
library(scales)
library(corrplot)
library(DT)

Related publications

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github		.github
code		code
data		data
docs		docs
results		results
.DS_Store		.DS_Store
.RData		.RData
.Rhistory		.Rhistory
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[Group 3] Sberbank Russian Housing Market

Groups

Goal

Demo

Folder organization and its related information

docs

data

code

results (need updated)

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

[Group 3] Sberbank Russian Housing Market

Groups

Goal

Demo

Folder organization and its related information

docs

data

code

results (need updated)

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages