The purpose of this issue is to create a feature engineering script that may be run repeatedly on the occupancy permit dataset, as new entries or files (by year) are added.
This issue was originally posted in the dc_doh_hackathon repository ,which can be found here:
issue_10
Start with the Occupancy Permit data in the /Data Sets/Occupancy Permits/ folder in Dropbox.
Write a script that uses this data to produce a feature data table for the number of new occupancy permits issued in the last 4 weeks.
You can find the data format and examples on the Feature Dataset Format tab in this document
Input:
CSV files with data for each given year
Output:
A script that produces a CSV file with the below format:
- 1 row for each occupancy permit type, and each week, year, and census block
- The dataset should include the following columns:
feature_id: The ID for the feature, in this case, "occupancy_permits_issued_last_4_weeks"
feature_type: Occupancy permit type, found in the EVENTTYPESCODEDESC column of the source data
feature_subtype: Left blank
year: The ISO-8601 year of the feature value
week: The ISO-8601 week number of the feature value
census_block_2010: The 2010 Census Block of the feature value
value: The value of the feature, i.e. the number of new occupancy permits of the specified types issued in the given census block during the previous 4 weeks starting from the year and week above.
The final script must be able to be run from the command line taking three arguments:
- A folder with the occupancy permit data files (the script should concatenate and merge the files in the directory as appropriate)
- The shapefile for census blocks
- The output CSV filename
Please also provide a README.md that describes the script and how to run it.
You can model the solution for the command line modifications after the files here or
here
Place all of your files in the codefordc/the-rat-hack repository under a new scripts/feature_engineering/extract_occupancy_permit_features/ folder
** Hints:**
The solution to Hackathon issue_3 may provide some helpful inspiration for the data cleaning steps.
The purpose of this issue is to create a feature engineering script that may be run repeatedly on the occupancy permit dataset, as new entries or files (by year) are added.
This issue was originally posted in the dc_doh_hackathon repository ,which can be found here:
issue_10
Start with the Occupancy Permit data in the
/Data Sets/Occupancy Permits/folder in Dropbox.Write a script that uses this data to produce a feature data table for the number of new occupancy permits issued in the last 4 weeks.
You can find the data format and examples on the
Feature Dataset Formattab in this documentInput:
CSV files with data for each given year
Output:
A script that produces a CSV file with the below format:
feature_id: The ID for the feature, in this case,"occupancy_permits_issued_last_4_weeks"feature_type: Occupancy permit type, found in theEVENTTYPESCODEDESCcolumn of the source datafeature_subtype: Left blankyear: The ISO-8601 year of the feature valueweek: The ISO-8601 week number of the feature valuecensus_block_2010: The 2010 Census Block of the feature valuevalue: The value of the feature, i.e. the number of new occupancy permits of the specified types issued in the given census block during the previous 4 weeks starting from the year and week above.The final script must be able to be run from the command line taking three arguments:
Please also provide a
README.mdthat describes the script and how to run it.You can model the solution for the command line modifications after the files here or
here
Place all of your files in the codefordc/the-rat-hack repository under a new
scripts/feature_engineering/extract_occupancy_permit_features/folder** Hints:**
The solution to Hackathon issue_3 may provide some helpful inspiration for the data cleaning steps.