This repository includes scripts and support files for 2019 IRIS-NCSES record linkage.
This script requires the additional installation of at least one package (unidecode). Excecute this to ensure that all requirements to run the main script run are present:
python -m pip install --user -r requirements.txt
python NCSES_clean_names.py
This code cleans and normalizes name fields, month, and year of birth. Key steps:
- Create nickname lookup from nickname csv (
NICKNAME_FILENAME) - Pull the source data input (
INPUT_FILENAME) - Clean and normalize each field.
- Apply nickname lookup function to assign a first name group from first given name.
- Output to a ready-to-hash CSV (
OUTPUT_FILENAME).
INPUT_FILENAME and OUTPUT_FILENAME should be customized as needed.
- They can be relative (
sourcenames.csv,./input/rawdata.csv) or absolute (C:/data/raw.csv). - Use forward slashes
/in filenames, not backslash\. Windows natively handles either.
Other constants are fixed configurations that should not be changed independently.
The INPUT_FIELDS variable specifies the following fields that must be in the source name CSV:
name_first_middle- concatenation of all given names: first(s) and/or middle(s)
name_last- last name as provided by source
mob- month of birth
yob- year of birth
All other fields in the source CSV (e.g. IDs) will be passed directly to the cleaned CSV.
The script uses, the OUTPUT_FIELDS variable helps validate, these outgoing fields:
-
cleaned versions of each input field, with new names for each field
givenfamilymonthyear
-
complete concatenated given + family
complete
-
name group assigned from the first word of first name
given_nickname
-
given name trio that breaks first/middle after the first word
given_first_wordgiven_middle_initialgiven_all_but_first
-
given name trio that breaks first/middle before the last word
given_all_but_finalgiven_final_initialgiven_final_word
Input:
name_first_middle Emilia Isobel Euphemia Rose Kit
name_last Clarke Harington
mob 10 ??
yob 1986 1986
Output:
given emiliaisobeleuphemiarose kit
family clarke harington
month 10
year 1986 1986
complete emiliaisobeleuphemiaroseclarke kitharington
given_nickname emilia christopher
given_first_word emilia kit
given_middle_initial i
given_all_but_first isobeleuphemiarose
given_all_but_final emiliaisobeleuphemia
given_final_initial r
given_final_word rose