Draft
Conversation
614afad to
f85a2e2
Compare
The column map will be more complicated with the need to ingest two slightly different flat files (_flat_file.csv and _reference_panel.csv) as discussed in #161 (comment). I also found myself constantly toggling back and forth between the separate column_map.tsv and the upload script to figure out how the columns are being used, so it makes more sense to just hard-code the column map in the script.
Update column map based on `0906.xlsx_H1_flat_file.csv` in comparison to the matching Excel file `20240906\ H1N1.xlsx` available on VIDRL's OneDrive.
Avoid pandas typing issues by just using the Python csv module to read and write the flat files. Mimics `augur curate` with independent functions for reading, curating, and writing records.
Doing this in preparation for processing the flat files that includes human sera measurements. The human serum ids will be parsed the same way for the flat files to ensure that we use the same standardized id.
Strip the "pool" suffix from the serum strain name, standardize the egg or cell type, and standardize the serum id. While looking into this change, I discovered that the strain name used for the human sera references in H1 and H3 is the egg vaccine strain regardless of passage annotation. Currently unclear if this is an error in the flat files or if we've misunderstood the passage annotations for human sera data. Once we clear this up, we should add some type of vaccine strain verification so that we can flag mismatches like this automatically.
In order to include the "assay_date" in the uploaded data, the VIDRL column needs to be "date" so that it can be parsed within `elife_upload` as "assay_date". This is an ugly work around, but it's similar to how cdc_upload handles the field.¹ ¹ <https://github.com/nextstrain/fauna/blob/b133974275ee1ed4e91816c76db6b7616247b6dc/tdb/cdc_upload.py#L58>
Validate records in single flat file. Ensure that the serum abbreviations map to a single serum strain and all records have the same test date. As a side effect, the validated `serum_abbr_map` and `test_date` are returned to be used for processing the reference panel records in following commits.
Pull out curation into individual functions that can be shared with the curation of the _reference_panel.csv file.
Ingests the matching "*_reference_panel.csv" for a provided "*_flat_file" fstem if the reference panel file exists. The records parsed from the reference panel file is appended to the same tmp file that is then passed to elife_upload.py. This currently includes "extra" records in comparison to Excel files, where the human sera pool strain is the "test virus" against the other references. If I strip the `pool` suffix from human sera pool strain, the measurements are exact duplicates of the measurements for the matching reference strain. We will need to decide whether or not these records should be dropped.
Based on comment in Slack¹ that the "e" or "c" suffix in the serum ID is not a reliable indicator of human serum passage. ¹ <https://bedfordlab.slack.com/archives/C03KWDET9/p1728430958054989?thread_ts=1699914235.686809&cid=C03KWDET9>
Based on meeting with VIDRL, we should only keep homologous titers for `virus_strain` that includes "pool" suffix. This will act as a proxy homologous titer for the human serum references. All other virus strains that include the "pool" suffix are ignored because they are duplicate data.
Based on meeting with VIDRL, a/b and _1/_2 reference panel files are created from the same Excel file so they are duplicates while capital A/B files are separate assays. So, this changes allows us to check for the a/b and _1/_2 patterns and ignore the reference panel file if it's a duplicate. This means we always ingest the a or _1 file but ignore the b and _2 files.
Using the latest flat file column `original designation` to use the original strain name that has not gone through VIDRL's strain name standardizations.
f85a2e2 to
10a5ad3
Compare
Contributor
|
This PR is mainly blocked now on this open TODO:
The issue is that we know one instance where flat files lacked records that existed in the Excel spreadsheets. It is possible that this missing data issue was related to a change in the way flat files were generated, so I'm going to check recent flat files against what we have ingested from Excel to look for other instances of missing data. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of proposed changes
Update ingest of VIDRL flat files for the latest version available via OneDrive.
Example command that I've been running during my testing to upload to the
test_tdbdatabase.This automatically ingests both the
_flat_fileand the matching_reference_panelfile if the_reference_panelfile exists.Related issue(s)
Resolves #161
TODOs
reference passagemismatch_reference_panelfile withpoolsuffix inreference strain_reference_files>10240->20480is okay (Slack thread)