Skip to content

Commit ed48bd5

Browse files
feat: path cleanup, moved data folder, and readme updates
feat: path cleanup and readme updates
2 parents 60de935 + 9ac85ce commit ed48bd5

20 files changed

Lines changed: 350 additions & 91 deletions

.gitignore

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,19 @@ __pycache__
1919
.env
2020

2121
# data
22-
src/load_data/saved_data/
23-
src/fetch_articles/saved_data/downloaded_pmcids.json
24-
src/fetch_articles/saved_data/articles/
22+
data/articles/
23+
data/variantAnnotations/
24+
data/unique_pmcids.json
25+
data/pmid_list.json
26+
data/downloaded_pmcids.json
27+
28+
*.zip
29+
*.tar.gz
30+
*.tar.bz2
31+
*.tar.xz
32+
*.tar.lzma
33+
*.tar.lz
34+
*.tar.lzo
35+
36+
.DS_Store
37+

README.MD

Lines changed: 27 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -6,19 +6,37 @@
66

77
# AutoGKB
88

9-
9+
Goals:
10+
1. Fetch annotated articles from variantAnnotations stored in PharmGKB API
11+
2. Create a general benchmark for an extraction system that can output a score for an extraction system
12+
Given: Article, Ground Truth Variants (Manually extracted and recorded in var_drug_ann.tsv:)
13+
Input: Extracted Variants
14+
Output: Score
15+
3. System for extracting drug related variants annotations from an article. Associations in which the variant affects a drug dose, response, metabolism, etc.
16+
4. Continously fetch new pharmacogenomic articles
1017

1118
## Description
1219

1320
This repository contains Python scripts for running and building a Pharmacogenomic Agentic system to annotate and label genetic variants based on their phenotypical associations from journal articles.
1421

1522

1623
## Progress Tracker
17-
| Task | Status |
18-
| --- | --- |
19-
| Download the zip of variants from pharmgkb ||
20-
| Get a PMID list from the variants tsv (column PMID) ||
21-
| Convert the PMID to PMCID ||
22-
| Update to use non-official pmid to pmcid | |
23-
| Fetch the content from the PMCID | |
24-
| Create pairing of annotations to article | |
24+
| Category | Task | Status |
25+
| --- | --- | --- |
26+
| Initial Download | Download the zip of variants from pharmgkb ||
27+
| | Get a PMID list from the variants tsv (column PMID) ||
28+
| | Convert the PMID to PMCID ||
29+
| | Update to use non-official pmid to pmcid (aaron's method) | |
30+
| | Fetch the content from the PMCID ||
31+
| Benchmark | Create pairings of annotations to articles | |
32+
| | Create a niave score of number of matches | |
33+
| | Create group wise score | |
34+
| | Look into advanced scoring based on distance from truth per term | |
35+
| Workflows | Integrate Aaron's current approach | |
36+
| | Document on individual annotation meanings | |
37+
| | Delegate annotation groupings to team members | |
38+
| New Article Fetching | Replicate PharGKB current workflow | |
39+
40+
## System Overview
41+
![Annotations Diagram](assets/annotations_diagram.svg)
42+

assets/annotations_diagram.svg

Lines changed: 1 addition & 0 deletions
Loading

data/README.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Data
2+
3+
This directory contains the primary data files used by the AutoGKB project.
4+
5+
## Directory Structure
6+
7+
- **articles/** - Contains XML files of articles from PubMed Central (PMC), identified by their PMCID (e.g., PMC1234567.xml). These articles are used for text mining and information extraction.
8+
9+
- **variantAnnotations/** - Contains clinical variant annotations and related data:
10+
- `var_drug_ann.tsv` - Variant-drug annotations. This is what is used in this repo.
11+
- This can be downloaded using download_and_extract_variant_annotations from the load_variants module
12+
13+
- **Support Files**:
14+
- `pmcid_mapping.json` - Maps between PMIDs and PMCIDs
15+
- `unique_pmcids.json` - List of unique PMCIDs in the dataset
16+
- `pmid_list.json` - List of PMIDs in the dataset
17+
- `downloaded_pmcids.json` - Tracking which PMCIDs have been downloaded

src/fetch_articles/saved_data/pmcid_mapping.json renamed to data/pmcid_mapping.json

File renamed without changes.

pixi.toml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,9 @@ platforms = ["osx-arm64"]
1212
version = "0.1.0"
1313

1414
[tasks]
15-
update-downloaded-pmcids = "python -c 'from src.fetch_articles.article_downloader import update_downloaded_pmcids; update_downloaded_pmcids()'"
15+
download-variants = "python -m src.load_variants.load_clinical_variants"
16+
update-download-map = "python -c 'from src.fetch_articles.article_downloader import update_downloaded_pmcids; update_downloaded_pmcids()'"
17+
download-articles = "python -m src.fetch_articles.article_downloader"
1618

1719
[dependencies]
1820
seaborn = ">=0.13.2,<0.14"

src/benchmark/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Benchmark
2+
3+
## Functions
4+
1. Calculate the niave difference between an extracted variant and the ground truth variant on Variant Annotation ID

src/benchmark/__init__.py

Whitespace-only changes.

src/dataset/README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# Dataset
2+
3+
## Goal
4+
Convert the loaded files into a dataset where the annotations and raw text are paired with each other
5+
6+
## Subgoals
7+
1. Understand the formats of the annotations
8+
2. Choose a format for the dataset

src/dataset/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)