This repository contains the codebase for the WOS Parser, a data processing tool designed to handle and process Web of Science (WOS) data. This document outlines the relationships between the code components and explains the architecture of the project.
The WOS Parser project is composed of several Python modules that work together to process large sets of WOS data, focusing on extracting and analyzing bibliographic information. The project is organized in a modular structure, ensuring scalability and clear separation of concerns.
The main modules and their roles include:
-
parser_proc_main.py: Entry point of the application. This module initializes various components and orchestrates the data processing flow.
-
paper_info_load_api.py: Handles loading of WOS paper data from provided input files or directories and organizes the information into dictionaries. It also defines methods for temporary data processing.
-
paper_parser.py: Contains the main
PaperInfoclass, responsible for extracting and managing bibliographic metadata such as authors, title, abstracts, grants, keywords, and references. It interacts with various supporting managers. -
fu_manager.py: Manages Funding (FU) information. This module defines a
FUManagerclass to process, store, and output grant and funding-related information. -
state_code_analysis.py: Loads and analyzes a list of U.S. state postal codes for validation purposes.
-
rp_author_manager.py: Manages reprint (RP) and corresponding author information. Provides mechanisms to load, process, and output reprint authors' information.
-
common_def.py: Defines constants and shared configurations used across the codebase. For example, output directories and file paths are centralized in this module.
-
filter_dumplicate.py: Identifies and removes duplicate entries from the input dataset.
-
txtparser.py: Contains helper functions for parsing and processing text-based WOS data.
Below is a description of how the different modules in the codebase interact with each other:
This script acts as the main application entry point. It calls the following modules to perform data processing:
state_code_analysis: Loads state postal codes usingload_state_code.proc_history_manager: Loads historical Unique Topic Session (UTS) data usingload_history_uts.paper_info_load_api: Manages paper input loading and processing using theload_paper_inputandpaper_info_procmethods.
This module loads input data and processes it using the provided callback function. It interacts with:
parser_proc_main.py: Called to load data inputs.paper_parser.PaperInfo: A key class used for parsing and processing paper-specific metadata.proc_history_manager: Checks if UTs are already in processing history.
The PaperInfo class in this module serves as the core of the application. It interacts with other classes and methods to load and parse data. It calls:
fu_manager.FUManager.load_fu: Processes grants and funding information.rp_author_manager.RPAuthorManager.load_rp_authors: Processes reprint authors and assigns them to corresponding paper metadata.author_addr.AuthorAddrManager: Manages address-related information.- It outputs major datasets to files (e.g., title, keywords) defined in
common_def.FilePathDef.
This module defines the FUManager and FUInfo classes to handle funding-related data for each paper. The methods include:
- Loading funding and grant information, parsing input data for each funding detail.
- Interacts with
paper_parser.PaperInfofor grant processing and output.
Defines the RPAuthorManager and RPAuthorInfo classes to manage reprint (RP) authors' metadata. Key methods:
load_rp_authors: Loads authors using regular expression patterns defined in this module.output_rp_authors: Outputs processed authors' metadata into a file.
Responsible for:
- Loading U.S. state postal codes from a CSV.
- Validating state codes using a lookup set.
Acts as a shared configuration file. Defines file paths and formats for the output files.
- Utilizes
paper_info_load_apifor paper input processing. - Ensures deduplication of input data based on unique identifiers like UT and Title.
Defines utility functions for extracting specific metadata from text inputs. Processes detailed attributes like publication year, citation counts, and abstracts.
- Input: The system takes WOS input data files or directories as the source.
parser_proc_main.pyinitializes the input process by calling methods frompaper_info_load_api.py.
- Parsing:
paper_info_load_api.load_paper_inputparses input files to extract metadata into dictionaries.- Metadata is further processed by
PaperInfoand its methods, e.g.,PaperInfo.load_fu,PaperInfo.load_rp.
- Post-processing:
- Duplicate and invalid records are filtered by
filter_dumplicate.py. - Address and author information is processed by
author_name,author_addr_manager, andrp_author_manager. - Funding and grants are handled by
fu_manager.FUManager.
- Duplicate and invalid records are filtered by
- Output:
- Metadata such as abstracts, titles, authors, grants, and citations are outputted to files, with locations and formats defined in
common_def.
- Metadata such as abstracts, titles, authors, grants, and citations are outputted to files, with locations and formats defined in
git clone https://github.com/DongboShi/wos_parser.git
cd wos_parser
pip install -r requirements.txtTo run the WOS Parser, use the following command from the project directory:
python parser_proc_main.pyFor faster processing of large datasets, enable concurrent processing:
# Automatic worker detection
python parser_proc_main.py --parallel
# Specify worker count
python parser_proc_main.py --parallel --workers 4Concurrent Processing Advantages:
- Dramatically improved performance for large datasets
- Efficient CPU utilization across multiple cores
- Real-time progress monitoring
- Comprehensive error tracking and reporting
Make sure to place your input files in the corresponding paper_input_uniq directory.
wos_parser/
├── parser_proc_main.py # Entry point supporting concurrent execution
├── parallel_processor.py # Concurrent processing implementation
├── paper_info_load_api.py # Paper data loading and API definitions
├── filter_dumplicate.py # Deduplication handling
├── paper_parser.py # Core PaperInfo class for managing paper metadata
├── fu_manager.py # Funding and grants management
├── rp_author_manager.py # Reprint authors and corresponding authors management
├── state_code_analysis.py # State code data loading and validation
├── txtparser.py # Utility functions for text parsing
├── common_def.py # Common constants and file path configurations
└── README.md # Documentation
This project is maintained by DongboShi.
This repository is distributed under the terms of the MIT License. See the LICENSE file for more details.