This repository contains the complete data processing pipeline for the FORRT Replication Database (FReD) and FLoRA datasets, plus the backend API for the Zotero Replication Checker plugin.
- Overview
- Repository Structure
- Quick Start
- Data Processing Pipelines
- COS Integration
- API Documentation
- Contributing
This repository manages two independent datasets:
-
FReD (FORRT Replication Effects Database)
- Effect-level data: individual effect sizes from replications
- Created by merging individual effects with paper-level success coding
- Output:
output/FReD.xlsx
-
FLoRA (FORRT Literature on Replications and Reproductions Archive)
- Paper-level data: metadata about replication and reproduction studies
- Combines both replications and reproductions from two separate Google Sheets
- Deduplicated by original-replication/reproduction DOI pairs
- Enriched with keywords and language metadata from OpenAlex
- Output:
output/flora.csv
Both datasets are augmented with:
- CrossRef metadata (titles, authors, years)
- APA-formatted references with manual override support
- Author overlap detection
- OpenAlex keywords
The datasets power the Zotero Replication Checker API backend via privacy-preserving DOI hash lookups.
fred_data/
├── R/ # Shared helper functions
│ ├── cache_config.R # Cache paths by data type
│ ├── data_cleaning.R # FReD data cleaning
│ ├── crossref_cache.R # Citation & author caching
│ ├── augmentation.R # Augmentation functions
│ └── release_helpers.R # OSF release automation
│
├── pipelines/ # Independent data pipelines
│ ├── fred/
│ │ ├── prepare_fred.qmd # Download → clean → augment → save
│ │ └── release_fred.qmd # Release to OSF (optional)
│ │
│ └── flora/
│ ├── prepare_flora.qmd # Download → deduplicate → augment → save
│ └── release_flora.qmd # Release to OSF (optional)
│
├── cache/ # Cache files (gitignored)
│ ├── crossref_doi_cache.rds # DOI metadata
│ ├── crossref_citations.rds # APA/BibTeX references
│ ├── crossref_authors.xlsx # Author lists
│ ├── author_overlap.xlsx # Overlap calculations
│ ├── manual_references.xlsx # Manual reference overrides
│ └── openalex_keywords.csv # Keywords cache
│
├── output/ # Generated datasets (gitignored)
│ ├── FReD.xlsx # Effect-level dataset
│ └── flora.csv # Paper-level dataset
│
├── cos_integration/ # COS test data (optional)
│ ├── cos_test_set_phase1.csv
│ ├── cos_test_set_phase1_prepared.xlsx
│ ├── prepare_cos_data.R
│ └── README.md # COS toggle instructions
│
├── fred_dynamodb_loader/ # API backend loader
├── release/ # Release automation scripts
├── COS Reports/ # COS competition reports
├── archive/ # Historical files
│ └── old_scripts/ # Previous pipeline versions
│
└── [Documentation files]
├── README.md # This file
├── .env.example # Environment variables template
├── REORGANIZATION_PROGRESS.md
├── PHASE2_SUMMARY.md
├── PHASE3-4_SUMMARY.md
└── IMPLEMENTATION_STATUS.md
# Clone repository
git clone https://github.com/forrtproject/FReD-data.git
cd fred_data
# Set up environment variables
cp .env.example .env
# Edit .env and add:
# - OSF_TOKEN (for releases)
# - ENABLE_COS_MERGE (TRUE/FALSE)
# - OPENALEX_MAILTO (your email for API)
# Install R dependencies (one-time setup)
Rscript -e "install.packages(c('tidyverse', 'readxl', 'openxlsx', 'rcrossref', 'osfr', 'quarto'))"# Prepare FReD (effect-level dataset)
quarto render pipelines/fred/prepare_fred.qmd
# Prepare FLoRA (paper-level dataset)
quarto render pipelines/flora/prepare_flora.qmd
# Output files created:
# - output/FReD.xlsx (effect-level data with augmentation)
# - output/flora.csv (paper-level data with augmentation)File: pipelines/fred/prepare_fred.qmd
8-step process:
- Load helpers - Source R functions for cleaning and augmentation
- Download - Fetch validated FReD data from Google Sheets
- COS Integration (optional) - Merge COS Phase 1 test data if enabled
- Clean - Standardize formatting, remove duplicates, fix DOIs
- Validate - Check data quality (ready when validation module complete)
- Generate IDs - Create fred_id, entry_id, effect_id
- Augment:
- Author overlap detection (% shared authors)
- Clean references (APA-formatted with manual overrides)
- Keywords from OpenAlex (optional)
- Save - Output to
output/FReD.xlsx
Run:
# Without COS data (default)
quarto render pipelines/fred/prepare_fred.qmd
# With COS data merged
export ENABLE_COS_MERGE=TRUE
quarto render pipelines/fred/prepare_fred.qmd
# Output: output/FReD.xlsx
# HTML report: pipelines/fred/prepare_fred.htmlFile: pipelines/flora/prepare_flora.qmd
12-step process:
- Load helpers - Source R functions for augmentation
- Download & Combine - Fetch both replications and reproductions from Google Sheets, combine on common columns
- Prepare - Select relevant columns
- Deduplicate - Remove duplicate (doi_o, doi_r) pairs
- Validate DOIs - Ensure format starts with "10."
- Fetch metadata - CrossRef/DataCite lookup (framework ready)
- Augment with references - Clean references (APA-formatted with manual overrides)
- Add IDs - Privacy-preserving 3-char DOI hash prefixes
- Add language & keywords - Fetch from OpenAlex API (only fills empty fields)
- Format - Reorder columns for output (includes keywords and language)
- Save - Output to
output/flora.csv - Summary - Report statistics on papers, coverage, and augmentation success
Run:
quarto render pipelines/flora/prepare_flora.qmd
# Output: output/flora.csv
# HTML report: pipelines/flora/prepare_flora.htmlAll helper functions are in R/ and ready to use:
clean_fred_data(data)- Standardize formatting, fix DOIs, remove non-printable characters
get_apa_references(dois)- Get APA references (manual → cache → API lookup)get_crossref_authors(dois)- Fetch author lists from CrossRefcompute_author_overlap(data)- Calculate author overlap
augment_with_author_overlap(data)- Add author overlap columnsaugment_with_clean_references(data)- Add APA reference columnsaugment_with_keywords(data)- Add OpenAlex keywords
release_to_osf(dataset_path, ...)- Release dataset to OSF with versioning
Usage:
source("R/augmentation.R")
# Augment data
data <- augment_with_author_overlap(data)
data <- augment_with_clean_references(data)
data <- augment_with_keywords(data)COS (Collaborative Open Science) Phase 1 test data can be optionally merged with FReD.
# Method 1: Environment variable
export ENABLE_COS_MERGE=TRUE
quarto render pipelines/fred/prepare_fred.qmd
# Method 2: .env file
# Edit .env and set: ENABLE_COS_MERGE=TRUE
quarto render pipelines/fred/prepare_fred.qmd
# Disable (default)
export ENABLE_COS_MERGE=FALSE
quarto render pipelines/fred/prepare_fred.qmdWhen ENABLE_COS_MERGE=TRUE:
- FReD pipeline downloads main dataset
- Merges with COS data on common columns
- Both processed identically (cleaning, validation, augmentation)
- Single output:
FReD.xlsxwith both datasets
See: cos_integration/README.md for detailed instructions
All caches are organized by data type (not purpose) for maximum efficiency:
| Cache File | Type | Purpose |
|---|---|---|
crossref_doi_cache.rds |
DOI Metadata | CrossRef/DataCite results |
crossref_citations.rds |
References | APA/BibTeX formatted citations |
crossref_authors.xlsx |
Authors | Author lists by DOI |
author_overlap.xlsx |
Overlap Data | Computed author overlaps |
manual_references.xlsx |
Overrides | Manual reference corrections |
openalex_keywords.csv |
Keywords | OpenAlex keywords by DOI |
Three-tier lookup for references (fastest to slowest):
- Manual reference overrides (
manual_references.xlsx) - Cached references (RDS cache files)
- Live CrossRef API call
See original README content for Zotero Replication Checker API endpoints:
- Prefix Lookup (privacy-preserving 3-char hash lookups)
- Original DOI Lookup (direct DOI searches)
The API is powered by FLoRA dataset (output/flora.csv) loaded into DynamoDB.
API Backend: fred_dynamodb_loader/load_fred_to_dynamodb.py
This reorganization introduces breaking changes:
- Old scripts moved to
archive/old_scripts/ - All helper functions now in
R/ - Pipelines now in
pipelines/fred/andpipelines/flora/(not at root) - Output files now in
output/(use pipelines to generate) - No symlinks created at root
Migration path:
- Run new pipelines:
quarto render pipelines/fred/prepare_fred.qmd - Use
output/FReD.xlsxandoutput/flora.csvas outputs - All helper functions available via
source("R/...")in your scripts
Set in .env file (or shell environment):
# OSF Release Authentication
OSF_TOKEN=your_osf_token_here
# COS Integration Toggle
ENABLE_COS_MERGE=FALSE # Set to TRUE to merge COS test data
# OpenAlex API Contact
OPENALEX_MAILTO=your_email@example.comAll cache paths defined in R/cache_config.R:
CACHE_DIR <- "cache"
CROSSREF_DOI_CACHE <- "cache/crossref_doi_cache.rds"
CROSSREF_CITATIONS_CACHE <- "cache/crossref_citations.rds"
# ... etcTo change cache locations, edit R/cache_config.R and update the paths.
- Check internet connection
- Verify Google Sheets URLs are still active
- Check that CSV format is still used for export
- Caches are auto-generated on first run
- Ensure
cache/directory exists - Check file permissions
- Ensure
ENABLE_COS_MERGE=TRUE - Verify
cos_integration/cos_test_set_phase1_prepared.xlsxexists - Run
Rscript cos_integration/prepare_cos_data.Rto prepare COS data
- Check internet connection for CrossRef API
- Verify manual_references.xlsx exists if using overrides
- Check OSF_TOKEN if OpenAlex requires authentication
# Enable verbose logging
quarto render pipelines/fred/prepare_fred.qmd --quiet false# Test cleaning
source("R/data_cleaning.R")
result <- clean_fred_data(sample_data)
# Test augmentation
source("R/augmentation.R")
data <- augment_with_author_overlap(sample_data)
# Test caching
source("R/crossref_cache.R")
refs <- get_apa_references(c("10.1234/example"))- Create function in
R/augmentation.R - Follow pattern:
augment_with_[feature](data) - Add cache management as needed
- Call from appropriate pipeline file
- Document in pipeline comments
- FORRT Project: https://forrt.org
- FReD Dataset: https://osf.io/9r62x (OSF project)
- Zotero Plugin: Replication Checker plugin in Zotero marketplace
- API Backend:
fred_dynamodb_loader/(AWS Lambda + DynamoDB)
[See LICENSE file in repository]
For questions about the data processing pipeline:
- Open an issue on GitHub
- Contact FORRT team
For API issues:
- See API documentation in this README
- Check
fred_dynamodb_loader/for backend code
Last Updated: 2025-12-17 Version: 2.0 (Reorganized with independent pipelines) Status: Production-ready