From 84e404c2a6f3617b89dd5a548da26ce0539e3675 Mon Sep 17 00:00:00 2001 From: hk-2029 Date: Fri, 13 Mar 2026 17:41:27 +0000 Subject: [PATCH 01/23] update --- .DS_Store | Bin 10244 -> 10244 bytes documentation/README.md | 19 +-- documentation/architecture.md | 66 +++++---- documentation/configuration.md | 14 +- documentation/getting-started.md | 2 +- documentation/gui-guide.md | 51 ------- documentation/model-concepts.md | 207 +++++++++++++++++++++++++++++ documentation/scenario-cookbook.md | 181 +++++++++---------------- 8 files changed, 328 insertions(+), 212 deletions(-) delete mode 100644 documentation/gui-guide.md create mode 100644 documentation/model-concepts.md diff --git a/.DS_Store b/.DS_Store index f18e97d977195ebd0d92543d0bc90a3bca6de531..1a9d76103f7b49dead6334eac2fb04f54716e0eb 100644 GIT binary patch delta 14 VcmZn(XbITBDagpMnM?4O7yu%g1X2J1 delta 14 VcmZn(XbITBDagpUnM?4O7yu%m1XBP2 diff --git a/documentation/README.md b/documentation/README.md index b36ace796..0210599e6 100644 --- a/documentation/README.md +++ b/documentation/README.md @@ -4,23 +4,24 @@ This documentation is structured to support both first-time users and contributo ## Recommended reading order -1. [Getting Started](getting-started.md) -2. [CLI Reference](cli-reference.md) -3. [Configuration](configuration.md) -4. [Scenario Cookbook](scenario-cookbook.md) -5. [Data and Outputs](data-and-outputs.md) -6. [Troubleshooting](troubleshooting.md) +1. [Model Concepts](model-concepts.md) — what SimPaths simulates, agents, annual cycle, alignment, EUROMOD +2. [Getting Started](getting-started.md) — prerequisites, build, first run +3. [CLI Reference](cli-reference.md) — all flags for `singlerun.jar` and `multirun.jar` +4. [Configuration](configuration.md) — YAML structure and all config keys +5. [Scenario Cookbook](scenario-cookbook.md) — provided configs and how to build your own +6. [Data and Outputs](data-and-outputs.md) — input layout, setup artifacts, output files +7. [Troubleshooting](troubleshooting.md) — common errors and fixes For contributors and advanced users: -- [Architecture](architecture.md) -- [Development and Testing](development.md) -- [GUI Guide](gui-guide.md) +- [Architecture](architecture.md) — source package structure and data flow +- [Development and Testing](development.md) — build, tests, CI, contributor workflow ## Scope These guides cover: +- Understanding the simulation model and its mechanisms - Building SimPaths with Maven - Running single-run and multi-run workflows - Configuring model, collector, and runtime behavior via YAML diff --git a/documentation/architecture.md b/documentation/architecture.md index a0e168edf..69c19c36e 100644 --- a/documentation/architecture.md +++ b/documentation/architecture.md @@ -1,44 +1,64 @@ # Architecture +For a conceptual overview of the simulation (agents, annual cycle, modules, alignment), see [Model Concepts](model-concepts.md). This page covers source-level structure and data flow. + +--- + ## High-level module map Core package layout under `src/main/java/simpaths/`: -- `experiment/`: simulation entry points and orchestration -- `model/`: core simulation entities and yearly process logic -- `data/`: parameters, setup routines, filters, statistics helpers +| Package | Contents | +|---------|----------| +| `experiment/` | Entry points, orchestration, and runtime managers (`SimPathsStart`, `SimPathsMultiRun`, `SimPathsCollector`, `SimPathsObserver`) | +| `model/` | Core simulation entities (`Person`, `BenefitUnit`, `Household`), yearly process logic, alignment routines, labour market, union matching, tax evaluation, intertemporal decision module | +| `data/` | Parameters, setup routines, input parsers, filters, statistics helpers | + +--- ## Primary entry points -- `simpaths.experiment.SimPathsStart` - - Builds/refreshes setup artifacts - - Launches single simulation run (GUI or headless) -- `simpaths.experiment.SimPathsMultiRun` - - Loads YAML config - - Iterates runs with optional seed/innovation logic - - Supports persistence mode switching +### `simpaths.experiment.SimPathsStart` + +- Builds or refreshes setup artifacts (H2 database, policy schedule) +- Launches a single simulation run, GUI or headless + +### `simpaths.experiment.SimPathsMultiRun` + +- Loads a YAML config from `config/` +- Iterates runs with optional seed or innovation logic +- Supports persistence mode switching across runs + +--- ## Runtime managers -The simulation engine registers: +All three are registered with the JAS-mine simulation engine at startup. They live in `simpaths.experiment`: -- `SimPathsModel`: state evolution and process scheduling -- `SimPathsCollector`: statistics computation and export -- `SimPathsObserver`: GUI observation layer (when GUI is enabled) +| Class | Role | +|-------|------| +| `SimPathsModel` | Owns the agent collections, builds the event schedule, fires yearly processes | +| `SimPathsCollector` | Computes and exports statistics at scheduled intervals | +| `SimPathsObserver` | GUI observation layer, only active when GUI is enabled | + +--- ## Data flow -1. Setup stage prepares policy schedule and input database. -2. Runtime model loads parameters and input maps. -3. Collector computes and exports statistics at scheduled intervals. -4. Output files are written to run folders under `output/`. +1. **Setup stage** — `SimPathsStart` or `multirun -DBSetup` generates `input/input.mv.db`, `input/EUROMODpolicySchedule.xlsx`, and `input/DatabaseCountryYear.xlsx`. +2. **Initialisation** — `SimPathsModel.buildObjects()` loads parameters, reads the initial population CSV, and hydrates agent collections. +3. **Yearly loop** — `SimPathsModel.buildSchedule()` registers all process events in fixed order. Each year the engine fires them sequentially across `Person`, `BenefitUnit`, and model-level processes. See [Model Concepts — Annual simulation cycle](model-concepts.md#annual-simulation-cycle) for the full ordered list. +4. **Collection** — `SimPathsCollector` computes cross-sectional statistics and writes CSV outputs at the end of each year. +5. **Output** — files land in timestamped run folders under `output/`. + +--- ## Configuration flow -`SimPathsMultiRun` combines: +`SimPathsMultiRun` applies values in three layers (later layers override earlier ones): -- defaults in class fields -- overrides from `config/.yml` -- final CLI overrides at invocation time +1. Class field defaults +2. Values from `config/.yml` +3. CLI flags provided at invocation -This layered strategy supports reproducible batch runs with targeted command-line changes. +This layered strategy supports reproducible batch runs with targeted command-line overrides without editing YAML files. diff --git a/documentation/configuration.md b/documentation/configuration.md index 4e8a1426a..9641a31db 100644 --- a/documentation/configuration.md +++ b/documentation/configuration.md @@ -2,17 +2,13 @@ SimPaths multi-run behavior is controlled by YAML files in `config/`. -Examples in this repository include: +This repository ships with three configs: -- `default.yml` -- `test_create_database.yml` -- `test_run.yml` -- `create database.yml` -- `sc analysis*.yml` -- `intertemporal elasticity.yml` -- `labour supply elasticity.yml` +- `default.yml` — standard baseline run +- `test_create_database.yml` — database setup using training data +- `test_run.yml` — short integration-test run -For command-by-command guidance for each provided config, see [Scenario Cookbook](scenario-cookbook.md). +For command-by-command guidance and a template for building your own config, see [Scenario Cookbook](scenario-cookbook.md). ## How config is applied diff --git a/documentation/getting-started.md b/documentation/getting-started.md index 6a93e977d..9ba29a426 100644 --- a/documentation/getting-started.md +++ b/documentation/getting-started.md @@ -62,4 +62,4 @@ Use `-g true` (default behavior in several flows) to run with GUI components. In headless/remote environments, set `-g false`. -See [GUI Guide](gui-guide.md) for screenshots. +For GUI usage, see the GUI section of the user guide on the project website. diff --git a/documentation/gui-guide.md b/documentation/gui-guide.md deleted file mode 100644 index 40ad53d96..000000000 --- a/documentation/gui-guide.md +++ /dev/null @@ -1,51 +0,0 @@ -# GUI Guide - -The GUI is available in single-run and multi-run workflows when enabled. - -## Enable GUI - -Single run: - -```bash -java -jar singlerun.jar -g true -``` - -Multi run: - -```bash -java -jar multirun.jar -config default.yml -g true -``` - -## Screenshots - -Main GUI: - -![SimPaths GUI](figures/SimPaths%20GUI.png) - -Control buttons: - -![SimPaths Buttons](figures/SimPaths-Buttons.png) - -Parameter selection: - -![SimPaths Parameters](figures/SimPaths%20parameters.png) - -Charts overview: - -![Charts](figures/Charts.png) - -Chart properties: - -![Chart Properties](figures/Chart%20Properties.png) - -Chart zoom example: - -![Chart Zoom](figures/SimPaths-Chart-Zoom.png) - -Output stream panel: - -![Output Stream](figures/Output%20stream.png) - -## Headless note - -In remote servers or CI, run with `-g false`. diff --git a/documentation/model-concepts.md b/documentation/model-concepts.md new file mode 100644 index 000000000..e154e820a --- /dev/null +++ b/documentation/model-concepts.md @@ -0,0 +1,207 @@ +# Model Concepts + +This page explains what SimPaths simulates, how it is structured, and how its core mechanisms work. It is intended as the conceptual companion to the operational guides (getting-started, configuration, cli-reference). + +--- + +## What SimPaths is + +SimPaths is a dynamic population microsimulation model. It takes a sample of real households as a starting population and advances them forward in time, year by year, simulating individual life events using a combination of statistical regression models and rule-based processes. + +The output is a longitudinal synthetic population whose trajectories can be used to study policy scenarios, distributional outcomes, and the long-run consequences of demographic and economic change. + +### Supported countries + +| Code | Country | +|------|---------| +| `UK` | United Kingdom | +| `IT` | Italy | + +Country selection affects which initial population, regional classifications, EUROMOD policy schedule, and regression coefficients are loaded. The two countries share the same model structure but are fully parameterised separately. + +--- + +## Agent hierarchy + +The simulation maintains three nested entity types. + +### Person + +The individual. Each person carries their own demographic, health, education, labour, and income attributes. Almost all behavioural processes (health, education, fertility, partnership, labour supply) are resolved at the person level. + +Key attributes tracked per person include: + +- **Demographics**: age, gender, region +- **Education**: highest qualification (Low / Medium / High / InEducation), mother's and father's education +- **Labour market status**: one of `EmployedOrSelfEmployed`, `NotEmployed`, `Student`, `Retired`; weekly hours worked; wage rate; work history in months +- **Health**: physical health (SF-12 PCS), mental health (SF-12 MCS, GHQ-12 psychological distress score, caseness indicator), life satisfaction, EQ-5D utility score, disability / care-need flag +- **Partnership**: partner reference, years in partnership +- **Income**: gross labour income, capital income, pension income, benefit receipt flags (UC and non-UC) +- **Social care**: formal and informal care hours received per week; care provision hours per week +- **Financial wellbeing**: equivalised disposable income, lifetime income trajectory, financial distress flag + +### BenefitUnit + +The tax-and-benefit assessment unit — typically an adult (or a couple) and their dependent children. Benefits and taxes are computed at this level, mirroring how real-world tax-benefit systems work. + +Key attributes include: + +- region, homeownership flag +- equivalised disposable income (EDI) and year-on-year change in log-EDI +- poverty flag (< 60 % of median equivalised household disposable income) +- discretionary consumption (when intertemporal optimisation is enabled) + +### Household + +A grouping of benefit units sharing an address. Used for aggregation and housing-related logic. A household may contain more than one benefit unit (for example, adult children living with parents before leaving home). + +--- + +## Annual simulation cycle + +SimPaths uses **discrete annual time steps**. Within each year, processes fire in a fixed order. The table below lists the ordered steps as scheduled in `SimPathsModel.buildSchedule()`. + +| # | Process | Level | Description | +|---|---------|-------|-------------| +| 1 | StartYear | model | Housekeeping, year logging | +| 2 | RationalOptimisation | model | *First year only.* Pre-computes intertemporal decision grids (if enabled) | +| 3 | UpdateParameters | model | Loads year-specific parameters and time-series factors | +| 4 | GarbageCollection | model | Removes stale entity references | +| 5 | UpdateWealth | benefit unit | Updates savings/wealth stocks (if intertemporal enabled) | +| 6 | Update | benefit unit | Refreshes household composition counts, clears state flags | +| 7 | Update | person | Refreshes individual-level state variables and lags | +| 8 | **Aging** | person | Increments age; checks whether individuals age out of the population | +| 9 | ConsiderRetirement | person | Stochastic retirement decision | +| 10 | **InSchool** | person | Whether person remains in / enters education (age 16–29) | +| 11 | InSchoolAlignment | model | Aligns school participation rates to targets (if enabled) | +| 12 | LeavingSchool | person | Transition out of education | +| 13 | EducationLevelAlignment | model | Aligns completed education distribution (if enabled) | +| 14 | Homeownership | benefit unit | Homeownership transition | +| 15 | **Health** | person | Updates physical health and disability status | +| 16 | UpdatePotentialHourlyEarnings | person | Refreshes wage potential for labour supply decisions | +| 17 | CohabitationAlignment | model | Aligns cohabitation share to targets (if enabled) | +| 18 | **Cohabitation** | person | Entry into partnership | +| 19 | PartnershipDissolution | person | Exit from partnership (separation / bereavement) | +| 20 | **UnionMatching** | model | Matches unpartnered individuals into new couples | +| 21 | FertilityAlignment | model | Aligns birth rates to projected fertility (if enabled) | +| 22 | **Fertility** | person | Fertility decision for women of childbearing age | +| 23 | GiveBirth | person | Adds newborn children to simulation | +| 24 | SocialCareReceipt | person | Care receipt (formal and informal) for those with care need | +| 25 | SocialCareProvision | person | Provision of informal care by eligible individuals | +| 26 | **Unemployment** | person | Unemployment transitions | +| 27 | UpdateStates | benefit unit | Refreshes joint labour states for IO decision (if enabled) | +| 28 | **LabourMarketAndIncomeUpdate** | model | Resolves labour supply, imputes taxes and benefits via EUROMOD donor matching | +| 29 | ReceivesBenefits | benefit unit | Assigns benefit receipt flags from donor match | +| 30 | ProjectDiscretionaryConsumption | benefit unit | Consumption/savings decision (if intertemporal enabled) | +| 31 | ProjectEquivConsumption | person | Computes individual equivalised consumption share | +| 32 | CalculateChangeInEDI | benefit unit | Updates equivalised disposable income and year-on-year change | +| 33 | ReviseLifetimeIncome | person | Updates lifetime income trajectory (if intertemporal enabled) | +| 34 | **FinancialDistress** | person | Financial distress indicator | +| 35–40 | **Mental health and wellbeing** | person | GHQ-12 psychological distress (levels and caseness, two-step); SF-12 MCS and PCS; life satisfaction (all two-step) | +| 41 | **ConsiderMortality** | person | Stochastic mortality | +| 42 | HealthEQ5D | person | EQ-5D utility score update | +| 43 | PopulationAlignment | model | Re-weights or resamples population to match demographic projections | +| 44 | EndYear / UpdateYear | model | Year-end housekeeping | + +--- + +## Modules in depth + +### Education + +Individuals aged 16–29 are assessed each year for whether they remain in education (`InSchool`) and whether they have left (`LeavingSchool`). Upon leaving, their highest completed qualification (Low / Medium / High) is determined. Parent education levels are tracked as covariates in child education models. + +Optional alignment (`alignInSchool`, `alignEducation`) can anchor simulated shares to empirical targets. + +### Health + +Physical health is updated annually using regression models. The result feeds into disability and care-need flags, which then govern social care processes. + +Mental health and wellbeing are resolved later in the cycle (after income is determined), reflecting the evidence that material conditions affect mental health outcomes. Multiple constructs are tracked: + +- **GHQ-12 psychological distress** — continuous score (0–12 Likert) and caseness indicator, each resolved in two steps (baseline prediction + exposure adjustment) +- **SF-12 MCS/PCS** — mental and physical component summary scores, two-step +- **Life satisfaction** — 0–10 score, two-step +- **EQ-5D** — health utility index updated at year end + +### Partnership + +Cohabitation entry is modelled at the person level; union matching is handled at the model level via a matching algorithm that pairs eligible singles. Partnership dissolution (separation or death of partner) is also modelled stochastically. Alignment of cohabitation shares to targets is available via `alignCohabitation`. + +### Fertility + +Women of childbearing age receive a fertility draw each year. A separate alignment step (`FertilityAlignment`) can scale individual probabilities to match aggregate fertility projections from population statistics. + +### Social care + +When `projectSocialCare` is enabled, individuals with a care need draw formal and informal care hours. Separate provision processes model informal care given by family members and others. A market-clearing step can reconcile supply and demand. + +### Labour market and income + +Labour supply is resolved for each benefit unit by choosing over a discrete set of hours options for each adult. The model supports: + +- **Intertemporal optimisation** (`enableIntertemporalOptimisations`): decision grids are pre-computed in the first year; each subsequent year agents select hours to maximise inter-period utility given expected future income. +- **Static labour supply** (default): hours are drawn from regression models without forward-looking optimisation. + +After hours are chosen, taxes and benefits are imputed using **EUROMOD donor matching** (see below). + +### Homeownership + +Homeownership transitions are modelled at the benefit unit level using a regression model, updating the homeownership flag each year. + +### Population alignment + +At the end of each year, `PopulationAlignment` re-weights or resamples the population to keep aggregate age-sex distributions consistent with external demographic projections. This ensures the simulated population does not drift away from official forecasts over long horizons. + +--- + +## Alignment + +Alignment is a technique used in microsimulation to prevent simulated aggregate rates from drifting away from known targets. Rather than discarding individual-level stochastic variation, alignment rescales or resamples agents' outcomes so that the population total matches a target share or count. + +SimPaths uses alignment for several dimensions, each controlled by a boolean flag in the config: + +| Flag | What it aligns | Default | +|------|---------------|---------| +| `alignPopulation` | Age-sex population totals to demographic projections | `true` | +| `alignCohabitation` | Share of individuals in partnerships | `true` | +| `alignFertility` | Birth rates to projected fertility rates | `false` | +| `alignInSchool` | School participation rate (age 16–29) | `false` | +| `alignEducation` | Completed education level distribution | `false` | +| `alignEmployment` | Employment share | `false` | + +--- + +## EUROMOD integration and tax-benefit imputation + +SimPaths does not compute taxes and benefits directly from first principles. Instead it uses a **donor matching** approach: + +1. Before or at the start of a run, a database of tax-benefit outcomes is generated by running EUROMOD (a static tax-benefit microsimulation model) over a population of "donor" households across each policy year. +2. During simulation, each benefit unit selects a donor whose characteristics (labour supply hours, earnings, household composition, region, year) closely match its own. +3. The donor's EUROMOD-calculated disposable income, tax liability, and benefit amounts are imputed to the simulated benefit unit. + +This means simulated households benefit from EUROMOD's detailed and annually updated policy rules without requiring SimPaths to re-implement the full tax-benefit schedule. The policy schedule loaded per simulation year is controlled by `input/EUROMODpolicySchedule.xlsx`. + +--- + +## Intertemporal optimisation + +When `enableIntertemporalOptimisations: true`, SimPaths solves a life-cycle consumption and labour supply problem. Decision grids are pre-computed in year 0 (`RationalOptimisation`) by solving backwards over the remaining simulation horizon. At each subsequent year, agents look up their optimal choice from the pre-computed grid given their current state. + +This is computationally intensive. It is disabled by default. The sensitivity of behaviour to assumed interest rates and labour income can be explored using the `interestRateInnov` and `disposableIncomeFromLabourInnov` parameters. + +--- + +## Key input files + +Most files in `input/` are regression coefficient tables (`reg_*.xlsx`), alignment targets (`align_*.xlsx`), and scenario overrides (`scenario_*.xlsx`). The most important ones to understand: + +| File | Purpose | +|------|---------| +| `input/EUROMODpolicySchedule.xlsx` | Maps simulation years to EUROMOD policy systems; generated by setup | +| `input/DatabaseCountryYear.xlsx` | Country- and year-specific macro parameters (wages, prices, etc.) | +| `input/input.mv.db` | H2 database containing the donor tax-benefit unit pool; generated by setup | +| `input/InitialPopulations/…/population_initial_UK_2019.csv` | Starting population cross-section | +| `input/reg_labourSupplyUtility.xlsx` | Labour supply utility function coefficients | +| `input/reg_lifetime_incomes.xlsx` | Lifetime income projection coefficients (used by IO module) | +| `input/align_popProjections.xlsx` | Official population projections used for demographic alignment | diff --git a/documentation/scenario-cookbook.md b/documentation/scenario-cookbook.md index 1d8576068..848e6e906 100644 --- a/documentation/scenario-cookbook.md +++ b/documentation/scenario-cookbook.md @@ -1,16 +1,16 @@ # Scenario Cookbook -This guide maps every provided YAML scenario in `config/` to its intended use. +This guide maps every YAML config currently in `config/` to its intended use, and explains how to build your own. -All commands below assume you are running from repository root after building jars. +All commands assume you are running from repository root after building jars. -## Baseline and testing scenarios +--- -### `default.yml` +## Provided configs -Use when you want the standard baseline run with conservative defaults. +### `default.yml` -Command: +The standard baseline run with conservative defaults. Use this as your starting point for any new analysis. ```bash java -jar multirun.jar -config default.yml -g false @@ -18,9 +18,7 @@ java -jar multirun.jar -config default.yml -g false ### `test_create_database.yml` -Use for test-oriented database setup with training data (`trainingFlag: true`). - -Command: +Test-oriented database setup using training data (`trainingFlag: true`). Creates the H2 donor database needed before running simulations. ```bash java -jar multirun.jar -DBSetup -config test_create_database.yml @@ -28,144 +26,89 @@ java -jar multirun.jar -DBSetup -config test_create_database.yml ### `test_run.yml` -Use for integration-style short runs (2 runs, test settings). - -Command: +Short integration-style run (2 runs, test settings, training data). Used by CI and useful for reproducing CI behavior locally. ```bash java -jar multirun.jar -config test_run.yml -P root ``` -### `programming test.yml` +--- -Use for quick developer smoke runs with smaller population and simplified behavior flags. - -Command: - -```bash -java -jar multirun.jar -config "programming test.yml" -g false -``` +## Building your own config -## Setup-focused scenario +Place a new `.yml` file in `config/` and pass it via `-config`. You only need to specify the values you want to override — everything else inherits defaults from `default.yml` or class field defaults. -### `create database.yml` +### Minimal template -Use to build a full database object set for UK long-horizon work. This file sets `flagDatabaseSetup: true` in `innovation_args`, so it runs setup mode. +```yaml +maxNumberOfRuns: 5 +executeWithGui: false +randomSeed: 42 +startYear: 2019 +endYear: 2030 +countryString: UK +popSize: 20000 -Command: - -```bash -java -jar multirun.jar -config "create database.yml" +collector_args: + persistStatistics: true + persistStatistics2: true + persistStatistics3: true + persistPersons: false + persistBenefitUnits: false + persistHouseholds: false ``` -## Sensitivity and robustness scenarios - -### `random seed.yml` - -Use to run multiple replications with random-seed iteration enabled. +### Enabling alignment -Command: +To align simulated aggregates to external targets, add `model_args` with the relevant flags: -```bash -java -jar multirun.jar -config "random seed.yml" -g false +```yaml +model_args: + alignPopulation: true + alignCohabitation: true + alignFertility: true + alignInSchool: true + alignEducation: true ``` -### `intertemporal elasticity.yml` +See [Configuration](configuration.md) for a full list of `model_args` toggles, and [Model Concepts](model-concepts.md) for what each alignment dimension does. -Use for intertemporal elasticity sensitivity (3 runs with interest-rate innovation pattern). +### Running sensitivity analyses -Command: +To vary a parameter across runs, use `innovation_args`. For example, to sweep the intertemporal interest-rate innovation: -```bash -java -jar multirun.jar -config "intertemporal elasticity.yml" -g false -``` +```yaml +maxNumberOfRuns: 3 +model_args: + enableIntertemporalOptimisations: true -### `labour supply elasticity.yml` - -Use for labour-supply elasticity sensitivity (3 runs with labour-income innovation pattern). - -Command: - -```bash -java -jar multirun.jar -config "labour supply elasticity.yml" -g false -``` - -## Targeted output scenarios - -### `employmentTransStats.yml` - -Use when you mainly want employment transition statistics and minimal other persisted outputs. - -Command: - -```bash -java -jar multirun.jar -config employmentTransStats.yml -g false +innovation_args: + intertemporalElasticityInnov: true ``` -## Social care scenario family +### Saving and reusing a behavioural grid -### `sc calibration.yml` +If you have computed a decision grid for a baseline scenario and want to reuse it in a counterfactual: -Use to calibrate preference parameters for social care analysis. +```yaml +# Baseline run — saves the grid +model_args: + enableIntertemporalOptimisations: true + saveBehaviour: true + # readGrid is set to the run name automatically -Command: - -```bash -java -jar multirun.jar -config "sc calibration.yml" -g false +# Counterfactual run — loads the saved grid +model_args: + enableIntertemporalOptimisations: true + useSavedBehaviour: true + readGrid: "my_baseline_run" ``` -### `sc analysis0.yml` - -Base social care analysis run with social care enabled and alignment on. - -Command: - -```bash -java -jar multirun.jar -config "sc analysis0.yml" -g false -``` - -### `sc analysis1.yml` - -Main social care analysis run with named behavioral grid output (`saveBehaviour: true`, `readGrid: "sc analysis1"`). - -Command: - -```bash -java -jar multirun.jar -config "sc analysis1.yml" -g false -``` - -### `sc analysis1b.yml` - -Variant of analysis1 with `alignPopulation: false` and `useSavedBehaviour: true` for comparison. - -Command: - -```bash -java -jar multirun.jar -config "sc analysis1b.yml" -g false -``` - -### `sc analysis2.yml` - -Zero-costs social care scenario (`flagSuppressChildcareCosts: true`, `flagSuppressSocialCareCosts: true`). - -Command: - -```bash -java -jar multirun.jar -config "sc analysis2.yml" -g false -``` - -### `sc analysis3.yml` - -Ignore-costs response scenario that reuses behavior from analysis2 (`useSavedBehaviour: true`, `readGrid: "sc analysis2"`). - -Command: - -```bash -java -jar multirun.jar -config "sc analysis3.yml" -g false -``` +--- ## Practical notes -- Use quotes around config filenames that contain spaces. +- Use quotes around config filenames that contain spaces: `-config "my config.yml"`. - Add `-f` to write run logs to `output/logs/`. -- Override config values via CLI flags when needed (for example `-n`, `-r`, `-P`, `-g`). +- Override individual values at runtime without editing the YAML, for example `-n 10` overrides `maxNumberOfRuns`. +- Add `-P none` when you do not need the processed dataset to persist between runs (faster). From 4a0bb37ad98cb861d3537c9813564a0c2a85a4a1 Mon Sep 17 00:00:00 2001 From: hk-2029 Date: Sat, 14 Mar 2026 13:13:12 +0000 Subject: [PATCH 02/23] fix: set heading colours to navy, fix grey faded headings --- documentation/wiki/assets/css/extra.css | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/documentation/wiki/assets/css/extra.css b/documentation/wiki/assets/css/extra.css index 04fef6eca..94ee99161 100644 --- a/documentation/wiki/assets/css/extra.css +++ b/documentation/wiki/assets/css/extra.css @@ -190,10 +190,6 @@ CONTENT — TYPOGRAPHY ═══════════════════════════════════════════════ */ -.md-content { - max-width: 820px; -} - .md-typeset { font-size: 0.82rem; line-height: 1.6; @@ -203,6 +199,7 @@ font-weight: 600; font-size: 1.45rem; letter-spacing: -0.015em; + color: var(--sp-primary); border-bottom: 2px solid transparent; border-image: var(--sp-gradient) 1; padding-bottom: 0.3rem; @@ -215,7 +212,7 @@ letter-spacing: -0.01em; margin-top: 1.4rem; margin-bottom: 0.45rem; - color: var(--md-default-fg-color); + color: var(--sp-primary); } .md-typeset h3 { @@ -223,7 +220,13 @@ font-size: 1.1rem; margin-top: 1rem; margin-bottom: 0.3rem; - color: var(--md-default-fg-color); + color: var(--sp-primary); +} + +[data-md-color-scheme="slate"] .md-typeset h1, +[data-md-color-scheme="slate"] .md-typeset h2, +[data-md-color-scheme="slate"] .md-typeset h3 { + color: #a8c8e8; } /* Paragraph justification */ From b90c278879d1e051e0f6e13d27d2d2581bea5232 Mon Sep 17 00:00:00 2001 From: hk-2029 Date: Sat, 14 Mar 2026 13:43:10 +0000 Subject: [PATCH 03/23] docs: trim redundancy, fix accuracy issues, improve navigation - model-concepts.md: remove Modules in depth (redundant with website), compress intro, remove Key input files table, trim EUROMOD section; keep agent hierarchy with code-level attributes, process order table, alignment flags table, and IO/intertemporal explanations - data-and-outputs.md: replace flat list with directory tree, remove reg_/align_ file listing (covered by parameterisation page), add pointer to variable codebook for output CSV descriptions - README.md: add one-liner clarifying scope vs website, remove redundant Scope section - scenario-cookbook.md: note that model_args keys map to @GUIparameter fields on SimPathsModel - wiki/getting-started/environment-setup.md: fix Java version (19, not 11) and build command (mvn clean package, not install -DskipTests) Co-Authored-By: Claude Sonnet 4.6 --- documentation/README.md | 15 +- documentation/data-and-outputs.md | 80 ++++---- documentation/model-concepts.md | 187 +++++------------- documentation/scenario-cookbook.md | 2 + .../wiki/getting-started/environment-setup.md | 8 +- 5 files changed, 107 insertions(+), 185 deletions(-) diff --git a/documentation/README.md b/documentation/README.md index 0210599e6..8c2252510 100644 --- a/documentation/README.md +++ b/documentation/README.md @@ -1,6 +1,8 @@ # SimPaths Documentation -This documentation is structured to support both first-time users and contributors. +These files are a **CLI- and developer-workflow quick reference** for working directly with the repository — building, running, configuring, and troubleshooting from the command line. For the full model documentation (simulated modules, parameterisation, GUI usage, country variants, research), see the [website](../documentation/wiki/index.md). + +--- ## Recommended reading order @@ -17,17 +19,6 @@ For contributors and advanced users: - [Architecture](architecture.md) — source package structure and data flow - [Development and Testing](development.md) — build, tests, CI, contributor workflow -## Scope - -These guides cover: - -- Understanding the simulation model and its mechanisms -- Building SimPaths with Maven -- Running single-run and multi-run workflows -- Configuring model, collector, and runtime behavior via YAML -- Understanding expected input/output files and generated artifacts -- Running unit and integration tests locally and in CI - ## Conventions - Commands are shown from the repository root. diff --git a/documentation/data-and-outputs.md b/documentation/data-and-outputs.md index 0e7ef0d13..81e894c68 100644 --- a/documentation/data-and-outputs.md +++ b/documentation/data-and-outputs.md @@ -1,56 +1,62 @@ # Data and Outputs -## Data availability model +## Data availability - Source code and documentation are open. -- Full research input datasets are not freely redistributable. -- Training data is included to support development, local testing, and CI. +- Full research input datasets (UKHLS initial population, UKMOD policy outputs) are not freely redistributable — see [Getting Started / Data](../documentation/wiki/getting-started/data/index.md) on the website for access instructions. +- Training data is included in the repository to support development, local testing, and CI. ## Input directory layout -Key paths: - -- `input/`: - - regression and scenario Excel files (`reg_*.xlsx`, `scenario_*.xlsx`, `align_*.xlsx`) - - generated setup files (`input.mv.db`, `EUROMODpolicySchedule.xlsx`, `DatabaseCountryYear.xlsx`) -- `input/InitialPopulations/`: - - `training/population_initial_UK_2019.csv` - - `compile/` scripts for preparing initial-population inputs -- `input/EUROMODoutput/`: - - `training/*.txt` policy outputs and schedule artifacts +``` +input/ +├── InitialPopulations/ +│ ├── training/ # training population CSV (included in repo) +│ └── compile/ # Stata do-files for extracting UKHLS data +├── EUROMODoutput/ +│ └── training/ # training UKMOD outputs (included in repo) +├── input.mv.db # H2 donor database — generated by setup +├── EUROMODpolicySchedule.xlsx # policy year mapping — generated by setup +├── DatabaseCountryYear.xlsx # macro parameters — generated by setup +├── reg_*.xlsx # regression coefficient tables +├── align_*.xlsx # alignment targets +├── projections_*.xlsx # demographic projections +└── scenario_*.xlsx # scenario-specific parameter overrides +``` + +For a description of each `reg_`, `align_`, and `scenario_` file, see [Model Parameterisation](../documentation/wiki/overview/parameterisation.md) on the website. ## Setup-generated artifacts -Running setup mode (`singlerun` setup or `multirun -DBSetup`) creates or refreshes: - -- `input/input.mv.db` -- `input/EUROMODpolicySchedule.xlsx` -- `input/DatabaseCountryYear.xlsx` +Running setup mode (`singlerun -Setup` or `multirun -DBSetup`) creates or refreshes: -## Output directory layout +- `input/input.mv.db` — H2 database of EUROMOD donor tax-benefit outcomes +- `input/EUROMODpolicySchedule.xlsx` — maps simulation years to EUROMOD policy systems +- `input/DatabaseCountryYear.xlsx` — country- and year-specific macro parameters -Simulation runs produce timestamped folders under `output/`, typically with: +These three files must exist before any simulation run. If they are missing, re-run setup. -- `csv/` generated statistics and exported entities -- `database/` run-specific persistence output -- `input/` copied or persisted run input artifacts - -Common CSV files include: +## Output directory layout -- `Statistics1.csv` -- `Statistics21.csv` -- `Statistics31.csv` -- `EmploymentStatistics1.csv` -- `HealthStatistics1.csv` +Simulation runs produce timestamped folders under `output/`: -## Logging output +``` +output// +├── csv/ +│ ├── Statistics1.csv +│ ├── Statistics21.csv +│ ├── Statistics31.csv +│ ├── EmploymentStatistics1.csv +│ └── HealthStatistics1.csv +├── database/ # run-specific persistence output +└── input/ # copied/persisted run input artifacts +``` -If `-f` is enabled with `multirun.jar`, logs are written to: +For a description of the variables in these CSV files, see `documentation/SimPaths_Variable_Codebook.xlsx`. -- `output/logs/run_.txt` (stdout capture) -- `output/logs/run_.log` (log4j output) +## Logging -## Validation and analysis assets +With `-f` on `multirun.jar`, logs are written to: -- `validation/` contains validation artifacts and graph assets. -- `analysis/` contains `.do` scripts and spreadsheets used for downstream analysis. +- `output/logs/run_.txt` — stdout capture +- `output/logs/run_.log` — log4j output diff --git a/documentation/model-concepts.md b/documentation/model-concepts.md index e154e820a..a9f7a543e 100644 --- a/documentation/model-concepts.md +++ b/documentation/model-concepts.md @@ -1,23 +1,8 @@ # Model Concepts -This page explains what SimPaths simulates, how it is structured, and how its core mechanisms work. It is intended as the conceptual companion to the operational guides (getting-started, configuration, cli-reference). +SimPaths is a dynamic population microsimulation model that advances a starting population of real households forward in time, year by year, simulating individual life events through statistical regression models and rule-based processes. For the full academic description — including the 11 simulated modules — see the [Overview](../documentation/wiki/overview/index.md) section of the website, in particular [Simulated Modules](../documentation/wiki/overview/simulated-modules.md). ---- - -## What SimPaths is - -SimPaths is a dynamic population microsimulation model. It takes a sample of real households as a starting population and advances them forward in time, year by year, simulating individual life events using a combination of statistical regression models and rule-based processes. - -The output is a longitudinal synthetic population whose trajectories can be used to study policy scenarios, distributional outcomes, and the long-run consequences of demographic and economic change. - -### Supported countries - -| Code | Country | -|------|---------| -| `UK` | United Kingdom | -| `IT` | Italy | - -Country selection affects which initial population, regional classifications, EUROMOD policy schedule, and regression coefficients are loaded. The two countries share the same model structure but are fully parameterised separately. +This page covers what you need to understand the **code and configuration**: agent structure, the annual process order, alignment flags, and the tax-benefit system. --- @@ -27,143 +12,95 @@ The simulation maintains three nested entity types. ### Person -The individual. Each person carries their own demographic, health, education, labour, and income attributes. Almost all behavioural processes (health, education, fertility, partnership, labour supply) are resolved at the person level. +The individual. Each person carries their own demographic, health, education, labour, and income attributes. Almost all behavioural processes are resolved at the person level. -Key attributes tracked per person include: +Key attributes tracked per person: - **Demographics**: age, gender, region -- **Education**: highest qualification (Low / Medium / High / InEducation), mother's and father's education -- **Labour market status**: one of `EmployedOrSelfEmployed`, `NotEmployed`, `Student`, `Retired`; weekly hours worked; wage rate; work history in months -- **Health**: physical health (SF-12 PCS), mental health (SF-12 MCS, GHQ-12 psychological distress score, caseness indicator), life satisfaction, EQ-5D utility score, disability / care-need flag +- **Education**: highest qualification (`Low` / `Medium` / `High` / `InEducation`), mother's and father's education +- **Labour market status**: `EmployedOrSelfEmployed`, `NotEmployed`, `Student`, or `Retired`; weekly hours worked; wage rate; work history in months +- **Health**: physical health (SF-12 PCS), mental health (SF-12 MCS, GHQ-12 psychological distress, caseness indicator), life satisfaction (0–10), EQ-5D utility score, disability/care-need flag - **Partnership**: partner reference, years in partnership - **Income**: gross labour income, capital income, pension income, benefit receipt flags (UC and non-UC) -- **Social care**: formal and informal care hours received per week; care provision hours per week +- **Social care**: formal and informal care hours received per week; informal care hours provided per week - **Financial wellbeing**: equivalised disposable income, lifetime income trajectory, financial distress flag ### BenefitUnit -The tax-and-benefit assessment unit — typically an adult (or a couple) and their dependent children. Benefits and taxes are computed at this level, mirroring how real-world tax-benefit systems work. +The tax-and-benefit assessment unit — typically an adult (or couple) and their dependent children. Taxes and benefits are computed here, mirroring how real-world tax-benefit systems work. -Key attributes include: +Key attributes: -- region, homeownership flag -- equivalised disposable income (EDI) and year-on-year change in log-EDI -- poverty flag (< 60 % of median equivalised household disposable income) -- discretionary consumption (when intertemporal optimisation is enabled) +- Region, homeownership flag, wealth +- Equivalised disposable income (EDI) and year-on-year change in log-EDI +- Poverty flag (< 60% of median equivalised household disposable income) +- Discretionary consumption (when intertemporal optimisation is enabled) ### Household -A grouping of benefit units sharing an address. Used for aggregation and housing-related logic. A household may contain more than one benefit unit (for example, adult children living with parents before leaving home). +A grouping of benefit units sharing an address. Used for aggregation and housing-related logic. A household may contain more than one benefit unit (e.g. adult children living with parents before leaving home). --- ## Annual simulation cycle -SimPaths uses **discrete annual time steps**. Within each year, processes fire in a fixed order. The table below lists the ordered steps as scheduled in `SimPathsModel.buildSchedule()`. +SimPaths uses **discrete annual time steps**. Within each year, processes fire in a fixed order defined in `SimPathsModel.buildSchedule()`. | # | Process | Level | Description | |---|---------|-------|-------------| -| 1 | StartYear | model | Housekeeping, year logging | +| 1 | StartYear | model | Year logging and housekeeping | | 2 | RationalOptimisation | model | *First year only.* Pre-computes intertemporal decision grids (if enabled) | | 3 | UpdateParameters | model | Loads year-specific parameters and time-series factors | | 4 | GarbageCollection | model | Removes stale entity references | | 5 | UpdateWealth | benefit unit | Updates savings/wealth stocks (if intertemporal enabled) | -| 6 | Update | benefit unit | Refreshes household composition counts, clears state flags | -| 7 | Update | person | Refreshes individual-level state variables and lags | -| 8 | **Aging** | person | Increments age; checks whether individuals age out of the population | +| 6 | Update | benefit unit | Refreshes composition counts, clears state flags | +| 7 | Update | person | Refreshes state variables and lag values | +| 8 | Aging | person | Increments age; dependent children reaching independence are split into their own benefit unit | | 9 | ConsiderRetirement | person | Stochastic retirement decision | -| 10 | **InSchool** | person | Whether person remains in / enters education (age 16–29) | -| 11 | InSchoolAlignment | model | Aligns school participation rates to targets (if enabled) | -| 12 | LeavingSchool | person | Transition out of education | +| 10 | InSchool | person | Whether person remains in / enters education (age 16–29) | +| 11 | InSchoolAlignment | model | Aligns school participation rate to targets (if enabled) | +| 12 | LeavingSchool | person | Transition out of education; assigns completed qualification | | 13 | EducationLevelAlignment | model | Aligns completed education distribution (if enabled) | | 14 | Homeownership | benefit unit | Homeownership transition | -| 15 | **Health** | person | Updates physical health and disability status | -| 16 | UpdatePotentialHourlyEarnings | person | Refreshes wage potential for labour supply decisions | +| 15 | Health | person | Updates physical health and disability status | +| 16 | UpdatePotentialHourlyEarnings | person | Refreshes wage potential prior to labour supply decisions | | 17 | CohabitationAlignment | model | Aligns cohabitation share to targets (if enabled) | -| 18 | **Cohabitation** | person | Entry into partnership | -| 19 | PartnershipDissolution | person | Exit from partnership (separation / bereavement) | -| 20 | **UnionMatching** | model | Matches unpartnered individuals into new couples | -| 21 | FertilityAlignment | model | Aligns birth rates to projected fertility (if enabled) | -| 22 | **Fertility** | person | Fertility decision for women of childbearing age | -| 23 | GiveBirth | person | Adds newborn children to simulation | -| 24 | SocialCareReceipt | person | Care receipt (formal and informal) for those with care need | -| 25 | SocialCareProvision | person | Provision of informal care by eligible individuals | -| 26 | **Unemployment** | person | Unemployment transitions | -| 27 | UpdateStates | benefit unit | Refreshes joint labour states for IO decision (if enabled) | -| 28 | **LabourMarketAndIncomeUpdate** | model | Resolves labour supply, imputes taxes and benefits via EUROMOD donor matching | -| 29 | ReceivesBenefits | benefit unit | Assigns benefit receipt flags from donor match | +| 18 | Cohabitation | person | Entry into partnership | +| 19 | PartnershipDissolution | person | Exit from partnership (separation or bereavement) | +| 20 | UnionMatching | model | Matches unpartnered individuals into new couples | +| 21 | FertilityAlignment | model | Scales birth probabilities to projected fertility rates (if enabled) | +| 22 | Fertility | person | Fertility decision for women of childbearing age | +| 23 | GiveBirth | person | Adds newborn children to the simulation | +| 24 | SocialCareReceipt | person | Formal and informal care receipt for those with a care need | +| 25 | SocialCareProvision | person | Informal care provision by eligible individuals | +| 26 | Unemployment | person | Unemployment transitions | +| 27 | UpdateStates | benefit unit | Refreshes joint labour states for IO decisions (if enabled) | +| 28 | LabourMarketAndIncomeUpdate | model | Resolves labour supply; imputes taxes and benefits via EUROMOD donor matching | +| 29 | ReceivesBenefits | benefit unit | Assigns benefit receipt flags from the donor match | | 30 | ProjectDiscretionaryConsumption | benefit unit | Consumption/savings decision (if intertemporal enabled) | | 31 | ProjectEquivConsumption | person | Computes individual equivalised consumption share | | 32 | CalculateChangeInEDI | benefit unit | Updates equivalised disposable income and year-on-year change | | 33 | ReviseLifetimeIncome | person | Updates lifetime income trajectory (if intertemporal enabled) | -| 34 | **FinancialDistress** | person | Financial distress indicator | -| 35–40 | **Mental health and wellbeing** | person | GHQ-12 psychological distress (levels and caseness, two-step); SF-12 MCS and PCS; life satisfaction (all two-step) | -| 41 | **ConsiderMortality** | person | Stochastic mortality | +| 34 | FinancialDistress | person | Financial distress indicator | +| 35–40 | Mental health and wellbeing | person | GHQ-12 distress (levels + caseness, two steps each); SF-12 MCS and PCS (two steps each); life satisfaction (two steps) | +| 41 | ConsiderMortality | person | Stochastic mortality | | 42 | HealthEQ5D | person | EQ-5D utility score update | -| 43 | PopulationAlignment | model | Re-weights or resamples population to match demographic projections | +| 43 | PopulationAlignment | model | Re-weights/resamples population to match demographic projections | | 44 | EndYear / UpdateYear | model | Year-end housekeeping | ---- - -## Modules in depth - -### Education - -Individuals aged 16–29 are assessed each year for whether they remain in education (`InSchool`) and whether they have left (`LeavingSchool`). Upon leaving, their highest completed qualification (Low / Medium / High) is determined. Parent education levels are tracked as covariates in child education models. - -Optional alignment (`alignInSchool`, `alignEducation`) can anchor simulated shares to empirical targets. - -### Health - -Physical health is updated annually using regression models. The result feeds into disability and care-need flags, which then govern social care processes. - -Mental health and wellbeing are resolved later in the cycle (after income is determined), reflecting the evidence that material conditions affect mental health outcomes. Multiple constructs are tracked: - -- **GHQ-12 psychological distress** — continuous score (0–12 Likert) and caseness indicator, each resolved in two steps (baseline prediction + exposure adjustment) -- **SF-12 MCS/PCS** — mental and physical component summary scores, two-step -- **Life satisfaction** — 0–10 score, two-step -- **EQ-5D** — health utility index updated at year end - -### Partnership - -Cohabitation entry is modelled at the person level; union matching is handled at the model level via a matching algorithm that pairs eligible singles. Partnership dissolution (separation or death of partner) is also modelled stochastically. Alignment of cohabitation shares to targets is available via `alignCohabitation`. - -### Fertility - -Women of childbearing age receive a fertility draw each year. A separate alignment step (`FertilityAlignment`) can scale individual probabilities to match aggregate fertility projections from population statistics. - -### Social care - -When `projectSocialCare` is enabled, individuals with a care need draw formal and informal care hours. Separate provision processes model informal care given by family members and others. A market-clearing step can reconcile supply and demand. - -### Labour market and income - -Labour supply is resolved for each benefit unit by choosing over a discrete set of hours options for each adult. The model supports: - -- **Intertemporal optimisation** (`enableIntertemporalOptimisations`): decision grids are pre-computed in the first year; each subsequent year agents select hours to maximise inter-period utility given expected future income. -- **Static labour supply** (default): hours are drawn from regression models without forward-looking optimisation. - -After hours are chosen, taxes and benefits are imputed using **EUROMOD donor matching** (see below). - -### Homeownership - -Homeownership transitions are modelled at the benefit unit level using a regression model, updating the homeownership flag each year. - -### Population alignment - -At the end of each year, `PopulationAlignment` re-weights or resamples the population to keep aggregate age-sex distributions consistent with external demographic projections. This ensures the simulated population does not drift away from official forecasts over long horizons. +The first simulation year runs a subset of these (some states are inherited directly from input data). All subsequent years run the full schedule. --- ## Alignment -Alignment is a technique used in microsimulation to prevent simulated aggregate rates from drifting away from known targets. Rather than discarding individual-level stochastic variation, alignment rescales or resamples agents' outcomes so that the population total matches a target share or count. +Alignment prevents simulated aggregate rates from drifting away from known targets. Rather than discarding individual-level stochastic variation, it rescales or resamples agents' outcomes so the population total matches a target share or count. -SimPaths uses alignment for several dimensions, each controlled by a boolean flag in the config: +Each dimension is controlled by a boolean flag in `model_args`: | Flag | What it aligns | Default | -|------|---------------|---------| -| `alignPopulation` | Age-sex population totals to demographic projections | `true` | +|------|----------------|---------| +| `alignPopulation` | Age-sex-region population totals to demographic projections | `true` | | `alignCohabitation` | Share of individuals in partnerships | `true` | | `alignFertility` | Birth rates to projected fertility rates | `false` | | `alignInSchool` | School participation rate (age 16–29) | `false` | @@ -172,36 +109,20 @@ SimPaths uses alignment for several dimensions, each controlled by a boolean fla --- -## EUROMOD integration and tax-benefit imputation +## Tax-benefit system (EUROMOD donor matching) -SimPaths does not compute taxes and benefits directly from first principles. Instead it uses a **donor matching** approach: +SimPaths does not compute taxes and benefits from first principles. It uses **donor matching**: -1. Before or at the start of a run, a database of tax-benefit outcomes is generated by running EUROMOD (a static tax-benefit microsimulation model) over a population of "donor" households across each policy year. -2. During simulation, each benefit unit selects a donor whose characteristics (labour supply hours, earnings, household composition, region, year) closely match its own. -3. The donor's EUROMOD-calculated disposable income, tax liability, and benefit amounts are imputed to the simulated benefit unit. +1. A database of tax-benefit outcomes is pre-computed by running EUROMOD/UKMOD over a population of "donor" households for each policy year. +2. Each simulated benefit unit selects a donor whose characteristics (labour hours, earnings, household composition, region, year) closely match its own. +3. The donor's computed disposable income, tax, and benefit amounts are imputed to the simulated unit. -This means simulated households benefit from EUROMOD's detailed and annually updated policy rules without requiring SimPaths to re-implement the full tax-benefit schedule. The policy schedule loaded per simulation year is controlled by `input/EUROMODpolicySchedule.xlsx`. +This gives SimPaths annually updated policy rules without re-implementing the full tax-benefit schedule. See [Tax-Benefit Donors (UK)](../documentation/wiki/getting-started/data/tax-benefit-donors-uk.md) for how to generate the donor database. --- ## Intertemporal optimisation -When `enableIntertemporalOptimisations: true`, SimPaths solves a life-cycle consumption and labour supply problem. Decision grids are pre-computed in year 0 (`RationalOptimisation`) by solving backwards over the remaining simulation horizon. At each subsequent year, agents look up their optimal choice from the pre-computed grid given their current state. - -This is computationally intensive. It is disabled by default. The sensitivity of behaviour to assumed interest rates and labour income can be explored using the `interestRateInnov` and `disposableIncomeFromLabourInnov` parameters. - ---- - -## Key input files - -Most files in `input/` are regression coefficient tables (`reg_*.xlsx`), alignment targets (`align_*.xlsx`), and scenario overrides (`scenario_*.xlsx`). The most important ones to understand: +When `enableIntertemporalOptimisations: true`, SimPaths solves a life-cycle consumption and labour supply problem. Decision grids are pre-computed in year 0 (`RationalOptimisation`) by solving backwards over the remaining horizon. In each subsequent year agents look up their optimal choice from the grid given their current state. -| File | Purpose | -|------|---------| -| `input/EUROMODpolicySchedule.xlsx` | Maps simulation years to EUROMOD policy systems; generated by setup | -| `input/DatabaseCountryYear.xlsx` | Country- and year-specific macro parameters (wages, prices, etc.) | -| `input/input.mv.db` | H2 database containing the donor tax-benefit unit pool; generated by setup | -| `input/InitialPopulations/…/population_initial_UK_2019.csv` | Starting population cross-section | -| `input/reg_labourSupplyUtility.xlsx` | Labour supply utility function coefficients | -| `input/reg_lifetime_incomes.xlsx` | Lifetime income projection coefficients (used by IO module) | -| `input/align_popProjections.xlsx` | Official population projections used for demographic alignment | +This is computationally intensive and disabled by default. When enabled, `saveBehaviour` and `useSavedBehaviour` allow a baseline grid to be reused in counterfactual runs without recomputing it — see [Scenario Cookbook](scenario-cookbook.md) for an example. diff --git a/documentation/scenario-cookbook.md b/documentation/scenario-cookbook.md index 848e6e906..6300f6392 100644 --- a/documentation/scenario-cookbook.md +++ b/documentation/scenario-cookbook.md @@ -38,6 +38,8 @@ java -jar multirun.jar -config test_run.yml -P root Place a new `.yml` file in `config/` and pass it via `-config`. You only need to specify the values you want to override — everything else inherits defaults from `default.yml` or class field defaults. +The keys under `model_args` map directly to the `@GUIparameter`-annotated fields on `SimPathsModel` — so anything you can set in the GUI can also be set here. + ### Minimal template ```yaml diff --git a/documentation/wiki/getting-started/environment-setup.md b/documentation/wiki/getting-started/environment-setup.md index d121cab7c..e3731590a 100644 --- a/documentation/wiki/getting-started/environment-setup.md +++ b/documentation/wiki/getting-started/environment-setup.md @@ -6,8 +6,8 @@ ## Requirements -- Java Development Kit (JDK) 11 or later -- Apache Maven 3.6 or later +- Java Development Kit (JDK) 19 (the project targets Java 19 — earlier versions will not compile) +- Apache Maven 3.8 or later - Git ## Cloning the repository @@ -20,7 +20,9 @@ cd SimPaths ## Building the project ```bash -mvn clean install -DskipTests +mvn clean package ``` +This produces `singlerun.jar` and `multirun.jar` at the repository root. + Refer to the [Working in GitHub](../developer-guide/working-in-github.md) guide for the full development workflow. From 1bf0811a2504a4cf25a7c503a2ef26cd73f86998 Mon Sep 17 00:00:00 2001 From: hk-2029 Date: Sat, 14 Mar 2026 14:22:36 +0000 Subject: [PATCH 04/23] docs: annotate default.yml and expand configuration.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - config/default.yml: add inline comment for every field, including exact numeric behavior for innovation_args shocks and S-Index parameter meanings - documentation/configuration.md: replace flat bullet lists with tables; add S-Index explanation, full innovation_args behavior (seed increment, ±0.0075 interest rate, ±0.01 income shock), and clear Statistics1/2/3/Employment/Health file descriptions Co-Authored-By: Claude Sonnet 4.6 --- config/default.yml | 248 ++++++++++++++++++++++----------- documentation/configuration.md | 139 +++++++++++++----- 2 files changed, 271 insertions(+), 116 deletions(-) diff --git a/config/default.yml b/config/default.yml index 631b016c3..d4aafba31 100644 --- a/config/default.yml +++ b/config/default.yml @@ -1,89 +1,177 @@ -# This file can be used to override defaults for multirun arguments. -# Arguments of the SimPathsMultiRun object overridden by the command-line - -maxNumberOfRuns: 1 -executeWithGui: false -randomSeed: 606 -startYear: 2019 -endYear: 2022 -popSize: 50000 -# countryString: "United Kingdom" -# integrationTest: false - -# Arguments passed to the SimPathsModel +# SimPaths multi-run configuration file. +# Uncomment and edit any field to override its default value. +# CLI flags take final precedence over anything set here. + +# ── Top-level run arguments ──────────────────────────────────────────────────── + +maxNumberOfRuns: 1 # number of sequential simulation runs +executeWithGui: false # true = launch JAS-mine GUI; false = headless (required on servers/CI) +randomSeed: 606 # seed for the first run; incremented automatically if randomSeedInnov is true +startYear: 2019 # first year of simulation (must have matching input/donor data) +endYear: 2022 # last year of simulation (inclusive) +popSize: 50000 # simulated population size (larger = more accurate, slower) +# countryString: "United Kingdom" # "United Kingdom" or "Italy" (auto-detected from donor DB if omitted) +# integrationTest: false # true = write output to a fixed folder for comparison in CI tests + + +# ── model_args: passed to SimPathsModel ─────────────────────────────────────── +# All keys map directly to @GUIparameter fields on SimPathsModel. +# Values shown are the class defaults. + model_args: -# maxAge: 130 -# fixTimeTrend: true -# timeTrendStopsIn: 2017 -# timeTrendStopsInMonetaryProcesses: 2017 -# fixRandomSeed: true -# sIndexTimeWindow: 5 -# sIndexAlpha: 2 -# sIndexDelta: 0 -# savingRate: 0 -# initialisePotentialEarningsFromDatabase: true -# useWeights: false -# useSBAMMatching: -# projectMortality: true -# alignPopulation: true -# alignFertility: true -# alignEducation: false -# alignInSchool: false -# alignCohabitation: false -# labourMarketCovid19On: false -# projectFormalChildcare: true -# donorPoolAveraging: true -# alignEmployment: false -# projectSocialCare: false -# addRegressionStochasticComponent: true -# fixRegressionStochasticComponent: false -# flagSuppressChildcareCosts: false -# flagSuppressSocialCareCosts: false + + # --- Time trend controls --- +# maxAge: 130 # maximum age kept in simulation; persons above this are removed +# fixTimeTrend: true # if true, freezes the time trend in regression equations +# timeTrendStopsIn: 2017 # year at which the time trend is frozen (if fixTimeTrend: true) +# timeTrendStopsInMonetaryProcesses: 2017 # same freeze year applied to monetary/income regressions only + + # --- Random number controls --- +# fixRandomSeed: true # if true, each run uses the same fixed seed (randomSeedIfFixed) + + # --- Income security (S-Index) --- + # The S-Index is an economic (in)security index computed from a rolling window of + # equivalised consumption, discounted and weighted by a risk-aversion parameter. + # SIndex_p50 is reported in Statistics1.csv each year. +# sIndexTimeWindow: 5 # length of rolling window in years (default 5) +# sIndexAlpha: 2 # coefficient of relative risk aversion (higher = more sensitivity to drops) +# sIndexDelta: 0.98 # annual discount factor applied to past consumption observations + + # --- Savings --- +# savingRate: 0.056 # fraction of equivalised disposable income saved (used when IO is disabled); + # default is OECD average UK household saving rate 2000–2019 + + # --- Wage initialisation --- +# initialisePotentialEarningsFromDatabase: true # initialise wage potential from donor DB rather than input CSV + + # --- Population weighting --- +# useWeights: false # if true, apply survey weights in alignment and statistics calculations + + # --- Matching method --- +# useSBAMMatching: # if true, use SBAM instead of standard union-matching algorithm + + # --- Demographic projections --- +# projectMortality: true # if false, disables stochastic mortality (population does not die) + + # --- Alignment flags --- + # See model-concepts.md for a full explanation of what alignment does. +# alignPopulation: true # align age-sex-region totals to official population projections +# alignFertility: true # scale birth probabilities to match projected fertility rates +# alignEducation: false # align completed education distribution to targets +# alignInSchool: false # align school participation rate (age 16–29) to targets +# alignCohabitation: false # align share of cohabiting individuals to targets +# alignEmployment: false # align employment share to targets + + # --- Labour market modules --- +# labourMarketCovid19On: false # enable reduced-form month-by-month COVID-19 labour market module + # (applies to years 2020–2021 in the baseline parameterisation) + + # --- Social care and childcare --- +# projectFormalChildcare: true # simulate formal childcare costs +# projectSocialCare: false # simulate social care receipt and provision module +# flagSuppressChildcareCosts: false # if true, set formal childcare costs to zero (scenario use) +# flagSuppressSocialCareCosts: false # if true, set social care costs to zero (scenario use) + + # --- Tax-benefit imputation --- +# donorPoolAveraging: true # if true, average disposable income over k nearest-neighbour donors + # rather than using the single closest donor; reduces imputation volatility + + # --- Regression stochasticity --- +# addRegressionStochasticComponent: true # include the residual draw in regression predictions +# fixRegressionStochasticComponent: false # if true, draw the residual once and hold it fixed + # across years (currently applies to wage equations only) + + # --- Time-series defaults --- +# flagDefaultToTimeSeriesAverages: true # if true, use the sample average of time-series variables + # rather than the year-specific value when data is unavailable + + # --- Intertemporal optimisation (IO) --- + # Enables backward-induction life-cycle solution for consumption and labour supply. + # Decision grids are pre-computed in year 0; agents look up optimal choices each year. + # Computationally intensive — disabled by default. # enableIntertemporalOptimisations: true -# flagDefaultToTimeSeriesAverages: true -# responsesToLowWageOffer: true -# responsesToPension: false -# saveImperfectTaxDBMatches: false -# useSavedBehaviour: false -# readGrid: "laptop serial" -# saveBehaviour: true -# employmentOptionsOfPrincipalWorker: 3 -# employmentOptionsOfSecondaryWorker: 3 -# responsesToEducation: true -# responsesToRetirement: false -# responsesToHealth: true -# responsesToDisability: false -# minAgeForPoorHealth: 50 -# responsesToRegion: false -# ignoreTargetsAtPopulationLoad: false - -# Arguments that alter processing of the SimPathsMultiRun object + + # IO state-space: which characteristics agents respond to when choosing labour/consumption. + # Each flag adds a dimension to the grid and increases solve time. +# responsesToHealth: true # include physical health in IO state space +# responsesToDisability: false # include disability status in IO state space +# responsesToEducation: true # include student and education level in IO state space +# responsesToPension: false # include private pension wealth in IO state space +# responsesToRetirement: false # include retirement state (and private pension) in IO state space +# responsesToLowWageOffer: true # include unemployment/low-wage-offer risk in IO state space +# responsesToRegion: false # include geographic region in IO state space +# minAgeForPoorHealth: 45 # minimum age from which less-than-perfect health enters state space + + # IO employment options +# employmentOptionsOfPrincipalWorker: 3 # number of discrete hours options for the principal earner +# employmentOptionsOfSecondaryWorker: 3 # number of discrete hours options for the secondary earner + + # IO grid persistence — save/reuse pre-computed grids across runs +# saveBehaviour: true # save decision grids to output folder after solving +# useSavedBehaviour: false # load grids from a previous run instead of recomputing +# readGrid: "test1" # name of the run whose grids to load (must match a folder in output/) + + # IO diagnostics +# saveImperfectTaxDBMatches: false # log cases where tax-benefit donor matching falls back to a coarser regime + + # --- Population load --- +# ignoreTargetsAtPopulationLoad: false # if true, skip alignment-target checks when loading the initial population + + +# ── innovation_args: parameter variation across sequential runs ──────────────── +# These flags control how parameters change between run 0, run 1, run 2, etc. +# Useful for sensitivity analysis and uncertainty quantification. + innovation_args: -# randomSeedInnov: false -# flagDatabaseSetup: false -# intertemporalElasticityInnov: false -# labourSupplyElasticityInnov: true +# randomSeedInnov: true # if true, increment randomSeed by 1 for each successive run + # (default true — each run gets a distinct seed) +# flagDatabaseSetup: false # if true, run database setup instead of simulation + # (equivalent to -DBSetup on the command line) +# intertemporalElasticityInnov: false # if true, applies interest rate shocks across runs: + # run 1: +0.0075 (higher return to saving) + # run 2: -0.0075 (lower return to saving) + # requires maxNumberOfRuns >= 3 to see all variants +# labourSupplyElasticityInnov: false # if true, applies disposable income shocks across runs: + # run 1: +0.01 (higher net labour income) + # run 2: -0.01 (lower net labour income) + # requires maxNumberOfRuns >= 3 to see all variants + + +# ── collector_args: output collection and export ─────────────────────────────── +# Controls what SimPathsCollector writes to CSV / database each year. +# +# Output files: +# Statistics1.csv — income distribution: Gini coefficients, income percentiles, median EDI, S-Index +# Statistics2.csv — demographic validation: partnership rates, employment, health, disability by age/gender +# Statistics3.csv — alignment diagnostics: simulated vs target rates and adjustment factors +# EmploymentStatistics.csv — labour market transitions and participation rates +# HealthStatistics.csv — health measures (SF-12, GHQ-12, EQ-5D) by age/gender collector_args: -# calculateGiniCoefficients: false -# exportToDatabase: false -# exportToCSV: true -# persistStatistics: true -# persistStatistics2: true -# persistStatistics3: true -# persistPersons: false -# persistBenefitUnits: false -# persistHouseholds: false -# persistEmploymentStatistics: false -# dataDumpStartTime: 0L -# dataDumpTimePeriod: 1.0 +# calculateGiniCoefficients: false # compute Gini coefficients (also populates GUI charts); off by default for speed +# exportToDatabase: false # write outputs to H2 database (in addition to or instead of CSV) +# exportToCSV: true # write outputs to CSV files under output//csv/ +# persistStatistics: true # write Statistics1.csv (income distribution) +# persistStatistics2: true # write Statistics2.csv (demographic validation outputs) +# persistStatistics3: true # write Statistics3.csv (alignment diagnostics) +# persistPersons: false # write one row per person per year (large files) +# persistBenefitUnits: false # write one row per benefit unit per year (large files) +# persistHouseholds: false # write one row per household per year +# persistEmploymentStatistics: false # write EmploymentStatistics.csv +# dataDumpStartTime: 0L # first year to write output (0 = startYear) +# dataDumpTimePeriod: 1.0 # output frequency in years (1.0 = every year) + + +# ── parameter_args: file paths and global flags ─────────────────────────────── parameter_args: -# input_directory: input -# input_directory_initial_populations: input/InitialPopulations -# euromod_output_directory: input/EUROMODoutput -# trainingFlag: false -# includeYears: +# input_directory: input # path to input data folder +# input_directory_initial_populations: input/InitialPopulations # path to initial population CSVs +# euromod_output_directory: input/EUROMODoutput # path to EUROMOD/UKMOD output files +# trainingFlag: false # if true, use training data from input/…/training/ subfolders + # (set automatically by test configs; do not set for research runs) +# includeYears: # list of policy years for which EUROMOD donor data is available; + # only these years will be included in the donor database # - 2011 # - 2012 # - 2013 @@ -96,4 +184,4 @@ parameter_args: # - 2020 # - 2021 # - 2022 -# - 2023 \ No newline at end of file +# - 2023 diff --git a/documentation/configuration.md b/documentation/configuration.md index 9641a31db..98f091be1 100644 --- a/documentation/configuration.md +++ b/documentation/configuration.md @@ -4,7 +4,7 @@ SimPaths multi-run behavior is controlled by YAML files in `config/`. This repository ships with three configs: -- `default.yml` — standard baseline run +- `default.yml` — standard baseline run (well-commented reference for all fields) - `test_create_database.yml` — database setup using training data - `test_run.yml` — short integration-test run @@ -21,55 +21,125 @@ For command-by-command guidance and a template for building your own config, see ### Core run arguments -Common fields: +| Key | Default | Description | +|-----|---------|-------------| +| `maxNumberOfRuns` | `1` | Number of sequential simulation runs | +| `executeWithGui` | `false` | `true` launches the JAS-mine GUI; `false` = headless (required on servers/CI) | +| `randomSeed` | `606` | RNG seed for the first run; auto-incremented when `randomSeedInnov` is true | +| `startYear` | `2019` | First simulation year (must have matching input/donor data) | +| `endYear` | `2022` | Last simulation year (inclusive) | +| `popSize` | `50000` | Simulated population size; larger = more accurate but slower | +| `countryString` | auto | `"United Kingdom"` or `"Italy"`; auto-detected from donor DB if omitted | +| `integrationTest` | `false` | Writes output to a fixed folder for CI comparison | -- `countryString` -- `maxNumberOfRuns` -- `executeWithGui` -- `randomSeed` -- `startYear` -- `endYear` -- `popSize` -- `integrationTest` +--- ### `model_args` -Passed into `SimPathsModel` via reflection. +Keys map directly to `@GUIparameter`-annotated fields on `SimPathsModel`. Anything settable in the GUI can also be set here. -Typical toggles include: +#### Alignment flags -- alignment flags (`alignPopulation`, `alignFertility`, `alignEmployment`, ...) -- behavioral switches (`enableIntertemporalOptimisations`, `responsesToHealth`, ...) -- persistence of behavioral grids (`saveBehaviour`, `useSavedBehaviour`, `readGrid`) +Alignment prevents aggregate rates from drifting from known targets. Each dimension is independently controlled: -### `collector_args` +| Flag | Default | What it aligns | +|------|---------|----------------| +| `alignPopulation` | `true` | Age-sex-region totals to demographic projections | +| `alignCohabitation` | `true` | Share of individuals in partnerships | +| `alignFertility` | `false` | Birth rates to projected fertility rates | +| `alignInSchool` | `false` | School participation rate (age 16–29) | +| `alignEducation` | `false` | Completed education level distribution | +| `alignEmployment` | `false` | Employment share | + +See [Model Concepts — Alignment](model-concepts.md#alignment) for a fuller explanation. + +#### Income security (S-Index) + +The S-Index is an economic (in)security measure computed each year per person and reported in `Statistics1.csv` as `SIndex_p50`. It takes a rolling window of equivalised consumption observations, applies exponential discounting, and weights losses more heavily than gains according to a risk-aversion parameter. + +| Parameter | Default | Meaning | +|-----------|---------|---------| +| `sIndexTimeWindow` | `5` | Length of rolling window in years | +| `sIndexAlpha` | `2` | Coefficient of relative risk aversion — higher values make the index more sensitive to consumption drops | +| `sIndexDelta` | `0.98` | Annual discount factor applied to past consumption observations | -Controls output collection and export behavior (via `SimPathsCollector`), including: +#### Intertemporal optimisation (IO) -- `persistStatistics`, `persistStatistics2`, `persistStatistics3` -- `persistPersons`, `persistBenefitUnits`, `persistHouseholds` -- `exportToCSV`, `exportToDatabase` +Enables a backward-induction life-cycle solution for consumption and labour supply. Decision grids are pre-computed in year 0; agents look up their optimal choice each year. Computationally intensive — disabled by default. + +The IO state-space flags control which personal characteristics enter the grid (each adds a dimension and increases solve time): + +| Flag | Default | +|------|---------| +| `responsesToHealth` | `true` | +| `responsesToDisability` | `false` | +| `responsesToEducation` | `true` | +| `responsesToPension` | `false` | +| `responsesToRetirement` | `false` | +| `responsesToLowWageOffer` | `true` | +| `responsesToRegion` | `false` | + +Grid persistence flags allow a baseline grid to be solved once and reused in counterfactual runs (`saveBehaviour: true` / `useSavedBehaviour: true` with `readGrid: ""`). See [Scenario Cookbook](scenario-cookbook.md) for an example. + +--- ### `innovation_args` -Controls iteration logic across runs, such as: +Controls how parameters change across sequential runs (run 0, run 1, run 2, …). Useful for sensitivity analysis and uncertainty quantification. + +| Flag | Default | Behavior | +|------|---------|----------| +| `randomSeedInnov` | `true` | Increments `randomSeed` by 1 for each successive run so each gets a distinct seed | +| `flagDatabaseSetup` | `false` | If `true`, runs database setup instead of simulation (equivalent to `-DBSetup` on the CLI) | +| `intertemporalElasticityInnov` | `false` | If `true`, applies interest rate shocks: run 1 = +0.0075 (higher return to saving), run 2 = −0.0075 (lower return to saving). Requires `maxNumberOfRuns >= 3` to see all variants. | +| `labourSupplyElasticityInnov` | `false` | If `true`, applies disposable income shocks: run 1 = +0.01 (higher net labour income), run 2 = −0.01 (lower net labour income). Requires `maxNumberOfRuns >= 3`. | + +--- + +### `collector_args` + +Controls what `SimPathsCollector` writes to CSV or database each simulation year. + +#### Output files + +| File | Content | Enabled by | +|------|---------|-----------| +| `Statistics1.csv` | Income distribution: Gini coefficients, income percentiles, median equivalised disposable income (EDI), S-Index | `persistStatistics: true` | +| `Statistics2.csv` | Demographic validation: partnership rates, employment rates, health and disability measures by age and gender | `persistStatistics2: true` | +| `Statistics3.csv` | Alignment diagnostics: simulated vs target rates and the adjustment factors applied | `persistStatistics3: true` | +| `EmploymentStatistics.csv` | Labour market transitions and participation rates | `persistEmploymentStatistics: true` | +| `HealthStatistics.csv` | Health measures (SF-12, GHQ-12, EQ-5D) by age and gender | *(written automatically when health statistics are computed)* | + +For a description of the variables in these files, see `documentation/SimPaths_Variable_Codebook.xlsx`. + +#### Other collector flags + +| Flag | Default | Description | +|------|---------|-------------| +| `calculateGiniCoefficients` | `false` | Compute Gini coefficients (also populates GUI charts); off by default for speed | +| `exportToCSV` | `true` | Write outputs to CSV files under `output//csv/` | +| `exportToDatabase` | `false` | Write outputs to H2 database in addition to or instead of CSV | +| `persistPersons` | `false` | Write one row per person per year (produces large files) | +| `persistBenefitUnits` | `false` | Write one row per benefit unit per year (produces large files) | +| `persistHouseholds` | `false` | Write one row per household per year | +| `dataDumpStartTime` | `0` | First year to write output (`0` = `startYear`) | +| `dataDumpTimePeriod` | `1.0` | Output frequency in years (`1.0` = every year) | -- `randomSeedInnov` -- `intertemporalElasticityInnov` -- `labourSupplyElasticityInnov` -- `flagDatabaseSetup` +--- ### `parameter_args` -Overrides values from `Parameters` (paths and model-global flags). +Overrides file paths and model-global flags in `Parameters`. -Common examples: +| Key | Default | Description | +|-----|---------|-------------| +| `input_directory` | `input` | Path to input data folder | +| `input_directory_initial_populations` | `input/InitialPopulations` | Path to initial population CSVs | +| `euromod_output_directory` | `input/EUROMODoutput` | Path to EUROMOD/UKMOD output files | +| `trainingFlag` | `false` | If `true`, loads training data from `input/.../training/` subfolders (set automatically by test configs) | +| `includeYears` | *(all)* | List of policy years for which EUROMOD donor data is available; only these years enter the donor database | -- `trainingFlag` -- `working_directory` -- `input_directory` -- `input_directory_initial_populations` -- `euromod_output_directory` +--- ## Minimal example @@ -85,13 +155,10 @@ collector_args: persistStatistics: true persistStatistics2: true persistStatistics3: true - persistPersons: false - persistBenefitUnits: false - persistHouseholds: false ``` Run it: ```bash -java -jar multirun.jar -config test_run.yml +java -jar multirun.jar -config my_run.yml ``` From 83af77668708f84db30eb74339cdc481765932b2 Mon Sep 17 00:00:00 2001 From: hk-2029 Date: Sat, 14 Mar 2026 16:43:16 +0000 Subject: [PATCH 05/23] docs: fix factual errors, add validation and data-pipeline guides MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - scenario-cookbook.md: correct that custom configs inherit from class field defaults, not from default.yml; fix readGrid comment (not set automatically — must be set manually to match the output folder name) - data-and-outputs.md: expand compile/ description to reflect the full data pipeline; add CSV naming convention note (EntityClass + RunNumber) and inline descriptions for each output file - validation-guide.md: new file documenting the two-stage validation workflow (9 estimate-validation scripts + 28 simulated-output scripts, setup instructions, interpretation guidance) - data-pipeline.md: new file documenting the full chain from raw UKHLS/BHPS/WAS/EUROMOD data to simulation-ready inputs, with per-script tables for all three parts and a when-to-re-run guide - README.md: add data-pipeline.md and validation-guide.md to reading list - config/default.yml: correct flagDefaultToTimeSeriesAverages default to false --- config/default.yml | 2 +- documentation/README.md | 2 + documentation/data-and-outputs.md | 16 +++-- documentation/data-pipeline.md | 110 +++++++++++++++++++++++++++++ documentation/scenario-cookbook.md | 8 +-- documentation/validation-guide.md | 96 +++++++++++++++++++++++++ 6 files changed, 223 insertions(+), 11 deletions(-) create mode 100644 documentation/data-pipeline.md create mode 100644 documentation/validation-guide.md diff --git a/config/default.yml b/config/default.yml index d4aafba31..ca449ce17 100644 --- a/config/default.yml +++ b/config/default.yml @@ -82,7 +82,7 @@ model_args: # across years (currently applies to wage equations only) # --- Time-series defaults --- -# flagDefaultToTimeSeriesAverages: true # if true, use the sample average of time-series variables +# flagDefaultToTimeSeriesAverages: false # if true, use the sample average of time-series variables # rather than the year-specific value when data is unavailable # --- Intertemporal optimisation (IO) --- diff --git a/documentation/README.md b/documentation/README.md index 8c2252510..7e17949b0 100644 --- a/documentation/README.md +++ b/documentation/README.md @@ -18,6 +18,8 @@ For contributors and advanced users: - [Architecture](architecture.md) — source package structure and data flow - [Development and Testing](development.md) — build, tests, CI, contributor workflow +- [Data Pipeline](data-pipeline.md) — how input files are generated from UKHLS/EUROMOD/WAS survey data +- [Validation Guide](validation-guide.md) — two-stage validation workflow (estimate validation + simulated output validation) ## Conventions diff --git a/documentation/data-and-outputs.md b/documentation/data-and-outputs.md index 81e894c68..4f4c7b38c 100644 --- a/documentation/data-and-outputs.md +++ b/documentation/data-and-outputs.md @@ -12,7 +12,9 @@ input/ ├── InitialPopulations/ │ ├── training/ # training population CSV (included in repo) -│ └── compile/ # Stata do-files for extracting UKHLS data +│ └── compile/ # full data pipeline: builds initial population CSVs from UKHLS/BHPS/WAS, +│ # reconstructs employment histories, and estimates all reg_*.xlsx coefficients +│ # (see Data Pipeline for details) ├── EUROMODoutput/ │ └── training/ # training UKMOD outputs (included in repo) ├── input.mv.db # H2 donor database — generated by setup @@ -43,15 +45,17 @@ Simulation runs produce timestamped folders under `output/`: ``` output// ├── csv/ -│ ├── Statistics1.csv -│ ├── Statistics21.csv -│ ├── Statistics31.csv -│ ├── EmploymentStatistics1.csv -│ └── HealthStatistics1.csv +│ ├── Statistics1.csv # income distribution (Gini, percentiles, S-Index) +│ ├── Statistics21.csv # demographic validation (employment, health, partnership by age/gender) +│ ├── Statistics31.csv # alignment diagnostics (simulated vs target rates) +│ ├── EmploymentStatistics1.csv # labour market transitions and participation rates +│ └── HealthStatistics1.csv # health measures (SF-12, GHQ-12, EQ-5D) by age/gender ├── database/ # run-specific persistence output └── input/ # copied/persisted run input artifacts ``` +CSV filenames follow the pattern `.csv`. With a single run the suffix is `1`; with multiple runs each run produces its own numbered file (e.g. `Statistics12.csv` for Statistics of run 2). + For a description of the variables in these CSV files, see `documentation/SimPaths_Variable_Codebook.xlsx`. ## Logging diff --git a/documentation/data-pipeline.md b/documentation/data-pipeline.md new file mode 100644 index 000000000..0c38df715 --- /dev/null +++ b/documentation/data-pipeline.md @@ -0,0 +1,110 @@ +# Data Pipeline + +This page explains how the simulation-ready input files in `input/` are generated from raw survey data, and what to do if you need to update or extend them. + +The pipeline has three independent parts: (1) initial populations, (2) regression coefficients, (3) alignment targets. Each can be re-run separately. + +--- + +## Data sources + +| Source | Description | Access | +|--------|-------------|--------| +| **UKHLS** (Understanding Society) | Main household panel survey; waves 1 to O (UKDA-6614-stata) | Requires EUL licence from UK Data Service | +| **BHPS** (British Household Panel Survey) | Historical predecessor to UKHLS; used for pre-2009 employment history | Bundled with UKHLS EUL | +| **WAS** (Wealth and Assets Survey) | Biennial survey of household wealth; waves 1 to 7 (UKDA-7215-stata) | Requires EUL licence from UK Data Service | +| **EUROMOD / UKMOD** | Tax-benefit microsimulation system | See [Tax-Benefit Donors (UK)](../documentation/wiki/getting-started/data/tax-benefit-donors-uk.md) on the website | + +--- + +## Part 1 — Initial populations (`input/InitialPopulations/compile/`) + +**What it produces:** Annual CSV files `population_initial_UK_.csv` used as the starting population for each simulation run. + +**Master script:** `input/InitialPopulations/compile/00_master.do` + +The pipeline runs in numbered stages: + +| Script | What it does | +|--------|-------------| +| `01_prepare_UKHLS_pooled_data.do` | Pools and standardises UKHLS waves | +| `02_create_UKHLS_variables.do` | Constructs all required variables (demographics, labour, health, income, wealth flags) and applies simulation-consistency rules (retirement as absorbing state, education age bounds, work/hours consistency) | +| `02_01_checks.do` | Data quality checks | +| `03_social_care_received.do` | Social care receipt variables | +| `04_social_care_provided.do` | Informal care provision variables | +| `05_create_benefit_units.do` | Groups individuals into benefit units (tax units) following UK tax-benefit rules | +| `06_reweight_and_slice.do` | Reweighting and year-specific slicing | +| `07_was_wealth_data.do` | Prepares Wealth and Assets Survey data | +| `08_wealth_to_ukhls.do` | Merges WAS wealth into UKHLS records | +| `09_finalise_input_data.do` | Final cleaning and formatting | +| `10_check_yearly_data.do` | Per-year consistency checks | +| `99_training_data.do` | Extracts the small training subset committed to the repo | + +### Employment history sub-pipeline (`compile/do_emphist/`) + +Reconstructs each respondent's monthly employment history from January 2007 onwards by combining UKHLS and BHPS interview records. The output variable `liwwh` (months employed since Jan 2007) feeds into the labour supply models. + +| Script | Purpose | +|--------|---------| +| `00_Master_emphist.do` | Master; sets parameters and calls sub-scripts | +| `01_Intdate.do` – `07_Empcal1a.do` | Sequential stages: interview dating, BHPS linkage, employment spell reconstruction, new-entrant identification | + +--- + +## Part 2 — Regression coefficients (`input/InitialPopulations/compile/RegressionEstimates/`) + +**What it produces:** The `reg_*.xlsx` coefficient tables read by `Parameters.java` at simulation startup. + +**Master script:** `input/InitialPopulations/compile/RegressionEstimates/master.do` + +> **Note:** Income and union-formation regressions depend on predicted wages, so `reg_wages.do` must complete before `reg_income.do` and `reg_partnership.do`. All other scripts can run in any order. + +**Required Stata packages:** `fre`, `tsspell`, `carryforward`, `outreg2`, `oparallel`, `gologit2`, `winsor`, `reghdfe`, `ftools`, `require` + +| Script | Module | Method | +|--------|--------|--------| +| `reg_wages.do` | Hourly wages | Heckman selection model (males and females separately) | +| `reg_income.do` | Non-labour income | Hurdle model (selection + amount); requires predicted wages | +| `reg_partnership.do` | Partnership formation/dissolution | Probit; requires predicted wages | +| `reg_education.do` | Education transitions | Generalised ordered logit | +| `reg_fertility.do` | Fertility | Probit | +| `reg_health.do` | Physical health (SF-12 PCS) | Linear regression | +| `reg_health_mental.do` | Mental health (GHQ-12, SF-12 MCS) | Linear regression | +| `reg_health_wellbeing.do` | Life satisfaction | Linear regression | +| `reg_home_ownership.do` | Homeownership transitions | Probit | +| `reg_retirement.do` | Retirement | Probit | +| `reg_leave_parental_home.do` | Leaving parental home | Probit | +| `reg_socialcare.do` | Social care receipt and provision | Probit / ordered logit | +| `reg_unemployment.do` | Unemployment transitions | Probit | +| `reg_financial_distress.do` | Financial distress | Probit | + +After running, output Excel files are placed in `input/` (overwriting the existing `reg_*.xlsx` files). + +--- + +## Part 3 — Alignment targets (`input/DoFilesTarget/`) + +**What it produces:** The `align_*.xlsx` and `*_targets.xlsx` files that the alignment modules use to rescale simulated rates. + +| Script | Output file | +|--------|------------| +| `01_employment_shares_initpopdata.do` | `input/employment_targets.xlsx` — employment shares by benefit-unit subgroup and year | +| `01_inSchool_targets_initpopdata.do` | `input/inSchool_targets.xlsx` — school participation rates by year | +| `03_calculate_partneredShare_initialPop_BUlogic.do` | `input/partnered_share_targets.xlsx` — partnership shares by year | +| `03_calculate_partnership_target.do` | Supplementary partnership targets | +| `02_person_risk_employment_stats.do` | `employment_risk_emp_stats.csv` — person-level at-risk diagnostics used for employment alignment group construction | + +Population projection targets (`align_popProjections.xlsx`) and fertility/mortality projections (`projections_*.xlsx`) come from ONS published projections and are not generated by these scripts. + +--- + +## When to re-run each part + +| Situation | What to re-run | +|-----------|---------------| +| Adding a new data year to the simulation | Part 1 (re-slice the population for the new year) + Part 3 (update alignment targets) | +| Re-estimating a behavioural module | Part 2 (the affected `reg_*.do` script only) + Stage 1 validation | +| Updating employment alignment targets | Part 3 (`01_employment_shares_initpopdata.do`) | +| Adding a new country | All three parts with country-appropriate data sources | + +After re-running any part, re-run setup (`singlerun -Setup` or `multirun -DBSetup`) to rebuild `input/input.mv.db` before running the simulation. diff --git a/documentation/scenario-cookbook.md b/documentation/scenario-cookbook.md index 6300f6392..129f4ad16 100644 --- a/documentation/scenario-cookbook.md +++ b/documentation/scenario-cookbook.md @@ -36,7 +36,7 @@ java -jar multirun.jar -config test_run.yml -P root ## Building your own config -Place a new `.yml` file in `config/` and pass it via `-config`. You only need to specify the values you want to override — everything else inherits defaults from `default.yml` or class field defaults. +Place a new `.yml` file in `config/` and pass it via `-config`. You only need to specify the values you want to change — everything else falls back to the Java class field defaults. Each config file is independent; there is no inheritance from `default.yml` or any other YAML file. The keys under `model_args` map directly to the `@GUIparameter`-annotated fields on `SimPathsModel` — so anything you can set in the GUI can also be set here. @@ -93,17 +93,17 @@ innovation_args: If you have computed a decision grid for a baseline scenario and want to reuse it in a counterfactual: ```yaml -# Baseline run — saves the grid +# Baseline run — saves the grid to output// model_args: enableIntertemporalOptimisations: true saveBehaviour: true - # readGrid is set to the run name automatically # Counterfactual run — loads the saved grid +# readGrid must be set to the exact output folder name of the baseline run model_args: enableIntertemporalOptimisations: true useSavedBehaviour: true - readGrid: "my_baseline_run" + readGrid: "my_baseline_run" # replace with the actual folder name under output/ ``` --- diff --git a/documentation/validation-guide.md b/documentation/validation-guide.md new file mode 100644 index 000000000..be6201cfe --- /dev/null +++ b/documentation/validation-guide.md @@ -0,0 +1,96 @@ +# Validation Guide + +SimPaths uses a two-stage validation workflow in `validation/`. Stage 1 checks that each estimated regression model is well-specified before simulation; stage 2 checks that full simulation output matches observed survey data. + +--- + +## Stage 1 — Estimate validation (`validation/01_estimate_validation/`) + +**When to run:** After updating or re-estimating any regression module (i.e. after re-running scripts in `input/InitialPopulations/compile/RegressionEstimates/`). + +**What it does:** For each behavioural module, the script loads the estimation sample, computes predicted values from the estimated coefficients, adds individual heterogeneity via 20 stochastic draws (as in multiple imputation), and overlays the predicted and observed distributions as histograms. + +| Script | Module validated | +|--------|----------------| +| `int_val_wages.do` | Hourly wages — Heckman selection model, separately for males/females with and without previous wage history | +| `int_val_education.do` | Education transitions (3 processes) | +| `int_val_fertility.do` | Fertility (2 processes) | +| `int_val_health.do` | Physical health transitions | +| `int_val_home_ownership.do` | Homeownership transitions | +| `int_val_income.do` | Income processes — hurdle models (selection and amount) | +| `int_val_leave_parental_home.do` | Leaving parental home | +| `int_val_partnership.do` | Partnership formation and dissolution | +| `int_val_retirement.do` | Retirement transitions | + +**Outputs:** PNG graphs saved under `validation/01_estimate_validation/graphs//`. Each graph shows predicted (red) vs observed (black outline) distributions. If the shapes diverge substantially, the regression may be mis-specified or the estimation sample may need updating. + +--- + +## Stage 2 — Simulated output validation (`validation/02_simulated_output_validation/`) + +**When to run:** After completing a baseline simulation run that you want to assess for plausibility. + +**What it does:** Loads your simulation output CSVs, loads UKHLS initial population data as an observational benchmark, and produces side-by-side time-series plots comparing 18 simulated outcomes against the observed distributions with confidence intervals. + +### Setup + +Before running, open `00_master.do` and set the global paths: + +```stata +global path "/your/local/path/to/validation/02_simulated_output_validation" +global dir_sim "/your/output//csv" * folder with simulation CSVs +global dir_obs "/path/to/ukhls/initial/populations" +``` + +Then run `00_master.do`. It calls all sub-scripts in order. + +### Scripts and what they check + +**Data preparation (run first, automatically called by master):** + +| Script | Purpose | +|--------|---------| +| `01_prepare_simulated_data.do` | Loads `Household.csv`, `BenefitUnit.csv`, `Person.csv` from the simulation output | +| `02_create_simulated_variables.do` | Derives analysis variables (sex, age groups, labour supply, income); produces full sample and ages 18–65 subset | +| `03_prepare_UKHLS_data.do` | Loads UKHLS observed data; prepares disposable income and matching variables | +| `05_create_UKHLS_validation_targets.do` | Creates target variables from UKHLS initial population CSVs by year | + +**Comparison plots (18 scripts, `06_01` through `06_18`):** + +| Script | What is compared | +|--------|-----------------| +| `06_01_plot_activity_status.do` | Economic activity: employed, student, inactive, retired by age group | +| `06_02_plot_education_level.do` | Completed education distribution over time | +| `06_03_plot_gross_income.do` | Gross benefit-unit income | +| `06_04_plot_gross_labour_income.do` | Gross labour income | +| `06_05_plot_capital_income.do` | Capital income (interest, dividends) | +| `06_06_plot_pension_income.do` | Pension income | +| `06_07_plot_disposable_income.do` | Disposable income after taxes and benefits | +| `06_08_plot_equivalised_disposable_income.do` | Household-size-adjusted disposable income | +| `06_09_plot_hourly_wages.do` | Hourly wages for employees | +| `06_10_plot_hours_worked.do` | Weekly hours worked by employment status | +| `06_11_plot_income_shares.do` | Income distribution across quintiles | +| `06_12_plot_partnership_status.do` | Partnership status (single, married, cohabiting, previously partnered) | +| `06_13_plot_health.do` | Physical and mental health (SF-12 PCS and MCS) | +| `06_14_plot_at_risk_of_poverty.do` | At-risk-of-poverty rate | +| `06_15_plot_inequality.do` | Income inequality (p90/p50 ratio) | +| `06_16_plot_number_children.do` | Number of dependent children | +| `06_17_plot_disability.do` | Disability prevalence | +| `06_18_plot_social_care.do` | Social care receipt | + +**Correlation analysis:** + +| Script | Purpose | +|--------|---------| +| `07_01_correlations.do` | Checks that key relationships between variables (e.g. income and employment, health and age) are preserved in the simulated data relative to UKHLS | + +**Outputs:** PNG graphs saved under `validation/02_simulated_output_validation/graphs//`, organised by topic (income, health, inequality, partnership, etc.). A reference set from a named run (`20250909_run`) is already committed and can serve as a baseline for comparison. + +--- + +## Interpreting results + +- **Stage 1:** Predicted and observed histograms should broadly overlap. Systematic divergence (e.g. predicted wages consistently too high) indicates a problem with the estimation or variable construction. +- **Stage 2:** Simulated time-series should track UKHLS trends within reasonable uncertainty bounds. Large divergence in levels suggests a miscalibration; divergence in trends suggests a missing time-series process or a misspecified time-trend parameter. + +The validation suite does not produce a single pass/fail metric — it is a diagnostic tool to inform judgement about whether a given parameterisation is fit for the intended research purpose. From 24e06c6a492f10d65d8916dcb8ab1203b26b8b3f Mon Sep 17 00:00:00 2001 From: hk-2029 Date: Sat, 14 Mar 2026 17:05:29 +0000 Subject: [PATCH 06/23] docs: rename scenario-cookbook.md to run-configuration.md --- documentation/README.md | 2 +- documentation/configuration.md | 4 ++-- documentation/model-concepts.md | 2 +- documentation/{scenario-cookbook.md => run-configuration.md} | 2 +- 4 files changed, 5 insertions(+), 5 deletions(-) rename documentation/{scenario-cookbook.md => run-configuration.md} (99%) diff --git a/documentation/README.md b/documentation/README.md index 7e17949b0..08a0f501a 100644 --- a/documentation/README.md +++ b/documentation/README.md @@ -10,7 +10,7 @@ These files are a **CLI- and developer-workflow quick reference** for working di 2. [Getting Started](getting-started.md) — prerequisites, build, first run 3. [CLI Reference](cli-reference.md) — all flags for `singlerun.jar` and `multirun.jar` 4. [Configuration](configuration.md) — YAML structure and all config keys -5. [Scenario Cookbook](scenario-cookbook.md) — provided configs and how to build your own +5. [Run Configuration](run-configuration.md) — provided configs and how to build your own 6. [Data and Outputs](data-and-outputs.md) — input layout, setup artifacts, output files 7. [Troubleshooting](troubleshooting.md) — common errors and fixes diff --git a/documentation/configuration.md b/documentation/configuration.md index 98f091be1..90fca123f 100644 --- a/documentation/configuration.md +++ b/documentation/configuration.md @@ -8,7 +8,7 @@ This repository ships with three configs: - `test_create_database.yml` — database setup using training data - `test_run.yml` — short integration-test run -For command-by-command guidance and a template for building your own config, see [Scenario Cookbook](scenario-cookbook.md). +For command-by-command guidance and a template for building your own config, see [Run Configuration](run-configuration.md). ## How config is applied @@ -79,7 +79,7 @@ The IO state-space flags control which personal characteristics enter the grid ( | `responsesToLowWageOffer` | `true` | | `responsesToRegion` | `false` | -Grid persistence flags allow a baseline grid to be solved once and reused in counterfactual runs (`saveBehaviour: true` / `useSavedBehaviour: true` with `readGrid: ""`). See [Scenario Cookbook](scenario-cookbook.md) for an example. +Grid persistence flags allow a baseline grid to be solved once and reused in counterfactual runs (`saveBehaviour: true` / `useSavedBehaviour: true` with `readGrid: ""`). See [Run Configuration](run-configuration.md) for an example. --- diff --git a/documentation/model-concepts.md b/documentation/model-concepts.md index a9f7a543e..2130d3e6b 100644 --- a/documentation/model-concepts.md +++ b/documentation/model-concepts.md @@ -125,4 +125,4 @@ This gives SimPaths annually updated policy rules without re-implementing the fu When `enableIntertemporalOptimisations: true`, SimPaths solves a life-cycle consumption and labour supply problem. Decision grids are pre-computed in year 0 (`RationalOptimisation`) by solving backwards over the remaining horizon. In each subsequent year agents look up their optimal choice from the grid given their current state. -This is computationally intensive and disabled by default. When enabled, `saveBehaviour` and `useSavedBehaviour` allow a baseline grid to be reused in counterfactual runs without recomputing it — see [Scenario Cookbook](scenario-cookbook.md) for an example. +This is computationally intensive and disabled by default. When enabled, `saveBehaviour` and `useSavedBehaviour` allow a baseline grid to be reused in counterfactual runs without recomputing it — see [Run Configuration](run-configuration.md) for an example. diff --git a/documentation/scenario-cookbook.md b/documentation/run-configuration.md similarity index 99% rename from documentation/scenario-cookbook.md rename to documentation/run-configuration.md index 129f4ad16..d4938cb1d 100644 --- a/documentation/scenario-cookbook.md +++ b/documentation/run-configuration.md @@ -1,4 +1,4 @@ -# Scenario Cookbook +# Run Configuration This guide maps every YAML config currently in `config/` to its intended use, and explains how to build your own. From 9c757d280f059190aeeb8a38b1485bc97c81a891 Mon Sep 17 00:00:00 2001 From: hk-2029 Date: Mon, 16 Mar 2026 09:36:39 +0000 Subject: [PATCH 07/23] docs: write file-organisation page for Developer Guide internals --- .../internals/file-organisation.md | 178 +++++++++++++++++- 1 file changed, 175 insertions(+), 3 deletions(-) diff --git a/documentation/wiki/developer-guide/internals/file-organisation.md b/documentation/wiki/developer-guide/internals/file-organisation.md index 742a1ef3a..041ac3b83 100644 --- a/documentation/wiki/developer-guide/internals/file-organisation.md +++ b/documentation/wiki/developer-guide/internals/file-organisation.md @@ -1,5 +1,177 @@ # File Organisation -!!! warning "In progress" - This page is under development. Contributions welcome — - see the [Developer Guide](../index.md) for how to contribute. +This page describes the directory and package layout of the SimPaths repository. For the generic JAS-mine project structure, see [Project Structure](../jasmine/project-structure.md). + +## 1. Top-level directories + +| Directory | Contents | +| --- | --- | +| `config/` | YAML configuration files for batch runs (`default.yml`, `test_create_database.yml`, `test_run.yml`) | +| `input/` | Survey-derived input data, EUROMOD donor files, Stata scripts for data preparation | +| `output/` | Simulation output (created at runtime; each run produces a timestamped subfolder) | +| `src/` | All Java source code (main and test) plus resources | +| `target/` | Maven build output: compiled classes and runnable JARs (`singlerun.jar`, `multirun.jar`) | +| `validation/` | Stata scripts and reference graphs for two-stage model validation | +| `documentation/` | Markdown documentation and wiki source files | +| `.github/workflows/` | CI pipeline (`SimPathsBuild.yml`) and Javadoc publishing (`publish-javadoc.yml`) | + +Root-level files include `pom.xml` (Maven project definition) and `README.md`. + +## 2. Source code — `src/main/java/simpaths/` + +### `experiment/` + +Entry points and orchestration. Contains the four manager classes required by the JAS-mine architecture: + +| Class | Role | +| --- | --- | +| `SimPathsStart` | Entry point for interactive single runs. Builds the GUI dialog, creates database tables from CSV, and launches the simulation. | +| `SimPathsMultiRun` | Entry point for batch runs. Reads a YAML config file, iterates over runs with optional parameter variation (innovation shocks), and manages run labelling. | +| `SimPathsCollector` | Collector manager. Computes aggregate statistics each simulated year and exports them to output CSV files (Statistics, Statistics2, Statistics3). | +| `SimPathsObserver` | Observer manager. Builds real-time GUI charts for monitoring the simulation while it runs. | + +### `model/` + +Core simulation logic. The central class is `SimPathsModel`, which owns all agent collections, builds the yearly event schedule (44 ordered processes), and coordinates the annual simulation cycle. + +Agent classes: + +| Class | Description | +| --- | --- | +| `Person` | Individual agent. Carries all demographics, health, education, labour, income, and social care state. Contains the per-person process methods invoked by the schedule. | +| `BenefitUnit` | Tax-and-benefit assessment unit: one or two adults plus their dependents. Tax-benefit evaluation is performed at this level. | +| `Household` | Grouping of benefit units sharing the same address. | + +Other key classes in `model/`: + +| Class | Purpose | +| --- | --- | +| `SimPathsModel` | Model manager. Initialises the population, registers all 44 yearly processes with the JAS-mine scheduler, manages alignment and aggregate state. | +| `TaxEvaluation` | Orchestrates EUROMOD donor matching to impute taxes and benefits onto simulated benefit units. | +| `UnionMatching` | Partnership formation algorithm. Matches unpartnered individuals into couples based on characteristics and preferences. | +| `LabourMarket` | Labour market clearing: matches labour supply decisions to employment outcomes. | +| `Innovations` | Applies parameter shocks (innovation perturbations) across sequential runs for sensitivity analysis. | +| `Validator` | Runtime consistency checks on the simulated population. | +| `*Alignment` classes | `FertilityAlignment`, `ActivityAlignmentV2`, `InSchoolAlignment`, `PartnershipAlignment`, `SocialCareAlignment` — each aligns a specific outcome to external calibration targets. | + +### `model/enums/` + +46 enumeration classes defining the categorical variables used throughout the simulation: `Gender`, `Education`, `Labour`, `HealthStatus`, `Country`, `Region`, `Ethnicity`, `Occupancy`, and others. These are referenced by the ORM for database persistence and by regression models for covariate encoding. + +### `model/decisions/` + +Intertemporal optimisation (IO) module. When IO is enabled, this package pre-computes decision grids by backward induction over a discretised state space, and agents look up optimal consumption–labour choices each simulated year. + +Key classes: + +| Class | Purpose | +| --- | --- | +| `DecisionParams` | Defines the state-space dimensions and grid parameters for the optimisation problem. | +| `ManagerPopulateGrids` | Populates the state-space grid points and evaluates value functions by backward induction. | +| `ManagerSolveGrids` | Solves for optimal policy at each grid point. | +| `ManagerFileGrids` | Reads and writes pre-computed grids to disk, so they can be reused across runs. | +| `Grids` | Container for the set of solved decision grids. | +| `States` | Enumerates the state variables that define each grid point. | +| `Expectations` / `LocalExpectations` | Computes expected future values over stochastic transitions. | +| `CESUtility` | CES utility function used in the optimisation. | + +### `model/taxes/` + +EUROMOD donor-matching subsystem. Imputes taxes and benefits onto simulated benefit units by matching them to pre-computed EUROMOD donor records. + +| Class | Purpose | +| --- | --- | +| `DonorTaxImputation` | Main entry point. Implements the three-step matching process: coarse-exact matching on characteristics, income proximity filtering, and candidate selection/averaging. | +| `KeyFunction` / `KeyFunction1`–`4` | Four progressively relaxed matching-key definitions. The system tries the tightest key first and falls back through wider keys if no donors are found. | +| `DonorKeys` | Builds composite matching keys from benefit-unit characteristics. | +| `DonorTaxUnit` / `DonorPerson` | Represent the pre-computed EUROMOD donor records loaded from the database. | +| `CandidateList` | Ranked list of donor matches for a given benefit unit, sorted by income proximity. | +| `Match` / `Matches` | Store the final selected donor(s) and their imputed tax-benefit values. | + +The `taxes/database/` sub-package handles loading donor data from the H2 database into memory (`TaxDonorDataParser`, `DatabaseExtension`, `MatchIndices`). + +### `model/lifetime_incomes/` + +Synthetic lifetime income trajectory generator. When IO is enabled, this package creates projected income paths for birth cohorts using an AR(2) process anchored to age-gender geometric means, and matches simulated persons to donor income profiles. + +| Class | Purpose | +| --- | --- | +| `ManagerProjectLifetimeIncomes` | Generates the synthetic income trajectory database for all birth cohorts in the simulation horizon. | +| `LifetimeIncomeImputation` | Matches each simulated person to a donor income trajectory via binary search on the income CDF. | +| `AnnualIncome` | Implements the AR(2) income process with age-gender anchoring. | +| `BirthCohort` | Groups individuals by birth year for cohort-level income projection. | +| `Individual` | Entity carrying age dummies and log GDP per capita for income regression. | + +### `data/` + +Parameters, input parsing, regression management, and utility classes. + +| Class | Purpose | +| --- | --- | +| `Parameters` | Central parameter store. Loads all regression coefficients, alignment targets, projections, and scenario tables from Excel files at simulation start. | +| `ManagerRegressions` | Manages the regression coefficient files (`reg_*.xlsx`) and provides methods for evaluating regression equations. | +| `RegressionName` | Enum-like catalogue of all named regression models used in the simulation. | +| `ScenarioTable` | Reads scenario-specific parameter overrides from Excel files. | +| `MahalanobisDistance` | Mahalanobis distance computation, used in donor matching. | +| `RootSearch` / `RootSearch2` | Numerical root-finding routines for alignment. | + +Sub-packages: + +- **`data/filters/`** — 42 cross-section filter classes (e.g. `FemaleAgeGroupCSfilter`, `RegionEducationWorkingCSfilter`). Each defines a predicate for selecting subsets of agents by demographic characteristics, used in alignment and statistics collection. +- **`data/startingpop/`** — `DataParser` reads the initial population CSV files and constructs the starting agent objects; `Processed` tracks which records have been loaded. +- **`data/statistics/`** — `Statistics`, `Statistics2`, `Statistics3` define the output entities whose fields are exported to CSVs by the Collector. `EmploymentStatistics` and `HealthStatistics` compute domain-specific aggregate indicators. + +## 3. Test code — `src/test/java/simpaths/` + +Test packages mirror the main source structure: + +| Package | Contents | +| --- | --- | +| `simpaths/model/` | Unit tests for agent classes and simulation logic | +| `simpaths/data/` | Tests for parameter loading and data utilities | +| `simpaths/data/filters/` | Tests for cross-section filters | +| `simpaths/data/statistics/` | Tests for statistics computation | +| `simpaths/experiment/` | Tests for entry points and configuration parsing | +| `simpaths/integrationtest/` | `RunSimPathsIntegrationTest` — end-to-end test that builds the database and runs a short simulation. The `expected/` subfolder contains reference output for comparison. | +| `simpaths/testinput/` | Test fixture data files | + +## 4. Resources — `src/main/resources/` + +| File | Purpose | +| --- | --- | +| `hibernate.cfg.xml` | Hibernate ORM configuration for the embedded H2 database | +| `log4j.properties` | Logging configuration | +| `META-INF/` | Persistence unit definitions | +| `images/` | Icons and images used by the GUI | + +## 5. Input data — `input/` + +| Subdirectory | Contents | +| --- | --- | +| `InitialPopulations/training/` | Small training population CSV committed to the repo for testing | +| `InitialPopulations/compile/` | 13 Stata do-files that build the full initial population from UKHLS/BHPS/WAS survey data | +| `InitialPopulations/compile/do_emphist/` | 8 Stata scripts that reconstruct monthly employment histories back to 2007 | +| `InitialPopulations/compile/RegressionEstimates/` | 14 Stata scripts that estimate regression coefficients and produce the `reg_*.xlsx` files | +| `DoFilesTarget/` | 5 Stata scripts that generate alignment target files (employment shares, education targets, partnership rates) | +| `EUROMODoutput/` | Pre-computed EUROMOD tax-benefit donor files, one per policy year. These are loaded into the H2 database during the setup phase. | + +The full input data (survey microdata and EUROMOD output) is not committed to the repository due to data licence restrictions. The `training/` subfolder contains a small synthetic subset for CI and testing. + +## 6. Validation — `validation/` + +| Subdirectory | Contents | +| --- | --- | +| `01_estimate_validation/do_files/` | 9 Stata scripts that compare predicted versus observed values for each regression module | +| `01_estimate_validation/graphs/` | Output graphs from estimate validation | +| `02_simulated_output_validation/do_files/` | 28 Stata scripts that compare simulation output against UKHLS observed data across 18 outcomes | +| `02_simulated_output_validation/graphs/` | Reference comparison plots from a baseline validation run | + +## 7. Configuration — `config/` + +| File | Purpose | +| --- | --- | +| `default.yml` | Default configuration for batch runs. Fully annotated with inline comments. | +| `test_create_database.yml` | Rebuilds the H2 database from input CSVs (used during setup). | +| `test_run.yml` | Minimal configuration for CI testing. | + +Each YAML file is standalone — there is no inheritance between config files. Keys map directly to fields in `SimPathsMultiRun` and `SimPathsModel`. See the [Configuration](../../../../documentation/configuration.md) reference for a complete listing of all keys. From 75d611fba5c4ac392afbaaa2a88c28ea460adb2b Mon Sep 17 00:00:00 2001 From: hk-2029 Date: Mon, 16 Mar 2026 13:58:07 +0000 Subject: [PATCH 08/23] docs: fix IO description in file-organisation, remove incorrect module label --- .../wiki/developer-guide/internals/file-organisation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/documentation/wiki/developer-guide/internals/file-organisation.md b/documentation/wiki/developer-guide/internals/file-organisation.md index 041ac3b83..466e21f6c 100644 --- a/documentation/wiki/developer-guide/internals/file-organisation.md +++ b/documentation/wiki/developer-guide/internals/file-organisation.md @@ -60,7 +60,7 @@ Other key classes in `model/`: ### `model/decisions/` -Intertemporal optimisation (IO) module. When IO is enabled, this package pre-computes decision grids by backward induction over a discretised state space, and agents look up optimal consumption–labour choices each simulated year. +Intertemporal optimisation (IO) computational engine. When IO is enabled, computing optimal consumption–labour choices for every agent at every time step during the simulation would be prohibitively slow. This package solves the problem once before the simulation runs: it constructs a grid covering all meaningful combinations of state variables (wealth, age, health, family status, etc.), then works backwards from the end of life to find the optimal choice at each grid point (backward induction). During the simulation, agents simply look up their current state in the pre-computed grid rather than solving an optimisation problem. Key classes: From ebad3dbff69806ba5c2c40f9f2afd0e9fec6b448 Mon Sep 17 00:00:00 2001 From: hk-2029 Date: Mon, 16 Mar 2026 14:12:48 +0000 Subject: [PATCH 09/23] =?UTF-8?q?docs:=20correct=20training=20data=20descr?= =?UTF-8?q?iption=20=E2=80=94=20de-identified=20synthetic,=20not=20just=20?= =?UTF-8?q?a=20small=20subset?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- documentation/data-pipeline.md | 2 +- .../wiki/developer-guide/internals/file-organisation.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/documentation/data-pipeline.md b/documentation/data-pipeline.md index 0c38df715..c27c404cb 100644 --- a/documentation/data-pipeline.md +++ b/documentation/data-pipeline.md @@ -38,7 +38,7 @@ The pipeline runs in numbered stages: | `08_wealth_to_ukhls.do` | Merges WAS wealth into UKHLS records | | `09_finalise_input_data.do` | Final cleaning and formatting | | `10_check_yearly_data.do` | Per-year consistency checks | -| `99_training_data.do` | Extracts the small training subset committed to the repo | +| `99_training_data.do` | Produces the de-identified synthetic population committed to `input/InitialPopulations/training/`. Selects a random subset of households, anonymises all IDs, collapses wave identifiers, randomises survey weights, and adds 15% random noise to all continuous variables (income, wealth, wages, care hours). This produces a file that is structurally identical to the real population data but contains no traceable individual records, so it can be distributed with the repo without breaching data licence terms. | ### Employment history sub-pipeline (`compile/do_emphist/`) diff --git a/documentation/wiki/developer-guide/internals/file-organisation.md b/documentation/wiki/developer-guide/internals/file-organisation.md index 466e21f6c..53936cf0a 100644 --- a/documentation/wiki/developer-guide/internals/file-organisation.md +++ b/documentation/wiki/developer-guide/internals/file-organisation.md @@ -148,7 +148,7 @@ Test packages mirror the main source structure: | Subdirectory | Contents | | --- | --- | -| `InitialPopulations/training/` | Small training population CSV committed to the repo for testing | +| `InitialPopulations/training/` | De-identified synthetic population CSV that can be distributed with the repo. Generated by `99_training_data.do`: household IDs anonymised, all continuous variables perturbed with 15% random noise, survey weights randomised. Structurally identical to the real data but contains no traceable individual records, so it is not subject to data licence restrictions. Used for CI testing and getting started without access to the full survey data. | | `InitialPopulations/compile/` | 13 Stata do-files that build the full initial population from UKHLS/BHPS/WAS survey data | | `InitialPopulations/compile/do_emphist/` | 8 Stata scripts that reconstruct monthly employment histories back to 2007 | | `InitialPopulations/compile/RegressionEstimates/` | 14 Stata scripts that estimate regression coefficients and produce the `reg_*.xlsx` files | From dd994aade14b691a55a87652e6ddae5f56160c15 Mon Sep 17 00:00:00 2001 From: hk-2029 Date: Mon, 16 Mar 2026 14:16:36 +0000 Subject: [PATCH 10/23] docs: fix script/class counts, remove stale sentence, add missing utility scripts --- documentation/data-pipeline.md | 2 ++ .../wiki/developer-guide/internals/file-organisation.md | 6 +++--- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/documentation/data-pipeline.md b/documentation/data-pipeline.md index c27c404cb..a43d1db2b 100644 --- a/documentation/data-pipeline.md +++ b/documentation/data-pipeline.md @@ -77,6 +77,8 @@ Reconstructs each respondent's monthly employment history from January 2007 onwa | `reg_socialcare.do` | Social care receipt and provision | Probit / ordered logit | | `reg_unemployment.do` | Unemployment transitions | Probit | | `reg_financial_distress.do` | Financial distress | Probit | +| `programs.do` | Shared utility programs called by the estimation scripts | — | +| `variable_update.do` | Prepares and recodes variables before estimation | — | After running, output Excel files are placed in `input/` (overwriting the existing `reg_*.xlsx` files). diff --git a/documentation/wiki/developer-guide/internals/file-organisation.md b/documentation/wiki/developer-guide/internals/file-organisation.md index 53936cf0a..7d73e8bab 100644 --- a/documentation/wiki/developer-guide/internals/file-organisation.md +++ b/documentation/wiki/developer-guide/internals/file-organisation.md @@ -117,7 +117,7 @@ Parameters, input parsing, regression management, and utility classes. Sub-packages: -- **`data/filters/`** — 42 cross-section filter classes (e.g. `FemaleAgeGroupCSfilter`, `RegionEducationWorkingCSfilter`). Each defines a predicate for selecting subsets of agents by demographic characteristics, used in alignment and statistics collection. +- **`data/filters/`** — 43 cross-section filter classes (e.g. `FemaleAgeGroupCSfilter`, `RegionEducationWorkingCSfilter`). Each defines a predicate for selecting subsets of agents by demographic characteristics, used in alignment and statistics collection. - **`data/startingpop/`** — `DataParser` reads the initial population CSV files and constructs the starting agent objects; `Processed` tracks which records have been loaded. - **`data/statistics/`** — `Statistics`, `Statistics2`, `Statistics3` define the output entities whose fields are exported to CSVs by the Collector. `EmploymentStatistics` and `HealthStatistics` compute domain-specific aggregate indicators. @@ -151,11 +151,11 @@ Test packages mirror the main source structure: | `InitialPopulations/training/` | De-identified synthetic population CSV that can be distributed with the repo. Generated by `99_training_data.do`: household IDs anonymised, all continuous variables perturbed with 15% random noise, survey weights randomised. Structurally identical to the real data but contains no traceable individual records, so it is not subject to data licence restrictions. Used for CI testing and getting started without access to the full survey data. | | `InitialPopulations/compile/` | 13 Stata do-files that build the full initial population from UKHLS/BHPS/WAS survey data | | `InitialPopulations/compile/do_emphist/` | 8 Stata scripts that reconstruct monthly employment histories back to 2007 | -| `InitialPopulations/compile/RegressionEstimates/` | 14 Stata scripts that estimate regression coefficients and produce the `reg_*.xlsx` files | +| `InitialPopulations/compile/RegressionEstimates/` | 17 Stata scripts that estimate regression coefficients and produce the `reg_*.xlsx` files. The 14 `reg_*.do` scripts each estimate one behavioural model; `programs.do` defines shared utility functions used by the estimation scripts; `variable_update.do` prepares variables before estimation; `master.do` orchestrates the full run. | | `DoFilesTarget/` | 5 Stata scripts that generate alignment target files (employment shares, education targets, partnership rates) | | `EUROMODoutput/` | Pre-computed EUROMOD tax-benefit donor files, one per policy year. These are loaded into the H2 database during the setup phase. | -The full input data (survey microdata and EUROMOD output) is not committed to the repository due to data licence restrictions. The `training/` subfolder contains a small synthetic subset for CI and testing. +The full input data (survey microdata and EUROMOD output) is not committed to the repository due to data licence restrictions. ## 6. Validation — `validation/` From a41a221cc829a584e46a3ed6917b0662247f74cc Mon Sep 17 00:00:00 2001 From: hk-2029 Date: Mon, 16 Mar 2026 14:18:11 +0000 Subject: [PATCH 11/23] docs: condense overlong table entries in file-organisation --- .../wiki/developer-guide/internals/file-organisation.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/documentation/wiki/developer-guide/internals/file-organisation.md b/documentation/wiki/developer-guide/internals/file-organisation.md index 7d73e8bab..0de8ffdf5 100644 --- a/documentation/wiki/developer-guide/internals/file-organisation.md +++ b/documentation/wiki/developer-guide/internals/file-organisation.md @@ -148,10 +148,10 @@ Test packages mirror the main source structure: | Subdirectory | Contents | | --- | --- | -| `InitialPopulations/training/` | De-identified synthetic population CSV that can be distributed with the repo. Generated by `99_training_data.do`: household IDs anonymised, all continuous variables perturbed with 15% random noise, survey weights randomised. Structurally identical to the real data but contains no traceable individual records, so it is not subject to data licence restrictions. Used for CI testing and getting started without access to the full survey data. | +| `InitialPopulations/training/` | De-identified synthetic population CSV for CI testing and getting started. Real survey data cannot be committed due to licence restrictions; this file is a privacy-protected substitute with anonymised IDs and noise-perturbed continuous variables. | | `InitialPopulations/compile/` | 13 Stata do-files that build the full initial population from UKHLS/BHPS/WAS survey data | | `InitialPopulations/compile/do_emphist/` | 8 Stata scripts that reconstruct monthly employment histories back to 2007 | -| `InitialPopulations/compile/RegressionEstimates/` | 17 Stata scripts that estimate regression coefficients and produce the `reg_*.xlsx` files. The 14 `reg_*.do` scripts each estimate one behavioural model; `programs.do` defines shared utility functions used by the estimation scripts; `variable_update.do` prepares variables before estimation; `master.do` orchestrates the full run. | +| `InitialPopulations/compile/RegressionEstimates/` | 17 Stata scripts that estimate regression coefficients and produce the `reg_*.xlsx` files (14 `reg_*.do` estimation scripts plus `master.do`, `programs.do`, `variable_update.do`) | | `DoFilesTarget/` | 5 Stata scripts that generate alignment target files (employment shares, education targets, partnership rates) | | `EUROMODoutput/` | Pre-computed EUROMOD tax-benefit donor files, one per policy year. These are loaded into the H2 database during the setup phase. | From 00d3a0a41c23543b05c2330a100f883911da3d13 Mon Sep 17 00:00:00 2001 From: hk-2029 Date: Mon, 16 Mar 2026 16:01:51 +0000 Subject: [PATCH 12/23] docs: delete architecture.md, merge run-configuration into configuration --- CLAUDE.md | 94 ++++++++++++++++ documentation/README.md | 8 +- documentation/architecture.md | 64 ----------- documentation/configuration.md | 173 +++++++++-------------------- documentation/model-concepts.md | 2 +- documentation/run-configuration.md | 116 ------------------- 6 files changed, 151 insertions(+), 306 deletions(-) create mode 100644 CLAUDE.md delete mode 100644 documentation/architecture.md delete mode 100644 documentation/run-configuration.md diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 000000000..fbfd20763 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,94 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Build Commands + +```bash +# Build (skip tests) +mvn clean package -DskipTests + +# Run unit tests +mvn test + +# Run a single test class +mvn test -Dtest=PersonTest + +# Run all tests including integration tests +mvn verify +``` + +The build produces two runnable JARs: +- `target/singlerun.jar` — single simulation run (GUI or headless) +- `target/multirun.jar` — batch runs from a YAML config file + +## Running the Simulation + +```bash +# Single run (headless, UK, setup from scratch) +java -jar target/singlerun.jar -g false -c UK -Setup + +# Multi-run batch from config +java -jar target/multirun.jar -config config/default.yml -g false +``` + +Key CLI flags: `-c` (country), `-s` (start year), `-e` (end year), `-g` (GUI true/false), `-Setup` (rebuild database), `-r` (random seed), `-p` (population size). + +## Architecture + +SimPaths is a discrete-time (annual steps) agent-based microsimulation framework built on the [JAS-mine](https://www.jas-mine.net/) engine. It projects life histories forward across labour, family, health, and financial domains. + +### Agent Hierarchy + +``` +Household → BenefitUnit(s) → Person(s) +``` + +- **Person** (`simpaths/model/Person.java`) — individual agent; carries all demographics, health, education, labour, and income state. +- **BenefitUnit** (`simpaths/model/BenefitUnit.java`) — tax/benefit assessment unit (one or two adults + dependents). +- **Household** (`simpaths/model/Household.java`) — grouping of benefit units at the same address. + +### Package Map + +| Package | Responsibility | +|---|---| +| `simpaths/experiment/` | Entry points and orchestration: `SimPathsStart`, `SimPathsMultiRun`, `SimPathsCollector`, `SimPathsObserver` | +| `simpaths/model/` | Core simulation logic: agent classes, annual process methods, alignment, labour market, tax evaluation, intertemporal decisions | +| `simpaths/data/` | Parameters, setup routines, input parsers, filters, statistics helpers, regression managers, EUROMOD donor matching | + +### Simulation Engine + +`SimPathsModel.java` is the central manager registered with JAS-mine. It owns all agent collections and builds the ordered event schedule. Each simulated year runs **44 ordered processes** covering: +1. Year setup / parameter updates +2. Demographic events (ageing, mortality, fertility, education) +3. Labour market transitions +4. Partnership dynamics (cohabitation, separation, union matching via `UnionMatching.java`) +5. Health and wellbeing +6. Tax-benefit evaluation (via EUROMOD donor matching in `TaxEvaluation.java`) +7. Financial outcomes and aggregate alignment to calibration targets + +### Configuration System + +Runtime parameters live in `config/default.yml` (template) and are loaded by `SimPathsMultiRun`. The layered override order is: **class defaults → YAML values → CLI flags**. + +Key top-level YAML keys: `maxNumberOfRuns`, `executeWithGui`, `randomSeed`, `startYear`, `endYear`, `popSize`. Model-specific keys toggle alignment, time-trend controls, and individual module switches. + +### Data / Database + +The initial population and EUROMOD donor data are stored in an embedded **H2 database** built during the `-Setup` phase. Integration tests that rebuild or query the database are in `src/test/java/simpaths/integrationtest/`. + +## Key Tech + +- **Java 19**, Maven 3.x +- **JAS-mine 4.3.25** — microsimulation engine and GUI +- **JUnit 5 + Mockito 5** for tests +- **Apache Commons Math3, CLI, CSV** and **SnakeYAML** for utilities + +## Documentation + +Detailed guides are in `documentation/`: +- `model-concepts.md` — agent lifecycle and annual-cycle detail +- `configuration.md` — YAML structure, config keys, and how to write your own +- `data-pipeline.md` — how input data is prepared and loaded +- `validation-guide.md` — model validation procedures +- `cli-reference.md` — full CLI argument reference \ No newline at end of file diff --git a/documentation/README.md b/documentation/README.md index 08a0f501a..cb73dffe1 100644 --- a/documentation/README.md +++ b/documentation/README.md @@ -9,14 +9,12 @@ These files are a **CLI- and developer-workflow quick reference** for working di 1. [Model Concepts](model-concepts.md) — what SimPaths simulates, agents, annual cycle, alignment, EUROMOD 2. [Getting Started](getting-started.md) — prerequisites, build, first run 3. [CLI Reference](cli-reference.md) — all flags for `singlerun.jar` and `multirun.jar` -4. [Configuration](configuration.md) — YAML structure and all config keys -5. [Run Configuration](run-configuration.md) — provided configs and how to build your own -6. [Data and Outputs](data-and-outputs.md) — input layout, setup artifacts, output files -7. [Troubleshooting](troubleshooting.md) — common errors and fixes +4. [Configuration](configuration.md) — YAML structure, config keys, and how to write your own +5. [Data and Outputs](data-and-outputs.md) — input layout, setup artifacts, output files +6. [Troubleshooting](troubleshooting.md) — common errors and fixes For contributors and advanced users: -- [Architecture](architecture.md) — source package structure and data flow - [Development and Testing](development.md) — build, tests, CI, contributor workflow - [Data Pipeline](data-pipeline.md) — how input files are generated from UKHLS/EUROMOD/WAS survey data - [Validation Guide](validation-guide.md) — two-stage validation workflow (estimate validation + simulated output validation) diff --git a/documentation/architecture.md b/documentation/architecture.md deleted file mode 100644 index 69c19c36e..000000000 --- a/documentation/architecture.md +++ /dev/null @@ -1,64 +0,0 @@ -# Architecture - -For a conceptual overview of the simulation (agents, annual cycle, modules, alignment), see [Model Concepts](model-concepts.md). This page covers source-level structure and data flow. - ---- - -## High-level module map - -Core package layout under `src/main/java/simpaths/`: - -| Package | Contents | -|---------|----------| -| `experiment/` | Entry points, orchestration, and runtime managers (`SimPathsStart`, `SimPathsMultiRun`, `SimPathsCollector`, `SimPathsObserver`) | -| `model/` | Core simulation entities (`Person`, `BenefitUnit`, `Household`), yearly process logic, alignment routines, labour market, union matching, tax evaluation, intertemporal decision module | -| `data/` | Parameters, setup routines, input parsers, filters, statistics helpers | - ---- - -## Primary entry points - -### `simpaths.experiment.SimPathsStart` - -- Builds or refreshes setup artifacts (H2 database, policy schedule) -- Launches a single simulation run, GUI or headless - -### `simpaths.experiment.SimPathsMultiRun` - -- Loads a YAML config from `config/` -- Iterates runs with optional seed or innovation logic -- Supports persistence mode switching across runs - ---- - -## Runtime managers - -All three are registered with the JAS-mine simulation engine at startup. They live in `simpaths.experiment`: - -| Class | Role | -|-------|------| -| `SimPathsModel` | Owns the agent collections, builds the event schedule, fires yearly processes | -| `SimPathsCollector` | Computes and exports statistics at scheduled intervals | -| `SimPathsObserver` | GUI observation layer, only active when GUI is enabled | - ---- - -## Data flow - -1. **Setup stage** — `SimPathsStart` or `multirun -DBSetup` generates `input/input.mv.db`, `input/EUROMODpolicySchedule.xlsx`, and `input/DatabaseCountryYear.xlsx`. -2. **Initialisation** — `SimPathsModel.buildObjects()` loads parameters, reads the initial population CSV, and hydrates agent collections. -3. **Yearly loop** — `SimPathsModel.buildSchedule()` registers all process events in fixed order. Each year the engine fires them sequentially across `Person`, `BenefitUnit`, and model-level processes. See [Model Concepts — Annual simulation cycle](model-concepts.md#annual-simulation-cycle) for the full ordered list. -4. **Collection** — `SimPathsCollector` computes cross-sectional statistics and writes CSV outputs at the end of each year. -5. **Output** — files land in timestamped run folders under `output/`. - ---- - -## Configuration flow - -`SimPathsMultiRun` applies values in three layers (later layers override earlier ones): - -1. Class field defaults -2. Values from `config/.yml` -3. CLI flags provided at invocation - -This layered strategy supports reproducible batch runs with targeted command-line overrides without editing YAML files. diff --git a/documentation/configuration.md b/documentation/configuration.md index 90fca123f..7d271ff02 100644 --- a/documentation/configuration.md +++ b/documentation/configuration.md @@ -1,23 +1,43 @@ # Configuration -SimPaths multi-run behavior is controlled by YAML files in `config/`. +SimPaths batch runs are controlled by YAML files in `config/`. The main config is `default.yml`, which is fully annotated with inline comments. -This repository ships with three configs: +--- + +## Quick run + +After building, three commands are all you need: + +```bash +mvn clean package +java -jar multirun.jar -DBSetup +java -jar multirun.jar +``` + +The first builds the JARs. The second creates the H2 donor database from the input data. The third runs the simulation using `default.yml`. + +To use a different config file: -- `default.yml` — standard baseline run (well-commented reference for all fields) -- `test_create_database.yml` — database setup using training data -- `test_run.yml` — short integration-test run +```bash +java -jar multirun.jar -config my_run.yml +``` -For command-by-command guidance and a template for building your own config, see [Run Configuration](run-configuration.md). +--- ## How config is applied `SimPathsMultiRun` loads `config/` and applies values in two stages: -1. YAML values initialize runtime fields and argument maps. +1. YAML values initialise runtime fields and argument maps. 2. CLI flags override those values if provided. -## Top-level keys +If a key is not specified in the YAML, the Java class field default is used. Each config file is standalone — there is no inheritance between config files. + +--- + +## Writing your own config + +Place a new `.yml` file in `config/` and pass it via `-config`. You only need to specify the values you want to change — everything else falls back to the Java class field defaults. ### Core run arguments @@ -25,130 +45,33 @@ For command-by-command guidance and a template for building your own config, see |-----|---------|-------------| | `maxNumberOfRuns` | `1` | Number of sequential simulation runs | | `executeWithGui` | `false` | `true` launches the JAS-mine GUI; `false` = headless (required on servers/CI) | -| `randomSeed` | `606` | RNG seed for the first run; auto-incremented when `randomSeedInnov` is true | +| `randomSeed` | `606` | RNG seed for the first run | | `startYear` | `2019` | First simulation year (must have matching input/donor data) | | `endYear` | `2022` | Last simulation year (inclusive) | | `popSize` | `50000` | Simulated population size; larger = more accurate but slower | | `countryString` | auto | `"United Kingdom"` or `"Italy"`; auto-detected from donor DB if omitted | -| `integrationTest` | `false` | Writes output to a fixed folder for CI comparison | ---- - -### `model_args` - -Keys map directly to `@GUIparameter`-annotated fields on `SimPathsModel`. Anything settable in the GUI can also be set here. - -#### Alignment flags - -Alignment prevents aggregate rates from drifting from known targets. Each dimension is independently controlled: - -| Flag | Default | What it aligns | -|------|---------|----------------| -| `alignPopulation` | `true` | Age-sex-region totals to demographic projections | -| `alignCohabitation` | `true` | Share of individuals in partnerships | -| `alignFertility` | `false` | Birth rates to projected fertility rates | -| `alignInSchool` | `false` | School participation rate (age 16–29) | -| `alignEducation` | `false` | Completed education level distribution | -| `alignEmployment` | `false` | Employment share | - -See [Model Concepts — Alignment](model-concepts.md#alignment) for a fuller explanation. - -#### Income security (S-Index) +### Collector arguments -The S-Index is an economic (in)security measure computed each year per person and reported in `Statistics1.csv` as `SIndex_p50`. It takes a rolling window of equivalised consumption observations, applies exponential discounting, and weights losses more heavily than gains according to a risk-aversion parameter. - -| Parameter | Default | Meaning | -|-----------|---------|---------| -| `sIndexTimeWindow` | `5` | Length of rolling window in years | -| `sIndexAlpha` | `2` | Coefficient of relative risk aversion — higher values make the index more sensitive to consumption drops | -| `sIndexDelta` | `0.98` | Annual discount factor applied to past consumption observations | - -#### Intertemporal optimisation (IO) - -Enables a backward-induction life-cycle solution for consumption and labour supply. Decision grids are pre-computed in year 0; agents look up their optimal choice each year. Computationally intensive — disabled by default. - -The IO state-space flags control which personal characteristics enter the grid (each adds a dimension and increases solve time): - -| Flag | Default | -|------|---------| -| `responsesToHealth` | `true` | -| `responsesToDisability` | `false` | -| `responsesToEducation` | `true` | -| `responsesToPension` | `false` | -| `responsesToRetirement` | `false` | -| `responsesToLowWageOffer` | `true` | -| `responsesToRegion` | `false` | - -Grid persistence flags allow a baseline grid to be solved once and reused in counterfactual runs (`saveBehaviour: true` / `useSavedBehaviour: true` with `readGrid: ""`). See [Run Configuration](run-configuration.md) for an example. - ---- - -### `innovation_args` - -Controls how parameters change across sequential runs (run 0, run 1, run 2, …). Useful for sensitivity analysis and uncertainty quantification. - -| Flag | Default | Behavior | -|------|---------|----------| -| `randomSeedInnov` | `true` | Increments `randomSeed` by 1 for each successive run so each gets a distinct seed | -| `flagDatabaseSetup` | `false` | If `true`, runs database setup instead of simulation (equivalent to `-DBSetup` on the CLI) | -| `intertemporalElasticityInnov` | `false` | If `true`, applies interest rate shocks: run 1 = +0.0075 (higher return to saving), run 2 = −0.0075 (lower return to saving). Requires `maxNumberOfRuns >= 3` to see all variants. | -| `labourSupplyElasticityInnov` | `false` | If `true`, applies disposable income shocks: run 1 = +0.01 (higher net labour income), run 2 = −0.01 (lower net labour income). Requires `maxNumberOfRuns >= 3`. | - ---- - -### `collector_args` - -Controls what `SimPathsCollector` writes to CSV or database each simulation year. - -#### Output files - -| File | Content | Enabled by | -|------|---------|-----------| -| `Statistics1.csv` | Income distribution: Gini coefficients, income percentiles, median equivalised disposable income (EDI), S-Index | `persistStatistics: true` | -| `Statistics2.csv` | Demographic validation: partnership rates, employment rates, health and disability measures by age and gender | `persistStatistics2: true` | -| `Statistics3.csv` | Alignment diagnostics: simulated vs target rates and the adjustment factors applied | `persistStatistics3: true` | -| `EmploymentStatistics.csv` | Labour market transitions and participation rates | `persistEmploymentStatistics: true` | -| `HealthStatistics.csv` | Health measures (SF-12, GHQ-12, EQ-5D) by age and gender | *(written automatically when health statistics are computed)* | - -For a description of the variables in these files, see `documentation/SimPaths_Variable_Codebook.xlsx`. - -#### Other collector flags +The `collector_args` section controls what output files are produced: | Flag | Default | Description | |------|---------|-------------| -| `calculateGiniCoefficients` | `false` | Compute Gini coefficients (also populates GUI charts); off by default for speed | +| `persistStatistics` | `true` | Write `Statistics1.csv` — income distribution, Gini, S-Index | +| `persistStatistics2` | `true` | Write `Statistics2.csv` — demographic validation by age and gender | +| `persistStatistics3` | `true` | Write `Statistics3.csv` — alignment diagnostics | | `exportToCSV` | `true` | Write outputs to CSV files under `output//csv/` | -| `exportToDatabase` | `false` | Write outputs to H2 database in addition to or instead of CSV | -| `persistPersons` | `false` | Write one row per person per year (produces large files) | -| `persistBenefitUnits` | `false` | Write one row per benefit unit per year (produces large files) | -| `persistHouseholds` | `false` | Write one row per household per year | -| `dataDumpStartTime` | `0` | First year to write output (`0` = `startYear`) | -| `dataDumpTimePeriod` | `1.0` | Output frequency in years (`1.0` = every year) | - ---- - -### `parameter_args` - -Overrides file paths and model-global flags in `Parameters`. -| Key | Default | Description | -|-----|---------|-------------| -| `input_directory` | `input` | Path to input data folder | -| `input_directory_initial_populations` | `input/InitialPopulations` | Path to initial population CSVs | -| `euromod_output_directory` | `input/EUROMODoutput` | Path to EUROMOD/UKMOD output files | -| `trainingFlag` | `false` | If `true`, loads training data from `input/.../training/` subfolders (set automatically by test configs) | -| `includeYears` | *(all)* | List of policy years for which EUROMOD donor data is available; only these years enter the donor database | - ---- +For a description of the variables in these files, see `documentation/SimPaths_Variable_Codebook.xlsx`. -## Minimal example +### Minimal example ```yaml -maxNumberOfRuns: 2 +maxNumberOfRuns: 5 executeWithGui: false -randomSeed: 100 +randomSeed: 42 startYear: 2019 -endYear: 2022 +endYear: 2030 popSize: 20000 collector_args: @@ -157,8 +80,18 @@ collector_args: persistStatistics3: true ``` -Run it: +--- -```bash -java -jar multirun.jar -config my_run.yml -``` +## Additional arguments + +The YAML file supports several other argument sections (`model_args`, `innovation_args`, `parameter_args`) that control alignment flags, intertemporal optimisation settings, sensitivity analysis parameters, and file paths. Many of these are for specific analyses and some are under active review. The annotated `default.yml` file documents all available keys with inline comments. + +Note that some settings — particularly alignment — are primarily controlled in `SimPathsModel.java` rather than through the YAML file. + +--- + +## Practical notes + +- Use quotes around config filenames that contain spaces: `-config "my config.yml"`. +- Add `-f` to write run logs to `output/logs/`. +- Override individual values at runtime without editing the YAML, for example `-n 10` overrides `maxNumberOfRuns`. diff --git a/documentation/model-concepts.md b/documentation/model-concepts.md index 2130d3e6b..8a150938c 100644 --- a/documentation/model-concepts.md +++ b/documentation/model-concepts.md @@ -125,4 +125,4 @@ This gives SimPaths annually updated policy rules without re-implementing the fu When `enableIntertemporalOptimisations: true`, SimPaths solves a life-cycle consumption and labour supply problem. Decision grids are pre-computed in year 0 (`RationalOptimisation`) by solving backwards over the remaining horizon. In each subsequent year agents look up their optimal choice from the grid given their current state. -This is computationally intensive and disabled by default. When enabled, `saveBehaviour` and `useSavedBehaviour` allow a baseline grid to be reused in counterfactual runs without recomputing it — see [Run Configuration](run-configuration.md) for an example. +This is computationally intensive and disabled by default. When enabled, `saveBehaviour` and `useSavedBehaviour` allow a baseline grid to be reused in counterfactual runs without recomputing it — see the annotated `config/default.yml` for the relevant keys. diff --git a/documentation/run-configuration.md b/documentation/run-configuration.md deleted file mode 100644 index d4938cb1d..000000000 --- a/documentation/run-configuration.md +++ /dev/null @@ -1,116 +0,0 @@ -# Run Configuration - -This guide maps every YAML config currently in `config/` to its intended use, and explains how to build your own. - -All commands assume you are running from repository root after building jars. - ---- - -## Provided configs - -### `default.yml` - -The standard baseline run with conservative defaults. Use this as your starting point for any new analysis. - -```bash -java -jar multirun.jar -config default.yml -g false -``` - -### `test_create_database.yml` - -Test-oriented database setup using training data (`trainingFlag: true`). Creates the H2 donor database needed before running simulations. - -```bash -java -jar multirun.jar -DBSetup -config test_create_database.yml -``` - -### `test_run.yml` - -Short integration-style run (2 runs, test settings, training data). Used by CI and useful for reproducing CI behavior locally. - -```bash -java -jar multirun.jar -config test_run.yml -P root -``` - ---- - -## Building your own config - -Place a new `.yml` file in `config/` and pass it via `-config`. You only need to specify the values you want to change — everything else falls back to the Java class field defaults. Each config file is independent; there is no inheritance from `default.yml` or any other YAML file. - -The keys under `model_args` map directly to the `@GUIparameter`-annotated fields on `SimPathsModel` — so anything you can set in the GUI can also be set here. - -### Minimal template - -```yaml -maxNumberOfRuns: 5 -executeWithGui: false -randomSeed: 42 -startYear: 2019 -endYear: 2030 -countryString: UK -popSize: 20000 - -collector_args: - persistStatistics: true - persistStatistics2: true - persistStatistics3: true - persistPersons: false - persistBenefitUnits: false - persistHouseholds: false -``` - -### Enabling alignment - -To align simulated aggregates to external targets, add `model_args` with the relevant flags: - -```yaml -model_args: - alignPopulation: true - alignCohabitation: true - alignFertility: true - alignInSchool: true - alignEducation: true -``` - -See [Configuration](configuration.md) for a full list of `model_args` toggles, and [Model Concepts](model-concepts.md) for what each alignment dimension does. - -### Running sensitivity analyses - -To vary a parameter across runs, use `innovation_args`. For example, to sweep the intertemporal interest-rate innovation: - -```yaml -maxNumberOfRuns: 3 -model_args: - enableIntertemporalOptimisations: true - -innovation_args: - intertemporalElasticityInnov: true -``` - -### Saving and reusing a behavioural grid - -If you have computed a decision grid for a baseline scenario and want to reuse it in a counterfactual: - -```yaml -# Baseline run — saves the grid to output// -model_args: - enableIntertemporalOptimisations: true - saveBehaviour: true - -# Counterfactual run — loads the saved grid -# readGrid must be set to the exact output folder name of the baseline run -model_args: - enableIntertemporalOptimisations: true - useSavedBehaviour: true - readGrid: "my_baseline_run" # replace with the actual folder name under output/ -``` - ---- - -## Practical notes - -- Use quotes around config filenames that contain spaces: `-config "my config.yml"`. -- Add `-f` to write run logs to `output/logs/`. -- Override individual values at runtime without editing the YAML, for example `-n 10` overrides `maxNumberOfRuns`. -- Add `-P none` when you do not need the processed dataset to persist between runs (faster). From b6b0f4e51feb234edd7a1ccd068aae511f543473 Mon Sep 17 00:00:00 2001 From: hk-2029 Date: Mon, 16 Mar 2026 17:09:05 +0000 Subject: [PATCH 13/23] docs: remove country selection references, UK only --- documentation/cli-reference.md | 3 +-- documentation/configuration.md | 1 - documentation/data-pipeline.md | 1 - documentation/getting-started.md | 18 ++++-------------- documentation/troubleshooting.md | 2 +- 5 files changed, 6 insertions(+), 19 deletions(-) diff --git a/documentation/cli-reference.md b/documentation/cli-reference.md index 7535ba00b..e1099b0b9 100644 --- a/documentation/cli-reference.md +++ b/documentation/cli-reference.md @@ -12,7 +12,6 @@ java -jar singlerun.jar [options] | Option | Meaning | |---|---| -| `-c`, `--country ` | Country code (`UK` or `IT`) | | `-s`, `--startYear ` | Simulation start year | | `-Setup` | Setup only (do not run simulation) | | `-Run` | Run only (skip setup) | @@ -30,7 +29,7 @@ Notes: Setup only: ```bash -java -jar singlerun.jar -c UK -s 2019 -g false -Setup --rewrite-policy-schedule +java -jar singlerun.jar -s 2019 -g false -Setup --rewrite-policy-schedule ``` Run only (after setup exists): diff --git a/documentation/configuration.md b/documentation/configuration.md index 7d271ff02..53c15705e 100644 --- a/documentation/configuration.md +++ b/documentation/configuration.md @@ -49,7 +49,6 @@ Place a new `.yml` file in `config/` and pass it via `-config`. You only need to | `startYear` | `2019` | First simulation year (must have matching input/donor data) | | `endYear` | `2022` | Last simulation year (inclusive) | | `popSize` | `50000` | Simulated population size; larger = more accurate but slower | -| `countryString` | auto | `"United Kingdom"` or `"Italy"`; auto-detected from donor DB if omitted | ### Collector arguments diff --git a/documentation/data-pipeline.md b/documentation/data-pipeline.md index a43d1db2b..454abd3e8 100644 --- a/documentation/data-pipeline.md +++ b/documentation/data-pipeline.md @@ -107,6 +107,5 @@ Population projection targets (`align_popProjections.xlsx`) and fertility/mortal | Adding a new data year to the simulation | Part 1 (re-slice the population for the new year) + Part 3 (update alignment targets) | | Re-estimating a behavioural module | Part 2 (the affected `reg_*.do` script only) + Stage 1 validation | | Updating employment alignment targets | Part 3 (`01_employment_shares_initpopdata.do`) | -| Adding a new country | All three parts with country-appropriate data sources | After re-running any part, re-run setup (`singlerun -Setup` or `multirun -DBSetup`) to rebuild `input/input.mv.db` before running the simulation. diff --git a/documentation/getting-started.md b/documentation/getting-started.md index 9ba29a426..6fdc3ab53 100644 --- a/documentation/getting-started.md +++ b/documentation/getting-started.md @@ -28,23 +28,13 @@ SimPaths supports two entry points: ## First run (headless) -### Step 1: setup input artifacts - ```bash -java -jar singlerun.jar -c UK -s 2019 -g false -Setup --rewrite-policy-schedule +mvn clean package +java -jar multirun.jar -DBSetup +java -jar multirun.jar ``` -This prepares required setup files such as: - -- `input/input.mv.db` -- `input/EUROMODpolicySchedule.xlsx` -- `input/DatabaseCountryYear.xlsx` - -### Step 2: execute a multi-run configuration - -```bash -java -jar multirun.jar -config default.yml -g false -``` +The first command builds the JARs. The second creates the H2 donor database from the input data. The third runs the simulation using `default.yml`. Results are written under `output//`. diff --git a/documentation/troubleshooting.md b/documentation/troubleshooting.md index d9e69082c..5ff3bee8b 100644 --- a/documentation/troubleshooting.md +++ b/documentation/troubleshooting.md @@ -26,7 +26,7 @@ Fix: - Re-run setup with rewrite enabled: ```bash -java -jar singlerun.jar -c UK -s 2019 -g false --rewrite-policy-schedule -Setup +java -jar singlerun.jar -s 2019 -g false --rewrite-policy-schedule -Setup ``` ## GUI errors on server or CI From 70c3d34be20a0e6caf97fe2f812704930e18480b Mon Sep 17 00:00:00 2001 From: hk-2029 Date: Mon, 16 Mar 2026 22:16:11 +0000 Subject: [PATCH 14/23] docs: replace data-and-outputs with repository-structure, remove getting-started and development - New repository-structure.md with full directory tree - Prerequisites and quick run consolidated into configuration.md - Training mode info moved to repository-structure.md - Condensed 99_training_data.do table entry in data-pipeline.md - Removed country variants reference from README.md --- documentation/README.md | 12 ++-- documentation/configuration.md | 8 ++- documentation/data-and-outputs.md | 66 -------------------- documentation/data-pipeline.md | 2 +- documentation/development.md | 61 ------------------- documentation/getting-started.md | 55 ----------------- documentation/repository-structure.md | 88 +++++++++++++++++++++++++++ 7 files changed, 99 insertions(+), 193 deletions(-) delete mode 100644 documentation/data-and-outputs.md delete mode 100644 documentation/development.md delete mode 100644 documentation/getting-started.md create mode 100644 documentation/repository-structure.md diff --git a/documentation/README.md b/documentation/README.md index cb73dffe1..6bb2d3678 100644 --- a/documentation/README.md +++ b/documentation/README.md @@ -1,21 +1,19 @@ # SimPaths Documentation -These files are a **CLI- and developer-workflow quick reference** for working directly with the repository — building, running, configuring, and troubleshooting from the command line. For the full model documentation (simulated modules, parameterisation, GUI usage, country variants, research), see the [website](../documentation/wiki/index.md). +These files are a **quick reference** for working directly with the repository — building, running, configuring, and troubleshooting from the command line. For the full model documentation (simulated modules, parameterisation, GUI usage, research), see the [website](../documentation/wiki/index.md). --- ## Recommended reading order 1. [Model Concepts](model-concepts.md) — what SimPaths simulates, agents, annual cycle, alignment, EUROMOD -2. [Getting Started](getting-started.md) — prerequisites, build, first run -3. [CLI Reference](cli-reference.md) — all flags for `singlerun.jar` and `multirun.jar` -4. [Configuration](configuration.md) — YAML structure, config keys, and how to write your own -5. [Data and Outputs](data-and-outputs.md) — input layout, setup artifacts, output files -6. [Troubleshooting](troubleshooting.md) — common errors and fixes +2. [Configuration](configuration.md) — prerequisites, quick run, YAML structure, config keys +3. [Repository Structure](repository-structure.md) — directory layout, input files, output files +4. [CLI Reference](cli-reference.md) — all flags for `singlerun.jar` and `multirun.jar` +5. [Troubleshooting](troubleshooting.md) — common errors and fixes For contributors and advanced users: -- [Development and Testing](development.md) — build, tests, CI, contributor workflow - [Data Pipeline](data-pipeline.md) — how input files are generated from UKHLS/EUROMOD/WAS survey data - [Validation Guide](validation-guide.md) — two-stage validation workflow (estimate validation + simulated output validation) diff --git a/documentation/configuration.md b/documentation/configuration.md index 53c15705e..28d2b9199 100644 --- a/documentation/configuration.md +++ b/documentation/configuration.md @@ -1,12 +1,14 @@ # Configuration -SimPaths batch runs are controlled by YAML files in `config/`. The main config is `default.yml`, which is fully annotated with inline comments. +## Prerequisites ---- +- Java 19 +- Maven 3.8+ +- Optional IDE: IntelliJ IDEA (import as a Maven project) ## Quick run -After building, three commands are all you need: +Three commands are all you need: ```bash mvn clean package diff --git a/documentation/data-and-outputs.md b/documentation/data-and-outputs.md deleted file mode 100644 index 4f4c7b38c..000000000 --- a/documentation/data-and-outputs.md +++ /dev/null @@ -1,66 +0,0 @@ -# Data and Outputs - -## Data availability - -- Source code and documentation are open. -- Full research input datasets (UKHLS initial population, UKMOD policy outputs) are not freely redistributable — see [Getting Started / Data](../documentation/wiki/getting-started/data/index.md) on the website for access instructions. -- Training data is included in the repository to support development, local testing, and CI. - -## Input directory layout - -``` -input/ -├── InitialPopulations/ -│ ├── training/ # training population CSV (included in repo) -│ └── compile/ # full data pipeline: builds initial population CSVs from UKHLS/BHPS/WAS, -│ # reconstructs employment histories, and estimates all reg_*.xlsx coefficients -│ # (see Data Pipeline for details) -├── EUROMODoutput/ -│ └── training/ # training UKMOD outputs (included in repo) -├── input.mv.db # H2 donor database — generated by setup -├── EUROMODpolicySchedule.xlsx # policy year mapping — generated by setup -├── DatabaseCountryYear.xlsx # macro parameters — generated by setup -├── reg_*.xlsx # regression coefficient tables -├── align_*.xlsx # alignment targets -├── projections_*.xlsx # demographic projections -└── scenario_*.xlsx # scenario-specific parameter overrides -``` - -For a description of each `reg_`, `align_`, and `scenario_` file, see [Model Parameterisation](../documentation/wiki/overview/parameterisation.md) on the website. - -## Setup-generated artifacts - -Running setup mode (`singlerun -Setup` or `multirun -DBSetup`) creates or refreshes: - -- `input/input.mv.db` — H2 database of EUROMOD donor tax-benefit outcomes -- `input/EUROMODpolicySchedule.xlsx` — maps simulation years to EUROMOD policy systems -- `input/DatabaseCountryYear.xlsx` — country- and year-specific macro parameters - -These three files must exist before any simulation run. If they are missing, re-run setup. - -## Output directory layout - -Simulation runs produce timestamped folders under `output/`: - -``` -output// -├── csv/ -│ ├── Statistics1.csv # income distribution (Gini, percentiles, S-Index) -│ ├── Statistics21.csv # demographic validation (employment, health, partnership by age/gender) -│ ├── Statistics31.csv # alignment diagnostics (simulated vs target rates) -│ ├── EmploymentStatistics1.csv # labour market transitions and participation rates -│ └── HealthStatistics1.csv # health measures (SF-12, GHQ-12, EQ-5D) by age/gender -├── database/ # run-specific persistence output -└── input/ # copied/persisted run input artifacts -``` - -CSV filenames follow the pattern `.csv`. With a single run the suffix is `1`; with multiple runs each run produces its own numbered file (e.g. `Statistics12.csv` for Statistics of run 2). - -For a description of the variables in these CSV files, see `documentation/SimPaths_Variable_Codebook.xlsx`. - -## Logging - -With `-f` on `multirun.jar`, logs are written to: - -- `output/logs/run_.txt` — stdout capture -- `output/logs/run_.log` — log4j output diff --git a/documentation/data-pipeline.md b/documentation/data-pipeline.md index 454abd3e8..554e9c540 100644 --- a/documentation/data-pipeline.md +++ b/documentation/data-pipeline.md @@ -38,7 +38,7 @@ The pipeline runs in numbered stages: | `08_wealth_to_ukhls.do` | Merges WAS wealth into UKHLS records | | `09_finalise_input_data.do` | Final cleaning and formatting | | `10_check_yearly_data.do` | Per-year consistency checks | -| `99_training_data.do` | Produces the de-identified synthetic population committed to `input/InitialPopulations/training/`. Selects a random subset of households, anonymises all IDs, collapses wave identifiers, randomises survey weights, and adds 15% random noise to all continuous variables (income, wealth, wages, care hours). This produces a file that is structurally identical to the real population data but contains no traceable individual records, so it can be distributed with the repo without breaching data licence terms. | +| `99_training_data.do` | Produces the de-identified training population committed to `input/InitialPopulations/training/` | ### Employment history sub-pipeline (`compile/do_emphist/`) diff --git a/documentation/development.md b/documentation/development.md deleted file mode 100644 index c5f5c4da9..000000000 --- a/documentation/development.md +++ /dev/null @@ -1,61 +0,0 @@ -# Development and Testing - -## Build - -Compile and package: - -```bash -mvn clean package -``` - -## Tests - -### Unit tests - -Run unit tests (Surefire): - -```bash -mvn test -``` - -### Integration tests - -Run integration tests (Failsafe): - -```bash -mvn verify -``` - -Integration tests exercise setup and run flows and compare generated CSV outputs to expected files in: - -- `src/test/java/simpaths/integrationtest/expected/` - -## CI workflows - -GitHub workflows in `.github/workflows/` run: - -- build and package on pull requests to `main` and `develop` -- integration tests (`mvn verify`) -- smoke runs for `singlerun.jar` and `multirun.jar` with persistence variants -- Javadoc generation and publish (on `develop` pushes) - -## Javadoc - -Generate locally: - -```bash -mvn javadoc:javadoc -``` - -## Typical contributor flow - -1. Create a feature branch in your fork. -2. Implement and test changes. -3. Run `mvn verify` before opening a PR. -4. Open a PR against `develop` (or `main` for stable fixes, when appropriate). - -## Debugging tips - -- Use `-g false` on headless systems. -- Use `-f` with `multirun.jar` to capture logs in `output/logs/`. -- Start from `config/test_create_database.yml` and `config/test_run.yml` when reproducing CI behavior. diff --git a/documentation/getting-started.md b/documentation/getting-started.md deleted file mode 100644 index 6fdc3ab53..000000000 --- a/documentation/getting-started.md +++ /dev/null @@ -1,55 +0,0 @@ -# Getting Started - -## Prerequisites - -- Java 19 -- Maven 3.8+ -- Optional IDE: IntelliJ IDEA (import as a Maven project) - -## Build - -From repository root: - -```bash -mvn clean package -``` - -Artifacts produced at the root: - -- `singlerun.jar` -- `multirun.jar` - -## Understand run modes - -SimPaths supports two entry points: - -- `singlerun.jar` (`SimPathsStart`): setup and single simulation execution -- `multirun.jar` (`SimPathsMultiRun`): repeated runs across seeds/scenarios - -## First run (headless) - -```bash -mvn clean package -java -jar multirun.jar -DBSetup -java -jar multirun.jar -``` - -The first command builds the JARs. The second creates the H2 donor database from the input data. The third runs the simulation using `default.yml`. - -Results are written under `output//`. - -## Training vs full data mode - -- The repository includes training data under: - - `input/InitialPopulations/training/` - - `input/EUROMODoutput/training/` -- If no initial-population CSV files are found in the main input location, SimPaths automatically switches to training mode. -- Training mode supports development and CI, but is not intended for research interpretation. - -## GUI usage - -Use `-g true` (default behavior in several flows) to run with GUI components. - -In headless/remote environments, set `-g false`. - -For GUI usage, see the GUI section of the user guide on the project website. diff --git a/documentation/repository-structure.md b/documentation/repository-structure.md new file mode 100644 index 000000000..4a750f34b --- /dev/null +++ b/documentation/repository-structure.md @@ -0,0 +1,88 @@ +# Repository Structure + +``` +SimPaths/ +├── config/ # YAML configuration files for simulation runs +│ ├── default.yml # Default simulation parameters (fully annotated) +│ ├── test_create_database.yml # Database creation config (CI) +│ └── test_run.yml # Test run config (CI) +│ +├── documentation/ # Quick-reference docs (this folder) +│ ├── wiki/ # Website source (model description, guides, research) +│ ├── SimPaths_Variable_Codebook.xlsx # Variable definitions for output CSVs +│ ├── SimPaths Stata Parameters.xlsx # Parameter comparison: Stata do-files vs Java +│ └── SimPathsUK_Schedule.xlsx # Event schedule with corresponding Java classes +│ +├── input/ # Input data and parameters +│ ├── InitialPopulations/ +│ │ ├── training/ # De-identified training population (included in repo) +│ │ └── compile/ # Stata pipeline: builds populations, estimates regressions +│ │ ├── do_emphist/ # Employment history reconstruction sub-pipeline +│ │ └── RegressionEstimates/ # Regression coefficient estimation scripts +│ ├── EUROMODoutput/ +│ │ └── training/ # Training UKMOD outputs (included in repo) +│ ├── DoFilesTarget/ # Stata scripts that generate alignment targets +│ ├── reg_*.xlsx # Regression coefficient tables +│ ├── align_*.xlsx # Alignment targets +│ ├── projections_*.xlsx # ONS demographic projections +│ ├── scenario_*.xlsx # Scenario-specific parameter overrides +│ ├── policy parameters.xlsx # Tax-benefit policy parameters +│ ├── validation_statistics.xlsx # Validation targets +│ ├── input.mv.db # H2 donor database (generated by setup) +│ ├── EUROMODpolicySchedule.xlsx # Policy year mapping (generated by setup) +│ └── DatabaseCountryYear.xlsx # Macro parameters (generated by setup) +│ +├── output/ # Simulation outputs (created at runtime) +│ └── / +│ ├── csv/ +│ │ ├── Statistics1.csv # Income distribution, Gini, S-Index +│ │ ├── Statistics2.csv # Demographics by age and gender +│ │ ├── Statistics3.csv # Alignment diagnostics +│ │ ├── Person.csv # Person-level output +│ │ ├── BenefitUnit.csv # Benefit-unit-level output +│ │ └── Household.csv # Household-level output +│ ├── database/ # Run-specific persistence output +│ └── input/ # Copied run input artifacts +│ +├── src/ +│ ├── main/java/simpaths/ +│ │ ├── data/ # Parameters, input parsing, filters, statistics +│ │ ├── experiment/ # Entry points: SimPathsStart, SimPathsMultiRun, +│ │ │ # SimPathsCollector, SimPathsObserver +│ │ └── model/ # Core simulation: Person, BenefitUnit, Household, +│ │ ├── decisions/ # intertemporal optimisation grids +│ │ ├── enums/ # categorical variable definitions +│ │ ├── taxes/ # EUROMOD donor matching +│ │ └── lifetime_incomes/ # synthetic income trajectory generation +│ └── test/java/simpaths/ # Unit and integration tests +│ +├── validation/ # Stata validation scripts and reference graphs +│ ├── 01_estimate_validation/ # Predicted vs observed for each regression module +│ └── 02_simulated_output_validation/ # Simulated output vs UKHLS survey data +│ +├── pom.xml # Maven build configuration +├── singlerun.jar # Single-run executable +└── multirun.jar # Multi-run executable +``` + +CSV filenames follow the pattern `.csv`. With a single run the suffix is `1`; with multiple runs each run produces its own numbered file. + +For a description of the variables in output CSV files, see `documentation/SimPaths_Variable_Codebook.xlsx`. For a description of each `reg_*`, `align_*`, and `scenario_*` input file, see [Model Parameterisation](../documentation/wiki/overview/parameterisation.md) on the website. + +## Setup-generated artifacts + +Running setup (`multirun -DBSetup`) creates or refreshes three files in `input/`: + +- `input.mv.db` — H2 database of EUROMOD donor tax-benefit outcomes +- `EUROMODpolicySchedule.xlsx` — maps simulation years to EUROMOD policy systems +- `DatabaseCountryYear.xlsx` — year-specific macro parameters + +These must exist before any simulation run. If they are missing, re-run setup. + +## Training mode + +The repository includes de-identified training data under `input/InitialPopulations/training/` and `input/EUROMODoutput/training/`. If no initial-population CSV files are found in the main input location, SimPaths automatically switches to training mode. Training mode supports development and CI but is not intended for research interpretation. + +## Logging + +With `-f` on `multirun.jar`, logs are written to `output/logs/run_.txt` (stdout) and `output/logs/run_.log` (log4j). From c4374ba17d2d1eccca8633ca46e0386c0cb8228b Mon Sep 17 00:00:00 2001 From: Hrushikesh Kalakandra Date: Mon, 16 Mar 2026 23:05:28 +0000 Subject: [PATCH 15/23] Enhance file organization documentation with details Expanded the file organization section to provide a detailed directory structure and descriptions of contents for the SimPaths repository. Updated the source code and input data sections to clarify their purposes and organization. --- .../internals/file-organisation.md | 172 +++++++++--------- 1 file changed, 81 insertions(+), 91 deletions(-) diff --git a/documentation/wiki/developer-guide/internals/file-organisation.md b/documentation/wiki/developer-guide/internals/file-organisation.md index 0de8ffdf5..f5eed984f 100644 --- a/documentation/wiki/developer-guide/internals/file-organisation.md +++ b/documentation/wiki/developer-guide/internals/file-organisation.md @@ -2,33 +2,76 @@ This page describes the directory and package layout of the SimPaths repository. For the generic JAS-mine project structure, see [Project Structure](../jasmine/project-structure.md). -## 1. Top-level directories - -| Directory | Contents | -| --- | --- | -| `config/` | YAML configuration files for batch runs (`default.yml`, `test_create_database.yml`, `test_run.yml`) | -| `input/` | Survey-derived input data, EUROMOD donor files, Stata scripts for data preparation | -| `output/` | Simulation output (created at runtime; each run produces a timestamped subfolder) | -| `src/` | All Java source code (main and test) plus resources | -| `target/` | Maven build output: compiled classes and runnable JARs (`singlerun.jar`, `multirun.jar`) | -| `validation/` | Stata scripts and reference graphs for two-stage model validation | -| `documentation/` | Markdown documentation and wiki source files | -| `.github/workflows/` | CI pipeline (`SimPathsBuild.yml`) and Javadoc publishing (`publish-javadoc.yml`) | - -Root-level files include `pom.xml` (Maven project definition) and `README.md`. - -## 2. Source code — `src/main/java/simpaths/` - -### `experiment/` - -Entry points and orchestration. Contains the four manager classes required by the JAS-mine architecture: - -| Class | Role | -| --- | --- | -| `SimPathsStart` | Entry point for interactive single runs. Builds the GUI dialog, creates database tables from CSV, and launches the simulation. | -| `SimPathsMultiRun` | Entry point for batch runs. Reads a YAML config file, iterates over runs with optional parameter variation (innovation shocks), and manages run labelling. | -| `SimPathsCollector` | Collector manager. Computes aggregate statistics each simulated year and exports them to output CSV files (Statistics, Statistics2, Statistics3). | -| `SimPathsObserver` | Observer manager. Builds real-time GUI charts for monitoring the simulation while it runs. | +# Repository Structure + +``` +SimPaths/ +├── config/ # YAML configuration files for simulation runs +│ ├── default.yml # Default simulation parameters (fully annotated) +│ ├── test_create_database.yml # Database creation config (CI) +│ └── test_run.yml # Test run config (CI) +│ +├── documentation/ # Quick-reference docs (this folder) +│ ├── wiki/ # Website source (model description, guides, research) +│ ├── SimPaths_Variable_Codebook.xlsx # Variable definitions for output CSVs +│ ├── SimPaths Stata Parameters.xlsx # Parameter comparison: Stata do-files vs Java +│ └── SimPathsUK_Schedule.xlsx # Event schedule with corresponding Java classes +│ +├── input/ # Input data and parameters +│ ├── InitialPopulations/ +│ │ ├── training/ # De-identified training population (included in repo) +│ │ └── compile/ # Stata pipeline: builds populations, estimates regressions +│ │ ├── do_emphist/ # Employment history reconstruction sub-pipeline +│ │ └── RegressionEstimates/ # Regression coefficient estimation scripts +│ ├── EUROMODoutput/ +│ │ └── training/ # Training UKMOD outputs (included in repo) +│ ├── DoFilesTarget/ # Stata scripts that generate alignment targets +│ ├── reg_*.xlsx # Regression coefficient tables +│ ├── align_*.xlsx # Alignment targets +│ ├── projections_*.xlsx # ONS demographic projections +│ ├── scenario_*.xlsx # Scenario-specific parameter overrides +│ ├── policy parameters.xlsx # Tax-benefit policy parameters +│ ├── validation_statistics.xlsx # Validation targets +│ ├── input.mv.db # H2 donor database (generated by setup) +│ ├── EUROMODpolicySchedule.xlsx # Policy year mapping (generated by setup) +│ └── DatabaseCountryYear.xlsx # Macro parameters (generated by setup) +│ +├── output/ # Simulation outputs (created at runtime) +│ └── / +│ ├── csv/ +│ │ ├── Statistics1.csv # Income distribution, Gini, S-Index +│ │ ├── Statistics2.csv # Demographics by age and gender +│ │ ├── Statistics3.csv # Alignment diagnostics +│ │ ├── Person.csv # Person-level output +│ │ ├── BenefitUnit.csv # Benefit-unit-level output +│ │ └── Household.csv # Household-level output +│ ├── database/ # Run-specific persistence output +│ └── input/ # Copied run input artifacts +│ +├── src/ +│ ├── main/java/simpaths/ +│ │ ├── data/ # Parameters, input parsing, filters, statistics +│ │ ├── experiment/ # Entry points: SimPathsStart, SimPathsMultiRun, +│ │ │ # SimPathsCollector, SimPathsObserver +│ │ └── model/ # Core simulation: Person, BenefitUnit, Household, +│ │ ├── decisions/ # intertemporal optimisation grids +│ │ ├── enums/ # categorical variable definitions +│ │ ├── taxes/ # EUROMOD donor matching +│ │ └── lifetime_incomes/ # synthetic income trajectory generation +│ └── test/java/simpaths/ # Unit and integration tests +│ +├── validation/ # Stata validation scripts and reference graphs +│ ├── 01_estimate_validation/ # Predicted vs observed for each regression module +│ └── 02_simulated_output_validation/ # Simulated output vs UKHLS survey data +│ +├── pom.xml # Maven build configuration +├── singlerun.jar # Single-run executable +└── multirun.jar # Multi-run executable +``` + +CSV filenames follow the pattern `.csv`. With a single run the suffix is `1`; with multiple runs each run produces its own numbered file. + +## Source code — `src/main/java/simpaths/` ### `model/` @@ -102,76 +145,23 @@ Synthetic lifetime income trajectory generator. When IO is enabled, this package | `BirthCohort` | Groups individuals by birth year for cohort-level income projection. | | `Individual` | Entity carrying age dummies and log GDP per capita for income regression. | -### `data/` - -Parameters, input parsing, regression management, and utility classes. -| Class | Purpose | -| --- | --- | -| `Parameters` | Central parameter store. Loads all regression coefficients, alignment targets, projections, and scenario tables from Excel files at simulation start. | -| `ManagerRegressions` | Manages the regression coefficient files (`reg_*.xlsx`) and provides methods for evaluating regression equations. | -| `RegressionName` | Enum-like catalogue of all named regression models used in the simulation. | -| `ScenarioTable` | Reads scenario-specific parameter overrides from Excel files. | -| `MahalanobisDistance` | Mahalanobis distance computation, used in donor matching. | -| `RootSearch` / `RootSearch2` | Numerical root-finding routines for alignment. | +For a description of the variables in output CSV files, see `documentation/SimPaths_Variable_Codebook.xlsx`. For a description of each `reg_*`, `align_*`, and `scenario_*` input file, see [Model Parameterisation](../documentation/wiki/overview/parameterisation.md) on the website. -Sub-packages: +## Setup-generated artifacts -- **`data/filters/`** — 43 cross-section filter classes (e.g. `FemaleAgeGroupCSfilter`, `RegionEducationWorkingCSfilter`). Each defines a predicate for selecting subsets of agents by demographic characteristics, used in alignment and statistics collection. -- **`data/startingpop/`** — `DataParser` reads the initial population CSV files and constructs the starting agent objects; `Processed` tracks which records have been loaded. -- **`data/statistics/`** — `Statistics`, `Statistics2`, `Statistics3` define the output entities whose fields are exported to CSVs by the Collector. `EmploymentStatistics` and `HealthStatistics` compute domain-specific aggregate indicators. +Running setup (`multirun -DBSetup`) creates or refreshes three files in `input/`: -## 3. Test code — `src/test/java/simpaths/` +- `input.mv.db` — H2 database of EUROMOD donor tax-benefit outcomes +- `EUROMODpolicySchedule.xlsx` — maps simulation years to EUROMOD policy systems +- `DatabaseCountryYear.xlsx` — year-specific macro parameters -Test packages mirror the main source structure: +These must exist before any simulation run. If they are missing, re-run setup. -| Package | Contents | -| --- | --- | -| `simpaths/model/` | Unit tests for agent classes and simulation logic | -| `simpaths/data/` | Tests for parameter loading and data utilities | -| `simpaths/data/filters/` | Tests for cross-section filters | -| `simpaths/data/statistics/` | Tests for statistics computation | -| `simpaths/experiment/` | Tests for entry points and configuration parsing | -| `simpaths/integrationtest/` | `RunSimPathsIntegrationTest` — end-to-end test that builds the database and runs a short simulation. The `expected/` subfolder contains reference output for comparison. | -| `simpaths/testinput/` | Test fixture data files | +## Training mode -## 4. Resources — `src/main/resources/` +The repository includes de-identified training data under `input/InitialPopulations/training/` and `input/EUROMODoutput/training/`. If no initial-population CSV files are found in the main input location, SimPaths automatically switches to training mode. Training mode supports development and CI but is not intended for research interpretation. -| File | Purpose | -| --- | --- | -| `hibernate.cfg.xml` | Hibernate ORM configuration for the embedded H2 database | -| `log4j.properties` | Logging configuration | -| `META-INF/` | Persistence unit definitions | -| `images/` | Icons and images used by the GUI | - -## 5. Input data — `input/` - -| Subdirectory | Contents | -| --- | --- | -| `InitialPopulations/training/` | De-identified synthetic population CSV for CI testing and getting started. Real survey data cannot be committed due to licence restrictions; this file is a privacy-protected substitute with anonymised IDs and noise-perturbed continuous variables. | -| `InitialPopulations/compile/` | 13 Stata do-files that build the full initial population from UKHLS/BHPS/WAS survey data | -| `InitialPopulations/compile/do_emphist/` | 8 Stata scripts that reconstruct monthly employment histories back to 2007 | -| `InitialPopulations/compile/RegressionEstimates/` | 17 Stata scripts that estimate regression coefficients and produce the `reg_*.xlsx` files (14 `reg_*.do` estimation scripts plus `master.do`, `programs.do`, `variable_update.do`) | -| `DoFilesTarget/` | 5 Stata scripts that generate alignment target files (employment shares, education targets, partnership rates) | -| `EUROMODoutput/` | Pre-computed EUROMOD tax-benefit donor files, one per policy year. These are loaded into the H2 database during the setup phase. | - -The full input data (survey microdata and EUROMOD output) is not committed to the repository due to data licence restrictions. - -## 6. Validation — `validation/` - -| Subdirectory | Contents | -| --- | --- | -| `01_estimate_validation/do_files/` | 9 Stata scripts that compare predicted versus observed values for each regression module | -| `01_estimate_validation/graphs/` | Output graphs from estimate validation | -| `02_simulated_output_validation/do_files/` | 28 Stata scripts that compare simulation output against UKHLS observed data across 18 outcomes | -| `02_simulated_output_validation/graphs/` | Reference comparison plots from a baseline validation run | - -## 7. Configuration — `config/` - -| File | Purpose | -| --- | --- | -| `default.yml` | Default configuration for batch runs. Fully annotated with inline comments. | -| `test_create_database.yml` | Rebuilds the H2 database from input CSVs (used during setup). | -| `test_run.yml` | Minimal configuration for CI testing. | +## Logging -Each YAML file is standalone — there is no inheritance between config files. Keys map directly to fields in `SimPathsMultiRun` and `SimPathsModel`. See the [Configuration](../../../../documentation/configuration.md) reference for a complete listing of all keys. +With `-f` on `multirun.jar`, logs are written to `output/logs/run_.txt` (stdout) and `output/logs/run_.log` (log4j). From 6963f52e1397760742087a6a8ff937effa3ee74f Mon Sep 17 00:00:00 2001 From: Hrushikesh Kalakandra Date: Mon, 16 Mar 2026 23:14:35 +0000 Subject: [PATCH 16/23] Delete documentation/repository-structure.md --- documentation/repository-structure.md | 88 --------------------------- 1 file changed, 88 deletions(-) delete mode 100644 documentation/repository-structure.md diff --git a/documentation/repository-structure.md b/documentation/repository-structure.md deleted file mode 100644 index 4a750f34b..000000000 --- a/documentation/repository-structure.md +++ /dev/null @@ -1,88 +0,0 @@ -# Repository Structure - -``` -SimPaths/ -├── config/ # YAML configuration files for simulation runs -│ ├── default.yml # Default simulation parameters (fully annotated) -│ ├── test_create_database.yml # Database creation config (CI) -│ └── test_run.yml # Test run config (CI) -│ -├── documentation/ # Quick-reference docs (this folder) -│ ├── wiki/ # Website source (model description, guides, research) -│ ├── SimPaths_Variable_Codebook.xlsx # Variable definitions for output CSVs -│ ├── SimPaths Stata Parameters.xlsx # Parameter comparison: Stata do-files vs Java -│ └── SimPathsUK_Schedule.xlsx # Event schedule with corresponding Java classes -│ -├── input/ # Input data and parameters -│ ├── InitialPopulations/ -│ │ ├── training/ # De-identified training population (included in repo) -│ │ └── compile/ # Stata pipeline: builds populations, estimates regressions -│ │ ├── do_emphist/ # Employment history reconstruction sub-pipeline -│ │ └── RegressionEstimates/ # Regression coefficient estimation scripts -│ ├── EUROMODoutput/ -│ │ └── training/ # Training UKMOD outputs (included in repo) -│ ├── DoFilesTarget/ # Stata scripts that generate alignment targets -│ ├── reg_*.xlsx # Regression coefficient tables -│ ├── align_*.xlsx # Alignment targets -│ ├── projections_*.xlsx # ONS demographic projections -│ ├── scenario_*.xlsx # Scenario-specific parameter overrides -│ ├── policy parameters.xlsx # Tax-benefit policy parameters -│ ├── validation_statistics.xlsx # Validation targets -│ ├── input.mv.db # H2 donor database (generated by setup) -│ ├── EUROMODpolicySchedule.xlsx # Policy year mapping (generated by setup) -│ └── DatabaseCountryYear.xlsx # Macro parameters (generated by setup) -│ -├── output/ # Simulation outputs (created at runtime) -│ └── / -│ ├── csv/ -│ │ ├── Statistics1.csv # Income distribution, Gini, S-Index -│ │ ├── Statistics2.csv # Demographics by age and gender -│ │ ├── Statistics3.csv # Alignment diagnostics -│ │ ├── Person.csv # Person-level output -│ │ ├── BenefitUnit.csv # Benefit-unit-level output -│ │ └── Household.csv # Household-level output -│ ├── database/ # Run-specific persistence output -│ └── input/ # Copied run input artifacts -│ -├── src/ -│ ├── main/java/simpaths/ -│ │ ├── data/ # Parameters, input parsing, filters, statistics -│ │ ├── experiment/ # Entry points: SimPathsStart, SimPathsMultiRun, -│ │ │ # SimPathsCollector, SimPathsObserver -│ │ └── model/ # Core simulation: Person, BenefitUnit, Household, -│ │ ├── decisions/ # intertemporal optimisation grids -│ │ ├── enums/ # categorical variable definitions -│ │ ├── taxes/ # EUROMOD donor matching -│ │ └── lifetime_incomes/ # synthetic income trajectory generation -│ └── test/java/simpaths/ # Unit and integration tests -│ -├── validation/ # Stata validation scripts and reference graphs -│ ├── 01_estimate_validation/ # Predicted vs observed for each regression module -│ └── 02_simulated_output_validation/ # Simulated output vs UKHLS survey data -│ -├── pom.xml # Maven build configuration -├── singlerun.jar # Single-run executable -└── multirun.jar # Multi-run executable -``` - -CSV filenames follow the pattern `.csv`. With a single run the suffix is `1`; with multiple runs each run produces its own numbered file. - -For a description of the variables in output CSV files, see `documentation/SimPaths_Variable_Codebook.xlsx`. For a description of each `reg_*`, `align_*`, and `scenario_*` input file, see [Model Parameterisation](../documentation/wiki/overview/parameterisation.md) on the website. - -## Setup-generated artifacts - -Running setup (`multirun -DBSetup`) creates or refreshes three files in `input/`: - -- `input.mv.db` — H2 database of EUROMOD donor tax-benefit outcomes -- `EUROMODpolicySchedule.xlsx` — maps simulation years to EUROMOD policy systems -- `DatabaseCountryYear.xlsx` — year-specific macro parameters - -These must exist before any simulation run. If they are missing, re-run setup. - -## Training mode - -The repository includes de-identified training data under `input/InitialPopulations/training/` and `input/EUROMODoutput/training/`. If no initial-population CSV files are found in the main input location, SimPaths automatically switches to training mode. Training mode supports development and CI but is not intended for research interpretation. - -## Logging - -With `-f` on `multirun.jar`, logs are written to `output/logs/run_.txt` (stdout) and `output/logs/run_.log` (log4j). From 54110cfad874ebc61949e0b648525a5b0eb71b63 Mon Sep 17 00:00:00 2001 From: Hrushikesh Kalakandra Date: Mon, 16 Mar 2026 23:16:48 +0000 Subject: [PATCH 17/23] Update recommended reading order in README Removed Repository Structure and CLI Reference from recommended reading order. --- documentation/README.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/documentation/README.md b/documentation/README.md index 6bb2d3678..85bb0cc1e 100644 --- a/documentation/README.md +++ b/documentation/README.md @@ -8,9 +8,7 @@ These files are a **quick reference** for working directly with the repository 1. [Model Concepts](model-concepts.md) — what SimPaths simulates, agents, annual cycle, alignment, EUROMOD 2. [Configuration](configuration.md) — prerequisites, quick run, YAML structure, config keys -3. [Repository Structure](repository-structure.md) — directory layout, input files, output files -4. [CLI Reference](cli-reference.md) — all flags for `singlerun.jar` and `multirun.jar` -5. [Troubleshooting](troubleshooting.md) — common errors and fixes +3. [Troubleshooting](troubleshooting.md) — common errors and fixes For contributors and advanced users: From 3f42a5db8291d27ba2a05078dcccd3f66f2347a9 Mon Sep 17 00:00:00 2001 From: hk-2029 Date: Mon, 16 Mar 2026 23:54:39 +0000 Subject: [PATCH 18/23] docs: restructure file-organisation, remove cli-reference and repository-structure --- documentation/cli-reference.md | 89 ------------------- .../internals/file-organisation.md | 78 ++++++++-------- 2 files changed, 37 insertions(+), 130 deletions(-) delete mode 100644 documentation/cli-reference.md diff --git a/documentation/cli-reference.md b/documentation/cli-reference.md deleted file mode 100644 index e1099b0b9..000000000 --- a/documentation/cli-reference.md +++ /dev/null @@ -1,89 +0,0 @@ -# CLI Reference - -## `singlerun.jar` (`SimPathsStart`) - -Usage: - -```bash -java -jar singlerun.jar [options] -``` - -### Options - -| Option | Meaning | -|---|---| -| `-s`, `--startYear ` | Simulation start year | -| `-Setup` | Setup only (do not run simulation) | -| `-Run` | Run only (skip setup) | -| `-r`, `--rewrite-policy-schedule` | Rebuild policy schedule from policy files | -| `-g`, `--showGui ` | Enable or disable GUI | -| `-h`, `--help` | Print help | - -Notes: - -- `-Setup` and `-Run` are mutually exclusive. -- For non-GUI environments, use `-g false`. - -### Examples - -Setup only: - -```bash -java -jar singlerun.jar -s 2019 -g false -Setup --rewrite-policy-schedule -``` - -Run only (after setup exists): - -```bash -java -jar singlerun.jar -g false -Run -``` - -## `multirun.jar` (`SimPathsMultiRun`) - -Usage: - -```bash -java -jar multirun.jar [options] -``` - -### Options - -| Option | Meaning | -|---|---| -| `-p`, `--popSize ` | Simulated population size | -| `-s`, `--startYear ` | Start year | -| `-e`, `--endYear ` | End year | -| `-DBSetup` | Database setup mode | -| `-n`, `--maxNumberOfRuns ` | Number of sequential runs | -| `-r`, `--randomSeed ` | Seed for first run | -| `-g`, `--executeWithGui ` | Enable or disable GUI | -| `-config ` | Config file in `config/` (default: `default.yml`) | -| `-f` | Write stdout and logs to `output/logs/` | -| `-P`, `--persist ` | Persistence strategy for processed dataset | -| `-h`, `--help` | Print help | - -Persistence modes: - -- `root` (default): persist to root input area for reuse -- `run`: persist per run output folder -- `none`: no processed-data persistence - -### Examples - -Create setup database using config: - -```bash -java -jar multirun.jar -DBSetup -config test_create_database.yml -``` - -Run two simulations with root persistence: - -```bash -java -jar multirun.jar -config test_run.yml -P root -``` - -Run without persistence and with file logging: - -```bash -java -jar multirun.jar -config default.yml -P none -f -``` diff --git a/documentation/wiki/developer-guide/internals/file-organisation.md b/documentation/wiki/developer-guide/internals/file-organisation.md index f5eed984f..e5cbaa619 100644 --- a/documentation/wiki/developer-guide/internals/file-organisation.md +++ b/documentation/wiki/developer-guide/internals/file-organisation.md @@ -75,17 +75,7 @@ CSV filenames follow the pattern `.csv`. With a single r ### `model/` -Core simulation logic. The central class is `SimPathsModel`, which owns all agent collections, builds the yearly event schedule (44 ordered processes), and coordinates the annual simulation cycle. - -Agent classes: - -| Class | Description | -| --- | --- | -| `Person` | Individual agent. Carries all demographics, health, education, labour, income, and social care state. Contains the per-person process methods invoked by the schedule. | -| `BenefitUnit` | Tax-and-benefit assessment unit: one or two adults plus their dependents. Tax-benefit evaluation is performed at this level. | -| `Household` | Grouping of benefit units sharing the same address. | - -Other key classes in `model/`: +Core simulation logic. The three agent classes — `Person`, `BenefitUnit`, `Household` — are described in the [Model Concepts](../../../../documentation/model-concepts.md) page. Other key classes: | Class | Purpose | | --- | --- | @@ -97,15 +87,42 @@ Other key classes in `model/`: | `Validator` | Runtime consistency checks on the simulated population. | | `*Alignment` classes | `FertilityAlignment`, `ActivityAlignmentV2`, `InSchoolAlignment`, `PartnershipAlignment`, `SocialCareAlignment` — each aligns a specific outcome to external calibration targets. | -### `model/enums/` +Sub-packages: + +- **`model/enums/`** — 46 enumeration classes defining categorical variables: `Gender`, `Education`, `Labour`, `HealthStatus`, `Region`, `Ethnicity`, `Occupancy`, and others. Referenced by the ORM for database persistence and by regression models for covariate encoding. +- **`model/decisions/`** — Intertemporal optimisation (IO) computational engine. Pre-computes decision grids by backward induction so agents can look up optimal consumption–labour choices during the simulation. +- **`model/taxes/`** — EUROMOD donor-matching subsystem. Imputes taxes and benefits onto simulated benefit units by matching them to pre-computed EUROMOD donor records. +- **`model/lifetime_incomes/`** — Synthetic lifetime income trajectory generator. Creates projected income paths for birth cohorts using an AR(2) process, used when IO is enabled. + +For a description of the variables in output CSV files, see `documentation/SimPaths_Variable_Codebook.xlsx`. For a description of each `reg_*`, `align_*`, and `scenario_*` input file, see [Model Parameterisation](../documentation/wiki/overview/parameterisation.md) on the website. + +## Setup-generated artifacts + +Running setup (`multirun -DBSetup`) creates or refreshes three files in `input/`: + +- `input.mv.db` — H2 database of EUROMOD donor tax-benefit outcomes +- `EUROMODpolicySchedule.xlsx` — maps simulation years to EUROMOD policy systems +- `DatabaseCountryYear.xlsx` — year-specific macro parameters + +These must exist before any simulation run. If they are missing, re-run setup. + +## Training mode + +The repository includes de-identified training data under `input/InitialPopulations/training/` and `input/EUROMODoutput/training/`. If no initial-population CSV files are found in the main input location, SimPaths automatically switches to training mode. Training mode supports development and CI but is not intended for research interpretation. -46 enumeration classes defining the categorical variables used throughout the simulation: `Gender`, `Education`, `Labour`, `HealthStatus`, `Country`, `Region`, `Ethnicity`, `Occupancy`, and others. These are referenced by the ORM for database persistence and by regression models for covariate encoding. +## Logging + +With `-f` on `multirun.jar`, logs are written to `output/logs/run_.txt` (stdout) and `output/logs/run_.log` (log4j). -### `model/decisions/` +--- -Intertemporal optimisation (IO) computational engine. When IO is enabled, computing optimal consumption–labour choices for every agent at every time step during the simulation would be prohibitively slow. This package solves the problem once before the simulation runs: it constructs a grid covering all meaningful combinations of state variables (wealth, age, health, family status, etc.), then works backwards from the end of life to find the optimal choice at each grid point (backward induction). During the simulation, agents simply look up their current state in the pre-computed grid rather than solving an optimisation problem. +## Sub-package detail -Key classes: +The following sub-packages are self-contained subsystems whose internals are not obvious from the class names alone. + +### `model/decisions/` — IO engine + +When IO is enabled, computing optimal consumption–labour choices for every agent at every time step would be prohibitively slow. This package solves the problem once before the simulation runs: it constructs a grid covering all meaningful combinations of state variables (wealth, age, health, family status, etc.), then works backwards from the end of life to find the optimal choice at each grid point (backward induction). During the simulation, agents simply look up their current state in the pre-computed grid. | Class | Purpose | | --- | --- | @@ -118,9 +135,9 @@ Key classes: | `Expectations` / `LocalExpectations` | Computes expected future values over stochastic transitions. | | `CESUtility` | CES utility function used in the optimisation. | -### `model/taxes/` +### `model/taxes/` — EUROMOD donor matching -EUROMOD donor-matching subsystem. Imputes taxes and benefits onto simulated benefit units by matching them to pre-computed EUROMOD donor records. +Imputes taxes and benefits onto simulated benefit units by matching them to pre-computed EUROMOD donor records. | Class | Purpose | | --- | --- | @@ -133,9 +150,9 @@ EUROMOD donor-matching subsystem. Imputes taxes and benefits onto simulated bene The `taxes/database/` sub-package handles loading donor data from the H2 database into memory (`TaxDonorDataParser`, `DatabaseExtension`, `MatchIndices`). -### `model/lifetime_incomes/` +### `model/lifetime_incomes/` — synthetic income trajectories -Synthetic lifetime income trajectory generator. When IO is enabled, this package creates projected income paths for birth cohorts using an AR(2) process anchored to age-gender geometric means, and matches simulated persons to donor income profiles. +When IO is enabled, this package creates projected income paths for birth cohorts using an AR(2) process anchored to age-gender geometric means, and matches simulated persons to donor income profiles. | Class | Purpose | | --- | --- | @@ -144,24 +161,3 @@ Synthetic lifetime income trajectory generator. When IO is enabled, this package | `AnnualIncome` | Implements the AR(2) income process with age-gender anchoring. | | `BirthCohort` | Groups individuals by birth year for cohort-level income projection. | | `Individual` | Entity carrying age dummies and log GDP per capita for income regression. | - - -For a description of the variables in output CSV files, see `documentation/SimPaths_Variable_Codebook.xlsx`. For a description of each `reg_*`, `align_*`, and `scenario_*` input file, see [Model Parameterisation](../documentation/wiki/overview/parameterisation.md) on the website. - -## Setup-generated artifacts - -Running setup (`multirun -DBSetup`) creates or refreshes three files in `input/`: - -- `input.mv.db` — H2 database of EUROMOD donor tax-benefit outcomes -- `EUROMODpolicySchedule.xlsx` — maps simulation years to EUROMOD policy systems -- `DatabaseCountryYear.xlsx` — year-specific macro parameters - -These must exist before any simulation run. If they are missing, re-run setup. - -## Training mode - -The repository includes de-identified training data under `input/InitialPopulations/training/` and `input/EUROMODoutput/training/`. If no initial-population CSV files are found in the main input location, SimPaths automatically switches to training mode. Training mode supports development and CI but is not intended for research interpretation. - -## Logging - -With `-f` on `multirun.jar`, logs are written to `output/logs/run_.txt` (stdout) and `output/logs/run_.log` (log4j). From 2b3a8ac5ebdc02727f539a824b9cbedff37a84a7 Mon Sep 17 00:00:00 2001 From: Hrushikesh Kalakandra Date: Tue, 17 Mar 2026 00:02:55 +0000 Subject: [PATCH 19/23] Condense file organiation Updated the documentation for file organization, setup-generated artifacts, training mode, and logging. --- .../internals/file-organisation.md | 71 +++++++------------ 1 file changed, 25 insertions(+), 46 deletions(-) diff --git a/documentation/wiki/developer-guide/internals/file-organisation.md b/documentation/wiki/developer-guide/internals/file-organisation.md index e5cbaa619..0720bb9d1 100644 --- a/documentation/wiki/developer-guide/internals/file-organisation.md +++ b/documentation/wiki/developer-guide/internals/file-organisation.md @@ -69,52 +69,6 @@ SimPaths/ └── multirun.jar # Multi-run executable ``` -CSV filenames follow the pattern `.csv`. With a single run the suffix is `1`; with multiple runs each run produces its own numbered file. - -## Source code — `src/main/java/simpaths/` - -### `model/` - -Core simulation logic. The three agent classes — `Person`, `BenefitUnit`, `Household` — are described in the [Model Concepts](../../../../documentation/model-concepts.md) page. Other key classes: - -| Class | Purpose | -| --- | --- | -| `SimPathsModel` | Model manager. Initialises the population, registers all 44 yearly processes with the JAS-mine scheduler, manages alignment and aggregate state. | -| `TaxEvaluation` | Orchestrates EUROMOD donor matching to impute taxes and benefits onto simulated benefit units. | -| `UnionMatching` | Partnership formation algorithm. Matches unpartnered individuals into couples based on characteristics and preferences. | -| `LabourMarket` | Labour market clearing: matches labour supply decisions to employment outcomes. | -| `Innovations` | Applies parameter shocks (innovation perturbations) across sequential runs for sensitivity analysis. | -| `Validator` | Runtime consistency checks on the simulated population. | -| `*Alignment` classes | `FertilityAlignment`, `ActivityAlignmentV2`, `InSchoolAlignment`, `PartnershipAlignment`, `SocialCareAlignment` — each aligns a specific outcome to external calibration targets. | - -Sub-packages: - -- **`model/enums/`** — 46 enumeration classes defining categorical variables: `Gender`, `Education`, `Labour`, `HealthStatus`, `Region`, `Ethnicity`, `Occupancy`, and others. Referenced by the ORM for database persistence and by regression models for covariate encoding. -- **`model/decisions/`** — Intertemporal optimisation (IO) computational engine. Pre-computes decision grids by backward induction so agents can look up optimal consumption–labour choices during the simulation. -- **`model/taxes/`** — EUROMOD donor-matching subsystem. Imputes taxes and benefits onto simulated benefit units by matching them to pre-computed EUROMOD donor records. -- **`model/lifetime_incomes/`** — Synthetic lifetime income trajectory generator. Creates projected income paths for birth cohorts using an AR(2) process, used when IO is enabled. - -For a description of the variables in output CSV files, see `documentation/SimPaths_Variable_Codebook.xlsx`. For a description of each `reg_*`, `align_*`, and `scenario_*` input file, see [Model Parameterisation](../documentation/wiki/overview/parameterisation.md) on the website. - -## Setup-generated artifacts - -Running setup (`multirun -DBSetup`) creates or refreshes three files in `input/`: - -- `input.mv.db` — H2 database of EUROMOD donor tax-benefit outcomes -- `EUROMODpolicySchedule.xlsx` — maps simulation years to EUROMOD policy systems -- `DatabaseCountryYear.xlsx` — year-specific macro parameters - -These must exist before any simulation run. If they are missing, re-run setup. - -## Training mode - -The repository includes de-identified training data under `input/InitialPopulations/training/` and `input/EUROMODoutput/training/`. If no initial-population CSV files are found in the main input location, SimPaths automatically switches to training mode. Training mode supports development and CI but is not intended for research interpretation. - -## Logging - -With `-f` on `multirun.jar`, logs are written to `output/logs/run_.txt` (stdout) and `output/logs/run_.log` (log4j). - ---- ## Sub-package detail @@ -161,3 +115,28 @@ When IO is enabled, this package creates projected income paths for birth cohort | `AnnualIncome` | Implements the AR(2) income process with age-gender anchoring. | | `BirthCohort` | Groups individuals by birth year for cohort-level income projection. | | `Individual` | Entity carrying age dummies and log GDP per capita for income regression. | + +CSV filenames follow the pattern `.csv`. With a single run the suffix is `1`; with multiple runs each run produces its own numbered file. + +For a description of the variables in output CSV files, see `documentation/SimPaths_Variable_Codebook.xlsx`. For a description of each `reg_*`, `align_*`, and `scenario_*` input file, see [Model Parameterisation](../documentation/wiki/overview/parameterisation.md) on the website. + +## Setup-generated artifacts + +Running setup (`multirun -DBSetup`) creates or refreshes three files in `input/`: + +- `input.mv.db` — H2 database of EUROMOD donor tax-benefit outcomes +- `EUROMODpolicySchedule.xlsx` — maps simulation years to EUROMOD policy systems +- `DatabaseCountryYear.xlsx` — year-specific macro parameters + +These must exist before any simulation run. If they are missing, re-run setup. + +## Training mode + +The repository includes de-identified training data under `input/InitialPopulations/training/` and `input/EUROMODoutput/training/`. If no initial-population CSV files are found in the main input location, SimPaths automatically switches to training mode. Training mode supports development and CI but is not intended for research interpretation. + +## Logging + +With `-f` on `multirun.jar`, logs are written to `output/logs/run_.txt` (stdout) and `output/logs/run_.log` (log4j). + +--- + From 434fb066e7aa608efd07fcd275810abca0deef37 Mon Sep 17 00:00:00 2001 From: hk-2029 Date: Tue, 17 Mar 2026 00:12:26 +0000 Subject: [PATCH 20/23] docs: fold troubleshooting into configuration, delete troubleshooting.md --- documentation/README.md | 1 - documentation/configuration.md | 3 ++ documentation/troubleshooting.md | 83 -------------------------------- 3 files changed, 3 insertions(+), 84 deletions(-) delete mode 100644 documentation/troubleshooting.md diff --git a/documentation/README.md b/documentation/README.md index 85bb0cc1e..878e0da41 100644 --- a/documentation/README.md +++ b/documentation/README.md @@ -8,7 +8,6 @@ These files are a **quick reference** for working directly with the repository 1. [Model Concepts](model-concepts.md) — what SimPaths simulates, agents, annual cycle, alignment, EUROMOD 2. [Configuration](configuration.md) — prerequisites, quick run, YAML structure, config keys -3. [Troubleshooting](troubleshooting.md) — common errors and fixes For contributors and advanced users: diff --git a/documentation/configuration.md b/documentation/configuration.md index 28d2b9199..514396909 100644 --- a/documentation/configuration.md +++ b/documentation/configuration.md @@ -96,3 +96,6 @@ Note that some settings — particularly alignment — are primarily controlled - Use quotes around config filenames that contain spaces: `-config "my config.yml"`. - Add `-f` to write run logs to `output/logs/`. - Override individual values at runtime without editing the YAML, for example `-n 10` overrides `maxNumberOfRuns`. +- If you see `Config file not found`, the `-config` flag points to a file not present in `config/` — check the filename and extension. +- If `EUROMODpolicySchedule.xlsx` is missing, re-run setup: `java -jar multirun.jar -DBSetup`. +- On headless servers or CI, always use `executeWithGui: false` in your YAML (or `-g false` on the command line) to avoid GUI errors. diff --git a/documentation/troubleshooting.md b/documentation/troubleshooting.md deleted file mode 100644 index 5ff3bee8b..000000000 --- a/documentation/troubleshooting.md +++ /dev/null @@ -1,83 +0,0 @@ -# Troubleshooting - -## `Config file not found` - -Cause: - -- `-config` points to a file not present in `config/`. - -Fix: - -- Verify filename and extension. -- Example: - -```bash -java -jar multirun.jar -config default.yml -``` - -## Missing `EUROMODpolicySchedule.xlsx` - -Cause: - -- Setup has not generated schedule files yet. - -Fix: - -- Re-run setup with rewrite enabled: - -```bash -java -jar singlerun.jar -s 2019 -g false --rewrite-policy-schedule -Setup -``` - -## GUI errors on server or CI - -Cause: - -- Running GUI mode in headless environment. - -Fix: - -- Disable GUI: - -```bash --g false -``` - -## Start year rejected or inconsistent - -Cause: - -- Chosen year is outside available input/training data bounds. - -Fix: - -- Use a year covered by available input files. -- For training-only mode, use the provided training start year (2019 in this repository setup). - -## Expected CSV files not found after run - -Cause: - -- Collector settings disabled certain exports. -- Run failed before collector dump phase. - -Fix: - -- Check `collector_args` in YAML. -- Re-run with `-f` and inspect `output/logs/run_.txt` and `.log`. - -## Integration test output mismatch - -Cause: - -- Simulation behavior changed or output schema changed. - -Fix: - -1. Confirm differences are intended. -2. Replace expected files in `src/test/java/simpaths/integrationtest/expected/` with verified new outputs. -3. Re-run: - -```bash -mvn verify -``` From b2567117a018ac14ef9e2d6cc343270e1ba8312e Mon Sep 17 00:00:00 2001 From: hk-2029 Date: Tue, 17 Mar 2026 07:49:40 +0000 Subject: [PATCH 21/23] docs: add mkdocs.yml and fix deploy workflow for automatic site builds - Add docs_dir: documentation/wiki to mkdocs.yml so MkDocs finds the source files - Fix deploy-docs.yml path filter from docs/** to documentation/wiki/** - Website now auto-deploys on push to main when wiki files change --- .github/workflows/deploy-docs.yml | 48 +++++++++ mkdocs.yml | 157 ++++++++++++++++++++++++++++++ 2 files changed, 205 insertions(+) create mode 100644 .github/workflows/deploy-docs.yml create mode 100644 mkdocs.yml diff --git a/.github/workflows/deploy-docs.yml b/.github/workflows/deploy-docs.yml new file mode 100644 index 000000000..66dce2744 --- /dev/null +++ b/.github/workflows/deploy-docs.yml @@ -0,0 +1,48 @@ +name: Deploy Documentation + +on: + push: + branches: + - main + paths: + - 'documentation/wiki/**' + - 'mkdocs.yml' + workflow_dispatch: + +permissions: + contents: read + pages: write + id-token: write + +concurrency: + group: "pages" + cancel-in-progress: false + +jobs: + build-and-deploy: + runs-on: ubuntu-latest + steps: + - name: Checkout repository + uses: actions/checkout@v4 + with: + fetch-depth: 0 + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: '3.x' + + - name: Install MkDocs and dependencies + run: pip install "mkdocs>=1.6,<2.0" mkdocs-material + + - name: Build documentation site + run: mkdocs build --strict + + - name: Upload Pages artifact + uses: actions/upload-pages-artifact@v3 + with: + path: site/ + + - name: Deploy to GitHub Pages + id: deployment + uses: actions/deploy-pages@v4 diff --git a/mkdocs.yml b/mkdocs.yml new file mode 100644 index 000000000..31873be04 --- /dev/null +++ b/mkdocs.yml @@ -0,0 +1,157 @@ +site_name: SimPaths Documentation +site_description: >- + An open-source microsimulation framework for modelling individual + and household life course events across the UK and Europe. +site_url: https://centreformicrosimulation.github.io/SimPaths/ +repo_url: https://github.com/centreformicrosimulation/SimPaths +repo_name: centreformicrosimulation/SimPaths + +docs_dir: documentation/wiki + +copyright: >- + Copyright © Matteo Richiardi, Patryk Bronka, Justin van de Ven — + Centre for Microsimulation and Policy Analysis + +theme: + name: material + palette: + - scheme: default + primary: custom + accent: custom + toggle: + icon: material/weather-night + name: Switch to dark mode + - scheme: slate + primary: custom + accent: custom + toggle: + icon: material/weather-sunny + name: Switch to light mode + font: false + icon: + repo: fontawesome/brands/github + logo: material/chart-timeline-variant-shimmer + features: + - navigation.tabs + - navigation.tabs.sticky + - navigation.sections + - navigation.indexes + - navigation.top + - navigation.footer + - navigation.tracking + - toc.follow + - search.suggest + - search.highlight + - search.share + - content.code.copy + - content.code.annotate + +extra_css: + - assets/css/extra.css + +plugins: + - search: + lang: en + +markdown_extensions: + - admonition + - pymdownx.details + - pymdownx.superfences + - pymdownx.tabbed: + alternate_style: true + - pymdownx.highlight: + anchor_linenums: true + line_spans: __span + pygments_lang_class: true + - pymdownx.inlinehilite + - pymdownx.snippets + - pymdownx.emoji: + emoji_index: !!python/name:material.extensions.emoji.twemoji + emoji_generator: !!python/name:material.extensions.emoji.to_svg + - toc: + permalink: true + toc_depth: 3 + - attr_list + - md_in_html + - tables + - footnotes + - def_list + +extra: + social: + - icon: fontawesome/brands/github + link: https://github.com/centreformicrosimulation/SimPaths + name: GitHub + - icon: fontawesome/solid/globe + link: https://www.microsimulation.ac.uk/ + name: Centre for Microsimulation and Policy Analysis + +nav: + - Home: index.md + + - Overview: + - overview/index.md + - Model Description: overview/model-description.md + - Simulated Modules: overview/simulated-modules.md + - Model Parameterisation: overview/parameterisation.md + - Country Variants: overview/country-variants.md + - How to Cite: overview/how-to-cite.md + + - Getting Started: + - getting-started/index.md + - Environment Setup: getting-started/environment-setup.md + - Input Data: + - getting-started/data/index.md + - Initial Population (UK): getting-started/data/initial-population-uk.md + - Tax-Benefit Donors (UK): getting-started/data/tax-benefit-donors-uk.md + - Running Your First Simulation: getting-started/first-simulation.md + - Video Tutorials: getting-started/video-tutorials.md + + - User Guide: + - user-guide/index.md + - Single Runs: user-guide/single-runs.md + - Multiple Runs: user-guide/multiple-runs.md + - Graphical User Interface: user-guide/gui.md + - Modifying Parameters: user-guide/modifying-parameters.md + - Modifying Tax-Benefit Settings: user-guide/tax-benefit-parameters.md + - Uncertainty Analysis: user-guide/uncertainty-analysis.md + + - Developer Guide: + - developer-guide/index.md + - Working in GitHub: developer-guide/working-in-github.md + - JAS-mine Architecture: + - developer-guide/jasmine/index.md + - Project Structure: developer-guide/jasmine/project-structure.md + - The Model and the Schedule: developer-guide/jasmine/model-and-schedule.md + - The Start Class: developer-guide/jasmine/start-class.md + - The MultiRun Class: developer-guide/jasmine/multirun-class.md + - Updating JAS-mine: developer-guide/jasmine/updating-jasmine.md + - SimPaths Internals: + - developer-guide/internals/index.md + - SimPaths API: developer-guide/internals/api.md + - File Organisation: developer-guide/internals/file-organisation.md + - The SimPathsModel Class: developer-guide/internals/simpaths-model.md + - Start Class Implementation: developer-guide/internals/start-class-implementation.md + - MultiRun Implementation: developer-guide/internals/multirun-implementation.md + - How-To Guides: + - developer-guide/how-to/index.md + - Introduce a New Variable: developer-guide/how-to/new-variable.md + - Add Parameters to the GUI: developer-guide/how-to/add-gui-parameters.md + - Perform MultiRun Simulations: developer-guide/how-to/multirun-simulations.md + - JAS-mine Reference: + - jasmine-reference/index.md + - Statistical Package: jasmine-reference/statistical-package.md + - Collection Filters: jasmine-reference/collection-filters.md + - Alignment Library: jasmine-reference/alignment-library.md + - Matching Library: jasmine-reference/matching-library.md + - Regression Library: jasmine-reference/regression-library.md + - Saving Outputs: jasmine-reference/saving-outputs.md + - Querying the Database: jasmine-reference/querying-database.md + - Links and Resources: jasmine-reference/links.md + - Enums: jasmine-reference/enums.md + + - Model Validation: + - validation/index.md + + - Research: + - research/index.md From 0ff255f3b6314e86d758def1f815f6e192139b30 Mon Sep 17 00:00:00 2001 From: hk-2029 Date: Tue, 17 Mar 2026 08:13:07 +0000 Subject: [PATCH 22/23] docs: consolidate into single README.md, remove redundant files --- documentation/README.md | 276 ++++++++++++++++++++++++++++-- documentation/configuration.md | 101 ----------- documentation/data-pipeline.md | 111 ------------ documentation/model-concepts.md | 128 -------------- documentation/validation-guide.md | 96 ----------- 5 files changed, 264 insertions(+), 448 deletions(-) delete mode 100644 documentation/configuration.md delete mode 100644 documentation/data-pipeline.md delete mode 100644 documentation/model-concepts.md delete mode 100644 documentation/validation-guide.md diff --git a/documentation/README.md b/documentation/README.md index 878e0da41..a4a0f661c 100644 --- a/documentation/README.md +++ b/documentation/README.md @@ -1,21 +1,273 @@ -# SimPaths Documentation +# SimPaths Quick Reference -These files are a **quick reference** for working directly with the repository — building, running, configuring, and troubleshooting from the command line. For the full model documentation (simulated modules, parameterisation, GUI usage, research), see the [website](../documentation/wiki/index.md). +A command-line quick reference for building, running, configuring, and validating SimPaths. For the full model documentation — simulated modules, parameterisation, GUI usage, research — see the [website](../documentation/wiki/index.md). --- -## Recommended reading order +## 1. Building and running -1. [Model Concepts](model-concepts.md) — what SimPaths simulates, agents, annual cycle, alignment, EUROMOD -2. [Configuration](configuration.md) — prerequisites, quick run, YAML structure, config keys +### Prerequisites -For contributors and advanced users: +- Java 19 +- Maven 3.8+ +- Optional IDE: IntelliJ IDEA (import as a Maven project) -- [Data Pipeline](data-pipeline.md) — how input files are generated from UKHLS/EUROMOD/WAS survey data -- [Validation Guide](validation-guide.md) — two-stage validation workflow (estimate validation + simulated output validation) +### Quick run -## Conventions +Three commands are all you need: -- Commands are shown from the repository root. -- Paths are relative to the repository root. -- `default.yml` refers to `config/default.yml`. +```bash +mvn clean package +java -jar multirun.jar -DBSetup +java -jar multirun.jar +``` + +The first builds the JARs. The second creates the H2 donor database from the input data. The third runs the simulation using `default.yml`. + +To use a different config file: + +```bash +java -jar multirun.jar -config my_run.yml +``` + +--- + +## 2. Configuration + +SimPaths batch runs are controlled by YAML files in `config/`. The main config is `default.yml`, which is fully annotated with inline comments. + +### How config is applied + +`SimPathsMultiRun` loads `config/` and applies values in two stages: + +1. YAML values initialise runtime fields and argument maps. +2. CLI flags override those values if provided. + +If a key is not specified in the YAML, the Java class field default is used. Each config file is standalone — there is no inheritance between config files. + +### Writing your own config + +Place a new `.yml` file in `config/` and pass it via `-config`. You only need to specify the values you want to change — everything else falls back to the Java class field defaults. + +#### Core run arguments + +| Key | Default | Description | +|-----|---------|-------------| +| `maxNumberOfRuns` | `1` | Number of sequential simulation runs | +| `executeWithGui` | `false` | `true` launches the JAS-mine GUI; `false` = headless (required on servers/CI) | +| `randomSeed` | `606` | RNG seed for the first run | +| `startYear` | `2019` | First simulation year (must have matching input/donor data) | +| `endYear` | `2022` | Last simulation year (inclusive) | +| `popSize` | `50000` | Simulated population size; larger = more accurate but slower | + +#### Collector arguments + +The `collector_args` section controls what output files are produced: + +| Flag | Default | Description | +|------|---------|-------------| +| `persistStatistics` | `true` | Write `Statistics1.csv` — income distribution, Gini, S-Index | +| `persistStatistics2` | `true` | Write `Statistics2.csv` — demographic validation by age and gender | +| `persistStatistics3` | `true` | Write `Statistics3.csv` — alignment diagnostics | +| `exportToCSV` | `true` | Write outputs to CSV files under `output//csv/` | + +For a description of the variables in these files, see `documentation/SimPaths_Variable_Codebook.xlsx`. + +#### Minimal example + +```yaml +maxNumberOfRuns: 5 +executeWithGui: false +randomSeed: 42 +startYear: 2019 +endYear: 2030 +popSize: 20000 + +collector_args: + persistStatistics: true + persistStatistics2: true + persistStatistics3: true +``` + +### Additional arguments + +The YAML file supports several other argument sections (`model_args`, `innovation_args`, `parameter_args`) that control alignment flags, intertemporal optimisation settings, sensitivity analysis parameters, and file paths. Many of these are for specific analyses and some are under active review. The annotated `default.yml` file documents all available keys with inline comments. + +Note that some settings — particularly alignment — are primarily controlled in `SimPathsModel.java` rather than through the YAML file. + +### Practical notes + +- Use quotes around config filenames that contain spaces: `-config "my config.yml"`. +- Add `-f` to write run logs to `output/logs/`. +- Override individual values at runtime without editing the YAML, for example `-n 10` overrides `maxNumberOfRuns`. +- If you see `Config file not found`, the `-config` flag points to a file not present in `config/` — check the filename and extension. +- If `EUROMODpolicySchedule.xlsx` is missing, re-run setup: `java -jar multirun.jar -DBSetup`. +- On headless servers or CI, always use `executeWithGui: false` in your YAML (or `-g false` on the command line) to avoid GUI errors. + +--- + +## 3. Data pipeline + +This section explains how the simulation-ready input files in `input/` are generated from raw survey data, and what to do if you need to update or extend them. + +The pipeline has three independent parts: (1) initial populations, (2) regression coefficients, (3) alignment targets. Each can be re-run separately. + +### Data sources + +| Source | Description | Access | +|--------|-------------|--------| +| **UKHLS** (Understanding Society) | Main household panel survey; waves 1 to O (UKDA-6614-stata) | Requires EUL licence from UK Data Service | +| **BHPS** (British Household Panel Survey) | Historical predecessor to UKHLS; used for pre-2009 employment history | Bundled with UKHLS EUL | +| **WAS** (Wealth and Assets Survey) | Biennial survey of household wealth; waves 1 to 7 (UKDA-7215-stata) | Requires EUL licence from UK Data Service | +| **EUROMOD / UKMOD** | Tax-benefit microsimulation system | See [Tax-Benefit Donors (UK)](../documentation/wiki/getting-started/data/tax-benefit-donors-uk.md) on the website | + +### Part 1 — Initial populations (`input/InitialPopulations/compile/`) + +**What it produces:** Annual CSV files `population_initial_UK_.csv` used as the starting population for each simulation run. + +**Master script:** `input/InitialPopulations/compile/00_master.do` + +The pipeline runs in numbered stages: + +| Script | What it does | +|--------|-------------| +| `01_prepare_UKHLS_pooled_data.do` | Pools and standardises UKHLS waves | +| `02_create_UKHLS_variables.do` | Constructs all required variables (demographics, labour, health, income, wealth flags) and applies simulation-consistency rules (retirement as absorbing state, education age bounds, work/hours consistency) | +| `02_01_checks.do` | Data quality checks | +| `03_social_care_received.do` | Social care receipt variables | +| `04_social_care_provided.do` | Informal care provision variables | +| `05_create_benefit_units.do` | Groups individuals into benefit units (tax units) following UK tax-benefit rules | +| `06_reweight_and_slice.do` | Reweighting and year-specific slicing | +| `07_was_wealth_data.do` | Prepares Wealth and Assets Survey data | +| `08_wealth_to_ukhls.do` | Merges WAS wealth into UKHLS records | +| `09_finalise_input_data.do` | Final cleaning and formatting | +| `10_check_yearly_data.do` | Per-year consistency checks | +| `99_training_data.do` | Produces the de-identified training population committed to `input/InitialPopulations/training/` | + +#### Employment history sub-pipeline (`compile/do_emphist/`) + +Reconstructs each respondent's monthly employment history from January 2007 onwards by combining UKHLS and BHPS interview records. The output variable `liwwh` (months employed since Jan 2007) feeds into the labour supply models. + +| Script | Purpose | +|--------|---------| +| `00_Master_emphist.do` | Master; sets parameters and calls sub-scripts | +| `01_Intdate.do` – `07_Empcal1a.do` | Sequential stages: interview dating, BHPS linkage, employment spell reconstruction, new-entrant identification | + +### Part 2 — Regression coefficients (`input/InitialPopulations/compile/RegressionEstimates/`) + +**What it produces:** The `reg_*.xlsx` coefficient tables read by `Parameters.java` at simulation startup. + +**Master script:** `input/InitialPopulations/compile/RegressionEstimates/master.do` + +> **Note:** Income and union-formation regressions depend on predicted wages, so `reg_wages.do` must complete before `reg_income.do` and `reg_partnership.do`. All other scripts can run in any order. + +**Required Stata packages:** `fre`, `tsspell`, `carryforward`, `outreg2`, `oparallel`, `gologit2`, `winsor`, `reghdfe`, `ftools`, `require` + +| Script | Module | Method | +|--------|--------|--------| +| `reg_wages.do` | Hourly wages | Heckman selection model (males and females separately) | +| `reg_income.do` | Non-labour income | Hurdle model (selection + amount); requires predicted wages | +| `reg_partnership.do` | Partnership formation/dissolution | Probit; requires predicted wages | +| `reg_education.do` | Education transitions | Generalised ordered logit | +| `reg_fertility.do` | Fertility | Probit | +| `reg_health.do` | Physical health (SF-12 PCS) | Linear regression | +| `reg_health_mental.do` | Mental health (GHQ-12, SF-12 MCS) | Linear regression | +| `reg_health_wellbeing.do` | Life satisfaction | Linear regression | +| `reg_home_ownership.do` | Homeownership transitions | Probit | +| `reg_retirement.do` | Retirement | Probit | +| `reg_leave_parental_home.do` | Leaving parental home | Probit | +| `reg_socialcare.do` | Social care receipt and provision | Probit / ordered logit | +| `reg_unemployment.do` | Unemployment transitions | Probit | +| `reg_financial_distress.do` | Financial distress | Probit | +| `programs.do` | Shared utility programs called by the estimation scripts | — | +| `variable_update.do` | Prepares and recodes variables before estimation | — | + +After running, output Excel files are placed in `input/` (overwriting the existing `reg_*.xlsx` files). + +### Part 3 — Alignment targets (`input/DoFilesTarget/`) + +**What it produces:** The `align_*.xlsx` and `*_targets.xlsx` files that the alignment modules use to rescale simulated rates. + +| Script | Output file | +|--------|------------| +| `01_employment_shares_initpopdata.do` | `input/employment_targets.xlsx` — employment shares by benefit-unit subgroup and year | +| `01_inSchool_targets_initpopdata.do` | `input/inSchool_targets.xlsx` — school participation rates by year | +| `03_calculate_partneredShare_initialPop_BUlogic.do` | `input/partnered_share_targets.xlsx` — partnership shares by year | +| `03_calculate_partnership_target.do` | Supplementary partnership targets | +| `02_person_risk_employment_stats.do` | `employment_risk_emp_stats.csv` — person-level at-risk diagnostics used for employment alignment group construction | + +Population projection targets (`align_popProjections.xlsx`) and fertility/mortality projections (`projections_*.xlsx`) come from ONS published projections and are not generated by these scripts. + +### When to re-run each part + +| Situation | What to re-run | +|-----------|---------------| +| Adding a new data year to the simulation | Part 1 (re-slice the population for the new year) + Part 3 (update alignment targets) | +| Re-estimating a behavioural module | Part 2 (the affected `reg_*.do` script only) + Stage 1 validation | +| Updating employment alignment targets | Part 3 (`01_employment_shares_initpopdata.do`) | + +After re-running any part, re-run setup (`singlerun -Setup` or `multirun -DBSetup`) to rebuild `input/input.mv.db` before running the simulation. + +--- + +## 4. Validation + +SimPaths uses a two-stage validation workflow in `validation/`. Stage 1 checks that each estimated regression model is well-specified before simulation; stage 2 checks that full simulation output matches observed survey data. For the conceptual overview and detailed setup instructions, see [Model Validation](../documentation/wiki/validation/index.md) on the website. + +### Stage 1 — Estimate validation (`validation/01_estimate_validation/`) + +**When to run:** After updating or re-estimating any regression module (i.e. after re-running scripts in `input/InitialPopulations/compile/RegressionEstimates/`). + +**What it does:** For each behavioural module, the script loads the estimation sample, computes predicted values from the estimated coefficients, adds individual heterogeneity via 20 stochastic draws (as in multiple imputation), and overlays the predicted and observed distributions as histograms. + +| Script | Module validated | +|--------|----------------| +| `int_val_wages.do` | Hourly wages — Heckman selection model, separately for males/females with and without previous wage history | +| `int_val_education.do` | Education transitions (3 processes) | +| `int_val_fertility.do` | Fertility (2 processes) | +| `int_val_health.do` | Physical health transitions | +| `int_val_home_ownership.do` | Homeownership transitions | +| `int_val_income.do` | Income processes — hurdle models (selection and amount) | +| `int_val_leave_parental_home.do` | Leaving parental home | +| `int_val_partnership.do` | Partnership formation and dissolution | +| `int_val_retirement.do` | Retirement transitions | + +**Outputs:** PNG graphs saved under `validation/01_estimate_validation/graphs//`. Each graph shows predicted (red) vs observed (black outline) distributions. + +### Stage 2 — Simulated output validation (`validation/02_simulated_output_validation/`) + +**When to run:** After completing a baseline simulation run that you want to assess for plausibility. + +**What it does:** Loads your simulation output CSVs, loads UKHLS initial population data as an observational benchmark, and produces side-by-side time-series plots comparing 18 simulated outcomes against the observed distributions with confidence intervals. + +**Comparison plots (18 scripts, `06_01` through `06_18`):** + +| Script | What is compared | +|--------|-----------------| +| `06_01_plot_activity_status.do` | Economic activity: employed, student, inactive, retired by age group | +| `06_02_plot_education_level.do` | Completed education distribution over time | +| `06_03_plot_gross_income.do` | Gross benefit-unit income | +| `06_04_plot_gross_labour_income.do` | Gross labour income | +| `06_05_plot_capital_income.do` | Capital income (interest, dividends) | +| `06_06_plot_pension_income.do` | Pension income | +| `06_07_plot_disposable_income.do` | Disposable income after taxes and benefits | +| `06_08_plot_equivalised_disposable_income.do` | Household-size-adjusted disposable income | +| `06_09_plot_hourly_wages.do` | Hourly wages for employees | +| `06_10_plot_hours_worked.do` | Weekly hours worked by employment status | +| `06_11_plot_income_shares.do` | Income distribution across quintiles | +| `06_12_plot_partnership_status.do` | Partnership status (single, married, cohabiting, previously partnered) | +| `06_13_plot_health.do` | Physical and mental health (SF-12 PCS and MCS) | +| `06_14_plot_at_risk_of_poverty.do` | At-risk-of-poverty rate | +| `06_15_plot_inequality.do` | Income inequality (p90/p50 ratio) | +| `06_16_plot_number_children.do` | Number of dependent children | +| `06_17_plot_disability.do` | Disability prevalence | +| `06_18_plot_social_care.do` | Social care receipt | + +**Outputs:** PNG graphs saved under `validation/02_simulated_output_validation/graphs//`. A reference set from a baseline run (`20250909_run`) is already committed for comparison. + +### Interpreting results + +- **Stage 1:** Predicted and observed histograms should broadly overlap. Systematic divergence indicates a problem with the estimation or variable construction. +- **Stage 2:** Simulated time-series should track UKHLS trends within reasonable uncertainty bounds. Large divergence in levels suggests a miscalibration; divergence in trends suggests a missing time-series process or a misspecified time-trend parameter. + +The validation suite does not produce a single pass/fail metric — it is a diagnostic tool to inform judgement about whether a given parameterisation is fit for the intended research purpose. diff --git a/documentation/configuration.md b/documentation/configuration.md deleted file mode 100644 index 514396909..000000000 --- a/documentation/configuration.md +++ /dev/null @@ -1,101 +0,0 @@ -# Configuration - -## Prerequisites - -- Java 19 -- Maven 3.8+ -- Optional IDE: IntelliJ IDEA (import as a Maven project) - -## Quick run - -Three commands are all you need: - -```bash -mvn clean package -java -jar multirun.jar -DBSetup -java -jar multirun.jar -``` - -The first builds the JARs. The second creates the H2 donor database from the input data. The third runs the simulation using `default.yml`. - -To use a different config file: - -```bash -java -jar multirun.jar -config my_run.yml -``` - ---- - -## How config is applied - -`SimPathsMultiRun` loads `config/` and applies values in two stages: - -1. YAML values initialise runtime fields and argument maps. -2. CLI flags override those values if provided. - -If a key is not specified in the YAML, the Java class field default is used. Each config file is standalone — there is no inheritance between config files. - ---- - -## Writing your own config - -Place a new `.yml` file in `config/` and pass it via `-config`. You only need to specify the values you want to change — everything else falls back to the Java class field defaults. - -### Core run arguments - -| Key | Default | Description | -|-----|---------|-------------| -| `maxNumberOfRuns` | `1` | Number of sequential simulation runs | -| `executeWithGui` | `false` | `true` launches the JAS-mine GUI; `false` = headless (required on servers/CI) | -| `randomSeed` | `606` | RNG seed for the first run | -| `startYear` | `2019` | First simulation year (must have matching input/donor data) | -| `endYear` | `2022` | Last simulation year (inclusive) | -| `popSize` | `50000` | Simulated population size; larger = more accurate but slower | - -### Collector arguments - -The `collector_args` section controls what output files are produced: - -| Flag | Default | Description | -|------|---------|-------------| -| `persistStatistics` | `true` | Write `Statistics1.csv` — income distribution, Gini, S-Index | -| `persistStatistics2` | `true` | Write `Statistics2.csv` — demographic validation by age and gender | -| `persistStatistics3` | `true` | Write `Statistics3.csv` — alignment diagnostics | -| `exportToCSV` | `true` | Write outputs to CSV files under `output//csv/` | - -For a description of the variables in these files, see `documentation/SimPaths_Variable_Codebook.xlsx`. - -### Minimal example - -```yaml -maxNumberOfRuns: 5 -executeWithGui: false -randomSeed: 42 -startYear: 2019 -endYear: 2030 -popSize: 20000 - -collector_args: - persistStatistics: true - persistStatistics2: true - persistStatistics3: true -``` - ---- - -## Additional arguments - -The YAML file supports several other argument sections (`model_args`, `innovation_args`, `parameter_args`) that control alignment flags, intertemporal optimisation settings, sensitivity analysis parameters, and file paths. Many of these are for specific analyses and some are under active review. The annotated `default.yml` file documents all available keys with inline comments. - -Note that some settings — particularly alignment — are primarily controlled in `SimPathsModel.java` rather than through the YAML file. - ---- - -## Practical notes - -- Use quotes around config filenames that contain spaces: `-config "my config.yml"`. -- Add `-f` to write run logs to `output/logs/`. -- Override individual values at runtime without editing the YAML, for example `-n 10` overrides `maxNumberOfRuns`. -- If you see `Config file not found`, the `-config` flag points to a file not present in `config/` — check the filename and extension. -- If `EUROMODpolicySchedule.xlsx` is missing, re-run setup: `java -jar multirun.jar -DBSetup`. -- On headless servers or CI, always use `executeWithGui: false` in your YAML (or `-g false` on the command line) to avoid GUI errors. diff --git a/documentation/data-pipeline.md b/documentation/data-pipeline.md deleted file mode 100644 index 554e9c540..000000000 --- a/documentation/data-pipeline.md +++ /dev/null @@ -1,111 +0,0 @@ -# Data Pipeline - -This page explains how the simulation-ready input files in `input/` are generated from raw survey data, and what to do if you need to update or extend them. - -The pipeline has three independent parts: (1) initial populations, (2) regression coefficients, (3) alignment targets. Each can be re-run separately. - ---- - -## Data sources - -| Source | Description | Access | -|--------|-------------|--------| -| **UKHLS** (Understanding Society) | Main household panel survey; waves 1 to O (UKDA-6614-stata) | Requires EUL licence from UK Data Service | -| **BHPS** (British Household Panel Survey) | Historical predecessor to UKHLS; used for pre-2009 employment history | Bundled with UKHLS EUL | -| **WAS** (Wealth and Assets Survey) | Biennial survey of household wealth; waves 1 to 7 (UKDA-7215-stata) | Requires EUL licence from UK Data Service | -| **EUROMOD / UKMOD** | Tax-benefit microsimulation system | See [Tax-Benefit Donors (UK)](../documentation/wiki/getting-started/data/tax-benefit-donors-uk.md) on the website | - ---- - -## Part 1 — Initial populations (`input/InitialPopulations/compile/`) - -**What it produces:** Annual CSV files `population_initial_UK_.csv` used as the starting population for each simulation run. - -**Master script:** `input/InitialPopulations/compile/00_master.do` - -The pipeline runs in numbered stages: - -| Script | What it does | -|--------|-------------| -| `01_prepare_UKHLS_pooled_data.do` | Pools and standardises UKHLS waves | -| `02_create_UKHLS_variables.do` | Constructs all required variables (demographics, labour, health, income, wealth flags) and applies simulation-consistency rules (retirement as absorbing state, education age bounds, work/hours consistency) | -| `02_01_checks.do` | Data quality checks | -| `03_social_care_received.do` | Social care receipt variables | -| `04_social_care_provided.do` | Informal care provision variables | -| `05_create_benefit_units.do` | Groups individuals into benefit units (tax units) following UK tax-benefit rules | -| `06_reweight_and_slice.do` | Reweighting and year-specific slicing | -| `07_was_wealth_data.do` | Prepares Wealth and Assets Survey data | -| `08_wealth_to_ukhls.do` | Merges WAS wealth into UKHLS records | -| `09_finalise_input_data.do` | Final cleaning and formatting | -| `10_check_yearly_data.do` | Per-year consistency checks | -| `99_training_data.do` | Produces the de-identified training population committed to `input/InitialPopulations/training/` | - -### Employment history sub-pipeline (`compile/do_emphist/`) - -Reconstructs each respondent's monthly employment history from January 2007 onwards by combining UKHLS and BHPS interview records. The output variable `liwwh` (months employed since Jan 2007) feeds into the labour supply models. - -| Script | Purpose | -|--------|---------| -| `00_Master_emphist.do` | Master; sets parameters and calls sub-scripts | -| `01_Intdate.do` – `07_Empcal1a.do` | Sequential stages: interview dating, BHPS linkage, employment spell reconstruction, new-entrant identification | - ---- - -## Part 2 — Regression coefficients (`input/InitialPopulations/compile/RegressionEstimates/`) - -**What it produces:** The `reg_*.xlsx` coefficient tables read by `Parameters.java` at simulation startup. - -**Master script:** `input/InitialPopulations/compile/RegressionEstimates/master.do` - -> **Note:** Income and union-formation regressions depend on predicted wages, so `reg_wages.do` must complete before `reg_income.do` and `reg_partnership.do`. All other scripts can run in any order. - -**Required Stata packages:** `fre`, `tsspell`, `carryforward`, `outreg2`, `oparallel`, `gologit2`, `winsor`, `reghdfe`, `ftools`, `require` - -| Script | Module | Method | -|--------|--------|--------| -| `reg_wages.do` | Hourly wages | Heckman selection model (males and females separately) | -| `reg_income.do` | Non-labour income | Hurdle model (selection + amount); requires predicted wages | -| `reg_partnership.do` | Partnership formation/dissolution | Probit; requires predicted wages | -| `reg_education.do` | Education transitions | Generalised ordered logit | -| `reg_fertility.do` | Fertility | Probit | -| `reg_health.do` | Physical health (SF-12 PCS) | Linear regression | -| `reg_health_mental.do` | Mental health (GHQ-12, SF-12 MCS) | Linear regression | -| `reg_health_wellbeing.do` | Life satisfaction | Linear regression | -| `reg_home_ownership.do` | Homeownership transitions | Probit | -| `reg_retirement.do` | Retirement | Probit | -| `reg_leave_parental_home.do` | Leaving parental home | Probit | -| `reg_socialcare.do` | Social care receipt and provision | Probit / ordered logit | -| `reg_unemployment.do` | Unemployment transitions | Probit | -| `reg_financial_distress.do` | Financial distress | Probit | -| `programs.do` | Shared utility programs called by the estimation scripts | — | -| `variable_update.do` | Prepares and recodes variables before estimation | — | - -After running, output Excel files are placed in `input/` (overwriting the existing `reg_*.xlsx` files). - ---- - -## Part 3 — Alignment targets (`input/DoFilesTarget/`) - -**What it produces:** The `align_*.xlsx` and `*_targets.xlsx` files that the alignment modules use to rescale simulated rates. - -| Script | Output file | -|--------|------------| -| `01_employment_shares_initpopdata.do` | `input/employment_targets.xlsx` — employment shares by benefit-unit subgroup and year | -| `01_inSchool_targets_initpopdata.do` | `input/inSchool_targets.xlsx` — school participation rates by year | -| `03_calculate_partneredShare_initialPop_BUlogic.do` | `input/partnered_share_targets.xlsx` — partnership shares by year | -| `03_calculate_partnership_target.do` | Supplementary partnership targets | -| `02_person_risk_employment_stats.do` | `employment_risk_emp_stats.csv` — person-level at-risk diagnostics used for employment alignment group construction | - -Population projection targets (`align_popProjections.xlsx`) and fertility/mortality projections (`projections_*.xlsx`) come from ONS published projections and are not generated by these scripts. - ---- - -## When to re-run each part - -| Situation | What to re-run | -|-----------|---------------| -| Adding a new data year to the simulation | Part 1 (re-slice the population for the new year) + Part 3 (update alignment targets) | -| Re-estimating a behavioural module | Part 2 (the affected `reg_*.do` script only) + Stage 1 validation | -| Updating employment alignment targets | Part 3 (`01_employment_shares_initpopdata.do`) | - -After re-running any part, re-run setup (`singlerun -Setup` or `multirun -DBSetup`) to rebuild `input/input.mv.db` before running the simulation. diff --git a/documentation/model-concepts.md b/documentation/model-concepts.md deleted file mode 100644 index 8a150938c..000000000 --- a/documentation/model-concepts.md +++ /dev/null @@ -1,128 +0,0 @@ -# Model Concepts - -SimPaths is a dynamic population microsimulation model that advances a starting population of real households forward in time, year by year, simulating individual life events through statistical regression models and rule-based processes. For the full academic description — including the 11 simulated modules — see the [Overview](../documentation/wiki/overview/index.md) section of the website, in particular [Simulated Modules](../documentation/wiki/overview/simulated-modules.md). - -This page covers what you need to understand the **code and configuration**: agent structure, the annual process order, alignment flags, and the tax-benefit system. - ---- - -## Agent hierarchy - -The simulation maintains three nested entity types. - -### Person - -The individual. Each person carries their own demographic, health, education, labour, and income attributes. Almost all behavioural processes are resolved at the person level. - -Key attributes tracked per person: - -- **Demographics**: age, gender, region -- **Education**: highest qualification (`Low` / `Medium` / `High` / `InEducation`), mother's and father's education -- **Labour market status**: `EmployedOrSelfEmployed`, `NotEmployed`, `Student`, or `Retired`; weekly hours worked; wage rate; work history in months -- **Health**: physical health (SF-12 PCS), mental health (SF-12 MCS, GHQ-12 psychological distress, caseness indicator), life satisfaction (0–10), EQ-5D utility score, disability/care-need flag -- **Partnership**: partner reference, years in partnership -- **Income**: gross labour income, capital income, pension income, benefit receipt flags (UC and non-UC) -- **Social care**: formal and informal care hours received per week; informal care hours provided per week -- **Financial wellbeing**: equivalised disposable income, lifetime income trajectory, financial distress flag - -### BenefitUnit - -The tax-and-benefit assessment unit — typically an adult (or couple) and their dependent children. Taxes and benefits are computed here, mirroring how real-world tax-benefit systems work. - -Key attributes: - -- Region, homeownership flag, wealth -- Equivalised disposable income (EDI) and year-on-year change in log-EDI -- Poverty flag (< 60% of median equivalised household disposable income) -- Discretionary consumption (when intertemporal optimisation is enabled) - -### Household - -A grouping of benefit units sharing an address. Used for aggregation and housing-related logic. A household may contain more than one benefit unit (e.g. adult children living with parents before leaving home). - ---- - -## Annual simulation cycle - -SimPaths uses **discrete annual time steps**. Within each year, processes fire in a fixed order defined in `SimPathsModel.buildSchedule()`. - -| # | Process | Level | Description | -|---|---------|-------|-------------| -| 1 | StartYear | model | Year logging and housekeeping | -| 2 | RationalOptimisation | model | *First year only.* Pre-computes intertemporal decision grids (if enabled) | -| 3 | UpdateParameters | model | Loads year-specific parameters and time-series factors | -| 4 | GarbageCollection | model | Removes stale entity references | -| 5 | UpdateWealth | benefit unit | Updates savings/wealth stocks (if intertemporal enabled) | -| 6 | Update | benefit unit | Refreshes composition counts, clears state flags | -| 7 | Update | person | Refreshes state variables and lag values | -| 8 | Aging | person | Increments age; dependent children reaching independence are split into their own benefit unit | -| 9 | ConsiderRetirement | person | Stochastic retirement decision | -| 10 | InSchool | person | Whether person remains in / enters education (age 16–29) | -| 11 | InSchoolAlignment | model | Aligns school participation rate to targets (if enabled) | -| 12 | LeavingSchool | person | Transition out of education; assigns completed qualification | -| 13 | EducationLevelAlignment | model | Aligns completed education distribution (if enabled) | -| 14 | Homeownership | benefit unit | Homeownership transition | -| 15 | Health | person | Updates physical health and disability status | -| 16 | UpdatePotentialHourlyEarnings | person | Refreshes wage potential prior to labour supply decisions | -| 17 | CohabitationAlignment | model | Aligns cohabitation share to targets (if enabled) | -| 18 | Cohabitation | person | Entry into partnership | -| 19 | PartnershipDissolution | person | Exit from partnership (separation or bereavement) | -| 20 | UnionMatching | model | Matches unpartnered individuals into new couples | -| 21 | FertilityAlignment | model | Scales birth probabilities to projected fertility rates (if enabled) | -| 22 | Fertility | person | Fertility decision for women of childbearing age | -| 23 | GiveBirth | person | Adds newborn children to the simulation | -| 24 | SocialCareReceipt | person | Formal and informal care receipt for those with a care need | -| 25 | SocialCareProvision | person | Informal care provision by eligible individuals | -| 26 | Unemployment | person | Unemployment transitions | -| 27 | UpdateStates | benefit unit | Refreshes joint labour states for IO decisions (if enabled) | -| 28 | LabourMarketAndIncomeUpdate | model | Resolves labour supply; imputes taxes and benefits via EUROMOD donor matching | -| 29 | ReceivesBenefits | benefit unit | Assigns benefit receipt flags from the donor match | -| 30 | ProjectDiscretionaryConsumption | benefit unit | Consumption/savings decision (if intertemporal enabled) | -| 31 | ProjectEquivConsumption | person | Computes individual equivalised consumption share | -| 32 | CalculateChangeInEDI | benefit unit | Updates equivalised disposable income and year-on-year change | -| 33 | ReviseLifetimeIncome | person | Updates lifetime income trajectory (if intertemporal enabled) | -| 34 | FinancialDistress | person | Financial distress indicator | -| 35–40 | Mental health and wellbeing | person | GHQ-12 distress (levels + caseness, two steps each); SF-12 MCS and PCS (two steps each); life satisfaction (two steps) | -| 41 | ConsiderMortality | person | Stochastic mortality | -| 42 | HealthEQ5D | person | EQ-5D utility score update | -| 43 | PopulationAlignment | model | Re-weights/resamples population to match demographic projections | -| 44 | EndYear / UpdateYear | model | Year-end housekeeping | - -The first simulation year runs a subset of these (some states are inherited directly from input data). All subsequent years run the full schedule. - ---- - -## Alignment - -Alignment prevents simulated aggregate rates from drifting away from known targets. Rather than discarding individual-level stochastic variation, it rescales or resamples agents' outcomes so the population total matches a target share or count. - -Each dimension is controlled by a boolean flag in `model_args`: - -| Flag | What it aligns | Default | -|------|----------------|---------| -| `alignPopulation` | Age-sex-region population totals to demographic projections | `true` | -| `alignCohabitation` | Share of individuals in partnerships | `true` | -| `alignFertility` | Birth rates to projected fertility rates | `false` | -| `alignInSchool` | School participation rate (age 16–29) | `false` | -| `alignEducation` | Completed education level distribution | `false` | -| `alignEmployment` | Employment share | `false` | - ---- - -## Tax-benefit system (EUROMOD donor matching) - -SimPaths does not compute taxes and benefits from first principles. It uses **donor matching**: - -1. A database of tax-benefit outcomes is pre-computed by running EUROMOD/UKMOD over a population of "donor" households for each policy year. -2. Each simulated benefit unit selects a donor whose characteristics (labour hours, earnings, household composition, region, year) closely match its own. -3. The donor's computed disposable income, tax, and benefit amounts are imputed to the simulated unit. - -This gives SimPaths annually updated policy rules without re-implementing the full tax-benefit schedule. See [Tax-Benefit Donors (UK)](../documentation/wiki/getting-started/data/tax-benefit-donors-uk.md) for how to generate the donor database. - ---- - -## Intertemporal optimisation - -When `enableIntertemporalOptimisations: true`, SimPaths solves a life-cycle consumption and labour supply problem. Decision grids are pre-computed in year 0 (`RationalOptimisation`) by solving backwards over the remaining horizon. In each subsequent year agents look up their optimal choice from the grid given their current state. - -This is computationally intensive and disabled by default. When enabled, `saveBehaviour` and `useSavedBehaviour` allow a baseline grid to be reused in counterfactual runs without recomputing it — see the annotated `config/default.yml` for the relevant keys. diff --git a/documentation/validation-guide.md b/documentation/validation-guide.md deleted file mode 100644 index be6201cfe..000000000 --- a/documentation/validation-guide.md +++ /dev/null @@ -1,96 +0,0 @@ -# Validation Guide - -SimPaths uses a two-stage validation workflow in `validation/`. Stage 1 checks that each estimated regression model is well-specified before simulation; stage 2 checks that full simulation output matches observed survey data. - ---- - -## Stage 1 — Estimate validation (`validation/01_estimate_validation/`) - -**When to run:** After updating or re-estimating any regression module (i.e. after re-running scripts in `input/InitialPopulations/compile/RegressionEstimates/`). - -**What it does:** For each behavioural module, the script loads the estimation sample, computes predicted values from the estimated coefficients, adds individual heterogeneity via 20 stochastic draws (as in multiple imputation), and overlays the predicted and observed distributions as histograms. - -| Script | Module validated | -|--------|----------------| -| `int_val_wages.do` | Hourly wages — Heckman selection model, separately for males/females with and without previous wage history | -| `int_val_education.do` | Education transitions (3 processes) | -| `int_val_fertility.do` | Fertility (2 processes) | -| `int_val_health.do` | Physical health transitions | -| `int_val_home_ownership.do` | Homeownership transitions | -| `int_val_income.do` | Income processes — hurdle models (selection and amount) | -| `int_val_leave_parental_home.do` | Leaving parental home | -| `int_val_partnership.do` | Partnership formation and dissolution | -| `int_val_retirement.do` | Retirement transitions | - -**Outputs:** PNG graphs saved under `validation/01_estimate_validation/graphs//`. Each graph shows predicted (red) vs observed (black outline) distributions. If the shapes diverge substantially, the regression may be mis-specified or the estimation sample may need updating. - ---- - -## Stage 2 — Simulated output validation (`validation/02_simulated_output_validation/`) - -**When to run:** After completing a baseline simulation run that you want to assess for plausibility. - -**What it does:** Loads your simulation output CSVs, loads UKHLS initial population data as an observational benchmark, and produces side-by-side time-series plots comparing 18 simulated outcomes against the observed distributions with confidence intervals. - -### Setup - -Before running, open `00_master.do` and set the global paths: - -```stata -global path "/your/local/path/to/validation/02_simulated_output_validation" -global dir_sim "/your/output//csv" * folder with simulation CSVs -global dir_obs "/path/to/ukhls/initial/populations" -``` - -Then run `00_master.do`. It calls all sub-scripts in order. - -### Scripts and what they check - -**Data preparation (run first, automatically called by master):** - -| Script | Purpose | -|--------|---------| -| `01_prepare_simulated_data.do` | Loads `Household.csv`, `BenefitUnit.csv`, `Person.csv` from the simulation output | -| `02_create_simulated_variables.do` | Derives analysis variables (sex, age groups, labour supply, income); produces full sample and ages 18–65 subset | -| `03_prepare_UKHLS_data.do` | Loads UKHLS observed data; prepares disposable income and matching variables | -| `05_create_UKHLS_validation_targets.do` | Creates target variables from UKHLS initial population CSVs by year | - -**Comparison plots (18 scripts, `06_01` through `06_18`):** - -| Script | What is compared | -|--------|-----------------| -| `06_01_plot_activity_status.do` | Economic activity: employed, student, inactive, retired by age group | -| `06_02_plot_education_level.do` | Completed education distribution over time | -| `06_03_plot_gross_income.do` | Gross benefit-unit income | -| `06_04_plot_gross_labour_income.do` | Gross labour income | -| `06_05_plot_capital_income.do` | Capital income (interest, dividends) | -| `06_06_plot_pension_income.do` | Pension income | -| `06_07_plot_disposable_income.do` | Disposable income after taxes and benefits | -| `06_08_plot_equivalised_disposable_income.do` | Household-size-adjusted disposable income | -| `06_09_plot_hourly_wages.do` | Hourly wages for employees | -| `06_10_plot_hours_worked.do` | Weekly hours worked by employment status | -| `06_11_plot_income_shares.do` | Income distribution across quintiles | -| `06_12_plot_partnership_status.do` | Partnership status (single, married, cohabiting, previously partnered) | -| `06_13_plot_health.do` | Physical and mental health (SF-12 PCS and MCS) | -| `06_14_plot_at_risk_of_poverty.do` | At-risk-of-poverty rate | -| `06_15_plot_inequality.do` | Income inequality (p90/p50 ratio) | -| `06_16_plot_number_children.do` | Number of dependent children | -| `06_17_plot_disability.do` | Disability prevalence | -| `06_18_plot_social_care.do` | Social care receipt | - -**Correlation analysis:** - -| Script | Purpose | -|--------|---------| -| `07_01_correlations.do` | Checks that key relationships between variables (e.g. income and employment, health and age) are preserved in the simulated data relative to UKHLS | - -**Outputs:** PNG graphs saved under `validation/02_simulated_output_validation/graphs//`, organised by topic (income, health, inequality, partnership, etc.). A reference set from a named run (`20250909_run`) is already committed and can serve as a baseline for comparison. - ---- - -## Interpreting results - -- **Stage 1:** Predicted and observed histograms should broadly overlap. Systematic divergence (e.g. predicted wages consistently too high) indicates a problem with the estimation or variable construction. -- **Stage 2:** Simulated time-series should track UKHLS trends within reasonable uncertainty bounds. Large divergence in levels suggests a miscalibration; divergence in trends suggests a missing time-series process or a misspecified time-trend parameter. - -The validation suite does not produce a single pass/fail metric — it is a diagnostic tool to inform judgement about whether a given parameterisation is fit for the intended research purpose. From 99fc809e39918a8286258279a359d469ed0f9be8 Mon Sep 17 00:00:00 2001 From: hk-2029 Date: Tue, 17 Mar 2026 09:09:57 +0000 Subject: [PATCH 23/23] docs: move quick start to root README, slim documentation/README to data pipeline only --- README.md | 28 ++++++- documentation/README.md | 174 +--------------------------------------- 2 files changed, 30 insertions(+), 172 deletions(-) diff --git a/README.md b/README.md index 6417b97a6..bf5c76ffe 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,33 @@ SimPaths is an open-source framework for modelling individual and household life SimPaths models currently exist for the UK, Greece, Hungary, Italy, and Poland. This page refers to the UK model; the other European models are available at the corresponding [SimPathsEU](https://github.com/centreformicrosimulation/SimPathsEU) page. -The entire SimPaths documentation is available on its [WikiPage](https://github.com/centreformicrosimulation/SimPaths/wiki), which includes: a detailed description of its building blocks; instructions on how to set up and run the model; information about contributing to the model's development. +The entire SimPaths documentation is available on its [website](https://centreformicrosimulation.github.io/SimPaths/), which includes: a detailed description of its building blocks; instructions on how to set up and run the model; information about contributing to the model's development. + +## Quick start + +### Prerequisites + +- Java 19 +- Maven 3.8+ +- Optional IDE: IntelliJ IDEA (import as a Maven project) + +### Build and run + +```bash +mvn clean package +java -jar multirun.jar -DBSetup +java -jar multirun.jar +``` + +The first command builds the JARs. The second creates the H2 donor database from the input data. The third runs the simulation using `default.yml`. + +To use a different config file: + +```bash +java -jar multirun.jar -config my_run.yml +``` + +For configuration options, see the annotated `config/default.yml`. For the data pipeline and further reference, see [`documentation/`](documentation/README.md). diff --git a/documentation/README.md b/documentation/README.md index a4a0f661c..c9756cd50 100644 --- a/documentation/README.md +++ b/documentation/README.md @@ -1,113 +1,9 @@ -# SimPaths Quick Reference +# Data Pipeline Reference -A command-line quick reference for building, running, configuring, and validating SimPaths. For the full model documentation — simulated modules, parameterisation, GUI usage, research — see the [website](../documentation/wiki/index.md). +For building and running SimPaths, see the [root README](../README.md). For the full model documentation, see the [website](https://centreformicrosimulation.github.io/SimPaths/). --- -## 1. Building and running - -### Prerequisites - -- Java 19 -- Maven 3.8+ -- Optional IDE: IntelliJ IDEA (import as a Maven project) - -### Quick run - -Three commands are all you need: - -```bash -mvn clean package -java -jar multirun.jar -DBSetup -java -jar multirun.jar -``` - -The first builds the JARs. The second creates the H2 donor database from the input data. The third runs the simulation using `default.yml`. - -To use a different config file: - -```bash -java -jar multirun.jar -config my_run.yml -``` - ---- - -## 2. Configuration - -SimPaths batch runs are controlled by YAML files in `config/`. The main config is `default.yml`, which is fully annotated with inline comments. - -### How config is applied - -`SimPathsMultiRun` loads `config/` and applies values in two stages: - -1. YAML values initialise runtime fields and argument maps. -2. CLI flags override those values if provided. - -If a key is not specified in the YAML, the Java class field default is used. Each config file is standalone — there is no inheritance between config files. - -### Writing your own config - -Place a new `.yml` file in `config/` and pass it via `-config`. You only need to specify the values you want to change — everything else falls back to the Java class field defaults. - -#### Core run arguments - -| Key | Default | Description | -|-----|---------|-------------| -| `maxNumberOfRuns` | `1` | Number of sequential simulation runs | -| `executeWithGui` | `false` | `true` launches the JAS-mine GUI; `false` = headless (required on servers/CI) | -| `randomSeed` | `606` | RNG seed for the first run | -| `startYear` | `2019` | First simulation year (must have matching input/donor data) | -| `endYear` | `2022` | Last simulation year (inclusive) | -| `popSize` | `50000` | Simulated population size; larger = more accurate but slower | - -#### Collector arguments - -The `collector_args` section controls what output files are produced: - -| Flag | Default | Description | -|------|---------|-------------| -| `persistStatistics` | `true` | Write `Statistics1.csv` — income distribution, Gini, S-Index | -| `persistStatistics2` | `true` | Write `Statistics2.csv` — demographic validation by age and gender | -| `persistStatistics3` | `true` | Write `Statistics3.csv` — alignment diagnostics | -| `exportToCSV` | `true` | Write outputs to CSV files under `output//csv/` | - -For a description of the variables in these files, see `documentation/SimPaths_Variable_Codebook.xlsx`. - -#### Minimal example - -```yaml -maxNumberOfRuns: 5 -executeWithGui: false -randomSeed: 42 -startYear: 2019 -endYear: 2030 -popSize: 20000 - -collector_args: - persistStatistics: true - persistStatistics2: true - persistStatistics3: true -``` - -### Additional arguments - -The YAML file supports several other argument sections (`model_args`, `innovation_args`, `parameter_args`) that control alignment flags, intertemporal optimisation settings, sensitivity analysis parameters, and file paths. Many of these are for specific analyses and some are under active review. The annotated `default.yml` file documents all available keys with inline comments. - -Note that some settings — particularly alignment — are primarily controlled in `SimPathsModel.java` rather than through the YAML file. - -### Practical notes - -- Use quotes around config filenames that contain spaces: `-config "my config.yml"`. -- Add `-f` to write run logs to `output/logs/`. -- Override individual values at runtime without editing the YAML, for example `-n 10` overrides `maxNumberOfRuns`. -- If you see `Config file not found`, the `-config` flag points to a file not present in `config/` — check the filename and extension. -- If `EUROMODpolicySchedule.xlsx` is missing, re-run setup: `java -jar multirun.jar -DBSetup`. -- On headless servers or CI, always use `executeWithGui: false` in your YAML (or `-g false` on the command line) to avoid GUI errors. - ---- - -## 3. Data pipeline - This section explains how the simulation-ready input files in `input/` are generated from raw survey data, and what to do if you need to update or extend them. The pipeline has three independent parts: (1) initial populations, (2) regression coefficients, (3) alignment targets. Each can be re-run separately. @@ -119,7 +15,7 @@ The pipeline has three independent parts: (1) initial populations, (2) regressio | **UKHLS** (Understanding Society) | Main household panel survey; waves 1 to O (UKDA-6614-stata) | Requires EUL licence from UK Data Service | | **BHPS** (British Household Panel Survey) | Historical predecessor to UKHLS; used for pre-2009 employment history | Bundled with UKHLS EUL | | **WAS** (Wealth and Assets Survey) | Biennial survey of household wealth; waves 1 to 7 (UKDA-7215-stata) | Requires EUL licence from UK Data Service | -| **EUROMOD / UKMOD** | Tax-benefit microsimulation system | See [Tax-Benefit Donors (UK)](../documentation/wiki/getting-started/data/tax-benefit-donors-uk.md) on the website | +| **EUROMOD / UKMOD** | Tax-benefit microsimulation system | See [Tax-Benefit Donors (UK)](wiki/getting-started/data/tax-benefit-donors-uk.md) on the website | ### Part 1 — Initial populations (`input/InitialPopulations/compile/`) @@ -207,67 +103,3 @@ Population projection targets (`align_popProjections.xlsx`) and fertility/mortal | Updating employment alignment targets | Part 3 (`01_employment_shares_initpopdata.do`) | After re-running any part, re-run setup (`singlerun -Setup` or `multirun -DBSetup`) to rebuild `input/input.mv.db` before running the simulation. - ---- - -## 4. Validation - -SimPaths uses a two-stage validation workflow in `validation/`. Stage 1 checks that each estimated regression model is well-specified before simulation; stage 2 checks that full simulation output matches observed survey data. For the conceptual overview and detailed setup instructions, see [Model Validation](../documentation/wiki/validation/index.md) on the website. - -### Stage 1 — Estimate validation (`validation/01_estimate_validation/`) - -**When to run:** After updating or re-estimating any regression module (i.e. after re-running scripts in `input/InitialPopulations/compile/RegressionEstimates/`). - -**What it does:** For each behavioural module, the script loads the estimation sample, computes predicted values from the estimated coefficients, adds individual heterogeneity via 20 stochastic draws (as in multiple imputation), and overlays the predicted and observed distributions as histograms. - -| Script | Module validated | -|--------|----------------| -| `int_val_wages.do` | Hourly wages — Heckman selection model, separately for males/females with and without previous wage history | -| `int_val_education.do` | Education transitions (3 processes) | -| `int_val_fertility.do` | Fertility (2 processes) | -| `int_val_health.do` | Physical health transitions | -| `int_val_home_ownership.do` | Homeownership transitions | -| `int_val_income.do` | Income processes — hurdle models (selection and amount) | -| `int_val_leave_parental_home.do` | Leaving parental home | -| `int_val_partnership.do` | Partnership formation and dissolution | -| `int_val_retirement.do` | Retirement transitions | - -**Outputs:** PNG graphs saved under `validation/01_estimate_validation/graphs//`. Each graph shows predicted (red) vs observed (black outline) distributions. - -### Stage 2 — Simulated output validation (`validation/02_simulated_output_validation/`) - -**When to run:** After completing a baseline simulation run that you want to assess for plausibility. - -**What it does:** Loads your simulation output CSVs, loads UKHLS initial population data as an observational benchmark, and produces side-by-side time-series plots comparing 18 simulated outcomes against the observed distributions with confidence intervals. - -**Comparison plots (18 scripts, `06_01` through `06_18`):** - -| Script | What is compared | -|--------|-----------------| -| `06_01_plot_activity_status.do` | Economic activity: employed, student, inactive, retired by age group | -| `06_02_plot_education_level.do` | Completed education distribution over time | -| `06_03_plot_gross_income.do` | Gross benefit-unit income | -| `06_04_plot_gross_labour_income.do` | Gross labour income | -| `06_05_plot_capital_income.do` | Capital income (interest, dividends) | -| `06_06_plot_pension_income.do` | Pension income | -| `06_07_plot_disposable_income.do` | Disposable income after taxes and benefits | -| `06_08_plot_equivalised_disposable_income.do` | Household-size-adjusted disposable income | -| `06_09_plot_hourly_wages.do` | Hourly wages for employees | -| `06_10_plot_hours_worked.do` | Weekly hours worked by employment status | -| `06_11_plot_income_shares.do` | Income distribution across quintiles | -| `06_12_plot_partnership_status.do` | Partnership status (single, married, cohabiting, previously partnered) | -| `06_13_plot_health.do` | Physical and mental health (SF-12 PCS and MCS) | -| `06_14_plot_at_risk_of_poverty.do` | At-risk-of-poverty rate | -| `06_15_plot_inequality.do` | Income inequality (p90/p50 ratio) | -| `06_16_plot_number_children.do` | Number of dependent children | -| `06_17_plot_disability.do` | Disability prevalence | -| `06_18_plot_social_care.do` | Social care receipt | - -**Outputs:** PNG graphs saved under `validation/02_simulated_output_validation/graphs//`. A reference set from a baseline run (`20250909_run`) is already committed for comparison. - -### Interpreting results - -- **Stage 1:** Predicted and observed histograms should broadly overlap. Systematic divergence indicates a problem with the estimation or variable construction. -- **Stage 2:** Simulated time-series should track UKHLS trends within reasonable uncertainty bounds. Large divergence in levels suggests a miscalibration; divergence in trends suggests a missing time-series process or a misspecified time-trend parameter. - -The validation suite does not produce a single pass/fail metric — it is a diagnostic tool to inform judgement about whether a given parameterisation is fit for the intended research purpose.