- [2026-04] Paper accepted to ACL 2026 Findings.
Large language models often need to integrate multiple information sources in real-world systems such as RAG and chat-based assistants. In these settings, a model may need to balance: (1) its own parametric knowledge, (2) a user's assertion, and (3) information attributed to retrieved documents.
Prior work on knowledge conflict and sycophancy has mostly studied binary conflicts β parametric knowledge vs. retrieved context, or parametric knowledge vs. user beliefs. However, realistic assistant systems often involve all three sources at the same time.
To study this setting, we propose a three-source interaction framework that evaluates how LLMs weigh parametric knowledge, user assertions, and document assertions under controlled source-conflict scenarios.
- Source reliance: How do LLMs weigh internal parametric knowledge, user assertions, and document assertions?
- Discrimination ability: Can LLMs distinguish helpful external information from harmful or misleading information?
- Post-training effects: How does post-training affect model preferences in multi-source settings?
We construct 13 probe variants for each question, including:
- a bare probe with no external assertion,
- single-source probes with either a user or document assertion,
- double-source probes where both user and document assertions are present,
- cases where external assertions are correct, incorrect, or conflicting.
We evaluate 27 LLMs from three model families (GPT-4o, Llama 3 / 3.1, and Qwen3) on CommonsenseQA and GSM8K-MC.
- Document preference: Most models rely more on document-attributed assertions than user-attributed assertions.
- Post-training reinforces document reliance: Instruction-tuned or post-trained models tend to show stronger document preference.
- Models are often impressionable: Many models accept helpful external information but also fail to resist harmful or misleading assertions.
- Fine-tuning helps: Supervised fine-tuning on diverse source-interaction data improves models' ability to distinguish helpful from harmful external information, while largely preserving general capabilities.
Models do not always combine sources rationally β they may over-trust document-like content, underweight user input, or absorb misleading claims. We see this work as a first step toward a broader research direction on multi-source knowledge integration, with natural extensions to:
-
Multi-source knowledge integration: Extending the framework beyond user and document assertions to other sources, such as tool outputs, search results, memory modules, database entries, environment feedback, or other agents' messages.
-
Multi-modal source interaction: Studying how external information from different modalities, such as text, images, audio, and video, affects model decisions, especially when different modalities provide conflicting or partially consistent evidence.
-
Agentic settings: Applying the framework to language agents that perform planning, reasoning, tool use, and interaction with external environments. In these settings, agents may need to reconcile information from users, retrieved documents, tools, sensors, previous actions, and changing world states.
-
Robust training and mitigation: Developing training methods that help models selectively accept helpful information while resisting misleading or unreliable sources, especially in more realistic long-context, multi-turn, and multi-modal settings.
The pipeline has four stages: (1) Setup β (2) Data β (3) Probe inference β (4) Analysis, plus optional SFT and standard-benchmark evaluation of base vs. fine-tuned models.
conda create -n <env> python=3.10 -y
conda activate <env>
pip install -r requirements.txtBuild the datasets used in the paper (CSQA, GSM8K-MC):
python data/prepare_datasets.pyThis produces:
- Per-dataset JSONL files in
data/processed_datasets/{csqa,gsm8k}_default_split/ - A manifest
data/prepared_datasets.yamlmapping dataset names to file paths
Dataset and model definitions live in configs/: datasets_config.yaml
declares each dataset's HF source, split ratios, and seed (consumed by
prepare_datasets.py); models_config.yaml maps the short model names used
across the pipeline (e.g. qwen3_8b, llama3_8b_instruct) to their HF
checkpoints.
The exact splits used in the paper are already checked in under
data/processed_datasets/, so this step is only needed if you want to
regenerate from scratch.
Drives a batch of experiments end-to-end. A recipe expands to many
(dataset, model, prior_token, instruction, cot_flag) combinations via a
cartesian product; the pipeline iterates over every combination and, for
each one, runs the bare probe, generates tier sentences, then runs all 8
non-bare probe variants (upos/uneg, dpos/dneg, plus 4 double-prior
variants) and aggregates them with the bare probe into a per-question
merged_results.jsonl.
experiments/build_experiments.py # recipe β flat config
β
βΌ
experiments/generated/<recipe>_config.yaml
β
βΌ
runner/run_full_pipeline.py # top-level driver
β for each experiment in the config:
β
βββ (1) experiments/run_experiments.py --probe-variant bare
β βββ experiments/run_single_probe.py
β βββ core/compute_probs_single_variant.py (HF models)
β or core/compute_answers_and_probs_openai.py (OpenAI)
β [reasoning models first call core/generate_with_vllm.py]
β then build canonical_wrong.jsonl from bare results
β
βββ (2) core/dataset_tier_generator.py
β βββ T1: deterministic templates (local)
β βββ T2: GPT-4o paraphrases (via core/llm_client.py)
β
βββ (3) experiments/run_experiments.py # 8 non-bare probes
βββ experiments/run_batch_probes_efficient.py
βββ core/compute_probs_multi_variant.py (single model load)
then experiments/build_merged_results.py for aggregation
β
βΌ
results/<experiment_name>/
βββ probs_<variant>.jsonl Γ 9
βββ canonical_wrong.jsonl
βββ merged_results.jsonl # consumed by sft/extract_sft_data_v5.py
# 1. Compose the experiment list from a recipe (cartesian product of fields).
python experiments/build_experiments.py \
--recipes experiments/recipes/exp1.yaml \
--out experiments/generated/exp1_config.yaml
# 2. Drive the full 3-step pipeline (bare β tier β all probes) for every experiment.
python runner/run_full_pipeline.py --config experiments/generated/exp1_config.yamlTurns each experiment's merged_results.jsonl into per-experiment metric
JSONs (logistic regression, choice-level ratios, distribution statistics),
then rolls everything up across experiments into summary CSVs.
results/<experiment_name>/merged_results.jsonl # produced by the probe pipeline
β
βΌ
analyze/compute_all_metrics.py # orchestrator (one experiment at a time)
βββ analyze/logistic_regression_analysis.py β logistic_regression_results.json
β (+ logistic_regression_breakdown.json)
βββ analyze/choice_metrics_analysis.py β choice_metrics.json
βββ analyze/distribution_metrics_analysis.py β distribution_metrics.json
β
βΌ
results/<experiment_name>/ (now also contains the 3 per-experiment metric JSONs)
β
βΌ
analyze/generate_core_metrics_report.py # cross-experiment roll-up
# 1. Compute per-experiment metrics (logistic / choice / distribution) for every
# experiment directory. Reads merged_results.jsonl, writes 3 JSONs alongside it.
python analyze/compute_all_metrics.py --config experiments/generated/exp1_config.yaml
# 2. Roll up the per-experiment JSONs into four combined CSVs.
python analyze/generate_core_metrics_report.py \
--results-dir results \
--output-dir analyze/exp_resultsComposes SFT training datasets from pre-computed probe-inference results in
results/, mixing helpful and harmful source-conflict examples to teach
models to selectively accept external information. The generated
*_candidates.jsonl files serve as input to the actual LoRA fine-tuning,
which we run using
LlamaFactory.
The trained LoRA adapters are released on Hugging Face:
Evaluates base vs. SFT models on standard benchmarks (MMLU-Pro, Math Level 5)
to quantify how task-specific SFT on GSM8K / CSQA affects general
capabilities β i.e. whether the source-balancing fine-tuning causes
regression elsewhere. Built as a thin wrapper around
lm-evaluation-harness.
run.shβ wrapper aroundlm_eval(supports vLLM and HF backends).analyze_mmlupro.py/analyze_mathlvl5.pyβ aggregate per-model results into a markdown report (overall accuracy + per-sample forgetting/robustness analysis).compute_forgetting.pyβ generic base-vs-SFT per-sample comparison.
cd lm_harness_eval
# 1. Evaluate each model. Results go under ./results/<bench>/<model_tag>/.
./run.sh --backend vllm --gpu 0 \
--model Qwen/Qwen3-8B \
--tasks mmlu_pro --limit 100 \
--output ./results/mmlupro/qwen3_8b_base
# For merged SFT models, pass the base tokenizer:
./run.sh --backend vllm --gpu 0 \
--model /path/to/sft_merged_model \
--tokenizer Qwen/Qwen3-8B \
--tasks mmlu_pro --limit 100 \
--output ./results/mmlupro/qwen3_8b_sft_gsm8k
# 2. Aggregate into a markdown report.
python analyze_mmlupro.py --results-dir ./results/mmlupro --output mmlupro_analysis.md
python analyze_mathlvl5.py --results-dir ./results/mathlvl5 --output mathlvl5_analysis.mdThe analyze scripts expect result subdirectories named
{qwen3_8b,llama3_8b}_{base,sft_gsm8k,sft_csqa}; adjust the families dict
at the top of each script for other models.
Tasks used in the paper:
| Benchmark | --tasks |
Typical --limit |
|---|---|---|
| MMLU-Pro | mmlu_pro |
100 per subject |
| Math Level 5 | leaderboard_math_hard |
all (~1,324) |
If you use this code or our findings in your work, please cite our paper: