GS-SoCo evaluates whether frontier LLMs can reason within diverse cultural frameworks, not just recall cultural facts. It targets the gap between cultural knowledge and cultural cognition: can a model predict behavior, interpret figurative language, and navigate norms as a culturally embedded social agent?
If you are reviewing the competition submission, start with:
PUBLIC_REVIEW_GUIDE.mdfor the canonical public-review path and terminology- the public Kaggle benchmark:
https://www.kaggle.com/benchmarks/elormyevudza/gs-soco docs/project_state/post_submission_record_2026-04-04.mdfor the corrected submission-time snapshot and post-submission audit trailREADME.mdfor the benchmark overview and current resultsLITERATURE_SOURCES.mdfor the consolidated literature grounding used in benchmark formationdocs/project_state/submission_writeup_final_2026-04-02.mdfor the final competition writeupdocs/project_state/v9_private_frontier_results_2026-04-02.mdfor the final held-out frontier resultsdocs/project_state/prize_positioning_appendix_2026-04-04.mdfor the benchmark comparison and figure setdocs/project_state/source_traceability_appendix_2026-04-04.mdanddocs/project_state/answer_adjudication_appendix_2026-04-04.mdfor the trust and traceability materials
This repository also contains the full archived iteration history from earlier
benchmark lines. The current competition artifact is the frozen v9_private
holdout. v9_private is the internal variant name: the episodes were held out
during development and are now committed in the repo for transparency and
reproducibility.
The repository was public at submission time under the name KaggleBench and
now resolves canonically as GS-SoCo at
https://github.com/Elormyevu/GS-SoCo.
Archived planning logs and drafts are retained on purpose, but they may use
superseded terminology or intermediate metrics; for judge-facing review, use
the files listed above.
Existing cultural benchmarks (CulturalBench, GlobalOpinionQA, NORMSAGE) primarily test knowledge recall. GS-SoCo tests cultural reasoning and inference: the ability to predict social outcomes, interpret figurative language, and navigate norms as a culturally embedded agent.
Key diagnostic metric: The Western Default Rate (WDR) tracks how often models select the tagged Western-default distractor when they fail on culturally grounded rows, giving labs actionable signal beyond raw accuracy.
| Benchmark | Primary target | Public/frozen final eval | Multi-step reasoning | Explicit default-misread diagnostic |
|---|---|---|---|---|
| CulturalBench | cultural knowledge QA | public benchmark | no | no |
| GlobalOpinionQA | subjective opinion representation | public benchmark | no | no |
| NORMSAGE | culture-aware norm discovery from conversations | not a private held-out frontier benchmark | no | no |
| GS-SoCo | culturally grounded social inference under ambiguity | frozen final holdout disclosed in repo | yes | yes — WDR |
GS-SoCo is closest to these benchmarks in spirit, but it occupies a different
evaluation slot: frozen multistep social inference rather than public
knowledge recall, survey-opinion similarity, or on-the-fly norm extraction.
For the full comparison and figure set, see
docs/project_state/prize_positioning_appendix_2026-04-04.md.
The competition benchmark is a frozen v9_private holdout backed by a
broader v9 development line: an iterative, multi-episode framework designed
to survive frontier model evaluation through strict cross-model survivor
filtering. In judge-facing prose, the more precise term is frozen final
holdout, not "private test set."
Public Kaggle benchmark:
https://www.kaggle.com/benchmarks/elormyevudza/gs-soco
Important: the public Kaggle benchmark reruns current hosted model versions, so
its live scores can drift after submission. For the frozen 2026-04-04
competition numbers, use this README, the final writeup, and the durable
record in docs/project_state/post_submission_record_2026-04-04.md.
| v1 (legacy) | v9 (active) | |
|---|---|---|
| Architecture | Single-item MCQ | Multi-step episodes (3 roles/episode) |
| Items | 184 (172 cultural + 12 ablation) | 36-row frozen final holdout, backed by a 41-instance survivor bank |
| Difficulty | Solved by Gemini 2.5 Flash (100%) | Curated from a 41-instance strict-survivor bank |
| Validation | Local baselines + audit | Cross-model Kaggle evaluation (Gemini, Claude, DeepSeek) |
The production holdout (data_v9_private/) now contains 8
literature-grounded survivor episodes plus 4 ablation anchors:
| Model | Overall Accuracy | Cultural Accuracy | Cultural Episode Pass | WDR | Ablation Accuracy |
|---|---|---|---|---|---|
| Gemini 2.5 Flash | 0.8611 |
0.7917 |
0.3750 |
0.8000 |
1.0000 |
| Claude Opus 4.6 | 0.8611 |
0.7917 |
0.3750 |
1.0000 |
1.0000 |
| DeepSeek V3.2 | 0.8333 |
0.7500 |
0.3750 |
0.5000 |
1.0000 |
The hardest step is the clarifier step, not the final response choice.
Across the frontier trio, clarifier accuracy is only 0.3750 for all three
models, while response accuracy stays at 1.0000 and avoid-step accuracy stays
perfect for Gemini and Claude (1.0000) and high for DeepSeek (0.8750).
Each frontier model passes 3/8 cultural episodes and 4/4 ablation episodes.
When Claude fails culturally, every failure is a Western-default failure.
The saved trio results should be read as a deliberately adversarial frontier stress test: the same three model families were used in the survivor loop and in the frozen final evaluation. In this frozen set, WDR is driven by the explicit Western-default distractor on the clarifier row of each cultural episode, so it is best read as a diagnostic of the decisive misread step rather than as a complete error taxonomy.
Episode-level frontier pressure on the 8 cultural episodes:
- 2 universal failures:
rv9priv_cs_001(Javanese elder-deference vs translation) andrv9priv_tom_002(Amhara reciprocal welcome vs romantic-intimacy misread) - 5 cross-model failures: failed by at least
2/3frontier models - 8/8 cultural episodes: failed by at least one frontier model
- 3 task families covered with Global South literature grounding intact
This frozen final holdout is now much more compelling on the benchmark's intended
metric than the old 6-episode pool: the frontier trio averages 0.7778
cultural accuracy but only 0.3750 cultural episode pass rate, meaning the
models fully solve fewer than half of the cultural episodes.
This final holdout is not the whole benchmark-development story. It is a
curated championship set drawn from 41 strict-survivor instances across the
v9 development line, collapsed into 8 distinct final episodes to avoid
repetitive clones while preserving literature grounding, regional spread, and
cross-model difficulty. The frozen set should therefore be read as a set of
family representatives curated from a strict-survivor bank, not as a claim that
every final episode remains a >=2/3 failure on every later rerun. A
survivor-bank audit later confirmed that the 41-instance bank already
collapses to the same 8 unique strict-survivor families represented here, so
further widening from existing survivors would mostly add duplicates rather than
new cognitive coverage. A later time-boxed v9_batch19 expansion branch also
tried six new-family candidates and still produced no new promotable strict
survivors.
Each v9 episode is a 3-step reasoning sequence:
| Step | Role | What It Tests |
|---|---|---|
| 1 | Clarifier | Identify the missing culturally decisive distinction |
| 2 | Response | Choose the best culturally grounded interpretation or action |
| 3 | Avoid | Reject the most tempting Western-default trap |
An episode passes only if all 3 steps are correct. This eliminates lucky guesses and tests end-to-end cultural cognition.
| Task | Focus | v9_private Episodes |
|---|---|---|
| Code-Switching | Social function of language switches | 4 |
| Cultural Theory of Mind | Predicting mental states in cultural context | 2 |
| Proverbial Reasoning | Culturally-embedded figurative language | 2 |
| Norm Navigation | Choosing appropriate behavior (Taboo Threshold design) | 0* |
*Norm navigation episodes pass all tested models so far. The "Taboo Threshold" design — where a protagonist must violate a minor norm to protect a sacred taboo — is an active research direction.
| Metric | What it measures |
|---|---|
| Episode Pass Rate | Fraction of 3-step episodes fully correct |
| Cultural Episode Pass Rate | Episode pass rate on cultural content only |
| Ablation Gap | Ablation − cultural accuracy = cultural reasoning penalty |
| Western Default Rate | Frequency of tagged Western-default errors on rows with explicit western-default distractors |
| Per-region breakdown | Identifies geographic blind spots |
| Wilson CI | 95% confidence interval for accuracy |
- Episode structure prevents lucky single-item guesses (must get all 3 steps correct)
- Answer positions balanced within each batch (deterministic seed)
- Distractor verbosity balanced against correct answers (longest-option heuristic near chance)
- Canary strings embedded for contamination detection
- Ablation controls baseline pure reasoning without cultural content
- Cross-model survivor filtering — only episodes that fail ≥2 frontier models are promoted
- Use the public benchmark page:
https://www.kaggle.com/benchmarks/elormyevudza/gs-soco - For the exact published competition snapshot, use benchmark
GS-SoCo, version1 - For private author-side reruns or task development, attach
gs-soco-benchmark-datato a Kaggle Benchmark Task and run the notebook - Results are saved under
/kaggle/working/gs_soco_runs/<timestamp>_<model>_direct/
from benchmark import build_full_dataset, evaluate_local
# Use v9_private for the validated survivor pool
df = build_full_dataset(variant="v9_private", include_ablation=True)
def my_model(prompt: str) -> str:
return response # Your model API here
results = evaluate_local(my_model, df)pytest tests/ -v --tb=short
python scripts/audit_dataset.py
python scripts/run_local_baselines.py├── data/ # Source data modules
│ ├── reasoning_v9_private_data.py # 8 curated family representatives + 4 ablation anchors
│ ├── reasoning_v9_batch17_data.py # Taboo Threshold batch
│ ├── reasoning_v9_dev_data.py # Episode infrastructure
│ └── ablation_data.py # Culturally-neutral baselines
├── data_v9_private/ # Materialized v9_private JSONL
│ ├── cultural_reasoning.jsonl
│ ├── ablations.jsonl
│ └── episodes.jsonl
├── benchmark.py # SDK harness + local evaluation + metrics
├── kaggle_notebook.py # Kaggle evaluation script
├── dataset_audit.py # Dataset integrity + balance checks
├── DATASHEET.md # Gebru et al. (2021) documentation
└── tests/
├── test_dataset_integrity.py
├── test_v9_pipeline.py
└── test_reasoning_v9_batch17.py
@misc{gssoco2026,
title={GS-SoCo: A Global South Social Cognition Benchmark for Evaluating Cultural Reasoning in Large Language Models},
author={GS-SoCo Benchmark Authors},
year={2026},
note={Kaggle Benchmarks Platform}
}CC-BY-4.0 — see LICENSE.