GS-SoCo: Global South Social Cognition Benchmark

GS-SoCo evaluates whether frontier LLMs can reason within diverse cultural frameworks, not just recall cultural facts. It targets the gap between cultural knowledge and cultural cognition: can a model predict behavior, interpret figurative language, and navigate norms as a culturally embedded social agent?

Start Here

If you are reviewing the competition submission, start with:

PUBLIC_REVIEW_GUIDE.md for the canonical public-review path and terminology
the public Kaggle benchmark: https://www.kaggle.com/benchmarks/elormyevudza/gs-soco
docs/project_state/post_submission_record_2026-04-04.md for the corrected submission-time snapshot and post-submission audit trail
README.md for the benchmark overview and current results
LITERATURE_SOURCES.md for the consolidated literature grounding used in benchmark formation
docs/project_state/submission_writeup_final_2026-04-02.md for the final competition writeup
docs/project_state/v9_private_frontier_results_2026-04-02.md for the final held-out frontier results
docs/project_state/prize_positioning_appendix_2026-04-04.md for the benchmark comparison and figure set
docs/project_state/source_traceability_appendix_2026-04-04.md and docs/project_state/answer_adjudication_appendix_2026-04-04.md for the trust and traceability materials

This repository also contains the full archived iteration history from earlier benchmark lines. The current competition artifact is the frozen v9_private holdout. v9_private is the internal variant name: the episodes were held out during development and are now committed in the repo for transparency and reproducibility. The repository was public at submission time under the name KaggleBench and now resolves canonically as GS-SoCo at https://github.com/Elormyevu/GS-SoCo. Archived planning logs and drafts are retained on purpose, but they may use superseded terminology or intermediate metrics; for judge-facing review, use the files listed above.

Why This Benchmark?

Existing cultural benchmarks (CulturalBench, GlobalOpinionQA, NORMSAGE) primarily test knowledge recall. GS-SoCo tests cultural reasoning and inference: the ability to predict social outcomes, interpret figurative language, and navigate norms as a culturally embedded agent.

Key diagnostic metric: The Western Default Rate (WDR) tracks how often models select the tagged Western-default distractor when they fail on culturally grounded rows, giving labs actionable signal beyond raw accuracy.

Why GS-SoCo Is Distinct

Benchmark	Primary target	Public/frozen final eval	Multi-step reasoning	Explicit default-misread diagnostic
CulturalBench	cultural knowledge QA	public benchmark	no	no
GlobalOpinionQA	subjective opinion representation	public benchmark	no	no
NORMSAGE	culture-aware norm discovery from conversations	not a private held-out frontier benchmark	no	no
GS-SoCo	culturally grounded social inference under ambiguity	frozen final holdout disclosed in repo	yes	yes — WDR

GS-SoCo is closest to these benchmarks in spirit, but it occupies a different evaluation slot: frozen multistep social inference rather than public knowledge recall, survey-opinion similarity, or on-the-fly norm extraction. For the full comparison and figure set, see docs/project_state/prize_positioning_appendix_2026-04-04.md.

Current Status

The competition benchmark is a frozen v9_private holdout backed by a broader v9 development line: an iterative, multi-episode framework designed to survive frontier model evaluation through strict cross-model survivor filtering. In judge-facing prose, the more precise term is frozen final holdout, not "private test set."

Public Kaggle benchmark: https://www.kaggle.com/benchmarks/elormyevudza/gs-soco

Important: the public Kaggle benchmark reruns current hosted model versions, so its live scores can drift after submission. For the frozen 2026-04-04 competition numbers, use this README, the final writeup, and the durable record in docs/project_state/post_submission_record_2026-04-04.md.

	v1 (legacy)	v9 (active)
Architecture	Single-item MCQ	Multi-step episodes (3 roles/episode)
Items	184 (172 cultural + 12 ablation)	36-row frozen final holdout, backed by a 41-instance survivor bank
Difficulty	Solved by Gemini 2.5 Flash (100%)	Curated from a 41-instance strict-survivor bank
Validation	Local baselines + audit	Cross-model Kaggle evaluation (Gemini, Claude, DeepSeek)

v9_private: Frozen Final Holdout

The production holdout (data_v9_private/) now contains 8 literature-grounded survivor episodes plus 4 ablation anchors:

Model	Overall Accuracy	Cultural Accuracy	Cultural Episode Pass	WDR	Ablation Accuracy
Gemini 2.5 Flash	`0.8611`	`0.7917`	`0.3750`	`0.8000`	`1.0000`
Claude Opus 4.6	`0.8611`	`0.7917`	`0.3750`	`1.0000`	`1.0000`
DeepSeek V3.2	`0.8333`	`0.7500`	`0.3750`	`0.5000`	`1.0000`

The hardest step is the clarifier step, not the final response choice. Across the frontier trio, clarifier accuracy is only 0.3750 for all three models, while response accuracy stays at 1.0000 and avoid-step accuracy stays perfect for Gemini and Claude (1.0000) and high for DeepSeek (0.8750). Each frontier model passes 3/8 cultural episodes and 4/4 ablation episodes. When Claude fails culturally, every failure is a Western-default failure.

The saved trio results should be read as a deliberately adversarial frontier stress test: the same three model families were used in the survivor loop and in the frozen final evaluation. In this frozen set, WDR is driven by the explicit Western-default distractor on the clarifier row of each cultural episode, so it is best read as a diagnostic of the decisive misread step rather than as a complete error taxonomy.

Episode-level frontier pressure on the 8 cultural episodes:

2 universal failures: rv9priv_cs_001 (Javanese elder-deference vs translation) and rv9priv_tom_002 (Amhara reciprocal welcome vs romantic-intimacy misread)
5 cross-model failures: failed by at least 2/3 frontier models
8/8 cultural episodes: failed by at least one frontier model
3 task families covered with Global South literature grounding intact

This frozen final holdout is now much more compelling on the benchmark's intended metric than the old 6-episode pool: the frontier trio averages 0.7778 cultural accuracy but only 0.3750 cultural episode pass rate, meaning the models fully solve fewer than half of the cultural episodes.

This final holdout is not the whole benchmark-development story. It is a curated championship set drawn from 41 strict-survivor instances across the v9 development line, collapsed into 8 distinct final episodes to avoid repetitive clones while preserving literature grounding, regional spread, and cross-model difficulty. The frozen set should therefore be read as a set of family representatives curated from a strict-survivor bank, not as a claim that every final episode remains a >=2/3 failure on every later rerun. A survivor-bank audit later confirmed that the 41-instance bank already collapses to the same 8 unique strict-survivor families represented here, so further widening from existing survivors would mostly add duplicates rather than new cognitive coverage. A later time-boxed v9_batch19 expansion branch also tried six new-family candidates and still produced no new promotable strict survivors.

Task Architecture (v9)

Each v9 episode is a 3-step reasoning sequence:

Step	Role	What It Tests
1	Clarifier	Identify the missing culturally decisive distinction
2	Response	Choose the best culturally grounded interpretation or action
3	Avoid	Reject the most tempting Western-default trap

An episode passes only if all 3 steps are correct. This eliminates lucky guesses and tests end-to-end cultural cognition.

Task Families

Task	Focus	v9_private Episodes
Code-Switching	Social function of language switches	4
Cultural Theory of Mind	Predicting mental states in cultural context	2
Proverbial Reasoning	Culturally-embedded figurative language	2
Norm Navigation	Choosing appropriate behavior (Taboo Threshold design)	0*

*Norm navigation episodes pass all tested models so far. The "Taboo Threshold" design — where a protagonist must violate a minor norm to protect a sacred taboo — is an active research direction.

Metrics

Metric	What it measures
Episode Pass Rate	Fraction of 3-step episodes fully correct
Cultural Episode Pass Rate	Episode pass rate on cultural content only
Ablation Gap	Ablation − cultural accuracy = cultural reasoning penalty
Western Default Rate	Frequency of tagged Western-default errors on rows with explicit western-default distractors
Per-region breakdown	Identifies geographic blind spots
Wilson CI	95% confidence interval for accuracy

Anti-Gaming Properties

Episode structure prevents lucky single-item guesses (must get all 3 steps correct)
Answer positions balanced within each batch (deterministic seed)
Distractor verbosity balanced against correct answers (longest-option heuristic near chance)
Canary strings embedded for contamination detection
Ablation controls baseline pure reasoning without cultural content
Cross-model survivor filtering — only episodes that fail ≥2 frontier models are promoted

Quick Start

On Kaggle (recommended)

Use the public benchmark page: https://www.kaggle.com/benchmarks/elormyevudza/gs-soco
For the exact published competition snapshot, use benchmark GS-SoCo, version 1
For private author-side reruns or task development, attach gs-soco-benchmark-data to a Kaggle Benchmark Task and run the notebook
Results are saved under /kaggle/working/gs_soco_runs/<timestamp>_<model>_direct/

Locally

from benchmark import build_full_dataset, evaluate_local

# Use v9_private for the validated survivor pool
df = build_full_dataset(variant="v9_private", include_ablation=True)

def my_model(prompt: str) -> str:
    return response  # Your model API here

results = evaluate_local(my_model, df)

Validation

pytest tests/ -v --tb=short
python scripts/audit_dataset.py
python scripts/run_local_baselines.py

File Structure

├── data/                          # Source data modules
│   ├── reasoning_v9_private_data.py  # 8 curated family representatives + 4 ablation anchors
│   ├── reasoning_v9_batch17_data.py  # Taboo Threshold batch
│   ├── reasoning_v9_dev_data.py      # Episode infrastructure
│   └── ablation_data.py              # Culturally-neutral baselines
├── data_v9_private/               # Materialized v9_private JSONL
│   ├── cultural_reasoning.jsonl
│   ├── ablations.jsonl
│   └── episodes.jsonl
├── benchmark.py                   # SDK harness + local evaluation + metrics
├── kaggle_notebook.py             # Kaggle evaluation script
├── dataset_audit.py               # Dataset integrity + balance checks
├── DATASHEET.md                   # Gebru et al. (2021) documentation
└── tests/
    ├── test_dataset_integrity.py
    ├── test_v9_pipeline.py
    └── test_reasoning_v9_batch17.py

Citation

@misc{gssoco2026,
  title={GS-SoCo: A Global South Social Cognition Benchmark for Evaluating Cultural Reasoning in Large Language Models},
  author={GS-SoCo Benchmark Authors},
  year={2026},
  note={Kaggle Benchmarks Platform}
}

License

CC-BY-4.0 — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/workflows		.github/workflows
artifacts/kaggle_runs		artifacts/kaggle_runs
data		data
data_v2		data_v2
data_v3		data_v3
data_v4		data_v4
data_v5		data_v5
data_v6		data_v6
data_v7		data_v7
data_v7_candidates		data_v7_candidates
data_v7_candidates_b2		data_v7_candidates_b2
data_v7_candidates_b3		data_v7_candidates_b3
data_v8		data_v8
data_v9_batch10		data_v9_batch10
data_v9_batch11		data_v9_batch11
data_v9_batch12		data_v9_batch12
data_v9_batch13		data_v9_batch13
data_v9_batch14		data_v9_batch14
data_v9_batch15		data_v9_batch15
data_v9_batch16		data_v9_batch16
data_v9_batch17		data_v9_batch17
data_v9_batch18		data_v9_batch18
data_v9_batch19		data_v9_batch19
data_v9_batch2		data_v9_batch2
data_v9_batch3		data_v9_batch3
data_v9_batch4		data_v9_batch4
data_v9_batch5		data_v9_batch5
data_v9_batch6		data_v9_batch6
data_v9_batch7		data_v9_batch7
data_v9_batch8		data_v9_batch8
data_v9_batch9		data_v9_batch9
data_v9_dev		data_v9_dev
data_v9_private		data_v9_private
docs/project_state		docs/project_state
scripts		scripts
tests		tests
v7		v7
v9		v9
.env.example		.env.example
.gitignore		.gitignore
DATASHEET.md		DATASHEET.md
LICENSE		LICENSE
LITERATURE_SOURCES.md		LITERATURE_SOURCES.md
PUBLIC_REVIEW_GUIDE.md		PUBLIC_REVIEW_GUIDE.md
README.md		README.md
V2_REASONING_GUIDELINES.md		V2_REASONING_GUIDELINES.md
V3_HARD_REASONING_GUIDELINES.md		V3_HARD_REASONING_GUIDELINES.md
V4_DEEP_REWRITE_GUIDELINES.md		V4_DEEP_REWRITE_GUIDELINES.md
V5_HARD_REASONING_GUIDELINES.md		V5_HARD_REASONING_GUIDELINES.md
V6_NEAR_NEIGHBOR_GUIDELINES.md		V6_NEAR_NEIGHBOR_GUIDELINES.md
benchmark.py		benchmark.py
benchmark_v2.py		benchmark_v2.py
benchmark_v2_full.py		benchmark_v2_full.py
benchmark_v3.py		benchmark_v3.py
benchmark_v3_full.py		benchmark_v3_full.py
benchmark_v4.py		benchmark_v4.py
benchmark_v4_full.py		benchmark_v4_full.py
benchmark_v4_slice.py		benchmark_v4_slice.py
benchmark_v5.py		benchmark_v5.py
benchmark_v6.py		benchmark_v6.py
benchmark_v7.py		benchmark_v7.py
benchmark_v8.py		benchmark_v8.py
benchmark_v9.py		benchmark_v9.py
benchmark_v9_batch10.py		benchmark_v9_batch10.py
benchmark_v9_batch11.py		benchmark_v9_batch11.py
benchmark_v9_batch12.py		benchmark_v9_batch12.py
benchmark_v9_batch13.py		benchmark_v9_batch13.py
benchmark_v9_batch14.py		benchmark_v9_batch14.py
benchmark_v9_batch15.py		benchmark_v9_batch15.py
benchmark_v9_batch16.py		benchmark_v9_batch16.py
benchmark_v9_batch17.py		benchmark_v9_batch17.py
benchmark_v9_batch2.py		benchmark_v9_batch2.py
benchmark_v9_batch3.py		benchmark_v9_batch3.py
benchmark_v9_batch4.py		benchmark_v9_batch4.py
benchmark_v9_batch5.py		benchmark_v9_batch5.py
benchmark_v9_batch6.py		benchmark_v9_batch6.py
benchmark_v9_batch7.py		benchmark_v9_batch7.py
benchmark_v9_batch8.py		benchmark_v9_batch8.py
benchmark_v9_batch9.py		benchmark_v9_batch9.py
dataset_audit.py		dataset_audit.py
kaggle_notebook.py		kaggle_notebook.py
pyproject.toml		pyproject.toml
reasoning_audit.py		reasoning_audit.py
reasoning_v3_audit.py		reasoning_v3_audit.py
reasoning_v4_audit.py		reasoning_v4_audit.py
reasoning_v5_audit.py		reasoning_v5_audit.py
reasoning_v6_audit.py		reasoning_v6_audit.py
reasoning_v7_audit.py		reasoning_v7_audit.py
reasoning_v8_audit.py		reasoning_v8_audit.py
reasoning_v9_audit.py		reasoning_v9_audit.py
reasoning_v9_batch10_audit.py		reasoning_v9_batch10_audit.py
reasoning_v9_batch11_audit.py		reasoning_v9_batch11_audit.py
reasoning_v9_batch12_audit.py		reasoning_v9_batch12_audit.py
reasoning_v9_batch13_audit.py		reasoning_v9_batch13_audit.py
reasoning_v9_batch14_audit.py		reasoning_v9_batch14_audit.py
reasoning_v9_batch15_audit.py		reasoning_v9_batch15_audit.py
reasoning_v9_batch16_audit.py		reasoning_v9_batch16_audit.py
reasoning_v9_batch17_audit.py		reasoning_v9_batch17_audit.py
reasoning_v9_batch2_audit.py		reasoning_v9_batch2_audit.py
reasoning_v9_batch3_audit.py		reasoning_v9_batch3_audit.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GS-SoCo: Global South Social Cognition Benchmark

Start Here

Why This Benchmark?

Why GS-SoCo Is Distinct

Current Status

v9_private: Frozen Final Holdout

Task Architecture (v9)

Task Families

Metrics

Anti-Gaming Properties

Quick Start

On Kaggle (recommended)

Locally

Validation

File Structure

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GS-SoCo: Global South Social Cognition Benchmark

Start Here

Why This Benchmark?

Why GS-SoCo Is Distinct

Current Status

v9_private: Frozen Final Holdout

Task Architecture (v9)

Task Families

Metrics

Anti-Gaming Properties

Quick Start

On Kaggle (recommended)

Locally

Validation

File Structure

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages