Skip to content

Elormyevu/GS-SoCo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GS-SoCo: Global South Social Cognition Benchmark

GS-SoCo evaluates whether frontier LLMs can reason within diverse cultural frameworks, not just recall cultural facts. It targets the gap between cultural knowledge and cultural cognition: can a model predict behavior, interpret figurative language, and navigate norms as a culturally embedded social agent?

Start Here

If you are reviewing the competition submission, start with:

This repository also contains the full archived iteration history from earlier benchmark lines. The current competition artifact is the frozen v9_private holdout. v9_private is the internal variant name: the episodes were held out during development and are now committed in the repo for transparency and reproducibility. The repository was public at submission time under the name KaggleBench and now resolves canonically as GS-SoCo at https://github.com/Elormyevu/GS-SoCo. Archived planning logs and drafts are retained on purpose, but they may use superseded terminology or intermediate metrics; for judge-facing review, use the files listed above.

Why This Benchmark?

Existing cultural benchmarks (CulturalBench, GlobalOpinionQA, NORMSAGE) primarily test knowledge recall. GS-SoCo tests cultural reasoning and inference: the ability to predict social outcomes, interpret figurative language, and navigate norms as a culturally embedded agent.

Key diagnostic metric: The Western Default Rate (WDR) tracks how often models select the tagged Western-default distractor when they fail on culturally grounded rows, giving labs actionable signal beyond raw accuracy.

Why GS-SoCo Is Distinct

Benchmark Primary target Public/frozen final eval Multi-step reasoning Explicit default-misread diagnostic
CulturalBench cultural knowledge QA public benchmark no no
GlobalOpinionQA subjective opinion representation public benchmark no no
NORMSAGE culture-aware norm discovery from conversations not a private held-out frontier benchmark no no
GS-SoCo culturally grounded social inference under ambiguity frozen final holdout disclosed in repo yes yes — WDR

GS-SoCo is closest to these benchmarks in spirit, but it occupies a different evaluation slot: frozen multistep social inference rather than public knowledge recall, survey-opinion similarity, or on-the-fly norm extraction. For the full comparison and figure set, see docs/project_state/prize_positioning_appendix_2026-04-04.md.

Current Status

The competition benchmark is a frozen v9_private holdout backed by a broader v9 development line: an iterative, multi-episode framework designed to survive frontier model evaluation through strict cross-model survivor filtering. In judge-facing prose, the more precise term is frozen final holdout, not "private test set."

Public Kaggle benchmark: https://www.kaggle.com/benchmarks/elormyevudza/gs-soco

Important: the public Kaggle benchmark reruns current hosted model versions, so its live scores can drift after submission. For the frozen 2026-04-04 competition numbers, use this README, the final writeup, and the durable record in docs/project_state/post_submission_record_2026-04-04.md.

v1 (legacy) v9 (active)
Architecture Single-item MCQ Multi-step episodes (3 roles/episode)
Items 184 (172 cultural + 12 ablation) 36-row frozen final holdout, backed by a 41-instance survivor bank
Difficulty Solved by Gemini 2.5 Flash (100%) Curated from a 41-instance strict-survivor bank
Validation Local baselines + audit Cross-model Kaggle evaluation (Gemini, Claude, DeepSeek)

v9_private: Frozen Final Holdout

The production holdout (data_v9_private/) now contains 8 literature-grounded survivor episodes plus 4 ablation anchors:

Model Overall Accuracy Cultural Accuracy Cultural Episode Pass WDR Ablation Accuracy
Gemini 2.5 Flash 0.8611 0.7917 0.3750 0.8000 1.0000
Claude Opus 4.6 0.8611 0.7917 0.3750 1.0000 1.0000
DeepSeek V3.2 0.8333 0.7500 0.3750 0.5000 1.0000

The hardest step is the clarifier step, not the final response choice. Across the frontier trio, clarifier accuracy is only 0.3750 for all three models, while response accuracy stays at 1.0000 and avoid-step accuracy stays perfect for Gemini and Claude (1.0000) and high for DeepSeek (0.8750). Each frontier model passes 3/8 cultural episodes and 4/4 ablation episodes. When Claude fails culturally, every failure is a Western-default failure.

The saved trio results should be read as a deliberately adversarial frontier stress test: the same three model families were used in the survivor loop and in the frozen final evaluation. In this frozen set, WDR is driven by the explicit Western-default distractor on the clarifier row of each cultural episode, so it is best read as a diagnostic of the decisive misread step rather than as a complete error taxonomy.

Episode-level frontier pressure on the 8 cultural episodes:

  • 2 universal failures: rv9priv_cs_001 (Javanese elder-deference vs translation) and rv9priv_tom_002 (Amhara reciprocal welcome vs romantic-intimacy misread)
  • 5 cross-model failures: failed by at least 2/3 frontier models
  • 8/8 cultural episodes: failed by at least one frontier model
  • 3 task families covered with Global South literature grounding intact

This frozen final holdout is now much more compelling on the benchmark's intended metric than the old 6-episode pool: the frontier trio averages 0.7778 cultural accuracy but only 0.3750 cultural episode pass rate, meaning the models fully solve fewer than half of the cultural episodes.

This final holdout is not the whole benchmark-development story. It is a curated championship set drawn from 41 strict-survivor instances across the v9 development line, collapsed into 8 distinct final episodes to avoid repetitive clones while preserving literature grounding, regional spread, and cross-model difficulty. The frozen set should therefore be read as a set of family representatives curated from a strict-survivor bank, not as a claim that every final episode remains a >=2/3 failure on every later rerun. A survivor-bank audit later confirmed that the 41-instance bank already collapses to the same 8 unique strict-survivor families represented here, so further widening from existing survivors would mostly add duplicates rather than new cognitive coverage. A later time-boxed v9_batch19 expansion branch also tried six new-family candidates and still produced no new promotable strict survivors.

GS-SoCo v9 survivor funnel

GS-SoCo final holdout frontier results

Task Architecture (v9)

Each v9 episode is a 3-step reasoning sequence:

Step Role What It Tests
1 Clarifier Identify the missing culturally decisive distinction
2 Response Choose the best culturally grounded interpretation or action
3 Avoid Reject the most tempting Western-default trap

An episode passes only if all 3 steps are correct. This eliminates lucky guesses and tests end-to-end cultural cognition.

Task Families

Task Focus v9_private Episodes
Code-Switching Social function of language switches 4
Cultural Theory of Mind Predicting mental states in cultural context 2
Proverbial Reasoning Culturally-embedded figurative language 2
Norm Navigation Choosing appropriate behavior (Taboo Threshold design) 0*

*Norm navigation episodes pass all tested models so far. The "Taboo Threshold" design — where a protagonist must violate a minor norm to protect a sacred taboo — is an active research direction.

Metrics

Metric What it measures
Episode Pass Rate Fraction of 3-step episodes fully correct
Cultural Episode Pass Rate Episode pass rate on cultural content only
Ablation Gap Ablation − cultural accuracy = cultural reasoning penalty
Western Default Rate Frequency of tagged Western-default errors on rows with explicit western-default distractors
Per-region breakdown Identifies geographic blind spots
Wilson CI 95% confidence interval for accuracy

Anti-Gaming Properties

  • Episode structure prevents lucky single-item guesses (must get all 3 steps correct)
  • Answer positions balanced within each batch (deterministic seed)
  • Distractor verbosity balanced against correct answers (longest-option heuristic near chance)
  • Canary strings embedded for contamination detection
  • Ablation controls baseline pure reasoning without cultural content
  • Cross-model survivor filtering — only episodes that fail ≥2 frontier models are promoted

Quick Start

On Kaggle (recommended)

  1. Use the public benchmark page: https://www.kaggle.com/benchmarks/elormyevudza/gs-soco
  2. For the exact published competition snapshot, use benchmark GS-SoCo, version 1
  3. For private author-side reruns or task development, attach gs-soco-benchmark-data to a Kaggle Benchmark Task and run the notebook
  4. Results are saved under /kaggle/working/gs_soco_runs/<timestamp>_<model>_direct/

Locally

from benchmark import build_full_dataset, evaluate_local

# Use v9_private for the validated survivor pool
df = build_full_dataset(variant="v9_private", include_ablation=True)

def my_model(prompt: str) -> str:
    return response  # Your model API here

results = evaluate_local(my_model, df)

Validation

pytest tests/ -v --tb=short
python scripts/audit_dataset.py
python scripts/run_local_baselines.py

File Structure

├── data/                          # Source data modules
│   ├── reasoning_v9_private_data.py  # 8 curated family representatives + 4 ablation anchors
│   ├── reasoning_v9_batch17_data.py  # Taboo Threshold batch
│   ├── reasoning_v9_dev_data.py      # Episode infrastructure
│   └── ablation_data.py              # Culturally-neutral baselines
├── data_v9_private/               # Materialized v9_private JSONL
│   ├── cultural_reasoning.jsonl
│   ├── ablations.jsonl
│   └── episodes.jsonl
├── benchmark.py                   # SDK harness + local evaluation + metrics
├── kaggle_notebook.py             # Kaggle evaluation script
├── dataset_audit.py               # Dataset integrity + balance checks
├── DATASHEET.md                   # Gebru et al. (2021) documentation
└── tests/
    ├── test_dataset_integrity.py
    ├── test_v9_pipeline.py
    └── test_reasoning_v9_batch17.py

Citation

@misc{gssoco2026,
  title={GS-SoCo: A Global South Social Cognition Benchmark for Evaluating Cultural Reasoning in Large Language Models},
  author={GS-SoCo Benchmark Authors},
  year={2026},
  note={Kaggle Benchmarks Platform}
}

License

CC-BY-4.0 — see LICENSE.

About

Global South social cognition benchmark for frontier models, built as a private held-out evaluation for culturally grounded reasoning.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages