Exploring how language models converge on token sequences — and what adversarial pressure reveals about their training.
uv run --with mlx-lm mlx_example.py -l 20 -m sequential # Solo: 2 iterations
uv run --with mlx-lm mlx_example.py -l 20 -m adversarial # Battle: never converges| Mode | Solo | Adversarial |
|---|---|---|
| Sequential (greedy L→R) | 2 iterations | Never converges (oscillates) |
| Single-token + blocking | ~2.5n iterations | ~3n iterations |
A single model has coherent internal preferences and quickly finds a stable state. Two different models have incompatible preferences and fight forever.
The most interesting finding: adversarial pressure reveals what models are made of.
| Model | Solo Mode | Adversarial Mode |
|---|---|---|
| Llama 3B | Varied: \n, ,, {} braces |
Collapsed: ! (100%) |
| Qwen 1.5B | Structured: \n\n, # headers |
Collapsed: spaces, Chinese chars |
1. Adversarial pressure collapses diversity
In solo mode, models use context-appropriate tokens. Under adversarial pressure, they fall back to whatever they're most confident about regardless of context.
2. The ! finding
Llama never uses ! in solo mode. But under adversarial pressure, it's the only thing it uses. This isn't Llama's "favorite" token — it's Llama's most defensible token. The one it can justify in the widest range of contexts.
3. Chinese tokens only under pressure
Qwen's Chinese tokens (输入, 错误, 背景, 的) don't appear in solo mode. They emerge when Qwen is actively contested — a fallback to vocabulary that Llama can't contest effectively.
4. Training strata revealed
Solo evaluation shows surface behavior. Adversarial pressure reveals composition — the training data that shows through when everything else is stripped away.
| Mode | Description |
|---|---|
sequential |
Greedy left-to-right, recompute after each token |
single |
One token per iteration + oscillation blocking |
batch |
Replace all mismatches at once |
adversarial |
Two models, full sequential pass each |
adversarial-single |
Two models, one token each + blocking + fingerprinting |
# Solo modes
python mlx_example.py -l LENGTH -m single|batch|sequential
# Adversarial modes (with fingerprinting)
python mlx_example.py -l LENGTH -m adversarial|adversarial-single \
--model-a mlx-community/Qwen2.5-1.5B-Instruct-4bit \
--model-b mlx-community/Llama-3.2-3B-Instruct-4bit
# Options
-l, --length Number of random tokens (default: 10)
-n, --iterations Max iterations (default: 100)
-m, --mode Convergence mode
--model-a First model
--model-b Second model (adversarial modes)- Start: Generate random token sequence
- Each iteration: Find highest-entropy position where model's prediction differs from actual
- Replace: Swap actual token with model's argmax prediction
- Track: Record what tokens each model tries to insert (fingerprinting)
- Block: Positions that flip 3+ times get blocked to prevent infinite oscillation
- Repeat: Until convergence or oscillation detected
MODEL FINGERPRINTS (tokens each model compulsively inserts)
============================================================
Model A (Qwen) top insertions (44 total):
27x ' '
5x '-'
4x '!'
1x '的'
1x '们'
Model B (Llama) top insertions (43 total):
43x '!'
This technique works like a mass spectrometer for models:
- Apply adversarial pressure to strip away surface behavior
- What remains reveals the training composition
- Different models have different "elemental signatures"
Not what models prefer — what they're made of.