Skip to content

Conversation

@oskarvanderwal
Copy link

@oskarvanderwal oskarvanderwal commented Apr 28, 2022

I can successfully run the CrowS-Pairs (multilingual) tasks for the prompts we have written.

For python main.py --model gpt2 --device cpu --tasks crows_pairs_english I get:

gpt2 (), limit: None, provide_description: False, num_fewshot: 0, batch_size: None
|       Task        |     Prompt      |Version|Metric|Value |   |Stderr|
|-------------------|-----------------|------:|------|-----:|---|-----:|
|crows_pairs_english|                1|      0|acc   |0.5092|±  |0.0122|
|crows_pairs_english|                2|      0|acc   |0.5069|±  |0.0122|
|crows_pairs_english|                3|      0|acc   |0.5069|±  |0.0122|
|crows_pairs_english|                4|      0|acc   |0.5194|±  |0.0122|
|crows_pairs_english|A_preference     |      0|acc   |0.4764|±  |0.0122|
|crows_pairs_english|A_stereotype_true|      0|acc   |0.4949|±  |0.0122|

There are no official implementations of the CrowS-Pairs benchmark that work for autoregressive models like GPT-2.
For another implementation of CrowS-Pairs (older version though), I get a bias score of 0.593501326259947 for GPT-2; So quite a bit higher---but that doesn't say too much since the operationalization of the measures is so different.

For python main.py --model gpt2 --device cpu --tasks crows_pairs_french I get:

|       Task       |       Prompt       |Version|Metric|Value |   |Stderr|
|------------------|--------------------|------:|------|-----:|---|-----:|
|crows_pairs_french|A_preference_fr     |      0|acc   |0.4997|±  |0.0122|
|crows_pairs_french|A_reality_check_fr  |      0|acc   |0.5134|±  |0.0122|
|crows_pairs_french|A_stereotype_true_fr|      0|acc   |0.5224|±  |0.0122|

@StellaAthena
Copy link
Collaborator

Great work, thanks for the PR

@StellaAthena StellaAthena merged commit 22155f7 into bigscience-workshop:master Apr 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants