Skip to content

Crowd-AI-Lab/CodedLang

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

"Newspaper Eat" Means "Not Tasty": A Taxonomy and Benchmark for Coded Language in Real-World Chinese Online Reviews

This repository provides datasets and code for the ACL 2026 long paper on Coded Language.

"Newspaper Eat" Means "Not Tasty": A Taxonomy and Benchmark for Coded Language in Real-World Chinese Online Reviews
Ruyuan Wan, Changye Li, Ting-Hao 'Kenneth' Huang
ACL 2026

Overview

Example of creative coded language in a Google Maps restaurant review.
Top: the original user review in Chinese, as posted on Google Maps.
Left: the machine-generated English translation produced by Google Translate.
Right: the inferred underlying meaning obtained by decoding phonetic substitutions.

CodedLang Dataset

We introduce CodedLang, a dataset of 7,744 Chinese Google Maps reviews, including 900 reviews with span-level annotations of coded language.

🤗 Hugging Face Dataset: https://huggingface.co/datasets/RuyuanWan/CodedLang

Taxonomy

We define 7 categories of coded language:

  • Ambiguous Homophones
  • Non-Lexical Homophones
  • Phonetic Substitution
  • Emoji Substitution
  • Orthographic Substitution
  • Cross-Lingual Phonetic Encoding
  • Cipher

Data Sources

Our dataset is constructed based on two large-scale Google Maps review datasets:

Annotations

Each sample includes:

  • id: Index of the review
  • original_review: Original Chinese review text
  • translated_review: English translation of the original review
    • Note: 309 reviews are missing translations from the raw Google Reviews dataset and are kept as-is
  • rating: Review rating (1–5 stars)
  • char_mask_review: Character-level masked version of the review
  • span_mask_review: Span-level masked version of the review
  • decode_review: Decoded version of the review, where coded language is replaced with the intended meaning.
  • coded_lang_class: Taxonomy label(s) for coded language classes
  • code_span: Text span of coded expressions
  • coded_language: Binary label
    • 1: contains coded language
    • 0: non-coded review

Coded Language Dictionary

We provide a coded language dictionary derived from human annotations:

Dictionary Fields

Each entry includes:

  • code_span: Coded expression
  • decode: Decoded form
  • coded_lang_class: Corresponding coded language category

For phonetic-based coded language, we additionally provide:

  • mid_homophone: Intermediate homophone form (if applicable)
  • code_pinyin: Pinyin of the coded expression
  • decode_pinyin: Pinyin of the decoded expression
  • code_ipa: IPA representation of the coded expression
  • decode_ipa: IPA representation of the decoded expression

Benchmark Results

We evaluate language models on taxonomy-aware coded language classification.

Multi-label coded language classification performance across categories by DeepSeek-V3.2. The highest (lowest) scores in each column are highlighted in bold (underlined).

Results show substantial performance variation across coded language categories, cross-lingual phonetic encoding remaining particularly challenging for current language models.

For full benchmark results and analyses, please refer to the paper.


Citation

If you find this work useful for your research, please cite our paper:

@inproceedings{wan2026newspaper,
  title={"Newspaper Eat" Means "Not Tasty": A Taxonomy and Benchmark for Coded Languages in Real-World Chinese Online Reviews},
  author={Wan, Ruyuan and Li, Changye and Huang, Ting-Hao'Kenneth'},
  booktitle={The 64th Annual Meeting of the Association for Computational Linguistics},
  year={2026}
}

About

Data and code for: "Newspaper Eat" Means "Not Tasty": A Taxonomy and Benchmark for Coded Languages in Real-World Chinese Online Reviews (ACL 2026) (https://arxiv.org/abs/2601.19932)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors