Do POS tags still help NER when the backbone is already BERT?
This project studies whether explicit POS-tag information helps BERT-based Named Entity Recognition on CoNLL-2003.
We compare a standard BERT NER baseline with several POS-aware NER variants. POS tags are injected either after the encoder, before the final classifier, or inside the encoder. We also test whether POS tagging errors propagate into NER by corrupting POS tags with a realistic error distribution based on a POS tagger confusion matrix.
Can POS tags improve BERT-based NER, and do POS tagging errors meaningfully affect NER predictions?
artifacts/
dataset/
conll2003/ # Local CoNLL-2003 dataset files
model/
bert-base-cased/ # Local BERT tokenizer/config files
notebooks/
01_results_analysis.ipynb # Main notebook for analyzing saved JSON results
scripts/
01_generate_pos_oof.py
02_train_ner_baseline.py
03_train_ner_pos_encoder_independent.py
04_train_ner_pos_encoder_level.py
05_controlled_pos_corruption.py
src/
constants.py # Shared constants
data.py # Dataset loading helpers
tokenization.py # Tokenization and label/POS alignment
collators.py # Custom data collators
models.py # POS-aware NER models
metrics.py # NER/POS metrics
evaluation.py # Evaluation helpers
pos_corruption.py # POS corruption utilities
train_manual.py # Manual PyTorch training loop helpers
seed.py # Seeding utilities
results/
metrics/
ner_baseline.json
ner_gold_pos.json
ner_predicted_pos.json
pos/
confusion_matrix.json
pos_corruption/
early_fusion.json
late_fusion.json
figures/ # Generated plots
poster/ # Final poster files
requirements.txt
README.md
git clone https://github.com/<your-username>/nlp-case-study.git
cd nlp-case-studyThe scripts expect the CoNLL-2003 dataset and BERT model files to be available under artifacts/:
artifacts/
dataset/
conll2003/
model/
bert-base-cased/You can either place the files manually into these folders or download them with Hugging Face tools.
Example:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoConfig
dataset = load_dataset("lhoestq/conll2003")
dataset.save_to_disk("artifacts/dataset/conll2003")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
config = AutoConfig.from_pretrained("bert-base-cased")
tokenizer.save_pretrained("artifacts/model/bert-base-cased")
config.save_pretrained("artifacts/model/bert-base-cased")If you use a different local path, update the script arguments accordingly, for example:
--dataset_name artifacts/dataset/conll2003
--model_name artifacts/model/bert-base-casedUsing Conda Env:
conda create -n nlp-case-study python=3.10 -y
conda activate nlp-case-study
pip install --upgrade pip
pip install -r requirements.txtTo check that the environment works:
python -c "import torch, transformers, datasets; print('Environment is ready')"Run all commands from the project root:
cd /Users/chrnegor/Documents/study/nlp-case-studyscripts/01_generate_pos_oof.py
This script trains the POS tagger used to generate predicted POS features for the NER experiments. It uses k-fold out-of-fold prediction on the training split to avoid leakage: each training example receives POS predictions from a model that was not trained on that example. The script also trains a final POS tagger on the full training set and uses it to predict POS tags and POS logits for the validation and test splits.
Example:
python scripts/01_generate_pos_oof.py \
--model_name artifacts/model/bert-base-cased \
--dataset_name artifacts/dataset/conll2003 \
--output_dir ./case_study_outputs/pos_oof \
--num_folds 5 \
--num_train_epochs 3 \
--train_batch_size 16 \
--eval_batch_size 16 \
--fp16scripts/02_train_ner_baseline.py
This script trains the baseline BERT-based NER model without any POS features. It runs training for several random seeds, evaluates the model on the validation and test splits, saves the best checkpoint, and writes aggregated mean/std metrics.
Example:
python scripts/02_train_ner_baseline.py \
--model_name artifacts/model/bert-base-cased \
--dataset_name artifacts/dataset/conll2003 \
--output_dir ./case_study_outputs/ner_baseline \
--seeds 42 43 44 \
--num_train_epochs 3 \
--train_batch_size 16 \
--eval_batch_size 16 \
--eval_steps 100 \
--fp16scripts/03_train_ner_pos_encoder_independent.py
This script trains the POS-aware NER model with encoder-independent fusion. In this setup, BERT first produces contextual token representations, and POS features are added only before the final NER classifier. The POS input can come from gold POS tags or from predicted POS tags. The script supports several POS representations: one-hot vectors, POS logits, and trainable POS embeddings.
Example:
python scripts/03_train_ner_pos_encoder_independent.py \
--model_name artifacts/model/bert-base-cased \
--dataset_name artifacts/dataset/conll2003 \
--predicted_pos_dataset_path /root/case_study_outputs/dataset_with_predicted_pos \
--output_dir ./nlp_case_study_outputs/ner_with_predicted_pos_embd \
--pos_source gold \
--pos_feature_type trainable_embed \
--pos_embed_dim 32 \
--seeds 42 43 44 \
--num_train_epochs 3 \
--train_batch_size 16 \
--eval_batch_size 16scripts/04_train_ner_pos_encoder_level.py
This script trains the NER model with encoder-level POS fusion. Instead of adding POS only before the classifier, this experiment injects trainable POS embeddings into the hidden states of the BERT encoder. The injection point can be controlled with --inject_layer_offset_from_end. For example, 2 injects POS before the last two encoder layers. This tests whether deeper POS integration makes the NER model use POS information more effectively.
Example:
python scripts/04_train_ner_pos_encoder_level.py \
--model_nameartifacts/model/bert-base-cased \
--dataset_name artifacts/dataset/conll2003 \
--predicted_pos_dataset_path /root/case_study_outputs/dataset_with_predicted_pos \
--output_dir ./nlp_case_study_outputs/ner_with_mid_pos_injection \
--pos_source gold \
--pos_embed_dim 32 \
--inject_layer_offset_from_end 2 \
--seeds 42 43 44 \
--num_train_epochs 3 \
--train_batch_size 16 \
--eval_batch_size 16scripts/05_controlled_pos_corruption.py
This script runs the controlled POS corruption experiment. It loads a trained POS-aware NER checkpoint and a POS confusion matrix. Then it corrupts gold POS tags on the test set at different corruption rates. The replacement POS tags are sampled from the real error distribution of the POS tagger, rather than uniformly at random. This makes the corruption more realistic and lets us measure whether POS errors propagate into NER predictions.
Example:
python scripts/05_controlled_pos_corruption.py \
--ner_model_dir artifacts/model/ner_gold_pos_ckpt/best_checkpoint \
--pos_confusion_dir artifacts/model/pos_model/best_checkpoint \
--dataset_name /root/nlp-case-study/artifacts/dataset/conll2003 \
--fusion_type encoder_level \
--pos_feature_type trainable_embed \
--pos_embed_dim 32 \
--inject_layer_offset_from_end 2 \
--batch_size 16 \
--corruption_rates 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0This project focuses on BERT-based NER on CoNLL-2003. The findings should not be interpreted as evidence that POS tags are never useful for NER.
Explicit POS features may still help in other settings, such as:
- smaller or non-contextual encoders;
- low-resource datasets;
- noisier domains;
- languages with richer morphology;
- architectures designed to rely more strongly on syntactic features.
