STRLite trains scene text recognition models in two stages: MAE pretraining for visual representation learning, followed by autoregressive decoder fine-tuning for text generation.
Repository: https://github.com/balaboom123/STRLite
We provide installation instructions in INSTALLATION.md.
We describe how to prepare the datasets in DATASET.md.
-
ViT-Tiny pretrained on U14M-U.
Variants Embedding Depth Heads Parameters Download ViT-Tiny 192 12 12 6M HuggingFace -
To pre-train the ViT backbone on your own dataset, see §3.1 MAE Pretraining.
-
STRLite fine-tuned on U14M-L-Filtered.
Variants Acc on Common Benchmarks Acc on U14M-Benchmarks Download STRLite 93.82 81.03 HuggingFace -
To fine-tune or evaluate the model, see §3.2 Fine-tuning and §3.3 Evaluation.
Results of STRLite Accuracy (%) with or without MAE pretraining on six common Datasets.
|
Common STR benchmarks
|
U14M benchmarks
|
The end-to-end workflow is: pretrain a MAE encoder, fine-tune with an autoregressive decoder, then evaluate a checkpoint on validation or test benchmarks.
python main_pretrain.py data_path='[/path/to/lmdb_pretrain]'Distributed example:
torchrun --nproc_per_node=8 main_pretrain.py \
data_path='[/path/to/lmdb_pretrain]'python main_finetune.py \
train_data_path='[/path/to/lmdb_train]' \
val_data_path='[/path/to/lmdb_val]' \
pretrained_mae=/path/to/pretrain_checkpoint.pthEval via fine-tune script (evaluates val_data_path):
python main_finetune.py \
train_data_path='[/path/to/lmdb_train]' \
val_data_path='[/path/to/lmdb_val]' \
resume=/path/to/finetune_checkpoint.pth \
eval=trueStandalone eval (recommended for benchmark reporting):
python eval.py \
resume=/path/to/finetune_checkpoint.pth \
test_data_path='[/path/to/lmdb_test]'