For labeling text regions and their contents we used Label Studio. AnnotationExport is an CLI script that builds datasets for TrOCR, YOLO and CRAFT out of LS annotations.
- Images are taken from an S3 bucket
- Annotations from Label Studio can be taken from both s3 and .json file
- Resulting dataset can be saved to both s3 and a local folder
- Create an virtual environment (Optional)
python -m venv venv
# for windows:
./venv/Scripts/activate.ps1
# for linux:
source ./venv/bin/activateEither:
- Install package from GitHub
pip install git+https://github.com/DialecticalHTR/AnnotationExporter.git- Create
.envin the folder you'll use the app, copy.env.templatecontents into it, add necessary data
or
- Clone repository
git clone https://github.com/DialecticalHTR/AnnotationExporter.git- Install as editable package
pip install -e .- Rename
env.templateto.env, add necessary data
Parameters:
--from source_type path: an annotation type to use and the path to data. Source type can bes3orexport(Local Label Studio JSON file). You can supply multiple annotations!--to output_type path: an output type and path to a place where dataset would be saved. Output type can bes3orfolder. You can have multiple outputs at the same time!--data model_type: dataset to generate. Dataset type can betrocr,yoloorcraft. The default dataset is TrOCR.
Example:
anno-exporter --from s3 dialectichtr-data --from export 13.json --to folder output --data yolo