End-to-end pipeline for turning raw eye-tracking recordings of programmers reading source code into fixation × code-token AOI tables. Outputs feed the ECPG (Stage II) and EC-GazeFormer (Stage III) modeling lines of the parent PhD thesis.
┌──────────────────────────────────────────────┐
│ Raw Tobii TSV → Canonical fixation rows │
│ (parsers / I-DT) │
└────────────────┬─────────────────────────────┘
│
┌────────────────▼─────────────────────────────┐
│ Stimulus image → OCR token AOIs │
│ (aoi.aoi_detector) │
└────────────────┬─────────────────────────────┘
│
┌────────────────▼─────────────────────────────┐
│ Fixation hit-test → fixation × token table │
│ (aoi.hit_test) │
└────────────────┬─────────────────────────────┘
│
▼
ECPG / EC-GazeFormer
git clone <repo-url>
cd GAZE2CodeToolkit
conda env create -f environment.yml
conda activate gaze2code
# Sanity check
python -c "from g2c.parsers import available_datasets; print(available_datasets())"
# → ['UNL_UM', 'YMU_UM']You will need to install the Tesseract OCR binary separately. See the "System dependencies" section below.
git clone <repo-url>
cd GAZE2CodeToolkit
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txt
python -c "from g2c.parsers import available_datasets; print(available_datasets())"Only one item is not pip-installable: the Tesseract OCR binary, which
backs pytesseract. If you use Option A (conda) above, this is
already handled — environment.yml pulls Tesseract from conda-forge and
you can skip the rest of this section.
If you use Option B (pip/venv), install Tesseract through your OS package manager:
| OS | Command |
|---|---|
| Ubuntu / Debian | sudo apt-get install tesseract-ocr |
| macOS (Homebrew) | brew install tesseract |
| Windows 11 | UB-Mannheim installer — https://github.com/UB-Mannheim/tesseract/wiki |
Verify the install:
tesseract --version-
Download
tesseract-ocr-w64-setup-5.x.x.exefrom the UB-Mannheim builds and run it. Keep the default install path (C:\Program Files\Tesseract-OCR\) and make sure the English language pack is selected. -
Add
C:\Program Files\Tesseract-OCRto your systemPATH(Win+R→sysdm.cpl→ Advanced → Environment Variables → Path → New), then open a new terminal and confirm withtesseract --version. -
If you would rather not edit
PATH, pointpytesseractat the binary explicitly inside your code:import pytesseract pytesseract.pytesseract.tesseract_cmd = ( r"C:\Program Files\Tesseract-OCR\tesseract.exe" )
Homebrew puts tesseract on PATH automatically — pytesseract will
find it with no further configuration. If brew itself is not
installed:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"On Apple Silicon the binary lives at /opt/homebrew/bin/tesseract; on
Intel Macs at /usr/local/bin/tesseract. Either way which tesseract
should return a path after brew install tesseract.
On Debian / Ubuntu the tesseract-ocr package installs both the binary
and the default English language data:
sudo apt-get update
sudo apt-get install tesseract-ocr
# Optional: extra language packs, e.g. simplified Chinese
sudo apt-get install tesseract-ocr-chi-simOther distros:
| Distro | Command |
|---|---|
| Fedora / RHEL | sudo dnf install tesseract |
| Arch / Manjaro | sudo pacman -S tesseract tesseract-data-eng |
| openSUSE | sudo zypper install tesseract-ocr |
The package manager places the binary at /usr/bin/tesseract, already on
PATH. WSL2 users follow the same Debian/Ubuntu instructions — the
binary is shared with your Linux side, not your Windows side, so
Tesseract has to be installed in both environments if you want to use
both.
GAZE2CodeToolkit/
├── environment.yml # Conda env (Python 3.10 + Tesseract)
├── requirements.txt # pip-installable deps
├── README.md # ← you are here
├── g2c/ # Python package
│ ├── parsers/ # Dataset parsers (UNL_UM, YMU_UM, ...)
│ │ ├── tobii.py # Unified Tobii TSV parser
│ │ ├── datasets_config.py # Per-dataset config (paths, columns, ...)
│ │ ├── load.py # parsers.load(name, sample_size=...)
│ │ ├── UNL_UM.py / YMU_UM.py # Thin back-compat wrappers
│ │ └── ...
│ ├── fixation_classification/ # I-DT classifier
│ ├── aoi/ # OCR AOI detection, hit-test
│ ├── util/ # Helpers (lines, stimuli, exports)
│ └── visualization/ # Trial overlay, heatmap, timeline
├── cli/ # Command-line entry points (see cli/README.md)
│ ├── extract_fixations.py
│ ├── extract_aoi.py
│ ├── visualize.py
│ ├── score_expertise.py
│ └── evaluate_ocr.py
├── app.py # Streamlit web UI entry point
├── webapp/ # Streamlit tab modules and shared helpers
│ ├── components.py
│ ├── state.py
│ └── tabs/{extract,aoi,visualize,evaluate}.py
├── datasets/ # Dataset directories (see config; not in git)
├── output/ # Default location for generated CSVs / PNGs
└── *.ipynb # Legacy Jupyter notebooks (kept as research record)
from g2c.parsers import load
eye_events, samples = load("YMU_UM", sample_size=10)
eye_events.head()
# eye_tracker, experiment_id, participant_id, filename, trial_id,
# stimuli_module, stimuli_name, timestamp, duration, x0, y0, x1, y1,
# token, pupil_l, pupil_r, amplitude, peak_velocity, eye_event_typeA single unified Tobii parser handles both UNL_UM and YMU_UM. Adding
a third Tobii-export dataset is a configuration entry in
g2c/parsers/datasets_config.py — no new code required.
The legacy UNL_UM() / YMU_UM() functions still work for backward
compatibility:
from g2c.parsers import UNL_UM, YMU_UM
ee, sm = YMU_UM(sample_size=10)from g2c import aoi
aoi_df = aoi.aoi_detector(
"datasets/YMU_UM/stimuli/Quiz - introduction-Q1 (localhost).png",
scale_factor=2.0, min_confidence=60, psm="6", oem="3",
)
fixation_aoi = aoi.aoi_tokens_matcher(
"output/ymu_um/aoi_tokens_structure/aoi_introduction-Q1_tokens_structure.csv",
eye_events, trial_id="introduction-Q1", radius=25,
)from g2c import visualization
visualization.draw_trial(eye_events, samples, draw_fixation=True, r3=3, r5=1)
visualization.draw_heatmap(eye_events, contours=False, sigma_value=17,
vmin=0, vmax=100)A browser UI wraps the CLIs into four interactive tabs (Extract, AOI Detection, Visualize, Evaluate OCR):
cd GAZE2CodeToolkit
streamlit run app.pyThe app opens at http://localhost:8501. Both pre-configured datasets
and uploaded files are supported; OCR runs are blocking with a spinner;
generated CSVs are written to output/ exactly like the CLIs, and also
downloadable from the UI.
Deploy to Streamlit Cloud by pointing at
this repo and GAZE2CodeToolkit/app.py if you want a public link
(e.g. for a thesis Appendix A reference).
Five headless CLI scripts replace the legacy .ipynb notebooks. They are
the recommended entry points for new runs.
# From inside GAZE2CodeToolkit/
# 1) Parse a dataset and dump fixations as CSVs
python -m cli.extract_fixations --dataset UNL_UM --mode by-task \
--out-dir output/unl_um/group/fixations
# 2) OCR a stimulus image and hit-test fixations against the AOIs
python -m cli.extract_aoi --dataset YMU_UM --by-task \
--trial-id introduction-Q5 \
--image-dir datasets/YMU_UM/stimuli \
--image-prefix "Quiz - " --image-suffix " (localhost).png" \
--out-dir output/ymu_um
# 3) Save visualizations as PNG
python -m cli.visualize --dataset YMU_UM \
--trial-id introduction-Q5 --experiment-id Participant52 \
--kinds trial heatmap \
--out-dir output/ymu_um/viz
# 4) Score MCQ answers into per-participant expertise totals
python -m cli.score_expertise \
--input data/ymu_um/python_mcq_answers.csv \
--output data/ymu_um/python_mcq_scores.csv
# 5) Evaluate OCR output against a ground-truth CSV
python -m cli.evaluate_ocr \
--ground-truth output/ocr_groundtruth/Q5_ground_truth04.csv \
--detected output/orc_detection/Q5_detected_tokens.csv \
--out-dir output/ocr_eval/Q5See cli/README.md for the full list of CLI flags and additional
examples.
Two Tobii-based datasets are wired up out of the box:
| Dataset | Hardware | Default sample size | Configured in |
|---|---|---|---|
UNL_UM |
Tobii Pro Nano (Tobii I-VT export) | 44 participants | datasets_config.DATASETS["UNL_UM"] |
YMU_UM |
Tobii Pro Nano (Tobii I-VT export) | 84 participants | datasets_config.DATASETS["YMU_UM"] |
Raw TSVs are not shipped with the toolkit. Configure paths to your
local copy in g2c/parsers/datasets_config.py:
"YMU_UM": {
"raw_dir": "datasets/YMU_UM/rawdata",
"stimuli_dir": "datasets/YMU_UM/stimuli",
...
}For a new dataset that uses the same Tobii Pro Nano export format,
edit g2c/parsers/datasets_config.py:
DATASETS["MY_NEW_DATASET"] = {
"eye_tracker": "Tobii I-VT (Fixation)",
"raw_dir": "datasets/MyDataset/rawdata",
"stimuli_dir": "datasets/MyDataset/stimuli",
"stimuli_names": ("Task1 (localhost)", "Task2 (localhost)", ...),
"n_stimuli": 7,
"columns": TOBII_PRO_COLUMNS, # already defined at the top of the file
"participant_col": "Participant name",
"trial_split": {"strategy": "paired_markers", "per_trial": 3},
"stimuli_name_template": "{event_value}.png",
"trial_id_strategy": "first_word",
"fixation_label": "Fixation",
"default_sample_size": 30,
...
}Then call parsers.load("MY_NEW_DATASET") — no new code.
The original five Jupyter notebooks at the repository root are kept as research-history artefacts and still run unchanged:
| Notebook | CLI replacement |
|---|---|
g2c_fixation_extractor.ipynb |
python -m cli.extract_fixations |
g2c_aoi_extractor.ipynb |
python -m cli.extract_aoi |
g2c_visionizer.ipynb |
python -m cli.visualize |
g2c_expertise.ipynb |
python -m cli.score_expertise |
evaluate_ocr.ipynb |
python -m cli.evaluate_ocr |
To launch the notebooks, install jupyter into the same environment and
run jupyter notebook.
If you use this toolkit in academic work, please cite the parent thesis
(or the associated GAZE2Code paper for the OCR-AOI methodology) — see
the top-level repository README for the full reference.
See COPYING.txt and LICENSE at the repository root.
For questions, collaboration, or reuse inquiries:
- Wudao Yang Universiti Malaya / Yunnan Minzu University 📧 s2137045@siswa.um.edu.my