A CLI tool that keeps @Types headers in CHAT corpus files (.cha) in sync with their canonical values defined in 0types.txt files.
Each directory containing .cha files can have a 0types.txt file that defines the correct @Types header for that directory. Subdirectories without their own 0types.txt inherit from the nearest ancestor that has one.
The tool walks the directory tree, reads the 0types.txt files, then for each .cha file:
- Replaces the
@Typesline if it differs from the canonical value - Inserts the
@Typesline if the file doesn't have one - Skips the file if the header already matches
The tool is designed to run as a git pre-commit hook in *-data repos. Run the bootstrap script once per machine:
# Build from source and install hook + binary:
./bootstrap.sh
# Or with a pre-built binary (for machines without Rust):
./bootstrap.sh --binary /path/to/update-chat-typesThis does:
- Installs the binary to
~/.talkbank/bin/update-chat-types - Installs the hook to
~/.talkbank/hooks/pre-commit - Sets
git config --global core.hooksPath ~/.talkbank/hooks - Adds
~/.talkbank/binto PATH in your shell profile
After bootstrap, every git commit in a *-data repo will automatically fix @Types headers and include the fixes in the commit. Non-data repos are unaffected (the hook exits silently).
For remote setup via SSH:
scp update-chat-types user@machine:~/
ssh user@machine 'bash -s' < bootstrap.sh --binary ~/update-chat-typescargo install --path .# Update all .cha files under a directory
update-chat-types --chat-dir /path/to/corpus
# Preview changes without modifying files
update-chat-types --chat-dir /path/to/corpus --dry-runOutput lists each modified file path relative to --chat-dir:
Updated 3 CHAT files:
Eng-NA/Bates/010600a.cha
Eng-NA/Bates/010600b.cha
Eng-NA/Brown/eve01.cha
The hook in hooks/pre-commit:
- Only activates in repos whose name matches
*-data - Gracefully degrades — if the binary isn't on PATH, prints a warning and allows the commit
- Auto-stages fixes — runs
git add -u -- '*.cha'so @Types corrections are included in the commit - Only blocks commits on tool errors (e.g., malformed
0types.txt)
Given this directory structure:
corpus/
├── 0types.txt # @Types: long, toyplay, TD
├── session1.cha # will use "long, toyplay, TD"
├── narratives/
│ ├── 0types.txt # @Types: long, narrative, TD
│ └── story1.cha # will use "long, narrative, TD"
└── freeplay/
└── play1.cha # inherits "long, toyplay, TD" from parent
Running update-chat-types --chat-dir corpus updates all .cha files to match their respective 0types.txt.
cargo check # Type-check without building
cargo test # Run all unit + integration tests
cargo build # Debug build
cargo build --release # Optimized release build (LTO + stripped)
cargo bench # Run Criterion benchmarks
cargo insta review # Review pending snapshot changesRun a single test:
cargo test <test_name> # e.g. cargo test test_get_typessrc/main.rs— CLI entry point usingclap. Callsupdate_types_in_place()and prints modified file paths.src/lib.rs— Core library. All public functions returnanyhow::Result.hooks/pre-commit— Git pre-commit hook (installed viacore.hooksPath).bootstrap.sh— One-time setup script to install the binary and hook.
Public API (4 functions):
get_types(&Path) -> Result<Option<String>>— extract@Typesheader from a.chafile (streaming, stops after 30 lines or first utterance)read_types_file(&Path) -> Result<String>— read the@Typesvalue from a0types.txtfileupdate_types_to_new_path(&Path, &Path, &str, bool) -> Result<bool>— update a single file's@Typesheader via atomic temp file writeupdate_types_in_place(&Path, bool) -> Result<Vec<PathBuf>>— orchestrator: walk directory, collect type mappings, update all.chafiles, return paths of modified files
Key internal helper:
classify_header_line(&str, &str) -> HeaderAction— pure function that classifies each header line asReplace,AlreadyOk,Splice, orContinue
This tool is designed to be fast enough for use as a pre-commit hook, even on large corpora with thousands of files.
Zero-cost unchanged files. When a file's @Types already matches the canonical value, the tool reads only the header (~14 lines), determines no change is needed, and moves on. No temp file is created, no bytes are copied. This is the common case for pre-commit hooks where most files are already correct.
Single directory walk. The entire directory tree is traversed once with WalkDir. During that single pass, the tool simultaneously builds the type inheritance map, collects 0types.txt locations, and gathers all .cha file paths. The previous implementation walked the tree twice.
Raw byte copy after header. When a file does need updating, only the header prefix (~14 lines) is parsed line-by-line. Once the @Types decision is made, the entire remainder of the file — which can be thousands of lines of transcript — is copied as raw bytes via io::copy, with no per-line UTF-8 decoding or re-encoding.
No regex. All header matching uses Rust byte-prefix patterns ([b'@', b'T', b'y', b'p', b'e', b's', b':', ..]), avoiding the cost of compiling and executing regex automata. This also eliminates regex, regex-automata, regex-syntax, and aho-corasick as dependencies.
Atomic file writes. Modified files are written to a NamedTempFile created in the same directory as the target, then atomically renamed via persist(). This prevents partial writes and avoids cross-device rename errors.
- Unit tests (
src/lib.rs) — rstest parameterized tests forclassify_header_line,get_types,read_types_file - Integration tests (
tests/integration.rs) — mutation tests usingTempDirfor filesystem isolation: replace, splice, noop, dry run, full directory walk, edge cases - Snapshot tests (
tests/snapshots/) — insta snapshots for replace and splice output verification
fixtures/*.cha(small-types.cha,big-types.cha,no-types.cha,tiny-types.cha) — unit test fixturesfixtures/test-dir/— nested directory structure with0types.txtand.chafiles for testing directory inheritance
- The
@Typesheader is always within the first ~30 lines of a.chafile, before any utterance lines (lines starting with*) 0types.txtfiles contain the canonical@Types:value for all.chafiles in that directory (and subdirectories without their own0types.txt)