CreatmanCEO · CreatmanCEO · Jun 4, 2026 · Jun 4, 2026 · Jun 4, 2026 · Jun 4, 2026
diff --git a/docs/plans/2026-06-04-hebrew-translator-docx-inplace-design.md b/docs/plans/2026-06-04-hebrew-translator-docx-inplace-design.md
@@ -0,0 +1,57 @@
+# Hebrew Translator — DOCX in-place pixel-faithful (design)
+
+**Date:** 2026-06-04
+**Status:** Approved (design-lock)
+**Builds on:** Phase 1 (segment-aligned pipeline + viewer, live). This is the first pixel-faithful **file output** sub-phase. PDF positional overlay is the next sub-phase after this.
+
+## Goal
+For DOCX input, produce a downloadable translated `.docx` that preserves the original layout, images, tables, and paragraph styles by editing the original `word/document.xml` **in place** (paragraph-level), instead of regenerating a bare document.
+
+## Decisions (locked)
+| Topic | Decision |
+|-------|----------|
+| First format | DOCX in-place (PDF overlay = next sub-phase) |
+| Write-back granularity | **Paragraph-level**: translated paragraph → first `w:t` run; other runs blanked |
+| Translation passes | **One** — XML-paragraph extraction feeds both the viewer and the in-place writer |
+| Libs | `jszip` (zip r/w) + `@xmldom/xmldom` (XML DOM) |
+
+## Architecture
+
+### Extraction (`docx` input → blocks + retained XML)
+- `jszip` opens the DOCX; read `word/document.xml`; parse to DOM (`@xmldom/xmldom`).
+- Walk body `w:p` paragraphs (including those inside table cells `w:tc`). For each non-empty paragraph at index `pIndex`, the block content = concatenation of its `w:t` run texts. Produce blocks `{ id, type:'paragraph', pIndex, content }` → `buildSegments` → `TranslationDocument` (sentences) as in Phase 1.
+- Retain the parsed DOM + JSZip instance for write-back.
+
+### Translation
+- Same batch-aligned translator (Groq primary, Claude fallback). One pass. Per block, the paragraph translation = the joined sentence targets.
+
+### Write-back (in-place, paragraph-level)
+- For each block: locate `w:p[pIndex]`; set the FIRST `w:t` text = NFC(target); set remaining `w:t` in that paragraph to empty. Preserve paragraph props, first-run props, images, tables, drawings — everything else untouched.
+- Serialize DOM → overwrite the `word/document.xml` entry in the zip → emit `.docx`.
+
+### Integration
+- Worker: `.docx` input → XML extractor + in-place writer (downloadable file). `.pdf` input → current path until the PDF-overlay sub-phase.
+- Viewer unchanged (same `TranslationDocument`).
+
+## Error handling
+- Paragraph with no runs / image-only → skipped (left as-is).
+- Missing translation for a block (graceful) → leave the original text (do not blank).
+- Unparseable / unexpectedly complex docx → **fall back to the existing flat `generateDOCX`** so the download always works.
+- `xml:space="preserve"` respected; NFC normalize.
+
+## Guardrails (design-guardrails-audit) — 🔴 = acceptance criteria
+1. 🔴 **Zip-bomb on read** — run existing `assertDocxSafe` (uncompressed-size cap) before JSZip extraction on the in-place path.
+2. 🔴 **XXE / entity expansion** — parsing untrusted `document.xml` must not resolve external entities; reject `DOCTYPE`/DTD (`@xmldom/xmldom` does not resolve external entities by default — verify + guard against DOCTYPE / billion-laughs).
+3. 🔴 **Output integrity** — minimal edits (only `w:t` text); re-parse the serialized XML to validate; **fall back to flat `generateDOCX` on any error** (never emit a corrupt file / never fail the job).
+4. 🟡 Headers/footers/footnotes (separate XML parts) NOT translated in v1 — explicit log/note, not silent.
+5. 🟡 Memory/size bounded by the existing file-size cap.
+6. 🟡 Cost — single translation pass (no double LLM calls).
+7. 🟢 Deterministic NFC normalization.
+
+## Testing
+- Unit: paragraph extractor (generate a fixture `.docx` via the `docx` dep → extract → assert paragraphs + `pIndex`); write-back (inject translations → re-parse → first run replaced, others blank, structure intact); fallback on malformed input.
+- Integration: round-trip a generated `.docx` → translate (mock) → output re-parses and contains the translated text.
+
+## Scope (YAGNI)
+**In:** DOCX body paragraphs (incl. table cells) in-place, paragraph-level, fallback to flat on error, one translation pass.
+**Out:** headers/footers/footnotes, intra-paragraph run formatting, PDF overlay (next sub-phase).
diff --git a/docs/plans/2026-06-04-hebrew-translator-docx-inplace-implementation.md b/docs/plans/2026-06-04-hebrew-translator-docx-inplace-implementation.md
@@ -0,0 +1,191 @@
+# DOCX in-place pixel-faithful — Implementation Plan
+
+> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
+
+**Goal:** For DOCX uploads, produce a translated `.docx` that preserves layout/images/tables by editing `word/document.xml` in place (paragraph-level), instead of regenerating a bare doc.
+
+**Architecture:** Unzip DOCX (`jszip`), parse `word/document.xml` (`@xmldom/xmldom`), extract body `w:p` paragraphs (incl. table cells) as blocks → existing segment translator (one pass) → write each paragraph's translation into its first `w:t` run (blank the rest) → repackage. Falls back to the existing flat `generateDOCX` on any error.
+
+**Tech Stack:** Node, jszip, @xmldom/xmldom, existing pipeline (buildSegments/buildTranslationDocument), vitest. `docx` dep used to generate test fixtures.
+
+**Design:** `docs/plans/2026-06-04-hebrew-translator-docx-inplace-design.md` (🔴 guardrails = acceptance criteria). **Branch:** `feat/docx-inplace`. **Test:** `npm test -- <path>`.
+
+---
+
+### Task 1: Add deps
+
+```bash
+npm install jszip @xmldom/xmldom --legacy-peer-deps
+git add package.json package-lock.json
+git commit -m "build: add jszip + @xmldom/xmldom for DOCX in-place"
+```
+Verify: `node -e "require('jszip'); require('@xmldom/xmldom'); console.log('ok')"`.
+
+---
+
+### Task 2: `extractParagraphs` (DOCX → blocks, XXE-guarded)
+
+**Files:** Create `server/services/docxInplace.js` + `server/services/__tests__/docxInplace.extract.test.js`.
+
+**Step 1 — failing test** (build a real fixture with the `docx` dep):
+```javascript
+import { describe, it, expect } from 'vitest';
+import docx from 'docx';
+import { extractParagraphs } from '../docxInplace.js';
+
+async function makeDocx(paras) {
+  const d = new docx.Document({ sections: [{ children: paras.map(t =>
+    new docx.Paragraph({ children: [ new docx.TextRun(t) ] })) }] });
+  return docx.Packer.toBuffer(d);
+}
+
+it('extracts non-empty paragraphs with pIndex and content', async () => {
+  const buf = await makeDocx(['שלום עולם', '', 'Second para']);
+  const { paragraphs, documentXml } = await extractParagraphs(buf);
+  expect(paragraphs.length).toBe(2);                 // empty one skipped
+  expect(paragraphs[0].content).toBe('שלום עולם');
+  expect(typeof paragraphs[0].pIndex).toBe('number');
+  expect(documentXml).toContain('w:p');
+});
+
+it('rejects DOCTYPE (XXE guard)', async () => {
+  // craft a zip whose document.xml has a DOCTYPE
+  const JSZip = (await import('jszip')).default;
+  const zip = new JSZip();
+  zip.file('word/document.xml', '<?xml version="1.0"?><!DOCTYPE x><w:document/>');
+  const buf = await zip.generateAsync({ type: 'nodebuffer' });
+  await expect(extractParagraphs(buf)).rejects.toThrow(/DOCTYPE|entity/i);
+});
+```
+Note `docx` is ESM-namespace — if `import docx from 'docx'` is undefined, use `import * as docx from 'docx'` (seen earlier in this repo).
+
+**Step 2:** `npm test -- server/services/__tests__/docxInplace.extract` → FAIL.
+
+**Step 3 — implement** in `server/services/docxInplace.js`:
+- `const JSZip = require('jszip'); const { DOMParser, XMLSerializer } = require('@xmldom/xmldom');`
+- `async function extractParagraphs(buffer)`:
+  - `const zip = await JSZip.loadAsync(buffer);`
+  - `const documentXml = await zip.file('word/document.xml').async('string');`
+  - **XXE guard:** `if (/<!DOCTYPE/i.test(documentXml)) throw new Error('DOCTYPE not allowed (XXE)');`
+  - `const dom = new DOMParser().parseFromString(documentXml, 'text/xml');`
+  - `const ps = Array.from(dom.getElementsByTagName('w:p'));`
+  - For each `p` at index `pIndex`: text = concat of `Array.from(p.getElementsByTagName('w:t')).map(t => t.textContent).join('')`. If `text.trim()` non-empty → push `{ pIndex, content: text }`.
+  - return `{ paragraphs, zip, documentXml }`.
+- `module.exports = { extractParagraphs };` (add writeBack in Task 3).
+
+**Step 4:** PASS (2). **Step 5:** commit `feat: DOCX XML paragraph extractor (XXE-guarded)`.
+
+---
+
+### Task 3: `writeBack` (paragraph-level in-place)
+
+**Files:** Modify `server/services/docxInplace.js`; create `server/services/__tests__/docxInplace.writeback.test.js`.
+
+**Step 1 — failing test:**
+```javascript
+import { describe, it, expect } from 'vitest';
+import docx from 'docx';
+import { extractParagraphs, writeBack } from '../docxInplace.js';
+
+async function makeDocx(paras) { /* same helper as Task 2 */ }
+
+it('writes translations into first run, blanks others, round-trips', async () => {
+  const buf = await makeDocx(['שלום עולם', 'Keep me']);
+  const { paragraphs, zip, documentXml } = await extractParagraphs(buf);
+  const mapping = {}; mapping[paragraphs[0].pIndex] = 'Hello world';
+  const out = await writeBack(zip, documentXml, mapping);
+  // re-extract from the output to verify
+  const re = await extractParagraphs(out);
+  const texts = re.paragraphs.map(p => p.content);
+  expect(texts).toContain('Hello world');           // translated paragraph replaced
+  expect(texts).toContain('Keep me');               // untouched paragraph preserved
+});
+
+it('throws on invalid mapping target type (caller falls back)', async () => {
+  const buf = await makeDocx(['a']);
+  const { paragraphs, zip, documentXml } = await extractParagraphs(buf);
+  const m = {}; m[paragraphs[0].pIndex] = { not: 'a string' };
+  await expect(writeBack(zip, documentXml, m)).rejects.toThrow();
+});
+```
+
+**Step 2:** FAIL.
+
+**Step 3 — implement** `async function writeBack(zip, documentXml, mapping)`:
+- parse fresh DOM from `documentXml`.
+- `const ps = Array.from(dom.getElementsByTagName('w:p'));`
+- for each `pIndexStr` in mapping: `const text = mapping[pIndexStr];` if `typeof text !== 'string'` throw; `const p = ps[Number(pIndexStr)];` if !p continue; `const ts = Array.from(p.getElementsByTagName('w:t'));` if `ts.length === 0` continue; set `ts[0].textContent = text.normalize('NFC');` and set `ts[0].setAttribute('xml:space','preserve');`; for `ts.slice(1)` set `.textContent = ''`.
+- `const outXml = new XMLSerializer().serializeToString(dom);`
+- validate: re-parse `new DOMParser().parseFromString(outXml,'text/xml')` — if it has a `parsererror` element, throw.
+- `zip.file('word/document.xml', outXml);`
+- `return zip.generateAsync({ type: 'nodebuffer' });`
+- export `writeBack`.
+
+**Step 4:** PASS. **Step 5:** commit `feat: DOCX in-place writeBack (paragraph-level, validated)`.
+
+---
+
+### Task 4: Wire into the worker (docx branch + fallback)
+
+**Files:** Modify `server/api/translate.js`.
+
+Read the current `documentQueue.process('translate', ...)`. Add a DOCX branch BEFORE the existing flat path:
+```javascript
+const ext = path.extname(filePath).slice(1).toLowerCase();
+let doc, outputPath, usedInplace = false;
+if (ext === 'docx') {
+  try {
+    await assertDocxSafe(filePath, MAX_DOCX_UNCOMPRESSED);          // existing import? add if missing
+    const buffer = await fs.readFile(filePath);
+    const { paragraphs, zip, documentXml } = await extractParagraphs(buffer);
+    const blocks = paragraphs.map(p => ({ type: 'paragraph', content: p.content }));
+    const { blocks: docBlocks, segments } = buildSegments(blocks);
+    if (docBlocks.length !== paragraphs.length) throw new Error('paragraph/segment count mismatch');
+    doc = await buildTranslationDocument({ blocks: docBlocks, segments },
+      (chunk) => aiProvider.translateBatchAligned(chunk, sourceLang || 'he', targetLang),
+      { sourceLang: sourceLang || 'he', targetLang, maxSegments: MAX_SEGMENTS, concurrency: 2, maxPerChunk: 8, maxTokens: 1200, owner:'anon', jobId:String(job.id), ts: Date.now(),
+        onCap:(i)=>console.warn(`cap ${i.total}>${i.cap}`) });
+    const mapping = {};
+    docBlocks.forEach((b, i) => { mapping[paragraphs[i].pIndex] = b.sentences.map(s => s.target).join(' '); });
+    const outBuf = await writeBack(zip, documentXml, mapping);
+    outputPath = path.join(path.dirname(filePath), `translated_${crypto.randomUUID()}.docx`);
+    await fs.writeFile(outputPath, outBuf);
+    usedInplace = true;
+  } catch (e) {
+    console.warn('DOCX in-place failed, falling back to flat:', e.message);
+  }
+}
+if (!usedInplace) {
+  // existing flat path: processDocument -> buildSegments -> buildTranslationDocument -> generateTranslatedDocument
+  // (keep current code; ensure it sets `doc` and `outputPath`)
+}
+const resultToken = crypto.randomUUID();
+saveResult(resultToken, doc);
+await job.progress(100);
+return { filename: path.basename(outputPath), resultToken, success: true };
+```
+- Add imports: `const { extractParagraphs, writeBack } = require('../services/docxInplace');` and ensure `assertDocxSafe` is imported (from `../services/zipGuard`) and `fs` is the promises fs already in the file.
+- Keep the existing flat path as the `if (!usedInplace)` body (refactor current middle into it). Ensure both paths produce `doc` (for saveResult) and `outputPath`.
+
+**Verify (no full unit test — needs LLM/redis):**
+- `npm test` (full) → no new failures.
+- `node -e "require('./server/api/translate.js'); console.log('loads'); process.exit(0)"`.
+- Optional: a focused integration test `server/services/__tests__/docxInplace.integration.test.js` that runs extract → buildSegments → buildTranslationDocument(fakeBatch upper-casing) → mapping → writeBack → re-extract asserts uppercased text. (Recommended; uses fake batch, no network.)
+
+**Commit:** `feat: wire DOCX in-place into worker with flat fallback`.
+
+---
+
+### Task 5: Deploy + verify
+
+- Merge `feat/docx-inplace` → main (PR). On sec: `git pull && docker compose -f docker-compose.prod.yml up -d --build app`.
+- Verify: generate a Hebrew `.docx` (multi-paragraph), upload to https://translator.creatman.site, download the result, confirm: opens in Word, paragraphs translated, layout/structure intact. Confirm a PDF upload still works (flat path unchanged). Log decision to memory.
+
+---
+
+## Acceptance criteria (design 🔴)
+- [ ] Zip-bomb: `assertDocxSafe` runs before extraction (Task 4)
+- [ ] XXE: DOCTYPE rejected (Task 2)
+- [ ] Output integrity: serialized XML re-validated; any error → flat fallback, job never fails (Tasks 3,4)
+- [ ] One translation pass (Task 4)
+- [ ] DOCX output preserves layout (paragraph-level in place); PDF path unchanged
diff --git a/package-lock.json b/package-lock.json
diff --git a/package.json b/package.json
@@ -23,6 +23,7 @@
   },
   "dependencies": {
     "@google-cloud/translate": "^7.0.0",
+    "@xmldom/xmldom": "^0.9.10",
     "bull": "^4.12.0",
     "cors": "^2.8.5",
     "docx": "^8.5.0",
@@ -35,6 +36,7 @@
     "hebrew-transliteration": "^2.0.0",
     "helmet": "^7.1.0",
     "ioredis": "^5.3.2",
+    "jszip": "^3.10.1",
     "mammoth": "^1.6.0",
     "mime-types": "^2.1.35",
     "multer": "^1.4.5-lts.1",