From eac29d225579bc7978ee4b44d224bb804225aaa4 Mon Sep 17 00:00:00 2001
From: Creatman <creatmanick@gmail.com>
Date: Thu, 4 Jun 2026 12:05:00 -0400
Subject: [PATCH 1/6] docs: DOCX in-place pixel-faithful design
 (paragraph-level, XML edit)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 ...4-hebrew-translator-docx-inplace-design.md | 57 +++++++++++++++++++
 1 file changed, 57 insertions(+)
 create mode 100644 docs/plans/2026-06-04-hebrew-translator-docx-inplace-design.md

diff --git a/docs/plans/2026-06-04-hebrew-translator-docx-inplace-design.md b/docs/plans/2026-06-04-hebrew-translator-docx-inplace-design.md
new file mode 100644
index 0000000..8821d3e
--- /dev/null
+++ b/docs/plans/2026-06-04-hebrew-translator-docx-inplace-design.md
@@ -0,0 +1,57 @@
+# Hebrew Translator — DOCX in-place pixel-faithful (design)
+
+**Date:** 2026-06-04
+**Status:** Approved (design-lock)
+**Builds on:** Phase 1 (segment-aligned pipeline + viewer, live). This is the first pixel-faithful **file output** sub-phase. PDF positional overlay is the next sub-phase after this.
+
+## Goal
+For DOCX input, produce a downloadable translated `.docx` that preserves the original layout, images, tables, and paragraph styles by editing the original `word/document.xml` **in place** (paragraph-level), instead of regenerating a bare document.
+
+## Decisions (locked)
+| Topic | Decision |
+|-------|----------|
+| First format | DOCX in-place (PDF overlay = next sub-phase) |
+| Write-back granularity | **Paragraph-level**: translated paragraph → first `w:t` run; other runs blanked |
+| Translation passes | **One** — XML-paragraph extraction feeds both the viewer and the in-place writer |
+| Libs | `jszip` (zip r/w) + `@xmldom/xmldom` (XML DOM) |
+
+## Architecture
+
+### Extraction (`docx` input → blocks + retained XML)
+- `jszip` opens the DOCX; read `word/document.xml`; parse to DOM (`@xmldom/xmldom`).
+- Walk body `w:p` paragraphs (including those inside table cells `w:tc`). For each non-empty paragraph at index `pIndex`, the block content = concatenation of its `w:t` run texts. Produce blocks `{ id, type:'paragraph', pIndex, content }` → `buildSegments` → `TranslationDocument` (sentences) as in Phase 1.
+- Retain the parsed DOM + JSZip instance for write-back.
+
+### Translation
+- Same batch-aligned translator (Groq primary, Claude fallback). One pass. Per block, the paragraph translation = the joined sentence targets.
+
+### Write-back (in-place, paragraph-level)
+- For each block: locate `w:p[pIndex]`; set the FIRST `w:t` text = NFC(target); set remaining `w:t` in that paragraph to empty. Preserve paragraph props, first-run props, images, tables, drawings — everything else untouched.
+- Serialize DOM → overwrite the `word/document.xml` entry in the zip → emit `.docx`.
+
+### Integration
+- Worker: `.docx` input → XML extractor + in-place writer (downloadable file). `.pdf` input → current path until the PDF-overlay sub-phase.
+- Viewer unchanged (same `TranslationDocument`).
+
+## Error handling
+- Paragraph with no runs / image-only → skipped (left as-is).
+- Missing translation for a block (graceful) → leave the original text (do not blank).
+- Unparseable / unexpectedly complex docx → **fall back to the existing flat `generateDOCX`** so the download always works.
+- `xml:space="preserve"` respected; NFC normalize.
+
+## Guardrails (design-guardrails-audit) — 🔴 = acceptance criteria
+1. 🔴 **Zip-bomb on read** — run existing `assertDocxSafe` (uncompressed-size cap) before JSZip extraction on the in-place path.
+2. 🔴 **XXE / entity expansion** — parsing untrusted `document.xml` must not resolve external entities; reject `DOCTYPE`/DTD (`@xmldom/xmldom` does not resolve external entities by default — verify + guard against DOCTYPE / billion-laughs).
+3. 🔴 **Output integrity** — minimal edits (only `w:t` text); re-parse the serialized XML to validate; **fall back to flat `generateDOCX` on any error** (never emit a corrupt file / never fail the job).
+4. 🟡 Headers/footers/footnotes (separate XML parts) NOT translated in v1 — explicit log/note, not silent.
+5. 🟡 Memory/size bounded by the existing file-size cap.
+6. 🟡 Cost — single translation pass (no double LLM calls).
+7. 🟢 Deterministic NFC normalization.
+
+## Testing
+- Unit: paragraph extractor (generate a fixture `.docx` via the `docx` dep → extract → assert paragraphs + `pIndex`); write-back (inject translations → re-parse → first run replaced, others blank, structure intact); fallback on malformed input.
+- Integration: round-trip a generated `.docx` → translate (mock) → output re-parses and contains the translated text.
+
+## Scope (YAGNI)
+**In:** DOCX body paragraphs (incl. table cells) in-place, paragraph-level, fallback to flat on error, one translation pass.
+**Out:** headers/footers/footnotes, intra-paragraph run formatting, PDF overlay (next sub-phase).

From 5727ced63a49a5e0677ad17a26f3070e1c796809 Mon Sep 17 00:00:00 2001
From: Creatman <creatmanick@gmail.com>
Date: Thu, 4 Jun 2026 12:06:54 -0400
Subject: [PATCH 2/6] docs: DOCX in-place implementation plan (5 tasks, TDD)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 ...-translator-docx-inplace-implementation.md | 191 ++++++++++++++++++
 1 file changed, 191 insertions(+)
 create mode 100644 docs/plans/2026-06-04-hebrew-translator-docx-inplace-implementation.md

diff --git a/docs/plans/2026-06-04-hebrew-translator-docx-inplace-implementation.md b/docs/plans/2026-06-04-hebrew-translator-docx-inplace-implementation.md
new file mode 100644
index 0000000..1eb5dd0
--- /dev/null
+++ b/docs/plans/2026-06-04-hebrew-translator-docx-inplace-implementation.md
@@ -0,0 +1,191 @@
+# DOCX in-place pixel-faithful — Implementation Plan
+
+> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
+
+**Goal:** For DOCX uploads, produce a translated `.docx` that preserves layout/images/tables by editing `word/document.xml` in place (paragraph-level), instead of regenerating a bare doc.
+
+**Architecture:** Unzip DOCX (`jszip`), parse `word/document.xml` (`@xmldom/xmldom`), extract body `w:p` paragraphs (incl. table cells) as blocks → existing segment translator (one pass) → write each paragraph's translation into its first `w:t` run (blank the rest) → repackage. Falls back to the existing flat `generateDOCX` on any error.
+
+**Tech Stack:** Node, jszip, @xmldom/xmldom, existing pipeline (buildSegments/buildTranslationDocument), vitest. `docx` dep used to generate test fixtures.
+
+**Design:** `docs/plans/2026-06-04-hebrew-translator-docx-inplace-design.md` (🔴 guardrails = acceptance criteria). **Branch:** `feat/docx-inplace`. **Test:** `npm test -- <path>`.
+
+---
+
+### Task 1: Add deps
+
+```bash
+npm install jszip @xmldom/xmldom --legacy-peer-deps
+git add package.json package-lock.json
+git commit -m "build: add jszip + @xmldom/xmldom for DOCX in-place"
+```
+Verify: `node -e "require('jszip'); require('@xmldom/xmldom'); console.log('ok')"`.
+
+---
+
+### Task 2: `extractParagraphs` (DOCX → blocks, XXE-guarded)
+
+**Files:** Create `server/services/docxInplace.js` + `server/services/__tests__/docxInplace.extract.test.js`.
+
+**Step 1 — failing test** (build a real fixture with the `docx` dep):
+```javascript
+import { describe, it, expect } from 'vitest';
+import docx from 'docx';
+import { extractParagraphs } from '../docxInplace.js';
+
+async function makeDocx(paras) {
+  const d = new docx.Document({ sections: [{ children: paras.map(t =>
+    new docx.Paragraph({ children: [ new docx.TextRun(t) ] })) }] });
+  return docx.Packer.toBuffer(d);
+}
+
+it('extracts non-empty paragraphs with pIndex and content', async () => {
+  const buf = await makeDocx(['שלום עולם', '', 'Second para']);
+  const { paragraphs, documentXml } = await extractParagraphs(buf);
+  expect(paragraphs.length).toBe(2);                 // empty one skipped
+  expect(paragraphs[0].content).toBe('שלום עולם');
+  expect(typeof paragraphs[0].pIndex).toBe('number');
+  expect(documentXml).toContain('w:p');
+});
+
+it('rejects DOCTYPE (XXE guard)', async () => {
+  // craft a zip whose document.xml has a DOCTYPE
+  const JSZip = (await import('jszip')).default;
+  const zip = new JSZip();
+  zip.file('word/document.xml', '<?xml version="1.0"?><!DOCTYPE x><w:document/>');
+  const buf = await zip.generateAsync({ type: 'nodebuffer' });
+  await expect(extractParagraphs(buf)).rejects.toThrow(/DOCTYPE|entity/i);
+});
+```
+Note `docx` is ESM-namespace — if `import docx from 'docx'` is undefined, use `import * as docx from 'docx'` (seen earlier in this repo).
+
+**Step 2:** `npm test -- server/services/__tests__/docxInplace.extract` → FAIL.
+
+**Step 3 — implement** in `server/services/docxInplace.js`:
+- `const JSZip = require('jszip'); const { DOMParser, XMLSerializer } = require('@xmldom/xmldom');`
+- `async function extractParagraphs(buffer)`:
+  - `const zip = await JSZip.loadAsync(buffer);`
+  - `const documentXml = await zip.file('word/document.xml').async('string');`
+  - **XXE guard:** `if (/<!DOCTYPE/i.test(documentXml)) throw new Error('DOCTYPE not allowed (XXE)');`
+  - `const dom = new DOMParser().parseFromString(documentXml, 'text/xml');`
+  - `const ps = Array.from(dom.getElementsByTagName('w:p'));`
+  - For each `p` at index `pIndex`: text = concat of `Array.from(p.getElementsByTagName('w:t')).map(t => t.textContent).join('')`. If `text.trim()` non-empty → push `{ pIndex, content: text }`.
+  - return `{ paragraphs, zip, documentXml }`.
+- `module.exports = { extractParagraphs };` (add writeBack in Task 3).
+
+**Step 4:** PASS (2). **Step 5:** commit `feat: DOCX XML paragraph extractor (XXE-guarded)`.
+
+---
+
+### Task 3: `writeBack` (paragraph-level in-place)
+
+**Files:** Modify `server/services/docxInplace.js`; create `server/services/__tests__/docxInplace.writeback.test.js`.
+
+**Step 1 — failing test:**
+```javascript
+import { describe, it, expect } from 'vitest';
+import docx from 'docx';
+import { extractParagraphs, writeBack } from '../docxInplace.js';
+
+async function makeDocx(paras) { /* same helper as Task 2 */ }
+
+it('writes translations into first run, blanks others, round-trips', async () => {
+  const buf = await makeDocx(['שלום עולם', 'Keep me']);
+  const { paragraphs, zip, documentXml } = await extractParagraphs(buf);
+  const mapping = {}; mapping[paragraphs[0].pIndex] = 'Hello world';
+  const out = await writeBack(zip, documentXml, mapping);
+  // re-extract from the output to verify
+  const re = await extractParagraphs(out);
+  const texts = re.paragraphs.map(p => p.content);
+  expect(texts).toContain('Hello world');           // translated paragraph replaced
+  expect(texts).toContain('Keep me');               // untouched paragraph preserved
+});
+
+it('throws on invalid mapping target type (caller falls back)', async () => {
+  const buf = await makeDocx(['a']);
+  const { paragraphs, zip, documentXml } = await extractParagraphs(buf);
+  const m = {}; m[paragraphs[0].pIndex] = { not: 'a string' };
+  await expect(writeBack(zip, documentXml, m)).rejects.toThrow();
+});
+```
+
+**Step 2:** FAIL.
+
+**Step 3 — implement** `async function writeBack(zip, documentXml, mapping)`:
+- parse fresh DOM from `documentXml`.
+- `const ps = Array.from(dom.getElementsByTagName('w:p'));`
+- for each `pIndexStr` in mapping: `const text = mapping[pIndexStr];` if `typeof text !== 'string'` throw; `const p = ps[Number(pIndexStr)];` if !p continue; `const ts = Array.from(p.getElementsByTagName('w:t'));` if `ts.length === 0` continue; set `ts[0].textContent = text.normalize('NFC');` and set `ts[0].setAttribute('xml:space','preserve');`; for `ts.slice(1)` set `.textContent = ''`.
+- `const outXml = new XMLSerializer().serializeToString(dom);`
+- validate: re-parse `new DOMParser().parseFromString(outXml,'text/xml')` — if it has a `parsererror` element, throw.
+- `zip.file('word/document.xml', outXml);`
+- `return zip.generateAsync({ type: 'nodebuffer' });`
+- export `writeBack`.
+
+**Step 4:** PASS. **Step 5:** commit `feat: DOCX in-place writeBack (paragraph-level, validated)`.
+
+---
+
+### Task 4: Wire into the worker (docx branch + fallback)
+
+**Files:** Modify `server/api/translate.js`.
+
+Read the current `documentQueue.process('translate', ...)`. Add a DOCX branch BEFORE the existing flat path:
+```javascript
+const ext = path.extname(filePath).slice(1).toLowerCase();
+let doc, outputPath, usedInplace = false;
+if (ext === 'docx') {
+  try {
+    await assertDocxSafe(filePath, MAX_DOCX_UNCOMPRESSED);          // existing import? add if missing
+    const buffer = await fs.readFile(filePath);
+    const { paragraphs, zip, documentXml } = await extractParagraphs(buffer);
+    const blocks = paragraphs.map(p => ({ type: 'paragraph', content: p.content }));
+    const { blocks: docBlocks, segments } = buildSegments(blocks);
+    if (docBlocks.length !== paragraphs.length) throw new Error('paragraph/segment count mismatch');
+    doc = await buildTranslationDocument({ blocks: docBlocks, segments },
+      (chunk) => aiProvider.translateBatchAligned(chunk, sourceLang || 'he', targetLang),
+      { sourceLang: sourceLang || 'he', targetLang, maxSegments: MAX_SEGMENTS, concurrency: 2, maxPerChunk: 8, maxTokens: 1200, owner:'anon', jobId:String(job.id), ts: Date.now(),
+        onCap:(i)=>console.warn(`cap ${i.total}>${i.cap}`) });
+    const mapping = {};
+    docBlocks.forEach((b, i) => { mapping[paragraphs[i].pIndex] = b.sentences.map(s => s.target).join(' '); });
+    const outBuf = await writeBack(zip, documentXml, mapping);
+    outputPath = path.join(path.dirname(filePath), `translated_${crypto.randomUUID()}.docx`);
+    await fs.writeFile(outputPath, outBuf);
+    usedInplace = true;
+  } catch (e) {
+    console.warn('DOCX in-place failed, falling back to flat:', e.message);
+  }
+}
+if (!usedInplace) {
+  // existing flat path: processDocument -> buildSegments -> buildTranslationDocument -> generateTranslatedDocument
+  // (keep current code; ensure it sets `doc` and `outputPath`)
+}
+const resultToken = crypto.randomUUID();
+saveResult(resultToken, doc);
+await job.progress(100);
+return { filename: path.basename(outputPath), resultToken, success: true };
+```
+- Add imports: `const { extractParagraphs, writeBack } = require('../services/docxInplace');` and ensure `assertDocxSafe` is imported (from `../services/zipGuard`) and `fs` is the promises fs already in the file.
+- Keep the existing flat path as the `if (!usedInplace)` body (refactor current middle into it). Ensure both paths produce `doc` (for saveResult) and `outputPath`.
+
+**Verify (no full unit test — needs LLM/redis):**
+- `npm test` (full) → no new failures.
+- `node -e "require('./server/api/translate.js'); console.log('loads'); process.exit(0)"`.
+- Optional: a focused integration test `server/services/__tests__/docxInplace.integration.test.js` that runs extract → buildSegments → buildTranslationDocument(fakeBatch upper-casing) → mapping → writeBack → re-extract asserts uppercased text. (Recommended; uses fake batch, no network.)
+
+**Commit:** `feat: wire DOCX in-place into worker with flat fallback`.
+
+---
+
+### Task 5: Deploy + verify
+
+- Merge `feat/docx-inplace` → main (PR). On sec: `git pull && docker compose -f docker-compose.prod.yml up -d --build app`.
+- Verify: generate a Hebrew `.docx` (multi-paragraph), upload to https://translator.creatman.site, download the result, confirm: opens in Word, paragraphs translated, layout/structure intact. Confirm a PDF upload still works (flat path unchanged). Log decision to memory.
+
+---
+
+## Acceptance criteria (design 🔴)
+- [ ] Zip-bomb: `assertDocxSafe` runs before extraction (Task 4)
+- [ ] XXE: DOCTYPE rejected (Task 2)
+- [ ] Output integrity: serialized XML re-validated; any error → flat fallback, job never fails (Tasks 3,4)
+- [ ] One translation pass (Task 4)
+- [ ] DOCX output preserves layout (paragraph-level in place); PDF path unchanged

From 08b9838ba2b18769facbc35bd6b2bab347913c73 Mon Sep 17 00:00:00 2001
From: Creatman <creatmanick@gmail.com>
Date: Thu, 4 Jun 2026 12:22:21 -0400
Subject: [PATCH 3/6] build: add jszip + @xmldom/xmldom for DOCX in-place

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 package-lock.json | 19 +++++++++++++++----
 package.json      |  2 ++
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/package-lock.json b/package-lock.json
index 71005ae..30b1321 100644
--- a/package-lock.json
+++ b/package-lock.json
@@ -9,6 +9,7 @@
       "version": "1.0.0",
       "dependencies": {
         "@google-cloud/translate": "^7.0.0",
+        "@xmldom/xmldom": "^0.9.10",
         "bull": "^4.12.0",
         "cors": "^2.8.5",
         "docx": "^8.5.0",
@@ -21,6 +22,7 @@
         "hebrew-transliteration": "^2.0.0",
         "helmet": "^7.1.0",
         "ioredis": "^5.3.2",
+        "jszip": "^3.10.1",
         "mammoth": "^1.6.0",
         "mime-types": "^2.1.35",
         "multer": "^1.4.5-lts.1",
@@ -3318,12 +3320,12 @@
       "license": "MIT"
     },
     "node_modules/@xmldom/xmldom": {
-      "version": "0.8.10",
-      "resolved": "https://registry.npmjs.org/@xmldom/xmldom/-/xmldom-0.8.10.tgz",
-      "integrity": "sha512-2WALfTl4xo2SkGCYRt6rDTFfk9R1czmBvUQy12gK2KuRKIpWEhcbbzy8EZXtz/jkRqHX8bFEc6FC1HjX4TUWYw==",
+      "version": "0.9.10",
+      "resolved": "https://registry.npmjs.org/@xmldom/xmldom/-/xmldom-0.9.10.tgz",
+      "integrity": "sha512-A9gOqLdi6cV4ibazAjcQufGj0B1y/vDqYrcuP6d/6x8P27gRS8643Dj9o1dEKtB6O7fwxb2FgBmJS2mX7gpvdw==",
       "license": "MIT",
       "engines": {
-        "node": ">=10.0.0"
+        "node": ">=14.6"
       }
     },
     "node_modules/abort-controller": {
@@ -9791,6 +9793,15 @@
         "node": ">=12.0.0"
       }
     },
+    "node_modules/mammoth/node_modules/@xmldom/xmldom": {
+      "version": "0.8.13",
+      "resolved": "https://registry.npmjs.org/@xmldom/xmldom/-/xmldom-0.8.13.tgz",
+      "integrity": "sha512-KRYzxepc14G/CEpEGc3Yn+JKaAeT63smlDr+vjB8jRfgTBBI9wRj/nkQEO+ucV8p8I9bfKLWp37uHgFrbntPvw==",
+      "license": "MIT",
+      "engines": {
+        "node": ">=10.0.0"
+      }
+    },
     "node_modules/mammoth/node_modules/argparse": {
       "version": "1.0.10",
       "resolved": "https://registry.npmjs.org/argparse/-/argparse-1.0.10.tgz",
diff --git a/package.json b/package.json
index bd080d3..aec231b 100644
--- a/package.json
+++ b/package.json
@@ -23,6 +23,7 @@
   },
   "dependencies": {
     "@google-cloud/translate": "^7.0.0",
+    "@xmldom/xmldom": "^0.9.10",
     "bull": "^4.12.0",
     "cors": "^2.8.5",
     "docx": "^8.5.0",
@@ -35,6 +36,7 @@
     "hebrew-transliteration": "^2.0.0",
     "helmet": "^7.1.0",
     "ioredis": "^5.3.2",
+    "jszip": "^3.10.1",
     "mammoth": "^1.6.0",
     "mime-types": "^2.1.35",
     "multer": "^1.4.5-lts.1",

From 4c827b4516a8b599e35a977931688b6839751cdc Mon Sep 17 00:00:00 2001
From: Creatman <creatmanick@gmail.com>
Date: Thu, 4 Jun 2026 12:24:31 -0400
Subject: [PATCH 4/6] feat: DOCX XML paragraph extractor (XXE-guarded)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 .../__tests__/docxInplace.extract.test.js     | 35 ++++++++++++++++
 server/services/docxInplace.js                | 42 +++++++++++++++++++
 2 files changed, 77 insertions(+)
 create mode 100644 server/services/__tests__/docxInplace.extract.test.js
 create mode 100644 server/services/docxInplace.js

diff --git a/server/services/__tests__/docxInplace.extract.test.js b/server/services/__tests__/docxInplace.extract.test.js
new file mode 100644
index 0000000..d0ed5b1
--- /dev/null
+++ b/server/services/__tests__/docxInplace.extract.test.js
@@ -0,0 +1,35 @@
+import { describe, it, expect } from 'vitest';
+import * as docx from 'docx'; // docx ships ESM named exports (no default) under vitest
+import { extractParagraphs } from '../docxInplace.js';
+
+async function makeDocx(paras) {
+  const d = new docx.Document({
+    sections: [
+      {
+        children: paras.map(
+          (t) => new docx.Paragraph({ children: [new docx.TextRun(t)] })
+        ),
+      },
+    ],
+  });
+  return docx.Packer.toBuffer(d);
+}
+
+describe('extractParagraphs', () => {
+  it('extracts non-empty paragraphs with pIndex and content', async () => {
+    const buf = await makeDocx(['שלום עולם', '', 'Second para']);
+    const { paragraphs, documentXml } = await extractParagraphs(buf);
+    expect(paragraphs.length).toBe(2); // the empty paragraph is skipped
+    expect(paragraphs[0].content).toBe('שלום עולם');
+    expect(typeof paragraphs[0].pIndex).toBe('number');
+    expect(documentXml).toContain('w:p');
+  });
+
+  it('rejects DOCTYPE (XXE guard)', async () => {
+    const JSZip = (await import('jszip')).default;
+    const zip = new JSZip();
+    zip.file('word/document.xml', '<?xml version="1.0"?><!DOCTYPE x><w:document/>');
+    const buf = await zip.generateAsync({ type: 'nodebuffer' });
+    await expect(extractParagraphs(buf)).rejects.toThrow(/DOCTYPE|entity|XXE/i);
+  });
+});
diff --git a/server/services/docxInplace.js b/server/services/docxInplace.js
new file mode 100644
index 0000000..d54f214
--- /dev/null
+++ b/server/services/docxInplace.js
@@ -0,0 +1,42 @@
+const JSZip = require('jszip');
+const { DOMParser } = require('@xmldom/xmldom');
+
+/**
+ * Open a .docx buffer, parse word/document.xml, and return the non-empty body
+ * paragraphs with their positional index (within ALL <w:p>, including those
+ * nested in table cells), plus the loaded zip and the raw XML for write-back.
+ *
+ * Security: untrusted XML must reject DOCTYPE (XXE / billion-laughs guard).
+ *
+ * @param {Buffer} buffer - raw .docx file contents (a zip)
+ * @returns {Promise<{paragraphs: {pIndex:number, content:string}[], zip: import('jszip'), documentXml: string}>}
+ */
+async function extractParagraphs(buffer) {
+  const zip = await JSZip.loadAsync(buffer);
+  const entry = zip.file('word/document.xml');
+  if (!entry) throw new Error('not a docx: word/document.xml missing');
+
+  const documentXml = await entry.async('string');
+  // XXE / billion-laughs guard: refuse any DOCTYPE declaration.
+  if (/<!DOCTYPE/i.test(documentXml)) {
+    throw new Error('DOCTYPE not allowed (XXE guard)');
+  }
+
+  const dom = new DOMParser().parseFromString(documentXml, 'text/xml');
+
+  // getElementsByTagName finds all <w:p>, nested (table cells) included.
+  const ps = Array.from(dom.getElementsByTagName('w:p'));
+  const paragraphs = [];
+  ps.forEach((p, pIndex) => {
+    const text = Array.from(p.getElementsByTagName('w:t'))
+      .map((t) => t.textContent || '')
+      .join('');
+    if (text.trim().length > 0) {
+      paragraphs.push({ pIndex, content: text });
+    }
+  });
+
+  return { paragraphs, zip, documentXml };
+}
+
+module.exports = { extractParagraphs };

From f1d6d199b159cf4d8a160576524e9a74a4d35577 Mon Sep 17 00:00:00 2001
From: Creatman <creatmanick@gmail.com>
Date: Thu, 4 Jun 2026 13:20:58 -0400
Subject: [PATCH 5/6] feat: DOCX in-place writeBack (paragraph-level,
 validated)

---
 .../__tests__/docxInplace.writeback.test.js   | 39 ++++++++++++++
 server/services/docxInplace.js                | 52 ++++++++++++++++++-
 2 files changed, 89 insertions(+), 2 deletions(-)
 create mode 100644 server/services/__tests__/docxInplace.writeback.test.js

diff --git a/server/services/__tests__/docxInplace.writeback.test.js b/server/services/__tests__/docxInplace.writeback.test.js
new file mode 100644
index 0000000..84c6129
--- /dev/null
+++ b/server/services/__tests__/docxInplace.writeback.test.js
@@ -0,0 +1,39 @@
+import { describe, it, expect } from 'vitest';
+import * as docx from 'docx';
+import { extractParagraphs, writeBack } from '../docxInplace.js';
+
+async function makeDocx(paras) {
+  const d = new docx.Document({ sections: [{ children: paras.map(t =>
+    new docx.Paragraph({ children: [ new docx.TextRun(t) ] })) }] });
+  return docx.Packer.toBuffer(d);
+}
+
+describe('writeBack', () => {
+  it('writes translation into first run, blanks others, round-trips', async () => {
+    const buf = await makeDocx(['שלום עולם', 'Keep me']);
+    const { paragraphs, zip, documentXml } = await extractParagraphs(buf);
+    const mapping = {}; mapping[paragraphs[0].pIndex] = 'Hello world';
+    const out = await writeBack(zip, documentXml, mapping);
+    expect(Buffer.isBuffer(out)).toBe(true);
+    const re = await extractParagraphs(out);
+    const texts = re.paragraphs.map(p => p.content);
+    expect(texts).toContain('Hello world');   // translated paragraph replaced
+    expect(texts).toContain('Keep me');       // untouched paragraph preserved
+  });
+
+  it('throws when a mapping target is not a string (caller will fall back)', async () => {
+    const buf = await makeDocx(['a']);
+    const { paragraphs, zip, documentXml } = await extractParagraphs(buf);
+    const m = {}; m[paragraphs[0].pIndex] = { not: 'a string' };
+    await expect(writeBack(zip, documentXml, m)).rejects.toThrow();
+  });
+
+  it('leaves paragraphs not in the mapping unchanged', async () => {
+    const buf = await makeDocx(['one', 'two']);
+    const { paragraphs, zip, documentXml } = await extractParagraphs(buf);
+    const m = {}; m[paragraphs[1].pIndex] = 'TWO';
+    const out = await writeBack(zip, documentXml, m);
+    const texts = (await extractParagraphs(out)).paragraphs.map(p => p.content);
+    expect(texts).toEqual(['one', 'TWO']);
+  });
+});
diff --git a/server/services/docxInplace.js b/server/services/docxInplace.js
index d54f214..13ee20e 100644
--- a/server/services/docxInplace.js
+++ b/server/services/docxInplace.js
@@ -1,5 +1,5 @@
 const JSZip = require('jszip');
-const { DOMParser } = require('@xmldom/xmldom');
+const { DOMParser, XMLSerializer } = require('@xmldom/xmldom');
 
 /**
  * Open a .docx buffer, parse word/document.xml, and return the non-empty body
@@ -39,4 +39,52 @@ async function extractParagraphs(buffer) {
   return { paragraphs, zip, documentXml };
 }
 
-module.exports = { extractParagraphs };
+/**
+ * Inject translations into word/document.xml at paragraph level, preserving
+ * everything else (run/paragraph props, images, tables). For each pIndex in
+ * `mapping`, the paragraph's FIRST <w:t> receives the translated string (NFC
+ * normalized) and the remaining <w:t> of that paragraph are blanked. The zip
+ * is repackaged and returned as a Buffer.
+ *
+ * Security/robustness (🔴): the serialized XML is re-parsed for validation;
+ * any problem throws so the caller can fall back to the flat generator and
+ * never emits a corrupt file.
+ *
+ * @param {import('jszip')} zip - JSZip instance from extractParagraphs
+ * @param {string} documentXml - raw word/document.xml string
+ * @param {Object<string|number, string>} mapping - pIndex -> translated text
+ * @returns {Promise<Buffer>} repackaged .docx buffer
+ */
+async function writeBack(zip, documentXml, mapping) {
+  const dom = new DOMParser().parseFromString(documentXml, 'text/xml');
+  const ps = Array.from(dom.getElementsByTagName('w:p'));
+
+  for (const key of Object.keys(mapping)) {
+    const text = mapping[key];
+    if (typeof text !== 'string') {
+      throw new Error('mapping target must be a string');
+    }
+    const p = ps[Number(key)];
+    if (!p) continue;
+    const ts = Array.from(p.getElementsByTagName('w:t'));
+    if (ts.length === 0) continue;
+    ts[0].textContent = text.normalize('NFC');
+    ts[0].setAttribute('xml:space', 'preserve');
+    for (const t of ts.slice(1)) {
+      t.textContent = '';
+    }
+  }
+
+  const outXml = new XMLSerializer().serializeToString(dom);
+
+  // Validate: re-parse and reject anything that didn't round-trip cleanly.
+  const check = new DOMParser().parseFromString(outXml, 'text/xml');
+  if (!check || check.getElementsByTagName('parsererror').length > 0) {
+    throw new Error('writeBack produced invalid XML');
+  }
+
+  zip.file('word/document.xml', outXml);
+  return zip.generateAsync({ type: 'nodebuffer' });
+}
+
+module.exports = { extractParagraphs, writeBack };

From 2eda02bf8375c1749547fc138e43f43c7ade319d Mon Sep 17 00:00:00 2001
From: Creatman <creatmanick@gmail.com>
Date: Thu, 4 Jun 2026 13:24:25 -0400
Subject: [PATCH 6/6] feat: wire DOCX in-place into worker with flat fallback

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 server/api/translate.js                       | 95 ++++++++++++++-----
 .../__tests__/docxInplace.integration.test.js | 30 ++++++
 2 files changed, 100 insertions(+), 25 deletions(-)
 create mode 100644 server/services/__tests__/docxInplace.integration.test.js

diff --git a/server/api/translate.js b/server/api/translate.js
index c2814dc..af394a2 100644
--- a/server/api/translate.js
+++ b/server/api/translate.js
@@ -14,6 +14,8 @@ const LiteLLMProvider = require('../adapters/ai/LiteLLMProvider');
 const { saveResult, getResult, recentUsage } = require('../services/resultStore');
 const { validateMagicBytes } = require('../middleware/fileValidation');
 const { emitToSession } = require('../socket/rooms');
+const { extractParagraphs, writeBack } = require('../services/docxInplace');
+const { assertDocxSafe } = require('../services/zipGuard');
 
 // Инициализируем сервисы
 const documentProcessor = new DocumentProcessor();
@@ -22,6 +24,9 @@ const aiProvider = new LiteLLMProvider();
 // Жёсткий предел на число сегментов в одном документе (DoS-guard + бюджет).
 const MAX_SEGMENTS = Number(process.env.MAX_SEGMENTS) || 1500;
 
+// Предел распакованного размера DOCX (zip-bomb guard) для воркера.
+const MAX_DOCX_UNCOMPRESSED = Number(process.env.MAX_DOCX_UNCOMPRESSED_MB || 100) * 1024 * 1024;
+
 // Создаем очередь для обработки документов
 const documentQueue = new Queue('document-processing', {
   redis: {
@@ -62,38 +67,78 @@ documentQueue.on('failed', (job, error) => {
 documentQueue.process('translate', async (job) => {
   try {
     const { filePath, sourceLang, targetLang, originalName } = job.data;
-    
+
     // Обновляем прогресс: Начало обработки
     await job.progress(10);
 
-    // Обрабатываем документ (плоский текст)
-    const processed = await documentProcessor.processDocument(filePath, targetLang);
-    await job.progress(40);
-
-    // Разбиваем на блоки и сегменты, затем строим TranslationDocument
-    const rawBlocks = toBlocks(processed.content);
-    const { blocks: docBlocks, segments } = buildSegments(rawBlocks);
-    const doc = await buildTranslationDocument(
-      { blocks: docBlocks, segments },
-      (chunk) => aiProvider.translateBatchAligned(chunk, sourceLang || 'he', targetLang),
-      { sourceLang: sourceLang || 'he', targetLang, maxSegments: MAX_SEGMENTS,
-        concurrency: 2, maxPerChunk: 8, maxTokens: 1200,
-        owner: 'anon', jobId: String(job.id), ts: Date.now(),
-        onCap: (info) => console.warn(`Segment cap hit: ${info.total} > ${info.cap} (job ${job.id})`) }
-    );
-    await job.progress(80);
+    const ext = path.extname(filePath).slice(1).toLowerCase();
+    let doc = null, outputPath = null, usedInplace = false;
+
+    // DOCX: переводим документ "на месте", сохраняя исходную вёрстку.
+    // Любая ошибка → откатываемся на плоский путь ниже, чтобы скачивание не ломалось.
+    if (ext === 'docx') {
+      try {
+        await assertDocxSafe(filePath, MAX_DOCX_UNCOMPRESSED);
+        const buffer = await fs.readFile(filePath);
+        const { paragraphs, zip, documentXml } = await extractParagraphs(buffer);
+        const blocks = paragraphs.map(p => ({ type: 'paragraph', content: p.content }));
+        const { blocks: docBlocks, segments } = buildSegments(blocks);
+        if (docBlocks.length !== paragraphs.length) {
+          throw new Error(`paragraph/segment count mismatch ${docBlocks.length} != ${paragraphs.length}`);
+        }
+        await job.progress(40);
+        doc = await buildTranslationDocument(
+          { blocks: docBlocks, segments },
+          (chunk) => aiProvider.translateBatchAligned(chunk, sourceLang || 'he', targetLang),
+          { sourceLang: sourceLang || 'he', targetLang, maxSegments: MAX_SEGMENTS,
+            concurrency: 2, maxPerChunk: 8, maxTokens: 1200,
+            owner: 'anon', jobId: String(job.id), ts: Date.now(),
+            onCap: (info) => console.warn(`Segment cap hit: ${info.total} > ${info.cap} (job ${job.id})`) }
+        );
+        await job.progress(80);
+        const mapping = {};
+        docBlocks.forEach((b, i) => { mapping[paragraphs[i].pIndex] = b.sentences.map(s => s.target).join(' '); });
+        const outBuf = await writeBack(zip, documentXml, mapping);
+        outputPath = path.join(path.dirname(filePath), `translated_${crypto.randomUUID()}.docx`);
+        await fs.writeFile(outputPath, outBuf);
+        usedInplace = true;
+      } catch (e) {
+        console.warn('DOCX in-place failed, falling back to flat:', e.message);
+      }
+    }
+
+    // Плоский путь (PDF всегда, DOCX как fallback): извлекаем текст и собираем
+    // переведённый документ заново, теряя вёрстку, но гарантируя результат.
+    if (!usedInplace) {
+      // Обрабатываем документ (плоский текст)
+      const processed = await documentProcessor.processDocument(filePath, targetLang);
+      await job.progress(40);
+
+      // Разбиваем на блоки и сегменты, затем строим TranslationDocument
+      const rawBlocks = toBlocks(processed.content);
+      const { blocks: docBlocks, segments } = buildSegments(rawBlocks);
+      doc = await buildTranslationDocument(
+        { blocks: docBlocks, segments },
+        (chunk) => aiProvider.translateBatchAligned(chunk, sourceLang || 'he', targetLang),
+        { sourceLang: sourceLang || 'he', targetLang, maxSegments: MAX_SEGMENTS,
+          concurrency: 2, maxPerChunk: 8, maxTokens: 1200,
+          owner: 'anon', jobId: String(job.id), ts: Date.now(),
+          onCap: (info) => console.warn(`Segment cap hit: ${info.total} > ${info.cap} (job ${job.id})`) }
+      );
+      await job.progress(80);
+
+      // downloadable file: flatten translated sentences per block
+      const fileBlocks = doc.blocks.map(b => ({ type: 'text', content: b.sentences.map(s => s.target).join(' ') }));
+      outputPath = path.join(
+        path.dirname(filePath),
+        `translated_${crypto.randomUUID()}${path.extname(filePath)}`
+      );
+      await documentProcessor.generateTranslatedDocument(fileBlocks, outputPath);
+    }
 
     // result for the viewer
     const resultToken = crypto.randomUUID();
     saveResult(resultToken, doc);
-
-    // downloadable file: flatten translated sentences per block
-    const fileBlocks = doc.blocks.map(b => ({ type: 'text', content: b.sentences.map(s => s.target).join(' ') }));
-    const outputPath = path.join(
-      path.dirname(filePath),
-      `translated_${crypto.randomUUID()}${path.extname(filePath)}`
-    );
-    await documentProcessor.generateTranslatedDocument(fileBlocks, outputPath);
     await job.progress(100);
 
     return {
diff --git a/server/services/__tests__/docxInplace.integration.test.js b/server/services/__tests__/docxInplace.integration.test.js
new file mode 100644
index 0000000..511bf31
--- /dev/null
+++ b/server/services/__tests__/docxInplace.integration.test.js
@@ -0,0 +1,30 @@
+import { describe, it, expect } from 'vitest';
+import * as docx from 'docx';
+import { extractParagraphs, writeBack } from '../docxInplace.js';
+import { buildSegments } from '../translationDocument.js';
+import { buildTranslationDocument } from '../pipeline.js';
+
+async function makeDocx(paras) {
+  const d = new docx.Document({ sections: [{ children: paras.map(t =>
+    new docx.Paragraph({ children: [ new docx.TextRun(t) ] })) }] });
+  return docx.Packer.toBuffer(d);
+}
+
+// fake batch: uppercases each segment's source
+const fakeBatch = async (chunk) => ({ items: chunk.map(s => ({ id: s.id, target: s.source.toUpperCase(), align: [] })), usage: null });
+
+describe('docx in-place integration', () => {
+  it('end-to-end docx in-place with a fake translator', async () => {
+    const buf = await makeDocx(['hello world', 'second line']);
+    const { paragraphs, zip, documentXml } = await extractParagraphs(buf);
+    const blocks = paragraphs.map(p => ({ type: 'paragraph', content: p.content }));
+    const { blocks: docBlocks, segments } = buildSegments(blocks);
+    expect(docBlocks.length).toBe(paragraphs.length);
+    const doc = await buildTranslationDocument({ blocks: docBlocks, segments }, fakeBatch, { sourceLang: 'he', targetLang: 'en' });
+    const mapping = {};
+    docBlocks.forEach((b, i) => { mapping[paragraphs[i].pIndex] = b.sentences.map(s => s.target).join(' '); });
+    const out = await writeBack(zip, documentXml, mapping);
+    const texts = (await extractParagraphs(out)).paragraphs.map(p => p.content);
+    expect(texts).toEqual(['HELLO WORLD', 'SECOND LINE']);
+  });
+});