From eac29d225579bc7978ee4b44d224bb804225aaa4 Mon Sep 17 00:00:00 2001 From: Creatman Date: Thu, 4 Jun 2026 12:05:00 -0400 Subject: [PATCH 1/6] docs: DOCX in-place pixel-faithful design (paragraph-level, XML edit) Co-Authored-By: Claude Opus 4.8 (1M context) --- ...4-hebrew-translator-docx-inplace-design.md | 57 +++++++++++++++++++ 1 file changed, 57 insertions(+) create mode 100644 docs/plans/2026-06-04-hebrew-translator-docx-inplace-design.md diff --git a/docs/plans/2026-06-04-hebrew-translator-docx-inplace-design.md b/docs/plans/2026-06-04-hebrew-translator-docx-inplace-design.md new file mode 100644 index 0000000..8821d3e --- /dev/null +++ b/docs/plans/2026-06-04-hebrew-translator-docx-inplace-design.md @@ -0,0 +1,57 @@ +# Hebrew Translator — DOCX in-place pixel-faithful (design) + +**Date:** 2026-06-04 +**Status:** Approved (design-lock) +**Builds on:** Phase 1 (segment-aligned pipeline + viewer, live). This is the first pixel-faithful **file output** sub-phase. PDF positional overlay is the next sub-phase after this. + +## Goal +For DOCX input, produce a downloadable translated `.docx` that preserves the original layout, images, tables, and paragraph styles by editing the original `word/document.xml` **in place** (paragraph-level), instead of regenerating a bare document. + +## Decisions (locked) +| Topic | Decision | +|-------|----------| +| First format | DOCX in-place (PDF overlay = next sub-phase) | +| Write-back granularity | **Paragraph-level**: translated paragraph → first `w:t` run; other runs blanked | +| Translation passes | **One** — XML-paragraph extraction feeds both the viewer and the in-place writer | +| Libs | `jszip` (zip r/w) + `@xmldom/xmldom` (XML DOM) | + +## Architecture + +### Extraction (`docx` input → blocks + retained XML) +- `jszip` opens the DOCX; read `word/document.xml`; parse to DOM (`@xmldom/xmldom`). +- Walk body `w:p` paragraphs (including those inside table cells `w:tc`). For each non-empty paragraph at index `pIndex`, the block content = concatenation of its `w:t` run texts. Produce blocks `{ id, type:'paragraph', pIndex, content }` → `buildSegments` → `TranslationDocument` (sentences) as in Phase 1. +- Retain the parsed DOM + JSZip instance for write-back. + +### Translation +- Same batch-aligned translator (Groq primary, Claude fallback). One pass. Per block, the paragraph translation = the joined sentence targets. + +### Write-back (in-place, paragraph-level) +- For each block: locate `w:p[pIndex]`; set the FIRST `w:t` text = NFC(target); set remaining `w:t` in that paragraph to empty. Preserve paragraph props, first-run props, images, tables, drawings — everything else untouched. +- Serialize DOM → overwrite the `word/document.xml` entry in the zip → emit `.docx`. + +### Integration +- Worker: `.docx` input → XML extractor + in-place writer (downloadable file). `.pdf` input → current path until the PDF-overlay sub-phase. +- Viewer unchanged (same `TranslationDocument`). + +## Error handling +- Paragraph with no runs / image-only → skipped (left as-is). +- Missing translation for a block (graceful) → leave the original text (do not blank). +- Unparseable / unexpectedly complex docx → **fall back to the existing flat `generateDOCX`** so the download always works. +- `xml:space="preserve"` respected; NFC normalize. + +## Guardrails (design-guardrails-audit) — 🔴 = acceptance criteria +1. 🔴 **Zip-bomb on read** — run existing `assertDocxSafe` (uncompressed-size cap) before JSZip extraction on the in-place path. +2. 🔴 **XXE / entity expansion** — parsing untrusted `document.xml` must not resolve external entities; reject `DOCTYPE`/DTD (`@xmldom/xmldom` does not resolve external entities by default — verify + guard against DOCTYPE / billion-laughs). +3. 🔴 **Output integrity** — minimal edits (only `w:t` text); re-parse the serialized XML to validate; **fall back to flat `generateDOCX` on any error** (never emit a corrupt file / never fail the job). +4. 🟡 Headers/footers/footnotes (separate XML parts) NOT translated in v1 — explicit log/note, not silent. +5. 🟡 Memory/size bounded by the existing file-size cap. +6. 🟡 Cost — single translation pass (no double LLM calls). +7. 🟢 Deterministic NFC normalization. + +## Testing +- Unit: paragraph extractor (generate a fixture `.docx` via the `docx` dep → extract → assert paragraphs + `pIndex`); write-back (inject translations → re-parse → first run replaced, others blank, structure intact); fallback on malformed input. +- Integration: round-trip a generated `.docx` → translate (mock) → output re-parses and contains the translated text. + +## Scope (YAGNI) +**In:** DOCX body paragraphs (incl. table cells) in-place, paragraph-level, fallback to flat on error, one translation pass. +**Out:** headers/footers/footnotes, intra-paragraph run formatting, PDF overlay (next sub-phase). From 5727ced63a49a5e0677ad17a26f3070e1c796809 Mon Sep 17 00:00:00 2001 From: Creatman Date: Thu, 4 Jun 2026 12:06:54 -0400 Subject: [PATCH 2/6] docs: DOCX in-place implementation plan (5 tasks, TDD) Co-Authored-By: Claude Opus 4.8 (1M context) --- ...-translator-docx-inplace-implementation.md | 191 ++++++++++++++++++ 1 file changed, 191 insertions(+) create mode 100644 docs/plans/2026-06-04-hebrew-translator-docx-inplace-implementation.md diff --git a/docs/plans/2026-06-04-hebrew-translator-docx-inplace-implementation.md b/docs/plans/2026-06-04-hebrew-translator-docx-inplace-implementation.md new file mode 100644 index 0000000..1eb5dd0 --- /dev/null +++ b/docs/plans/2026-06-04-hebrew-translator-docx-inplace-implementation.md @@ -0,0 +1,191 @@ +# DOCX in-place pixel-faithful — Implementation Plan + +> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. + +**Goal:** For DOCX uploads, produce a translated `.docx` that preserves layout/images/tables by editing `word/document.xml` in place (paragraph-level), instead of regenerating a bare doc. + +**Architecture:** Unzip DOCX (`jszip`), parse `word/document.xml` (`@xmldom/xmldom`), extract body `w:p` paragraphs (incl. table cells) as blocks → existing segment translator (one pass) → write each paragraph's translation into its first `w:t` run (blank the rest) → repackage. Falls back to the existing flat `generateDOCX` on any error. + +**Tech Stack:** Node, jszip, @xmldom/xmldom, existing pipeline (buildSegments/buildTranslationDocument), vitest. `docx` dep used to generate test fixtures. + +**Design:** `docs/plans/2026-06-04-hebrew-translator-docx-inplace-design.md` (🔴 guardrails = acceptance criteria). **Branch:** `feat/docx-inplace`. **Test:** `npm test -- `. + +--- + +### Task 1: Add deps + +```bash +npm install jszip @xmldom/xmldom --legacy-peer-deps +git add package.json package-lock.json +git commit -m "build: add jszip + @xmldom/xmldom for DOCX in-place" +``` +Verify: `node -e "require('jszip'); require('@xmldom/xmldom'); console.log('ok')"`. + +--- + +### Task 2: `extractParagraphs` (DOCX → blocks, XXE-guarded) + +**Files:** Create `server/services/docxInplace.js` + `server/services/__tests__/docxInplace.extract.test.js`. + +**Step 1 — failing test** (build a real fixture with the `docx` dep): +```javascript +import { describe, it, expect } from 'vitest'; +import docx from 'docx'; +import { extractParagraphs } from '../docxInplace.js'; + +async function makeDocx(paras) { + const d = new docx.Document({ sections: [{ children: paras.map(t => + new docx.Paragraph({ children: [ new docx.TextRun(t) ] })) }] }); + return docx.Packer.toBuffer(d); +} + +it('extracts non-empty paragraphs with pIndex and content', async () => { + const buf = await makeDocx(['שלום עולם', '', 'Second para']); + const { paragraphs, documentXml } = await extractParagraphs(buf); + expect(paragraphs.length).toBe(2); // empty one skipped + expect(paragraphs[0].content).toBe('שלום עולם'); + expect(typeof paragraphs[0].pIndex).toBe('number'); + expect(documentXml).toContain('w:p'); +}); + +it('rejects DOCTYPE (XXE guard)', async () => { + // craft a zip whose document.xml has a DOCTYPE + const JSZip = (await import('jszip')).default; + const zip = new JSZip(); + zip.file('word/document.xml', ''); + const buf = await zip.generateAsync({ type: 'nodebuffer' }); + await expect(extractParagraphs(buf)).rejects.toThrow(/DOCTYPE|entity/i); +}); +``` +Note `docx` is ESM-namespace — if `import docx from 'docx'` is undefined, use `import * as docx from 'docx'` (seen earlier in this repo). + +**Step 2:** `npm test -- server/services/__tests__/docxInplace.extract` → FAIL. + +**Step 3 — implement** in `server/services/docxInplace.js`: +- `const JSZip = require('jszip'); const { DOMParser, XMLSerializer } = require('@xmldom/xmldom');` +- `async function extractParagraphs(buffer)`: + - `const zip = await JSZip.loadAsync(buffer);` + - `const documentXml = await zip.file('word/document.xml').async('string');` + - **XXE guard:** `if (/ t.textContent).join('')`. If `text.trim()` non-empty → push `{ pIndex, content: text }`. + - return `{ paragraphs, zip, documentXml }`. +- `module.exports = { extractParagraphs };` (add writeBack in Task 3). + +**Step 4:** PASS (2). **Step 5:** commit `feat: DOCX XML paragraph extractor (XXE-guarded)`. + +--- + +### Task 3: `writeBack` (paragraph-level in-place) + +**Files:** Modify `server/services/docxInplace.js`; create `server/services/__tests__/docxInplace.writeback.test.js`. + +**Step 1 — failing test:** +```javascript +import { describe, it, expect } from 'vitest'; +import docx from 'docx'; +import { extractParagraphs, writeBack } from '../docxInplace.js'; + +async function makeDocx(paras) { /* same helper as Task 2 */ } + +it('writes translations into first run, blanks others, round-trips', async () => { + const buf = await makeDocx(['שלום עולם', 'Keep me']); + const { paragraphs, zip, documentXml } = await extractParagraphs(buf); + const mapping = {}; mapping[paragraphs[0].pIndex] = 'Hello world'; + const out = await writeBack(zip, documentXml, mapping); + // re-extract from the output to verify + const re = await extractParagraphs(out); + const texts = re.paragraphs.map(p => p.content); + expect(texts).toContain('Hello world'); // translated paragraph replaced + expect(texts).toContain('Keep me'); // untouched paragraph preserved +}); + +it('throws on invalid mapping target type (caller falls back)', async () => { + const buf = await makeDocx(['a']); + const { paragraphs, zip, documentXml } = await extractParagraphs(buf); + const m = {}; m[paragraphs[0].pIndex] = { not: 'a string' }; + await expect(writeBack(zip, documentXml, m)).rejects.toThrow(); +}); +``` + +**Step 2:** FAIL. + +**Step 3 — implement** `async function writeBack(zip, documentXml, mapping)`: +- parse fresh DOM from `documentXml`. +- `const ps = Array.from(dom.getElementsByTagName('w:p'));` +- for each `pIndexStr` in mapping: `const text = mapping[pIndexStr];` if `typeof text !== 'string'` throw; `const p = ps[Number(pIndexStr)];` if !p continue; `const ts = Array.from(p.getElementsByTagName('w:t'));` if `ts.length === 0` continue; set `ts[0].textContent = text.normalize('NFC');` and set `ts[0].setAttribute('xml:space','preserve');`; for `ts.slice(1)` set `.textContent = ''`. +- `const outXml = new XMLSerializer().serializeToString(dom);` +- validate: re-parse `new DOMParser().parseFromString(outXml,'text/xml')` — if it has a `parsererror` element, throw. +- `zip.file('word/document.xml', outXml);` +- `return zip.generateAsync({ type: 'nodebuffer' });` +- export `writeBack`. + +**Step 4:** PASS. **Step 5:** commit `feat: DOCX in-place writeBack (paragraph-level, validated)`. + +--- + +### Task 4: Wire into the worker (docx branch + fallback) + +**Files:** Modify `server/api/translate.js`. + +Read the current `documentQueue.process('translate', ...)`. Add a DOCX branch BEFORE the existing flat path: +```javascript +const ext = path.extname(filePath).slice(1).toLowerCase(); +let doc, outputPath, usedInplace = false; +if (ext === 'docx') { + try { + await assertDocxSafe(filePath, MAX_DOCX_UNCOMPRESSED); // existing import? add if missing + const buffer = await fs.readFile(filePath); + const { paragraphs, zip, documentXml } = await extractParagraphs(buffer); + const blocks = paragraphs.map(p => ({ type: 'paragraph', content: p.content })); + const { blocks: docBlocks, segments } = buildSegments(blocks); + if (docBlocks.length !== paragraphs.length) throw new Error('paragraph/segment count mismatch'); + doc = await buildTranslationDocument({ blocks: docBlocks, segments }, + (chunk) => aiProvider.translateBatchAligned(chunk, sourceLang || 'he', targetLang), + { sourceLang: sourceLang || 'he', targetLang, maxSegments: MAX_SEGMENTS, concurrency: 2, maxPerChunk: 8, maxTokens: 1200, owner:'anon', jobId:String(job.id), ts: Date.now(), + onCap:(i)=>console.warn(`cap ${i.total}>${i.cap}`) }); + const mapping = {}; + docBlocks.forEach((b, i) => { mapping[paragraphs[i].pIndex] = b.sentences.map(s => s.target).join(' '); }); + const outBuf = await writeBack(zip, documentXml, mapping); + outputPath = path.join(path.dirname(filePath), `translated_${crypto.randomUUID()}.docx`); + await fs.writeFile(outputPath, outBuf); + usedInplace = true; + } catch (e) { + console.warn('DOCX in-place failed, falling back to flat:', e.message); + } +} +if (!usedInplace) { + // existing flat path: processDocument -> buildSegments -> buildTranslationDocument -> generateTranslatedDocument + // (keep current code; ensure it sets `doc` and `outputPath`) +} +const resultToken = crypto.randomUUID(); +saveResult(resultToken, doc); +await job.progress(100); +return { filename: path.basename(outputPath), resultToken, success: true }; +``` +- Add imports: `const { extractParagraphs, writeBack } = require('../services/docxInplace');` and ensure `assertDocxSafe` is imported (from `../services/zipGuard`) and `fs` is the promises fs already in the file. +- Keep the existing flat path as the `if (!usedInplace)` body (refactor current middle into it). Ensure both paths produce `doc` (for saveResult) and `outputPath`. + +**Verify (no full unit test — needs LLM/redis):** +- `npm test` (full) → no new failures. +- `node -e "require('./server/api/translate.js'); console.log('loads'); process.exit(0)"`. +- Optional: a focused integration test `server/services/__tests__/docxInplace.integration.test.js` that runs extract → buildSegments → buildTranslationDocument(fakeBatch upper-casing) → mapping → writeBack → re-extract asserts uppercased text. (Recommended; uses fake batch, no network.) + +**Commit:** `feat: wire DOCX in-place into worker with flat fallback`. + +--- + +### Task 5: Deploy + verify + +- Merge `feat/docx-inplace` → main (PR). On sec: `git pull && docker compose -f docker-compose.prod.yml up -d --build app`. +- Verify: generate a Hebrew `.docx` (multi-paragraph), upload to https://translator.creatman.site, download the result, confirm: opens in Word, paragraphs translated, layout/structure intact. Confirm a PDF upload still works (flat path unchanged). Log decision to memory. + +--- + +## Acceptance criteria (design 🔴) +- [ ] Zip-bomb: `assertDocxSafe` runs before extraction (Task 4) +- [ ] XXE: DOCTYPE rejected (Task 2) +- [ ] Output integrity: serialized XML re-validated; any error → flat fallback, job never fails (Tasks 3,4) +- [ ] One translation pass (Task 4) +- [ ] DOCX output preserves layout (paragraph-level in place); PDF path unchanged From 08b9838ba2b18769facbc35bd6b2bab347913c73 Mon Sep 17 00:00:00 2001 From: Creatman Date: Thu, 4 Jun 2026 12:22:21 -0400 Subject: [PATCH 3/6] build: add jszip + @xmldom/xmldom for DOCX in-place Co-Authored-By: Claude Opus 4.8 (1M context) --- package-lock.json | 19 +++++++++++++++---- package.json | 2 ++ 2 files changed, 17 insertions(+), 4 deletions(-) diff --git a/package-lock.json b/package-lock.json index 71005ae..30b1321 100644 --- a/package-lock.json +++ b/package-lock.json @@ -9,6 +9,7 @@ "version": "1.0.0", "dependencies": { "@google-cloud/translate": "^7.0.0", + "@xmldom/xmldom": "^0.9.10", "bull": "^4.12.0", "cors": "^2.8.5", "docx": "^8.5.0", @@ -21,6 +22,7 @@ "hebrew-transliteration": "^2.0.0", "helmet": "^7.1.0", "ioredis": "^5.3.2", + "jszip": "^3.10.1", "mammoth": "^1.6.0", "mime-types": "^2.1.35", "multer": "^1.4.5-lts.1", @@ -3318,12 +3320,12 @@ "license": "MIT" }, "node_modules/@xmldom/xmldom": { - "version": "0.8.10", - "resolved": "https://registry.npmjs.org/@xmldom/xmldom/-/xmldom-0.8.10.tgz", - "integrity": "sha512-2WALfTl4xo2SkGCYRt6rDTFfk9R1czmBvUQy12gK2KuRKIpWEhcbbzy8EZXtz/jkRqHX8bFEc6FC1HjX4TUWYw==", + "version": "0.9.10", + "resolved": "https://registry.npmjs.org/@xmldom/xmldom/-/xmldom-0.9.10.tgz", + "integrity": "sha512-A9gOqLdi6cV4ibazAjcQufGj0B1y/vDqYrcuP6d/6x8P27gRS8643Dj9o1dEKtB6O7fwxb2FgBmJS2mX7gpvdw==", "license": "MIT", "engines": { - "node": ">=10.0.0" + "node": ">=14.6" } }, "node_modules/abort-controller": { @@ -9791,6 +9793,15 @@ "node": ">=12.0.0" } }, + "node_modules/mammoth/node_modules/@xmldom/xmldom": { + "version": "0.8.13", + "resolved": "https://registry.npmjs.org/@xmldom/xmldom/-/xmldom-0.8.13.tgz", + "integrity": "sha512-KRYzxepc14G/CEpEGc3Yn+JKaAeT63smlDr+vjB8jRfgTBBI9wRj/nkQEO+ucV8p8I9bfKLWp37uHgFrbntPvw==", + "license": "MIT", + "engines": { + "node": ">=10.0.0" + } + }, "node_modules/mammoth/node_modules/argparse": { "version": "1.0.10", "resolved": "https://registry.npmjs.org/argparse/-/argparse-1.0.10.tgz", diff --git a/package.json b/package.json index bd080d3..aec231b 100644 --- a/package.json +++ b/package.json @@ -23,6 +23,7 @@ }, "dependencies": { "@google-cloud/translate": "^7.0.0", + "@xmldom/xmldom": "^0.9.10", "bull": "^4.12.0", "cors": "^2.8.5", "docx": "^8.5.0", @@ -35,6 +36,7 @@ "hebrew-transliteration": "^2.0.0", "helmet": "^7.1.0", "ioredis": "^5.3.2", + "jszip": "^3.10.1", "mammoth": "^1.6.0", "mime-types": "^2.1.35", "multer": "^1.4.5-lts.1", From 4c827b4516a8b599e35a977931688b6839751cdc Mon Sep 17 00:00:00 2001 From: Creatman Date: Thu, 4 Jun 2026 12:24:31 -0400 Subject: [PATCH 4/6] feat: DOCX XML paragraph extractor (XXE-guarded) Co-Authored-By: Claude Opus 4.8 --- .../__tests__/docxInplace.extract.test.js | 35 ++++++++++++++++ server/services/docxInplace.js | 42 +++++++++++++++++++ 2 files changed, 77 insertions(+) create mode 100644 server/services/__tests__/docxInplace.extract.test.js create mode 100644 server/services/docxInplace.js diff --git a/server/services/__tests__/docxInplace.extract.test.js b/server/services/__tests__/docxInplace.extract.test.js new file mode 100644 index 0000000..d0ed5b1 --- /dev/null +++ b/server/services/__tests__/docxInplace.extract.test.js @@ -0,0 +1,35 @@ +import { describe, it, expect } from 'vitest'; +import * as docx from 'docx'; // docx ships ESM named exports (no default) under vitest +import { extractParagraphs } from '../docxInplace.js'; + +async function makeDocx(paras) { + const d = new docx.Document({ + sections: [ + { + children: paras.map( + (t) => new docx.Paragraph({ children: [new docx.TextRun(t)] }) + ), + }, + ], + }); + return docx.Packer.toBuffer(d); +} + +describe('extractParagraphs', () => { + it('extracts non-empty paragraphs with pIndex and content', async () => { + const buf = await makeDocx(['שלום עולם', '', 'Second para']); + const { paragraphs, documentXml } = await extractParagraphs(buf); + expect(paragraphs.length).toBe(2); // the empty paragraph is skipped + expect(paragraphs[0].content).toBe('שלום עולם'); + expect(typeof paragraphs[0].pIndex).toBe('number'); + expect(documentXml).toContain('w:p'); + }); + + it('rejects DOCTYPE (XXE guard)', async () => { + const JSZip = (await import('jszip')).default; + const zip = new JSZip(); + zip.file('word/document.xml', ''); + const buf = await zip.generateAsync({ type: 'nodebuffer' }); + await expect(extractParagraphs(buf)).rejects.toThrow(/DOCTYPE|entity|XXE/i); + }); +}); diff --git a/server/services/docxInplace.js b/server/services/docxInplace.js new file mode 100644 index 0000000..d54f214 --- /dev/null +++ b/server/services/docxInplace.js @@ -0,0 +1,42 @@ +const JSZip = require('jszip'); +const { DOMParser } = require('@xmldom/xmldom'); + +/** + * Open a .docx buffer, parse word/document.xml, and return the non-empty body + * paragraphs with their positional index (within ALL , including those + * nested in table cells), plus the loaded zip and the raw XML for write-back. + * + * Security: untrusted XML must reject DOCTYPE (XXE / billion-laughs guard). + * + * @param {Buffer} buffer - raw .docx file contents (a zip) + * @returns {Promise<{paragraphs: {pIndex:number, content:string}[], zip: import('jszip'), documentXml: string}>} + */ +async function extractParagraphs(buffer) { + const zip = await JSZip.loadAsync(buffer); + const entry = zip.file('word/document.xml'); + if (!entry) throw new Error('not a docx: word/document.xml missing'); + + const documentXml = await entry.async('string'); + // XXE / billion-laughs guard: refuse any DOCTYPE declaration. + if (/, nested (table cells) included. + const ps = Array.from(dom.getElementsByTagName('w:p')); + const paragraphs = []; + ps.forEach((p, pIndex) => { + const text = Array.from(p.getElementsByTagName('w:t')) + .map((t) => t.textContent || '') + .join(''); + if (text.trim().length > 0) { + paragraphs.push({ pIndex, content: text }); + } + }); + + return { paragraphs, zip, documentXml }; +} + +module.exports = { extractParagraphs }; From f1d6d199b159cf4d8a160576524e9a74a4d35577 Mon Sep 17 00:00:00 2001 From: Creatman Date: Thu, 4 Jun 2026 13:20:58 -0400 Subject: [PATCH 5/6] feat: DOCX in-place writeBack (paragraph-level, validated) --- .../__tests__/docxInplace.writeback.test.js | 39 ++++++++++++++ server/services/docxInplace.js | 52 ++++++++++++++++++- 2 files changed, 89 insertions(+), 2 deletions(-) create mode 100644 server/services/__tests__/docxInplace.writeback.test.js diff --git a/server/services/__tests__/docxInplace.writeback.test.js b/server/services/__tests__/docxInplace.writeback.test.js new file mode 100644 index 0000000..84c6129 --- /dev/null +++ b/server/services/__tests__/docxInplace.writeback.test.js @@ -0,0 +1,39 @@ +import { describe, it, expect } from 'vitest'; +import * as docx from 'docx'; +import { extractParagraphs, writeBack } from '../docxInplace.js'; + +async function makeDocx(paras) { + const d = new docx.Document({ sections: [{ children: paras.map(t => + new docx.Paragraph({ children: [ new docx.TextRun(t) ] })) }] }); + return docx.Packer.toBuffer(d); +} + +describe('writeBack', () => { + it('writes translation into first run, blanks others, round-trips', async () => { + const buf = await makeDocx(['שלום עולם', 'Keep me']); + const { paragraphs, zip, documentXml } = await extractParagraphs(buf); + const mapping = {}; mapping[paragraphs[0].pIndex] = 'Hello world'; + const out = await writeBack(zip, documentXml, mapping); + expect(Buffer.isBuffer(out)).toBe(true); + const re = await extractParagraphs(out); + const texts = re.paragraphs.map(p => p.content); + expect(texts).toContain('Hello world'); // translated paragraph replaced + expect(texts).toContain('Keep me'); // untouched paragraph preserved + }); + + it('throws when a mapping target is not a string (caller will fall back)', async () => { + const buf = await makeDocx(['a']); + const { paragraphs, zip, documentXml } = await extractParagraphs(buf); + const m = {}; m[paragraphs[0].pIndex] = { not: 'a string' }; + await expect(writeBack(zip, documentXml, m)).rejects.toThrow(); + }); + + it('leaves paragraphs not in the mapping unchanged', async () => { + const buf = await makeDocx(['one', 'two']); + const { paragraphs, zip, documentXml } = await extractParagraphs(buf); + const m = {}; m[paragraphs[1].pIndex] = 'TWO'; + const out = await writeBack(zip, documentXml, m); + const texts = (await extractParagraphs(out)).paragraphs.map(p => p.content); + expect(texts).toEqual(['one', 'TWO']); + }); +}); diff --git a/server/services/docxInplace.js b/server/services/docxInplace.js index d54f214..13ee20e 100644 --- a/server/services/docxInplace.js +++ b/server/services/docxInplace.js @@ -1,5 +1,5 @@ const JSZip = require('jszip'); -const { DOMParser } = require('@xmldom/xmldom'); +const { DOMParser, XMLSerializer } = require('@xmldom/xmldom'); /** * Open a .docx buffer, parse word/document.xml, and return the non-empty body @@ -39,4 +39,52 @@ async function extractParagraphs(buffer) { return { paragraphs, zip, documentXml }; } -module.exports = { extractParagraphs }; +/** + * Inject translations into word/document.xml at paragraph level, preserving + * everything else (run/paragraph props, images, tables). For each pIndex in + * `mapping`, the paragraph's FIRST receives the translated string (NFC + * normalized) and the remaining of that paragraph are blanked. The zip + * is repackaged and returned as a Buffer. + * + * Security/robustness (🔴): the serialized XML is re-parsed for validation; + * any problem throws so the caller can fall back to the flat generator and + * never emits a corrupt file. + * + * @param {import('jszip')} zip - JSZip instance from extractParagraphs + * @param {string} documentXml - raw word/document.xml string + * @param {Object} mapping - pIndex -> translated text + * @returns {Promise} repackaged .docx buffer + */ +async function writeBack(zip, documentXml, mapping) { + const dom = new DOMParser().parseFromString(documentXml, 'text/xml'); + const ps = Array.from(dom.getElementsByTagName('w:p')); + + for (const key of Object.keys(mapping)) { + const text = mapping[key]; + if (typeof text !== 'string') { + throw new Error('mapping target must be a string'); + } + const p = ps[Number(key)]; + if (!p) continue; + const ts = Array.from(p.getElementsByTagName('w:t')); + if (ts.length === 0) continue; + ts[0].textContent = text.normalize('NFC'); + ts[0].setAttribute('xml:space', 'preserve'); + for (const t of ts.slice(1)) { + t.textContent = ''; + } + } + + const outXml = new XMLSerializer().serializeToString(dom); + + // Validate: re-parse and reject anything that didn't round-trip cleanly. + const check = new DOMParser().parseFromString(outXml, 'text/xml'); + if (!check || check.getElementsByTagName('parsererror').length > 0) { + throw new Error('writeBack produced invalid XML'); + } + + zip.file('word/document.xml', outXml); + return zip.generateAsync({ type: 'nodebuffer' }); +} + +module.exports = { extractParagraphs, writeBack }; From 2eda02bf8375c1749547fc138e43f43c7ade319d Mon Sep 17 00:00:00 2001 From: Creatman Date: Thu, 4 Jun 2026 13:24:25 -0400 Subject: [PATCH 6/6] feat: wire DOCX in-place into worker with flat fallback Co-Authored-By: Claude Opus 4.8 --- server/api/translate.js | 95 ++++++++++++++----- .../__tests__/docxInplace.integration.test.js | 30 ++++++ 2 files changed, 100 insertions(+), 25 deletions(-) create mode 100644 server/services/__tests__/docxInplace.integration.test.js diff --git a/server/api/translate.js b/server/api/translate.js index c2814dc..af394a2 100644 --- a/server/api/translate.js +++ b/server/api/translate.js @@ -14,6 +14,8 @@ const LiteLLMProvider = require('../adapters/ai/LiteLLMProvider'); const { saveResult, getResult, recentUsage } = require('../services/resultStore'); const { validateMagicBytes } = require('../middleware/fileValidation'); const { emitToSession } = require('../socket/rooms'); +const { extractParagraphs, writeBack } = require('../services/docxInplace'); +const { assertDocxSafe } = require('../services/zipGuard'); // Инициализируем сервисы const documentProcessor = new DocumentProcessor(); @@ -22,6 +24,9 @@ const aiProvider = new LiteLLMProvider(); // Жёсткий предел на число сегментов в одном документе (DoS-guard + бюджет). const MAX_SEGMENTS = Number(process.env.MAX_SEGMENTS) || 1500; +// Предел распакованного размера DOCX (zip-bomb guard) для воркера. +const MAX_DOCX_UNCOMPRESSED = Number(process.env.MAX_DOCX_UNCOMPRESSED_MB || 100) * 1024 * 1024; + // Создаем очередь для обработки документов const documentQueue = new Queue('document-processing', { redis: { @@ -62,38 +67,78 @@ documentQueue.on('failed', (job, error) => { documentQueue.process('translate', async (job) => { try { const { filePath, sourceLang, targetLang, originalName } = job.data; - + // Обновляем прогресс: Начало обработки await job.progress(10); - // Обрабатываем документ (плоский текст) - const processed = await documentProcessor.processDocument(filePath, targetLang); - await job.progress(40); - - // Разбиваем на блоки и сегменты, затем строим TranslationDocument - const rawBlocks = toBlocks(processed.content); - const { blocks: docBlocks, segments } = buildSegments(rawBlocks); - const doc = await buildTranslationDocument( - { blocks: docBlocks, segments }, - (chunk) => aiProvider.translateBatchAligned(chunk, sourceLang || 'he', targetLang), - { sourceLang: sourceLang || 'he', targetLang, maxSegments: MAX_SEGMENTS, - concurrency: 2, maxPerChunk: 8, maxTokens: 1200, - owner: 'anon', jobId: String(job.id), ts: Date.now(), - onCap: (info) => console.warn(`Segment cap hit: ${info.total} > ${info.cap} (job ${job.id})`) } - ); - await job.progress(80); + const ext = path.extname(filePath).slice(1).toLowerCase(); + let doc = null, outputPath = null, usedInplace = false; + + // DOCX: переводим документ "на месте", сохраняя исходную вёрстку. + // Любая ошибка → откатываемся на плоский путь ниже, чтобы скачивание не ломалось. + if (ext === 'docx') { + try { + await assertDocxSafe(filePath, MAX_DOCX_UNCOMPRESSED); + const buffer = await fs.readFile(filePath); + const { paragraphs, zip, documentXml } = await extractParagraphs(buffer); + const blocks = paragraphs.map(p => ({ type: 'paragraph', content: p.content })); + const { blocks: docBlocks, segments } = buildSegments(blocks); + if (docBlocks.length !== paragraphs.length) { + throw new Error(`paragraph/segment count mismatch ${docBlocks.length} != ${paragraphs.length}`); + } + await job.progress(40); + doc = await buildTranslationDocument( + { blocks: docBlocks, segments }, + (chunk) => aiProvider.translateBatchAligned(chunk, sourceLang || 'he', targetLang), + { sourceLang: sourceLang || 'he', targetLang, maxSegments: MAX_SEGMENTS, + concurrency: 2, maxPerChunk: 8, maxTokens: 1200, + owner: 'anon', jobId: String(job.id), ts: Date.now(), + onCap: (info) => console.warn(`Segment cap hit: ${info.total} > ${info.cap} (job ${job.id})`) } + ); + await job.progress(80); + const mapping = {}; + docBlocks.forEach((b, i) => { mapping[paragraphs[i].pIndex] = b.sentences.map(s => s.target).join(' '); }); + const outBuf = await writeBack(zip, documentXml, mapping); + outputPath = path.join(path.dirname(filePath), `translated_${crypto.randomUUID()}.docx`); + await fs.writeFile(outputPath, outBuf); + usedInplace = true; + } catch (e) { + console.warn('DOCX in-place failed, falling back to flat:', e.message); + } + } + + // Плоский путь (PDF всегда, DOCX как fallback): извлекаем текст и собираем + // переведённый документ заново, теряя вёрстку, но гарантируя результат. + if (!usedInplace) { + // Обрабатываем документ (плоский текст) + const processed = await documentProcessor.processDocument(filePath, targetLang); + await job.progress(40); + + // Разбиваем на блоки и сегменты, затем строим TranslationDocument + const rawBlocks = toBlocks(processed.content); + const { blocks: docBlocks, segments } = buildSegments(rawBlocks); + doc = await buildTranslationDocument( + { blocks: docBlocks, segments }, + (chunk) => aiProvider.translateBatchAligned(chunk, sourceLang || 'he', targetLang), + { sourceLang: sourceLang || 'he', targetLang, maxSegments: MAX_SEGMENTS, + concurrency: 2, maxPerChunk: 8, maxTokens: 1200, + owner: 'anon', jobId: String(job.id), ts: Date.now(), + onCap: (info) => console.warn(`Segment cap hit: ${info.total} > ${info.cap} (job ${job.id})`) } + ); + await job.progress(80); + + // downloadable file: flatten translated sentences per block + const fileBlocks = doc.blocks.map(b => ({ type: 'text', content: b.sentences.map(s => s.target).join(' ') })); + outputPath = path.join( + path.dirname(filePath), + `translated_${crypto.randomUUID()}${path.extname(filePath)}` + ); + await documentProcessor.generateTranslatedDocument(fileBlocks, outputPath); + } // result for the viewer const resultToken = crypto.randomUUID(); saveResult(resultToken, doc); - - // downloadable file: flatten translated sentences per block - const fileBlocks = doc.blocks.map(b => ({ type: 'text', content: b.sentences.map(s => s.target).join(' ') })); - const outputPath = path.join( - path.dirname(filePath), - `translated_${crypto.randomUUID()}${path.extname(filePath)}` - ); - await documentProcessor.generateTranslatedDocument(fileBlocks, outputPath); await job.progress(100); return { diff --git a/server/services/__tests__/docxInplace.integration.test.js b/server/services/__tests__/docxInplace.integration.test.js new file mode 100644 index 0000000..511bf31 --- /dev/null +++ b/server/services/__tests__/docxInplace.integration.test.js @@ -0,0 +1,30 @@ +import { describe, it, expect } from 'vitest'; +import * as docx from 'docx'; +import { extractParagraphs, writeBack } from '../docxInplace.js'; +import { buildSegments } from '../translationDocument.js'; +import { buildTranslationDocument } from '../pipeline.js'; + +async function makeDocx(paras) { + const d = new docx.Document({ sections: [{ children: paras.map(t => + new docx.Paragraph({ children: [ new docx.TextRun(t) ] })) }] }); + return docx.Packer.toBuffer(d); +} + +// fake batch: uppercases each segment's source +const fakeBatch = async (chunk) => ({ items: chunk.map(s => ({ id: s.id, target: s.source.toUpperCase(), align: [] })), usage: null }); + +describe('docx in-place integration', () => { + it('end-to-end docx in-place with a fake translator', async () => { + const buf = await makeDocx(['hello world', 'second line']); + const { paragraphs, zip, documentXml } = await extractParagraphs(buf); + const blocks = paragraphs.map(p => ({ type: 'paragraph', content: p.content })); + const { blocks: docBlocks, segments } = buildSegments(blocks); + expect(docBlocks.length).toBe(paragraphs.length); + const doc = await buildTranslationDocument({ blocks: docBlocks, segments }, fakeBatch, { sourceLang: 'he', targetLang: 'en' }); + const mapping = {}; + docBlocks.forEach((b, i) => { mapping[paragraphs[i].pIndex] = b.sentences.map(s => s.target).join(' '); }); + const out = await writeBack(zip, documentXml, mapping); + const texts = (await extractParagraphs(out)).paragraphs.map(p => p.content); + expect(texts).toEqual(['HELLO WORLD', 'SECOND LINE']); + }); +});