BACK-475 - Add-Word-(docx)-upload-to-enable-image-extraction-for-pasted-Word-content#648
Open
kuwork wants to merge 3 commits into
Open
BACK-475 - Add-Word-(docx)-upload-to-enable-image-extraction-for-pasted-Word-content#648kuwork wants to merge 3 commits into
kuwork wants to merge 3 commits into
Conversation
(cherry picked from commit e1ea342)
…ted-Word-content (cherry picked from commit c92e3b3)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Allow users to upload Word documents (
.docx) directly into the Web UI editor. The backend extracts text and images, converting them to Markdown with proper image references.Why
The existing paste-as-markdown feature (BACK-208) cannot extract images from pasted Word content because browser clipboard APIs don't expose embedded images as extractable blobs. By supporting direct
.docxfile upload, mammoth can read the docx archive and extract embedded images to the temp assets directory.Changes
src/core/docx-converter.ts): New module usingmammothto convert.docx→ HTML. Embedded images are extracted via mammoth'sconvertImagecallback and uploaded tobacklog/assets/.temp/viaAssetManager.src/server/index.ts): NewPOST /api/docx/convertendpoint. Accepts multipart/form-data, validates.docxextension, returns{ html, images, messages }.src/web/components/PasteAwareMDEditor.tsx): Added Word upload button to editor toolbar (extraCommands), drag-and-drop support, and a hidden file picker. Uploads file to backend, then runscleanHtml+ Turndown in the browser to produce Markdown.src/web/utils/paste-as-markdown.ts): ExtractedcleanHtmlas an exported async function with a newkeepMediaoption. This allows the docx upload path to preserve server-side extracted images while the paste path continues to filter invalid local images.src/web/lib/api.ts): AddedconvertDocx()API client method.src/test/server-docx-convert.test.ts): Integration tests for the conversion endpoint (validation, conversion, image extraction to temp directory).mammothfor docx parsing.How it works
.docxfile onto the editor.POST /api/docx/convert..temp/with UUID filenames.{ html, images, messages }.cleanHtml(html, { keepMedia: true })to normalize Word HTML (flatten table cells, convert mso-lists, strip classes, etc.) while preserving<img>tags.POST /api/assets/promoteflow promotes temp images to the permanent paste directory.Testing
bun test src/test/server-docx-convert.test.ts— backend endpoint tests (4 pass)bun test src/test/build.test.ts— CLI compile still works (no jsdom in bundle)bunx tsc --noEmit— type check passescloses BACK-475