Skip to content

Improve squeeze efficiency for large files#440

Open
luantak wants to merge 2 commits into
johnwhitington:masterfrom
luantak:squeeze_optimization
Open

Improve squeeze efficiency for large files#440
luantak wants to merge 2 commits into
johnwhitington:masterfrom
luantak:squeeze_optimization

Conversation

@luantak
Copy link
Copy Markdown

@luantak luantak commented Mar 17, 2026

This makes the short path of squeeze -squeeze -squeeze-no-pagedata much faster for huge pdf files. There is a moderate speedup for smaller files.

May also improve file size a little on some files.

I can provide concrete numbers for my improvement claims if necessary.

I'm not sure if all of this code belongs in cpdf or if some of it should go in camlpdf.

luantak added 2 commits March 17, 2026 18:38
…rgeted cleanup

Use new hash tables to bucket normalized objects and cache stream-content hashes, reducing repeated comparisons and avoiding unnecessary stream materialization during duplicate detection.

Track rewritten page streams and only run unreferenced-object cleanup and follow-up deduplication when page rewrites or recompression actually changed the PDF.
@johnwhitington
Copy link
Copy Markdown
Owner

Thanks. I'll take a look soon.

But can you give me a couple of paragraphs of description to help me navigate the patch, please?

@luantak
Copy link
Copy Markdown
Author

luantak commented Mar 18, 2026

Previously, cpdf squeeze found duplicate objects in a fairly expensive way. It would hash objects, sort and group them, and then do direct comparisons on the candidates. For stream objects, those comparisons often meant pulling in the full stream data just to decide whether two objects were actually the same.

The new version makes that duplicate-checking process much more selective. Instead of going quickly from “maybe similar” to “compare the whole object,” it filters things in stages.

First, it puts objects into hash-table buckets using a cheap normalized representation, so only objects that already look alike are grouped together. Anything that ends up alone is discarded immediately.

For streams, it then refines those groups again using stronger hashes based on both the stream metadata and the stream contents, with those content hashes cached.

After that, actual equality checks only happen inside these much smaller groups. And even there, stream objects are checked cheaply first by comparing normalized dictionaries and lengths before the code touches the full byte data.

In summary duplicate detection is no longer doing expensive work on broad sets of objects. It now narrows the field aggressively, rejects most non-matches early, and only pays the full comparison cost for the few objects that have already passed several cheap tests.

@johnwhitington
Copy link
Copy Markdown
Owner

johnwhitington commented Mar 27, 2026

Manual testing shows big speed improvements on big files. Great! And all test outputs appear to open in PDF viewers. Manual inspection will be needed to see that they are fully valid.

But, with this patch, on Cpdf's (normal sized) test files:

$ time ./cpdftest -squeeze all >foo 2>&1

real	0m52.999s
user	0m51.539s
sys	0m0.898s

$ du -h PDFResults/
165M	PDFResults/squeeze
165M	PDFResults/

And with vanilla v2.9:

$ time ./cpdftest -squeeze all >foo 2>&1

real	0m25.396s
user	0m23.997s
sys	0m0.858s

$ du -h PDFResults/
184M	PDFResults/squeeze
184M	PDFResults/

So the time is doubled, but 10% size improvements. These are two separate issues - I need to understand why the output file sizes are smaller, because that probably exposes a bug in the existing code. Then we can look at which of your new methods can be applied to improve large file speeds without degrading ordinary size file speeds.

@luantak
Copy link
Copy Markdown
Author

luantak commented May 7, 2026

@johnwhitington Sorry had notifications off for this thread for some reason.

Because it was a while ago this may not be the full story, but I think the file size improvements mainly came from these changes:

  1. Page streams and Form XObjects are now only rewritten if the recompressed rewritten version is smaller or equal.
    Previously cpdf already rewrote page data and XObjects, but it did not do this size check. That means the old logic could replace a compact original stream with a larger normalized stream.
  2. Inherited resources are now handled when parsing page/Form content.
    The old code explicitly had a FIXME saying inherited resources had been tried before and reverted because sizes went up. This version reintroduces inherited-resource support, but combines it with the “only rewrite if smaller” protection, which makes it safer. This allows more streams/forms to be parsed and normalized without the previous size-regression problem.

As for the time regression in your test pdf I have some ideas why it might be the case, but I still need to test them can you send me that pdf so I can investigate?

@johnwhitington
Copy link
Copy Markdown
Owner

Thanks. I'll take another look before the next release.

The regression on speed is on the total size of all our test files (which cannot be public, because they are from customers I'm afraid.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants