Improve squeeze efficiency for large files by luantak · Pull Request #440 · johnwhitington/cpdf-source

luantak · 2026-03-17T18:01:54Z

This makes the short path of squeeze -squeeze -squeeze-no-pagedata much faster for huge pdf files. There is a moderate speedup for smaller files.

May also improve file size a little on some files.

I can provide concrete numbers for my improvement claims if necessary.

I'm not sure if all of this code belongs in cpdf or if some of it should go in camlpdf.

…rgeted cleanup Use new hash tables to bucket normalized objects and cache stream-content hashes, reducing repeated comparisons and avoiding unnecessary stream materialization during duplicate detection. Track rewritten page streams and only run unreferenced-object cleanup and follow-up deduplication when page rewrites or recompression actually changed the PDF.

johnwhitington · 2026-03-18T11:06:46Z

Thanks. I'll take a look soon.

But can you give me a couple of paragraphs of description to help me navigate the patch, please?

luantak · 2026-03-18T12:42:00Z

Previously, cpdf squeeze found duplicate objects in a fairly expensive way. It would hash objects, sort and group them, and then do direct comparisons on the candidates. For stream objects, those comparisons often meant pulling in the full stream data just to decide whether two objects were actually the same.

The new version makes that duplicate-checking process much more selective. Instead of going quickly from “maybe similar” to “compare the whole object,” it filters things in stages.

First, it puts objects into hash-table buckets using a cheap normalized representation, so only objects that already look alike are grouped together. Anything that ends up alone is discarded immediately.

For streams, it then refines those groups again using stronger hashes based on both the stream metadata and the stream contents, with those content hashes cached.

After that, actual equality checks only happen inside these much smaller groups. And even there, stream objects are checked cheaply first by comparing normalized dictionaries and lengths before the code touches the full byte data.

In summary duplicate detection is no longer doing expensive work on broad sets of objects. It now narrows the field aggressively, rejects most non-matches early, and only pays the full comparison cost for the few objects that have already passed several cheap tests.

johnwhitington · 2026-03-27T14:04:56Z

Manual testing shows big speed improvements on big files. Great! And all test outputs appear to open in PDF viewers. Manual inspection will be needed to see that they are fully valid.

But, with this patch, on Cpdf's (normal sized) test files:

$ time ./cpdftest -squeeze all >foo 2>&1

real	0m52.999s
user	0m51.539s
sys	0m0.898s

$ du -h PDFResults/
165M	PDFResults/squeeze
165M	PDFResults/

And with vanilla v2.9:

$ time ./cpdftest -squeeze all >foo 2>&1

real	0m25.396s
user	0m23.997s
sys	0m0.858s

$ du -h PDFResults/
184M	PDFResults/squeeze
184M	PDFResults/

So the time is doubled, but 10% size improvements. These are two separate issues - I need to understand why the output file sizes are smaller, because that probably exposes a bug in the existing code. Then we can look at which of your new methods can be applied to improve large file speeds without degrading ordinary size file speeds.

luantak · 2026-05-07T08:35:27Z

@johnwhitington Sorry had notifications off for this thread for some reason.

Because it was a while ago this may not be the full story, but I think the file size improvements mainly came from these changes:

Page streams and Form XObjects are now only rewritten if the recompressed rewritten version is smaller or equal.
Previously cpdf already rewrote page data and XObjects, but it did not do this size check. That means the old logic could replace a compact original stream with a larger normalized stream.
Inherited resources are now handled when parsing page/Form content.
The old code explicitly had a FIXME saying inherited resources had been tried before and reverted because sizes went up. This version reintroduces inherited-resource support, but combines it with the “only rewrite if smaller” protection, which makes it safer. This allows more streams/forms to be parsed and normalized without the previous size-regression problem.

As for the time regression in your test pdf I have some ideas why it might be the case, but I still need to test them can you send me that pdf so I can investigate?

johnwhitington · 2026-05-07T11:43:20Z

Thanks. I'll take another look before the next release.

The regression on speed is on the total size of all our test files (which cannot be public, because they are from customers I'm afraid.)

luantak added 2 commits March 17, 2026 18:38

add build output to gitignore

4a8089b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve squeeze efficiency for large files#440

Improve squeeze efficiency for large files#440
luantak wants to merge 2 commits into
johnwhitington:masterfrom
luantak:squeeze_optimization

luantak commented Mar 17, 2026

Uh oh!

johnwhitington commented Mar 18, 2026

Uh oh!

luantak commented Mar 18, 2026 •

edited

Loading

Uh oh!

johnwhitington commented Mar 27, 2026 •

edited

Loading

Uh oh!

luantak commented May 7, 2026

Uh oh!

johnwhitington commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

luantak commented Mar 17, 2026

Uh oh!

johnwhitington commented Mar 18, 2026

Uh oh!

luantak commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnwhitington commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luantak commented May 7, 2026

Uh oh!

johnwhitington commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

luantak commented Mar 18, 2026 •

edited

Loading

johnwhitington commented Mar 27, 2026 •

edited

Loading