Skip to content

Conversation

@tangledhelix
Copy link
Member

Ports a feature from PPtools: find page numbers in various formats and display them (roman first, arabic next).

Understands different formats like Page_1, page_1, page1 (in id attribute). Also attempts to parse numbers from <span class="pagenum"> tags (p. 1, [Pg 1], etc.).

Example of what this looked like in PPTools:
Screenshot 2025-12-31 at 12 19 04 AM

Example from this change:

----- document info ------------------------------------------------------------
[info] page numbers ( roman): i–xi
[info] page numbers (arabic): 13–168

Can display multiple ranges if numbers are missing from the sequence:

----- document info ------------------------------------------------------------
[info] page numbers ( roman): i–iii, v–xi
[info] page numbers (arabic): 13–23, 25–45, 47–168

@tangledhelix
Copy link
Member Author

Note that this adds to requirements.py, the roman module is required.

ports a feature from PPtools: find page numbers in various formats and
display them (roman first, arabic next). understands different formats
like Page_1, page_1, page1 (in id attribute) and attempts to parse out
span class=pagenum formats like p. 1, [Pg 1] and so forth.
pphtml.py Outdated
import roman
from time import strftime
from html.parser import HTMLParser
import regex as re # for unicode support (pip install regex)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we follow the convention of grouping the built-in packages together, a newline, and then the 3rd party?

import sys
import os
import argparse
import itertools
from time import strftime
from html.parser import HTMLParser

import regex as re  # for unicode support
import roman
from PIL import Image

Ideally each would be alpha-sorted but I'm not going to get wound around the axle about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done! and sorted :)

@cpeel cpeel requested a review from srjfoo December 31, 2025 19:01
cpeel pushed a commit to DistributedProofreaders/ppwb that referenced this pull request Dec 31, 2025
@tangledhelix
Copy link
Member Author

I just pushed an update that will show 3 instead of 3–3 when it's found a page range only one page long.

@tangledhelix
Copy link
Member Author

I ran all of my own projects through this and it worked for all of them except for a couple that have some weird issue uploading. I don't think that's related to this change, though.

My projects have gone through Guiguts 1, ppgen, and Guiguts 2, so I think that shows this is able to handle all of those styles for page number markup.

Copy link
Member

@cpeel cpeel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sandbox with this code available for testing in https://www.pgdp.org/~cpeel/ppwb/pphtml.php

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants