Add an auto-generated unicode character category file#4605
Add an auto-generated unicode character category file#4605TheBlueMatt wants to merge 2 commits intolightningdevkit:mainfrom
Conversation
|
👋 Thanks for assigning @tnull as a reviewer! |
1a01b5a added detection of unicode format characters in `PrintableString`, but used a hard-coded table which may eventually become out of date. Here we switch to an auto-generated table, include all `General_Category` `Other` characters, and also ban unallocated code points. Finally, CI validates that the file is kept up to date. Written by Claude
b6f8c03 to
bd75483
Compare
| let is_other = is_unicode_general_category_other(c); | ||
| let is_unassigned = is_unicode_general_category_unassigned(c); | ||
| let c = if c.is_control() || is_other || is_unassigned { |
There was a problem hiding this comment.
Nit: c.is_control() is now fully redundant — it checks Cc (Control), which is already covered by is_unicode_general_category_other (see 0x0000..=0x001F and 0x007F..=0x009F in unicode.rs). The old code needed it because is_format_char only covered Cf, but the new function covers all of Cc / Cf / Cs / Co.
Not a bug (the || short-circuits harmlessly), but it's potentially confusing because it suggests is_other doesn't handle control characters.
| let is_other = is_unicode_general_category_other(c); | |
| let is_unassigned = is_unicode_general_category_unassigned(c); | |
| let c = if c.is_control() || is_other || is_unassigned { | |
| let c = if is_unicode_general_category_other(c) || is_unicode_general_category_unassigned(c) { |
Review SummaryThe CI workflow file ( Inline comments posted:
Previously flagged (still applicable):
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4605 +/- ##
==========================================
+ Coverage 86.09% 86.16% +0.07%
==========================================
Files 157 158 +1
Lines 108828 109323 +495
Branches 108828 109323 +495
==========================================
+ Hits 93694 94199 +505
+ Misses 12519 12509 -10
Partials 2615 2615
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
tnull
left a comment
There was a problem hiding this comment.
Alright, seems maybe a bit excessive, but fine by me. No need to check this on every CI run though, let's just move it to a weekly cronjob?
| - name: Run default clippy linting | ||
| run: | | ||
| ./ci/check-lint.sh | ||
| - name: Check Unicode general-category table is up to date |
There was a problem hiding this comment.
Is it worth to do this every time we run CI? Can't we add that as a weekly cronjob rather?
There was a problem hiding this comment.
I mean not sure it matters that much, but, done.
tnull
left a comment
There was a problem hiding this comment.
Looks good, please squash.
| schedule: | ||
| - cron: '42 3 * * *' | ||
|
|
||
| check-unicode: |
There was a problem hiding this comment.
Bug: Missing jobs: key. check-unicode is nested under on: (the trigger block) instead of under its own jobs: top-level key. GitHub Actions will either reject this workflow or silently run zero jobs.
Every other workflow in this repo has the correct structure:
on:
...
jobs:
check-unicode:
...You need to add jobs: at the top level (zero indent) before check-unicode:, and un-indent check-unicode: out of the on: block.
| steps: | ||
| - name: Check unicode file state |
There was a problem hiding this comment.
Bug: Missing actions/checkout step. Without checking out the repository, neither contrib/gen_unicode_general_category.py nor lightning-types/src/unicode.rs will exist on the runner, so the run step will fail immediately.
Every other workflow in this repo checks out first. Add:
steps:
- uses: actions/checkout@v4
- name: Check unicode file state| RUN_URL="https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}" | ||
| BODY="The unicode character listing is out of date, see $RUN_URL" | ||
| gh issue create --title "$TITLE" --body "$BODY" | ||
| fi |
There was a problem hiding this comment.
Bug: fi is indented at 8 spaces, but the YAML literal block scalar (run: |) has its content indented at 10 spaces. In YAML, once the block scalar's indentation is established by the first content line, any line with less indentation terminates the block. So fi falls outside the shell script, producing:
- A shell syntax error (unterminated
if) - A YAML parse error (
fiisn't a valid key at this level)
Fix by indenting fi to 10 spaces (matching the rest of the script body):
| fi | |
| fi |
|
Claude's reviews seem legit though |
1a01b5a added detection of unicode format characters in
PrintableString, but used a hard-coded table which may eventually become out of date.Here we switch to an auto-generated table, include all
General_CategoryOthercharacters, and also ban unallocated code points.Finally, CI validates that the file is kept up to date.
Written by Claude