Skip to content

fix: preserve opaque <stem> contents in Sanitizer (MathML round-trip)#117

Open
opoudjis wants to merge 1 commit into
lutaml-integrationfrom
fix/sanitizer-preserves-stem-opaque-content
Open

fix: preserve opaque <stem> contents in Sanitizer (MathML round-trip)#117
opoudjis wants to merge 1 commit into
lutaml-integrationfrom
fix/sanitizer-preserves-stem-opaque-content

Conversation

@opoudjis

Copy link
Copy Markdown
Contributor

Closes #116

See the linked issue for the full diagnosis, reproducer, and root-cause walk-through. This PR is the proposed fix.

Change

lib/relaton/bib/sanitizer.rb:

  • New OPAQUE = %w[stem].freeze set, listing elements whose children are non-basicdoc inline notation (MathML, AsciiMath, LaTeX, …) and must be preserved verbatim rather than sanitised against ALLOWED.
  • sanitize_children short-circuits before recursing into OPAQUE elements (after the RENAME pass, before the recursive descent), so their inner XML survives the walk untouched.

spec/relaton/bib/sanitizer_spec.rb:

  • Three new examples covering the opaque-<stem> cases — MathML inner elements survive, <stem> attributes plus deeply nested MathML survive, and siblings of <stem> are still sanitised (opacity is local).
  • Assertions use include-style matches because Nokogiri's serialiser may reflow whitespace around nested elements; the semantic claim is "inner elements survive, not just their text content".

Verification

bundle exec rspec spec/relaton/bib/sanitizer_spec.rb36 examples, 0 failures.

Target branch

lutaml-integration — the 2.x line lives on this branch (main is still 1.x, last tagged v1.20.8).

Downstream

The regression that prompted this fix is in metanorma/metanorma-generic — the spec Metanorma::Generic::BibdataConfig preserves embedded MathML when deserialising a bibdata title fails until this lands and a new relaton-bib release ships. No Gemfile.devel pointer is being added to metanorma-generic because there is no metanorma-generic release pending this cycle; the metanorma-generic spec stays red until the next release-cycle gem bump.

🤖

Closes #116

The recursive sanitiser stripped MathML / AsciiMath / LaTeX from inside
<stem> elements because it descended into every child element before
checking the ALLOWED allow-list. <stem> itself is ALLOWED, but its
children (<math>, <mstyle>, <msub>, <mi>, <mn>, <asciimath>, …) are not,
so they were unwrapped to bare text. End-to-end symptom: bibliographic
titles with embedded math lost their notation content on YAML
round-trip, with only the text nodes surviving alongside their
whitespace skeleton.

Fix: introduce an OPAQUE set ({<stem>} for now) listing elements whose
contents are out-of-band inline notation and must be preserved verbatim.
sanitize_children skips recursion into OPAQUE elements after the rename
pass, so their inner XML survives untouched.

The downstream regression that prompted this is in
metanorma/metanorma-generic — the spec
"Metanorma::Generic::BibdataConfig preserves embedded MathML when
deserialising a bibdata title" fails until this lands and a new
relaton-bib release ships.

Three new sanitizer specs cover:
  - MathML inner elements survive inside <stem> (do not unwrap to text)
  - <stem> attributes plus deeply nested MathML survive together
  - siblings of <stem> are still sanitised; <stem>'s opacity is local

Assertions use include-style matches because Nokogiri's serialiser may
reflow whitespace around nested elements; the semantic claim is "inner
elements survive, not just their text content".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants