Allow <fn> in Sanitizer to preserve footnotes in bibliographic titles#115
Conversation
The Sanitizer's ALLOWED set was strict basicdoc PureTextElement plus <p>, <eref>, <xref>. <fn> was not in the list, so the content= setter on LocalizedMarkedUpString (used by Title, Note, Abstract, and other text-bearing fields) unwrapped <fn> at assignment time and kept only its inner <p> body. Downstream consumers — relaton-render and isodoc — were left with an orphan <p> they could not place, producing a visible regression in formattedref rendering of ISO-style titles with footnotes (e.g. isodoc spec/isodoc/footnotes_spec.rb:4). Adds <fn> to ALLOWED. <fn> is strictly speaking not in basicdoc PureTextElement, but it is a legitimate child of <title> in real Metanorma bibliographic input (ISO disclaimer footnotes), is in relaton-render's own inline-tag allow-list, and the existing "plus <p>, <eref>, <xref>" carve-out already concedes that the Sanitizer needs to be a touch broader than strict PureTextElement to handle real bibliographic content. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@andrew2net — could we please cut a v2.1.4 release from The concrete downstream impact: I've been carrying it as a 🤖 Generated with Claude Code |
Summary
Adds
<fn>toRelaton::Bib::Sanitizer::ALLOWEDso that footnotes embedded in bibliographic titles (and otherLocalizedMarkedUpStringcontent fields) round-trip throughcontent=instead of being unwrapped at assignment time.Background — why this matters
The sanitiser landed in 4cfd83b ("Sanitize LocalizedMarkedUpString#content to basicdoc PureTextElement", referencing relaton/relaton-doi#21) on the
lutaml-integrationbranch, shipped in 2.1.2 / 2.1.3. It prepends aContentSanitizationmodule onto thecontent=setter ofLocalizedMarkedUpString, which is the base class forTitle,Note,Abstract,Formattedref,Affiliation,Contributor,Docidentifier, etc.The intent — clamping raw marked-up content to a known-safe inline set — is the right shape of fix. The execution has a gap:
<fn>is missing fromALLOWED, so when a bibitem likeis loaded, the setter unwraps
<fn>(children kept) and storesThe
<fn>wrapper is gone before any downstream consumer sees the value.Downstream impact — visible regression in isodoc
This causes a real round-trip regression in metanorma/isodoc, where the failing test is
spec/isodoc/footnotes_spec.rb:4("processes IsoXML footnotes"). The test exercises exactly the ISO pattern of attaching a disclaimer footnote to a standard's title:With the current 2.1.3 sanitiser the expected formattedref
degrades to
— footnote stripped, content inlined as plain text. This also shifts every subsequent footnote number in the document (the title fn was reference 4, so all later references decrement).
The chain is:
Sanitizer(this gem) strips<fn>atcontent=time becausefnis not inALLOWED. The orphan<p>survives.<p>(which isn't in its allow-list) and thegsubchain incontent()collapses the residual whitespace.<fn>ends up concatenated into the surrounding<em>.Notably, relaton-render's own allow-list (extended in relaton-render PR #76 / commit 27fde10) does include
<fn>— that gem was correctly preserving footnotes inside titles before relaton-bib started pre-emptively stripping them.I'm absorbing this regression downstream for the current release rather than blocking, but flagging it so the relaton-bib allow-list can catch up.
Grammar context
The strict basicdoc grammar's
PureTextElement(basicdoc.rng:1292-1308) is:<fn>is not in it. Strictly, then,<fn>inside<title>(which biblio.rng pipes throughLocalizedMarkedUpString→oneOrMore PureTextElement) is non-conformant.In practice it has been routinely produced by isodoc/metanorma for years — ISO standards titles with disclaimer footnotes are a stock pattern. The existing sanitiser already concedes this kind of practical broadening: the commit message says "basicdoc PureTextElement set (plus
<p>,<eref>,<xref>)" — none of which is in strict PureTextElement either.<p>in particular is the inverse of<fn>: strictly it's a block, not an inline, but it has to be allowed becauseAbstractcontent is paragraph-shaped.This PR proposes the same kind of carve-out for
<fn>on the same grounds: it's a real element that real bibliographic input carries, and it's already accepted by downstream consumers.Open question for follow-up
This is the minimum change to restore the title-fn round-trip. Other Metanorma inline elements that relaton-render's allow-list permits but relaton-bib's does not —
link,bookmark,image,index,index-xref,concept,keyword,add,del,span,pagebreak,hr,erefstack,date,ruby— may or may not warrant similar treatment depending on whether they show up in real bibliographic content. I haven't audited that and don't want to broaden the allow-list speculatively in this PR. Worth a separate look.Changes
lib/relaton/bib/sanitizer.rb— addfntoALLOWED; update the header comment to document the carve-out.spec/relaton/bib/sanitizer_spec.rb— add a test case asserting that<fn>with nested<p>body survives sanitisation (the existing per-tag preservation loop also automatically covers the bare<fn>x</fn>case viaALLOWED.each).Test plan
bundle exec rspec spec/relaton/bib/sanitizer_spec.rb— 33/33 pass onfix/sanitizer-allow-fnGemfile.devel,spec/isodoc/footnotes_spec.rb:4passes (4/4 examples)🤖 Generated with Claude Code