Skip to content

Allow <fn> in Sanitizer to preserve footnotes in bibliographic titles#115

Merged
andrew2net merged 1 commit into
lutaml-integrationfrom
fix/sanitizer-allow-fn
May 26, 2026
Merged

Allow <fn> in Sanitizer to preserve footnotes in bibliographic titles#115
andrew2net merged 1 commit into
lutaml-integrationfrom
fix/sanitizer-allow-fn

Conversation

@opoudjis

Copy link
Copy Markdown
Contributor

Summary

Adds <fn> to Relaton::Bib::Sanitizer::ALLOWED so that footnotes embedded in bibliographic titles (and other LocalizedMarkedUpString content fields) round-trip through content= instead of being unwrapped at assignment time.

Background — why this matters

The sanitiser landed in 4cfd83b ("Sanitize LocalizedMarkedUpString#content to basicdoc PureTextElement", referencing relaton/relaton-doi#21) on the lutaml-integration branch, shipped in 2.1.2 / 2.1.3. It prepends a ContentSanitization module onto the content= setter of LocalizedMarkedUpString, which is the base class for Title, Note, Abstract, Formattedref, Affiliation, Contributor, Docidentifier, etc.

The intent — clamping raw marked-up content to a known-safe inline set — is the right shape of fix. The execution has a gap: <fn> is missing from ALLOWED, so when a bibitem like

<title type="main" format="text/plain">Cereals and cereal products<fn reference="7"><p id="_x">ISO is a standards organisation.</p></fn></title>

is loaded, the setter unwraps <fn> (children kept) and stores

"Cereals and cereal products\n  <p id=\"_x\">ISO is a standards organisation.</p>\n"

The <fn> wrapper is gone before any downstream consumer sees the value.

Downstream impact — visible regression in isodoc

This causes a real round-trip regression in metanorma/isodoc, where the failing test is spec/isodoc/footnotes_spec.rb:4 ("processes IsoXML footnotes"). The test exercises exactly the ISO pattern of attaching a disclaimer footnote to a standard's title:

<title type="main" format="text/plain">Cereals and cereal products<fn reference="7"><p>ISO is a standards organisation.</p></fn></title>

With the current 2.1.3 sanitiser the expected formattedref

<em>Cereals and cereal products<fn reference="4" ...><p>ISO is a standards organisation.</p>...</fn></em>

degrades to

<em>Cereals and cereal products ISO is a standards organisation.</em>

— footnote stripped, content inlined as plain text. This also shifts every subsequent footnote number in the document (the title fn was reference 4, so all later references decrement).

The chain is:

  1. relaton-bib Sanitizer (this gem) strips <fn> at content= time because fn is not in ALLOWED. The orphan <p> survives.
  2. relaton-render's own inline-tag sanitiser (parse.rb:123, allow-list at parse.rb:85-90) then strips the orphan <p> (which isn't in its allow-list) and the gsub chain in content() collapses the residual whitespace.
  3. The text content of the original <fn> ends up concatenated into the surrounding <em>.

Notably, relaton-render's own allow-list (extended in relaton-render PR #76 / commit 27fde10) does include <fn> — that gem was correctly preserving footnotes inside titles before relaton-bib started pre-emptively stripping them.

I'm absorbing this regression downstream for the current release rather than blocking, but flagging it so the relaton-bib allow-list can catch up.

Grammar context

The strict basicdoc grammar's PureTextElement (basicdoc.rng:1292-1308) is:

text | em | strong | sub | sup | tt | underline | strike | smallcap | br | stem

<fn> is not in it. Strictly, then, <fn> inside <title> (which biblio.rng pipes through LocalizedMarkedUpStringoneOrMore PureTextElement) is non-conformant.

In practice it has been routinely produced by isodoc/metanorma for years — ISO standards titles with disclaimer footnotes are a stock pattern. The existing sanitiser already concedes this kind of practical broadening: the commit message says "basicdoc PureTextElement set (plus <p>, <eref>, <xref>)" — none of which is in strict PureTextElement either. <p> in particular is the inverse of <fn>: strictly it's a block, not an inline, but it has to be allowed because Abstract content is paragraph-shaped.

This PR proposes the same kind of carve-out for <fn> on the same grounds: it's a real element that real bibliographic input carries, and it's already accepted by downstream consumers.

Open question for follow-up

This is the minimum change to restore the title-fn round-trip. Other Metanorma inline elements that relaton-render's allow-list permits but relaton-bib's does not — link, bookmark, image, index, index-xref, concept, keyword, add, del, span, pagebreak, hr, erefstack, date, ruby — may or may not warrant similar treatment depending on whether they show up in real bibliographic content. I haven't audited that and don't want to broaden the allow-list speculatively in this PR. Worth a separate look.

Changes

  • lib/relaton/bib/sanitizer.rb — add fn to ALLOWED; update the header comment to document the carve-out.
  • spec/relaton/bib/sanitizer_spec.rb — add a test case asserting that <fn> with nested <p> body survives sanitisation (the existing per-tag preservation loop also automatically covers the bare <fn>x</fn> case via ALLOWED.each).

Test plan

  • bundle exec rspec spec/relaton/bib/sanitizer_spec.rb — 33/33 pass on fix/sanitizer-allow-fn
  • With isodoc pointed at this branch via Gemfile.devel, spec/isodoc/footnotes_spec.rb:4 passes (4/4 examples)
  • Full relaton-bib suite — would appreciate CI confirmation
  • Downstream relaton-render and other relaton-consumers — no expected impact since this only adds a tag to the allow-list

🤖 Generated with Claude Code

The Sanitizer's ALLOWED set was strict basicdoc PureTextElement plus
<p>, <eref>, <xref>. <fn> was not in the list, so the content= setter
on LocalizedMarkedUpString (used by Title, Note, Abstract, and other
text-bearing fields) unwrapped <fn> at assignment time and kept only
its inner <p> body. Downstream consumers — relaton-render and isodoc —
were left with an orphan <p> they could not place, producing a visible
regression in formattedref rendering of ISO-style titles with
footnotes (e.g. isodoc spec/isodoc/footnotes_spec.rb:4).

Adds <fn> to ALLOWED. <fn> is strictly speaking not in basicdoc
PureTextElement, but it is a legitimate child of <title> in real
Metanorma bibliographic input (ISO disclaimer footnotes), is in
relaton-render's own inline-tag allow-list, and the existing "plus
<p>, <eref>, <xref>" carve-out already concedes that the Sanitizer
needs to be a touch broader than strict PureTextElement to handle
real bibliographic content.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@andrew2net andrew2net merged commit cd1a16b into lutaml-integration May 26, 2026
12 checks passed
@opoudjis

opoudjis commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

@andrew2net — could we please cut a v2.1.4 release from lutaml-integration to ship this fix? It's merged but the merge sits one commit after the v2.1.3 tag, so downstream consumers pinning relaton-bib (~> 2.1.0) still get 2.1.3 with the bug.

The concrete downstream impact: isodoc/spec/isodoc/footnotes_spec.rb:4 ("processes IsoXML footnotes") fails on main against the published relaton-bib 2.1.3 because the bibitem-title footnote (<title type="main">Cereals and cereal products<fn>…</fn></title>) loses its <fn> wrapper at content-assignment time. The whole formattedref renders with the footnote content inlined as plain text inside <em> instead of preserved as a footnote element. Local verification with Gemfile.devel pointing at this branch makes the spec pass cleanly.

I've been carrying it as a Gemfile.devel override locally, but I'd like to land it without that workaround as soon as you have a chance.

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants