Summary
Relaton::Bib::Sanitizer.sanitize strips MathML (and any other non-basicdoc inline markup) from inside <stem> elements, even though <stem> itself is preserved. The result is that bibliographic titles containing <stem> lose the entire mathematical or technical-notation content of the stem on YAML round-trip.
Reproducer
require "relaton/bib"
input = "Prefix <stem><math><mi>d</mi><mn>6</mn></math><asciimath>d_6</asciimath></stem> Suffix"
output = Relaton::Bib::Sanitizer.sanitize(input)
puts input
puts output
Output:
Prefix <stem><math><mi>d</mi><mn>6</mn></math><asciimath>d_6</asciimath></stem> Suffix
Prefix <stem>d6d_6</stem> Suffix
The <stem> wrapper survives (it's in ALLOWED), but <math>, <mstyle>, <msub>, <mi>, <mn>, and <asciimath> are all unwrapped. Only their text content remains: d, 6, d_6 — concatenated, with no element markup to disambiguate.
End-to-end symptom
Downstream, metanorma-generic's spec Metanorma::Generic::BibdataConfig preserves embedded MathML when deserialising a bibdata title (spec) fails. Expected YAML:
content: |-
Internal Standard Reference Data for qNMR: 4,4-Dimethyl-4-silapentane-1-sulfonic acid-<stem block="false" type="MathML">
<math>
<mstyle displaystyle="false">
<msub>
<mi>d</mi>
<mn>6</mn>
</msub>
</mstyle>
</math>
<asciimath>d_6</asciimath>
</stem> [ISRD-07]
Actual YAML (post-Sanitizer):
content: "Internal Standard Reference Data for qNMR: 4,4-Dimethyl-4-silapentane-1-sulfonic
acid-<stem block=\"false\" type=\"MathML\">\n \n \n \n d\n 6\n
\ \n \n \n d_6\n</stem> [ISRD-07]"
The MathML structure is gone; only whitespace skeleton and the bare text nodes (d, 6, d_6) survive.
Root cause
Sanitizer.sanitize_children recurses into every child element before deciding whether to unwrap. It checks the ALLOWED set only at the parent level, so the recursion strips disallowed descendants of <stem> (which is itself allowed) before the parent-level allow-check can spare them.
Current logic (lib/relaton/bib/sanitizer.rb, lines 31–39):
def self.sanitize_children(node)
node.children.to_a.each do |child|
next unless child.element?
child.name = RENAME[child.name] if RENAME.key?(child.name)
sanitize_children(child) # ← always recurses
child.replace(child.children) unless ALLOWED.include?(child.name)
end
end
For <stem>, the recursion is wrong: <stem> is an opaque container holding non-basicdoc inline notation (MathML / AsciiMath / LaTeX), not basicdoc inline markup that needs sanitising. Its contents should be preserved verbatim.
Proposed fix
Add an OPAQUE set for elements whose contents are out-of-band notation, and skip recursion into them:
# Elements whose contents are non-basicdoc inline notation (MathML,
# AsciiMath, LaTeX) and must be preserved verbatim rather than sanitised
# against ALLOWED.
OPAQUE = %w[stem].freeze
def self.sanitize_children(node)
node.children.to_a.each do |child|
next unless child.element?
child.name = RENAME[child.name] if RENAME.key?(child.name)
next if OPAQUE.include?(child.name) # opaque: leave inner XML untouched
sanitize_children(child)
child.replace(child.children) unless ALLOWED.include?(child.name)
end
end
This preserves the existing semantics for every ALLOWED element except <stem>, and converts <stem> from "strip everything inside that isn't in ALLOWED" to "leave inner content untouched".
Test plan
A new spec case covering the MathML-inside-<stem> reproducer above, plus the existing AsciiMath-inside-<stem> shape if not already covered. The downstream metanorma-generic spec then passes once a new relaton-bib release lands.
🤖
Summary
Relaton::Bib::Sanitizer.sanitizestrips MathML (and any other non-basicdoc inline markup) from inside<stem>elements, even though<stem>itself is preserved. The result is that bibliographic titles containing<stem>lose the entire mathematical or technical-notation content of the stem on YAML round-trip.Reproducer
Output:
The
<stem>wrapper survives (it's inALLOWED), but<math>,<mstyle>,<msub>,<mi>,<mn>, and<asciimath>are all unwrapped. Only their text content remains:d,6,d_6— concatenated, with no element markup to disambiguate.End-to-end symptom
Downstream,
metanorma-generic's specMetanorma::Generic::BibdataConfig preserves embedded MathML when deserialising a bibdata title(spec) fails. Expected YAML:Actual YAML (post-Sanitizer):
The MathML structure is gone; only whitespace skeleton and the bare text nodes (
d,6,d_6) survive.Root cause
Sanitizer.sanitize_childrenrecurses into every child element before deciding whether to unwrap. It checks theALLOWEDset only at the parent level, so the recursion strips disallowed descendants of<stem>(which is itself allowed) before the parent-level allow-check can spare them.Current logic (
lib/relaton/bib/sanitizer.rb, lines 31–39):For
<stem>, the recursion is wrong:<stem>is an opaque container holding non-basicdoc inline notation (MathML / AsciiMath / LaTeX), not basicdoc inline markup that needs sanitising. Its contents should be preserved verbatim.Proposed fix
Add an
OPAQUEset for elements whose contents are out-of-band notation, and skip recursion into them:This preserves the existing semantics for every ALLOWED element except
<stem>, and converts<stem>from "strip everything inside that isn't in ALLOWED" to "leave inner content untouched".Test plan
A new spec case covering the MathML-inside-
<stem>reproducer above, plus the existing AsciiMath-inside-<stem>shape if not already covered. The downstreammetanorma-genericspec then passes once a newrelaton-bibrelease lands.🤖