Skip to content

Sanitizer strips inner MathML/AsciiMath from <stem>, losing notation content on YAML round-trip #116

Description

@opoudjis

Summary

Relaton::Bib::Sanitizer.sanitize strips MathML (and any other non-basicdoc inline markup) from inside <stem> elements, even though <stem> itself is preserved. The result is that bibliographic titles containing <stem> lose the entire mathematical or technical-notation content of the stem on YAML round-trip.

Reproducer

require "relaton/bib"

input  = "Prefix <stem><math><mi>d</mi><mn>6</mn></math><asciimath>d_6</asciimath></stem> Suffix"
output = Relaton::Bib::Sanitizer.sanitize(input)

puts input
puts output

Output:

Prefix <stem><math><mi>d</mi><mn>6</mn></math><asciimath>d_6</asciimath></stem> Suffix
Prefix <stem>d6d_6</stem> Suffix

The <stem> wrapper survives (it's in ALLOWED), but <math>, <mstyle>, <msub>, <mi>, <mn>, and <asciimath> are all unwrapped. Only their text content remains: d, 6, d_6 — concatenated, with no element markup to disambiguate.

End-to-end symptom

Downstream, metanorma-generic's spec Metanorma::Generic::BibdataConfig preserves embedded MathML when deserialising a bibdata title (spec) fails. Expected YAML:

content: |-
  Internal Standard Reference Data for qNMR: 4,4-Dimethyl-4-silapentane-1-sulfonic acid-<stem block="false" type="MathML">
    <math>
      <mstyle displaystyle="false">
        <msub>
          <mi>d</mi>
          <mn>6</mn>
        </msub>
      </mstyle>
    </math>
    <asciimath>d_6</asciimath>
  </stem> [ISRD-07]

Actual YAML (post-Sanitizer):

content: "Internal Standard Reference Data for qNMR: 4,4-Dimethyl-4-silapentane-1-sulfonic
  acid-<stem block=\"false\" type=\"MathML\">\n  \n    \n      \n        d\n        6\n
  \     \n    \n  \n  d_6\n</stem> [ISRD-07]"

The MathML structure is gone; only whitespace skeleton and the bare text nodes (d, 6, d_6) survive.

Root cause

Sanitizer.sanitize_children recurses into every child element before deciding whether to unwrap. It checks the ALLOWED set only at the parent level, so the recursion strips disallowed descendants of <stem> (which is itself allowed) before the parent-level allow-check can spare them.

Current logic (lib/relaton/bib/sanitizer.rb, lines 31–39):

def self.sanitize_children(node)
  node.children.to_a.each do |child|
    next unless child.element?

    child.name = RENAME[child.name] if RENAME.key?(child.name)
    sanitize_children(child)                                  # ← always recurses
    child.replace(child.children) unless ALLOWED.include?(child.name)
  end
end

For <stem>, the recursion is wrong: <stem> is an opaque container holding non-basicdoc inline notation (MathML / AsciiMath / LaTeX), not basicdoc inline markup that needs sanitising. Its contents should be preserved verbatim.

Proposed fix

Add an OPAQUE set for elements whose contents are out-of-band notation, and skip recursion into them:

# Elements whose contents are non-basicdoc inline notation (MathML,
# AsciiMath, LaTeX) and must be preserved verbatim rather than sanitised
# against ALLOWED.
OPAQUE = %w[stem].freeze

def self.sanitize_children(node)
  node.children.to_a.each do |child|
    next unless child.element?

    child.name = RENAME[child.name] if RENAME.key?(child.name)
    next if OPAQUE.include?(child.name)   # opaque: leave inner XML untouched
    sanitize_children(child)
    child.replace(child.children) unless ALLOWED.include?(child.name)
  end
end

This preserves the existing semantics for every ALLOWED element except <stem>, and converts <stem> from "strip everything inside that isn't in ALLOWED" to "leave inner content untouched".

Test plan

A new spec case covering the MathML-inside-<stem> reproducer above, plus the existing AsciiMath-inside-<stem> shape if not already covered. The downstream metanorma-generic spec then passes once a new relaton-bib release lands.

🤖

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions