Skip to content

[Comgr] Fix use-after-free in metadata merge for multiple MsgPack notes#1827

Open
lamb-j wants to merge 1 commit intoamd-mainfrom
comgr/fix-metadata-merge-deep-copy
Open

[Comgr] Fix use-after-free in metadata merge for multiple MsgPack notes#1827
lamb-j wants to merge 1 commit intoamd-mainfrom
comgr/fix-metadata-merge-deep-copy

Conversation

@lamb-j
Copy link
Collaborator

@lamb-j lamb-j commented Mar 20, 2026

When merging metadata from multiple MsgPack notes, the code was reusing the same Document for parsing each note while accumulating results in a separate DocNode (Root). Since DocNode contains pointers to memory owned by its Document, clearing and reusing the Document for subsequent notes invalidated the pointers held by the accumulated Root node, causing a use-after-free crash.

Fix this by using a temporary Document for parsing each note, then deep copying the parsed nodes into the main Document. This ensures all nodes in Root point to memory owned by a Document that is never cleared.

When merging metadata from multiple MsgPack notes, the code was reusing
the same Document for parsing each note while accumulating results in a
separate DocNode (Root). Since DocNode contains pointers to memory owned
by its Document, clearing and reusing the Document for subsequent notes
invalidated the pointers held by the accumulated Root node, causing a
use-after-free crash.

Fix this by using a temporary Document for parsing each note, then deep
copying the parsed nodes into the main Document. This ensures all nodes
in Root point to memory owned by a Document that is never cleared.

Co-Authored-By: Claude <noreply@anthropic.com>
@lamb-j lamb-j requested a review from slinder1 March 20, 2026 16:51
@lamb-j lamb-j requested a review from chinmaydd as a code owner March 20, 2026 16:51
@z1-cciauto
Copy link
Collaborator

Copy link

@chinmaydd chinmaydd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, though I'll let @slinder1 stamp approval

@chinmaydd chinmaydd added the comgr Related to Code Object Manager label Mar 20, 2026
Copy link

@slinder1 slinder1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not entirely comfortable with the fact that we don't seem to fully understand what we are fixing. I would have to look at the ASAN report and do some debugging to be more helpful, and I don't have the time right now.

The code here LGTM, and because Document never frees the backing arrays/maps/strings it doesn't seem unreasonable to first parse into a temporary one anyway.

Final note: I believe the Document APIs around converting nodes in-place to maps and arrays has always had some lingering UB in that it effectively reinterprets an object allocated as the base type as a derived type without any placement new or anything to actually create the derived object. It may be that UB which ultimately confuses ASAN (which would be right to assume the derived type object is never actually created).

@lamb-j
Copy link
Collaborator Author

lamb-j commented Mar 23, 2026

I am still not entirely comfortable with the fact that we don't seem to fully understand what we are fixing. I would have to look at the ASAN report and do some debugging to be more helpful, and I don't have the time right now.

The code here LGTM, and because Document never frees the backing arrays/maps/strings it doesn't seem unreasonable to first parse into a temporary one anyway.

Final note: I believe the Document APIs around converting nodes in-place to maps and arrays has always had some lingering UB in that it effectively reinterprets an object allocated as the base type as a derived type without any placement new or anything to actually create the derived object. It may be that UB which ultimately confuses ASAN (which would be right to assume the derived type object is never actually created).

If I rebuild Comgr with BUILD_TYPE=Debug, I get this error:

bad map key type
UNREACHABLE executed at /home/lambj/git/lightning/llvm-project/build/install/include/llvm/BinaryFormat/MsgPackDocument.h:181!

I think the root issue is coming from this line that Sam added:
From.getMap()[PrintfStrKey] != To.getMap()[PrintfStrKey]

Before we were never comparing Maps between To/From, just returning early if both had printf, otherwise copy/pasting Printf to the new Document.

The comparison is pulling in some extra logic we didn't have before, and isn't doing what it appears to: https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/BinaryFormat/MsgPackDocument.h#L155.

If the maps are coming from the same document, it hits the llvm_unreachable(), due to amdhsa.printf being an array:

 switch (Lhs.getKind()) {
    case Type::Int:
      return Lhs.Int < Rhs.Int;
    case Type::UInt:
      return Lhs.UInt < Rhs.UInt;
    case Type::Nil:
      return false;
    case Type::Boolean:
      return Lhs.Bool < Rhs.Bool;
    case Type::Float:
      return Lhs.Float < Rhs.Float;
    case Type::String:
    case Type::Binary:
      return Lhs.Raw < Rhs.Raw;
    default:
      llvm_unreachable("bad map key type");
    }

If they're different documents (i.e. we create new documents like in this PR), it just always returns true:

    if (Lhs.KindAndDoc != Rhs.KindAndDoc) {
      if (Lhs.isEmpty())
        return true;
      return (unsigned)Lhs.getKind() < (unsigned)Rhs.getKind();
    }

TLDR: I don't think MsgPackDocument was set up for cross-document comparison

@yxsamliu do we need to do a proper comparison of the amdhsa.printf content? From your commit message, I see "The amdhsa.printf content is identical across all partitions". But I guess we're worried about cases where we may mergeNoteRecords() that weren't created by splitting a single device module via LTO partitions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comgr Related to Code Object Manager

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants