Skip to content

Ignore whitespace for now, annotations and other changes for a large schema#48

Open
charlesmoore99 wants to merge 2 commits intoekrich:mainfrom
charlesmoore99:topic/annotation
Open

Ignore whitespace for now, annotations and other changes for a large schema#48
charlesmoore99 wants to merge 2 commits intoekrich:mainfrom
charlesmoore99:topic/annotation

Conversation

@charlesmoore99
Copy link

I don't write a lot of PRs so please be gentle.

I need to use exipg to generate a schema aware exi processor. the xsd is large. (~7500 complex type and elements and a bunch of abstract base complex types split across two files). I uses nagasina to generate .xsd.exi files for the two xsds and fed them into exipg -static -schema=one.xsd.exi,two.xsd.exi exi_proc.c &2>1

I ran into a couple problems with exipg.

First, crashes. some of the buffers were too small for the xsd filenames on the command line, and some of the internal buffers were too small for some of the rather verbose type names .

Second, the two XSDs were loaded up with whitespace and annotation tags which exipg wasn't able to process.

I inflated the buffer sizes and added code to skip annotation and whiteSpace tags. It works for my bloated xsds and thought I should PR this before I got pulled away to a new task so someone else might be able to find value in it.... Or so that someone can tell me there was a simpler way to do this.

Here's what this PR does

  1. Increases buffer sizes to handle large filenames and element names
  2. Moves an argIndex inside a while loop so that it is not incremented when an annotation is skipped (skipping the annotation does not allocate memory for it in the dyn array, on cleanup that dyn array gets free'd which results in unallocated memory being free'd, which causes a crash and truncates the output.c file).
  3. Adds a string length guard to limit string length to buffer length on a string copy

increases buffer sizes to handle large filenames and elementnames

Moves argIndex inside a while loop so that it is not increments
when an annotation is skipped.

Added a string length guard to limit string length to buffer length
@ekrich
Copy link
Owner

ekrich commented Feb 26, 2026

Hey no worries at all. I adopted this repo as we are trying to use it at work so I am not a super expert EXI but learning.

A few questions.

  1. Do you need to use C for your project or can you use Java or Scala?
  2. Did you take a look at https://ekrich.github.io/exip/exip-user-guide.pdf, the last section for Schema Information?
  3. How did you find out about this repo?

If you can use Java then using EXIficient would be a good idea since EXIP is not production quality or fully functional. More info here: https://exificient.github.io/java/ Either way I would use the EXIficient GUI project to transform your schema into EXI for EXIP.

Forgive me if you already know but when EXIP talks about out-of-band options like the -opts argument, this means there is no EXI header (literally EXI$) so you must know the encode options and supply those options for decode. When you use the EXI header and options, they are specified in the header and thus the decoder knows what to do.

So using the EXIficient GUI you can encode your XSD - if you use the EXI header then when you use exipg then you won't have to supply the -opts. Select Configure Advanced EncodingOptions and select the first two options for the EXI cookie and EXI options. If you don't select anything else (all defaults) you should be good. You shouldn't have any comments or other things you don't need and then maybe things will work as you expect.

I am wondering if you try this, you may not need to change EXIP. I am not opposed to making changes but I would need to be really confident about the changes.

Edit: where did you get the Nasagena?

@ekrich ekrich mentioned this pull request Feb 27, 2026
@ekrich
Copy link
Owner

ekrich commented Feb 27, 2026

I think I misunderstood a few things. The annotations/documents are not the same as the documentation you can put in and XML instance - I was confusing schema encoding to EXI vs document encoding.

@charlesmoore99
Copy link
Author

Thanks for taking the time to respond.

to answer your questions:

  1. If it were just me, I'd probably use Java, but the shop uses mostly c++ and python. Python is too slow for this particular use case, and I don't really want to add Java to the tech stack.

  2. I have. It has been helpful.

  3. I started out using exip-0.5.4 from sourceforge and made the changes for whitespace and annotation using that code base. Then I found your project. I compared the code bases and it looked like they were equivalent enough that the whitespace and annotation changes would work, so I forked, updated to c99, made the changes, and PR'd them. I've tested them against a number of our doc types. To find your project, I think I googled exi 0.5.5 on a whim. I want to say that the first time I looked for exi tools it didn't show up. but now it does every time.

I think I got the nagasena at openexi.sourceforge.net.

I have a quick and dirty java program that uses the nagasena and nagasenst-rt jar files to convert the XSD to and .xsd.exi (with the preserve prefixes option bit set and my namespace hard coded). This is for a project with a frequently updated onerously large xsd that is updated every couple of weeks. There's going to be a build pipeline for building the .xsd.exi

Using that .xsd.exi I have been able to both encode documents to exi and decode them back to xml. I've also tested with the Exificient GUI, and with nagasena to check interoperability. So far I'm able to decode messages exip enecded, but havent tried the other way around yet.

Note: I ran into (what I think is a bug) in Nagasena in that it decoded the documents to the wrong qname. Nagasena was setting the qname to the xst:type Overriding the decoders SAX XMLFilterImpl to use prefix:localname instead fixed it. but I'm left wondering if I'm doing something wrong with the encoding or if its a nagasena bug.

Copy link
Owner

@ekrich ekrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank-you for submitting the changes. If we could get a small XSD as a reproducer so I could add a test and see for myself that would be really helpful.

In a code base like this some changes can make a big impact so I need to be very conservative. Also, my experience and expertise in this area is limited.

{
TRY(getComplexTypeProtoGrammar(ctx, &treeTEntry->entry->child, &pg));
}
else if(treeTEntry->entry->child.entry->element == ELEMENT_ANNOTATION)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems ok. The schema doesn't have documentation tags in side the annotation or does skipping here avoid that?

}
else
{
destroyDynArray(&partGrammarTbl.dynArray);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly, this was a memory leak and this is the fix?

@@ -1775,17 +1781,20 @@ static errorCode getRestrictionSimpleProtoGrammar(BuildContext* ctx, QualifiedTr
// TODO: needs to be implemented. It is also needed for the XML Schema grammars
// COMMENT #SCHEMA#: ignore for now
DEBUG_MSG(INFO, DEBUG_GRAMMAR_GEN, ("\n>Type facet pattern is not implemented: at %s, line %d.", __FILE__, __LINE__));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this debug message show up normally or d you have to set DEBUG mode?

{
SET_TYPE_FACET(newSimpleType.content, TYPE_FACET_WHITE_SPACE);
return EXIP_NOT_IMPLEMENTED_YET;
// skip whiteSpace
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we need to leave those two lines. Since we are not implementing it now we should probably have a debug message like above on R1783.

}
enumEntry = enumEntry->next;
enumIter++;
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change above seems fine but the diff would be less if we kept the same order. Put the ANNOTATION in the else if.

if(strstr(argv[argIndex], "-schema") != NULL)
{
char *xsdList = argv[argIndex] + 7;
char *xsdList = argv[argIndex] + 8;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need some help on the reason for this change.

FILE *schemaFile;
BinaryBuffer buffer[MAX_XSD_FILES_COUNT]; // up to 10 XSD files
char schemaFileName[50];
char schemaFileName[2048];
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should probably use C stdio.h FILENAME_MAX.


// Maximum number of characters in a variable name buffer
#define VAR_BUFFER_MAX_LENGTH 200
#define VAR_BUFFER_MAX_LENGTH 2048
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering why this value was chosen.

char elemGrammar[20];
char typeGrammar[20];
char elemGrammar[256];
char typeGrammar[256];
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is also a very large buffer change and other places in the code based use 20. It would be really helpful to have a small subset of your schema or equivalent as a reproducer so I could see the problems myself and add it to the tests.

size_t cpyLen = str->length < (VAR_BUFFER_MAX_LENGTH-1) ? str->length : (VAR_BUFFER_MAX_LENGTH-1);
strncpy(displayStr, str->str, cpyLen);
displayStr[cpyLen] = '\0';
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove the scope here?

@ekrich ekrich changed the title Topic/annotation Ignore whitespace for now, annotations and other changes for a large schema Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants