Skip to content

refactor: split large grammar.js into modules#341

Merged
DerekStride merged 9 commits intoDerekStride:mainfrom
vergenzt:vergenzt/modularize-grammar-js
Feb 10, 2026
Merged

refactor: split large grammar.js into modules#341
DerekStride merged 9 commits intoDerekStride:mainfrom
vergenzt:vergenzt/modularize-grammar-js

Conversation

@vergenzt
Copy link
Copy Markdown
Contributor

@vergenzt vergenzt commented Dec 12, 2025

I tried to make no semantic changes to anything, but that could use some verifying. I did make substantial editorial decisions about what "belongs" where -- totally open to feedback on any of it.

I may try to do some sort of automated analysis to confirm that all of the rules expressions from before and after are the same. We'll see.

Would you be open to merging something like this?

I tried to make no semantic changes to anything, but that could use some verifying.
@vergenzt
Copy link
Copy Markdown
Contributor Author

vergenzt commented Dec 13, 2025

Okay I was able to figure out a good way to do that validation. I used an ast-grep rule to fetch all relevant grammar rule JS key/value pairs, normalized whitespace for each (since I dedented when extracting to the grammar/ folder), then sorted and wrote to a file and was able to diff the files.

On both main and this branch, I ran the following:

ast-grep scan --json --inline-rules='id: _
language: js
rule:
  kind: pair
  inside:
    kind: object
    inside:
      any:
        - pattern: module.exports = { $$$ }
        - kind: pair
          has:
            field: key
            regex: rules
' grammar* | jq '.[].text | gsub("\\\\s+"; " ")' | sort > rules-$(git rev-parse HEAD)

(edit 2025-12-29 to remove indent which arose from copy/paste)

This then produced the following files:

  • rules-224f3a49757b4997fc89b89fe31b84c635002534 (all rules detected at 224f3a4, tip of this branch)
  • rules-fe77f6868d6cdea593052a6af390116495093dc1 (all rules detected at fe77f68, current tip of main)

Then, git diff --no-index rules-fe77f6868d6cdea593052a6af390116495093dc1 rules-224f3a49757b4997fc89b89fe31b84c635002534 for me yields:

diff --git a/rules-fe77f6868d6cdea593052a6af390116495093dc1 b/rules-224f3a49757b4997fc89b89fe31b84c635002534
index c025468..320e991 100644
--- a/rules-fe77f6868d6cdea593052a6af390116495093dc1
+++ b/rules-224f3a49757b4997fc89b89fe31b84c635002534
@@ -148,7 +148,7 @@
 "direction: $ => choice($.keyword_desc, $.keyword_asc)"
 "distinct_from: $ => seq($.keyword_is, $.keyword_distinct, $.keyword_from)"
 "dollar_quote: () => /\\$[^\\$]*\\$/"
-"double: $ => choice( make_keyword(\"float8\"), unsigned_type($, parametric_type($, $.keyword_double, ['precision', 'scale'])), unsigned_type($, parametric_type($, seq($.keyword_double, $.keyword_precision), ['precision', 'scale'])), unsigned_type($, parametric_type($, $.keyword_real, ['precision', 'scale'])), )"
+"double: $ => choice( $.keyword_float8, unsigned_type($, parametric_type($, $.keyword_double, ['precision', 'scale'])), unsigned_type($, parametric_type($, seq($.keyword_double, $.keyword_precision), ['precision', 'scale'])), unsigned_type($, parametric_type($, $.keyword_real, ['precision', 'scale'])), )"
 "drop_column: $ => seq( $.keyword_drop, optional( $.keyword_column, ), optional($._if_exists), field('name', $.identifier), )"
 "drop_constraint: $ => seq( $.keyword_drop, $.keyword_constraint, optional($._if_exists), $.identifier, optional($._drop_behavior), )"
 "drop_database: $ => prec.left(seq( $.keyword_drop, $.keyword_database, optional($._if_exists), $.identifier, optional($.keyword_with), optional($.keyword_force), ))"
@@ -306,6 +306,7 @@
 "keyword_filter: _ => make_keyword(\"filter\")"
 "keyword_first: _ => make_keyword(\"first\")"
 "keyword_float: _ => make_keyword(\"float\")"
+"keyword_float8: _ => make_keyword(\"float8\")"
 "keyword_following: _ => make_keyword(\"following\")"
 "keyword_follows: _ => make_keyword(\"follows\")"
 "keyword_for: _ => make_keyword(\"for\")"

... which of course now reminds me that I did change one thing, which was to extract a separate keyword_float8 rule instead of the previous inline call to make_keyword. 😄

Looks like that's confirmed to be the only change though!!

@DerekStride
Copy link
Copy Markdown
Owner

I like the idea here, splitting everything out into separate modules seems nice, are you aware of other projects that make use of this pattern? We're already quite different in that we don't commit the generated code so I worry about adding more friction.

@vergenzt
Copy link
Copy Markdown
Contributor Author

vergenzt commented Dec 28, 2025

I like the idea here, splitting everything out into separate modules seems nice, are you aware of other projects that make use of this pattern? We're already quite different in that we don't commit the generated code so I worry about adding more friction.

From some searching, the following appear to be tree-sitter grammar definitions which require pieces of their definitions from other files:

To answer your question more directly though, no I'm not aware of specific other projects which split things out. I believe this is also the first major tree-sitter grammar I've attempted to make a contribution to -- but I'm pretty sure it's also the largest / most complex grammar I've encountered.

IMO this will substantially reduce friction for new contributors because potential contributions will have a clearer place to go.

to avoid changing any parsed output (which would require changes to highlights.scm and test/corpus)
@vergenzt
Copy link
Copy Markdown
Contributor Author

vergenzt commented Dec 29, 2025

FYI I pushed 8f8e5fd to revert the (admittedly minor) keyword change, and reran the diff script. Diff output from ebf08ba (now-current tip of main) to 8f8e5fd (new tip of this PR) is now empty.

(You can see full output here: https://gist.github.com/vergenzt/1614ef579a3ba452050b2338cfd7adcf)

@vergenzt
Copy link
Copy Markdown
Contributor Author

If you're not interested in this would you at least be open to some sort of well-defined order for rules in grammar.js? Maybe a topological sort or something?

@DerekStride
Copy link
Copy Markdown
Owner

DerekStride commented Jan 5, 2026

@matthias-Q Do you have any objections? If not, I'm good to merge these changes.

@matthias-Q
Copy link
Copy Markdown
Collaborator

No, I am totally fine with it.

@vergenzt
Copy link
Copy Markdown
Contributor Author

vergenzt commented Jan 9, 2026

Awesome! I'll try to get conflicts resolved asap and get this back to being ready to merge.

@clason
Copy link
Copy Markdown

clason commented Jan 10, 2026

PSA: upstream tree-sitter is moving away from CJS towards ESM (modules), mostly for improved compatibility with the new native JS runtime. You might wish to future-proof this while you're at it.

(I know you're pinning to tree-sitter 0.24 here (current is 0.26.3), but this will not be tenable forever -- you're missing out on useful features and bugfixes, and support for ABI 14 will eventually be removed.)

@vergenzt
Copy link
Copy Markdown
Contributor Author

vergenzt commented Feb 9, 2026

Hi all! Instead of rebasing this time I "merged" each of the PRs which landed in main since I branched off, and I ensured all changes to grammar.js in the original PR got re-applied into the split versions. I also then switched everything over to ES6 module syntax per #351.

It should be ready to go! Tests passed locally, so would love if someone could approve the CI jobs to run and confirm everything is still green.

Copy link
Copy Markdown
Owner

@DerekStride DerekStride left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a big refactor, thanks for sticking with it

@DerekStride DerekStride merged commit b8926cd into DerekStride:main Feb 10, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants