pest to chumsky migration #185

gerau · 2025-12-18T11:12:21Z

No description provided.

apoelstra · 2025-12-18T13:21:06Z

cc @canndrew may want to keep an eye on progress here

gerau · 2026-01-12T13:16:25Z

Right now there is a working parser using the chumsky crate which replicates the behavior of the pest parser in terms of building a correct parse tree -- it should produce the same Simplicity program. This implementation also fixes #79.

Error reporting is currently broken because we need to replace the logic of parse::ParseFromStr to return multiple errors or handle recoverable errors differently, and error recovery is proving to be more overwhelming than I estimated it would be.

The code will be refactored because some parts are only half-finished (such as adding Spanned for certain names) and there are better ways to use parser combinators. However, I want to show this progress before implementing error recovery.

gerau · 2026-01-12T13:16:48Z

cc @canndrew

uncomputable · 2026-01-12T15:19:49Z

src/lib.rs

    }

    #[test]
-    #[ignore]


1b1e751 It's nice to see that chumsky seems to be faster than pest here.

src/error.rs

canndrew · 2026-01-16T08:10:06Z

src/error.rs

+                    })
+                    .map_or(0, |ts| u32::from(ts) as usize);
+
+                let start_col = file[line_start_byte..self.span.start].chars().count();


Do we want to count columns as being the number of utf8 codepoints? There's no good way to define "number of columns" in general for non-ascii text, but LSP defines it as the number of utf16 codepoints and that's the closest thing to a standard that I'm aware of.

Actually I just checked and LSP now allows you to choose between utf{8,16,32} at your leisure. But it's moot anyway since this is just deciding how long an underline to print and that's going to depend on the terminal.

We should consider switching to ariadne for error pretty-printing, as it's the "sister-crate" for chumsky.

canndrew · 2026-01-16T08:19:09Z

It's weird that the lexer is treating all our built-in macro/function/etc names as being keywords. I realize that's how the compiler currently works, so it's okay to land this PR as-is to keep the changes small. But obviously we'd want to eventually treat these as just being identifiers.

gerau · 2026-01-21T13:32:34Z

I would like to provide more context on a few points:

Some of the parsers try to recover to some "default" values, so it could continue parse and report an error. If I understand correctly, in most parsers this is implemented by adding to parsing structures error states, so analysis stage of the compiler could handle this cases correctly. I haven't done this in this PR, because it requires changing the analysis code as well. Right now, it would not progress to analysis stage if there is a parsing error.
I changed the lexer to not parse built-in types and functions as keywords, because this creates behavior, that was not in original pest parser (e.g. u1 was considered UnsignedType, even if it's defined as variable). This also does not require significant changes to parser itself, so I think we should keep this change here.
I didn't change errors too much and their printing, but I think we should consider refactor errors and use ariadne for collecting them and printing. It seems to pair fairly well with chumsky, and it would provide prettier errors than we currently have.

gerau · 2026-01-21T13:40:01Z

Also a note about performance: chumsky seems faster in general than pest parser. For example, on my machine for a large file .simf file, which was generated by simplicity-bn254, chumsky is 10 times faster than pest for parsing. But trade-off for this is slower compilation times and lag with rust-analyzer, because chumsky is type-driven.

It would be nice if we could move the parser to a different crate, so it would not affect compile time too much, and the SimplicityHL parser could be used separately from the compiler.

gerau · 2026-01-21T13:41:31Z

cc @canndrew @KyrylR

src/error.rs

KyrylR · 2026-01-22T14:48:09Z

src/error.rs

+                let start_pos = index.line_col(TextSize::from(self.span.start as u32));
+                let end_pos = index.line_col(TextSize::from(self.span.end as u32));


Can we have From trait implemented to avoid repeating the TextSize::from(self.span.start as u32)

Span has a start and an end, so the only way to implement the From trait would be for a tuple of TextSizes. That isn't a very pretty solution either, so I'd prefer to stick with the current variant.

src/error.rs

apoelstra · 2026-01-23T18:47:34Z

It would be nice if we could move the parser to a different crate, so it would not affect compile time too much, and the SimplicityHL parser could be used separately from the compiler.

Strongly agreed. If we had a public somewhat-standard AST type that the parser would produce, this would also let people implement some kinds of linters and/or formatters without needing support from us. (We will likely get some pressure to preserve whitespace and comments to help with this. Maybe we actually want two AST types, one that has whitespace and comments and one that's reduced somehow.)

In any case, this is all separate from this PR.

KyrylR · 2026-01-29T13:24:09Z

Please rebase onto master

src/error.rs

KyrylR · 2026-01-29T13:48:32Z

src/lexer.rs

+        "true" => Token::Bool(true),
+        "false" => Token::Bool(false),
+        "bool" => Token::BooleanType,


Hardcoded built-in names - "true", "false", "bool" are treated as keywords at the lexer level

Let's create a GitHub issue, so we do not forget to resolve this

I don't think it's a problem for "true" or "false," because the Rust compiler doesn't allow variables to be named "true" or "false".

However, I can change it for "bool" very quickly, as it wouldn't be so difficult.

KyrylR · 2026-01-29T13:49:54Z

src/lexer.rs

+    let macros =
+        choice((just("assert!"), just("panic!"), just("dbg!"), just("list!"))).map(Token::Macro);


Macros are lexed as special Macro tokens, therefore adding new macros would require lexer changes, let's mention it in the issue as well

Maybe we could consider treating ident! as a general pattern

src/lexer.rs

KyrylR · 2026-01-29T13:56:01Z

src/lexer.rs

+    // We would discard them for the compiler, but they are needed, for example, for the formatter.
+    Comment,


Could you create an issue "Formatter support" and mention that content is discarded for now?

src/lexer.rs

src/lib.rs

KyrylR · 2026-01-29T14:04:23Z

src/lib.rs

    /// ## Errors
    ///
    /// The string is not a valid SimplicityHL program.
    pub fn new<Str: Into<Arc<str>>>(s: Str) -> Result<Self, String> {


Why we cannot return Error type instead of String?

I think it's because at this stage we only want to have pretty-printed diagnostics?

src/lib.rs

KyrylR · 2026-01-29T14:07:18Z

src/lib.rs

 impl TemplateProgram {
    /// Parse the template of a SimplicityHL program.
    ///
    /// ## Errors
    ///
    /// The string is not a valid SimplicityHL program.
    pub fn new<Str: Into<Arc<str>>>(s: Str) -> Result<Self, String> {


TemplateProgram::new() collects errors but only returns them as single string, so are multiple errors included into single string?

Yes, you are correct.

src/lib.rs

KyrylR · 2026-01-29T14:09:57Z

src/lib.rs

@@ -664,4 +675,167 @@
            .with_witness_values(WitnessValues::default())
            .assert_run_success();
    }


Maybe we could also add test that checks that multiple errors are detected?

KyrylR · 2026-01-29T14:16:29Z

src/main.rs

+    let compiled =
+        match CompiledProgram::new(prog_text, Arguments::default(), include_debug_symbols) {
+            Ok(program) => program,
+            Err(e) => {
+                eprintln!("{}", e);
+                std::process::exit(1);
+            }
+        };


Maybe we could use approach like below?

use std::process::{ExitCode, Termination}; struct CliError(String); impl Termination for CliError { fn report(self) -> ExitCode { eprintln!("{}", self.0); ExitCode::from(1) } } fn main() -> Result<(), CliError> { // ... let compiled = CompiledProgram::new(prog_text, Arguments::default(), include_debug_symbols) .map_err(CliError)?; // ... }

I implemented this code because I was having some trouble with the error display -- it was showing up like a Debug print, and CliError doesn't solve that issue. I'd also like to avoid touching the CLI too much right now, as this PR is mainly focused on the parser changes.

src/parse.rs

KyrylR · 2026-01-29T14:27:51Z

src/parse.rs

+            tok.into_iter()
+                .map(|(tok, span)| (tok, Span::from(span)))
+                .filter(|(tok, _)| !matches!(tok, Token::Comment | Token::BlockComment))
+                .collect::<Vec<_>>()


Duplication with ParseFromStrWithErrors

src/parse.rs

KyrylR · 2026-01-29T14:33:32Z

src/parse.rs

            span: Span::DUMMY,
        })
    }
 }


No unit tests is quite troublesome, though this file is already huge

The lexer parses incoming code into tokens, which makes it simpler to process using `chumsky`.

This commit introduce multiple changes, because it full rewrite of parsing and error Changes in `error.rs`: - Change `Span` to use byte offsets in place of old `Position` - Add `line-index` crate to calculate line and column of byte offset - Change `RichError` implementation to use new `Span` structure - Implement `chumsky` error traits, so it can be used in error reporting of parsers - add `expected..found` error Changes in `parse.rs`: - Fully rewrite `pest` parsers to `chumsky` parsers. - Change `ParseFromStr` trait to use this change.

This adds `ParseFromStrWithErrors`, which would take `ErrorCollector` and return an `Option` of AST. Also changes `TemplateProgram` to use new trait with collector

it's not slow anymore

This adds tests to ensure that the compiler using the `chumsky` parser produces the same Simplicity program as when using the `pest` parser for the default examples. The programs were compiled using an old `simc` version with debug symbols into .json files, and located in `test-data/` folder.

gerau · 2026-01-30T14:30:46Z

cc @apoelstra

gerau mentioned this pull request Dec 26, 2025

Refactor parsing and analysis for better tooling support #191

Open

gerau force-pushed the simc/chumsky-migration branch from 6db55db to 1b1e751 Compare January 12, 2026 13:01

uncomputable reviewed Jan 12, 2026

View reviewed changes

src/lib.rs

}

#[test]

#[ignore]

Copy link

Collaborator

uncomputable Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1b1e751 It's nice to see that chumsky seems to be faster than pest here.

gerau force-pushed the simc/chumsky-migration branch from 1b1e751 to 1e7c61b Compare January 14, 2026 15:10

canndrew reviewed Jan 16, 2026

View reviewed changes

src/error.rs Outdated Show resolved Hide resolved

canndrew reviewed Jan 16, 2026

View reviewed changes

src/error.rs Outdated Show resolved Hide resolved

canndrew reviewed Jan 16, 2026

View reviewed changes

gerau force-pushed the simc/chumsky-migration branch 3 times, most recently from bd5c30f to 24a6bc6 Compare January 21, 2026 13:08

KyrylR reviewed Jan 22, 2026

View reviewed changes

gerau force-pushed the simc/chumsky-migration branch from 24a6bc6 to b200640 Compare January 27, 2026 11:50

KyrylR reviewed Jan 29, 2026

View reviewed changes

gerau force-pushed the simc/chumsky-migration branch 2 times, most recently from b88ef62 to 4bf6252 Compare January 29, 2026 13:52

KyrylR reviewed Jan 29, 2026

View reviewed changes

gerau force-pushed the simc/chumsky-migration branch from 4bf6252 to 6a5bc25 Compare January 29, 2026 16:12

add lexer

5030190

The lexer parses incoming code into tokens, which makes it simpler to process using `chumsky`.

gerau force-pushed the simc/chumsky-migration branch from 6a5bc25 to 7e5a257 Compare January 30, 2026 13:07

gerau mentioned this pull request Jan 30, 2026

Hardcoded built-ins #202

Open

gerau added 5 commits January 30, 2026 16:17

add ErrorCollector

f8ffaff

add multiple error handling

c24b901

This adds `ParseFromStrWithErrors`, which would take `ErrorCollector` and return an `Option` of AST. Also changes `TemplateProgram` to use new trait with collector

remove #[ignore] above fuzz_slow_unit_1()

183fdae

it's not slow anymore

gerau force-pushed the simc/chumsky-migration branch from 7e5a257 to e53f005 Compare January 30, 2026 14:21

gerau marked this pull request as ready for review January 30, 2026 14:29

gerau requested a review from delta1 as a code owner January 30, 2026 14:29

This was referenced Jan 30, 2026

Implement error states in parser #205

Open

Formatter support #206

Open

		let start_pos = index.line_col(TextSize::from(self.span.start as u32));
		let end_pos = index.line_col(TextSize::from(self.span.end as u32));

		let macros =
		choice((just("assert!"), just("panic!"), just("dbg!"), just("list!"))).map(Token::Macro);

		// We would discard them for the compiler, but they are needed, for example, for the formatter.
		Comment,

pest to chumsky migration #185

Are you sure you want to change the base?

pest to chumsky migration #185

Uh oh!

Conversation

gerau commented Dec 18, 2025

Uh oh!

apoelstra commented Dec 18, 2025

Uh oh!

gerau commented Jan 12, 2026

Uh oh!

gerau commented Jan 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

canndrew commented Jan 16, 2026

Uh oh!

gerau commented Jan 21, 2026

Uh oh!

gerau commented Jan 21, 2026

Uh oh!

gerau commented Jan 21, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

apoelstra commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KyrylR commented Jan 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

apoelstra commented Jan 23, 2026 •

edited

Loading