Skip to content

Conversation

@ChrHorn
Copy link
Contributor

@ChrHorn ChrHorn commented Oct 10, 2025

Not sure how much Emoji variable names are used, but this cuts the size of the parser.c file in half (45MB -> 24MB), so it might be worthy the tradeoff.

Regenerated parser is not committed.

Some unicode characters or character groups lead to a large increase
in parser size. This change halves the size of the generated parser
file.
@ObserverOfTime
Copy link
Member

Regenerated parser is not committed.

It should be. :)

@ChrHorn
Copy link
Contributor Author

ChrHorn commented Oct 10, 2025

Regenerated parser is not committed.

It should be. :)

Ok, I added the parser changes too. Makes sense with CI now enabled.

@clason
Copy link
Contributor

clason commented Oct 13, 2025

You need to regenerate the parser with --abi 14 since the bindings are still ancient. (Bumping the ABI is technically a breaking change and hence should be done in a separate commit marked as such. Unless @savq disagrees, it would be time though.)

@clason
Copy link
Contributor

clason commented Oct 13, 2025

For the record, dropping emojis not only halves the parser size but also the large state count which directly affects performance. So I would argue that it's worth it.

(Having said that, I just did a quick test and the change did not affect either compiled parser size or parsing speed...)

@ChrHorn
Copy link
Contributor Author

ChrHorn commented Oct 14, 2025

You need to regenerate the parser with --abi 14 since the bindings are still ancient. (Bumping the ABI is technically a breaking change and hence should be done in a separate commit marked as such. Unless @savq disagrees, it would be time though.)

I regenerated with --abi 14 but there still seems to be some issues with the CI.

(Having said that, I just did a quick test and the change did not affect either compiled parser size or parsing speed...)

For me the compiled parser size is also cut in half (julia.so 5.2mb -> 2.5mb). Speed difference is much less noticeable, about 10% faster on my end.

@clason
Copy link
Contributor

clason commented Oct 14, 2025

Yeah, that's now a problem with CI. It'll be probably easiest to wait until a maintainer (not me) updates the whole shebang to the latest versions.

@clason
Copy link
Contributor

clason commented Oct 14, 2025

@savq are you still maintaining the parser here?

@ObserverOfTime
Copy link
Member

There are three options:

  1. Generate with ABI 15 (breaking)
  2. Adapt CI to generate with ABI 14
  3. Adapt CI to not generate at all

@savq
Copy link
Collaborator

savq commented Oct 21, 2025

Hi. Thanks for the PR @ChrHorn

On the ABI: It hasn't been updated because there hasn't been much activity on the repo, but there's no blockers AFAIK.

On the PR itself: I feel like deliberately not parsing a part of the language isn't ideal, but having a smaller parser should be an option. We'd already discussed this in #144 and I still think generating two parsers would be the happy solution. So we'd have to generate a julia parser and a julia-ascii parser that removes emojis and unicode operators.

@clason
Copy link
Contributor

clason commented Oct 21, 2025

I have already created a fork for nvim-treesitter where I will make that change; feel free to pull in the ABI 15 (including CI) changes from there if you are interested.

@ChrHorn
Copy link
Contributor Author

ChrHorn commented Oct 21, 2025

On the PR itself: I feel like deliberately not parsing a part of the language isn't ideal, but having a smaller parser should be an option. We'd already discussed this in #144 and I still think generating two parsers would be the happy solution. So we'd have to generate a julia parser and a julia-ascii parser that removes emojis and unicode operators.

Agreed that it's not ideal, but the smaller parser is very much worth it in my opinion. At least temporarily, until it is fixed upstream.
Only supporting ascii is also not ideal. Unicode operators, unlike Emojis (I have only seen it used as a joke), are very much used in the wild. There are also no benefits with restricting the identifiers too much. The issue arises in binary form, you either hit it or not.

Could also open another upstream issue. tree-sitter/tree-sitter#3496 was closed and focused on WASM, but I suspect it's just a cascading issue stemming from this one.

@savq
Copy link
Collaborator

savq commented Nov 7, 2025

Only supporting ascii is also not ideal. Unicode operators, unlike Emojis (I have only seen it used as a joke), are very much used in the wild. There are also no benefits with restricting the identifiers too much. The issue arises in binary form, you either hit it or not.

I thought a bit more about this and I think I agree. People using tree-sitter for code analysis stuff are not spamming their code with emojis, so removing emoji identifiers is fine.


@clason I see that this change (along the ABI update and #177) are already done in tree-sitter-grammars/tree-sitter-julia. Should I merge that instead and create a new release here?

@clason
Copy link
Contributor

clason commented Nov 7, 2025

Yes, feel free to cherry-pick the commits (or make a master->master PR for rebasing), they should apply cleanly here (in their order).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants