Improve JSON string encoding #6328

tlively · 2024-02-21T03:26:19Z

Catch and report all kinds of WTF-8 encoding errors in the source strings,
including invalid leading bytes, invalid trailing bytes, unexpected ends of
strings, and invalid surrogate sequences. Insert replacement characters into the
output as necessary. Add a TODO about minimizing size by escaping only those
code points mandated to be escaped by the JSON spec. Generally improve
readability of the code.

Catch and report all kinds of WTF-8 encoding errors in the source strings, including invalid leading bytes, invalid trailing bytes, unexpected ends of strings, and invalid surrogate sequences. Insert replacement characters into the output as necessary. Add a TODO about minimizing size by escaping only those code points mandated to be escaped by the JSON spec. Generally improve readability of the code.

tlively · 2024-02-21T03:26:33Z

Current dependencies on/for this PR:

Improve JSON string encoding #6328 👈
main

This stack of pull requests is managed by Graphite.

kripken · 2024-02-21T03:51:55Z

src/support/string.cpp

+        os << "\\b";
+        continue;
+      case '\f':
+        os << "\\f";


Are \b and \f necessary here? (What spec would I read that in?)

(if you add them to the test without this PR, does the test fail?)

See page 4 of the JSON spec: https://ecma-international.org/wp-content/uploads/ECMA-404_2nd_edition_december_2017.pdf. We could also use \uXXXX escapes, which is what the previous code would have done, but these are shorter.

Got it, thanks.

Maybe worth adding those to the test?

kripken · 2024-02-21T03:54:03Z

src/support/string.cpp

+    // Print.cpp would consider the contents unprintable, messing up our test.
+    bool isNaivelyPrintable = 32 <= u && u < 127;
+    if (isNaivelyPrintable) {
+      assert(u < 0x80 && "need additional logic to emit valid UTF-8");


We only get here if u < 127, so the only difference with u < 0x80 = 128 is when u == 128? If so perhaps check that directly?

This is meant to guard against future improvement where we would start emitting most code points directly as UTF-8 instead of escaping them. After the initial refactoring to allow most code points to hit this code path by default, this assertion will trigger if we don't add extra logic for code points 0x80 and greater. The assertion would be less robust in the context of that future change if it checked against 128 specifically.

tlively · 2024-02-21T19:53:37Z

I added tests for all the error cases. Will land, but PTAL if you're interested.

Catch and report all kinds of WTF-8 encoding errors in the source strings, including invalid leading bytes, invalid trailing bytes, unexpected ends of strings, and invalid surrogate sequences. Insert replacement characters into the output as necessary. Add a TODO about minimizing size by escaping only those code points mandated to be escaped by the JSON spec. Generally improve readability of the code.

tlively requested a review from kripken February 21, 2024 03:26

kripken approved these changes Feb 21, 2024

View reviewed changes

extend tests, fix warnings

95ca22e

tlively enabled auto-merge (squash) February 21, 2024 19:54

tlively merged commit 39ae6cf into main Feb 21, 2024

tlively deleted the improve-json-encode branch February 21, 2024 20:10

gkdn mentioned this pull request Aug 31, 2024

stringconsts gkdn/binaryen#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve JSON string encoding #6328

Improve JSON string encoding #6328

Uh oh!

tlively commented Feb 21, 2024

Uh oh!

tlively commented Feb 21, 2024

Uh oh!

kripken Feb 21, 2024

Uh oh!

kripken Feb 21, 2024

Uh oh!

tlively Feb 21, 2024

Uh oh!

kripken Feb 21, 2024

Uh oh!

kripken Feb 21, 2024

Uh oh!

tlively Feb 21, 2024

Uh oh!

tlively commented Feb 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improve JSON string encoding #6328

Improve JSON string encoding #6328

Uh oh!

Conversation

tlively commented Feb 21, 2024

Uh oh!

tlively commented Feb 21, 2024

Uh oh!

kripken Feb 21, 2024

Choose a reason for hiding this comment

Uh oh!

kripken Feb 21, 2024

Choose a reason for hiding this comment

Uh oh!

tlively Feb 21, 2024

Choose a reason for hiding this comment

Uh oh!

kripken Feb 21, 2024

Choose a reason for hiding this comment

Uh oh!

kripken Feb 21, 2024

Choose a reason for hiding this comment

Uh oh!

tlively Feb 21, 2024

Choose a reason for hiding this comment

Uh oh!

tlively commented Feb 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants