-
Notifications
You must be signed in to change notification settings - Fork 831
Improve JSON string encoding #6328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Catch and report all kinds of WTF-8 encoding errors in the source strings, including invalid leading bytes, invalid trailing bytes, unexpected ends of strings, and invalid surrogate sequences. Insert replacement characters into the output as necessary. Add a TODO about minimizing size by escaping only those code points mandated to be escaped by the JSON spec. Generally improve readability of the code.
|
Current dependencies on/for this PR: This stack of pull requests is managed by Graphite. |
| os << "\\b"; | ||
| continue; | ||
| case '\f': | ||
| os << "\\f"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are \b and \f necessary here? (What spec would I read that in?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(if you add them to the test without this PR, does the test fail?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See page 4 of the JSON spec: https://ecma-international.org/wp-content/uploads/ECMA-404_2nd_edition_december_2017.pdf. We could also use \uXXXX escapes, which is what the previous code would have done, but these are shorter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks.
Maybe worth adding those to the test?
| // Print.cpp would consider the contents unprintable, messing up our test. | ||
| bool isNaivelyPrintable = 32 <= u && u < 127; | ||
| if (isNaivelyPrintable) { | ||
| assert(u < 0x80 && "need additional logic to emit valid UTF-8"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only get here if u < 127, so the only difference with u < 0x80 = 128 is when u == 128? If so perhaps check that directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is meant to guard against future improvement where we would start emitting most code points directly as UTF-8 instead of escaping them. After the initial refactoring to allow most code points to hit this code path by default, this assertion will trigger if we don't add extra logic for code points 0x80 and greater. The assertion would be less robust in the context of that future change if it checked against 128 specifically.
|
I added tests for all the error cases. Will land, but PTAL if you're interested. |
Catch and report all kinds of WTF-8 encoding errors in the source strings, including invalid leading bytes, invalid trailing bytes, unexpected ends of strings, and invalid surrogate sequences. Insert replacement characters into the output as necessary. Add a TODO about minimizing size by escaping only those code points mandated to be escaped by the JSON spec. Generally improve readability of the code.

Catch and report all kinds of WTF-8 encoding errors in the source strings,
including invalid leading bytes, invalid trailing bytes, unexpected ends of
strings, and invalid surrogate sequences. Insert replacement characters into the
output as necessary. Add a TODO about minimizing size by escaping only those
code points mandated to be escaped by the JSON spec. Generally improve
readability of the code.