-
Notifications
You must be signed in to change notification settings - Fork 831
[Strings] Represent string values as WTF-16 internally #6418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This stack of pull requests is managed by Graphite. Learn more about stacking. |
| - Add a new `BinaryenModuleReadWithFeatures` function to the C API that allows | ||
| to configure which features to enable in the parser. | ||
| - The build-time option to use legacy WasmGC opcodes is removed. | ||
| - The strings in `string.const` instructions must now be valid WTF-8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not a limitation for anyone we know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be, no. Before this change, most invalid WTF-8 would end up littered with replacement characters in the output anyway, and certainly no one wants that.
| } | ||
| return makeStringConst(string); | ||
| // TODO: Use wtf16.view() once we have C++20. | ||
| return makeStringConst(wtf16.str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's happening here? (comment might help) Specifically, wouldn't the input Literal already by a wtf16 - what are we converting from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so we're not doing an encoding conversion here, but rather a data structure conversion. The Literal stores a sequence of Literal integers and we need to convert that to a Name.
| (string.concat (string.const "a\F0") (string.const "b")) | ||
| (string.const "a\F0b") | ||
| (string.concat (string.const "\ED\A0\80") (string.const "\ED\BD\88")) | ||
| (string.const "\F0\90\8D\88") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment as to why these seemingly-very-different strings are in fact equal?
| ) | ||
| (drop | ||
| (string.const "invalid surrogate sequence \ED\A0\81\ED\B0\B7") | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we no longer accepting these as inputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. These test cases are all invalid WTF-8 and we were previously testing that we printed warnings and inserted replacement characters correctly, but now we just reject them in the parsers.
test/lit/strings.wast
Outdated
| ;; CHECK: (global $string-const stringref (string.const "string in a global \01\ff\00\t\t\n\n\r\r\"\"\'\'\\\\")) | ||
| (global $string-const stringref (string.const "string in a global \01\ff\00\t\09\n\0a\r\0d\"\22\'\27\\\5c")) | ||
| ;; CHECK: (global $string-const stringref (string.const "string in a global \c2\a3_\e2\82\ac_\f0\90\8d\88")) | ||
| (global $string-const stringref (string.const "string in a global \C2\A3_\E2\82\AC_\F0\90\8D\88")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not worth still testing \n\r"' and the other corner cases here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, for the JSON escaping. I'll add that back in.
kripken
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, modulo UTF8/16 nuances that I am not an expert on... but the fuzzer should now be able to test such things by comparing to V8. Might be worth fuzzing this for a while before landing.
WTF-16, i.e. arbitrary sequences of 16-bit values, is the encoding of Java and JavaScript strings, and using the same encoding makes the interpretation of string operations trivial, even when accounting for non-ascii characters. Specifically, use little-endian WTF-16. Re-encode string constants from WTF-8 to WTF-16 in the parsers, then back to WTF-8 in the writers. Update the constructor for string `Literal`s to interpret the string as WTF-16 and store a sequence of WTF-16 code units, i.e. 16-bit integers. Update `Builder::makeConstantExpression` accordingly to convert from the new `Literal` string representation back to a WTF-16 string. Update the interpreter to remove the logic for detecting non-ascii characters and bailing out. The naive implementations of all the string operations are correct now that our string encoding matches the JS string encoding.
src/wasm-interpreter.h
Outdated
| if (hasNonAsciiUpTo(refValues, endVal)) { | ||
| return Flow(NONCONSTANT_FLOW); | ||
|
|
||
| if (endVal > refValues.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous line ensures that endVal <= refValues.size(), so this check is always false.

WTF-16, i.e. arbitrary sequences of 16-bit values, is the encoding of Java and
JavaScript strings, and using the same encoding makes the interpretation of
string operations trivial, even when accounting for non-ascii characters.
Specifically, use little-endian WTF-16.
Re-encode string constants from WTF-8 to WTF-16 in the parsers, then back to
WTF-8 in the writers. Update the constructor for string
Literals to interpretthe string as WTF-16 and store a sequence of WTF-16 code units, i.e. 16-bit
integers. Update
Builder::makeConstantExpressionaccordingly to convert fromthe new
Literalstring representation back to a WTF-16 string.Update the interpreter to remove the logic for detecting non-ascii characters
and bailing out. The naive implementations of all the string operations are
correct now that our string encoding matches the JS string encoding.