Skip to content

Conversation

@tlively
Copy link
Member

@tlively tlively commented Mar 21, 2024

WTF-16, i.e. arbitrary sequences of 16-bit values, is the encoding of Java and
JavaScript strings, and using the same encoding makes the interpretation of
string operations trivial, even when accounting for non-ascii characters.
Specifically, use little-endian WTF-16.

Re-encode string constants from WTF-8 to WTF-16 in the parsers, then back to
WTF-8 in the writers. Update the constructor for string Literals to interpret
the string as WTF-16 and store a sequence of WTF-16 code units, i.e. 16-bit
integers. Update Builder::makeConstantExpression accordingly to convert from
the new Literal string representation back to a WTF-16 string.

Update the interpreter to remove the logic for detecting non-ascii characters
and bailing out. The naive implementations of all the string operations are
correct now that our string encoding matches the JS string encoding.

@tlively tlively requested a review from kripken March 21, 2024 06:03
@tlively
Copy link
Member Author

tlively commented Mar 21, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @tlively and the rest of your teammates on Graphite Graphite

- Add a new `BinaryenModuleReadWithFeatures` function to the C API that allows
to configure which features to enable in the parser.
- The build-time option to use legacy WasmGC opcodes is removed.
- The strings in `string.const` instructions must now be valid WTF-8.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not a limitation for anyone we know?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be, no. Before this change, most invalid WTF-8 would end up littered with replacement characters in the output anyway, and certainly no one wants that.

}
return makeStringConst(string);
// TODO: Use wtf16.view() once we have C++20.
return makeStringConst(wtf16.str());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's happening here? (comment might help) Specifically, wouldn't the input Literal already by a wtf16 - what are we converting from?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so we're not doing an encoding conversion here, but rather a data structure conversion. The Literal stores a sequence of Literal integers and we need to convert that to a Name.

(string.concat (string.const "a\F0") (string.const "b"))
(string.const "a\F0b")
(string.concat (string.const "\ED\A0\80") (string.const "\ED\BD\88"))
(string.const "\F0\90\8D\88")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment as to why these seemingly-very-different strings are in fact equal?

)
(drop
(string.const "invalid surrogate sequence \ED\A0\81\ED\B0\B7")
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we no longer accepting these as inputs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. These test cases are all invalid WTF-8 and we were previously testing that we printed warnings and inserted replacement characters correctly, but now we just reject them in the parsers.

;; CHECK: (global $string-const stringref (string.const "string in a global \01\ff\00\t\t\n\n\r\r\"\"\'\'\\\\"))
(global $string-const stringref (string.const "string in a global \01\ff\00\t\09\n\0a\r\0d\"\22\'\27\\\5c"))
;; CHECK: (global $string-const stringref (string.const "string in a global \c2\a3_\e2\82\ac_\f0\90\8d\88"))
(global $string-const stringref (string.const "string in a global \C2\A3_\E2\82\AC_\F0\90\8D\88"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not worth still testing \n\r"' and the other corner cases here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, for the JSON escaping. I'll add that back in.

Copy link
Member

@kripken kripken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, modulo UTF8/16 nuances that I am not an expert on... but the fuzzer should now be able to test such things by comparing to V8. Might be worth fuzzing this for a while before landing.

tlively added 9 commits March 22, 2024 15:05
WTF-16, i.e. arbitrary sequences of 16-bit values, is the encoding of Java and
JavaScript strings, and using the same encoding makes the interpretation of
string operations trivial, even when accounting for non-ascii characters.
Specifically, use little-endian WTF-16.

Re-encode string constants from WTF-8 to WTF-16 in the parsers, then back to
WTF-8 in the writers. Update the constructor for string `Literal`s to interpret
the string as WTF-16 and store a sequence of WTF-16 code units, i.e. 16-bit
integers. Update `Builder::makeConstantExpression` accordingly to convert from
the new `Literal` string representation back to a WTF-16 string.

Update the interpreter to remove the logic for detecting non-ascii characters
and bailing out. The naive implementations of all the string operations are
correct now that our string encoding matches the JS string encoding.
if (hasNonAsciiUpTo(refValues, endVal)) {
return Flow(NONCONSTANT_FLOW);

if (endVal > refValues.size()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous line ensures that endVal <= refValues.size(), so this check is always false.

@tlively tlively enabled auto-merge (squash) March 22, 2024 23:25
@tlively tlively merged commit b3fea30 into main Mar 22, 2024
@tlively tlively deleted the wtf-16 branch March 22, 2024 23:56
@gkdn gkdn mentioned this pull request Aug 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants