[Strings] Represent string values as WTF-16 internally #6418

tlively · 2024-03-21T06:03:52Z

WTF-16, i.e. arbitrary sequences of 16-bit values, is the encoding of Java and
JavaScript strings, and using the same encoding makes the interpretation of
string operations trivial, even when accounting for non-ascii characters.
Specifically, use little-endian WTF-16.

Re-encode string constants from WTF-8 to WTF-16 in the parsers, then back to
WTF-8 in the writers. Update the constructor for string Literals to interpret
the string as WTF-16 and store a sequence of WTF-16 code units, i.e. 16-bit
integers. Update Builder::makeConstantExpression accordingly to convert from
the new Literal string representation back to a WTF-16 string.

Update the interpreter to remove the logic for detecting non-ascii characters
and bailing out. The naive implementations of all the string operations are
correct now that our string encoding matches the JS string encoding.

tlively · 2024-03-21T06:04:06Z

[Strings] Represent string values as WTF-16 internally #6418 👈
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @tlively and the rest of your teammates on Graphite

kripken · 2024-03-21T16:54:26Z

CHANGELOG.md

 - Add a new `BinaryenModuleReadWithFeatures` function to the C API that allows
   to configure which features to enable in the parser.
 - The build-time option to use legacy WasmGC opcodes is removed.
+ - The strings in `string.const` instructions must now be valid WTF-8.


Is this not a limitation for anyone we know?

It shouldn't be, no. Before this change, most invalid WTF-8 would end up littered with replacement characters in the output anyway, and certainly no one wants that.

kripken · 2024-03-21T16:58:04Z

src/wasm-builder.h

      }
-      return makeStringConst(string);
+      // TODO: Use wtf16.view() once we have C++20.
+      return makeStringConst(wtf16.str());


What's happening here? (comment might help) Specifically, wouldn't the input Literal already by a wtf16 - what are we converting from?

Yes, so we're not doing an encoding conversion here, but rather a data structure conversion. The Literal stores a sequence of Literal integers and we need to convert that to a Name.

kripken · 2024-03-21T17:00:36Z

test/lit/passes/precompute-strings.wast

-   (string.concat (string.const "a\F0") (string.const "b"))
-   (string.const "a\F0b")
+   (string.concat (string.const "\ED\A0\80") (string.const "\ED\BD\88"))
+   (string.const "\F0\90\8D\88")


Maybe add a comment as to why these seemingly-very-different strings are in fact equal?

kripken · 2024-03-21T17:03:20Z

test/lit/passes/string-lowering.wast

-    )
-    (drop
-      (string.const "invalid surrogate sequence \ED\A0\81\ED\B0\B7")
-    )


Are we no longer accepting these as inputs?

Correct. These test cases are all invalid WTF-8 and we were previously testing that we printed warnings and inserted replacement characters correctly, but now we just reject them in the parsers.

kripken · 2024-03-21T17:04:03Z

test/lit/strings.wast

-  ;; CHECK:      (global $string-const stringref (string.const "string in a global \01\ff\00\t\t\n\n\r\r\"\"\'\'\\\\"))
-  (global $string-const stringref (string.const "string in a global \01\ff\00\t\09\n\0a\r\0d\"\22\'\27\\\5c"))
+  ;; CHECK:      (global $string-const stringref (string.const "string in a global \c2\a3_\e2\82\ac_\f0\90\8d\88"))
+  (global $string-const stringref (string.const "string in a global \C2\A3_\E2\82\AC_\F0\90\8D\88"))


Is it not worth still testing \n\r"' and the other corner cases here?

Oh right, for the JSON escaping. I'll add that back in.

kripken

lgtm, modulo UTF8/16 nuances that I am not an expert on... but the fuzzer should now be able to test such things by comparing to V8. Might be worth fuzzing this for a while before landing.

WTF-16, i.e. arbitrary sequences of 16-bit values, is the encoding of Java and JavaScript strings, and using the same encoding makes the interpretation of string operations trivial, even when accounting for non-ascii characters. Specifically, use little-endian WTF-16. Re-encode string constants from WTF-8 to WTF-16 in the parsers, then back to WTF-8 in the writers. Update the constructor for string `Literal`s to interpret the string as WTF-16 and store a sequence of WTF-16 code units, i.e. 16-bit integers. Update `Builder::makeConstantExpression` accordingly to convert from the new `Literal` string representation back to a WTF-16 string. Update the interpreter to remove the logic for detecting non-ascii characters and bailing out. The naive implementations of all the string operations are correct now that our string encoding matches the JS string encoding.

kripken · 2024-03-22T22:08:36Z

src/wasm-interpreter.h

-    if (hasNonAsciiUpTo(refValues, endVal)) {
-      return Flow(NONCONSTANT_FLOW);
+
+    if (endVal > refValues.size()) {


The previous line ensures that endVal <= refValues.size(), so this check is always false.

tlively requested a review from kripken March 21, 2024 06:03

kripken reviewed Mar 21, 2024

View reviewed changes

kripken approved these changes Mar 21, 2024

View reviewed changes

tlively force-pushed the wtf-16 branch from eddabc0 to fa879ae Compare March 21, 2024 23:52

tlively added 9 commits March 22, 2024 15:05

convert to WTF-16 in C API

f2baea4

restore most escapes in test

7df7e33

comment on surrogate concatenation

c3bfb29

comment in makeConstantExpression

cd41dfd

use wtf-16 in the fuzzer

8db8221

print string literals with WTF-8

73a66e1

do not confuse split_wast with fake escape

781efcc

allow strings in fuzzer initial contents

68de7bb

tlively force-pushed the wtf-16 branch from 371df4b to 68de7bb Compare March 22, 2024 22:06

kripken reviewed Mar 22, 2024

View reviewed changes

remove dead code

eba0bb8

kripken approved these changes Mar 22, 2024

View reviewed changes

tlively enabled auto-merge (squash) March 22, 2024 23:25

tlively merged commit b3fea30 into main Mar 22, 2024

tlively deleted the wtf-16 branch March 22, 2024 23:56

gkdn mentioned this pull request Aug 31, 2024

stringconsts gkdn/binaryen#1

Closed

[Strings] Represent string values as WTF-16 internally #6418

[Strings] Represent string values as WTF-16 internally #6418

Uh oh!

Conversation

tlively commented Mar 21, 2024

Uh oh!

tlively commented Mar 21, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kripken left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants