-
-
Notifications
You must be signed in to change notification settings - Fork 32.9k
Closed
Labels
confirmed-bugIssues with confirmed bugs.Issues with confirmed bugs.encodingIssues and PRs related to the TextEncoder and TextDecoder APIs.Issues and PRs related to the TextEncoder and TextDecoder APIs.
Description
Version
v18.5.0
Platform
No response
Subsystem
No response
What steps will reproduce the bug?
const decoder = new TextDecoder('Shift_JIS');
const s = decoder.decode(new Uint8Array([255]));
How often does it reproduce? Is there a required condition?
Always
What is the expected behavior?
const decoder = new TextDecoder('Shift_JIS');
const s = decoder.decode(new Uint8Array([255]));
console.log(s) // '�' === '\ufffd'
According to WHATWG spec, any decoder should use �(U+FFFD)
when an unassigned codepoint is found during decoding.
What do you see instead?
const decoder = new TextDecoder('Shift_JIS');
const s = decoder.decode(new Uint8Array([255]));
console.log(s) // '\x1A'
From my investigation, ICU intentionally uses \x1A
for unassigned codepoint on Shift_JIS encoding, and Node.js uses it as it is.
Conversion Data - ICU Documentation
Which substitution character is used if a character cannot be converted?
Additional information
ICU provides the utility ucnv_setSubstChars
to specify substitution characters for any encoding, and Node.js already has it in library. I'm working on this.
Metadata
Metadata
Assignees
Labels
confirmed-bugIssues with confirmed bugs.Issues with confirmed bugs.encodingIssues and PRs related to the TextEncoder and TextDecoder APIs.Issues and PRs related to the TextEncoder and TextDecoder APIs.