is NFKD / NFKC normalization working properly? #4527

gpawru · 2024-01-17T00:42:47Z

gpawru
Jan 17, 2024

hello guys!

am I dumb or I don't understand normalization process at all?

I've got two sequences:

U+01C4 U+0323
U+0044 U+005A U+030C U+0323

if i take a look at UnicodeData.txt i see

01C4;LATIN CAPITAL LETTER DZ WITH CARON;Lu;0;L;<compat> 0044 017D;;;;N;LATIN CAPITAL LETTER D Z HACEK;;;01C6;01C5

and

017D;LATIN CAPITAL LETTER Z WITH CARON;Lu;0;L;005A 030C;;;;N;LATIN CAPITAL LETTER Z HACEK;;;017E;

as we can see, compat decomposition for U+01C4 will be U+0044 U+005A U+030C
am I right at that point?

let's add nonstarter U+0323 to non-decomposed and decomposed U+01C4.
here we are again:

U+01C4 U+0323
U+0044 U+005A U+030C U+0323

QUESTION: if i normalize these strings with NFKD normalizer, should i receive the same result? because the test says the opposite.

here is a test:

#[test]
fn test_nfkd()
{
    let nfkd = DecomposingNormalizer::new_nfkd();

    let a = "\u{01C4}\u{0323}";
    let b = "\u{0044}\u{005A}\u{030C}\u{0323}";

    let result_a: Vec<char> = nfkd.normalize(a).chars().collect();
    let result_b: Vec<char> = nfkd.normalize(b).chars().collect();

    for c in result_a.iter() {
        print!("{:04X} ", u32::from(*c));
    }
    println!();
  

    for c in result_b.iter() {
        print!("{:04X} ", u32::from(*c));
    }
    println!();
}

and here is the result:

0044 005A 030C 0323 
0044 005A 0323 030C

Manishearth · 2024-01-17T03:59:12Z

Manishearth
Jan 17, 2024
Maintainer

This does seem to be a bug, because of ccc U+0323 should occur before U+030C

cc @hsivonen @sffc

2 replies

gpawru Jan 17, 2024
Author

Thank you! I was checking the validity of my implementation of the NF(K)C algorithm, encountered this case, and it surprised me - spent several hours searching for the error in my code :)

hsivonen Jan 17, 2024
Maintainer

Thank you for reporting! PR: #4530

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

is NFKD / NFKC normalization working properly? #4527

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

is NFKD / NFKC normalization working properly? #4527

Uh oh!

gpawru Jan 17, 2024

Replies: 1 comment · 2 replies

Uh oh!

Manishearth Jan 17, 2024 Maintainer

Uh oh!

gpawru Jan 17, 2024 Author

Uh oh!

hsivonen Jan 17, 2024 Maintainer

gpawru
Jan 17, 2024

Replies: 1 comment 2 replies

Manishearth
Jan 17, 2024
Maintainer

gpawru Jan 17, 2024
Author

hsivonen Jan 17, 2024
Maintainer