-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
Matthew Donoughe opened MNG-8241 and commented
Java strings are (usually) Unicode, but Java chars are a subset of Unicode. ComparableVersion makes heavy use of String.charAt, which will return surrogate values instead of Unicode code points whenever a character takes more than 16 bits.
This leads to the following behavior:
java -jar ~/.m2/repository/org/apache/maven/maven-artifact/3.9.4/maven-artifact-3.9.4.jar 1 𝟤
Display parameters as parsed by Maven (in canonical form and as a list of tokens) and comparison result:
1. 1 -> 1; tokens: [1]
1 > 𝟤
2. 𝟤 -> 𝟤; tokens: [𝟤]
1 (DIGIT ONE) > 𝟤 (MATHEMATICAL SANS-SERIF DIGIT TWO) because ComparableVersion sees 𝟤 as two invalid characters and treats it as text.
java -jar ~/.m2/repository/org/apache/maven/maven-artifact/3.9.4/maven-artifact-3.9.4.jar 0 𝟤
Display parameters as parsed by Maven (in canonical form and as a list of tokens) and comparison result:
1. 0 -> ; tokens: []
0 < 𝟤
2. 𝟤 -> 𝟤; tokens: [𝟤]
However, 0 (DIGIT 0) is still < 𝟤 (MATHEMATICAL SANS-SERIF DIGIT TWO). 0 < 𝟤 < 1 the same way 0 < a < 1.
It's unclear whether this should be considered to be a bug or whether it's just an undocumented limitation. String.charAt and String.length should be avoided unless you can be sure the characters are all BMP (Basic Multilingual Plane).
I was initially worried that 𝟣𝟣𝟣𝟣𝟣 (MATHEMATICAL SANS-SERIF DIGIT ONE) > 22222 (DIGIT TWO) because "𝟣𝟣𝟣𝟣𝟣".length is 10, greater than MAX_INTITEM_LENGTH, but that code doesn't even get hit because String.charAt is producing effectively "�����������". If the code is changed to identify non-BMP Nd class digits like 𝟣 as digits then the code that determines the required size of the data type needs to be updated to measure the length in code points instead of chars.
Remote Links: