Skip to content

Change UnbufferedCharStream to use code points #1796

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 29, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 30 additions & 7 deletions runtime/Java/src/org/antlr/v4/runtime/UnbufferedCharStream.java
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ public class UnbufferedCharStream implements CharStream {
* we keep adding to buffer. Otherwise, {@link #consume consume()} resets so
* we start filling at index 0 again.
*/
protected char[] data;
protected int[] data;

/**
* The number of characters currently in {@link #data data}.
Expand Down Expand Up @@ -82,7 +82,7 @@ public UnbufferedCharStream() {
/** Useful for subclasses that pull char from other than this.input. */
public UnbufferedCharStream(int bufferSize) {
n = 0;
data = new char[bufferSize];
data = new int[bufferSize];
}

public UnbufferedCharStream(InputStream input) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this ends up calling new InputStreamReader(input) without specifying a Charset. That means this logic depends on the client's $LANG environment variable or equivalent, which is usually not what anyone wants.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we should add an optional charset arg on one or more of the arguments. Also, doesn't the fill() method assume UTF-X? I.e., it would not work with some other encoding, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! We absolutely should add a Charset arg, or there's no way to use anything but the default in the calling environment. (We should probably also change to default to UTF-8..)

fill() uses the InputStreamReader to get UTF-16 encoded chars. It will work with any encoding, although there's no way to specify an encoding other than the environment default with the current UnbufferedCharStream API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, great! could you please make the changes? Does that mean we change this field to InputStream?

    protected Reader input;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll make the changes! Nope, we'll still use a Reader, we'll just use a different InputStreamReader constructor which allows us to specify a Charset.

Expand Down Expand Up @@ -145,13 +145,36 @@ protected void sync(int want) {
*/
protected int fill(int n) {
for (int i=0; i<n; i++) {
if (this.n > 0 && data[this.n - 1] == (char)IntStream.EOF) {
if (this.n > 0 && data[this.n - 1] == IntStream.EOF) {
return i;
}

try {
int c = nextChar();
add(c);
if (c > Character.MAX_VALUE || c == IntStream.EOF) {
add(c);
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder that our "style" is else starts a line ;)

char ch = (char) c;
if (Character.isLowSurrogate(ch)) {
throw new RuntimeException("Invalid UTF-16 (low surrogate with no preceding high surrogate)");
} else if (Character.isHighSurrogate(ch)) {
int lowSurrogate = nextChar();
if (lowSurrogate > Character.MAX_VALUE) {
throw new RuntimeException("Invalid UTF-16 (high surrogate followed by code point > U+FFFF");
} else if (lowSurrogate == IntStream.EOF) {
throw new RuntimeException("Invalid UTF-16 (dangling high surrogate at end of file)");
} else {
char lowSurrogateChar = (char) lowSurrogate;
if (Character.isLowSurrogate(lowSurrogateChar)) {
add(Character.toCodePoint(ch, lowSurrogateChar));
} else {
throw new RuntimeException("Invalid UTF-16 (dangling high surrogate");
}
}
} else {
add(c);
}
}
}
catch (IOException ioe) {
throw new RuntimeException(ioe);
Expand All @@ -173,7 +196,7 @@ protected void add(int c) {
if ( n>=data.length ) {
data = Arrays.copyOf(data, data.length * 2);
}
data[n++] = (char)c;
data[n++] = c;
}

@Override
Expand All @@ -183,8 +206,8 @@ public int LA(int i) {
int index = p + i - 1;
if ( index < 0 ) throw new IndexOutOfBoundsException();
if ( index >= n ) return IntStream.EOF;
char c = data[index];
if ( c==(char)IntStream.EOF ) return IntStream.EOF;
int c = data[index];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, just realized this can be tidied up to just return data[index];.

if ( c==IntStream.EOF ) return IntStream.EOF;
return c;
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -313,6 +313,30 @@ public void testLastChar() {
assertEquals(expecting, tokens.getTokens().toString());
}

@Test public void testUnicodeSMP() throws Exception {
TestingUnbufferedCharStream input = createStream("\uD83C\uDF0E");
assertEquals(0x1F30E, input.LA(1));
assertEquals("\uD83C\uDF0E", input.getBuffer());
input.consume();
assertEquals(IntStream.EOF, input.LA(1));
assertEquals("\uFFFF", input.getBuffer());
}

@Test(expected = RuntimeException.class)
public void testDanglingHighSurrogateAtEOFThrows() throws Exception {
createStream("\uD83C");
}

@Test(expected = RuntimeException.class)
public void testDanglingHighSurrogateThrows() throws Exception {
createStream("\uD83C\u0123");
}

@Test(expected = RuntimeException.class)
public void testDanglingLowSurrogateThrows() throws Exception {
createStream("\uDF0E");
}

protected static TestingUnbufferedCharStream createStream(String text) {
return new TestingUnbufferedCharStream(new StringReader(text));
}
Expand All @@ -336,15 +360,28 @@ public TestingUnbufferedCharStream(Reader input, int bufferSize) {
*/
public String getRemainingBuffer() {
if ( n==0 ) return "";
return new String(data,p,n-p);
int len = n;
if (data[len-1] == IntStream.EOF) {
// Don't pass -1 to new String().
return new String(data,p,len-p-1) + "\uFFFF";
} else {
return new String(data,p,len-p);
}
}

/** For testing. What's in moving window buffer into data stream.
* From 0..p-1 have been consume.
*/
public String getBuffer() {
if ( n==0 ) return "";
return new String(data,0,n);
int len = n;
// Don't pass -1 to new String().
if (data[len-1] == IntStream.EOF) {
// Don't pass -1 to new String().
return new String(data,0,len-1) + "\uFFFF";
} else {
return new String(data,0,len);
}
}

}
Expand Down