Skip to content

Commit 2ab2763

Browse files
committed
Add a strict parsing mode to the JsonReader and improve its javadoc
By default, the JsonReader accepts unescaped control characters that are embedded in a string or name (which is technically just a string). According to RFC 8259, RFC 7159, and RFC 4627, this is forbidden. From RFC 8259 Section 7 "Strings": "All Unicode characters may be placed within the quotation marks, except for the characters that MUST be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F)." Accepting unescaped control characters can at least cause some confusion as the following naive program demonstrates: marcus@linux:~> cat NaiveJsonProcessor.java import java.io.IOException; import java.io.StringReader; import com.google.gson.stream.JsonReader; public class NaiveJsonProcessor { public static void parseAndLog(String jsonInput) throws IOException { JsonReader reader = new JsonReader(new StringReader(jsonInput)); String parsed = reader.nextString(); if (parsed.equals("foo")) { throw new IllegalStateException("foo is forbidden"); } /* * According to the JsonReader's documentation "[...] this parser * is strict and only accepts JSON as specified by RFC 4627" (see * documentation of setLenient). Hence, we can safely log the * raw jsonInput to stdout because it contains no unescaped control * characters, which could be interpreted by a terminal. * Oops... wrong assumption:) */ System.out.println("Processed: " + jsonInput); } public static void main(String[] args) { String jsonInput = "\"foobar\u001b[3D\u001b[K\""; try { // the log entry might confuse the user... parseAndLog(jsonInput); } catch (IOException e) { e.printStackTrace(); } } } marcus@linux:~> javac -cp /path/to/gson/classes NaiveJsonProcessor.java marcus@linux:~> java -cp /path/to/gson/classes:. NaiveJsonProcessor Processed: "foo" marcus@linux:~> Since the unescaped control characters of the raw jsonInput are interpreted by the terminal, it _looks_ as if we processed the JSON text "foo" even though this string should result in an IllegalStateException (of course in reality we did _not_ process "foo"). Apart from this, the JsonReader accepts non-lowercase literals (like tRuE, falSE, NULl). According to the previously mentioned RFCs, this is forbidden. From RFC 8259 Section 3 "Values": "[...]or one of the following three literal names: false null true The literal names MUST be lowercase." To cope with this a strict mode is added to the JsonReader. In strict mode, the JsonReader does not accept unescaped control characters in strings and names. For this, the JsonReader raises an exception if it encounters an unescaped control character in nextQuotedValue and skipQuotedValue. Also, it does not accept non-lowercase literals. For this, peekKeyword raises an exception if a non-lowercase literal is encountered. In order to avoid regressions, the strict mode is disabled by default and the old behavior is retained. In strict mode, the JsonReader behaves exactly as before (except in case of an unescaped control character or non-lowercase literal, of course). For the details, see the new JsonReaderStrictTest testcase. The javadoc of the JsonReader is updated accordingly. As part of this update, all references to a JSON RFC are changed to RFC 8259 (that's what the JsonReader conforms to (in strict mode)). Signed-off-by: Marcus Huewe <[email protected]>
1 parent 08aa02f commit 2ab2763

File tree

2 files changed

+311
-7
lines changed

2 files changed

+311
-7
lines changed

gson/src/main/java/com/google/gson/stream/JsonReader.java

Lines changed: 89 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525
import java.util.Arrays;
2626

2727
/**
28-
* Reads a JSON (<a href="http://www.ietf.org/rfc/rfc7159.txt">RFC 7159</a>)
28+
* Reads a JSON (<a href="https://www.ietf.org/rfc/rfc8259.txt">RFC 8259</a>)
2929
* encoded value as a stream of tokens. This stream includes both literal
3030
* values (strings, numbers, booleans, and nulls) as well as the begin and
3131
* end delimiters of objects and arrays. The tokens are traversed in
@@ -182,6 +182,39 @@
182182
* non-execute prefix when {@link #setLenient(boolean) lenient parsing} is
183183
* enabled.
184184
*
185+
* <h3>Available Parsing Modes</h3>
186+
* This parser supports three different parsing modes:
187+
* <ul>
188+
* <li>strict mode
189+
* <li>semi-strict mode (the default)
190+
* <li>lenient mode
191+
* </ul>
192+
*
193+
* <p>In strict mode, the parser only accepts a JSON text that exactly conforms
194+
* to the grammar, which is specified in
195+
* <a href="https://www.ietf.org/rfc/rfc8259.txt">RFC 8259</a>. The strict
196+
* mode can be enabled by calling {@link #setStrict(boolean) setStrict(true)}.
197+
* A leading byte order mark (U+FEFF) at the beginning of a JSON text is
198+
* allowed.
199+
*
200+
* <p>In contrast to the strict mode, the semi-strict mode allows non-lowercase
201+
* literals (like TRUE, fAlSe, NUlL etc.) and unescaped control characters in
202+
* strings (and names). For the latter, consider the two Java strings
203+
* {@code "\"unescaped\nnewline\""} and {@code "\"escaped\\u000anewline\""}.
204+
* In semi-strict mode, both strings represent valid JSON texts and the parsed
205+
* values are equal. In strict mode, only the latter string is accepted as a
206+
* valid JSON text. The semi-strict mode, which is the default mode,
207+
* corresponds to {@link #setStrict(boolean) setStrict(false)} and
208+
* {@link #setLenient(boolean) setLenient(false)}.
209+
*
210+
* <p>In the lenient mode, the parser is very liberal in what it accepts.
211+
* For the details, see the {@link #setLenient(boolean) setLenient} method.
212+
* The lenient mode can be enabled by calling
213+
* {@link #setLenient(boolean) setLenient(true)}.
214+
*
215+
* <p>Note: all three modes are mutually exclusive. Enabling one mode
216+
* automatically disables the other two modes.
217+
*
185218
* <p>Each {@code JsonReader} may be used to read a single JSON stream. Instances
186219
* of this class are not thread safe.
187220
*
@@ -228,6 +261,12 @@ public class JsonReader implements Closeable {
228261
/** True to accept non-spec compliant JSON */
229262
private boolean lenient = false;
230263

264+
/**
265+
* True to accept a JSON text that exactly conforms to the
266+
* <a href="https://www.ietf.org/rfc/rfc8259.txt">RFC 8259</a> grammar
267+
*/
268+
private boolean strict = false;
269+
231270
/**
232271
* Use a manual buffer to easily read and unread upcoming characters, and
233272
* also so we can create strings without an intermediate StringBuilder.
@@ -294,17 +333,16 @@ public JsonReader(Reader in) {
294333

295334
/**
296335
* Configure this parser to be liberal in what it accepts. By default,
297-
* this parser is strict and only accepts JSON as specified by <a
298-
* href="http://www.ietf.org/rfc/rfc4627.txt">RFC 4627</a>. Setting the
299-
* parser to lenient causes it to ignore the following syntax errors:
336+
* this parser is semi-strict and accepts JSON as specified by <a
337+
* href="https://www.ietf.org/rfc/rfc8259.txt">RFC 8259</a> (including some
338+
* slight variations). Setting the parser to lenient causes it to ignore
339+
* the following syntax errors:
300340
*
301341
* <ul>
302342
* <li>Streams that start with the <a href="#nonexecuteprefix">non-execute
303343
* prefix</a>, <code>")]}'\n"</code>.
304-
* <li>Streams that include multiple top-level values. With strict parsing,
344+
* <li>Streams that include multiple top-level values. With semi-strict parsing,
305345
* each stream must contain exactly one top-level value.
306-
* <li>Top-level values of any type. With strict parsing, the top-level
307-
* value must be an object or an array.
308346
* <li>Numbers may be {@link Double#isNaN() NaNs} or {@link
309347
* Double#isInfinite() infinities}.
310348
* <li>End of line comments starting with {@code //} or {@code #} and
@@ -323,6 +361,9 @@ public JsonReader(Reader in) {
323361
*/
324362
public final void setLenient(boolean lenient) {
325363
this.lenient = lenient;
364+
if (lenient) {
365+
strict = false;
366+
}
326367
}
327368

328369
/**
@@ -332,6 +373,25 @@ public final boolean isLenient() {
332373
return lenient;
333374
}
334375

376+
/**
377+
* Configure this parser to be very conservative in what it accepts. In
378+
* strict mode, it only accepts a JSON text as specified in
379+
* <a href="https://www.ietf.org/rfc/rfc8259.txt">RFC 8259</a>.
380+
*/
381+
public final void setStrict(boolean strict) {
382+
this.strict = strict;
383+
if (strict) {
384+
lenient = false;
385+
}
386+
}
387+
388+
/**
389+
* Returns true if this parser is very conservative in what it accepts.
390+
*/
391+
public final boolean isStrict() {
392+
return strict;
393+
}
394+
335395
/**
336396
* Consumes the next token from the JSON stream and asserts that it is the
337397
* beginning of a new array.
@@ -626,6 +686,14 @@ private int peekKeyword() throws IOException {
626686
return PEEKED_NONE;
627687
}
628688
}
689+
if (strict) {
690+
// a keyword/literal must be lowercase in strict mode
691+
for (int i = 0; i < length; i++) {
692+
if (buffer[pos + i] == keywordUpper.charAt(i)) {
693+
syntaxError("Literal must be lowercase (strict mode)");
694+
}
695+
}
696+
}
629697

630698
if ((pos + length < limit || fillBuffer(length + 1))
631699
&& isLiteral(buffer[pos + length])) {
@@ -985,6 +1053,11 @@ private String nextQuotedValue(char quote) throws IOException {
9851053
// Like nextNonWhitespace, this uses locals 'p' and 'l' to save inner-loop field access.
9861054
char[] buffer = this.buffer;
9871055
StringBuilder builder = null;
1056+
/*
1057+
* Use a local variable to avoid inner-loop field access (as above).
1058+
* Note: strict == true implies quote == '"'
1059+
*/
1060+
final boolean noUnescapedControlCharacter = strict;
9881061
while (true) {
9891062
int p = pos;
9901063
int l = limit;
@@ -1014,6 +1087,8 @@ private String nextQuotedValue(char quote) throws IOException {
10141087
p = pos;
10151088
l = limit;
10161089
start = p;
1090+
} else if (noUnescapedControlCharacter && c <= 0x1F) {
1091+
syntaxError("Illegal unescaped control character (strict mode)");
10171092
} else if (c == '\n') {
10181093
lineNumber++;
10191094
lineStart = p;
@@ -1094,6 +1169,11 @@ private String nextUnquotedValue() throws IOException {
10941169
private void skipQuotedValue(char quote) throws IOException {
10951170
// Like nextNonWhitespace, this uses locals 'p' and 'l' to save inner-loop field access.
10961171
char[] buffer = this.buffer;
1172+
/*
1173+
* Use a local variable to avoid inner-loop field access (as above).
1174+
* Note: strict == true implies quote == '"'
1175+
*/
1176+
final boolean noUnescapedControlCharacter = strict;
10971177
do {
10981178
int p = pos;
10991179
int l = limit;
@@ -1108,6 +1188,8 @@ private void skipQuotedValue(char quote) throws IOException {
11081188
readEscapeCharacter();
11091189
p = pos;
11101190
l = limit;
1191+
} else if (noUnescapedControlCharacter && c <= 0x1F) {
1192+
syntaxError("Illegal unescaped control character (strict mode)");
11111193
} else if (c == '\n') {
11121194
lineNumber++;
11131195
lineStart = p;

0 commit comments

Comments
 (0)