Skip to content

JSON scalar texts in the stream are not captured properly #3

@00dani

Description

@00dani

It is valid for a JSON text to represent only a single scalar value, rather than an object or array - this is supported by Python's json module:

>>> import json
>>> json.loads('true')
True
>>> json.loads('false')
False
>>> json.loads('"an example"')
'an example'

However, a stream containing such texts will not be split correctly by splitstream. The keywords true, false, and null are silently dropped, as are numeric literals:

>>> import io; from splitstream import splitfile
>>> split_buf = lambda data: list(splitfile(io.BytesIO(data), format='json'))
>>> split_buf(b'true false null [5] null true false true false {"a": 6}')
[b'[5]', b'{"a": 6}']
>>> split_buf(b'4 5 6 7 []')
[b'[]']

Attempting to insert a string literal will cause different, still incorrect behaviour. If there are no objects or arrays in the stream, the text is still silently dropped; however, if there is an object or array occurring somewhere after the string, the entire stream up to that object or array will be captured as one buffer.

>>> split_buf(b'"abc" 56 "def"')
[]
>>> split_buf(b'"abc" 56 "def" {} 3 4')
[b'"abc" 56 "def" {}']
>>> split_buf(b'"abc" 56 "def" {} 3 4 "5" 6 7 []')
[b'"abc" 56 "def" {}', b' 3 4 "5" 6 7 []']

Attempting to parse these buffers with json.loads, naturally, does not work.

The correct behaviour would be to split the stream on every toplevel JSON value, producing separate buffers for each - in other words:

>>> fixed_split_buf(b'true false null 1 "hello world" ["goodbye", "world"] {"a": 12, "b": [null]}')
[b'true', b'false', b'null', b'1', b'"hello world"', b'["goodbye", "world"]', b'{"a": 12, "b": [null]}']

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions