Skip to content

Conversation

@cheqianh
Copy link
Contributor

@cheqianh cheqianh commented Dec 13, 2023

Description:

This PR adds support for multiple top level use case, so that we have an apple-to-apple performance comparison for Ion/JSON/CBOR.

Note:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@cheqianh cheqianh marked this pull request as ready for review December 28, 2023 20:13
@cheqianh cheqianh changed the title Baseline Enhance the benchmark runner to support multiple top level objects use case. Dec 28, 2023
@cheqianh
Copy link
Contributor Author

cheqianh commented Dec 28, 2023

Sample file multiple_top_level_object.json and multiple_top_level_object.cbor include equivalent data:

{"name":"John", "age":30, "car":null}
{"name":"Mike", "age":33, "car":null}
{"name":"Jack", "age":24, "car":null}

# Flag to control if the JSON file is delimited by a newline.
generate_jsonl = True

with open(ion_file, 'br') as fp:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For posterity, can you reply here to explain what was wrong with this technique?

Copy link
Contributor Author

@cheqianh cheqianh Dec 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, the reasons are:

  1. Since JSON doesn't support multiple top-level objects, we used to store all JSON object within a list. Now we iterate top level values and write them one by one.
  2. CBOR conversion was wrong. We used jsonable bytes for JSON/CBOR conversion instead of Python objects that are converted from JSON objects so that the resulted CBOR data is incorrect.

Comment on lines +232 to +233
(error_code, _, _) = run_cli([f'{command}', file, '--format', f'{format_option}', '--io-type', 'file'])
assert not error_code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we able to assert that multiple top-level values are actually read/written?

Copy link
Contributor Author

@cheqianh cheqianh Dec 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, I'll create few new tests for each generated test_fun to ensure they work correctly.

loader = self.get_loader_dumper()
with open(self.get_input_file(), "rb") as fp:
self._data_object = loader.load(fp)
format_option = self.get_format()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have all this branching here? we already have loaders for each type. Put the loader specific code there.

Copy link
Contributor Author

@cheqianh cheqianh Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll refactor them into a specific file there. I'll run the benchmark-cli again to ensure that the write/read process won't affect the performance.

Copy link
Contributor Author

@cheqianh cheqianh Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just refactored JSON and CBOR; the performance is almost the same, but there is a significant difference in peak memory usage. I'm going to investigate why.

Also, since we dump the entire document for Ion to avoid unnecessary symbol table writes, the Ion_load_dump needs to be changed too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is due to building the list?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure why IonLoadDump needs to change. My understanding is that we're not flushing the writer so it will just buffer until it closes, then flush the symbol table and values. Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re:

I'm also not sure why IonLoadDump needs to change. My understanding is that we're not flushing the writer so it will just buffer until it closes, then flush the symbol table and values. Right?

I saw a 20% difference between dumps each top-level object and dump the whole document. I'll try to see why it takes shorter when repeatedly calling dump

with open(data_file, "rb") as f:
return loader_dumper.load(f)
format_option = benchmark_spec.get_format()
if _format.format_is_ion(format_option):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put the loader specific logic in custom loaders for each type, avoid branching here.

* Modifies the write benchmarking process to dump all data, preventing the repeated writing of symbol tables
* Throws an error for protocol buffer benchmarking
* Fixes some typos
* Adds return_object flag for debugging
* Adds and refactors benchmark-cli tests
@cheqianh
Copy link
Contributor Author

cheqianh commented Jan 4, 2024

The commit 74d89ad refactor loader (read APIs) to separate files. And below are the new results for the baseline file - a large log

name file_size(B) time_min(ns) time_mean(ns) memory_usage_peak(B)
Ion - before 22263038 4730782811 171,700
Ion - after 22263038 5092575416.60 5109052881 276,500
json - before 145330385 875974056 77,446
json - after 145330385 871989650.00 872934369 112,139
cbor - before 116960762 995595161 51,222
cbor - after 116960762 985745625.00 991288347 87,769

Noticed that Ion is slower and has a higher memory peak while JSON and CBOR are a little bit faster.

@cheqianh
Copy link
Contributor Author

cheqianh commented Jan 5, 2024

Below is the table showing the differences in write performance after the dump refactor commit - 6dc3d96.

name file_size(B) time_min(ns) time_mean(ns) memory_usage_peak(B)
Ion - before 22263038 2,332,061,975 (dump one by one)
2,010,238,558 (dump everything at once)
43540 (dump one by one)
42,573,352 (dump everything at once)
Ion - after 22263038 4,139,845,225 (wrong result, didn't use the latest commit)
1,849,542,250 (dump everything at once)
42,573,352
json - before 145330385 3,981,317,292 143,755,789
json - after 145330385 4,018,459,501 143,756,301
cbor - before 116960762 1,303,162,089 19,301
cbor - after 116960762 1,291,955,078 19,813

@tgregg
Copy link
Contributor

tgregg commented Jan 5, 2024

Why is Ion read and write performance worse after the refactor?

loader = self.get_loader_dumper()
with open(self.get_input_file(), "rb") as fp:
self._data_object = loader.load(fp)
format_option = self.get_format()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is due to building the list?

loader = self.get_loader_dumper()
with open(self.get_input_file(), "rb") as fp:
self._data_object = loader.load(fp)
format_option = self.get_format()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure why IonLoadDump needs to change. My understanding is that we're not flushing the writer so it will just buffer until it closes, then flush the symbol table and values. Right?

@cheqianh
Copy link
Contributor Author

cheqianh commented Jan 8, 2024

The commit 2b177af addressed the comment. A few comments and highlights below.

  1. Re: Enhance the benchmark runner to support multiple top level objects use case. #315 (comment)
    It seems that the generator can't be fully consumed by test_fun; it will only parse the first value and return when I pass a generator to a defined test_fun. I refactored the benchmark_spec.get_data_object() method to return a generator and reused it in other places to avoid repeating code. However, for benchmarking, we still call list(the_generator) to convert it into a list before passing it to test_fun. Do you have any concerns or comments about this?

  2. Re: Enhance the benchmark runner to support multiple top level objects use case. #315 (comment)
    I'm going to investigate why this happens and determine which benchmarking methods are more accurate. Based on the previous results, there is a 20% difference between these two approaches.

  3. After the refactor, the Ion write performance dropped from 2e^9 to 4e^9 I'm not sure why; I'm investigating it now.

  4. There are some conflicts with the main branch. (I guess it's because of the recent bare_value PR) I'll work on them at the end.

@cheqianh cheqianh requested a review from rmarrowstone January 8, 2024 21:57
@rmarrowstone
Copy link
Contributor

It seems that the generator can't be fully consumed by test_fun; it will only parse the first value and return when I pass a generator to a defined test_fun. I refactored the benchmark_spec.get_data_object() method to return a generator and reused it in other places to avoid repeating code. However, for benchmarking, we still call list(the_generator) to convert it into a list before passing it to test_fun. Do you have any concerns or comments about this?

Can't we just pull off the generator and do nothing with the values in the test_fun?

@cheqianh
Copy link
Contributor Author

cheqianh commented Jan 9, 2024

The highlight No.3 in my comment above is solved and I updated the metrics table for dump APIs.

This is because for the before metrics, we benchmarked the latest ion-python commit to see how much improvement we have made for the write side. However, since the PR is not merged yet so that the after result is still based on the latest released version 0.11.3. I manually overwrite it and updated the metrics. In the future I will use this commit d425587 for benchmarking comparison in this PR (the current latest commit).

@cheqianh
Copy link
Contributor Author

cheqianh commented Jan 12, 2024

I made the changes in three separate commits for easier visibility. To review them together, you can find the link here.

As discussed offline, a fair apple-to-apple comparison would involve fully marshalling all top-value objects into memory. The library would then write them to the destination file as a stream.

Here are the details:

  • For the loading APIs of all three formats, load_dump.load() returns an iterator. The benchmark-cli will accumulate the execution time for deserializing each top-level object.
  • For the dumping APIs of all three formats, we provide test_fun with a series of top-level objects and benchmark the time it takes to write these objects as a stream.

I also modified the benchmark_spec.get_data_object() method to return a list instead of a generator. This change addresses a GHA pipeline failure on Windows PYPY. The root is because the method need to keep the file open when returning a generator, which will cause invalid accesses by other processes on some platforms.

@cheqianh cheqianh requested a review from tgregg January 12, 2024 19:26
def dumps(self, obj):
ion.c_ext = self._c_ext
return ion.dumps(obj, binary=self._binary)
ion.dumps(obj, binary=self._binary)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use sequence_as_stream=True?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

io_type=buffer currently doesn't handle multiple top-level use case, I opened an issue for that - #325

data_obj = list(data_obj)

data_format = benchmark_spec.get_format()
if _format.format_is_protobuf(data_format):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this warrant its own error handling? Wouldn't it just fall into the default else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if protobuf can or should handle multiple top-level objects. I opened an issue for this - #326 and throw an error for now JIC.

if custom_file:
if _format.format_is_bytes(data_format):
def test_fn():
with open(custom_file, 'ab') as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to avoid branches with duplicated code. Factor out what's different.

Consider:

flags = 'ab' if _format.format_is_bytes(data_format) else 'at'
if custom_file:
  def fopen():
    return open(custom_file, flags)
else:
  def fopen():
    return tempfile.TemporaryFile(mode=flags)
    
def test_fn():
  with fopen() as f:
    loader_dumper.dump(data_obj, f)

I haven't ran it, but that should work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. Yeah it works, thanks for the recommendation.

or (format_option == Format.PROTOBUF.value) or (format_option == Format.SD_PROTOBUF.value)


def format_is_bytes(format_option):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this just call format_is_binary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really, does format_is_binary exist for some other reason? why not refactor it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ion_text is not a binary format, but it needs a 'b' flag to open the file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the logic requirement. What I don't understand is what 'is_binary' is actually called for other than determining the flag.

"""
Create a benchmark function for the given `benchmark_spec`.
:param return_obj: If the test_fun returns the load object for debugging. It only works for `io-type=file` and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we're using this pydoc style anywhere else. simpleion uses the google pydoc style. That is the main API to this code, follow what that does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

return loader_dumper.loads(buffer)

elif match_arg == ['buffer', 'write', 'load_dump']:
# This method returns a list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment on lines 197 to 199
obj = loader.load(fp)
for v in obj:
rtn.append(v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
obj = loader.load(fp)
for v in obj:
rtn.append(v)
rtn = [v for v in loader.load(fp)]

Shorter, and I believe, faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed, thanks for catching this up.

import cbor2


class CborLoadDump:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cbor2LoadDump?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

def dumps(self, obj):
ion.c_ext = self._c_ext
return ion.dumps(obj, binary=self._binary)
ion.dumps(obj, binary=self._binary)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to change too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question was as to whether this needed sequence_as_stream=True

yield json.loads(line)

def loads(self, s):
return json.loads(s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so loads can only handle single values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

io_type=buffer currently doesn't handle multiple top-level use case, I opened an issue for that - #325

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. At first blush it seems like just making multiple TLVs there would be simpler than creating an issue and having divergent code paths, but we need to get this merged.

@cheqianh cheqianh merged commit aa26d86 into amazon-ion:master Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants