-
Notifications
You must be signed in to change notification settings - Fork 47
Enhance the benchmark runner to support multiple top level objects use case. #315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Sample file |
| # Flag to control if the JSON file is delimited by a newline. | ||
| generate_jsonl = True | ||
|
|
||
| with open(ion_file, 'br') as fp: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For posterity, can you reply here to explain what was wrong with this technique?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, the reasons are:
- Since JSON doesn't support multiple top-level objects, we used to store all JSON object within a list. Now we iterate top level values and write them one by one.
- CBOR conversion was wrong. We used jsonable bytes for JSON/CBOR conversion instead of Python objects that are converted from JSON objects so that the resulted CBOR data is incorrect.
| (error_code, _, _) = run_cli([f'{command}', file, '--format', f'{format_option}', '--io-type', 'file']) | ||
| assert not error_code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we able to assert that multiple top-level values are actually read/written?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, I'll create few new tests for each generated test_fun to ensure they work correctly.
| loader = self.get_loader_dumper() | ||
| with open(self.get_input_file(), "rb") as fp: | ||
| self._data_object = loader.load(fp) | ||
| format_option = self.get_format() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have all this branching here? we already have loaders for each type. Put the loader specific code there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll refactor them into a specific file there. I'll run the benchmark-cli again to ensure that the write/read process won't affect the performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just refactored JSON and CBOR; the performance is almost the same, but there is a significant difference in peak memory usage. I'm going to investigate why.
Also, since we dump the entire document for Ion to avoid unnecessary symbol table writes, the Ion_load_dump needs to be changed too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it is due to building the list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not sure why IonLoadDump needs to change. My understanding is that we're not flushing the writer so it will just buffer until it closes, then flush the symbol table and values. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re:
I'm also not sure why IonLoadDump needs to change. My understanding is that we're not flushing the writer so it will just buffer until it closes, then flush the symbol table and values. Right?
I saw a 20% difference between dumps each top-level object and dump the whole document. I'll try to see why it takes shorter when repeatedly calling dump
| with open(data_file, "rb") as f: | ||
| return loader_dumper.load(f) | ||
| format_option = benchmark_spec.get_format() | ||
| if _format.format_is_ion(format_option): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put the loader specific logic in custom loaders for each type, avoid branching here.
* Modifies the write benchmarking process to dump all data, preventing the repeated writing of symbol tables * Throws an error for protocol buffer benchmarking * Fixes some typos * Adds return_object flag for debugging * Adds and refactors benchmark-cli tests
|
The commit 74d89ad refactor loader (read APIs) to separate files. And below are the new results for the baseline file - a large log
Noticed that Ion is slower and has a higher memory peak while JSON and CBOR are a little bit faster. |
|
Below is the table showing the differences in write performance after the dump refactor commit - 6dc3d96.
|
|
Why is Ion read and write performance worse after the refactor? |
| loader = self.get_loader_dumper() | ||
| with open(self.get_input_file(), "rb") as fp: | ||
| self._data_object = loader.load(fp) | ||
| format_option = self.get_format() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it is due to building the list?
| loader = self.get_loader_dumper() | ||
| with open(self.get_input_file(), "rb") as fp: | ||
| self._data_object = loader.load(fp) | ||
| format_option = self.get_format() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not sure why IonLoadDump needs to change. My understanding is that we're not flushing the writer so it will just buffer until it closes, then flush the symbol table and values. Right?
|
The commit 2b177af addressed the comment. A few comments and highlights below.
|
Can't we just pull off the generator and do nothing with the values in the |
|
The highlight No.3 in my comment above is solved and I updated the metrics table for dump APIs. This is because for the |
… avoid file access issue in Windows PYPY.
|
I made the changes in three separate commits for easier visibility. To review them together, you can find the link here. As discussed offline, a fair apple-to-apple comparison would involve fully marshalling all top-value objects into memory. The library would then write them to the destination file as a stream. Here are the details:
I also modified the |
amazon/ionbenchmark/ion_load_dump.py
Outdated
| def dumps(self, obj): | ||
| ion.c_ext = self._c_ext | ||
| return ion.dumps(obj, binary=self._binary) | ||
| ion.dumps(obj, binary=self._binary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use sequence_as_stream=True?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
io_type=buffer currently doesn't handle multiple top-level use case, I opened an issue for that - #325
| data_obj = list(data_obj) | ||
|
|
||
| data_format = benchmark_spec.get_format() | ||
| if _format.format_is_protobuf(data_format): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this warrant its own error handling? Wouldn't it just fall into the default else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if protobuf can or should handle multiple top-level objects. I opened an issue for this - #326 and throw an error for now JIC.
| if custom_file: | ||
| if _format.format_is_bytes(data_format): | ||
| def test_fn(): | ||
| with open(custom_file, 'ab') as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer to avoid branches with duplicated code. Factor out what's different.
Consider:
flags = 'ab' if _format.format_is_bytes(data_format) else 'at'
if custom_file:
def fopen():
return open(custom_file, flags)
else:
def fopen():
return tempfile.TemporaryFile(mode=flags)
def test_fn():
with fopen() as f:
loader_dumper.dump(data_obj, f)
I haven't ran it, but that should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed. Yeah it works, thanks for the recommendation.
| or (format_option == Format.PROTOBUF.value) or (format_option == Format.SD_PROTOBUF.value) | ||
|
|
||
|
|
||
| def format_is_bytes(format_option): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this just call format_is_binary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really, does format_is_binary exist for some other reason? why not refactor it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ion_text is not a binary format, but it needs a 'b' flag to open the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the logic requirement. What I don't understand is what 'is_binary' is actually called for other than determining the flag.
| """ | ||
| Create a benchmark function for the given `benchmark_spec`. | ||
| :param return_obj: If the test_fun returns the load object for debugging. It only works for `io-type=file` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we're using this pydoc style anywhere else. simpleion uses the google pydoc style. That is the main API to this code, follow what that does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
| return loader_dumper.loads(buffer) | ||
|
|
||
| elif match_arg == ['buffer', 'write', 'load_dump']: | ||
| # This method returns a list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
| obj = loader.load(fp) | ||
| for v in obj: | ||
| rtn.append(v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| obj = loader.load(fp) | |
| for v in obj: | |
| rtn.append(v) | |
| rtn = [v for v in loader.load(fp)] |
Shorter, and I believe, faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed, thanks for catching this up.
| import cbor2 | ||
|
|
||
|
|
||
| class CborLoadDump: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cbor2LoadDump?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed.
amazon/ionbenchmark/ion_load_dump.py
Outdated
| def dumps(self, obj): | ||
| ion.c_ext = self._c_ext | ||
| return ion.dumps(obj, binary=self._binary) | ||
| ion.dumps(obj, binary=self._binary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to change too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question was as to whether this needed sequence_as_stream=True
| yield json.loads(line) | ||
|
|
||
| def loads(self, s): | ||
| return json.loads(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so loads can only handle single values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
io_type=buffer currently doesn't handle multiple top-level use case, I opened an issue for that - #325
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. At first blush it seems like just making multiple TLVs there would be simpler than creating an issue and having divergent code paths, but we need to get this merged.
Description:
This PR adds support for multiple top level use case, so that we have an apple-to-apple performance comparison for Ion/JSON/CBOR.
Note:
This PR focuses on
io_type=file, we might consider addingio_type=bufferlater but we need to figure out how to fairly use JSON and CBOR's load/dump APIs to iterate each of multiple top level object - Benchmark-cliIO-type=buffershould support benchmarking multiple top_level objects #325.When we benchmark read speed for JSON document that includes multiple top level objects, we suppose that the given data is in JSONL. So we have to modify our JSON sample data - code here
I'm not sure if protobuf has the same issue for the multiple top-level objects use case. This need to be figure out when we come back to protobuf in the future - Benchmark-cli protobuf multiple top level objects use case #326.
Three files I used for test locally, I upload them for visible but they will not be included when merge this PR into the main branch. They are:
1 amazon/ionbenchmark/json_newline_vs_list_repro.py
2 amazon/ionbenchmark/format_conversion_correct.py
3 amazon/ionbenchmark/format_conversion_inaccurate.py
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.