Enhance the benchmark runner to support multiple top level objects use case. #315

cheqianh · 2023-12-13T10:24:17Z

Description:

This PR adds support for multiple top level use case, so that we have an apple-to-apple performance comparison for Ion/JSON/CBOR.

Note:

This PR focuses on io_type=file, we might consider adding io_type=buffer later but we need to figure out how to fairly use JSON and CBOR's load/dump APIs to iterate each of multiple top level object - Benchmark-cli IO-type=buffer should support benchmarking multiple top_level objects #325.
When we benchmark read speed for JSON document that includes multiple top level objects, we suppose that the given data is in JSONL. So we have to modify our JSON sample data - code here
I'm not sure if protobuf has the same issue for the multiple top-level objects use case. This need to be figure out when we come back to protobuf in the future - Benchmark-cli protobuf multiple top level objects use case #326.
Three files I used for test locally, I upload them for visible but they will not be included when merge this PR into the main branch. They are:
1 amazon/ionbenchmark/json_newline_vs_list_repro.py
2 amazon/ionbenchmark/format_conversion_correct.py
3 amazon/ionbenchmark/format_conversion_inaccurate.py

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…mparsion.

cheqianh · 2023-12-28T20:46:09Z

Sample file multiple_top_level_object.json and multiple_top_level_object.cbor include equivalent data:

{"name":"John", "age":30, "car":null}
{"name":"Mike", "age":33, "car":null}
{"name":"Jack", "age":24, "car":null}

amazon/ionbenchmark/benchmark_runner.py

amazon/ionbenchmark/benchmark_spec.py

tgregg · 2023-12-28T20:40:41Z

amazon/ionbenchmark/format_conversion_inaccurate.py

+# Flag to control if the JSON file is delimited by a newline.
+generate_jsonl = True
+
+with open(ion_file, 'br') as fp:


For posterity, can you reply here to explain what was wrong with this technique?

Sure, the reasons are:

Since JSON doesn't support multiple top-level objects, we used to store all JSON object within a list. Now we iterate top level values and write them one by one.

CBOR conversion was wrong. We used jsonable bytes for JSON/CBOR conversion instead of Python objects that are converted from JSON objects so that the resulted CBOR data is incorrect.

tgregg · 2023-12-28T20:44:47Z

tests/test_benchmark_cli.py

+    (error_code, _, _) = run_cli([f'{command}', file, '--format', f'{format_option}', '--io-type', 'file'])
+    assert not error_code


Are we able to assert that multiple top-level values are actually read/written?

As discussed offline, I'll create few new tests for each generated test_fun to ensure they work correctly.

tests/test_benchmark_spec.py

rmarrowstone · 2023-12-29T22:20:20Z

amazon/ionbenchmark/benchmark_spec.py

            loader = self.get_loader_dumper()
-            with open(self.get_input_file(), "rb") as fp:
-                self._data_object = loader.load(fp)
+            format_option = self.get_format()


Why have all this branching here? we already have loaders for each type. Put the loader specific code there.

Okay, I'll refactor them into a specific file there. I'll run the benchmark-cli again to ensure that the write/read process won't affect the performance.

Just refactored JSON and CBOR; the performance is almost the same, but there is a significant difference in peak memory usage. I'm going to investigate why.

Also, since we dump the entire document for Ion to avoid unnecessary symbol table writes, the Ion_load_dump needs to be changed too.

I wonder if it is due to building the list?

I'm also not sure why IonLoadDump needs to change. My understanding is that we're not flushing the writer so it will just buffer until it closes, then flush the symbol table and values. Right?

Re:

I'm also not sure why IonLoadDump needs to change. My understanding is that we're not flushing the writer so it will just buffer until it closes, then flush the symbol table and values. Right?

I saw a 20% difference between dumps each top-level object and dump the whole document. I'll try to see why it takes shorter when repeatedly calling dump

rmarrowstone · 2023-12-29T22:20:56Z

amazon/ionbenchmark/benchmark_runner.py

-            with open(data_file, "rb") as f:
-                return loader_dumper.load(f)
+            format_option = benchmark_spec.get_format()
+            if _format.format_is_ion(format_option):


Put the loader specific logic in custom loaders for each type, avoid branching here.

* Modifies the write benchmarking process to dump all data, preventing the repeated writing of symbol tables * Throws an error for protocol buffer benchmarking * Fixes some typos * Adds return_object flag for debugging * Adds and refactors benchmark-cli tests

cheqianh · 2024-01-04T19:48:47Z

The commit 74d89ad refactor loader (read APIs) to separate files. And below are the new results for the baseline file - a large log

name	file_size(B)	time_min(ns)	time_mean(ns)	memory_usage_peak(B)
Ion - before	22263038		4730782811	171,700
Ion - after	22263038	5092575416.60	5109052881	276,500

json - before	145330385		875974056	77,446
json - after	145330385	871989650.00	872934369	112,139

cbor - before	116960762		995595161	51,222
cbor - after	116960762	985745625.00	991288347	87,769

Noticed that Ion is slower and has a higher memory peak while JSON and CBOR are a little bit faster.

cheqianh · 2024-01-05T00:40:38Z

Below is the table showing the differences in write performance after the dump refactor commit - 6dc3d96.

name	file_size(B)	time_mean(ns)	memory_usage_peak(B)
Ion - before	22263038	2,332,061,975 (dump one by one) 2,010,238,558 (dump everything at once)	43540 (dump one by one) 42,573,352 (dump everything at once)
Ion - after	22263038	~~4,139,845,225~~ (wrong result, didn't use the latest commit) 1,849,542,250 (dump everything at once)	42,573,352

json - before	145330385	3,981,317,292	143,755,789
json - after	145330385	4,018,459,501	143,756,301

cbor - before	116960762	1,303,162,089	19,301
cbor - after	116960762	1,291,955,078	19,813

tgregg · 2024-01-05T00:51:07Z

Why is Ion read and write performance worse after the refactor?

amazon/ionbenchmark/benchmark_runner.py

rmarrowstone · 2024-01-05T00:59:07Z

amazon/ionbenchmark/benchmark_spec.py

            loader = self.get_loader_dumper()
-            with open(self.get_input_file(), "rb") as fp:
-                self._data_object = loader.load(fp)
+            format_option = self.get_format()


I wonder if it is due to building the list?

rmarrowstone · 2024-01-05T01:00:04Z

amazon/ionbenchmark/benchmark_spec.py

            loader = self.get_loader_dumper()
-            with open(self.get_input_file(), "rb") as fp:
-                self._data_object = loader.load(fp)
+            format_option = self.get_format()


I'm also not sure why IonLoadDump needs to change. My understanding is that we're not flushing the writer so it will just buffer until it closes, then flush the symbol table and values. Right?

cheqianh · 2024-01-08T21:57:04Z

The commit 2b177af addressed the comment. A few comments and highlights below.

Re: Enhance the benchmark runner to support multiple top level objects use case. #315 (comment)
It seems that the generator can't be fully consumed by test_fun; it will only parse the first value and return when I pass a generator to a defined test_fun. I refactored the benchmark_spec.get_data_object() method to return a generator and reused it in other places to avoid repeating code. However, for benchmarking, we still call list(the_generator) to convert it into a list before passing it to test_fun. Do you have any concerns or comments about this?
Re: Enhance the benchmark runner to support multiple top level objects use case. #315 (comment)
I'm going to investigate why this happens and determine which benchmarking methods are more accurate. Based on the previous results, there is a 20% difference between these two approaches.
After the refactor, the Ion write performance dropped from 2e^9 to 4e^9 I'm not sure why; I'm investigating it now.
There are some conflicts with the main branch. (I guess it's because of the recent bare_value PR) I'll work on them at the end.

rmarrowstone · 2024-01-08T22:40:32Z

It seems that the generator can't be fully consumed by test_fun; it will only parse the first value and return when I pass a generator to a defined test_fun. I refactored the benchmark_spec.get_data_object() method to return a generator and reused it in other places to avoid repeating code. However, for benchmarking, we still call list(the_generator) to convert it into a list before passing it to test_fun. Do you have any concerns or comments about this?

Can't we just pull off the generator and do nothing with the values in the test_fun?

cheqianh · 2024-01-09T01:11:41Z

The highlight No.3 in my comment above is solved and I updated the metrics table for dump APIs.

This is because for the before metrics, we benchmarked the latest ion-python commit to see how much improvement we have made for the write side. However, since the PR is not merged yet so that the after result is still based on the latest released version 0.11.3. I manually overwrite it and updated the metrics. In the future I will use this commit d425587 for benchmarking comparison in this PR (the current latest commit).

… avoid file access issue in Windows PYPY.

cheqianh · 2024-01-12T19:23:51Z

I made the changes in three separate commits for easier visibility. To review them together, you can find the link here.

As discussed offline, a fair apple-to-apple comparison would involve fully marshalling all top-value objects into memory. The library would then write them to the destination file as a stream.

Here are the details:

For the loading APIs of all three formats, load_dump.load() returns an iterator. The benchmark-cli will accumulate the execution time for deserializing each top-level object.
For the dumping APIs of all three formats, we provide test_fun with a series of top-level objects and benchmark the time it takes to write these objects as a stream.

I also modified the benchmark_spec.get_data_object() method to return a list instead of a generator. This change addresses a GHA pipeline failure on Windows PYPY. The root is because the method need to keep the file open when returning a generator, which will cause invalid accesses by other processes on some platforms.

tgregg · 2024-01-12T20:06:48Z

amazon/ionbenchmark/ion_load_dump.py

    def dumps(self, obj):
        ion.c_ext = self._c_ext
-        return ion.dumps(obj, binary=self._binary)
+        ion.dumps(obj, binary=self._binary)


Should this use sequence_as_stream=True?

io_type=buffer currently doesn't handle multiple top-level use case, I opened an issue for that - #325

rmarrowstone · 2024-01-12T19:48:24Z

amazon/ionbenchmark/benchmark_runner.py

+        data_obj = list(data_obj)

+        data_format = benchmark_spec.get_format()
+        if _format.format_is_protobuf(data_format):


Why does this warrant its own error handling? Wouldn't it just fall into the default else?

I'm not sure if protobuf can or should handle multiple top-level objects. I opened an issue for this - #326 and throw an error for now JIC.

rmarrowstone · 2024-01-12T19:56:27Z

amazon/ionbenchmark/benchmark_runner.py

+            if custom_file:
+                if _format.format_is_bytes(data_format):
+                    def test_fn():
+                        with open(custom_file, 'ab') as f:


Prefer to avoid branches with duplicated code. Factor out what's different.

Consider:

flags = 'ab' if _format.format_is_bytes(data_format) else 'at' if custom_file: def fopen(): return open(custom_file, flags) else: def fopen(): return tempfile.TemporaryFile(mode=flags) def test_fn(): with fopen() as f: loader_dumper.dump(data_obj, f)

I haven't ran it, but that should work.

Changed. Yeah it works, thanks for the recommendation.

rmarrowstone · 2024-01-12T19:57:41Z

amazon/ionbenchmark/Format.py

           or (format_option == Format.PROTOBUF.value) or (format_option == Format.SD_PROTOBUF.value)


+def format_is_bytes(format_option):


should this just call format_is_binary?

Really, does format_is_binary exist for some other reason? why not refactor it?

ion_text is not a binary format, but it needs a 'b' flag to open the file.

I understand the logic requirement. What I don't understand is what 'is_binary' is actually called for other than determining the flag.

rmarrowstone · 2024-01-12T20:00:01Z

amazon/ionbenchmark/benchmark_runner.py

    """
    Create a benchmark function for the given `benchmark_spec`.
+
+    :param return_obj: If the test_fun returns the load object for debugging. It only works for `io-type=file` and


I don't think we're using this pydoc style anywhere else. simpleion uses the google pydoc style. That is the main API to this code, follow what that does.

rmarrowstone · 2024-01-12T20:00:45Z

amazon/ionbenchmark/benchmark_runner.py

            return loader_dumper.loads(buffer)

    elif match_arg == ['buffer', 'write', 'load_dump']:
+        # This method returns a list


rmarrowstone · 2024-01-12T20:04:34Z

amazon/ionbenchmark/benchmark_spec.py

+                    obj = loader.load(fp)
+                    for v in obj:
+                        rtn.append(v)


Suggested change

obj = loader.load(fp)

for v in obj:

rtn.append(v)

rtn = [v for v in loader.load(fp)]

Shorter, and I believe, faster.

Changed, thanks for catching this up.

rmarrowstone · 2024-01-12T20:07:10Z

amazon/ionbenchmark/cbor_load_dump.py

+import cbor2
+
+
+class CborLoadDump:


Cbor2LoadDump?

rmarrowstone · 2024-01-12T20:08:41Z

amazon/ionbenchmark/ion_load_dump.py

    def dumps(self, obj):
        ion.c_ext = self._c_ext
-        return ion.dumps(obj, binary=self._binary)
+        ion.dumps(obj, binary=self._binary)


does this need to change too?

Added back.

The question was as to whether this needed sequence_as_stream=True

rmarrowstone · 2024-01-12T20:09:15Z

amazon/ionbenchmark/json_load_dump.py

+            yield json.loads(line)
+
+    def loads(self, s):
+        return json.loads(s)


so loads can only handle single values?

io_type=buffer currently doesn't handle multiple top-level use case, I opened an issue for that - #325

Ok. At first blush it seems like just making multiple TLVs there would be simpler than creating an issue and having divergent code paths, but we need to get this merged.

cheqianh added 11 commits December 13, 2023 00:26

Fix the load_dump APIs

8e20e45

Adds 3 test scripts.

02da7bf

change ion benchmarking to iterator so that it's an apple-to-apple co…

c322eb8

…mparsion.

Adds parse_eagerly flag

5767145

fix json readlines too

cfc3972

work on write

7b727e8

fix

86fb0ab

write

fd7372d

adds write

1b183c4

fix conflict

a850717

Adds protobuf compatible

4a9f48e

cheqianh force-pushed the baseline branch from 3bf0c49 to 4a9f48e Compare December 27, 2023 23:51

adds lots of tests

80221f9

This was referenced Dec 28, 2023

Benchmark-cli IO-type=buffer should support benchmarking multiple top_level objects #325

Open

Benchmark-cli protobuf multiple top level objects use case #326

Open

cheqianh marked this pull request as ready for review December 28, 2023 20:13

cheqianh changed the title ~~Baseline~~ Enhance the benchmark runner to support multiple top level objects use case. Dec 28, 2023

tgregg reviewed Dec 28, 2023

View reviewed changes

rmarrowstone reviewed Dec 29, 2023

View reviewed changes

Address the feedback above

b26d4d2

* Modifies the write benchmarking process to dump all data, preventing the repeated writing of symbol tables * Throws an error for protocol buffer benchmarking * Fixes some typos * Adds return_object flag for debugging * Adds and refactors benchmark-cli tests

cheqianh force-pushed the baseline branch from bcdf544 to b26d4d2 Compare January 4, 2024 00:26

Refactor Ion, Cbor and JSON loader to seperate files.

74d89ad

Refactor Ion, Cbor and JSON dumper to seperate files.

6dc3d96

rmarrowstone reviewed Jan 5, 2024

View reviewed changes

Refactor and fixes.

2b177af

cheqianh force-pushed the baseline branch from e36474e to 2b177af Compare January 8, 2024 20:48

cheqianh requested a review from rmarrowstone January 8, 2024 21:57

cheqianh added 3 commits January 9, 2024 15:55

clean up

916cb83

finished merge

58b58c6

Adds a missing sample file, change the get_object to return a list to…

f754e07

… avoid file access issue in Windows PYPY.

cheqianh force-pushed the baseline branch from 8001a82 to f754e07 Compare January 12, 2024 19:08

cheqianh requested a review from tgregg January 12, 2024 19:26

Removes 3 test files.

6ad42c5

tgregg reviewed Jan 12, 2024

View reviewed changes

rmarrowstone approved these changes Jan 12, 2024

View reviewed changes

refactor, clean up

728776d

cheqianh requested review from rmarrowstone and tgregg January 12, 2024 21:49

cheqianh merged commit aa26d86 into amazon-ion:master Jan 15, 2024

		(error_code, _, _) = run_cli([f'{command}', file, '--format', f'{format_option}', '--io-type', 'file'])
		assert not error_code

		or (format_option == Format.PROTOBUF.value) or (format_option == Format.SD_PROTOBUF.value)


		def format_is_bytes(format_option):

		import cbor2


		class CborLoadDump:

Enhance the benchmark runner to support multiple top level objects use case. #315

Enhance the benchmark runner to support multiple top level objects use case. #315

Uh oh!

Conversation

cheqianh commented Dec 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description:

Note:

Uh oh!

cheqianh commented Dec 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cheqianh Dec 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cheqianh Dec 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cheqianh Jan 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cheqianh Jan 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cheqianh commented Jan 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cheqianh commented Jan 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tgregg commented Jan 5, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cheqianh commented Jan 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmarrowstone commented Jan 8, 2024

Uh oh!

cheqianh commented Jan 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cheqianh commented Jan 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cheqianh commented Dec 13, 2023 •

edited

Loading

cheqianh commented Dec 28, 2023 •

edited

Loading

cheqianh Dec 28, 2023 •

edited

Loading

cheqianh Dec 29, 2023 •

edited

Loading

cheqianh Jan 4, 2024 •

edited

Loading

cheqianh Jan 4, 2024 •

edited

Loading

cheqianh commented Jan 4, 2024 •

edited

Loading

cheqianh commented Jan 5, 2024 •

edited

Loading

cheqianh commented Jan 8, 2024 •

edited

Loading

cheqianh commented Jan 9, 2024 •

edited

Loading

cheqianh commented Jan 12, 2024 •

edited

Loading