-
Notifications
You must be signed in to change notification settings - Fork 108
Initial set of Data Reader changes for Streaming Ingestion #935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Govind Kamat <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @gkamat Could we sync to see how this works E2E?
# For testing. Set credentials with environment variables. | ||
self.s3_client = client('s3') | ||
self.chunk_size = IngestionManager.chunk_size * 1024**2 | ||
self.num_workers = os.cpu_count() * 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we always want 2x the number of CPU cores or will we ever want to use less?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good question, given that the cores are used for the bulk operations as well. This is one of the items to be evaluated during the scaling experiments #929, but for now the feature seems to perform generally well with this number.
@@ -412,6 +415,8 @@ def __init__(self, cfg): | |||
|
|||
@staticmethod | |||
def prepare_docs(cfg, workload, corpus, preparator): | |||
if corpus.streaming_ingestion: | |||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: A logging statement here would be useful. Can mention that workload loader detected that user is using streaming ingestion and from which cloud provider (only aws for now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change needs to have logging added all through (and error-checking), but will include this for clarity.
@@ -776,8 +777,7 @@ def __init__(self, corpora, batch_size, bulk_size, ingest_percentage, id_conflic | |||
# use a value > 0 so percent_completed returns a sensible value | |||
self.total_bulks = 1 | |||
self.infinite = False | |||
# TBD: obtain value from workload spec. | |||
self.streaming_ingestion = False | |||
self.streaming_ingestion = corpora[0].streaming_ingestion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the index 0, will streaming ingestion always only have one corpora?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would not help dividing the bandwidth allocated to the download from cloud storage across multiple document sources, rather than optimizing the speed for the single corpus. We could look into this as a later improvement, perhaps.
# pylint: disable = import-outside-toplevel | ||
from osbenchmark.cloud_provider.vendors.s3_data_producer import S3DataProducer | ||
producer = S3DataProducer("big5-corpus", keys, client_options, Slice.data_dir) | ||
bucket = re.sub('^s3://', "", Slice.base_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understand that this is still the initial set of changes for the MVP but what are your thoughts on moving the instantiation of the producer outside of the _start_producer()
? We could have a separate method dedicated to setting up the producer -- such as _setup_data_producer()
-- based on the self.streaming_ingestion
detected in the workload. After that, the producer can be passed into _start_producer()
. That way, the main workflow will be preserved and we can support other types of data producesr later (e.g. azure
, gcp
, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would certainly be the right way of modularizing this. For now, the focus was on getting the feature in prior to the 2.0 release deadline, end-to-end.
Yes, let's do that later today. |
Signed-off-by: Govind Kamat <[email protected]>
Signed-off-by: Govind Kamat <[email protected]>
Description
Initial set of Data Reader changes for Streaming Ingestion. This is an interim check-in to facilitate the availability of the feature in OSB 2.0 as an alpha capability. There are several enhancements to be added yet; these will be made available progressively after the release. Integration tests are also to be added.
For now, the focus has been to ensure that OSB functionality, when Streaming Ingestion is not enabled, is not affected.
Usage
To operate in Streaming Ingestion mode, the
streaming-ingestion
tag must be attached to the corpus in the workload specification. Currently, onlyaws
is supported, and a single corpus must be accessible on S3, with the appropriate credentials enabled. For example, with thehttp_logs
workload:Issues Addressed
#918
Testing
[Describe how this change was tested]
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.