-
Notifications
You must be signed in to change notification settings - Fork 1.9k
fix(file source) Fix a data corruption bug with multi-char delimiters #24028
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…ppen right at a buffer boundary.
|
Hi @lfrancke thanks for your contribution! Since this is something that alters Vector behavior it is considered a user facing change (I edited the PR description already). Could you please add a changelog? Thanks! Also, your changes seem sound to me but I still need to review them more throughly. I will take a closer look soon |
|
Will do! Thanks. |
|
I pushed the changelog and removed the debug statements. It's ready for review I believe. |
NOTE: I removed async/await parts from the original patch as that comes after 0.49.0 ```sh pushd $(cargo patchable checkout vector 0.49.0) git remote add lfrancke https://github.com/lfrancke/vector git fetch lfrancke git cherry-pick 3ce729073f23631dd7b5525be640b5fa15af0223 and git cherry-pick --continue git commit --amend popd cargo patchable export vector 0.49.0 ```
NOTE: I removed async/await parts from the original patch as that comes after 0.49.0 ```sh pushd $(cargo patchable checkout vector 0.49.0) git remote add lfrancke https://github.com/lfrancke/vector git fetch lfrancke git cherry-pick 3ce729073f23631dd7b5525be640b5fa15af0223 and git cherry-pick --continue git commit --amend popd cargo patchable export vector 0.49.0 ```
* chore(vector): Init patchable * chore(stackable-devel): Make a special variant for Vector so that a different rust toolchain can be selected * chore(stackable-devel): Add note about moving the version to boil-config.toml once renovate can check there (for consistency) * chore(nix): Add rust and cargo dependencies Otherwise cargo can't be found ``` error: the 'cargo' binary, normally provided by the 'cargo' component, is not applicable to the '1.89.0-x86_64-unknown-linux-gnu' toolchain ``` * chore(vector): Build from source (based on ubi9-rust-builder) NOTE: The ubi9-rust-builder could not be used as it contains `ONBUILD` steps which we need to run after patchable does it's thing. Also it is specifically designed for operators and their layout (under `rust/` and using workspaces). * chore(nix): Remove unused image-tools * chore(issue_template/vector): Update instructions for version bumps * fix(vector): Cherry pick unmerged patch from vectordotdev/vector#24028 NOTE: I removed async/await parts from the original patch as that comes after 0.49.0 ```sh pushd $(cargo patchable checkout vector 0.49.0) git remote add lfrancke https://github.com/lfrancke/vector git fetch lfrancke git cherry-pick 3ce729073f23631dd7b5525be640b5fa15af0223 and git cherry-pick --continue git commit --amend popd cargo patchable export vector 0.49.0 ``` * chore(vector): Add maintainer label This seems to be added to other images, so I'm just copying that. * chore: Update changelog * Apply suggestions from code review Co-authored-by: Techassi <[email protected]> * chore(vector): Remove unused upload script * chore(vector): Remove old comments, add new todo --------- Co-authored-by: Techassi <[email protected]>
|
@thomasqueirozb a quick ping. Considering that it corrupts data I hope a ping is fine here. |
thomasqueirozb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution, this fix is very welcome! Sorry for the delay
lib/file-source-common/src/buffer.rs
Outdated
| // Check if the end matches a prefix of the delimiter. | ||
| // We iterate from longest to shortest prefix and break on first match. | ||
| // Performance: For typical 2-byte delimiters (CRLF), this is 1 iteration. | ||
| // For longer delimiters, this runs O(delim_len) times but only occurs | ||
| // at buffer boundaries (~every 8KB), making the impact negligible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm reading this correctly this means that this will only run more than once if the delimiter is >8kb or if we fail to fetch the whole delimiter this iteration (maybe due to some io issue). Is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of answering here I tried to add more comments: 0254302 (#24028)
I had to reread my code as well and it's only been a month. I hope these new explanations help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that I'm not sure about - but I'd be happy if we could leave it out of scope for now - if someone were to use a delimiter itself that is larger than 8kb...... I didn't think that through.
| /// 1. Creates test data with delimiters positioned to split at buffer boundaries | ||
| /// 2. Tests multiple iterations to ensure state tracking works correctly | ||
| /// 3. Verifies all lines are correctly separated without merging | ||
| async fn test_delimiter_boundary_split_helper(delimiter: &[u8], num_lines: usize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work if we add a #[cfg(test)] so that this isn't compiled when outside of tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely! Thanks.
Addressed in 6fc45e6 (#24028)
| #[tokio::test] | ||
| async fn test_single_byte_delimiter_boundary() { | ||
| // Test single-byte delimiter (should work without any special handling) | ||
| test_delimiter_boundary_split_helper(b"\n", 5).await; | ||
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn test_two_byte_delimiter_boundary() { | ||
| // Test two-byte delimiter (CRLF case) | ||
| test_delimiter_boundary_split_helper(b"\r\n", 5).await; | ||
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn test_three_byte_delimiter_boundary() { | ||
| test_delimiter_boundary_split_helper(b"|||", 5).await; | ||
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn test_four_byte_delimiter_boundary() { | ||
| test_delimiter_boundary_split_helper(b"<|>|", 5).await; | ||
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn test_five_byte_delimiter_boundary() { | ||
| test_delimiter_boundary_split_helper(b"<<>>>", 5).await; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are nice. Are we able to build something similar by creating a test for the file source? I guess it shouldn't be that hard given the existing repro repo. If we create a file with the same data as in the repo we should be able to assert the data is correct (and also that the test should fail on master today).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added something based on the existing test that was already there. I hope that's what you meant. It passes for me with my patch and fails for me without.
Co-authored-by: Thomas <[email protected]>
Summary
Fix problem where multi-chars delimiters fail to be parsed if they happen right at a buffer boundary.
Vector configuration
See https://github.com/lfrancke/vector-repro-24027 for a reproduction repository
How did you test this PR?
The repro repo contains a test case which I used.
In addition I added unit tests for 1-5 char delimiters.
Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.References
Notes
@vectordotdev/vectorto reach out to us regarding this PR.pre-pushhook, please see this template.make fmtmake check-clippy(if there are failures it's possible some of them can be fixed withmake clippy-fix)make testgit merge origin masterandgit push.Cargo.lock), pleaserun
make build-licensesto regenerate the license inventory and commit the changes (if any). More details here.