Skip to content

Conversation

@lfrancke
Copy link

@lfrancke lfrancke commented Oct 20, 2025

Summary

Fix problem where multi-chars delimiters fail to be parsed if they happen right at a buffer boundary.

Vector configuration

See https://github.com/lfrancke/vector-repro-24027 for a reproduction repository

How did you test this PR?

The repro repo contains a test case which I used.
In addition I added unit tests for 1-5 char delimiters.

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • Some CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • make fmt
      • make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
      • make test
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run make build-licenses to regenerate the license inventory and commit the changes (if any). More details here.

@thomasqueirozb
Copy link
Contributor

Hi @lfrancke thanks for your contribution! Since this is something that alters Vector behavior it is considered a user facing change (I edited the PR description already). Could you please add a changelog? Thanks!

Also, your changes seem sound to me but I still need to review them more throughly. I will take a closer look soon

@lfrancke
Copy link
Author

Will do! Thanks.
I see that I left two of my debugging statements in the test as well. I'll remove those too.

@lfrancke
Copy link
Author

I pushed the changelog and removed the debug statements. It's ready for review I believe.

NickLarsenNZ added a commit to stackabletech/docker-images that referenced this pull request Oct 30, 2025
NOTE: I removed async/await parts from the original patch as that comes after 0.49.0

```sh
pushd $(cargo patchable checkout vector 0.49.0)

git remote add lfrancke https://github.com/lfrancke/vector

git fetch lfrancke

git cherry-pick 3ce729073f23631dd7b5525be640b5fa15af0223
and git cherry-pick --continue
git commit --amend

popd
cargo patchable export vector 0.49.0
```
NickLarsenNZ added a commit to stackabletech/docker-images that referenced this pull request Oct 30, 2025
NOTE: I removed async/await parts from the original patch as that comes after 0.49.0

```sh
pushd $(cargo patchable checkout vector 0.49.0)

git remote add lfrancke https://github.com/lfrancke/vector

git fetch lfrancke

git cherry-pick 3ce729073f23631dd7b5525be640b5fa15af0223
and git cherry-pick --continue
git commit --amend

popd
cargo patchable export vector 0.49.0
```
github-merge-queue bot pushed a commit to stackabletech/docker-images that referenced this pull request Oct 30, 2025
* chore(vector): Init patchable

* chore(stackable-devel): Make a special variant for Vector so that a different rust toolchain can be selected

* chore(stackable-devel): Add note about moving the version to
boil-config.toml once renovate can check there (for consistency)

* chore(nix): Add rust and cargo dependencies

Otherwise cargo can't be found

```
error: the 'cargo' binary, normally provided by the 'cargo' component, is not applicable to the '1.89.0-x86_64-unknown-linux-gnu' toolchain
```

* chore(vector): Build from source (based on ubi9-rust-builder)

NOTE: The ubi9-rust-builder could not be used as it contains `ONBUILD`
steps which we need to run after patchable does it's thing. Also it is
specifically designed for operators and their layout (under `rust/` and
using workspaces).

* chore(nix): Remove unused image-tools

* chore(issue_template/vector): Update instructions for version bumps

* fix(vector): Cherry pick unmerged patch from vectordotdev/vector#24028

NOTE: I removed async/await parts from the original patch as that comes after 0.49.0

```sh
pushd $(cargo patchable checkout vector 0.49.0)

git remote add lfrancke https://github.com/lfrancke/vector

git fetch lfrancke

git cherry-pick 3ce729073f23631dd7b5525be640b5fa15af0223
and git cherry-pick --continue
git commit --amend

popd
cargo patchable export vector 0.49.0
```

* chore(vector): Add maintainer label

This seems to be added to other images, so I'm just copying that.

* chore: Update changelog

* Apply suggestions from code review

Co-authored-by: Techassi <[email protected]>

* chore(vector): Remove unused upload script

* chore(vector): Remove old comments, add new todo

---------

Co-authored-by: Techassi <[email protected]>
@lfrancke
Copy link
Author

lfrancke commented Nov 3, 2025

@thomasqueirozb a quick ping. Considering that it corrupts data I hope a ping is fine here.

Copy link
Contributor

@thomasqueirozb thomasqueirozb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, this fix is very welcome! Sorry for the delay

Comment on lines 103 to 107
// Check if the end matches a prefix of the delimiter.
// We iterate from longest to shortest prefix and break on first match.
// Performance: For typical 2-byte delimiters (CRLF), this is 1 iteration.
// For longer delimiters, this runs O(delim_len) times but only occurs
// at buffer boundaries (~every 8KB), making the impact negligible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this correctly this means that this will only run more than once if the delimiter is >8kb or if we fail to fetch the whole delimiter this iteration (maybe due to some io issue). Is this correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of answering here I tried to add more comments: 0254302 (#24028)

I had to reread my code as well and it's only been a month. I hope these new explanations help.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that I'm not sure about - but I'd be happy if we could leave it out of scope for now - if someone were to use a delimiter itself that is larger than 8kb...... I didn't think that through.

/// 1. Creates test data with delimiters positioned to split at buffer boundaries
/// 2. Tests multiple iterations to ensure state tracking works correctly
/// 3. Verifies all lines are correctly separated without merging
async fn test_delimiter_boundary_split_helper(delimiter: &[u8], num_lines: usize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work if we add a #[cfg(test)] so that this isn't compiled when outside of tests?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! Thanks.

Addressed in 6fc45e6 (#24028)

Comment on lines +416 to +441
#[tokio::test]
async fn test_single_byte_delimiter_boundary() {
// Test single-byte delimiter (should work without any special handling)
test_delimiter_boundary_split_helper(b"\n", 5).await;
}

#[tokio::test]
async fn test_two_byte_delimiter_boundary() {
// Test two-byte delimiter (CRLF case)
test_delimiter_boundary_split_helper(b"\r\n", 5).await;
}

#[tokio::test]
async fn test_three_byte_delimiter_boundary() {
test_delimiter_boundary_split_helper(b"|||", 5).await;
}

#[tokio::test]
async fn test_four_byte_delimiter_boundary() {
test_delimiter_boundary_split_helper(b"<|>|", 5).await;
}

#[tokio::test]
async fn test_five_byte_delimiter_boundary() {
test_delimiter_boundary_split_helper(b"<<>>>", 5).await;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are nice. Are we able to build something similar by creating a test for the file source? I guess it shouldn't be that hard given the existing repro repo. If we create a file with the same data as in the repo we should be able to assert the data is correct (and also that the test should fail on master today).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added something based on the existing test that was already there. I hope that's what you meant. It passes for me with my patch and fails for me without.

@thomasqueirozb thomasqueirozb added the meta: awaiting author Pull requests that are awaiting their author. label Nov 17, 2025
@github-actions github-actions bot removed the meta: awaiting author Pull requests that are awaiting their author. label Nov 20, 2025
@github-actions github-actions bot added the domain: sources Anything related to the Vector's sources label Nov 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: sources Anything related to the Vector's sources

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-byte line delimiters split across buffer boundaries cause log event merging

2 participants