Skip to content
This repository was archived by the owner on Apr 2, 2024. It is now read-only.

Conversation

jgpruitt
Copy link
Contributor

@jgpruitt jgpruitt commented Jan 31, 2022

Description

I have seen instances in which promscale attempts to insert spans that have an end time < start time. There is a database check constraint that catches this error, but it results in dropped spans. This patch catches the issue in Go and swaps the start and end times so that the spans will be inserted successfully.

Issue

Merge requirements

Please take into account the following non-code changes that you may need to make with your PR:

  • CHANGELOG entry for user-facing changes
  • Updated the relevant documentation

@jgpruitt jgpruitt self-assigned this Jan 31, 2022
@jgpruitt jgpruitt added the Bug Something isn't working label Jan 31, 2022
@jgpruitt jgpruitt linked an issue Jan 31, 2022 that may be closed by this pull request
@jgpruitt jgpruitt force-pushed the jp/span-reversed-times branch from 8381cf3 to 5969dfb Compare January 31, 2022 17:15
@jgpruitt jgpruitt requested a review from a team as a code owner January 31, 2022 17:15
@jgpruitt jgpruitt force-pushed the jp/span-reversed-times branch from 5969dfb to 129e939 Compare January 31, 2022 17:20
@JamesGuthrie
Copy link
Member

@jgpruitt do we know why we're receiving spans with end < start? I wonder if it's safe to just swap start and end, it could end up hiding something weird/buggy in the user's tracing implementation.

@jgpruitt
Copy link
Contributor Author

jgpruitt commented Feb 1, 2022

@JamesGuthrie I don't know why yet. I want to try to track it down later today.

I think these are our possible solutions:

  1. Let the database catch end < start which will rollback the transaction and drop a whole batch of spans on the floor. This is what we are doing now. I don't think we want to do this.
  2. Remove the constraint and allow the bad data to be inserted. Then, all your queries/dashboards may need logic to deal with this bad data.
  3. Discard the span with end < start. This would break the tree structure of the trace though.
  4. Attempt to "fix" the data. Either swap start and end, or set end to start, or set start to end.

IMO, 4 is the best approach, but they're all icky. WDYT?

@jgpruitt
Copy link
Contributor Author

jgpruitt commented Feb 1, 2022

@JamesGuthrie I've spent a couple of hours trying to track down the source of the end < start spans. I haven't found the source of the errors yet. I bypassed the opentelemetry-collector by sending my traces directly to promscale, and the errors kept showing up. So, I don't think it is the opentelemetry-collector. I removed the constraint so the spans would get inserted, but nothing about the spans lent any clues.

I *think I'm using the Python tracing libraries correctly. I referenced multiple tutorials.

Copy link
Member

@harkishen harkishen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be an issue upstream. Based on the previous comment, it seems that this happened due to python lib, so I think we should open an issue there and report this.

I am not confident in making our own fixes as this in other words is hiding the cause, which is basically data corruption as the user will never expect this. Hence, I don't think we should have this swapping fix, rather drop the entire span batch. We should be strict in ingesting data to avoid problems down the road.

@jgpruitt jgpruitt requested review from harkishen and cevian February 3, 2022 14:46
@jgpruitt jgpruitt force-pushed the jp/span-reversed-times branch from a69ca6b to b6602af Compare February 4, 2022 16:31
@niksajakovljevic
Copy link
Contributor

I agree with @Harkishen-Singh. I think we should try to get to the bottom of this. Does it only happen with Python libs (seems like race condition? ). start > end seems like a major data corruption. Imho swapping it makes it harder to reason about semantics of start and end and we might open the door for similar bugs to go undetected. I think it's great that you caught this @jgpruitt and that we have DB constraint in place!

@jgpruitt
Copy link
Contributor Author

jgpruitt commented Feb 8, 2022

I created another issue to track my efforts at identifying a source. #1127

IMO, having this "fix" in place is of benefit in case of future bugs

Copy link
Contributor

@cevian cevian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this solution as it follows the robustness principle.

Adds code to detect when a span's end time is less than its start time. This would be caught by a database constraint and prevent the insertion of the span. The new code will swap the start and end times to allow the span's insertion.
@jgpruitt jgpruitt force-pushed the jp/span-reversed-times branch from b6602af to a404386 Compare February 9, 2022 17:05
@jgpruitt jgpruitt dismissed harkishen’s stale review February 9, 2022 19:00

As a team, we decided to implement the fix as-is and follow up at a later date to track down the issue. The most important thing ATM is dealing with data issues.

@jgpruitt jgpruitt merged commit 32c9aa1 into master Feb 9, 2022
@jgpruitt jgpruitt deleted the jp/span-reversed-times branch February 9, 2022 19:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Trying to insert spans with end_time < start_time
6 participants