Skip to content

Conversation

@philipphofmann
Copy link
Member

@philipphofmann philipphofmann commented Sep 24, 2025

This RFC aims to develop a strategy for SDKs to prevent the loss of logs when an application terminates abnormally, such as a crash or watchdog termination.

Rendered RFC

@philipphofmann philipphofmann marked this pull request as ready for review September 25, 2025 06:23
@philipphofmann philipphofmann requested a review from a team September 25, 2025 12:31
Copy link
Member

@AbhiPrasad AbhiPrasad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a good to add a note about recording client outcomes for this, like if the FIFO queue is full for example.

Everything else lgtm!

1) if the application crashes between storing the envelope and deleting a `batch-processor-cache-file` or `log-crash-recover-file`.
2) if a crash occurs between step 2 and before step 3 of the FIFO queue process, where logs might exist in both the `batch-processor-cache-file` and the FIFO queue.

In both cases, the SDK will send duplicate logs on the next launch. While this isn't acceptable long-term, we accept it for now because solving this correctly is complicated and we don't currently handle this edge case for other telemetry data such as crashes. If duplicate logs become problematic, we can implement database-style atomic operations using marker files or something similar to prevent duplication.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we solve this duplicate issue by attaching item ids to logs? If so, I'd be open to doing that, the protocol can be easily expanded to allow for sdks to set this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can. This would be pretty easy for 2), but for 1) we would somehow need to hold sending envelopes back before we run this check, which could be a bit tricky but doable. Do you think it's worth the effort to already do this in the first iteration?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's worth the effort to already do this in the first iteration?

Yes because duplicate log data has both a billing impact (users pay for two logs instead of one) and has the chance to reduce trust in the product if users are onboarding onto sentry for the first time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's do it, but it's some extra logic. I only want to add it if it's needed. Will update the RFC.


When the BatchProcessor receives a log, it performs the following steps

1. Put the log into the FIFO queue on the calling thread.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have recommendations on queue size? Should we adjust this based on backpressure (like amount of logs being recorded on the calling threads)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just picked 64 for now. As we're immidiately writing these to disk, I think this should be sufficient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 64 will not be enough, depending on the time interval. During startup an app can easily log more than that in a short timestamp due to e.g. bootstrapping services.
But the 64 is a relative value, so we should clarify what the related value is.

Copy link
Member Author

@philipphofmann philipphofmann Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the FIFO queue we don't have a time interval. The SDK has to write the log immidiately to disk after adding it to the queue. There is no time interval for storing to disk. I updated the wording. Do you still believe 64 isn't enough in that case @philprime and @AbhiPrasad ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AbhiPrasad, please have another look now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If SDKs are using a flush interval of 5 seconds, that gives us 12/13 logs per second on a span of 5 seconds.
I'd say it's a nice number for small/medium projects but would be nice if users could configure it.

Copy link
Member Author

@philipphofmann philipphofmann Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucas-zimerman, yes, we can make this configurable, but I would only if required. As already pointed out, this number is for the in-memory FIFO queue. When users add logs, the SDKs first put them into the in-memory FIFO queue and then immidiately store them to disk asynchronously. So the in-memory FIFO queue will only overlflow if the bg thread can't keep up storing the logs to disk, and then flushing them out. So, even if a user adds 100 or even 1000 logs per second, SDKs should be able to keep up.

Copy link
Member

@philprime philprime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid RFC, no other feedback than the already open discussions.

Copy link
Member

@markushi markushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few minor comments, but this is already in some very good shape!

Copy link

@antonis antonis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggested approach looks solid too me! Thank you for driving this @philipphofmann 🙇

Also looping in @lucas-zimerman since he has more context on the RN Logs implementation and for awareness.


The BatchProcessor maintains its logic of batching multiple logs together into a single envelope to avoid multiple HTTP requests.

Hybrid SDKs pass every span down to the native SDKs, which will put every log in their BatchProcessor and its cache.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With that said, will the batch processor only be served as a queue for sending logs or will also invoke integrations for parsing the logs, like beforelog on the native side?

Are there hybrid SDKs using the batch processor at the moment?

Copy link
Member Author

@philipphofmann philipphofmann Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch processor only be served as a queue for sending logs

Only for sending logs. Logs go into the BatchProcessor after going through all integrations, beforeSend, etc. I updated the wording. Is it clear now, @lucas-zimerman ?

@lucas-zimerman
Copy link

Left a few minor comments, but this is looking good!


1. Put the log into the FIFO queue on the calling thread.
2. On a background thread, serialize the next log of the FIFO queue and store it in the `batch-processor-cache-file`.
3. Remove the log from the FIFO queue.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm a bit confused here - why do we want to remove the log from the queue? Or is this queue a different one from what we already have in the BatchProcessor?

My understanding was that just in addition to what we already have (the existing queue) we would just spin up a task to store the same log in a file. If it crashes and the in-memory logs are lost, we'd use the file on the next launch to send the leftovers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a crash occurs, the SDKs write the logs in the FIFO queue to the log-crash-recover-file and send these logs on the next SDK launch.

@romtsn, the FIFO queue exists to prevent log loss when a crash occurs immediately after logging.

logger.trace("Starting database connection")

// The above log must show up in Sentry
SentrySDK.crash()

If we don't use the async safe memory and only store it to disk on a BG thread, we can't guarantee that on Cocoa. Can you on Java?

Copy link
Member

@romtsn romtsn Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in DMs let's figure out the details later when you move this to develop docs. My concern was that we're doing double the work here: removing log entries from the in-memory queue, but then also loading them from disk into memory again (step 4. below) which entails more I/O work.

So why not keep them in-memory and use the in-memory queue as main source, and only use the disk cache for exceptional cases (like crashes/watchdog terminations).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, let's figure this out once we add this to the develop docs. Thanks for bringing it up @romtsn.

Copy link
Member

@AbhiPrasad AbhiPrasad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

philipphofmann and others added 2 commits October 3, 2025 06:10
Copy link
Member

@romtsn romtsn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One point for later, but LGTM otherwise!

@philipphofmann philipphofmann merged commit 71ce99e into main Oct 3, 2025
6 checks passed
@philipphofmann philipphofmann deleted the rfc/logs-for-crashes branch October 3, 2025 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants