-
-
Notifications
You must be signed in to change notification settings - Fork 6
rfc(feature): Logs for Crashes #148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a3c9142 to
ed2950b
Compare
AbhiPrasad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be a good to add a note about recording client outcomes for this, like if the FIFO queue is full for example.
Everything else lgtm!
text/0148-logs-for-crashes.md
Outdated
| 1) if the application crashes between storing the envelope and deleting a `batch-processor-cache-file` or `log-crash-recover-file`. | ||
| 2) if a crash occurs between step 2 and before step 3 of the FIFO queue process, where logs might exist in both the `batch-processor-cache-file` and the FIFO queue. | ||
|
|
||
| In both cases, the SDK will send duplicate logs on the next launch. While this isn't acceptable long-term, we accept it for now because solving this correctly is complicated and we don't currently handle this edge case for other telemetry data such as crashes. If duplicate logs become problematic, we can implement database-style atomic operations using marker files or something similar to prevent duplication. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we solve this duplicate issue by attaching item ids to logs? If so, I'd be open to doing that, the protocol can be easily expanded to allow for sdks to set this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can. This would be pretty easy for 2), but for 1) we would somehow need to hold sending envelopes back before we run this check, which could be a bit tricky but doable. Do you think it's worth the effort to already do this in the first iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's worth the effort to already do this in the first iteration?
Yes because duplicate log data has both a billing impact (users pay for two logs instead of one) and has the chance to reduce trust in the product if users are onboarding onto sentry for the first time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's do it, but it's some extra logic. I only want to add it if it's needed. Will update the RFC.
|
|
||
| When the BatchProcessor receives a log, it performs the following steps | ||
|
|
||
| 1. Put the log into the FIFO queue on the calling thread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have recommendations on queue size? Should we adjust this based on backpressure (like amount of logs being recorded on the calling threads)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just picked 64 for now. As we're immidiately writing these to disk, I think this should be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 64 will not be enough, depending on the time interval. During startup an app can easily log more than that in a short timestamp due to e.g. bootstrapping services.
But the 64 is a relative value, so we should clarify what the related value is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the FIFO queue we don't have a time interval. The SDK has to write the log immidiately to disk after adding it to the queue. There is no time interval for storing to disk. I updated the wording. Do you still believe 64 isn't enough in that case @philprime and @AbhiPrasad ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AbhiPrasad, please have another look now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If SDKs are using a flush interval of 5 seconds, that gives us 12/13 logs per second on a span of 5 seconds.
I'd say it's a nice number for small/medium projects but would be nice if users could configure it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lucas-zimerman, yes, we can make this configurable, but I would only if required. As already pointed out, this number is for the in-memory FIFO queue. When users add logs, the SDKs first put them into the in-memory FIFO queue and then immidiately store them to disk asynchronously. So the in-memory FIFO queue will only overlflow if the bg thread can't keep up storing the logs to disk, and then flushing them out. So, even if a user adds 100 or even 1000 logs per second, SDKs should be able to keep up.
philprime
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solid RFC, no other feedback than the already open discussions.
markushi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few minor comments, but this is already in some very good shape!
antonis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suggested approach looks solid too me! Thank you for driving this @philipphofmann 🙇
Also looping in @lucas-zimerman since he has more context on the RN Logs implementation and for awareness.
text/0148-logs-for-crashes.md
Outdated
|
|
||
| The BatchProcessor maintains its logic of batching multiple logs together into a single envelope to avoid multiple HTTP requests. | ||
|
|
||
| Hybrid SDKs pass every span down to the native SDKs, which will put every log in their BatchProcessor and its cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With that said, will the batch processor only be served as a queue for sending logs or will also invoke integrations for parsing the logs, like beforelog on the native side?
Are there hybrid SDKs using the batch processor at the moment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batch processor only be served as a queue for sending logs
Only for sending logs. Logs go into the BatchProcessor after going through all integrations, beforeSend, etc. I updated the wording. Is it clear now, @lucas-zimerman ?
|
Left a few minor comments, but this is looking good! |
|
|
||
| 1. Put the log into the FIFO queue on the calling thread. | ||
| 2. On a background thread, serialize the next log of the FIFO queue and store it in the `batch-processor-cache-file`. | ||
| 3. Remove the log from the FIFO queue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm a bit confused here - why do we want to remove the log from the queue? Or is this queue a different one from what we already have in the BatchProcessor?
My understanding was that just in addition to what we already have (the existing queue) we would just spin up a task to store the same log in a file. If it crashes and the in-memory logs are lost, we'd use the file on the next launch to send the leftovers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a crash occurs, the SDKs write the logs in the FIFO queue to the
log-crash-recover-fileand send these logs on the next SDK launch.
@romtsn, the FIFO queue exists to prevent log loss when a crash occurs immediately after logging.
logger.trace("Starting database connection")
// The above log must show up in Sentry
SentrySDK.crash()If we don't use the async safe memory and only store it to disk on a BG thread, we can't guarantee that on Cocoa. Can you on Java?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in DMs let's figure out the details later when you move this to develop docs. My concern was that we're doing double the work here: removing log entries from the in-memory queue, but then also loading them from disk into memory again (step 4. below) which entails more I/O work.
So why not keep them in-memory and use the in-memory queue as main source, and only use the disk cache for exceptional cases (like crashes/watchdog terminations).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, let's figure this out once we add this to the develop docs. Thanks for bringing it up @romtsn.
AbhiPrasad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
Co-authored-by: Abhijeet Prasad <[email protected]>
Co-authored-by: Abhijeet Prasad <[email protected]>
Co-authored-by: Roman Zavarnitsyn <[email protected]>
romtsn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One point for later, but LGTM otherwise!
This RFC aims to develop a strategy for SDKs to prevent the loss of logs when an application terminates abnormally, such as a crash or watchdog termination.
Rendered RFC