Skip to content

Conversation

Catharsis68
Copy link
Contributor

Add New Metrics for Call Outcomes and ACK Timeouts

Overview

This PR introduces several new metric counters to improve observability around SIP call processing. The added metrics track:

  • Failed Calls: drachtio_calls_failed
  • Successful Calls: drachtio_calls_successful
  • ACK Timeouts: drachtio_ack_timeout

These metrics help us monitor and diagnose issues with SIP transactions, including cases where the final ACK is not received after a 200 OK response.

Changes

1. New Metric Definitions in src/drachtio.h

  • Added three new string constants:
    • STATS_COUNTER_FAILED_CALLS – tracks the number of calls that failed.
    • STATS_COUNTER_SUCCESSFUL_CALLS – tracks the number of successfully completed calls.
    • STATS_COUNTER_ACK_TIMEOUT – tracks instances of ACK timeouts.

These constants are used with our stats macros (like STATS_COUNTER_INCREMENT, etc.) throughout the codebase.

2. Metrics Increment for ACK Timeouts in src/sip-dialog-controller.cpp

  • In the SIP dialog controller, when an ACK timeout is detected (for example, during retransmission of a 200 OK response without receiving an ACK), the STATS_COUNTER_ACK_TIMEOUT metric is incremented.
  • This helps in flagging potential network issues or application delays affecting SIP transactions.

3. Monitoring and Diagnostics Enhancements

  • Call Outcome Metrics: The new counters provide a clear view of successful versus failed calls, enabling more detailed performance analysis.
  • ACK Timeout Visibility: Real-time data on ACK timeouts aids in proactive troubleshooting and ensures that potential issues are not overlooked.

Testing and Verification

  • Local Testing: Confirmed that the new constants are correctly defined and that ACK timeouts are being properly incremented.
  • Integration Testing: Observed the updated metrics in our monitoring dashboard during stress tests, verifying that the new data points (including any added contextual metadata) appear correctly in logs.
  • Logging: Enhanced logging in the relevant sections confirms that the metrics calls execute as expected during timeout scenarios.

By merging this PR, we gain improved visibility into SIP call performance and can quickly pinpoint issues related to ACK timeouts and call failures, thereby enhancing overall system reliability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant