-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Component(s)
exporter/opensearch
What happened?
##Description
Loss of data when search engine is unavailable and bought back , but search engine results in loss of data, even though queue and retry mechanisms are configured
##Test Strategy
- Initially ,,made data pod replicas to 0
- produced 1 log per second for 10 seconds
- Brought back data pods up after 10 seconds
##Observations:
- Telemetry collector received 10 log records and exported
- out of 10 only 9 log records are stored , when backend available
Expected Result
Complete data in Search engine
Actual Result
Search Engine count is 9(expected 10)
Collector version
v0.121.0
OpenTelemetry Collector configuration
opensearch:
sending_queue:
enabled: true
storage: file_storage/opensearch
num_consumers: 2
queue_size: 40000
retry_on_failure:
enabled: true
initial_interval: 1s
max_interval: 30s
max_elapsed_time: 0
http:
endpoint: <SE host>:9000
extensions:
file_storage/opensearch:
directory: /tmp/otel/queue/opensearch
create_directory: true
timeout: 10s
processors:
memory_limiter:
check_interval: 5s
limit_percentage: 90
spike_limit_percentage: 15
batch:
send_batch_size: 50
timeout: 2s
Log output
Additional context
POST /_bulk HTTP/1.1
Host: search-engine:9200
User-Agent: opensearch-go/2.3.0 (linux amd64; Go 1.23.3)
Content-Length: 1110
Content-Type: application/json
Accept-Encoding: gzip
{"create":{"_index":"ss4o_logs-default-namespace"}}
{"attributes":{"data_stream":{"dataset":"default","namespace":"namespace","type":"record"}},"body":"Hello! This is a Testing log","instrumentationScope":{"name":"test-app"},"observedTimestamp":"2025-03-19T12:22:44.663646777Z","resource":{"service.name":"test-app","telemetry.sdk.language":"go","telemetry.sdk.name":"opentelemetry","telemetry.sdk.version":"1.33.0"},"severity":{"number":5},"@timestamp":"2025-03-19T12:22:43.633743348Z"}
{"create":{"_index":"ss4o_logs-default-namespace"}}
{"attributes":{"data_stream":{"dataset":"default","namespace":"namespace","type":"record"}},"body":"Hello! This is a Testing log","instrumentationScope":{"name":"test-app"},"observedTimestamp":"2025-03-19T12:22:44.664584006Z","resource":{"service.name":"test-app","telemetry.sdk.language":"go","telemetry.sdk.name":"opentelemetry","telemetry.sdk.version":"1.33.0"},"severity":{"number":5},"@timestamp":"2025-03-19T12:22:44.648203511Z"}
HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-encoding: gzip
content-length: 580
{"took":59937,"errors":true,"items":[{"create":{"_index":"ss4o_logs-default-namespace","_id":"WaBbrpUBfWGueJ6HKMa7","status":503,"error":{"type":"unavailable_shards_exception","reason":"[ss4o_logs-default-namespace][1] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[ss4o_logs-default-namespace][1]] containing [index {[ss4o_logs-default-namespace][WaBbrpUBfWGueJ6HKMa7], source[{"attributes":{"data_stream":{"dataset":"default","namespace":"namespace","type":"record"}},"body":"Hello! This is a Testing log","instrumentationScope":{"name":"test-app"},"observedTimestamp":"2025-03-19T12:22:44.663646777Z","resource":{"service.name":"test-app","telemetry.sdk.language":"go","telemetry.sdk.name":"opentelemetry","telemetry.sdk.version":"1.33.0"},"severity":{"number":5},"@timestamp":"2025-03-19T12:22:43.633743348Z"}]}]]"}}},{"create":{"_index":"ss4o_logs-default-namespace","_id":"WqBbrpUBfWGueJ6HKMa7","status":503,"error":{"type":"unavailable_shards_exception","reason":"[ss4o_logs-default-namespace][4] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[ss4o_logs-default-namespace][4]] containing [index {[ss4o_logs-default-namespace][WqBbrpUBfWGueJ6HKMa7], source[{"attributes":{"data_stream":{"dataset":"default","namespace":"namespace","type":"record"}},"body":"Hello! This is a Testing log","instrumentationScope":{"name":"test-app"},"observedTimestamp":"2025-03-19T12:22:44.664584006Z","resource":{"service.name":"test-app","telemetry.sdk.language":"go","telemetry.sdk.name":"opentelemetry","telemetry.sdk.version":"1.33.0"},"severity":{"number":5},"@timestamp":"2025-03-19T12:22:44.648203511Z"}]}]]"}}}]}
POST /_bulk HTTP/1.1
Host: search-engine:9200
User-Agent: opensearch-go/2.3.0 (linux amd64; Go 1.23.3)
Content-Length: 555
Content-Type: application/json
Accept-Encoding: gzip
{"create":{"_index":"ss4o_logs-default-namespace"}}
{"attributes":{"data_stream":{"dataset":"default","namespace":"namespace","type":"record"}},"body":"Hello! This is a Testing log","instrumentationScope":{"name":"test-app"},"observedTimestamp":"2025-03-19T12:23:45.536550316Z","resource":{"service.name":"test-app","telemetry.sdk.language":"go","telemetry.sdk.name":"opentelemetry","telemetry.sdk.version":"1.33.0"},"severity":{"number":5},"@timestamp":"2025-03-19T12:22:43.633743348Z"}
HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-encoding: gzip
content-length: 208
{"took":1465,"errors":false,"items":[{"create":{"_index":"ss4o_logs-default-namespace","_id":"XaBcrpUBfWGueJ6HFsaB","_version":1,"result":"created","_shards":
{"total":2,"successful":1,"failed":0}
,"_seq_no":0,"_primary_term":1,"status":201}}]}
In each data packet , if the request failed with 503, if it has 2 retriable items, but only 1 item is retrying, which results in loss
Is there any changes needed to expect complete data from Search engine , when backends unavailable and bring back