Skip to content

Loss of data when search engine is unavailable using Opensearch exporter #38846

@charan906

Description

@charan906

Component(s)

exporter/opensearch

What happened?

##Description
Loss of data when search engine is unavailable and bought back , but search engine results in loss of data, even though queue and retry mechanisms are configured

##Test Strategy

  1. Initially ,,made data pod replicas to 0
  2. produced 1 log per second for 10 seconds
  3. Brought back data pods up after 10 seconds

##Observations:

  1. Telemetry collector received 10 log records and exported
  2. out of 10 only 9 log records are stored , when backend available

Expected Result

Complete data in Search engine

Actual Result

Search Engine count is 9(expected 10)

Collector version

v0.121.0

OpenTelemetry Collector configuration

opensearch:
    sending_queue:
      enabled: true
      storage: file_storage/opensearch
      num_consumers: 2
      queue_size: 40000
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_interval: 30s
      max_elapsed_time: 0
    http:
      endpoint: <SE host>:9000
extensions:
  file_storage/opensearch:
    directory: /tmp/otel/queue/opensearch
    create_directory: true
    timeout: 10s
processors:
  memory_limiter:
    check_interval: 5s
    limit_percentage: 90
    spike_limit_percentage: 15
  batch:
    send_batch_size: 50
    timeout: 2s

Log output

Additional context

POST /_bulk HTTP/1.1
Host: search-engine:9200
User-Agent: opensearch-go/2.3.0 (linux amd64; Go 1.23.3)
Content-Length: 1110
Content-Type: application/json
Accept-Encoding: gzip

{"create":{"_index":"ss4o_logs-default-namespace"}}
{"attributes":{"data_stream":{"dataset":"default","namespace":"namespace","type":"record"}},"body":"Hello! This is a Testing log","instrumentationScope":{"name":"test-app"},"observedTimestamp":"2025-03-19T12:22:44.663646777Z","resource":{"service.name":"test-app","telemetry.sdk.language":"go","telemetry.sdk.name":"opentelemetry","telemetry.sdk.version":"1.33.0"},"severity":{"number":5},"@timestamp":"2025-03-19T12:22:43.633743348Z"}
{"create":{"_index":"ss4o_logs-default-namespace"}}
{"attributes":{"data_stream":{"dataset":"default","namespace":"namespace","type":"record"}},"body":"Hello! This is a Testing log","instrumentationScope":{"name":"test-app"},"observedTimestamp":"2025-03-19T12:22:44.664584006Z","resource":{"service.name":"test-app","telemetry.sdk.language":"go","telemetry.sdk.name":"opentelemetry","telemetry.sdk.version":"1.33.0"},"severity":{"number":5},"@timestamp":"2025-03-19T12:22:44.648203511Z"}
HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-encoding: gzip
content-length: 580

{"took":59937,"errors":true,"items":[{"create":{"_index":"ss4o_logs-default-namespace","_id":"WaBbrpUBfWGueJ6HKMa7","status":503,"error":{"type":"unavailable_shards_exception","reason":"[ss4o_logs-default-namespace][1] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[ss4o_logs-default-namespace][1]] containing [index {[ss4o_logs-default-namespace][WaBbrpUBfWGueJ6HKMa7], source[{"attributes":{"data_stream":{"dataset":"default","namespace":"namespace","type":"record"}},"body":"Hello! This is a Testing log","instrumentationScope":{"name":"test-app"},"observedTimestamp":"2025-03-19T12:22:44.663646777Z","resource":{"service.name":"test-app","telemetry.sdk.language":"go","telemetry.sdk.name":"opentelemetry","telemetry.sdk.version":"1.33.0"},"severity":{"number":5},"@timestamp":"2025-03-19T12:22:43.633743348Z"}]}]]"}}},{"create":{"_index":"ss4o_logs-default-namespace","_id":"WqBbrpUBfWGueJ6HKMa7","status":503,"error":{"type":"unavailable_shards_exception","reason":"[ss4o_logs-default-namespace][4] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[ss4o_logs-default-namespace][4]] containing [index {[ss4o_logs-default-namespace][WqBbrpUBfWGueJ6HKMa7], source[{"attributes":{"data_stream":{"dataset":"default","namespace":"namespace","type":"record"}},"body":"Hello! This is a Testing log","instrumentationScope":{"name":"test-app"},"observedTimestamp":"2025-03-19T12:22:44.664584006Z","resource":{"service.name":"test-app","telemetry.sdk.language":"go","telemetry.sdk.name":"opentelemetry","telemetry.sdk.version":"1.33.0"},"severity":{"number":5},"@timestamp":"2025-03-19T12:22:44.648203511Z"}]}]]"}}}]}

POST /_bulk HTTP/1.1
Host: search-engine:9200
User-Agent: opensearch-go/2.3.0 (linux amd64; Go 1.23.3)
Content-Length: 555
Content-Type: application/json
Accept-Encoding: gzip

{"create":{"_index":"ss4o_logs-default-namespace"}}
{"attributes":{"data_stream":{"dataset":"default","namespace":"namespace","type":"record"}},"body":"Hello! This is a Testing log","instrumentationScope":{"name":"test-app"},"observedTimestamp":"2025-03-19T12:23:45.536550316Z","resource":{"service.name":"test-app","telemetry.sdk.language":"go","telemetry.sdk.name":"opentelemetry","telemetry.sdk.version":"1.33.0"},"severity":{"number":5},"@timestamp":"2025-03-19T12:22:43.633743348Z"}
HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-encoding: gzip
content-length: 208

{"took":1465,"errors":false,"items":[{"create":{"_index":"ss4o_logs-default-namespace","_id":"XaBcrpUBfWGueJ6HFsaB","_version":1,"result":"created","_shards":

{"total":2,"successful":1,"failed":0}
,"_seq_no":0,"_primary_term":1,"status":201}}]}

In each data packet , if the request failed with 503, if it has 2 retriable items, but only 1 item is retrying, which results in loss

Is there any changes needed to expect complete data from Search engine , when backends unavailable and bring back

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions