Skip to content

Conversation

douglascamata
Copy link
Member

@douglascamata douglascamata commented May 8, 2025

Description

This PR fixes a bug that happens because of broken logic in the start up process when there is no explicit OpAMP server in the Supervisor's agent config. Currently it works like this:

  1. The bootstrap happens, which starts the opamp server in the Supervisor using a random port. This leaves the port saved in Supervisor:

func (s *Supervisor) getBootstrapInfo() (err error) {
_, span := s.getTracer().Start(context.Background(), "GetBootstrapInfo")
defer span.End()
s.opampServerPort, err = s.getSupervisorOpAMPServerPort()

  1. The initial merged configuration is written to a file. It uses the opamp server port already saved in the Supervisor.

  2. The "real" opamp server is started using a different random port, leading the Collector to be constantly restarted by the Supervisor because it never connects back via OpAMP.

func (s *Supervisor) startOpAMPServer() error {
s.opampServer = server.New(newLoggerFromZap(s.telemetrySettings.Logger, "opamp-server"))
var err error
s.opampServerPort, err = s.getSupervisorOpAMPServerPort()
if err != nil {
return err
}

Testing

Added an e2e to cover this.

Copy link
Contributor

@evan-bradley evan-bradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for digging into this. The issue you point out and your fix both seem valid to me, but I'm not understanding why this doesn't happen when I run the Supervisor against the example server, or in any of the other E2E tests. What conditions trigger this error?

@douglascamata
Copy link
Member Author

douglascamata commented May 9, 2025

Thank you for digging into this. The issue you point out and your fix both seem valid to me, but I'm not understanding why this doesn't happen when I run the Supervisor against the example server, or in any of the other E2E tests. What conditions trigger this error?

@evan-bradley I think what triggers this is the presence of a local configuration that is enough for the Supervisor to start the Collector on the first attempt, inside Supervisor.Start. This scenario wasn't covered before by any other e2e test. Even if it was covered, we didn't have any check confirming that the Collector could connect back to the Supervisor.

This bug doesn't happen in the scenario where no local configuration is present because the Supervisor would have to receive some remote configuration first and this path correctly composes and writes the effective configuration to the file before starting the Collector process (see lines 1342 and 1343 below):

case <-s.hasNewConfig:
s.lastHealthFromClient = nil
if !configApplyTimeoutTimer.Stop() {
select {
case <-configApplyTimeoutTimer.C: // Try to drain the channel
default:
}
}
configApplyTimeoutTimer.Reset(s.config.Agent.ConfigApplyTimeout)
s.telemetrySettings.Logger.Debug("Restarting agent due to new config")
restartTimer.Stop()
s.stopAgentApplyConfig()
status, err := s.startAgent()

@douglascamata
Copy link
Member Author

Hey @evan-bradley, could you give this another look, please? 🙏

Copy link
Contributor

@evan-bradley evan-bradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @douglascamata. I was able to reproduce this when running the Supervisor locally by restricting the example server from sending any kind of reply that would cause the Supervisor to restart the Collector. Normally an additional message is sent that masks this bug.

@evan-bradley evan-bradley merged commit da60438 into open-telemetry:main May 14, 2025
173 checks passed
@github-actions github-actions bot added this to the next release milestone May 14, 2025
seongpil0948 pushed a commit to seongpil0948/opentelemetry-collector-contrib that referenced this pull request May 16, 2025
johnleslie pushed a commit to johnleslie/opentelemetry-collector-contrib that referenced this pull request May 23, 2025
johnleslie pushed a commit to johnleslie/opentelemetry-collector-contrib that referenced this pull request May 23, 2025
dragonlord93 pushed a commit to dragonlord93/opentelemetry-collector-contrib that referenced this pull request May 23, 2025
dd-jasminesun pushed a commit to DataDog/opentelemetry-collector-contrib that referenced this pull request Jun 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants