Skip to content

Conversation

kostasrim
Copy link
Contributor

@kostasrim kostasrim commented Apr 4, 2025

The issue: client fails to connect and redis lib converts OsError to ConnectionError and rethrows. This happens rarely and the symptom can be treated by catching the exception.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
dragonfly/replication_test.py:1249: in block_during_takeover
    assert await c_blocking.execute_command("BLPOP BLOCKING_KEY1 BLOCKING_KEY2 100") is None
/usr/local/lib/python3.8/dist-packages/redis/asyncio/client.py:611: in execute_command
    conn = self.connection or await pool.get_connection(command_name, **options)
/usr/local/lib/python3.8/dist-packages/redis/asyncio/connection.py:1064: in get_connection
    await self.ensure_connection(connection)
/usr/local/lib/python3.8/dist-packages/redis/asyncio/connection.py:1097: in ensure_connection
    await connection.connect()
/usr/local/lib/python3.8/dist-packages/redis/asyncio/connection.py:296: in connect
    await self.on_connect()
/usr/local/lib/python3.8/dist-packages/redis/asyncio/connection.py:401: in on_connect
    await self.send_command("CLIENT", "SETINFO", "LIB-VER", self.lib_version)
/usr/local/lib/python3.8/dist-packages/redis/asyncio/connection.py:505: in send_command
    await self.send_packed_command(

and
redis.exceptions.ConnectionError: Error UNKNOWN while writing to socket. Connection lost.

Fixes # #4533

assert await c_blocking.execute_command("BLPOP BLOCKING_KEY1 BLOCKING_KEY2 100") is None
try:
assert await c_blocking.execute_command("BLPOP BLOCKING_KEY1 BLOCKING_KEY2 100") is None
except redis.exceptions.ConnectionError as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but why do we have connection error?
is it because takeover already started? if so isnt the solution is to increase a little bit the delay in takeover execution ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adiholden Oh I thought I wrote it in the description. The client fails to connect with an os error. The redis lib converts the OsError to connection error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if so isnt the solution is to increase a little bit the delay

Or just handle the ConnectionError. It doesn't eve happen that often 🤷‍♂️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this kind of a fix that now you see it does not happen that often, but tomorrow someone will change the test and will remove the delay in takeover or something else in the test will change and we will not know this functionality is not tested at all as it might always get to the path of ConnectionError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep but it's also flaky if you just increase the takeover because you don't know when BLPOP will run by the python's event loop.

I ended up increasing the takeover delay to 1s. Let's see how that goes in practice 😄

@kostasrim kostasrim requested a review from adiholden April 7, 2025 06:13
@adiholden adiholden merged commit 2d96a57 into main Apr 9, 2025
10 checks passed
@adiholden adiholden deleted the kpr5 branch April 9, 2025 06:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants