-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat(replication): allow non-tls connections between replica and master #1419 #1490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
78fb3ca
to
70bde27
Compare
70bde27
to
af2d393
Compare
General comment about this: Whether we want TLS is a question of what network interface we listen on and its security. If the network doesn't need TLS then we don't need TLS at all, if the network needs TLS then we should always have TLS. From a security perspective I really don't like that it is specific to replication. |
Why do you think it's a security issue if it's specific for replication? If we are on a private network then why would the replica need to communicate over tls with master? (I think this was the original idea). I am not an expert in security, so plz add your concerns here and we can chat a little bit tomorrow on tomorrow's standup. |
My problem is that this is an unclean security abstraction: If the network is private then all connections should be without TLS. Making this relaxation specific for replication creates a difference in the network security level that is based on application logic and not on the network logic. |
} | ||
FiberSocketBase* peer = tls_sock ? (FiberSocketBase*)tls_sock.get() : socket_.get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I simplified it a little bit, it gets initialized at the top :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am missing something but in the old flow tls_sock is assigned to peer if tls_sock not null. I dont that you update peer with tls_sock in the new flow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is it working?
I would expect the test you wrote to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed I messed this up on refactoring and HUGE thumps up for catching this. It works, because we only test the admin port and not the TLS connection itself -- we have 0 tests for that (although my other PR addresses this and it would have been caught there).
raise "Non tls connection connected on master with tls. This should NOT happen" | ||
except redis.ConnectionError: | ||
pass | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets add a connection via admin port and run DBSIZE command, see that it passes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can also be a test in connection_test.py as checking that running command via admin port does not require tls is not related to replication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up having it here. It's not related to replication but it doesn't require to create a new test case for that. It's small detail ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I kind of agree with Adi here. The feature here is being able to connect to the admin port without using the tls. I think the slave part is redundant machinery to cover this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I missed something but @adiholden asked to promt admin port directly (on previous commits that was not the case because of how I initially implemented this) -- which I addressed.
Now for (# 2. Try to connect on master without admin port. This should fail.
), I think it's good to be here because it asserts that the changes/flags introduced that allowed non-tls connections on admin port did not also allow non-tls connections on other ports (it could be the case that I messed up the if conditions here: https://github.com/dragonflydb/dragonfly/pull/1490/files#diff-b88cbca1b2c9337ab2a37262f9d516b117f4996da25e2ae173e0e129f2b82b20R310) -- this guards as against that potential mistake and that's why I put it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment here is that the changes in dfly code are not specific for replication, we enable connecting from admin port without tls. Therefor we would like to test this specifically and the replication test is another test to check that replication work with this, we can add this test but f.e if replication with no tls breaks this does not mean that connecting via admin port with no tls fails.
f.e if the PR was about adding admin port to dfly I would test it in connection test, it is importatnt to validate that replication as well works with admin port there fore I will add another test for replication with admin port
I believe that we need
c_replica = aioredis.Redis(port=replica.port) | ||
res = await c_replica.execute_command("REPLICAOF localhost " + str(ADMIN_PORT)) | ||
assert b"OK" == res | ||
time.sleep(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of sleep
use
await check_all_replicas_finished([c_replica], c_master)
The clients here should be connected to admin port
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove the sleep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have taken an oath that I removed it, I really don't know how it got left -- removed
@dfly_args({"dbfilename": "test-tls-replication-{{timestamp}}"}) | ||
async def test_replication_all(df_local_factory, df_seeder_factory, t_master, t_replica): | ||
# 1. Spin up dragonfly without tls, debug populate and then shut it down | ||
master = df_local_factory.create(port=BASE_PORT, proactor_threads=t_master) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this first step is redundant, as you can run the dubug populate command from admin port when starting master with the tls args
# b. Allow non-tls connection from replica to master | ||
master = df_local_factory.create(admin_port=ADMIN_PORT, admin_nopass=True, **with_tls_args, port=BASE_PORT, proactor_threads=t_master) | ||
master.start() | ||
# we need to sleep, because currently tls connections on master fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to sleep because lts connection fail? I dont understand this.. where do you create connection? how is sleeping helps here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that was left from the rpevious PR. I will remove it
replica = df_local_factory.create(port=BASE_PORT + 1, proactor_threads=t_replica,) | ||
replica.start() | ||
c_replica = aioredis.Redis(port=replica.port) | ||
res = await c_replica.execute_command("REPLICAOF localhost " + str(ADMIN_PORT)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"REPLICAOF localhost {master.admin_port}"
pass | ||
|
||
# 4. Spin up a replica and initiate a REPLICAOF | ||
replica = df_local_factory.create(port=BASE_PORT + 1, proactor_threads=t_replica,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest creating replica with tls args as well, you can run the commands on replica using the admin port
"no_tls_on_admin_port": "true"} | ||
return with_tls_args | ||
|
||
def gen_tls_cert(dfly_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider making it a fixture (in conftest.py) similarly to other resources being consumed by tests.
src/facade/dragonfly_connection.h
Outdated
@@ -19,6 +19,7 @@ | |||
#include "core/fibers.h" | |||
#include "facade/facade_types.h" | |||
#include "facade/resp_expr.h" | |||
#include "util/tls/tls_socket.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you need it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh it's FiberSocketBase *
, I completely missed that, I thought it's used explicitly. removed.
db_size = await c_replica.execute_command("DBSIZE") | ||
assert 100 == db_size | ||
|
||
master.stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you dont need to call stop at the end of the test as this is done in df_local_factory
tests/dragonfly/connection_test.py
Outdated
@@ -2,6 +2,7 @@ | |||
import pytest | |||
import asyncio | |||
from redis import asyncio as aioredis | |||
from redis.exceptions import ConnectionError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConnectionError
is a Python builtin so it's probably best not to override it
tests/dragonfly/connection_test.py
Outdated
@pytest.mark.asyncio | ||
@dfly_args({"admin_nopass" : True}) | ||
async def test_reject_non_tls_connections_on_tls_master(gen_tls_cert, tls_server_key_file_name, tls_server_cert_file_name, df_local_factory, df_seeder_factory): | ||
gen_tls_cert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gen_tls_cert |
?
src/facade/dragonfly_connection.cc
Outdated
LOG(WARNING) << "Error handshaking " << aresult.error().message(); | ||
return; | ||
const bool no_tls_on_admin_port = absl::GetFlag(FLAGS_no_tls_on_admin_port); | ||
if (!no_tls_on_admin_port || (no_tls_on_admin_port && !IsAdmin())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this a bit more readable -
if (!no_tls_on_admin_port || (no_tls_on_admin_port && !IsAdmin())) { | |
if (!(IsAdmin() && no_tls_on_admin_port)) { |
@@ -159,6 +159,13 @@ def stop_all(self): | |||
def __str__(self): | |||
return f"Factory({self.args})" | |||
|
|||
@property | |||
def dfly_path(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use os.path.basedir
here? hard-coding the test binary name is a bit annoying if people try to bisect regressions etc
tests/dragonfly/connection_test.py
Outdated
await c_master.execute_command("DBSIZE") | ||
raise "Non tls connection connected on master with tls. This should NOT happen" | ||
except redis.exceptions.ConnectionError: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please connect via admin port also and check it success
@pytest.mark.parametrize("t_master, t_replica", replication_cases) | ||
@dfly_args({"dbfilename": "test-tls-replication-{{timestamp}}", "admin_nopass" : True}) | ||
async def test_no_tls_on_admin_port(gen_tls_cert, tls_server_key_file_name, tls_server_cert_file_name, df_local_factory, df_seeder_factory, t_master, t_replica): | ||
gen_tls_cert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove gen_tls_cert
|
||
@pytest.mark.asyncio | ||
@pytest.mark.parametrize("t_master, t_replica", replication_cases) | ||
@dfly_args({"dbfilename": "test-tls-replication-{{timestamp}}", "admin_nopass" : True}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you can drop the dbfilename param
why is the admin_nopass needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make our life easier so I don't have to setup a password
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why password is required if requirepass flag is not set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because from my understanding you still need to provide an auth token according to #1181
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I quote:
However for auth protected setups we still need to provide an auth token for clients that connect via the admin port.
ADMIN_PORT = 1211 | ||
|
||
""" | ||
Test replication with tls and non-tls options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is just another replication test.. you can just move it to replicaiton_test.py
tests/dragonfly/conftest.py
Outdated
def tls_server_cert_file_name(): | ||
return "df-cert.pem" | ||
|
||
@pytest.fixture(scope="function") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its better to use scope="session" so this function will run only once when we are running regression tests
# Use CA's private key to sign dragonfly's CSR and get back the signed certificate | ||
tls_server_cert = dfly_path + tls_server_cert_file_name | ||
step3 = fr'openssl x509 -req -in {tls_server_req} -days 1 -CA {ca_cert} -CAkey {ca_key} -CAcreateserial -out {tls_server_cert}' | ||
subprocess.run(step3, shell=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gen_tls_cert can return the path of key and certificate file names and instead of defining tls_server_cert_file_name and tls_server_key_file_name set the filenames to constant var
Addresses #1419 and form the basis for #1418.