Skip to content

Conversation

kostasrim
Copy link
Contributor

@kostasrim kostasrim commented Jun 28, 2023

Addresses #1419 and form the basis for #1418.

  1. Allow non-tls connections between replica and master on admin port
  2. Add test
  3. Add tls certificate generator

@kostasrim kostasrim force-pushed the fix_replication_with_tls branch 2 times, most recently from 78fb3ca to 70bde27 Compare June 29, 2023 16:46
@kostasrim kostasrim force-pushed the fix_replication_with_tls branch from 70bde27 to af2d393 Compare June 30, 2023 15:20
@kostasrim kostasrim changed the title feat(replication): allow non-tls connections between replica and master (DO NOT REVIEW WIP) feat(replication): allow non-tls connections between replica and master #1419 Jun 30, 2023
@kostasrim kostasrim marked this pull request as ready for review June 30, 2023 15:26
@kostasrim kostasrim self-assigned this Jun 30, 2023
@royjacobson
Copy link
Contributor

General comment about this: Whether we want TLS is a question of what network interface we listen on and its security. If the network doesn't need TLS then we don't need TLS at all, if the network needs TLS then we should always have TLS. From a security perspective I really don't like that it is specific to replication.

@kostasrim
Copy link
Contributor Author

@royjacobson

General comment about this: Whether we want TLS is a question of what network interface we listen on and its security. If the network doesn't need TLS then we don't need TLS at all, if the network needs TLS then we should always have TLS. From a security perspective I really don't like that it is specific to replication.

Why do you think it's a security issue if it's specific for replication? If we are on a private network then why would the replica need to communicate over tls with master? (I think this was the original idea).

I am not an expert in security, so plz add your concerns here and we can chat a little bit tomorrow on tomorrow's standup.

@royjacobson
Copy link
Contributor

@royjacobson

General comment about this: Whether we want TLS is a question of what network interface we listen on and its security. If the network doesn't need TLS then we don't need TLS at all, if the network needs TLS then we should always have TLS. From a security perspective I really don't like that it is specific to replication.

Why do you think it's a security issue if it's specific for replication? If we are on a private network then why would the replica need to communicate over tls with master? (I think this was the original idea).

I am not an expert in security, so plz add your concerns here and we can chat a little bit tomorrow on tomorrow's standup.

My problem is that this is an unclean security abstraction: If the network is private then all connections should be without TLS. Making this relaxation specific for replication creates a difference in the network security level that is based on application logic and not on the network logic.

@kostasrim kostasrim requested a review from romange July 5, 2023 13:50
@kostasrim kostasrim requested a review from royjacobson July 5, 2023 13:59
}
FiberSocketBase* peer = tls_sock ? (FiberSocketBase*)tls_sock.get() : socket_.get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I simplified it a little bit, it gets initialized at the top :)

Copy link
Contributor

@adiholden adiholden Jul 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am missing something but in the old flow tls_sock is assigned to peer if tls_sock not null. I dont that you update peer with tls_sock in the new flow

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is it working?
I would expect the test you wrote to fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I messed this up on refactoring and HUGE thumps up for catching this. It works, because we only test the admin port and not the TLS connection itself -- we have 0 tests for that (although my other PR addresses this and it would have been caught there).

raise "Non tls connection connected on master with tls. This should NOT happen"
except redis.ConnectionError:
pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add a connection via admin port and run DBSIZE command, see that it passes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can also be a test in connection_test.py as checking that running command via admin port does not require tls is not related to replication

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up having it here. It's not related to replication but it doesn't require to create a new test case for that. It's small detail ;)

Copy link
Collaborator

@romange romange Jul 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I kind of agree with Adi here. The feature here is being able to connect to the admin port without using the tls. I think the slave part is redundant machinery to cover this test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I missed something but @adiholden asked to promt admin port directly (on previous commits that was not the case because of how I initially implemented this) -- which I addressed.

Now for (# 2. Try to connect on master without admin port. This should fail.), I think it's good to be here because it asserts that the changes/flags introduced that allowed non-tls connections on admin port did not also allow non-tls connections on other ports (it could be the case that I messed up the if conditions here: https://github.com/dragonflydb/dragonfly/pull/1490/files#diff-b88cbca1b2c9337ab2a37262f9d516b117f4996da25e2ae173e0e129f2b82b20R310) -- this guards as against that potential mistake and that's why I put it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here is that the changes in dfly code are not specific for replication, we enable connecting from admin port without tls. Therefor we would like to test this specifically and the replication test is another test to check that replication work with this, we can add this test but f.e if replication with no tls breaks this does not mean that connecting via admin port with no tls fails.
f.e if the PR was about adding admin port to dfly I would test it in connection test, it is importatnt to validate that replication as well works with admin port there fore I will add another test for replication with admin port
I believe that we need

c_replica = aioredis.Redis(port=replica.port)
res = await c_replica.execute_command("REPLICAOF localhost " + str(ADMIN_PORT))
assert b"OK" == res
time.sleep(10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of sleep
use
await check_all_replicas_finished([c_replica], c_master)
The clients here should be connected to admin port

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove the sleep

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have taken an oath that I removed it, I really don't know how it got left -- removed

@dfly_args({"dbfilename": "test-tls-replication-{{timestamp}}"})
async def test_replication_all(df_local_factory, df_seeder_factory, t_master, t_replica):
# 1. Spin up dragonfly without tls, debug populate and then shut it down
master = df_local_factory.create(port=BASE_PORT, proactor_threads=t_master)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this first step is redundant, as you can run the dubug populate command from admin port when starting master with the tls args

@kostasrim kostasrim requested a review from adiholden July 5, 2023 14:47
# b. Allow non-tls connection from replica to master
master = df_local_factory.create(admin_port=ADMIN_PORT, admin_nopass=True, **with_tls_args, port=BASE_PORT, proactor_threads=t_master)
master.start()
# we need to sleep, because currently tls connections on master fail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to sleep because lts connection fail? I dont understand this.. where do you create connection? how is sleeping helps here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was left from the rpevious PR. I will remove it

replica = df_local_factory.create(port=BASE_PORT + 1, proactor_threads=t_replica,)
replica.start()
c_replica = aioredis.Redis(port=replica.port)
res = await c_replica.execute_command("REPLICAOF localhost " + str(ADMIN_PORT))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"REPLICAOF localhost {master.admin_port}"

pass

# 4. Spin up a replica and initiate a REPLICAOF
replica = df_local_factory.create(port=BASE_PORT + 1, proactor_threads=t_replica,)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest creating replica with tls args as well, you can run the commands on replica using the admin port

@kostasrim kostasrim enabled auto-merge (squash) July 5, 2023 15:38
@kostasrim kostasrim requested a review from adiholden July 5, 2023 15:39
"no_tls_on_admin_port": "true"}
return with_tls_args

def gen_tls_cert(dfly_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider making it a fixture (in conftest.py) similarly to other resources being consumed by tests.

@@ -19,6 +19,7 @@
#include "core/fibers.h"
#include "facade/facade_types.h"
#include "facade/resp_expr.h"
#include "util/tls/tls_socket.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you need it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh it's FiberSocketBase *, I completely missed that, I thought it's used explicitly. removed.

db_size = await c_replica.execute_command("DBSIZE")
assert 100 == db_size

master.stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you dont need to call stop at the end of the test as this is done in df_local_factory

@kostasrim kostasrim requested a review from adiholden July 6, 2023 07:32
@@ -2,6 +2,7 @@
import pytest
import asyncio
from redis import asyncio as aioredis
from redis.exceptions import ConnectionError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConnectionError is a Python builtin so it's probably best not to override it

@pytest.mark.asyncio
@dfly_args({"admin_nopass" : True})
async def test_reject_non_tls_connections_on_tls_master(gen_tls_cert, tls_server_key_file_name, tls_server_cert_file_name, df_local_factory, df_seeder_factory):
gen_tls_cert
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
gen_tls_cert

?

LOG(WARNING) << "Error handshaking " << aresult.error().message();
return;
const bool no_tls_on_admin_port = absl::GetFlag(FLAGS_no_tls_on_admin_port);
if (!no_tls_on_admin_port || (no_tls_on_admin_port && !IsAdmin())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this a bit more readable -

Suggested change
if (!no_tls_on_admin_port || (no_tls_on_admin_port && !IsAdmin())) {
if (!(IsAdmin() && no_tls_on_admin_port)) {

@@ -159,6 +159,13 @@ def stop_all(self):
def __str__(self):
return f"Factory({self.args})"

@property
def dfly_path(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use os.path.basedir here? hard-coding the test binary name is a bit annoying if people try to bisect regressions etc

await c_master.execute_command("DBSIZE")
raise "Non tls connection connected on master with tls. This should NOT happen"
except redis.exceptions.ConnectionError:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please connect via admin port also and check it success

@pytest.mark.parametrize("t_master, t_replica", replication_cases)
@dfly_args({"dbfilename": "test-tls-replication-{{timestamp}}", "admin_nopass" : True})
async def test_no_tls_on_admin_port(gen_tls_cert, tls_server_key_file_name, tls_server_cert_file_name, df_local_factory, df_seeder_factory, t_master, t_replica):
gen_tls_cert
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove gen_tls_cert


@pytest.mark.asyncio
@pytest.mark.parametrize("t_master, t_replica", replication_cases)
@dfly_args({"dbfilename": "test-tls-replication-{{timestamp}}", "admin_nopass" : True})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you can drop the dbfilename param
why is the admin_nopass needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make our life easier so I don't have to setup a password

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why password is required if requirepass flag is not set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because from my understanding you still need to provide an auth token according to #1181

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quote:

However for auth protected setups we still need to provide an auth token for clients that connect via the admin port.

ADMIN_PORT = 1211

"""
Test replication with tls and non-tls options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is just another replication test.. you can just move it to replicaiton_test.py

def tls_server_cert_file_name():
return "df-cert.pem"

@pytest.fixture(scope="function")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its better to use scope="session" so this function will run only once when we are running regression tests

# Use CA's private key to sign dragonfly's CSR and get back the signed certificate
tls_server_cert = dfly_path + tls_server_cert_file_name
step3 = fr'openssl x509 -req -in {tls_server_req} -days 1 -CA {ca_cert} -CAkey {ca_key} -CAcreateserial -out {tls_server_cert}'
subprocess.run(step3, shell=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gen_tls_cert can return the path of key and certificate file names and instead of defining tls_server_cert_file_name and tls_server_key_file_name set the filenames to constant var

@kostasrim kostasrim merged commit 15481b8 into main Jul 6, 2023
@romange romange deleted the fix_replication_with_tls branch July 6, 2023 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants