Skip to content

Conversation

MatMaul
Copy link
Contributor

@MatMaul MatMaul commented May 9, 2025

If a set of messages exceeds this limit, the messages are splitted across several EDUs.

Should fix #17035.

There is currently no official specced limit for EDUs, but the consensus seems to be that it would be useful to have one to avoid this bug by bounding the transaction size.

As a side effect it also limits the size of a single to-device message to a bit less than 65536.

This should probably be added to the spec similarly to the message size limit.

Pull Request Checklist

@MatMaul MatMaul marked this pull request as ready for review May 9, 2025 14:38
@MatMaul MatMaul requested a review from a team as a code owner May 9, 2025 14:38
Comment on lines +304 to +307
edu_contents = get_device_message_edu_contents(
sender_user_id, message_type, messages, context
)
remote_edu_contents[destination] = edu_contents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of changing the structure of remote_edu_contents (was a map from destination to EDU meta) (to a map from destination to multiple EDU meta), could we just call add_messages_to_device_inbox(...) multiple times?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multi version should have some gain performance side, since it's in an unique transaction, and pre-allocate all the stream ids.

I am fine if we decide to keep it simple and sacrifice some perf for that, but I am not sure it's worth it here it's not overly complicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MadLittleMods are you fine with keeping it like that ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't really investigated the intricacies here but my leaning would be on the simple solution I suggested.

Performance wasn't the thing we're trying to address.


for recipient, message in messages.items():
# We remove 2 for the curly braces and add 1 for the colon
message_entry_size = len(encode_canonical_json({recipient: message})) - 2 + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by thought: instead of trying to work out the lengths and calculate the number of messages we can add, it might be easier to just generate the EDU and then check the size of it. If its too big you half the number of messages and try again.

The common case will be that we don't need to split up the EDU, at the expense of duplicating some work. It feels a bit hacky, but I think might be a little less brittle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it will be that much simpler for comprehension TBH.

If we think we can eat the perf cost, the simpler is probably to call encode_canonical_json on the whole EDU for each added message, and remove it and create a new EDU if it's larger than the max.

My calculation tricks were to avoid doing a full serialization on each added message.

Copy link
Contributor Author

@MatMaul MatMaul May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And add a special case tried first where we try to put everything in one ?

I don't know TBH, the idea of splitting in 2 is nice too but I feel like it is going to be quite annoying to implement and hence not simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erikjohnston @MadLittleMods thoughts on where we would like to go here ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The manual message serialization is a bit error prone and confusing to follow.

@erikjohnston's suggestion sounds good to me if you're willing to adapt

@MadLittleMods MadLittleMods changed the title Limit to_device EDU size to 65536 Limit outgoing to_device EDU size to 65536 Jun 3, 2025
MadLittleMods added a commit that referenced this pull request Jun 19, 2025
@mcalinghee
Copy link
Contributor

Looks like the CI tests are flaky as mentioned here : #18537

Comment on lines +158 to +162
# FIXME: Because huge log line is triggered in this test,
# trial breaks, sometimes (flakily) failing the test run.
# ref: https://github.com/twisted/twisted/issues/12482
# To remove this, we would need to fix the above issue and
# update, including in olddeps (so several years' wait).
Copy link
Contributor

@MadLittleMods MadLittleMods Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there is still another case somewhere -> https://github.com/element-hq/synapse/actions/runs/17109765260/job/48528124654?pr=18416 (twisted.protocols.amp.TooLong)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this to track the case but I was not able to reproduce the issue on my side.
If you have an idea on how to track the issue, I would be interested.

In the meantime, I had a guess on what was the issue by looking at the code.

@mcalinghee
Copy link
Contributor

Not able to reproduce the issue here by running COMPLEMENT_DIR=../complement POSTGRES=1 WORKERS=1 ./scripts-dev/complement.sh
Probably missing something here.

# This is defined in the Matrix spec and enforced by the receiver.
MAX_EDUS_PER_TRANSACTION = 100
# A transaction can contain up to 100 EDUs but synapse reserves 10 EDUs for other purposes
SYNAPSE_EDUS_PER_TRANSACTION = 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SYNAPSE_EDUS_PER_TRANSACTION = 10
NUMBER_OF_RESERVED_EDUS_PER_TRANSACTION = 10


# This is defined in the Matrix spec and enforced by the receiver.
MAX_EDUS_PER_TRANSACTION = 100
# A transaction can contain up to 100 EDUs but synapse reserves 10 EDUs for other purposes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# A transaction can contain up to 100 EDUs but synapse reserves 10 EDUs for other purposes
# A transaction can contain up to 100 EDUs but synapse reserves 10 EDUs for other purposes
# like trickling out some device list updates.

sender_user_id: str,
message_type: str,
messages: Dict[str, Dict[str, JsonDict]],
context: Dict[str, Any],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
context: Dict[str, Any],
tracing_context: Dict[str, Any],

sender_user_id: str,
message_type: str,
messages: Dict[str, Dict[str, JsonDict]],
context: Dict[str, Any],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing this in, can we just call tracing_context = get_active_span_text_map() in the function?

sender_user_id: str,
message_type: str,
context: Dict[str, Any],
message_id: str = random_string(16),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work correctly and I'm surprised this isn't causing a lint to be triggered (we had a lint for this before). I guess this is something we need to re-enable -> https://docs.astral.sh/ruff/rules/function-call-in-default-argument/

random_string(16) will only be executed once and be shared across all messages. We should also have a test that catches this as well.

Comment on lines +304 to +307
edu_contents = get_device_message_edu_contents(
sender_user_id, message_type, messages, context
)
remote_edu_contents[destination] = edu_contents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't really investigated the intricacies here but my leaning would be on the simple solution I suggested.

Performance wasn't the thing we're trying to address.

)

if len(current_edu_content["messages"]) > 0:
message_entry_size += 1 # Add 1 for the comma
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give an example.

This is also confused because we also state "add 1 for the comma per message" above. Which comma's? Is having both correct?


for recipient, message in messages.items():
# We remove 2 for the curly braces and add 1 for the colon
message_entry_size = len(encode_canonical_json({recipient: message})) - 2 + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The manual message serialization is a bit error prone and confusing to follow.

@erikjohnston's suggestion sounds good to me if you're willing to adapt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

A long queue of to-device messages can prevent outgoing federation working

4 participants