Skip to content
This repository was archived by the owner on Apr 26, 2024. It is now read-only.
This repository was archived by the owner on Apr 26, 2024. It is now read-only.

backfill slow and expensive #2504

@richvdh

Description

@richvdh

Today we saw a pattern of requests from a client which appeared to have got stuck in a loop calling /messages on matrix HQ (but passing the same pagination parameters over and over again).

There's probably a bug in the client, but that isn't the point here. The main problem was that each request ended up burning 300ms of CPU time, which combined with the rate of requests, DoSed matrix.org.

Some example requests:

2017-10-05 14:58:49,106 - synapse.access.http.8080 - 91 - INFO - GET-4039099 - 8080 - {@user:matrix.org} Processed request: 1813ms (271ms, 0ms) (4ms/2) 9809B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353343-282585688_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:49,520 - synapse.access.http.8080 - 91 - INFO - GET-4039004 - 8080 - {@user:matrix.org} Processed request: 4056ms (316ms, 0ms) (19ms/3) 8807B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353323-282576586_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:49,522 - synapse.access.http.8080 - 91 - INFO - GET-4039030 - 8080 - {@user:matrix.org} Processed request: 3279ms (276ms, 0ms) (15ms/3) 8807B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353323-282576586_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:49,572 - synapse.access.http.8080 - 91 - INFO - GET-4039061 - 8080 - {@user:matrix.org} Processed request: 2853ms (291ms, 0ms) (16ms/3) 8807B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353323-282576586_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:50,089 - synapse.access.http.8080 - 91 - INFO - GET-4039231 - 8080 - {@user:matrix.org} Processed request: 1835ms (280ms, 0ms) (3ms/2) 9809B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353343-282585688_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:51,605 - synapse.access.http.8080 - 91 - INFO - GET-4039382 - 8080 - {@user:matrix.org} Processed request: 2212ms (259ms, 0ms) (2ms/2) 8807B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353323-282576586_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:52,000 - synapse.access.http.8080 - 91 - INFO - GET-4039452 - 8080 - {@user:matrix.org} Processed request: 1888ms (283ms, 0ms) (2ms/2) 8807B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353323-282576586_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:52,003 - synapse.access.http.8080 - 91 - INFO - GET-4039453 - 8080 - {@user:matrix.org} Processed request: 1881ms (287ms, 0ms) (4ms/2) 8807B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353323-282576586_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:52,241 - synapse.access.http.8080 - 91 - INFO - GET-4039454 - 8080 - {@user:matrix.org} Processed request: 2110ms (268ms, 0ms) (2ms/2) 8807B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353323-282576586_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:52,481 - synapse.access.http.8080 - 91 - INFO - GET-4039483 - 8080 - {@user:matrix.org} Processed request: 1352ms (279ms, 0ms) (2ms/2) 8807B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353323-282576586_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"

Each request prompted a backfill request to jki.re, for example:

2017-10-05 14:58:45,474 - synapse.access.http.8080 - 59 - INFO - GET-4039005 - 8080 - Received request: GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353343-282585688_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20
2017-10-05 14:58:46,189 - synapse.http.outbound - 154 - INFO - GET-4039005 - {GET-O-6237} [jki.re] Sending request: GET matrix://jki.re/_matrix/federation/v1/backfill/!cURbafjkfsMDVwdRDQ:matrix.org/?limit=100&v=%241504371387441TzMaJ%3Ahveem.no&v=%2414998105322MIGsD%3Aarquilla.net&v=%2415030841510raCcl%3Ahveem.no&v=%2415068188390cXeVs%3Atapenet.org&v=%2414847676930qrrMY%3Adeuxfleurs.fr
2017-10-05 14:58:47,382 - synapse.http.outbound - 239 - INFO - GET-4039005 - {GET-O-6237} [jki.re] Result: 200 OK
2017-10-05 14:58:47,801 - synapse.access.http.8080 - 91 - INFO - GET-4039005 - 8080 - {@user:matrix.org} Processed request: 2327ms (283ms, 0ms) (3ms/2) 9809B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353343-282585688_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"

It appears that there are five events in HQ in this timeframe which have prev_event links to events we do not have, so /messages requests in the area prompt a backfill attempt; jki.re is chosen somewhat arbitrarily, but also doesn't have the missing events - so we don't make any forward progress. We then repeat this process for each subsequent client request.

There appear to be four separate problems here:

  • The 300ms of CPU time is somewhat inexplicable. I haven't figured out what it's doing, or whether it happens before or after the callout to jki.re. (/messages requests which do not trigger a backfill attempt do not seem to cause this CPU time usage, so it kinda has to be in the maybe_backfill codepath... but I can't see where)
  • When jki.re returns the empty list rather than any of the events we have asked for, we are happy with that and don't ask any of the other servers in the room.
  • We don't record failures when backfilling, so are doomed to repeat historical mistakes.
  • There is no attempt to ratelimit how often we do this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-FederationA-Messages-Endpoint/messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill)A-PerformancePerformance, both client-facing and admin-facingT-TaskRefactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions