-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
backfill slow and expensive #2504
Description
Today we saw a pattern of requests from a client which appeared to have got stuck in a loop calling /messages on matrix HQ (but passing the same pagination parameters over and over again).
There's probably a bug in the client, but that isn't the point here. The main problem was that each request ended up burning 300ms of CPU time, which combined with the rate of requests, DoSed matrix.org.
Some example requests:
2017-10-05 14:58:49,106 - synapse.access.http.8080 - 91 - INFO - GET-4039099 - 8080 - {@user:matrix.org} Processed request: 1813ms (271ms, 0ms) (4ms/2) 9809B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353343-282585688_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:49,520 - synapse.access.http.8080 - 91 - INFO - GET-4039004 - 8080 - {@user:matrix.org} Processed request: 4056ms (316ms, 0ms) (19ms/3) 8807B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353323-282576586_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:49,522 - synapse.access.http.8080 - 91 - INFO - GET-4039030 - 8080 - {@user:matrix.org} Processed request: 3279ms (276ms, 0ms) (15ms/3) 8807B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353323-282576586_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:49,572 - synapse.access.http.8080 - 91 - INFO - GET-4039061 - 8080 - {@user:matrix.org} Processed request: 2853ms (291ms, 0ms) (16ms/3) 8807B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353323-282576586_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:50,089 - synapse.access.http.8080 - 91 - INFO - GET-4039231 - 8080 - {@user:matrix.org} Processed request: 1835ms (280ms, 0ms) (3ms/2) 9809B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353343-282585688_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:51,605 - synapse.access.http.8080 - 91 - INFO - GET-4039382 - 8080 - {@user:matrix.org} Processed request: 2212ms (259ms, 0ms) (2ms/2) 8807B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353323-282576586_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:52,000 - synapse.access.http.8080 - 91 - INFO - GET-4039452 - 8080 - {@user:matrix.org} Processed request: 1888ms (283ms, 0ms) (2ms/2) 8807B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353323-282576586_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:52,003 - synapse.access.http.8080 - 91 - INFO - GET-4039453 - 8080 - {@user:matrix.org} Processed request: 1881ms (287ms, 0ms) (4ms/2) 8807B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353323-282576586_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:52,241 - synapse.access.http.8080 - 91 - INFO - GET-4039454 - 8080 - {@user:matrix.org} Processed request: 2110ms (268ms, 0ms) (2ms/2) 8807B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353323-282576586_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:52,481 - synapse.access.http.8080 - 91 - INFO - GET-4039483 - 8080 - {@user:matrix.org} Processed request: 1352ms (279ms, 0ms) (2ms/2) 8807B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353323-282576586_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
Each request prompted a backfill request to jki.re
, for example:
2017-10-05 14:58:45,474 - synapse.access.http.8080 - 59 - INFO - GET-4039005 - 8080 - Received request: GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353343-282585688_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20
2017-10-05 14:58:46,189 - synapse.http.outbound - 154 - INFO - GET-4039005 - {GET-O-6237} [jki.re] Sending request: GET matrix://jki.re/_matrix/federation/v1/backfill/!cURbafjkfsMDVwdRDQ:matrix.org/?limit=100&v=%241504371387441TzMaJ%3Ahveem.no&v=%2414998105322MIGsD%3Aarquilla.net&v=%2415030841510raCcl%3Ahveem.no&v=%2415068188390cXeVs%3Atapenet.org&v=%2414847676930qrrMY%3Adeuxfleurs.fr
2017-10-05 14:58:47,382 - synapse.http.outbound - 239 - INFO - GET-4039005 - {GET-O-6237} [jki.re] Result: 200 OK
2017-10-05 14:58:47,801 - synapse.access.http.8080 - 91 - INFO - GET-4039005 - 8080 - {@user:matrix.org} Processed request: 2327ms (283ms, 0ms) (3ms/2) 9809B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353343-282585688_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
It appears that there are five events in HQ in this timeframe which have prev_event links to events we do not have, so /messages requests in the area prompt a backfill attempt; jki.re is chosen somewhat arbitrarily, but also doesn't have the missing events - so we don't make any forward progress. We then repeat this process for each subsequent client request.
There appear to be four separate problems here:
- The 300ms of CPU time is somewhat inexplicable. I haven't figured out what it's doing, or whether it happens before or after the callout to jki.re. (
/messages
requests which do not trigger a backfill attempt do not seem to cause this CPU time usage, so it kinda has to be in themaybe_backfill
codepath... but I can't see where) - When jki.re returns the empty list rather than any of the events we have asked for, we are happy with that and don't ask any of the other servers in the room.
- We don't record failures when backfilling, so are doomed to repeat historical mistakes.
- There is no attempt to ratelimit how often we do this.