-
Notifications
You must be signed in to change notification settings - Fork 595
Description
Actual behaviour:
When Akka HTTP parses Content-Disposition
header, it follows RFC6266 and applies RFC5987 encoding to non-ASCII characters in its filename
field.
I.e., when it gets a Content-Disposition
header with filename
field that contains non-ASCII characters, it generates a UTF-8 encoded filename*
field contained filename
content in UTF-8, as proposed in RFC5987.
In addition to this, it converts all non-ASCII characters to ?
in the original filename
field, following this RFC6266 recommendation:
When a "filename*" parameter is sent, to also generate a "filename" parameter as a fallback for user agents that do not support the "filename*" form, if possible. This can be done by substituting characters with US-ASCII sequences (e.g., Unicode character point U+00E4 (LATIN SMALL LETTER A WITH DIARESIS) by "ae"). Note that this may not be possible in some locales.
akka-http/akka-http-core/src/main/scala/akka/http/scaladsl/model/headers/headers.scala
Lines 491 to 499 in 7638ab4
withExtParamsSorted foreach { | |
case (k, v) if k == "filename" => | |
r ~~ "; " ~~ k ~~ '=' ~~ '"' | |
r.putReplaced(v, keep = safeChars, placeholder = '?') ~~ '"' | |
case (k, v) if k endsWith "*" => | |
r ~~ "; " ~~ k ~~ '=' ~~ "UTF-8''" | |
UriRendering.encode(r, v, UTF8, keep = `attr-char`, replaceSpaces = false) | |
case (k, v) => r ~~ "; " ~~ k ~~ '=' ~~#! v | |
} |
As the result, if I try to send a multipart request with some non-ASCII characters in the filename
field via CURL, CURL itself sends something like this:
Content-Disposition: form-data; name="test0"; filename="my_файл_123!.txt"\r\n
And when Akka HTTP parses it, it modifies it this way:
Content-Disposition: form-data; filename="my_????_123!.txt"; filename*=UTF-8''my_%D1%84%D0%B0%D0%B9%D0%BB_123!.txt; name="test0"
Notice that all non-ASCII characters were turned to ?
. If my filename contained only non-ASCII characters, then the resulting filename
would be just ????.txt
, regardless of whether that's файл.txt
or лайф.txt
.
The issue:
The latest HTML5 standard says:
For details on how to interpret multipart/form-data payloads, see RFC 7578. [RFC7578]
And RFC7578 strictly forbids using the usage of RFC5987 for filename
field of Content-Disposition
header in form-data
case:
NOTE: The encoding method described in [RFC5987], which would add a "filename*" parameter to the Content-Disposition header field, MUST NOT be used.
Instead it proposes to use percent-encoding:
In most multipart types, the MIME header fields in each part are restricted to US-ASCII; for compatibility with those systems, file names normally visible to users MAY be encoded using the percent-encoding method in Section 2, following how a "file:" URI [URI-SCHEME] might be encoded.
And this percent-encoding is described this way:
Within this specification, "percent-encoding" (as defined in [RFC3986]) is offered as a possible way of encoding characters in file names that are otherwise disallowed, including non-ASCII characters, spaces, control characters, and so forth. The encoding is created replacing each non-ASCII or disallowed character with a sequence, where each byte of the UTF-8 encoding of the character is represented by a percent-sign (%) followed by the (case-insensitive) hexadecimal of that byte.
There are some clients that follow this standard, so they don't expect filename*
field anymore, since it's strictly forbidden. And they expect to see percent-encoding in filename
field. If non-ASCII characters in filename
are just replaced with some generic placeholder, it could cause issues, because any file whose filename contains of 4 non-ASCII characters, would be just ????
for these clients.
Proposals:
Unfortunately there is no standard approach to solving this issue. Other libraries, like http4s, playframework, etc, use slightly different approaches. I think that generally there are two ways to improve the situation:
Use RFC7578 approach
It would probably be the "right" thing to do, but fairly dangerous, because it would break backwards compatibility for legacy clients that rely on
filename*
field, i.e. RFC5987 approach. It clearly is not the desired outcome.Keep
filename*
, but apply percent-encoding tofilename
This would still violate RFC7578, which says MUST NOT about using RFC5987 encoding method. Though, it would at least unblock clients that expect
filename
field to be percent-encoded.
But I am not sure which approach would be the best. Probably it deserves some community discussion to figure out the best way to move forward and resolve the issue.