-
Notifications
You must be signed in to change notification settings - Fork 10
Description
In Protocol.hs, the following passages use defaultInputType to parse incoming data into a string. I'm particularly interested in the case where that data is multipart form data.
-- | Builds an 'Input' object for a simple value.
simpleInput :: String -> Input
simpleInput v = Input { inputValue = BS.pack v,
inputFilename = Nothing,
inputContentType = defaultInputType }
-- | The default content-type for variables.
defaultInputType :: ContentType
defaultInputType = ContentType "text" "plain" [("charset","windows-1252")]
bodyPartToInput :: BodyPart -> (String,Input)
bodyPartToInput (BodyPart hs b) =
case getContentDisposition hs of
Just (ContentDisposition "form-data" ps) ->
(lookupOrNil "name" ps,
Input { inputValue = b,
inputFilename = lookup "filename" ps,
inputContentType = ctype })
_ -> ("ERROR",simpleInput "ERROR") -- FIXME: report error
where ctype = fromMaybe defaultInputType (getContentType hs)
A couple of things here are outdated, as far as I understand.
- The default input charset is windows-1252, but utf-8 is the default now, at least for values in forms.
- The bodyPartToInput function allows the default to be overridden if Haskell spots an alternative with getContentType hs. However, this just checks for a content type header, while now (in HTML 5) the standard for communicating the charset in a multipart form is to write it in a special charset field.
Here's a link to the corresponding parts of the form data and HTML 5 specs which I'm looking at.
https://datatracker.ietf.org/doc/html/rfc7578#section-4.6
https://html.spec.whatwg.org/multipage/form-control-infrastructure.html#constructing-form-data-set
These issues make it difficult to interpret incoming form data with (non-ascii) Unicode characters.
Edit: After another hour of looking through source code and some tests, it seems that the code above has little to do with the final conversion of the message body into a String. Instead the code in this library correctly produces a ByteString, and leaves the job of converting that into a String to the Data.ByteString.Lazy.Char8 package, which automatically truncates longer Unicode characters, as far as I understand. Given that, the windows-1252 default makes some sense, and the content type field in Input is only for the file type when a file is provided, so reading the header should be fine.
On the one hand, my practical issue is now resolved, as I'll just use the functions that return a ByteString instead of a String. On the other, perhaps the documentation should mention that the String functions are only supposed to work if the CGI requests only use a limited character set. Or am I missing something?