Skip to content

Charset settings #48

@duelistone

Description

@duelistone

In Protocol.hs, the following passages use defaultInputType to parse incoming data into a string. I'm particularly interested in the case where that data is multipart form data.

-- | Builds an 'Input' object for a simple value.
simpleInput :: String -> Input
simpleInput v = Input { inputValue = BS.pack v,
                        inputFilename = Nothing,
                        inputContentType = defaultInputType }

-- | The default content-type for variables.
defaultInputType :: ContentType
defaultInputType = ContentType "text" "plain" [("charset","windows-1252")]
bodyPartToInput :: BodyPart -> (String,Input)
bodyPartToInput (BodyPart hs b) =
    case getContentDisposition hs of
              Just (ContentDisposition "form-data" ps) ->
                  (lookupOrNil "name" ps,
                   Input { inputValue = b,
                           inputFilename = lookup "filename" ps,
                           inputContentType = ctype })
              _ -> ("ERROR",simpleInput "ERROR") -- FIXME: report error
    where ctype = fromMaybe defaultInputType (getContentType hs)

A couple of things here are outdated, as far as I understand.

  1. The default input charset is windows-1252, but utf-8 is the default now, at least for values in forms.
  2. The bodyPartToInput function allows the default to be overridden if Haskell spots an alternative with getContentType hs. However, this just checks for a content type header, while now (in HTML 5) the standard for communicating the charset in a multipart form is to write it in a special charset field.

Here's a link to the corresponding parts of the form data and HTML 5 specs which I'm looking at.

https://datatracker.ietf.org/doc/html/rfc7578#section-4.6
https://html.spec.whatwg.org/multipage/form-control-infrastructure.html#constructing-form-data-set

These issues make it difficult to interpret incoming form data with (non-ascii) Unicode characters.

Edit: After another hour of looking through source code and some tests, it seems that the code above has little to do with the final conversion of the message body into a String. Instead the code in this library correctly produces a ByteString, and leaves the job of converting that into a String to the Data.ByteString.Lazy.Char8 package, which automatically truncates longer Unicode characters, as far as I understand. Given that, the windows-1252 default makes some sense, and the content type field in Input is only for the file type when a file is provided, so reading the header should be fine.

On the one hand, my practical issue is now resolved, as I'll just use the functions that return a ByteString instead of a String. On the other, perhaps the documentation should mention that the String functions are only supposed to work if the CGI requests only use a limited character set. Or am I missing something?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions