Character Encoding in AOLserver 3.0 and ACS

by Rob Mayoff


This is a work in progress. It is incomplete, inconsistent, and subject to radical change.

Note that this document applies only to the Tcl 8 version of AOLserver, also known as nsd8x, because Tcl 7 has no internationalization support. This document is also mainly concerned with the AOLserver Tcl API, because that is what we use at ArsDigita. There are probably problems in the C API as well that are not covered here.

Parts of this document where I am not sure of something and am specifically seeking advice have a blue background like this. If you have any feedback please e-mail me.

Contents

The Problem

Here's a simple example of the problem: you have a file on disk, named "hello.html" and stored using the ISO-8859-1 encoding:
Hello. My name is Günther Schmidt.
(That should say "Gunther" with an umlaut on the "u".) Since it's in ISO-8859-1 encoding, the u with umlaut is stored as one byte with value xFC. Suppose you send this file to the user using this script:
set fd [open /web/pages/hello.html r]
set content [read $fd [file size /web/pages/hello.html]]
close $fd
ns_return 200 text/html $content
Then the user will probably see this:
Hello. My name is Günther Schmidt.
(That should say "Gunther" with an umlaut on the "u".) But suppose you send this file using this script:
set fd [open /web/pages/hello.html r]
set content [read $fd [file size /web/pages/hello.html]]
close $fd
regsub {Hello\.} $content {Hello!} content
ns_return 200 text/html $content
Then the user will probably see this:
Hello! My name is Günther Schmidt.
(That should say "GA1/4nther" with a tilde on the "A" and the "1/4" as a fraction.) What happened? The reason it worked in the first case is that by default, AOLserver just ships out the raw bytes from the (ISO-8859-1-encoded) file, and the HTTP standard says that the client must assume a charset of ISO-8859-1 if no other charset is specified. The file encoding and the browser encoding matched, and AOLserver sent the data unmodified, so everything worked.

The second case is different. It turns out that Tcl 8.1 and later use Unicode. The interpreter normally stores strings using the UTF-8 encoding (which uses a variable number of bytes per character), and sometimes converts them to UCS-2 encoding (which uses 16-bit "wide characters"). The regsub command is one of those cases where conversions are involved. First, regsub converted the string to UCS-2. Tcl's UTF-8 parser is lenient, so the transformation ended up translating xFC into x00FC. (This happens to be the correct translation because UCS-2 is a superset of ISO-8859-1.) Then regsub did its matching and substitution. Then it converted the UCS-2 representation back to UTF-8. The UTF-8 encoding of x00FC is xC3 xBC. AOLserver does not know anything about UTF-8; it just sends whatever bytes you give it. In ISO-8859-1, xC3 means à and xBC means ¼.

So regsub didn't do anything wrong. We gave it garbage (a non-UTF-8 string), so it gave us garbage. How do we solve this problem? We need to make sure that all of AOLserver's textual input is translated to its UTF-8 representation and that the UTF-8 is translated to the appropriate character encoding on output.

Terminology

A character encoding is a mapping from a set of characters to a set of octet sequences. US-ASCII maps all of its characters to a single octet each. UTF-8 maps its characters to a variable number of octets each.

Charset is synonymous with "character encoding"; Internet standards use this term.

Tcl 8.1 and later use Unicode and UTF-8 internally and include support for converting between character encodings. The Tcl names for various encodings are different than the Internet standard names. So, in this document, I typically use the term "encoding" when I am referring to Tcl, and "charset" when I am referring to an Internet protocol feature.

Database Access

For database access, the only sane choice is to use a database that supports UTF-8. Then Tcl strings can be passed to and from the database client library unmodified. Trust me, you just want to use a UTF-8 database.

Configuration Files

AOLserver reads its configuration files (both Tcl and ini-format) with no character encoding translation. This means that you must store AOLserver configuration files in UTF-8.

Content Files

By "content file", I mean a file containing data to be sent to the client, not a file containing a program to be run. So an HTML or JPEG file is a content file, but a Tcl script is not.

AOLserver has several APIs for sending the contents of a file directly to the client. All of them send the contents of the file back to the client unmodified - no character encoding translation is performed. This means that it is up to you to ensure that the file's encoding is the same as the encoding the client expects.

The safest thing is to use only US-ASCII bytes in your text files - bytes with the high bit clear. Just about every character encoding you're likely to run across on the Web will be a superset of US-ASCII, so no matter what charset the client is expecting, your content will probably be displayed correctly. If you are sending an HTML (or XML) file, it can still access any Unicode character using the &#nnn; notation. However, if you have non-HTML files, or you don't want to deal with all those character reference entities, you'll have to make sure your client knows what character set you're sending.

The client knows what character set to expect from the Content-Type header. You're probably used to seeing a header like this:

Content-Type: text/html
In fact, the header can specify a charset like this:
Content-Type: text/html; charset=iso-8859-1
If the header does not include a charset parameter, the HTTP standard specifies that the character set must be ISO-8859-1. In practice, though, clients may try to guess a character set, or they may let the user override the default character set. So you should always specify a character set for text content.

Typically, you determine the content-type to send for a file by calling ns_guesstype on it. ns_guesstype looks up the file extension in AOLserver's file extension table to pick the content-type. The default table is in the AOLserver manual. Some of the default mappings are:

Extension Type
.html text/html
.txt text/plain
.jpg image/jpeg
As you can see, no charset is specified for the text file types. That means that you can't predict how your text will appear on the user's screen unless you stick to US-ASCII bytes in your files. So you should override the mappings in your AOLserver config file. For example, if all your text files use the ISO-8859-1 encoding, you should put this in your config file:
nsd.ini nsd.tcl
[ns/mimetypes]
.html=text/html; charset=iso-8859-1
.txt=text/plain; charset=iso-8859-1
ns_section ns/mimetypes
ns_param .html "text/html; charset=iso-8859-1"
ns_param .txt "text/plain; charset=iso-8859-1"
But all your text files might not use the same encoding. If you have files in various encodings, you need to make up extensions to identify the different encodings, rename your files accordingly, and map the extensions in your config file. For example:
nsd.ini nsd.tcl
[ns/mimetypes]
.html=text/html; charset=iso-8859-1
.txt=text/plain; charset=iso-8859-1
.html_sj=text/html; charset=shift_jis
.txt_sj=text/plain; charset=shift_jis
.html_ej=text/html; charset=euc-jp
.txt_ej=text/plain; charset=euc-jp
ns_section ns/mimetypes
ns_param .html "text/html; charset=iso-8859-1"
ns_param .txt "text/plain; charset=iso-8859-1"
ns_param .html_sj "text/html; charset=shift_jis"
ns_param .txt_sj "text/plain; charset=shift_jis"
ns_param .html_ej "text/html; charset=euc-jp"
ns_param .txt_ej "text/plain; charset=euc-jp"
If you wish to translate the contents of a file to some other charset when you send it, you can use Tcl's file handling:
set fd [open somefile.html_sj r]
fconfigure $fd -encoding shiftjis
set html [read $fd [file size somefile.html_sj]]
close $fd
ns_return 200 "text/html; charset=euc-jp" $html

XXX ACS: ad_serve_html_file

Output from Tcl

Your Tcl programs (Tcl files, filters, and registered procs) can send content to the client using a number of commands: The commands for sending files are discussed under Content Files.

Tcl stores strings in memory using UTF-8. However, when you send content to the client from Tcl, you may not want the client to receive UTF-8; he may not support it. So AOLserver can translate UTF-8 to a different charset.

If you use ns_return or ns_respond to send a Tcl string to the client, AOLserver determines what character set to use by examining the content type you specify:

  1. If your content-type includes a charset parameter, then AOLserver translates the string to that charset.
  2. Otherwise, if your content-type is text/anything, then AOLserver translates the string to the charset specified in the config file by ns/parameters/OutputCharset (iso-8859-1 by default).
  3. Otherwise, AOLserver sends the string unmodified.

In the second instance, where AOLserver uses the ns/parameters/OutputCharset, if ns/parameters/HackContentType is also set to true, then AOLserver will modify the Content-Type header to include the charset parameter. HackContentType is set by default, and I strongly recommend leaving it set, because it's always safer to tell the client explicitly what charset you are sending.

For example, the default configuration is equivalent to this:

[ns/parameters]
OutputCharset=iso-8859-1
HackContentType=true
So if you run this command:
ns_return 200 text/html $html
This header will be sent:
Content-Type: text/html; charset=iso-8859-1
And the contents of $html will be converted to the ISO-8859-1 encoding as they are sent to the client.

If you write the headers to the client with ns_write instead of letting AOLserver do it (via ns_return or ns_respond), then AOLserver does not parse the content-type. You must explicitly tell it what charset to use immediately after you write the headers, by calling ns_startcontent in one of these forms:

ns_startcontent
Tells AOLserver that you have written the headers and do not wish the content to be translated.
ns_startcontent -charset charset
Tells AOLserver that you have written the headers and wish the following content to be translated to the specified charset.
ns_startcontent -type content-type
Tells AOLserver that you have written the headers and wish the following content to be translated to the charset specified by content-type, which should be the same value you sent to the client in the Content-Type header. If content-type does not contain a charset parameter, AOLserver translates to ISO-8859-1.
The client may specify what charsets in accepts by sending an Accept-Charset header in its HTTP request. If the Accept-Charset header is missing, then the client is assumed to accept any charset. The ns_choosecharset command will return the best charset to use, taking into account the Accept-Charset header and the charsets supported by AOLserver. The syntax is
ns_choosecharset ?-preference charset-list?

The ns_choosecharset algorithm:

  1. Set preferred-charsets to the list of charsets specified by the -preference flag. If that flag was not given, use the config parameter ns/parameters/PreferredCharsets. If the config parameter is missing, use {utf-8 iso-8859-1}. The list order is significant.
  2. Set acceptable-charsets to the intersection of the Accept-Charset charsets and the charsets supported by AOLserver.
  3. If acceptable-charsets is empty, return the charset specified by config parameter ns/parameters/DefaultCharset, or iso-8859-1 by default.
  4. Choose the first charset from preferred-charsets that also appears in acceptable-charsets. Return that charset.
  5. If no charset in preferred-charsets also appears in acceptable-charsets, then choose the first charset listed in Accept-Charsets that also appears in acceptable-charsets. Return that charset.
  6. (Note: the last step will always return a charset because acceptable-charsets can only contain charsets listed by Accept-Charsets.)

Example:

# Assume japanesetext.html_sj is stored in Shift-JIS encoding.
set fd [open japanesetext.html_sj r]
fconfigure $fd -encoding shiftjis
set html [read $fd [file size japanesetext.html_sj]]
close $fd

set charset [ns_choosecharset -preference {utf-8 shift-jis euc-jp iso-2022-jp}]
set type "text/html; charset=$charset"
ns_write "HTTP/1.0 200 OK
Content-Type: $type
\n"
ns_startcontent -type $type
ns_write $html

URL Encoding

Whether a URL is made up of "characters" or "bytes" is a complex issue (see RFC 2396 for details). Ultimately, though, URIs are transmitted over the network, so they must be reduced to bytes. However, HTTP limits the set of bytes used to transmit a URL. URLs containing bytes outside that set must be encoded for transmission.

In URL encoding, one byte may be encoded as three bytes which in US-ASCII represent a percent character ("%") followed by two hexadecimal digits.

After a URL is decoded, any bytes less that x80 represent US-ASCII characters. The problem with URLs and URL encoding is that historically, no standard defined what bytes larger than x80 represent. Various proposals such as IURI Internet-Draft propose using UTF-8 exclusively as the character encoding in URLs, but existing software does not work that way.

AOLserver's ns_urlencode and ns_urldecode choose the character encoding to use in one of three ways:

  1. If the command was invoked with a -charset flag, use that charset. For example:
    ns_urlencode -charset shift_jis "\u304b"
    Unicode character U+304B is HIRAGANA LETTER KA. In Shift-JIS this is encoded as x82 xA9, so the command returns the string "%82%A9".
  2. If no -charset flag was given, then the ns_urlcharset command determines what encoding is used. The ns_urlcharset sets the default charset for the ns_urlencode and ns_urldecode commands for one connection. For example, these commands have the same result as the preceding example:
    ns_urlcharset shiftjis
    ns_urlencode "\u304b"
    The ns_urlcharset command is only valid when called from a connection thread. Do not call it from an ns_schedule_proc thread.
  3. If neither of the preceding steps specified a charset, then the AOLserver config parameter ns/parameters/URLCharset determines the charset. The default value for the parameter is "iso-8859-1".

A URL, as seen by AOLserver in an HTTP request, consists of two parts, the path and the query. For example:

/register/user-new.tcl
path
? first_names=Rob&last_name=Mayoff
query

We will consider the path part and the query part separately.

URL Path

AOLserver decodes the path part of the URL in the HTTP request before determining how to handle the URL. It does not run any Tcl code in the connection thread first, so AOLserver always uses the charset specified by ns/parameters/URLCharset to decode the path. You must use the same charset to encode URLs you send out, or you will have problems.

However, other people might link to you from their servers and might be careless about the character encodings. So the safest practice is to use only US-ASCII characters in your URL paths if you possibly can.

Form Data in application/x-www-form-urlencoded Format

Form data comes from one of two places: Either way, form data is URL-encoded for transmission. AOLserver has no standard way to determine what charset the browser used to encode the data. Typically the browser uses whatever charset the HTML page containing the form was in. If the HTML page was sent without a charset in the Content-Type header, then the browser is supposed to use ISO-8859-1, but browsers often guess or let the user override that default. Always specify a charset when you send a text document to avoid this problem.

If you always send data in a single charset, and you always specify the charset in the Content-Type header, then it is safe to assume that form data is always encoded using that charset. Just make that your ns/parameters/URLCharset and don't worry about it.

If you cannot limit yourself to a single charset, then you need to use some other technique. No matter how you do it, you must call ns_urlcharset before calling ns_conn form or ns_getform. If you call ns_urlcharset after you've asked AOLserver for the form, it will not work retroactively.

Here are two ways you could determine the charset:

Form Data in multipart/form-data Format

The browser sends data in multipart/form-data format when the FORM tag says enctype='multipart/form-data'. This format is based on the MIME standard and allows file upload (which application/x-www-form-urlencoded does not).

Alas, multipart/form-data format is no better than application/x-www-form-urlencoded format as far as character encoding issues are concerned. The MIME multipart format allows each form field to include its own Content-Type header with a charset parameter, but in practice clients do not send any indication of the charset used. So we must resort to the same tricks to decide what charset the data is in: always use the same charset, or use a hidden field or a cookie to determine the charset.

The ns_formfieldcharset and ns_cookiecharset commands work for fields in multipart/form-data format except file upload fields. We cannot know what character set the user stores his files in, so we don't know how to translate an uploaded file to utf-8 (assuming the uploaded file is even a text file). So the temporary files created by ns_getform contain the exact bytes sent by the client.

If you hand non-UTF-8 data to the Oracle client library when it thinks you are handing it UTF-8 data, it may crash. So when you are inserting an uploaded file into a CLOB, it is imperative that you run the file contents through Tcl's encoder first. I have not figured out a satisfactory way to automate this yet.

Cookies

The browser should not mess with cookie values; it should just send back exactly the bytes you sent it. However, it is common to URL-encode cookie values that might otherwise have unsafe characters in them. You need to be careful to use the same character encoding for encoding and decoding cookie values.

ns_httpopen / ns_httpget

The ns_httpopen command now parses the Content-Type header from the remote server and sets the encoding on the read file descriptor appropriately. If the content from the remote server is a text type but no charset was specified, then ns_httpopen uses the config parameter ns/parameters/HttpOpenCharset, which specifies the charset to assume the remote server is sending (iso-8859-1 by default).

References


mayoff@arsdigita.com