Joshua Cranmer: Understanding email charsets |
This time, I modified my data-collection scripts to make it much easier to mass-download NNTP messages. The first script effectively lists all the newsgroups, and then all the message IDs in those newsgroups, stuffing the results in a set to remove duplicates (cross-posts). The second script uses Python's nntplib package to attempt to download all of those messages. Of the 32,598,261 messages identified by the first set, I succeeded in obtaining 1,025,586 messages in full or in part. Some messages failed to download due to crashing nntplib (which appears to be unable to handle messages of unbounded length), and I suspect my newsserver connections may have just timed out in the middle of the download at times. Others failed due to expiring before I could download them. All in all, 19,288 messages were not downloaded.
Analysis of the contents of messages were hampered due to a strong desire to find techniques that could mangle messages as little as possible. Prior experience with Python's message-parsing libraries lend me to believe that they are rather poor at handling some of the crap that comes into existence, and the errors in nntplib suggest they haven't fixed them yet. The only message parsing framework I truly trust to give me the level of finess is the JSMime that I'm writing, but that happens to be in the wrong language for this project. After reading some blog posts of Jeffrey Stedfast, though, I decided I would give GMime a try instead of trying to rewrite ad-hoc MIME parser #N.
Ultimately, I wrote a program to investigate the following questions on how messages operate in practice:
While those were the questions I seeked the answers to originally, I did come up with others as I worked on my tool, some in part due to what information I was basically already collecting. The tool I wrote primarily uses GMime to convert the body parts to 8-bit text (no charset conversion), as well as parse the Content-Type headers, which are really annoying to do without writing a full parser. I used ICU to handle charset conversion and detection. RFC 2047 decoding is done largely by hand since I needed very specific information that I couldn't convince GMime to give me. All code that I used is available upon request; the exact dataset is harder to transport, given that it is some 5.6GiB of data.
Other than GMime being built on GObject and exposing a C API, I can't complain much, although I didn't try to use it to do magic. Then again, in my experience (and as this post will probably convince you as well), you really want your MIME library to do charset magic for you, so in doing well for my needs, it's actually not doing well for a larger audience. ICU's C API similarly makes me want to complain. However, I'm now very suspect of the quality of its charset detection code, which is the main reason I used it. Trying to figure out how to get it to handle the charset decoding errors also proved far more annoying than it really should.
Some final background regards the biases I expect to crop up in the dataset. As the approximately 1 million messages were drawn from the python set iterator, I suspect that there's no systematic bias towards or away from specific groups, excepting that the ~11K messages found in the eternal-september.* hierarchy are completely represented. The newsserver I used, Eternal September, has a respectably large set of newsgroups, although it is likely to be biased towards European languages and under-representing East Asians. The less well-connected South America, Africa, or central Asia are going to be almost completely unrepresented. The download process will be biased away towards particularly heinous messages (such as exceedingly long lines), since nntplib itself is failing.
This being news messages, I also expect that use of 8-bit will be far more common than would be the case in regular mail messages. On a related note, the use of 8-bit in headers would be commensurately elevated compared to normal email. What would be far less common is HTML. I also expect that undeclared charsets may be slightly higher.
Charset data is mostly collected on the basis of individual body parts within body messages; some messages have more than one. Interestingly enough, the 1,025,587 messages yielded 1,016,765 body parts with some text data, which indicates that either the messages on the server had only headers in the first place or the download process somehow managed to only grab the headers. There were also 393 messages that I identified having parts with different charsets, which only further illustrates how annoying charsets are in messages.
The aliases in charsets are mostly uninteresting in variance, except for the various labels used for US-ASCII (us - ascii, 646, and ANSI_X3.4-1968 are the less-well-known aliases), as well as the list of charsets whose names ICU was incapable of recognizing, given below. Unknown charsets are treated as equivalent to undeclared charsets in further processing, as there were too few to merit separate handling (45 in all).
For the next step, I used ICU to attempt to detect the actual charset of the body parts. ICU's charset detector doesn't support the full gamut of charsets, though, so charset names not claimed to be detected were instead processed by checking if they decoded without error. Before using this detection, I detect if the text is pure ASCII (excluding control characters, to enable charsets like ISO-2022-JP, and +, if the charset we're trying to check is UTF-7). ICU has a mode which ignores all text in things that look like HTML tags, and this mode is set for all HTML body parts.
I don't quite believe ICU's charset detection results, so I've collapsed the results into a simpler table to capture the most salient feature. The correct column indicates the cases where the detected result was the declared charset. The ASCII column captures the fraction which were pure ASCII. The UTF-8 column indicates if ICU reported that the text was UTF-8 (it always seems to try this first). The Wrong C1 column refers to an ISO-8859-1 text being detected as windows-1252 or vice versa, which is set by ICU if it sees or doesn't see an octet in the appropriate range. The other column refers to all other cases, including invalid cases for charsets not supported by ICU.
Declared | Correct | ASCII | UTF-8 | Wrong C1 | Other | Total |
---|---|---|---|---|---|---|
ISO-8859-1 | 230,526 | 225,667 | 883 | 8,119 | 1,035 | 466,230 |
Undeclared | 148,054 | 1,116 | 37,626 | 186,796 | ||
UTF-8 | 75,674 | 37,600 | 1,551 | 114,825 | ||
US-ASCII | 98,238 | 0 | 304 | 98,542 | ||
ISO-8859-15 | 67,529 | 18,527 | 0 | 86,056 | ||
windows-1252 | 21,414 | 4,370 | 154 | 3,319 | 130 | 29,387 |
ISO-8859-2 | 18,647 | 2,138 | 70 | 71 | 2,319 | 23,245 |
KOI8-R | 4,616 | 424 | 2 | 1,112 | 6,154 | |
GB2312 | 1,307 | 59 | 0 | 112 | 1,478 | |
Big5 | 622 | 608 | 0 | 174 | 1,404 | |
windows-1256 | 343 | 10 | 0 | 45 | 398 | |
IBM437 | 84 | 257 | 0 | 341 | ||
ISO-8859-13 | 311 | 6 | 0 | 317 | ||
windows-1251 | 131 | 97 | 1 | 61 | 290 | |
windows-1250 | 69 | 69 | 0 | 14 | 101 | 253 |
ISO-8859-7 | 26 | 26 | 0 | 0 | 131 | 183 |
ISO-8859-9 | 127 | 11 | 0 | 0 | 17 | 155 |
ISO-2022-JP | 76 | 69 | 0 | 3 | 148 | |
macintosh | 67 | 57 | 0 | 124 | ||
ISO-8859-16 | 0 | 15 | 101 | 116 | ||
UTF-7 | 51 | 4 | 0 | 55 | ||
x-mac-croatian | 0 | 13 | 25 | 38 | ||
KOI8-U | 28 | 2 | 0 | 30 | ||
windows-1255 | 0 | 18 | 0 | 0 | 6 | 24 |
ISO-8859-4 | 23 | 0 | 0 | 23 | ||
EUC-KR | 0 | 3 | 0 | 16 | 19 | |
ISO-8859-14 | 14 | 4 | 0 | 18 | ||
GB18030 | 14 | 3 | 0 | 0 | 17 | |
ISO-8859-8 | 0 | 0 | 0 | 0 | 16 | 16 |
TIS-620 | 15 | 0 | 0 | 15 | ||
Shift_JIS | 8 | 4 | 0 | 1 | 13 | |
ISO-8859-3 | 9 | 1 | 1 | 11 | ||
ISO-8859-10 | 10 | 0 | 0 | 10 | ||
KSC_5601 | 3 | 6 | 0 | 9 | ||
GBK | 4 | 2 | 0 | 6 | ||
windows-1253 | 0 | 3 | 0 | 0 | 2 | 5 |
ISO-8859-5 | 1 | 0 | 0 | 3 | 4 | |
IBM850 | 0 | 4 | 0 | 4 | ||
windows-1257 | 0 | 3 | 0 | 3 | ||
ISO-2022-JP-2 | 2 | 0 | 0 | 2 | ||
ISO-8859-6 | 0 | 1 | 0 | 0 | 1 | |
Total | 421,751 | 536,373 | 2,226 | 11,523 | 44,892 | 1,016,765 |
The most obvious thing shown in this table is that the most common charsets remain ISO-8859-1, Windows-1252, US-ASCII, UTF-8, and ISO-8859-15, which is to be expected, given an expected prior bias to European languages in newsgroups. The low prevalence of ISO-2022-JP is surprising to me: it means a lower incidence of Japanese than I would have expected. Either that, or Japanese have switched to UTF-8 en masse, which I consider very unlikely given that Japanese have tended to resist the trend towards UTF-8 the most.
Beyond that, this dataset has caused me to lose trust in the ICU charset detectors. KOI8-R is recorded as being 18% malformed text, with most of that ICU believing to be ISO-8859-1 instead. Judging from the results, it appears that ICU has a bias towards guessing ISO-8859-1, which means I don't believe the numbers in the Other column to be accurate at all. For some reason, I don't appear to have decoders for ISO-8859-16 or x-mac-croatian on my local machine, but running some tests by hand appear to indicate that they are valid and not incorrect.
Somewhere between 0.1% and 1.0% of all messages are subject to mojibake, depending on how much you trust the charset detector. The cases of UTF-8 being misdetected as non-UTF-8 could potentially be explained by having very few non-ASCII sequences (ICU requires four valid sequences before it confidently declares text UTF-8); someone who writes a post in English but has a non-ASCII signature (such as myself) could easily fall into this category. Despite this, however, it does suggest that there is enough mojibake around that users need to be able to override charset decisions.
The undeclared charsets are described, in descending order of popularity, by ISO-8859-1, Windows-1252, KOI8-R, ISO-8859-2, and UTF-8, describing 99% of all non-ASCII undeclared data. ISO-8859-1 and Windows-1252 are probably over-counted here, but the interesting tidbit is that KOI8-R is used half as much undeclared as it is declared, and I suspect it may be undercounted. The practice of using locale-default fallbacks that Thunderbird has been using appears to be the best way forward for now, although UTF-8 is growing enough in popularity that using a specialized detector that decodes as UTF-8 if possible may be worth investigating (3% of all non-ASCII, undeclared messages are UTF-8).
Unsuprisingly (considering I'm polling newsgroups), very few messages contained any HTML parts at all: there were only 1,032 parts in the total sample size, of which only 552 had non-ASCII characters and were therefore useful for the rest of this analysis. This means that I'm skeptical of generalizing the results of this to email in general, but I'll still summarize the findings.
HTML, unlike plain text, contains a mechanism to explicitly identify the charset of a message. The official algorithm for determining the charset of an HTML file can be described simply as "look for a tag in the first 1024 bytes. If it can be found, attempt to extract a charset using one of several different techniques depending on what's present or not." Since doing this fully properly is complicated in library-less C++ code, I opted to look first for a tt> production, guess the extent of the tag, and try to find a charset= string somewhere in that tag. This appears to be an approach which is more reflective of how this parsing is actually done in email clients than the proper HTML algorithm. One difference is that my regular expressions also support the newer construct, although I don't appear to see any use of this.
I found only 332 parts where the HTML declared a charset. Only 22 parts had a case where both a MIME charset and an HTML charset and the two disagreed with each other. I neglected to count how many messages had HTML charsets but no MIME charsets, but random sampling appeared to indicate that this is very rare on the data set (the same order of magnitude or less as those where they disagreed).
As for the question of who wins: of the 552 non-ASCII HTML parts, only 71 messages did not have the MIME type be the valid charset. Then again, 71 messages did not have the HTML type be valid either, which strongly suggests that ICU was detecting the incorrect charset. Judging from manual inspection of such messages, it appears that the MIME charset ought to be preferred if it exists. There are also a large number of HTML charset specifications saying unicode, which ICU treats as UTF-16, which is most certainly wrong.
In the data set, 1,025,856 header blocks were processed for the following statistics. This is slightly more than the number of messages since the headers of contained message/rfc822 parts were also processed. The good news is that 97% (996,103) headers were completely ASCII. Of the remaining 29,753 headers, 3.6% (1,058) were UTF-8 and 43.6% (12,965) matched the declared charset of the first body part. This leaves 52.9% (15,730) that did not match that charset, however.
Now, NNTP messages can generally be expected to have a higher 8-bit header ratio, so this is probably exaggerating the setup in most email messages. That said, the high incidence is definitely an indicator that even non-EAI-aware clients and servers cannot blindly presume that headers are 7-bit, nor can EAI-aware clients and servers presume that 8-bit headers are UTF-8. The high incidence of mismatching the declared charset suggests that fallback-charset decoding of headers is a necessary step.
RFC 2047 encoded-words is also an interesting statistic to mine. I found 135,951 encoded-words in the data set, which is rather low, considering that messages can be reasonably expected to carry more than one encoded-word. This is likely an artifact of NNTP's tendency towards 8-bit instead of 7-bit communication and understates their presence in regular email.
Counting encoded-words can be difficult, since there is a mechanism to let them continue in multiple pieces. For the purposes of this count, a sequence of such words count as a single word, and I indicate the number of them that had more than one element in a sequence in the Continued column. The 2047 Violation column counts the number of sequences where decoding words individually does not yield the same result as decoding them as a whole, in violation of RFC 2047. The Only ASCII column counts those words containing nothing but ASCII symbols and where the encoding was thus (mostly) pointless. The Invalid column counts the number of sequences that had a decoder error.
Charset | Count | Continued | 2047 Violation | Only ASCII | Invalid |
---|---|---|---|---|---|
ISO-8859-1 | 56,355 | 15,610 | 499 | 0 | |
UTF-8 | 36,563 | 14,216 | 3,311 | 2,704 | 9,765 |
ISO-8859-15 | 20,699 | 5,695 | 40 | 0 | |
ISO-8859-2 | 11,247 | 2,669 | 9 | 0 | |
windows-1252 | 5,174 | 3,075 | 26 | 0 | |
KOI8-R | 3,523 | 1,203 | 12 | 0 | |
windows-1256 | 765 | 568 | 0 | 0 | |
Big5 | 511 | 46 | 28 | 0 | 171 |
ISO-8859-7 | 165 | 26 | 0 | 3 | |
windows-1251 | 157 | 30 | 2 | 0 | |
GB2312 | 126 | 35 | 6 | 0 | 51 |
ISO-2022-JP | 102 | 8 | 5 | 0 | 49 |
ISO-8859-13 | 78 | 45 | 0 | 0 | |
ISO-8859-9 | 76 | 21 | 0 | 0 | |
ISO-8859-4 | 71 | 2 | 0 | 0 | |
windows-1250 | 68 | 21 | 0 | 0 | |
ISO-8859-5 | 66 | 20 | 0 | 0 | |
US-ASCII | 38 | 10 | 38 | 0 | |
TIS-620 | 36 | 34 | 0 | 0 | |
KOI8-U | 25 | 11 | 0 | 0 | |
ISO-8859-16 | 22 | 1 | 0 | 22 | |
UTF-7 | 17 | 2 | 1 | 8 | 3 |
EUC-KR | 17 | 4 | 4 | 0 | 9 |
x-mac-croatian | 10 | 3 | 0 | 10 | |
Shift_JIS | 8 | 0 | 0 | 0 | 3 |
Unknown | 7 | 2 | 0 | 7 | |
ISO-2022-KR | 7 | 0 | 0 | 0 | 0 |
GB18030 | 6 | 1 | 0 | 0 | 1 |
windows-1255 | 4 | 0 | 0 | 0 | |
ISO-8859-14 | 3 | 0 | 0 | 0 | |
ISO-8859-3 | 2 | 1 | 0 | 0 | |
GBK | 2 | 0 | 0 | 0 | 2 |
ISO-8859-6 | 1 | 1 | 0 | 0 | |
Total | 135,951 | 43,360 | 3,361 | 3,338 | 10,096 |
This table somewhat mirrors the distribution of regular charsets, with one major class of differences: charsets that represent non-Latin scripts (particularly Asian scripts) appear to be overdistributed compared to their corresponding use in body parts. The exception to this rule is GB2312 which is far lower than relative rankings would presume—I attribute this to people using GB2312 being more likely to use 8-bit headers instead of RFC 2047 encoding, although I don't have direct evidence.
Clearly continuations are common, which is to be relatively expected. The sad part is how few people bother to try to adhere to the specification here: out of 14,312 continuations in languages that could violate the specification, 23.5% of them violated the specification. The mode-shifting versions (ISO-2022-JP and EUC-KR) are basically all violated, which suggests that no one bothered to check if their encoder "returns to ASCII" at the end of the word (I know Thunderbird's does, but the other ones I checked don't appear to).
The number of invalid UTF-8 decoded words, 26.7%, seems impossibly high to me. A brief check of my code indicates that this is working incorrectly in the face of invalid continuations, which certainly exaggerates the effect but still leaves a value too high for my tastes. Of more note are the elevated counts for the East Asian charsets: Big5, GB2312, and ISO-2022-JP. I am not an expert in charsets, but I belive that Big5 and GB2312 in particular are a family of almost-but-not-quite-identical charsets and it may be that ICU is choosing the wrong candidate of each family for these instances.
There is a surprisingly large number of encoded words that encode only ASCII. When searching specifically for the ones that use the US-ASCII charset, I found that these can be divided into three categories. One set comes from a few people who apparently have an unsanitized whitespace (space and LF were the two I recall seeing) in the display name, producing encoded words like =?us-ascii?Q?=09Edward_Rosten?=. Blame 40tude Dialog here. Another set encodes some basic characters (most commonly = and ?, although a few other interpreted characters popped up). The final set of errors were double-encoded words, such as =?us-ascii?Q?=3D=3FUTF-8=3FQ=3Ff=3DC3=3DBCr=3F=3D?=, which appear to be all generated by an Emacs-based newsreader.
One interesting thing when sifting the results is finding the crap that people produce in their tools. By far the worst single instance of an RFC 2047 encoded-word that I found is this one: Subject: Re: [Kitchen Nightmares] Meow! Gordon Ramsay Is =?ISO-8859-1?B?UEgR lqZ VuIEhlYWQgVH rbGeOIFNob BJc RP2JzZXNzZW?= With My =?ISO-8859-1?B?SHVzYmFuZ JzX0JhbGxzL JfU2F5c19BbXiScw==?= Baking Company Owner (complete with embedded spaces), discovered by crashing my ad-hoc base64 decoder (due to the spaces). The interesting thing is that even after investigating the output encoding, it doesn't look like the text is actually correct ISO-8859-1... or any obvious charset for that matter.
I looked at the unknown charsets by hand. Most of them were actually empty charsets (looked like =??B?Sy4gSC4gdm9uIFLDvGRlbg==?=), and all but one of the outright empty ones were generated by KNode and really UTF-8. The other one was a Windows-1252 generated by a minor newsreader.
Another important aspect of headers is how to handle 8-bit headers. RFC 5322 blindly hopes that headers are pure ASCII, while RFC 6532 dictates that they are UTF-8. Indeed, 97% of headers are ASCII, leaving just 29,753 headers that are not. Of these, only 1,058 (3.6%) are UTF-8 per RFC 6532. Deducing which charset they are is difficult because the large amount of English text for header names and the important control values will greatly skew any charset detector, and there is too little text to give a charset detector confidence. The only metric I could easily apply was testing Thunderbird's heuristic as "the header blocks are the same charset as the message contents"—which only worked 45.2% of the time.
While developing an earlier version of my scanning program, I was intrigued to know how often various content transfer encodings were used. I found 1,028,971 parts in all (1,027,474 of which are text parts). The transfer encoding of binary did manage to sneak in, with 57 such parts. Using 8-bit text was very popular, at 381,223 samples, second only to 7-bit at 496,114 samples. Quoted-printable had 144,932 samples and base64 only 6,640 samples. Extremely interesting are the presence of 4 illegal transfer encodings in 5 messages, two of them obvious typos and the others appearing to be a client mangling header continuations into the transfer-encoding.
So, drawing from the body of this data, I would like to make the following conclusions as to using charsets in mail messages:
When I have time, I'm planning on taking some of the more egregious or interesting messages in my dataset and packaging them into a database of emails to help create testsuites on handling messages properly.
http://quetzalcoatal.blogspot.com/2014/03/understanding-email-charsets.html
Комментировать | « Пред. запись — К дневнику — След. запись » | Страницы: [1] [Новые] |