Записи Друзья Комментарии

rss planet mozilla rss planet mozilla

Аватар rss_planet_mozilla

-Поиск по дневнику

-Подписка по e-mail

-Постоянные читатели

~~cyber_alien~~

-Статистика

Создан: 19.06.2007
Записей:
Комментариев:
Написано: 7

Отчеты:
Посетители
Поисковые фразы

Henri Sivonen: encoding_rs: a Web-Compatible Character Encoding Library in Rust

Понедельник, 03 Декабря 2018 г. 13:30 + в цитатник

encoding_rs is a high-decode-performance, low-legacy-encode-footprint and high-correctness implementation of the WHATWG Encoding Standard written in Rust. In Firefox 56, encoding_rs replaced uconv as the character encoding library used in Firefox. This wasn’t an addition of a component but an actual replacement: uconv was removed when encoding_rs landed. This writeup covers the motivation and design of encoding_rs, as well as some benchmark results.

Additionally, encoding_rs contains a submodule called encoding_rs::mem that’s meant for efficient encoding-related operations on UTF-16, UTF-8, and Latin1 in-memory strings—i.e., the kind of strings that are used in Gecko C++ code. This module is discussed separately after describing encoding_rs proper.

The C++ integration of encoding_rs is not covered here and is covered in another write-up instead.

TL;DR

Rust’s borrow checker is used with on-stack structs that get optimized away to enforce an “at most once” property that matches reads and writes to buffer space availability checks in legacy CJK converters. Legacy CJK converters are the most risky area in terms of memory-safety bugs in a C or C++ implementation.

Decode is very fast relative to other libraries with the exception of some single-byte encodings on ARMv7. Particular effort has gone into validating UTF-8 and converting UTF-8 to UTF-16 efficiently. ASCII runs are handled using SIMD when it makes sense. There is tension between making ASCII even faster vs. making transitions between ASCII and non-ASCII more expensive. This tension is the clearest when encoding from UTF-16, but it’s there when decoding, too.

By default, there is no encode-specific data other than 32 bits per single-byte encoding. This makes legacy CJK encode extremely slow by default relative to other libraries but still fast enough in for the browser use cases. That is, the amount of text one could reasonably submit at a time in a form submission encodes so fast even on a Raspberry Pi 3 (standing in for a low-end phone) that the user will not notice. Even with only 32 bits of encode-oriented data, multiple single-byte encoders are competitive with ICU though only the windows-1252 applied to ASCII or almost ASCII input is competitive with Windows system encoders. Faster CJK legacy encode is available as a compile-time option. But ideally, you should only be using UTF-8 for output anyway.

(If you just want to see the benchmarks and don’t have time for the discussion of the API and implementation internals, you can skip to the benchmarking section.)

Scope

Excluding the encoding_rs::mem submodule, which is discussed after encoding_rs proper, encoding_rs implements the character encoding conversions defined in the Encoding Standard as well as the mapping from labels (i.e. strings in protocol text that identify encodings) to encodings.

Specifically, encoding_rs does the following:

Decodes a stream of bytes in an Encoding Standard-defined character encoding into valid aligned native-endian in-RAM UTF-16 (units of u16).
Encodes a stream of potentially-invalid aligned native-endian in-RAM UTF-16 (units of u16) into a sequence of bytes in an Encoding Standard-defined character encoding as if the lone surrogates had been replaced with the REPLACEMENT CHARACTER before performing the encode. (Gecko’s UTF-16 is potentially invalid.)
Decodes a stream of bytes in an Encoding Standard-defined character encoding into valid UTF-8.
Encodes a stream of valid UTF-8 into a sequence of bytes in an Encoding Standard-defined character encoding. (Rust’s UTF-8 is guaranteed-valid.)
Does the above in streaming (input and output split across multiple buffers) and non-streaming (whole input in a single buffer and whole output in a single buffer) variants.
Avoids copying (borrows) when possible in the non-streaming cases when decoding to or encoding from UTF-8.
Resolves textual labels that identify character encodings in protocol text into type-safe objects representing the those encodings conceptually.
Maps the type-safe encoding objects onto strings suitable for returning from document.characterSet.
Validates UTF-8 (in common instruction set scenarios a bit faster for Web workloads than the Rust standard library; hopefully will get upstreamed some day) and ASCII.

Notably, the JavaScript APIs defined in the Encoding Standard are not implemented by encoding_rs directly. Instead, they are implemented in Gecko as a thin C++ layer that calls into encoding_rs.

Why is a Character Encoding Conversion Library Even Needed Anymore?

The Web is UTF-8 these days and Rust uses UTF-8 as the in-RAM Unicode representation, so why is a character encoding conversion library even needed anymore? The answer is, of course, “for legacy reasons”.

While the HTML spec requires the use of UTF-8 and the Web is over 90% UTF-8 (according to W3Techs, whose methodology is questionable considering that they report e.g. ISO-8859-1 separately from windows-1252 and GB2312 separately from GBK even though the Web Platform makes no such distinctions, but Google hasn’t published their numbers since 2012), users still need to access the part of the Web that has not migrated to UTF-8 yet. That part does not consist only of ancient static pages, either. For example, in Japan there are still news sites that publish new content every day in Shift_JIS. Over here in Finland, I do my banking using a Web UI that is still encoded in ISO-8859-15.

Another side of the legacy is inside the browser engine. Gecko, JavaScript and the DOM API originate from the 1990s when the way to represent Unicode in RAM was in 16-bit units as can also been seen in other software from that era, such as Windows NT, Java, Qt and ICU. (Unicode was formally extended beyond 16 bits in Unicode 2.0 in 1996 but non-Private Use Characters were not assigned outside the Basic Multilingual Plane until Unicode 3.1 in 2001.)

Why a Rewrite?

Regardless of the implementation language, the character encoding library in Gecko was in need of a rewrite for three reasons:

The addition of Rust code in Firefox brought about the need to be able to convert to and from UTF-8 directly and in terms of binary size, it didn’t make sense to have distinct libraries for converting to and from UTF-16 and for converting to and from UTF-8. Instead, a unified library using the same lookup tables for both was needed. The old code wasn’t designed to yield both UTF-16-targeting and UTF-8-targeting machine code from the same source. The addition of an efficient capability to decode to UTF-8 or to encode from UTF-8 would have involved a level of change comparable to a rewrite.
The old library was crufty enough that it was easier to make correctness improvements by the means of a rewrite than by the means of incremental fixes.
In Firefox 43, I had already rewritten the Big5 decoder and encoder in C++, because a rewrite was easier than modifying the old code. In that particular case, the old code used the Private Use Area (PUA) of the Basic Multilingual Plane (BMP) for Hong Kong Supplementary Character Set (HKSCS) characters. However, after the old code was written, HKSCS characters had been assigned proper code points in Unicode, but many of the assignments are on the Supplementary Ideographic Plane (Plane 2). When a fundamental assumption, such as all the characters in an encoding mapping to the BMP, no longer holds, a rewrite is easier than an incremental change.
As another example (that showed up after the initial rewrite proposal but before the implementation got properly going), the ISO-2022-JP decoder had an XSS vulnerability that was difficult to fix without restructuring the existing code. I actually tried to write a patch for the old code and gave up.
In general, the code structure of the old multi-byte decoders differed from the spec text so much that it would have been harder to try to figure out if the code does what the spec requires than to write new code according to the spec.
The old code was written at a time when the exact set of behaviors that Web-exposed character encodings exhibit wasn’t fully understood. For this reason, the old code had generality that is no longer useful now that we know the full set of Web-exposed legacy encodings and can be confident that there will be no additional legacy encodings introduced with additional behaviors anymore.
As the most notable example, the old code assumed that the lower half of single-byte encodings might not be ASCII. By the time of planning encoding_rs, single-byte encodings whose lower half wasn’t ASCII had already been removed as part of previous Encoding Standard-compliance efforts. Some of the multi-byte encoding handling code also had configurability for the single-byte mode that allowed for non-ASCII single-byte mode. However, some multi-byte encodings had already been migrated off the generic two-byte encoding handling code years ago.
There had been generic two-byte encoding handling code, but it no longer made sense when only EUC-KR remained as an encoding exhibiting the generic characteristics. Big5 was able to decode to Plane 2, GBK had grown four-byte sequences as part of the evolution to GB18030, EUC-JP had grown support for three-byte sequences in order to support JIS X 0212 and Shift_JIS never had the EUC structure to begin with and had single-byte half-width katakana. Even EUC-KR itself had deviated from the original EUC structure by being extended to support all precomposed Hangul syllables (not just the ones in common use) in windows-949.

When a rewrite made sense in any case, it made sense to do the rewrite in Rust, because a rewrite of a clearly identifiable subsystem is exactly the kind of thing that is suitable for rewriting in Rust and the problem domain could use memory-safety. The old library was created in early 1999, but it still had a buffer overrun discovered in it in 2016 (in code added in the 2001 and 2002). This shows that the notion that code written in a memory-unsafe language becomes safe by being “battle-hardened” if it has been broadly deployed for an extended period of time is a myth. Memory-safety needs a systematic approach. Calendar time and broad deployment are not sufficient to turn unsafe code into safe code.

(The above-mentioned bug discovered in 2016 wasn’t the last uconv security bug to be fixed. In 2018, a memory-safety-relevant integer overflow bug was discovered in uconv after uconv had already been replaced with encoding_rs in non-ESR Firefox but uconv was still within security support in ESR. However, that bug was in the new Big5 code that I wrote for Firefox 43, so it can’t be held against the ancient uconv code. I had fixed the corresponding encoding_rs bug before encoding_rs landed in Firefox 56. The uconv bug was fixed in Firefox ESR 52.7.)

Why not ICU or rust-encoding?

As noted above, a key requirement was the ability to decode to and from both UTF-16 and UTF-8, but ICU supports only decoding to and from UTF-16 and rust-encoding supports only decoding to and from UTF-8. Perhaps one might argue that pivoting via another UTF would be fast enough, but experience indicated that pivoting via another UTF posed at least a mental barrier: Even after the benefits of UTF-8 as an in-memory Unicode representation were known, Gecko subsystems had been written to use UTF-16 because that was what uconv decoded to.

A further problem with ICU is that it does not treat the Encoding Standard as its conformance target. Chrome patches ICU substantially for conformance. I didn’t want to maintain a similar patch set in the Gecko context and instead wanted a library that treats the Encoding Standard as its conformance target.

The invasiveness of the changes to rust-encoding that would have been needed to meet the API design, performance and UTF-16 targeting goals would have been large enough that it made sense to pursue them in a new project instead of trying to impose the requirements onto an existing project.

API Design Problems

In addition to internal problems, uconv also had a couple of API design problems. First, the decoder API lacked the ability to signal the end of the stream. This meant that there was no way for the decoder to generate a REPLACEMENT CHARACTER when the input stream ended with an incomplete byte sequence. It was possible for the caller to determine from the status code if the last buffer passed to the decoder ended with an incomplete byte sequence, but then it was up to the caller to generate the REPLACEMENT CHARACTER in that situation even though the decoder was generally expected to provide this service. As a result, only one caller in the code base, the TextDecoder implementation, did the right thing. Furthermore, even though the encoder side had an explicit way to signal the end of the stream, it was a separate method leading to more complexity for callers than just being able to say that a buffer is the last buffer.

Additionally, the API contract was unclear on whether it was supposed to fill buffers exactly potentially splitting a surrogate pair across buffer boundaries or whether it was supposed to guarantee output validity on a per-method call basis. In a situation where the input and output buffers were exhausted simultaneously, it was unspecified whether the converter should signal that the input was exhausted or that the output was exhausted. In cases where it wasn’t the responsibility of the converter to handle the replacement of malformed byte sequences when decoding or unmappable characters when encoding, the API left needlessly much responsibility to the caller to advance over the faulty input and to figure out what the faulty input was in the case where that mattered, i.e. when encoding and producing numeric character references for unmappable characters.

Character encoding conversion APIs tend to exhibit common problems, so the above uconv issues didn’t make uconv particularly flawed compared to other character encoding conversion APIs out there. In fact, to uconv’s credit at least in the form that it had evolved into by the time I got involved, given enough output space uconv always consumed all the input provided to it. This is very important from the perspective of API usability. It’s all too common for character encoding conversion APIs to backtrack if the input buffer ends with an incomplete byte sequence and to report the incomplete byte sequence at the end of the input buffer as not consumed. This leaves it to the caller to take those unconsumed bytes and to copy them to the start of the next buffer so that they can be completed by the bytes that follow. Even worse, sometimes this behavior isn’t documented and is up to the caller of the API to discover by experimentation. This behavior also imposes a, typically undocumented, minimum input buffer size, because the input buffer has to be large enough for at least one complete byte sequence to fit. If the input trickles in byte by byte, it’s up to the caller to arrange them into chunks large enough to contain a complete byte sequence.

Sometimes, the API design problem described in the previous paragraph is conditional on requesting error reporting. When I was writing the Validator.nu HTML Parser, I discovered that the java.nio.charset character encoding conversion API was well-behaved when it was asked to handle errors on its own, but when the caller asked for the errors to be reported, the behavior undocumentedly changed to not consuming all the input offered even if there was enough output space. This was because the error reporting mechanism sought to designate the exact bytes in error by giving the caller the number of erroneous bytes corresponding to a single error. In order to make a single number make sense, the bytes always had to be counted backwards from the current position, which meant that the current position had to be placed such that it was at the end of the erroneous sequence and additionally the API sought to make it so that the entire erroneous sequence was in the buffer provided and not partially in a past already discarded buffer.

Additionally, as a more trivial to describe matter, but as a security-wise potentially very serious matter, some character encoding conversion APIs offer to provide a mode that ignores errors. Especially when decoding and especially in the context of input such as HTML that has executable (JavaScript) and non-executable parts, silently dropping erroneous byte sequences instead of replacing them with the REPLACEMENT CHARACTER is a security problem. Therefore, it’s a bad idea for character encoding conversion API to offer a mode where errors are neither signaled to the caller nor replaced with the REPLACEMENT CHARACTER.

Finally, some APIs fail to provide a high-performance streaming mode where the caller is responsible for output buffer allocation. (This means two potential failures: First, failure to provide a streaming mode and, second, providing a streaming mode but converter seeks to control the output buffer allocation.)

In summary, in my experience, common character encoding conversion API design problems are the following:

Failure to provide a streaming mode
- E.g. the kernel32 conversion APIs
In streaming mode, failure to let the caller signal the end of the stream
- E.g. uconv decode API and Qt (.NET can signal this but the documentation says the converter ignores the invalid bytes at the end in that case! I hope the docs are wrong.)
In streaming mode, having a separate API entry point for signaling the end of the stream (as opposed to being able to flag a buffer as the last buffer) resulting in two API entry points that can generate output
- E.g. uconv encode API
In streaming mode, given sufficient output space, failure to consume all provided input
- E.g. java.nio.charset in error reporting mode, rust-encoding and iconv
In streaming mode, seeking to identify which bytes were in error but doing so with too simplistic mechanism leading to also having to have the problem from the previous item
- E.g. java.nio.charset
In streaming mode, causing memory allocation when a conversion call is on the stack (as opposed to letting the caller be fully in charge of allocating buffers)
- E.g. Qt, WebKit and rust-encoding
In streaming mode, failure to guarantee that the exhaustion of the input buffer is the condition that is reported if both the input and output buffers are exhausted at the same time
- E.g. uconv
In streaming mode, seeking to fill the output buffer fully (even if doing so e.g. splits a surrogate pair) instead of guaranteeing that the output is valid on a per-buffer basis
- E.g. ICU by documentation; many others silent on this matter in documentation, so who knows
Providing a mode that silently ignores erroneous input sequences
- E.g. rust-encoding, java.nio.charset

All but the last item are specific to a streaming mode. Streaming is hard.

Other Design Considerations

There are other API design considerations that would be unfair to label as “problems”, but that are still very relevant to designing a new API. These relate mainly to error handling and byte order mark (BOM) handling.

Replacement of Errors

It is typical for character encoding conversion APIs to treat error handling as a mode that is set on a converter object as opposed to treating error handling as a different API entry point. API-wise it makes sense to have different entry points in order to have different return values for the two cases. Specifically, when the converter handles errors, the status of the conversion call cannot be that conversion stopped on an error for the caller to handle. Additionally, when the converter handles errors, it may make sense to provide a flag that indicates whether there where errors even though they were automatically handled.

Implementation-wise, experience suggests that baking error handling into each converter complicates code considerably and adds opportunities for bugs. Making the converter implementation always signal errors and having an optional wrapper that deals with those errors so that the application developer doesn’t need to leads to a much cleaner design. This design is a natural match for exposing different entry points: one entry point goes directly to the underlying converter and the other goes through the wrapper.

BOM Handling

BOM sniffing is subtle enough that it is a bad idea to leave it to the application. It’s more robust to bake it into the conversion library. In particular, getting BOM sniffing right when bytes arrive one at a time is not trivial for applications to handle. Like replacement of errors, different BOM handling modes can be implemented as wrappers around the underlying converters.

Extensibility

Especially in languages that provide a notion of inheritance, interfaces or traits it is alluring for the API designer to seek to define an abstract conversion API that others can write more converters for. However, in the case of the Web, the set of encodings is closed and includes only those that are defined in the Encoding Standard. As far as the use cases in the Web context go, extensibility is not needed. On the contrary, especially in a code base that is also used in a non-Web context like Gecko is used in Thunderbird in the email context, it is a feature that we can be confident on the Web side that if we have a type that represents an encoding defined in the Encoding Standard it can’t exhibit behaviors from outside the Encoding Standard. By design, encoding_rs is not extensible, so an encoding_rs Encoding does not represent any imaginable character encoding but instead represents a character encoding from the Encoding Standard. For example, we know from the type that we don’t accidentally have a UTF-7 decoder in Gecko code that has Web expectations even though Thunderbird contains a UTF-7 decoder in its codebase. (If you are interested in decoding email in Rust, there is a crate that wraps encoding_rs, adds UTF-7 decoding and maintains a type distinction between Web encodings and email encodings.)

Additionally, in the context of Rust and its Foreign Function Interface (FFI), it helps that references are references to plain structs and not trait objects. Whereas C++ puts a vtable pointer on the objects allowing pointers to polymorphic types to have the same size as C pointers, Rust’s type erasure puts the vtable pointer in the reference. A Rust reference to a struct has the same machine representation as a plain (non-null) C pointer. A Rust reference to a trait-typed thing is actually two pointers: one to the instance and another to the vtable appropriate for the concrete type of the instance. Since interoperability with C++ is a core design goal for encoding_rs, using the kind of types whose references are the same as C pointers avoids the problem of losing the vtable pointer when crossing the FFI boundary.

Iterators vs. Slices

Conceptually a character encoding is a mapping from a stream of bytes onto a stream of Unicode scalar values and, in most cases, vice versa. Therefore, it would seem that the right abstraction for a converter is an iterator adaptor that consumes an iterator over bytes and yields Unicode scalar values (or vice versa).

There are two problems with modeling character encoding converters as iterator adaptors. First, it leaves optimization to the compiler, when manual optimizations across runs of code units are desirable. Specifically, it is a core goal for encoding_rs to make ASCII handling fast using SIMD, and the compiler does not have enough information about the data to know to produce ASCII-sequence-biased autovectorization. Second, Rust iterators are ill-suited for efficient and (from the C perspective) idiomatic exposure over the FFI.

The API style of unconv, java.nio.charset, iconv, etc., of providing input and output buffers of several code units at a time to the converter is friendly both to SIMD and to FFI (Rust slices trivially decompose to pointer and length in C). While this isn’t 100% rustic like iterators, slices still aren’t unrustic.

The API Design

This finally brings us to the actual API. There are three public structs: Encoding, Decoder and Encoder. From the point of view of the application developer, these act like traits (or interfaces or superclasses to use concepts from other languages) even though they are structs. Instead of using language implementation-provided vtables for dynamic dispatch, they internally have an enum that wraps private structs that are conceptually like subclasses. The use of private enum for dispatch avoids vtable pointers in FFI, makes the hierarchy intentionally non-extensible (see above) and allows BOM sniffing to change what encoding a Decoder is a decoder for.

There is one statically allocated instance of Encoding for each encoding defined in the Encoding Standard. These instances have publicly visible names that allow application code to statically refer to a specific encoding (commonly, you want to do this with UTF-8, windows-1252, and the replacement encoding). To find an Encoding instance dynamically at runtime based on a label obtained from protocol text, there is a static method fn Encoding::for_label(label: &[u8]) -> &'static Encoding.

The Encoding struct provides convenience methods for non-streaming conversions. These are “convenience” methods in the sense that they are implemented on top of Decoder and Encoder. An application that only uses non-streaming conversions only needs to deal with Encoding and doesn’t need to use Decoder and Encoder at all.

Streaming API

Decoder and Encoder provide streaming conversions and are allocated at runtime, because they encapsulate state related to the streaming conversion. On the Encoder side, only ISO-2022-JP is actually stateful, so most of the discussion here will focus on Decoder.

Internally, the encoding-specific structs wrapped by Decoder are macroized to generate decode to UTF-8 and decode to UTF-16 from the same source code (likewise for Encoder). Even though Rust applications are expected to use the UTF-8 case, I’m going to give examples using the UTF-16 case, because it doesn’t involve the distinction between &str and &[u8] which would distract from the more important issues.

The fundamental function that Decoder provides is:
fn decode_to_utf16_without_replacement( &mut self, src: &[u8], dst: &mut [u16], last: bool ) -> (DecoderResult, usize, usize)

This function wraps BOM sniffing around an underlying encoding-specific implementation that takes the same arguments and has the same return value. The Decoder-provided wrapper first exposes the input to a BOM sniffing state machine and once the state machine gets out of the way delegates to the underlying implementation. Decoder instances can’t be constructed by the application directly. Instead, they need to be obtained from factory functions on Encoding. The factory functions come in three flavors for three different BOM sniffing modes: full BOM sniffing (the default), which may cause the Decoder to morph into a decoder for a different encoding than initially (using enum for dispatch shows its usefulness here!), BOM removal (no morphing but the BOM for the encoding itself is skipped) and without BOM handling. The struct is the same in all cases, but the different factory methods initialize the state of the BOM sniffing state machine differently.

The method takes an input buffer (src) and an output buffer (dst) both of which are caller-allocated. The method then decodes bytes from src into Unicode scalar values that are stored (as UTF-16) into dst until one of the following three things happens:

A malformed byte sequence is encountered.
All the input bytes have been processed.
The output buffer has been filled so near capacity that the decoder cannot be sure that processing an additional byte of input wouldn’t cause so much output that the output buffer would overflow.

The return value is a tuple of a status indicating which one of the three reasons to return happened, how many input bytes were read and how many output code units were written. The status is a DecoderResult enumeration (possibilities Malformed, InputEmpty and OutputFull corresponding to the three cases listed above).

The output written into dst is guaranteed to be valid UTF-16, and the output after each call is guaranteed to consist of complete characters. (I.e. the code unit sequence for the last character is guaranteed not to be split across output buffers.) This implies that the output buffer must be long enough for an astral character to fit (two UTF-16 code units) and the output buffer might not be fully filled. While it may seem wasteful not to fill the last slot of the output buffer in the common case, this design significantly simplifies the implementation while also simplifying callers by guaranteeing to the caller that it won’t have to deal with split surrogate pairs.

The boolean argument last indicates that the end of the stream is reached when all the bytes in src have been consumed.

A Decoder object can be used to incrementally decode a byte stream. During the processing of a single stream, the caller must call the method zero or more times with last set to false and then call decode_* at least once with last set to true. If the decode_* with last set to true returns InputEmpty, the processing of the stream has ended. Otherwise, the caller must call decode_* again with last set to true (or treat a Malformed result as a fatal error).

Once the stream has ended, the Decoder object must not be used anymore. That is, you need to create another one to process another stream. Unlike with some other libraries that encourage callers to recycle converters that are expensive to create, encoding_rs guarantees that converters are extremely cheap to create. (More on this later.)

When the decoder returns OutputFull or the decoder returns Malformed and the caller does not wish to treat it as a fatal error, the input buffer src may not have been completely consumed. In that case, the caller must pass the unconsumed contents of src to the method again upon the next call.

Typically the application doesn’t wish to do its own error handling and just wants errors to be replaced with the REPLACEMENT CHARACTER. For this use case, there is another method that wraps the previous method and provides the replacement. The wrapper looks like this:
fn decode_to_utf16( &mut self, src: &[u8], dst: &mut [u16], last: bool ) -> (CoderResult, usize, usize, bool)

Notably, the status enum is different, because the case of malformed sequences doesn’t need to be communicated to the application. Also, the return tuple includes a boolean flag to indicate whether there where errors.

Additionally, there is a method for querying the worst case output size even the current state of the decoder and the length of an input buffer. If the length of the output buffer is at least the worst case, the decoder guarantees that it won’t return OutputFull.

Identifying Malformed Sequences

Initially, the plan was simply not to support applications that need to identify which input bytes were in error, because I thought that it wasn’t possible to do so without complicating the API for everyone else. However, very early into the implementation phase, I realized that it is possible to identify which bytes are in error without burdening applications that don’t care if the applications that want to know are responsible for remembering the last N bytes decoded where N is relatively small. It turns out that N is 6.

For a malformed sequence that corresponds to a single decode error (i.e. a single REPLACEMENT CHARACTER) a DecoderResult::Malformed(u8, u8) is returned. The first wrapped integer indicates the length of the malformed byte sequence. The second wrapped integer indicates the number of bytes that were consumed after the malformed sequence. If the second integer is zero, the last byte that was consumed is the last byte of the malformed sequence. The malformed bytes may have been part of an earlier input buffer, which is why it is the responsibility of the application that wants to identify the bytes that were in error.

The first wrapped integer can have values 1, 2, 3 or 4. The second wrapped integer can have values 0, 1, 2 or 3. The worst-case sum of the two is 6, which happens with ISO-2022-JP.

Identifying Unmappable Characters

When encoding to an encoding other than UTF-8 (the Encoding Standard does not support encoding into UTF-16LE or UTF-16BE, and there is one Unicode scalar value that cannot be encoded into gb18030), it is possible that the encoding cannot represent a character that is being encoded. In this case, instead of returning backward-looking indices EncoderResult::Unmappable(char) wraps the Unicode scalar value that needs to be replaced with a numeric character reference when performing replacement. In the case of ISO-2022-JP, this Unicode scalar value can be the REPLACEMENT CHARACTER instead of a value actually occurring in the input if the input contains U+000E, U+000F, or U+001B.

This asymmetry between how errors are signaled in the decoder and encoder scenarios makes the signaling appropriate for each scenario instead of optimizing for consistency where consistency isn’t needed.

Non-Streaming API

As noted earlier, Encoding provides non-streaming convenience methods built on top of the streaming functionality. Instead of being simply wrappers for the streaming conversion, the non-streaming methods first try to check if the input is borrowable as output without conversion. For example, if the input is all ASCII and the encoding is ASCII-compatible, a Cow borrowing the input is returned. Likewise, the input is borrowed when the encoding is UTF-8 and the input is valid or when the encoding is ISO-2022-JP and the input contains no escape sequences. Here’s an example of a non-streaming conversion method:
fn decode_with_bom_removal<'a>( &'static self, bytes: &'a [u8] ) -> (Cow<'a, str>, bool)

(Cow is a Rust standard library type that wraps either an owned type or a corresponding borrowed type, so a heap allocation and copy can be avoided if the caller only needs a borrow. E.g., Cow<'a, str> wraps either a heap-allocated string or a pointer and a length designating a string view into memory owned by someone else. Lifetime 'a indicates that the lifetime of borrowed output depends on the lifetime of the input.)

Internals

Internally, there are five guiding design principles.

For the legacy CJK encodings the conversions to and from UTF-8 and UTF-16 should come from the same source code instead of being implemented twice. (For the UTFs and for single-byte encodings, there are enough optimization opportunities from having two implementations that it doesn’t make sense to keep those unified for the sake of unification.)
Since Web content is either markup, which is runs of ASCII mixed with runs of potentially non-ASCII, and CSS and JS, which are almost entirely ASCII, handling of the ASCII range should be very fast and use SIMD where possible.
Small binary size matters more than the speed of encode into legacy encodings.
For performance, everything should be inlined into the conversion loop. (This rules out abstractions that would involve virtual calls from within the conversion loop.)
The instantiation of converters should be very efficient—just a matter of initializing a few machine words. The instantiation should not read from the file system (other than the system lazily paging in the binary for encoding_rs itself), run decompression algorithms, allocate memory on the heap or compute derived lookup tables from other lookup tables.

Abstracting over UTF-8 and UTF-16

Even though in principle compile-time abstraction over UTF-8 and UTF-16 is a matter of monomorphizing over u8 and u16, handling the two cases using generics would be more complicated than handling them using macros. That’s why it’s handled using macros. The conversion algorithms are written as blocks of code that are inputs to macros that expand to provide the skeleton conversion loop and fill in the encoding-specific blocks of code. In the skeleton in the decode case, one instantiation uses a Utf8Destination struct and another uses a Utf16Destination struct both of which provide the same API for writing into them. In the encode case, the source struct varies similarly.

Using Rust Lifetimes to Match Buffer Accesses to Space Checks

The old code in uconv was relatively ad hoc in how it accessed the input and output buffers. It maybe did stuff, advanced some pointers, checked if the pointers reached the end of the buffer and maybe even backed off a bit in some places. It didn’t have an overarching pattern to how space availability was checked and matched to memory accesses so that no accesses could happen without a space check having happened first. For encoding_rs, I wanted to make sure that buffer access only goes forwards without backtracking more than the one byte that might get unread in error cases and that no read happens without checking that there is still data to be read and no write happens without checking that there is space in the output buffer.

Rust’s lifetimes can be used to enforce an “at most once” property. Immediately upon entering a conversion function, the input and output slices are wrapped in source and destination structs that maintain the current read or write position. I’ll use the write case as the example, but the read case works analogously. A decoder that only ever produces characters in the basic multilingual plane uses a BMP space checking method on the destination that takes the destination as a mutable reference (&mut self). If the destination is a UTF-8 destination, the method checks that there is space for at least three additional bytes. If the destination is a UTF-16 destination, the method checks that there is space for at least one additional code unit. If there is enough space, the caller receives a BMP handle whose lifetime is tied to the the lifetime of the destination due to the handle containing the mutable reference to the destination. A mutable reference in Rust means exclusive access. Since a mutable reference to the destination is hidden inside the handle, no other method can be called on the destination until the handle goes out of scope. The handle provides a method for writing one BMP scalar value. That method takes the handle’s self by value consuming the handle and preventing reuse.

The general concept is that at the top of the loop, the conversion loop checks availability of data at the source and obtains a read handle or returns from the conversion function with InputEmpty and then checks availability of space at the destination and obtains a write handle or returns from the conversion function with OutputFull. If neither check caused a return out of the conversion function, the conversion loop now hasn’t read or written either buffer but can be fully confident that it can successfully read from the input at most once and write a predetermined amount of units to the output at most once during the loop body. The handles go out of scope at the end of the loop body, and once the loop starts again, it’s time to check for input availability and out space availability again.

As an added twist, the read operation yields not only a byte of input but also an unread handle for unreading it, because in various error cases the spec calls for prepending input that was already read back to the input stream. In practice, all the cases in the spec can be handled by being able to unread at most one unit of input even though the spec text occasionally prepends more than one unit.

Optimizing ASCII and Multibyte Sequences

In practice, the ISO-2022-JP converters, which don’t need to be fast for Web use cases, use the above concept in its general form. For the ASCII-compatible encodings that are actually performance-relevant for Web use cases, there are a couple of elaborations.

First, the UTF-8 destination and the UTF-16 destination know how to copy ASCII from a byte source in an efficient way that handles more than one ASCII character per register (either a SIMD register or even an ALU register). So the main conversion loop starts with a call to a method that first tries to copy ASCII from the source to the destination and then returns a non-ASCII byte and write handle if there’s space left the destination. Once a non-ASCII byte is found, another loop is entered into that actually works with the handles.

Second, the loop that works with the handles doesn’t have a single scope per loop body for multi-byte encodings. Once we’re done copying ASCII, the non-ASCII byte that we found is always a lead byte of a multi-byte sequence unless there is an error—and we are optimizing for the case where there is neither an error nor a buffer boundary. Therefore, it makes sense to start another scope that does the handle obtaining space check choreography again in the hope that the next byte will be a valid trail byte given the lead byte that we just saw. Then there is a third innermost loop for reading the next byte after that so that if non-ASCII we can continue the middle loop as if this non-ASCII byte had come from the end of the initial ASCII fast path and if the byte is ASCII punctuation, we can spin in the innermost loop without trying to handle a longer ASCII run using SIMD, which would likely fail within CJK plain text. However, if we see non-punctuation ASCII, we can continue the outermost loop and go back to the ASCII fast path.

Not matching on a state variable indicating whether we’re expecting a lead or trail byte on a per-byte basis and instead using the program counter for the state distinguishing between lead and trail byte expectations is good for performance. However, it poses a new problem: What if the input buffer ends in the middle of a multi-byte sequence? Since we are using the program counter for state, the code for handling the trail byte in a two-byte encoding is only reachable by first executing the code for handling the lead byte, and since Rust doesn’t have goto or a way to store continuations, after a buffer boundary we can’t just restore the local variables and jump directly to the trail byte handling. To deal with this, the macro structure that allows the reuse of code for decoding both to UTF-8 and to UTF-16 also duplicates the block for handling the trail byte such that the same block occurs between the function method entry and the conversion loop. If the previous buffer ended in the middle of a byte sequence, the next call to the conversion function handles the trail of that sequence before entering the actual conversion loop.

Optimizing UTF-8

The UTF-8 decoder does not use the same structure as the other multi-byte decoders. Dealing with invalid byte sequences in the middle of the buffer or valid byte sequences that cross a buffer boundary is implemented na"ively from the spec in a way that is instantiated via macro from the same code both when converting to UTF-8 and when converting to UTF-16. However, once that outer tier of conversion gets to a state where it expects the next UTF-8 byte sequence, it calls into fast-track code that only deals with valid UTF-8 and returns back to the outer tier that’s capable of dealing with invalid UTF-8 or partial sequences when it discovers an incomplete sequence at the end of the buffer or an invalid sequence in the middle. This inner fast track is implemented separately for decoding UTF-8 to UTF-8 and for decoding UTF-8 to UTF-16.

The UTF-8 to UTF-16 case is close enough to one might expect from the above description of legacy multibyte encodings. At the top of the loop, there is the call to the ASCII fast path that zero-extends ASCII to UTF-16 Basic Latin multiple code units at a time and then byte sequences that start with a non-ASCII lead byte are handles as three cases: two-byte sequence, three-byte sequence or four-byte sequence. Lookup tables are used to check the validity of the combination of lead byte and second byte as explained below. The sequence is considered consumed only if it’s found to be valid. The corresponding UTF-8 code units are then written to the destination as normal u16 writes.

The UTF-8 to UTF-8 case is different. The input is read twice, but the writing is maximally efficient. First, a UTF-8 validation function is run on the input. This function only reads and doesn’t write and uses an ASCII validation fast path that checks more than one code unit at a time using SIMD or multiple code units per ALU word. The UTF-8 validation function is the UTF-8 to UTF-16 conversion function with all the writes removed. After the validation, the valid UTF-8 run is copied to the destination using std::ptr::copy_nonoverlapping(), which is the Rust interface to LLVM memcpy(). This way, the writing, which is generally less efficient than reading, can be done maximally efficiently instead of being done on a byte-by-byte basis for non-ASCII as would result from a read-once implementation. (Note that in the non-streaming case when the input is valid, both the second read and the writing are avoided. More on that later.)

It is not totally clear if this kind of double-reading is smart, since it is a pessimization for the 100% ASCII case. Intuitively, it should help the non-ASCII case, since even the non-ASCII parts can be written using SIMD. However, 100% ASCII UTF-8 to UTF-8 streaming case, which copies instead of borrowing, runs on Haswell at about two thirds of memcpy() speed while the 100% ASCII windows-1252 to UTF-8 case (which writes the SIMD vectors right away without re-reading) runs at about memcpy() speed.

The hard parts of looping over potentially-invalid UTF-8 are:

Minimizing the performance impact of deciding if the lead byte is valid
Minimizing the performance impact of deciding if the second byte is valid considering that its valid range depends on the lead byte
Avoiding misprediction of the length of the byte sequence representing the next scalar value.

encoding_rs combines the solution for the first two problems. Once it’s known that the lead byte is not ASCII, the lead byte is used as an index to a lookup table that yields a byte whose lower two bits are always zero and that has exactly one of the other six bits set to represent the following cases:

Byte is not a legal lead byte.
Lead byte is associated with a normal-range second byte.
Lead byte for a three-byte sequence requires special lower bound for second byte.
Lead byte for a three-byte sequence requires special upper bound for second byte.
Lead byte for a four-byte sequence requires special lower bound for second byte.
Lead byte for a four-byte sequence requires special upper bound for second byte.

The second byte is used as an index to a lookup table yielding a byte whose low two bits are always zere, whose bit in the position corresponding to the lead being illegal is always one and whose other five bits are zero if the second byte is legal given the type of lead the bit position represents and one otherwise. When the bytes from the two lookup tables are ANDed together, the result is zero if the combination of lead byte and second byte is legal and non-zero otherwise.

When a trail byte is always known to have the normal range, as the third byte in a three-byte sequence is, we can check that the most significant bit is 1 and the second-most significant bit is zero. Note how the ANDing described in the above paragraph always leaves the two least-significant bits of the AND result as zeros. We shift the third byte of a three-byte sequence right by six and OR it with the AND result from the previous paragraph. Now the validity of the three-byte sequence can be decided in a single branch: If the result is 0x2, the sequence is valid. Otherwise, it’s invalid.

In the case of four-byte sequences, the number computed per above is extended to 16 bits and the two most-significant bits of the fourth byte are masked and shifted to bit positions 8 and 9. Now the validitiy of the four-byte sequence can be decidded in a single branch: If the result is 0x202, the sequence is valid. Otherwise, it’s invalid.

The fast path checks that there is at least 4 bytes of input on each iteration, so the bytes of any valid byte sequence for a single scalar value can be read without further bound checks. The code does use branches to decide whether to try to match the bytes as a two-byte, three-byte or four-byte sequence. I tried to handle the distinction between two-byte sequences and three-byte sequences branchlessly when converting UTF-8 to UTF-16. In this case, the mask applied to the lead byte is taken from a lookup table and mask is taken from a lookup table to zero out the bits of the third byte and the third shift amount (from 6 to 0) in the two-byte case. The result was slower than just having a branch to distinguish between two-byte sequences and three-byte sequences.

Now that there is branching to categorize the sequence length, it becomes of interest to avoid that branching. It’s also of interest to avoid going back to the SIMD ASCII fast path when the next lead is not ASCII. After a non-ASCII byte sequence, instead of looping back to the ASCII fast path, the next byte is read and checked. After a two-byte sequence, the next lead is checked for ASCIIness. If it’s not ASCII, the code loops back to the point where the SIMD ASCII path has just exited. I.e. there’s a non-ASCII byte as when exiting the ASCII SIMD fast path, but its non-ASCIIness was decided without SIMD. If the byte is an ASCII byte, it is processed and then the code loops back to the ASCII SIMD fast path.

Obviously, this is far from ideal. Avoiding immediate return to ASCII fast path after a two-byte character works within a non-Latin-script word but it doesn’t really help to let one ASCII character signal a return to SIMD when the one ASCII character is a single space between two non-Latin words. Unfortunately, trying to be smarter about avoiding too early looping back to the SIMD fast path would mean more branching, which itself has a cost.

In the two-byte case, if the next lead is non-ASCII, looping back to immediately after the exit from the ASCII fast path means that the next branch is anyway the branch to check if the lead is for a two-byte sequence, so this works out OK for words in non-Latin scripts in the two-byte-per-character part of the Basic Multilingual Plane. In the three-byte case, however, looping back to the point where the ASCII SIMD fast path ends would first run the check for a two-byte lead even though after a three-byte sequence the next lead is more likely to be for another three-byte sequnces. Therefore, after a three-byte sequence, the first check performed on the next lead is to see if it, too, is for a three-byte sequence in which case the code loops back to the start of the three-byte sequence processing code.

Optimizing UTF-16LE and UTF-16BE

UTF-16LE and UTF-16BE are rare enough on the Web that a browser can well get away with a totally na"ive and slow from-the-spec implementation. Indeed, that’s what landed in Firefox 56. However, when talking about encoding_rs, it was annoying to always have the figurative asterisk next to UTF-16LE and UTF-16BE to disclose slowness when the rest was fast. To get rid of the figurative asterisk, UTF-16LE and UTF-16BE decode is now optimized, too.

If you read The Unicode Standard, you might be left with the impression that the difference between UTF-16 as an in-memory Unicode representation and UTF-16 as an interchange format is byte order. This is not the full story. There are three additional concerns. First, there is a concern of memory alignment. In the case of UTF-16 as an in-memory Unicode representation, a buffer of UTF-16 code units is aligned to start at a memory address that is a multiple of the size of the code unit. That is, such a buffer always starts at an even address. When UTF-16 as an interchange format is read using a byte-oriented I/O interface, it may happen that a buffer starts at an odd address. Even on CPU architectures that don’t distinguish between aligned and unaligned 16-bit reads and writes on the ISA layer, merely reinterpreting a pointer to bytes starting at an odd address as a pointer pointing to 16-bit units and then accessing it as if was a normal buffer of 16-bit units is Undefined Behavior in C, C++, and Rust (as can in practice be revealed by autovectorization performed on the assumption of correct alignment). Second, there is the concern of buffers being an odd number of bytes in length, so the special logic is needed to handle the split UTF-16 code unit at the buffer boundary. Third, there is the concern of unpaired surrogates, so even when decoding to UTF-16, the input can’t be just be copied into right alignment, potentially with byte order swapping, without inspecting the data.

The structure of the UTF-16LE and UTF-16BE decoders is modeled on the structure of the UTF-8 decoders: There’s a na"ive from-the-spec outer tier that deals with invalid and partial sequences and an inner fast path that only deals with valid sequences.

At the core of the fast path is a struct called UnalignedU16Slice that wraps *const u8, i.e. a pointer that can point to either an ever or an odd address, and a length in 16-bit units. It provides a way to make the unaligned slice one code unit shorter (to exclude a trailing high surrogate when needed), a way to take a tail subslice and ways to read a u16 or, if SIMD is enabled, u16x8 in a way that assumes the slice might not be aligned. It also provides a way to copy, potentially with endianness swapping, Basic Multilingual Plane code units to a plain aligned &mut [u16] until the end of the buffer or surrogate code unit is reached. If SIMD is enabled, both the endianness swapping and the surrogate check are SIMD-accelerated.

When decoding to UTF-16, there’s a loop that first tries to use the above-mentioned Basic Multilingual Plane fast path and once a surrogate is found, handles the surrogates on a per-code-unit and returns back to the top of the loop if there was a valid pair.

When decoding to UTF-8, code copied and pasted from the UTF-16 to UTF-8 encoder is used. The difference is that instead of using &[u16] as the source, the source is an UnalignedU16Slice and, additionally, reads are followed with potential endian swapping. Additionally, unpaired surrogates are reported as errors in decode while UTF-16 to UTF-8 encode silently replaces unpaired surrogates with the REPLACEMENT CHARACTER. If SIMD is enabled, SIMD is used for the ASCII fast path. Both when decoding to UTF-8 and when decoding to UTF-16, endianness swapping is represented by a trait parameter, so the conversions are monomorphized into two copies: One that swaps endianness and one that doesn’t. This results in four conversion functions: Opposite-endian UTF-16 to UTF-8, same-endian UTF-16 to UTF-8, opposite-endian UTF-16 to UTF-16, same-endian UTF-16 to UTF-16. All these assume the worst for alignment. That is, code isn’t monomorphized for the aligned and unaligned cases. Unaligned access is fast on aarch64 and on the several most recent x86_64 microarchitectures, so optimizing performance of UTF-16LE and UTF-16BE in the aligned case for Core2 Duo-era x86_64 or for ARMv7 at the expense of binary size and source code complexity would be a bit too much considering that UTF-16LE and UTF-16BE performance doesn’t even really matter for Web use cases.

Optimizing x-user-defined

Unlike the other decoders, the x-user-defined decoder doesn’t have an optimized ASCII fast path. This is because the main remaining use case for x-user-defined it is loading binary data via XMLHttpRequest in code written before proper binary data support via ArrayBuffers was introduced to JavaScript. (Note that when HTML is declared as x-user-defined via the meta tag, the windows-1252 decoder is used in place of the x-user-defined decoder.)

When decoding to UTF-8, the byte length of the output varies depending on content, so the operation is not suitable for SIMD. The loop simply works in a per-byte basis. However, when decoding to UTF-16 with SIMD enabled, each u8x16 vector is zero-extended into two u16x8 vectors. A mask computed by a lane-wise greater-than comparison to see which lanes were not in the ASCII range. The mask is used to retain the corresponding lanes from a vector of all lanes set to 0xF700 and the result is added to the original u16x8 vector.

Portable SIMD

(Nightly) Rust provides access to portable SIMD which closely maps to LLVM’s notion of portable SIMD. There are portable types, such as the u8x16 and u16x8 types used by encoding_rs