-Поиск по дневнику

Поиск сообщений в rss_planet_mozilla

 -Подписка по e-mail

 

 -Постоянные читатели

 -Статистика

Статистика LiveInternet.ru: показано количество хитов и посетителей
Создан: 19.06.2007
Записей:
Комментариев:
Написано: 7


Henri Sivonen: Always Use UTF-8 & Always Label Your HTML Saying So

Среда, 19 Февраля 2020 г. 22:21 + в цитатник

To avoid having to deal with escapes (other than for <, >, &, and "), to avoid data loss in form submission, to avoid XSS when serving user-provided content, and to comply with the HTML Standard, always encode your HTML as UTF-8. Furthermore, in order to let browsers know that the document is UTF-8-encoded, always label it as such. To label your document, you need to do at least one of the following:

  • Put as the first thing after the tag.

    The meta tag, including its ending > character needs to be within the first 1024 bytes of the file. Putting it right after is the easiest way to get this right. Do not put comments before .

  • Configure your server to send the header Content-Type: text/html; charset=utf-8 on the HTTP layer.

  • Start the document with the UTF-8 BOM, i.e. the bytes 0xEF, 0xBB, and 0xBF.

Doing more than one of these is OK.

Answers to Questions

The above says the important bit. Here are answers to further questions:

Why Do I Need to Label UTF-8 in HTML?

Because HTML didn’t support UTF-8 in the very beginning and legacy content can’t be expected to opt out, you need to opt into UTF-8 just like you need to opt into the standards mode (via ) and to mobile-friedly layout (via ). (Longer answer)

Which Method Should I Choose?

has the benefit of keeping the label within your document even if you move it around. The main risk is that someone forgets that it needs to be within the first 1024 bytes and puts comments, Facebook metadata, rel=preloads, stylesheets or scripts before it. Always put that other stuff after it.

The HTTP header has the benefit that if you are setting up a new server that doesn’t have any old non-UTF-8 documents on it, you can configure the header once, and it works for all HTML documents on the server thereafter.

The BOM method has the problem that it’s too easy to edit the file in a text editor that removes the BOM and not notice that this has happened. However, if you are writing a serializer library and you are neither in control of the HTTP header nor can inject a tag without interfering with what your users are doing, you can make the serializer always start with the UTF-8 BOM and know that things will be OK.

Can I Use UTF-16 Instead?

Don’t. If you serve user-provided content as UTF-16, it is possible to smuggle content that becomes executable when interpreted as other encodings. This is a cross-site scripting vulnerability if the user uses a browser that allows the user to manually override UTF-16 with another encoding.

UTF-16 cannot be labeled via .

What about Plain Text?

The method is not available for plain text, but the other two are. In the case of plain text, the HTTP header is obviously Content-Type: text/plain; charset=utf-8 instead.

What about JavaScript?

If you’ve labeled your HTML as UTF-8, you don’t need to label your UTF-8-encoded JavaScript files, since by default they inherit the encoding from the document that includes them. However, to make your JavaScript robust when referenced form non-UTF-8 HTML you can use the UTF-8 BOM or the HTTP header, which is Content-Type: application/javascript; charset=utf-8 in the JavaScript case.

What about CSS?

If you’ve labeled your HTML as UTF-8, you don’t need to label your UTF-8-encoded CSS files, since by default they inherit the encoding from the document that includes them. However, to make your CSS robust when referenced form non-UTF-8 HTML you can use the UTF-8 BOM or the HTTP header, which is Content-Type: text/css; charset=utf-8 in the CSS case, or you can put @charset "utf-8"; as the very first thing in the CSS file.

What about XML (Including SVG)?

Unlabeled XML defaults to UTF-8, so you don’t need to label it.

What about JSON?

JSON must be UTF-8 and is processed as UTF-8, so there’s no labeling.

What about WebVTT?

WebVTT is always UTF-8, so there’s no labeling.

https://hsivonen.fi/label-utf-8/


 

Добавить комментарий:
Текст комментария: смайлики

Проверка орфографии: (найти ошибки)

Прикрепить картинку:

 Переводить URL в ссылку
 Подписаться на комментарии
 Подписать картинку