一千萬個為什麽

搜索

為什麽python decode會替換編碼字符串中的無效字節?

嘗試解碼無效的編碼utf-8 html頁面會產生不同的結果 蟒蛇,firefox和chrome。

The invalid encoded fragment from test page looks like 'PREFIX\xe3\xabSUFFIX'

>>> fragment = 'PREFIX\xe3\xabSUFFIX'
>>> fragment.decode('utf-8', 'strict')
...
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-8: invalid data

UPDATE: This question concluded in a bug report to 蟒蛇 unicode component. The Issue is reported to be fixed in 蟒蛇 2.7.11 and 3.5.2.


以下是用於處理解碼錯誤的替換策略 蟒蛇,Firefox和Chrome。註意它們是如何不同的,特別是如何 蟒蛇 builtin刪除有效的 S (加上無效的字節序列)。

蟒蛇

內置的 replace 錯誤處理程序替換了無效的 \ xe3 \ xab 以及 來自 SUFFIXS 由U + FFFD提供

>>> fragment.decode('utf-8', 'replace')
u'PREFIX\ufffdUFFIX'
>>> print _
PREFIX�UFFIX

瀏覽器

To tests how 瀏覽器 decode the invalid sequence of bytes will use a cgi script:

#!/usr/bin/env 蟒蛇
print """\
Content-Type: text/plain; charset=utf-8

PREFIX\xe3\xabSUFFIX"""

Firefox and Chrome 瀏覽器 rendered:

PREFIX�SUFFIX

Why builtin replace error handler for str.decode is removing the S from SUFFIX

(更新1)

根據維基百科 UTF-8 (感謝mjv), 以下字節範圍用於指示序列的開始 字節

  • 0xC2-0xDF:2字節序列的開始
  • 0xE0-0xEF:3字節序列的開始
  • 0xF0-0xF4:4字節序列的開始

'PREFIX\xe3\abSUFFIX' test fragment has 0xE3, it instructs 蟒蛇 decoder that a 3-byte sequence follows, the sequence is found invalid and 蟒蛇 decoder ignores the whole sequence including '\xabS', and continues after it ignoring any possible correct sequence starting in the middle.

這意味著對於無效的編碼序列,如'\ xF0SUFFIX',它會 解碼 u'\ ufffdFIX'而不是 u'\ ufffdSUFFIX'

示例1:引入DOM解析錯誤

>>> '<div>\xf0<div>Price: $20</div>...</div>'.decode('utf-8', 'replace')
u'<div>\ufffdv>Price: $20</div>...</div>'
>>> print _
<div>�v>Price: $20</div>...</div>

Example 2: Security issues (Also see Unicode security considerations):

>>> '\xf0<!-- <script>alert("hi!");</script> -->'.decode('utf-8', 'replace')
u'\ufffd- <script>alert("hi!");</script> -->'
>>> print _
�- <script>alert("hi!");</script> -->

Example 3: Remove valid information for a scraping application

>>> '\xf0' + u'it\u2019s'.encode('utf-8') # "it’s"
'\xf0it\xe2\x80\x99s'
>>> _.decode('utf-8', 'replace')
u'\ufffd\ufffd\ufffds'
>>> print _
���s

Using a cgi script to render this in 瀏覽器:

#!/usr/bin/env 蟒蛇
print """\
Content-Type: text/plain; charset=utf-8

\xf0it\xe2\x80\x99s"""

Rendered:

�it’s

Is there any official recommended way for handling decoding replacements?

(Was UPDATE 2)

In a public review, the Unicode Technical Committee has opted for option 2 of the following candidates:

  1. Replace the entire ill-formed subsequence by a single U+FFFD.
  2. Replace each maximal subpart of the ill-formed subsequence by a single U+FFFD.
  3. Replace each code unit of the ill-formed subsequence by a single U+FFFD.

UTC Resolution was at 2008-08-29, source: http://www.unicode.org/review/resolved-pri-100.html

UTC Public Review 121 also includes an invalid bytestream as example '\x61\xF1\x80\x80\xE1\x80\xC2\x62', it shows decoding results for each option.

            61      F1      80      80      E1      80      C2      62
      1   U+0061  U+FFFD                                          U+0062
      2   U+0061  U+FFFD                  U+FFFD          U+FFFD  U+0062
      3   U+0061  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+0062

In plain 蟒蛇 the three results are:

  1. u'a\ufffdb' shows as a�b
  2. u'a\ufffd\ufffd\ufffdb' shows as a���b
  3. u'a\ufffd\ufffd\ufffd\ufffd\ufffd\ufffdb' shows as a������b

And here is what 蟒蛇 does for the invalid example bytestream:

>>> '\x61\xF1\x80\x80\xE1\x80\xC2\x62'.decode('utf-8', 'replace')
u'a\ufffd\ufffd\ufffd'
>>> print _
a���

Again, using a cgi script to test how 瀏覽器 render the buggy encoded bytes:

#!/usr/bin/env 蟒蛇
print """\
Content-Type: text/plain; charset=utf-8

\x61\xF1\x80\x80\xE1\x80\xC2\x62"""

Both, Chrome and Firefox rendered:

a���b

Note that 瀏覽器 rendered result matches option 2 of PR121 recomendation

While option 3 looks easily implementable in 蟒蛇, option 2 and 1 are a challenge.

>>> replace_option3 = lambda exc: (u'\ufffd', exc.start+1)
>>> codecs.register_error('replace_option3', replace_option3)
>>> '\x61\xF1\x80\x80\xE1\x80\xC2\x62'.decode('utf-8', 'replace_option3')
u'a\ufffd\ufffd\ufffd\ufffd\ufffd\ufffdb'
>>> print _
a������b

最佳答案

You know that your S is valid, with the benefit of both look-ahead and hindsight :-) Suppose there was originally a legal 3-byte UTF-8 sequence there, and the 3rd byte was corrupted in transmission ... with the change that you mention, you'd be complaining that a spurious S had not been replaced. There is no "right" way of doing it, without the benefit of error-correcting codes, or a crystal ball, or a tamborine.

更新

正如@mjv所說,UTC問題是關於應該包括多少 U + FFFD。

事實上,Python沒有使用UTC的3個選項中的任何一個。

這是UTC唯一的例子:

      61      F1      80      80      E1      80      C2      62
1   U+0061  U+FFFD                                          U+0062
2   U+0061  U+FFFD                  U+FFFD          U+FFFD  U+0062
3   U+0061  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+0062

這是Python的作用:

>>> bad = '\x61\xf1\x80\x80\xe1\x80\xc2\x62cdef'
>>> bad.decode('utf8', 'replace')
u'a\ufffd\ufffd\ufffdcdef'
>>>

為什麽?

F1應該啟動一個4字節的序列,但E1無效。一個壞序列,一個替換。
在下一個字節再次開始,第三個80. Bang,另一個FFFD。
再次從C2開始,它引入了一個2字節的序列,但C2 62無效,所以再次爆炸。

有趣的是,UTC沒有提到Python正在做什麽(在引導字符指示的字節數之後重新啟動)。也許這在Unicode標準的某處實際上是被禁止或棄用的。需要更多閱讀。關註此空間。

Update 2 Houston, we have a problem.

===引自 Unicode 5.2的第3章 ===

轉換過程的約束

不要將字符串中任何格式錯誤的代碼單元子序列解釋為字符的要求(參見一致性條款C10)對轉換過程具有重要影響。

例如,這樣的過程可以將UTF-8代碼單元序列解釋為Unicode字符 序列。如果轉換器遇到格式錯誤的UTF-8代碼單元序列 以有效的第一個字節開頭,但不會繼續有效的後繼字節(參見 表3-7),它不能使用後繼字節作為格式錯誤的子序列的一部分 每當這些後繼字節本身構成格式良好的UTF-8代碼的一部分 單位子序列

If an implementation of a UTF-8 conversion process stops at the first error encountered, without reporting the end of any ill-formed UTF-8 code unit subsequence, then the requirement makes little practical difference. However, the requirement does introduce a significant constraint if the UTF-8 converter continues past the point of a detected error, perhaps by substituting one or more U+FFFD replacement characters for the uninterpretable, ill-formed UTF-8 code unit subsequence. For example, with the input UTF-8 code unit sequence , such a UTF-8 conversion process must not return or , because either of those outputs would be the result of misinterpreting a well-formed subsequence as being part of the ill-formed subsequence. The expected return value for such a process would instead be .

對於使用有效後繼字節的UTF-8轉換過程,不僅不符合, 但也讓轉換器對安全漏洞開放。請參閱Unicode技術報告 #36 ,“Unicode安全註意事項。”

===報價結束===

然後繼續用例子討論“發射多少FFFD”問題。

在最後的第2段引用段落中使用他們的例子:

>>> bad2 = "\xc2\x41\x42"
>>> bad2.decode('utf8', 'replace')
u'\ufffdB'
# FAIL

請註意,這是str.decode('utf_8')的'replace' 'ignore'選項的問題 - 它是所有關於省略數據,而不是關於發出多少U + FFFD;讓數據發射部分正確,U + FFFD問題自然會消失,正如我沒有引用的那部分所解釋的那樣。

Update 3 Current versions of Python (including 2.7) have unicodedata.unidata_version as '5.1.0' which may or may not indicate that the Unicode-related code is intended to conform to Unicode 5.1.0. In any case, the wordy prohibition of what Python is doing didn't appear in the Unicode standard until 5.2.0. I'll raise an issue on the Python tracker without mentioning the word 'oht'.encode('rot13').

Reported here

轉載註明原文: 為什麽python decode會替換編碼字符串中的無效字節?