Last modified: 2009-01-12 14:11:58 UTC
Consider the following query: http://localhost/w/api.php?action=query&format=xml&action=expandtemplates&text=%ef%bf%bd%f0%90%80%80%f3%b0%80%8fzzz It contains 6 characters: U+fffd, U+10000, U+f000f, U+007a, U+007a, and U+007a. In json encoding, they should be \ufffd\ud800\udc00\udb80\udc0fzzz (U+10000 and U+f000f must be encoded as surrogate pairs). If I change the format to jsonfm, the three characters are instead encoded as \ufffd\ud800dc00\udb80dc0fzzz, which cannot be decoded correctly. This should be relatively simple to fix, I think. If I change the format to json, it's even worse: the first two are output correctly as \ufffd\ud800\udc00, but that's it! Apparently PHP's built-in json_encode silently screws up anything over U+1ffff: U+20000-U+3ffff, U+80000-U+bffff, and U+100000-U+10ffff seem to be incorrectly encoded as U+10000-U+1ffff, while U+40000-U+7ffff and U+c0000-U+fffff seem to cause the mentioned silent truncation. The only fix I can think of is to detect if these characters are present and use the fallback code instead. I'll see about posting a patch later on.
Created attachment 5625 [details] Patch The PHP bug has been reported at http://bugs.php.net/bug.php?id=46944 This patch adjusts the fallback JSON encoder to be able to handle UTF-16 surrogate pairs, and removes some of the support for invalid UTF-8 encoded characters above U+10FFFF. It also adds a check to see if the PHP built-in json_encode is affected by PHP bug 46944, and uses our fallback code if so.
Heh, wrong example url in the original post. That should obviously be http://en.wikipedia.org/w/api.php?action=query&format=xml&action=expandtemplates&text=%EF%BF%BD%F0%90%80%80%F3%B0%80%8Fzzz
Will try to review this soon.
On a side note, PHP reports this as being fixed now.
(In reply to comment #4) > On a side note, PHP reports this as being fixed now. > That's nice, but it means that older versions of PHP still have broken JSON formatters. At a quick glance, the patch seems to accommodate for that and only fall back to our own JSON formatter if PHP's is broken.
Slightly modified patch applied in r45674.