Last modified: 2012-05-07 21:22:55 UTC
extensions/PdfHandler.image.php fails to extract pdf text to utf-8, see http://fr.wikisource.org/w/index.php?title=Page:Journal_des_d%C3%A9bats,_7_d%C3%A9cembre_1820.pdf/1&action=edit&redlink=1 retrieveMetaData() force utf-8 output encoding but only for metadata, this is not done for the text itself. For some reason, it look like pdftotext installed on the cluster doesn't use utf-8 as default output encoding. (note for Pdf there is no internal encoding as text is encoded as draw command using the currently selected font)
Bug 35122 has more details. *** This bug has been marked as a duplicate of bug 35122 ***