Last modified: 2013-09-04 11:51:28 UTC
I'm creating a sitemap for my wiki with the following command: php /memberroot/dmb/public_html/metabase/mw/maintenance/generateSitemap.php \ --fspath /memberroot/dmb/public_html/metabase/mw/sitemap \ --server http://metadatabase.org \ --urlpath http://metadatabase.org/sitemap When I load this into google webmaster tools, almost everything works fine. However, a couple of pages have a weird 'null' lastmod field: <url> <loc>http://metadatabase.org/wiki/Main_Page</loc> <lastmod></lastmod> <priority>1.0</priority> </url> and: <url> <loc>http://metadatabase.org/wiki/Help:About</loc> <lastmod></lastmod> <priority>0.5</priority> </url> It's always these two pages! This causes Google to barf with an error about an incorrect date format.
Here is the exact error message from Google Webmaster Tools: 6680 Invalid date An invalid date was found. Please fix the date or formatting before resubmitting. Parent tag: url Tag: lastmod Value: Problem detected on: Jul 3, 2011
Another possibility might be http://www.mediawiki.org/wiki/Extension:GoogleNewsSitemap
Dan, can you check what the page_touched values for these rows in the page table are? Normally this should carry a timestamp, which in MediaWiki on MySQL is stored as a 14-character string (YYYYMMDDHHMMSS). Null or empty *should* end up formatting the current time, though there may be some bad values or such.
I debugged this a bit with TimStarling, but the results are still a bit confusing. It seems that MediaWiki is corrupting the page_touched field! I recently imported the data for this wiki from MW 1.11 (using MySQL dump 10.13 Distrib 5.1.56) into MW 1.17 (using MySQL 4.1.22-standard-log). Since that import, I touched a couple of pages (guess which?) and discovered this problem with the site map. Since reporting the two problem pages, I ran a process that touched many pages, and here is the state of the page_touched field: mysql> select page_touched, count(*) from mb_page group by page_touched limit 20; +----------------+----------+ | page_touched | count(*) | +----------------+----------+ | 2.01107052046E | 606 | | 2.01107052047E | 1179 | | 2.01107052048E | 1116 | | 2.01107052049E | 1255 | | 2.01107052092E | 2 | | 2.01107052094E | 1 | | 2.01107052095E | 1 | | 2.01107052096E | 5 | | 2.01107052097E | 275 | | 2.01107052098E | 132 | | 2.01107052227E | 2 | | 2.01107052229E | 1 | | 2.0110705223E+ | 1 | | 20070810130314 | 1 | | 20090609211125 | 1 | | 20100315174918 | 1 | | 20110705214006 | 1 | +----------------+----------+ 17 rows in set (0.01 sec) With help from TimStarling, I checked the data in my import 'dump' file, and concluded that it looks fine. I re-imported the dump, and it looked fine in the database (no 'corruption' like the above). I then followed the 1.17 DB update procedure (first using the GUI and then again using the CLI), and both looked fine (no corruption). Then I 'recovered' the correct page_touched field using the update (to keep my changes post import). Then I edited a page, and saw the same corruption! Before edit: mysql> select * from mb_page where page_title = "Main_Page"; +---------+----------------+------------+-------------------+--------------+------------------+-------------+----------------+----------------+-------------+----------+ | page_id | page_namespace | page_title | page_restrictions | page_counter | page_is_redirect | page_is_new | page_random | page_touched | page_latest | page_len | +---------+----------------+------------+-------------------+--------------+------------------+-------------+----------------+----------------+-------------+----------+ | 4792 | 0 | Main_Page | | 93887 | 0 | 0 | 0.940133380737 | 20090811074455 | 15093 | 1069 | +---------+----------------+------------+-------------------+--------------+------------------+-------------+----------------+----------------+-------------+----------+ 2 rows in set (0.01 sec) After edit: mysql> select * from mb_page where page_title = "Main_Page"; +---------+----------------+------------+-------------------+--------------+------------------+-------------+----------------+----------------+-------------+----------+ | page_id | page_namespace | page_title | page_restrictions | page_counter | page_is_redirect | page_is_new | page_random | page_touched | page_latest | page_len | +---------+----------------+------------+-------------------+--------------+------------------+-------------+----------------+----------------+-------------+----------+ | 4792 | 0 | Main_Page | | 93888 | 0 | 0 | 0.940133380737 | 2.01107060996E | 15097 | 1074 | +---------+----------------+------------+-------------------+--------------+------------------+-------------+----------------+----------------+-------------+----------+ 2 rows in set (0.02 sec) As I said, in the interim I had touched many pages. Going to Google Webmaster Tools, I now see many errors! Seems pretty clear, now that I've set it out, that MW 1.17 (+ extensions) is borking the page_touched field on this version of MySQL, leading to an error in the sitemap.
Dan: Is this still a problem? If so, which MW version do you use nowadays?