Last modified: 2014-11-04 12:41:57 UTC
bits.wikimedia.org was set up to serve JavaScript and CSS in January 2010 (cf. [[wikitech:Server admin log/Archive 15#January 10]]. Its [[robots.txt]] file disallows everything: $ curl bits.wikimedia.org/robots.txt User-agent: * Disallow: / Comparing the French Wikipedia main page from 2009 (<https://web.archive.org/web/20090601000000*/http://fr.wikipedia.org/wiki/Accueil>) to 2010 (<https://web.archive.org/web/20110101000000*/http://fr.wikipedia.org/wiki/Accueil>) shows its effect: the generated snapshots look goofy. Perhaps we should specifically re-enable the IA bot? This issue was reported by ytrezq in #wikimedia-operations in freenode.
This should certainly be done, cf. 0e30a230d8eb105ff3724d4aade4be57eadba2a1 ; but I don't understand if the file is in operations/puppet, where I only find files/misc/robots-txt-disallow (used for gerrit) and modules/mediawiki_singlenode/files/robots.txt (used in labs). IMHO the robots.txt "catchall" disallow rules should all be managed by files/misc/robots-txt-disallow and ia_archiver allowed there.
The file is robots-private.txt in operations/mediawiki-config
Change 144364 had a related patch set uploaded by Nemo bis: Allow Internet Archive's Wayback machine to get stuff from bits etc. https://gerrit.wikimedia.org/r/144364
Change 144364 merged by jenkins-bot: Allow Internet Archive's Wayback machine to get stuff from bits etc. https://gerrit.wikimedia.org/r/144364
https://bits.wikimedia.org/robots.txt looks good but the effects can be verified only after it gets recrawled; currently e.g. http://web.archive.org/save/http://bits.wikimedia.org/geoiplookup complains.
It does look better. http://web.archive.org/web/20140711225404/http://it.wikipedia.org/wiki/Pagina_principale
(In reply to MZMcBride from comment #0) > (<https://web.archive.org/web/20110101000000*/http://fr.wikipedia.org/wiki/ > Accueil>) shows its effect: the generated snapshots look goofy. No longer? (In reply to Nemo from comment #6) > It does look better. > http://web.archive.org/web/20140711225404/http://it.wikipedia.org/wiki/ > Pagina_principale No longer. :( https://bits.wikimedia.org/robots.txt is a 404 now, what happened? Is that the cause?
(In reply to Nemo from comment #7) > https://bits.wikimedia.org/robots.txt is a 404 now, what happened? Is that > the cause? I85dcec2a168b9b2d42678d1d9a0e314793c99e21 perhaps?
And now it looks (mostly) good again. Go figure. http://web.archive.org/web/20141104122645/http://it.wikipedia.org/wiki/Pagina_principale