Last modified: 2014-11-04 12:41:57 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T67464, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 65464 - bits.wikimedia.org discourages indexing, resulting in goofy archive.org snapshots
bits.wikimedia.org discourages indexing, resulting in goofy archive.org snaps...
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
General/Unknown (Other open bugs)
wmf-deployment
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
: shell
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-05-18 20:36 UTC by MZMcBride
Modified: 2014-11-04 12:41 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description MZMcBride 2014-05-18 20:36:23 UTC
bits.wikimedia.org was set up to serve JavaScript and CSS in January 2010 (cf. [[wikitech:Server admin log/Archive 15#January 10]].

Its [[robots.txt]] file disallows everything:
 
$ curl bits.wikimedia.org/robots.txt
User-agent: *
Disallow: /

Comparing the French Wikipedia main page from 2009 (<https://web.archive.org/web/20090601000000*/http://fr.wikipedia.org/wiki/Accueil>) to 2010 (<https://web.archive.org/web/20110101000000*/http://fr.wikipedia.org/wiki/Accueil>) shows its effect: the generated snapshots look goofy.

Perhaps we should specifically re-enable the IA bot?

This issue was reported by ytrezq in #wikimedia-operations in freenode.
Comment 1 Nemo 2014-07-06 15:13:33 UTC
This should certainly be done, cf. 0e30a230d8eb105ff3724d4aade4be57eadba2a1 ; but I don't understand if the file is in operations/puppet, where I only find files/misc/robots-txt-disallow (used for gerrit) and modules/mediawiki_singlenode/files/robots.txt (used in labs).

IMHO the robots.txt "catchall" disallow rules should all be managed by files/misc/robots-txt-disallow and ia_archiver allowed there.
Comment 2 Marius Hoch 2014-07-06 15:27:56 UTC
The file is robots-private.txt in operations/mediawiki-config
Comment 3 Gerrit Notification Bot 2014-07-06 16:14:58 UTC
Change 144364 had a related patch set uploaded by Nemo bis:
Allow Internet Archive's Wayback machine to get stuff from bits etc.

https://gerrit.wikimedia.org/r/144364
Comment 4 Gerrit Notification Bot 2014-07-08 18:25:00 UTC
Change 144364 merged by jenkins-bot:
Allow Internet Archive's Wayback machine to get stuff from bits etc.

https://gerrit.wikimedia.org/r/144364
Comment 5 Nemo 2014-07-08 18:32:26 UTC
https://bits.wikimedia.org/robots.txt looks good but the effects can be verified only after it gets recrawled; currently e.g. http://web.archive.org/save/http://bits.wikimedia.org/geoiplookup complains.
Comment 7 Nemo 2014-10-02 09:27:53 UTC
(In reply to MZMcBride from comment #0)
> (<https://web.archive.org/web/20110101000000*/http://fr.wikipedia.org/wiki/
> Accueil>) shows its effect: the generated snapshots look goofy.

No longer?

(In reply to Nemo from comment #6)
> It does look better.
> http://web.archive.org/web/20140711225404/http://it.wikipedia.org/wiki/
> Pagina_principale

No longer. :(
https://bits.wikimedia.org/robots.txt is a 404 now, what happened? Is that the cause?
Comment 8 Glaisher 2014-10-03 16:35:57 UTC
(In reply to Nemo from comment #7)
> https://bits.wikimedia.org/robots.txt is a 404 now, what happened? Is that
> the cause?

I85dcec2a168b9b2d42678d1d9a0e314793c99e21 perhaps?
Comment 9 Nemo 2014-11-04 12:41:57 UTC
And now it looks (mostly) good again. Go figure.

http://web.archive.org/web/20141104122645/http://it.wikipedia.org/wiki/Pagina_principale

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links