Last modified: 2010-05-15 16:02:53 UTC
Created attachment 5279 [details] The "Wait" extension from the case study A single wiki page that takes a long time to render -- due to a long-running tag extension -- blocks ALL other users of the wiki from viewing ANY pages. This bug might sound unlikely -- given that Wikipedia has scaled so massively -- but I've reproduced it now on four different MediaWiki servers, two running Windows 2003 Server, and two running Linux (CentOS, Ubuntu), all with different PHP and Apache versions & configurations. One of them was a plain vanilla MediaWiki install created just to test this bug. The problem doesn't seem to be related to sessions (e.g., when session data is locked to prevent concurrent writes, as described at http://us2.php.net/manual/en/function.session-write-close.php). Even with cookies disabled, the same behavior happens. CASE STUDY TO REPRODUCE THE PROBLEM These steps will create two web pages, one slow and one fast, and demonstrate that ALL other users cannot use the wiki at all while the slow page is rendering. 1. Start with a plain vanilla install of MediaWiki 1.13. 2. Install the "Wait" extension attached to this bug ticket, that simply sleeps for N seconds. The source code is: <?php # Wait for N seconds $wgExtensionFunctions[] = 'wfWaitSetup'; function wfWaitSetup() { global $wgParser; $wgParser->setHook('wait', 'wfWait'); } function wfWait($input) { global $wgParser; $wgParser->disableCache(); sleep($input); return "Slept for $input seconds"; } 3. Create two wiki articles: one called "Slow" that contains: <wait>30</wait> and one called "Fast" that contains the text "Hello world". (Note that saving the "Slow" page will take 30 seconds as the <wait> tag is rendered.) Now you are ready to demonstrate the problem. 4. Open web browsers on TWO DIFFERENT PCs. (Alternatively, open two different browsers -- say, Firefox and IE -- on the same PC.) Do not just open 2 windows or tabs of the same browser instance, because they might share session information, which would invalidate this test. 5. In the first browser, hit the "Slow" page. 6. While the Slow page is sleeping, in the second browser, hit the "Fast" page. Ensure that you are NOT getting a cached copy of Fast, say, by force-refreshing the browser (shift-refresh in Firefox, ctrl-F5 in IE, etc.) or appending action=purge. (Or hit Special:Version or other dynamic page instead of the "Fast" page.) 7. While the Slow page is sleeping, the Fast page WILL NOT RENDER. It will be delayed until the Slow page has finished sleeping. This is the problematic behavior: the Slow page -- in an entirely different session -- is blocking the Fast page. SOME ANALYSIS Long-running parser tags like <wait> are realistic. Imagine a parser tag that hits an external database to run SQL queries (that's how we found this problem), or a tag that requires an external resource that is slow or unavailable. This does not seem to be an Apache problem. While the "Slow" page is sleeping, you can hit another non-MediaWiki PHP page on the same webserver (say, test.php that runs "phpinfo()" and exits) and there are no delays. This also does not seem to be a PHP session problem. On the same webserver, if you create two simple PHP scripts that open a session, and hit them with separate browsers as above, the problem does NOT occur. Here are example scripts: <?php # Script number 1 session_start(); sleep(20); echo "Done"; <?php # Script number 2 session_start(); echo "Done"; If you hit script #1 and, while it's sleeping, you hit script #2, using independent browsers in different sessions, the problem does not occur. (Which is good & expected.) Why doesn't Wikipedia experience this problem? Probably because of its extreme caching architecture (with squid, etc.), and that it doesn't have any long-running extensions installed. Or maybe it has a magical PHP/Apache/MySQL configuration that prevents the problem? STACK DUMP I did the case-study above and let the "Fast" page time out and fail while "Slow" was sleeping. Here is the stack dump from that moment, produced by xdebug: Fatal error: Maximum execution time of 30 seconds exceeded in includes/db/Database.php on line 579 Stack trace: 1. {main}() index.php:0 2. MediaWiki->checkInitialQueries() index.php:60 3. Title::newMainPage() includes/Wiki.php:105 4. wfMsgForContent() includes/Title.php:293 5. wfMsgReal() includes/GlobalFunctions.php:380 6. wfMsgGetKey() includes/GlobalFunctions.php:432 7. StubObject->get() includes/GlobalFunctions.php:467 8. StubObject->__call() includes/StubObject.php:0 9. StubObject->_call() includes/StubObject.php:76 10. call_user_func_array() includes/StubObject.php:58 11. MessageCache->get() includes/StubObject.php:0 12. MessageCache->getMsgFromNamespace() includes/MessageCache.php:543 13. MessageCache->load() includes/MessageCache.php:606 14. MessageCache->saveToCaches() includes/MessageCache.php:250 15. BagOStuff->add() includes/MessageCache.php:419 16. SqlBagOStuff->set() includes/BagOStuff.php:100 17. SqlBagOStuff->delete() includes/BagOStuff.php:274 18. SqlBagOStuff->_query() includes/BagOStuff.php:288 19. MediaWikiBagOStuff->_doquery() includes/BagOStuff.php:319 20. Database->query() includes/BagOStuff.php:433 21. Database->doQuery() includes/db/Database.php:540 And the offending line 579 is: $ret = mysql_query( $sql, $this->mConn ); BRION VIBBER'S TEST Brion did a quick test with this on his local wiki and thought the problem might be session-related. He writes in mediawiki-l: "If I preview a page with <wait>20</wait>, then go load something up in another tab in the same browser, it sits there waiting on both tabs. (Confirmed with Firefox 3 and Safari 3 on Mac OS X.) If on the other hand I go load things up in another browser, there's no delay there. If I disable cookies (thus removing session affinity), then a second tab in the same browser has no slowdown. And indeed, it appears that PHP session files are by default locked to prevent multiple simultaneous accesses." I don't know why Brion had different results than I did. Perhaps his "Fast" page was cached so there was no slowdown due to the Slow page. I've asked him to try again and hit Special:SpecialPages instead of "Fast". Personally, I would LOVE to hear that my wikis are misconfigured in some way, or that I've made some mistake. This problem is killing performance for us.
I definitely do *not* get the described effect when using separate browsers. Special:Specialpages, recentchanges, a few random pages, etc all load fine in my second browser (IE in a Windows VM) while my "slow" page is previewing, saving, or rendering in Firefox on my mac.
Looks like a browser issue. In firefox you can probably tweak your network parameters (about:config) such as : network.http.max-connections network.http.max-connections-per-server network.http.max-persistent-connections-per-server
I don't see how it can be a browser issue. In the case study, we have two separate users on two separate PCs, and each is simply hitting ONE single (different) wiki page. For the record, my settings are: network.http.max-connections = 24 network.http.max-connections-per-server = 8 network.http.max-persistent-connections-per-server = 2
Ok, I *do* see this effect on a fresh 1.13 install. I'll see if there's some configuration funniness going on...
Thank you thank you thank you for looking at this. Can I provide more information that would be helpful? php.ini? httpd.conf? LocalSettings.conf (though it's plain vanilla)? mysql config?
Ok here's the skinny: 1.12 branch fresh install, all default options -- no hang. 1.13 branch fresh install, all default options -- hang. Trunk (1.14 dev) fresh install, all default options -- no hang. These are all tested on Mac OS X 10.5, MySQL 5.0.51a, default system Apache/PHP (5.2.6). So there may be some kind of regression in 1.13, which is interestingly fixed (or didn't happen) in dev trunk.
Poked around a little looking for obvious changes in Article and Title DB usage, didn't see anything clear. Assigning to Tim to investigate this as a regression in release branch.
I installed the 1.14 dev trunk, and the problem disappears for me too. (And I was pretty sure we hadn't experienced it in 1.12.) Any chance that a fix could make it into the 1.13.1 patch release? I am happy to help in any way I can.
(In reply to comment #8) > I installed the 1.14 dev trunk, and the problem disappears for me too. (And I > was pretty sure we hadn't experienced it in 1.12.) > > Any chance that a fix could make it into the 1.13.1 patch release? I am happy > to help in any way I can. > If we want to track down what caused this, someone who has time for it (which I don't, and I guess it's safe to say Brion doesn't either) should do a binary search-like revision witchhunt (see <http://en.wikipedia.org/wiki/Binary_search#Number_Guessing_Game> for a quick explanation) to see which revision introduced this bug.
That's why I *assigned* it to Tim. But if you want to do it before he wakes up, be my guest. :)
(In reply to comment #10) > That's why I *assigned* it to Tim. But if you want to do it before he wakes up, > be my guest. :) > Yeah, I only noticed that later.
Found it! (Via the binary search technique suggested.) The problem was fixed in 1.14alpha by Tim in revision 40209: $ svn log -r 40209 -v r40209 | tstarling | 2008-08-29 04:35:00 -0400 (Fri, 29 Aug 2008) | 3 lines Changed paths: M /trunk/phase3/RELEASE-NOTES M /trunk/phase3/includes/BagOStuff.php * Fixed intermittent deadlock errors involving objectcache table queries. Use a separate database connection for the objectcache table to avoid long-lasting locks on that table. I've confirmed that the bug is present in 40208 and fixed in 40209.
Is the delta between 40208 and 40209 safe to patch into 1.13?
On my own wiki, I mean. (I don't have commit access to your svn repository.)
I applied the delta to mw13 and it seems to fix the issue.
Back ported in 1.13. Thanks for bisecting this issue. ------------------------------------------------------------------------ r40391 | hashar | 2008-09-03 22:15:33 +0200 (mer, 03 sep 2008) | 4 lines Fix #15460 : One slow extension can block all wiki pages for all users Backport of trunk@40209 by Tim Starling. Pointed by Dan Barrett. ------------------------------------------------------------------------
Your patch works for me. Thanks so much Ashar and Brion!!!
Fixed summary.