Last modified: 2014-11-12 18:31:26 UTC
This is related to bug 51410, but looks like a new form of the old problem introduced by that fix. On mediawikiwiki master queries experience lock-wait-timeout in what looks like an effective deadlock between transactions and co-operative locks. SELECT /* MessageGroupStats::forItemInternal */ GET_LOCK('MessageGroupStats:modify:page-MediaWiki-Vagrant', 1) AS lockstatus; UPDATE /* LinksUpdate::updateLinksTimestamp */ `page` SET page_links_updated = '20140829051059' WHERE page_id = '226112'; The queries are unrelated. The LinksUpdate query is perfectly ok until MessageGroupStats appears. From the database end, it looks like MessageGroupStats get_lock() is called in a loop by a connection which can already have an open transaction with row locks on the page and translate_groupstats tables. When the co-op lock is not acquired quickly, MessageGroupStats transactions bottleneck and queue up, collectively holding many row locks and blocking other queries like LinksUpdate *and whichever MessageGroupStats connection already holds the co-op lock*. We should not be combining transactions and co-operative locking in this manner.
Some of this code runs in autocommit mode by runners but other times it still happens in a big transaction. I've mentioned that all of this needs to be move to the job queue. A quick work around would be to not use GET_LOCK for web requests or to use some MW transaction hook to push this all post-COMMIT for web requests.
Change 157040 had a related patch set uploaded by Aaron Schulz: Avoid GET_LOCK in non-autocommit mode https://gerrit.wikimedia.org/r/157040
A bit off topic, sorry... (In reply to Aaron Schulz from comment #1) > I've mentioned that all of this needs to > be move to the job queue. Using the job queue however can be disruptive sometimes when it gets in the way of editing, as I think bug 69669 may show (panic for some minutes while a translation-admin-related action was being completed). It would be nice to have a write-up of what should be moved to the job queue, but also of what actions should be high priority in the job queue.
(In reply to Nemo from comment #3) > A bit off topic, sorry... > > (In reply to Aaron Schulz from comment #1) > > I've mentioned that all of this needs to > > be move to the job queue. > > Using the job queue however can be disruptive sometimes when it gets in the > way of editing, as I think bug 69669 may show (panic for some minutes while > a translation-admin-related action was being completed). It would be nice to > have a write-up of what should be moved to the job queue, but also of what > actions should be high priority in the job queue. If the problem is responsiveness, then it can always go into a small dedicated job loop in jobrunner.conf.erb.
Change 157040 merged by jenkins-bot: Avoid GET_LOCK in non-autocommit mode https://gerrit.wikimedia.org/r/157040
All patches mentioned in this report were merged or abandoned - is there more work left to do here (if yes: please reset the bug report status to NEW or ASSIGNED), or can you close this ticket as RESOLVED FIXED?