Last modified: 2014-08-25 17:16:46 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T71934, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 69934 - Web services continually restarting


Summary:	Web services continually restarting

Status:	UNCONFIRMED

Product:	Wikimedia Labs
Classification:	Unclassified
Component:	tools (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	High normal
Target Milestone:	---
Assigned To:	Marc A. Pelletier

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-08-23 07:04 UTC by bgwhite
Modified:	2014-08-25 17:16 UTC (History)
CC List:	4 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description bgwhite 2014-08-23 07:04:53 UTC

Since August 17, web services have continually gone up and down.  Big Brother does restart web services.  The few times I was logged into labs when things went down, I did a 'qstat -f'.  Load averages for web-grid nodes were at or above 20.

Comment 1 metatron 2014-08-23 09:38:35 UTC

I can confirm this. Multiple restarts for no clear reason. In both emaples a running service had been terminated + restarted automatically.

Bigbrother mails:
2014-08-21 05:31:16 info: Restarting job 'lighttpd-xtools'
2014-08-21 05:33:05 warn: job 'lighttpd-xtools' failed to start
2014-08-21 05:33:05 info: Restarting job 'lighttpd-xtools'

2014-08-21 20:20:31 info: Restarting job 'lighttpd-xtools'
2014-08-21 20:22:30 warn: job 'lighttpd-xtools' failed to start
2014-08-21 20:22:31 info: Restarting job 'lighttpd-xtools'

qacct reports:
jobname      lighttpd-xtools
jobnumber    3328857
taskid       undefined
account      sge
priority     0
qsub_time    Thu Aug 21 05:33:06 2014
start_time   Thu Aug 21 05:33:18 2014
end_time     Thu Aug 21 20:20:27 2014
granted_pe   NONE
slots        1
failed       0
exit_status  0

jobname      lighttpd-xtools
jobnumber    3345286
taskid       undefined
account      sge
priority     0
qsub_time    Thu Aug 21 20:22:32 2014
start_time   Thu Aug 21 20:22:33 2014
end_time     Fri Aug 22 11:37:33 2014
granted_pe   NONE
slots        1
failed       0
exit_status  0

Comment 2 metatron 2014-08-23 09:42:10 UTC

*examples

Comment 3 Marc A. Pelletier 2014-08-25 17:16:46 UTC

A quick perusal of the logs show that this happens only to a short (~12) list of webservices, in bursts.

My current working hypothesis is that this is due to leaking fcgi combined with memory pressure (that is, the problem is always present but leads to webservices being restarted only when resource use is especially high).

Could the maintainers of the affected tools please look into their logs to see if those restart match periods of unusual activity?

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links