Last modified: 2014-08-25 17:16:46 UTC
Since August 17, web services have continually gone up and down. Big Brother does restart web services. The few times I was logged into labs when things went down, I did a 'qstat -f'. Load averages for web-grid nodes were at or above 20.
I can confirm this. Multiple restarts for no clear reason. In both emaples a running service had been terminated + restarted automatically. Bigbrother mails: 2014-08-21 05:31:16 info: Restarting job 'lighttpd-xtools' 2014-08-21 05:33:05 warn: job 'lighttpd-xtools' failed to start 2014-08-21 05:33:05 info: Restarting job 'lighttpd-xtools' 2014-08-21 20:20:31 info: Restarting job 'lighttpd-xtools' 2014-08-21 20:22:30 warn: job 'lighttpd-xtools' failed to start 2014-08-21 20:22:31 info: Restarting job 'lighttpd-xtools' qacct reports: jobname lighttpd-xtools jobnumber 3328857 taskid undefined account sge priority 0 qsub_time Thu Aug 21 05:33:06 2014 start_time Thu Aug 21 05:33:18 2014 end_time Thu Aug 21 20:20:27 2014 granted_pe NONE slots 1 failed 0 exit_status 0 jobname lighttpd-xtools jobnumber 3345286 taskid undefined account sge priority 0 qsub_time Thu Aug 21 20:22:32 2014 start_time Thu Aug 21 20:22:33 2014 end_time Fri Aug 22 11:37:33 2014 granted_pe NONE slots 1 failed 0 exit_status 0
*examples
A quick perusal of the logs show that this happens only to a short (~12) list of webservices, in bursts. My current working hypothesis is that this is due to leaking fcgi combined with memory pressure (that is, the problem is always present but leads to webservices being restarted only when resource use is especially high). Could the maintainers of the affected tools please look into their logs to see if those restart match periods of unusual activity?