Last modified: 2014-07-28 20:06:09 UTC
Looking at graphite, the values for cvn instances appear all constant (cpu, memory, time since puppet run, everything). For example: http://graphite.wmflabs.org/render/?width=578&height=289&from=00%3A00_20140723&until=23%3A45_20140723&hideLegend=false&target=cvn.*.cpu.total.user.value Checking the local instance (e.g. cvn-dev.eqiad.wmflabs) I see that the diamond directory has been idle for the past 16 days: $ l /var/log/diamond/ total 77M drwxr-xr-x 2 diamond root 4.0K Jul 7 14:04 ./ drwxr-xr-x 16 root root 4.0K Jul 23 06:45 ../ -rw-r--r-- 1 diamond nogroup 6.6M Jul 7 15:42 archive.log -rw-r--r-- 1 diamond nogroup 11M Jul 2 23:59 archive.log.2014-07-02 -rw-r--r-- 1 diamond nogroup 11M Jul 3 23:59 archive.log.2014-07-03 -rw-r--r-- 1 diamond nogroup 11M Jul 4 23:59 archive.log.2014-07-04 -rw-r--r-- 1 diamond nogroup 10M Jul 5 23:59 archive.log.2014-07-05 -rw-r--r-- 1 diamond nogroup 11M Jul 6 23:59 archive.log.2014-07-06 -rw-r--r-- 1 diamond nogroup 1.4M Jul 8 17:03 diamond.log -rw-r--r-- 1 diamond nogroup 19M Jul 7 14:03 diamond.log.2014-07-06 And there is no diamond process running $ ps -u diamond f (empty) $ ps aux | grep diamond | grep -v grep (empty) $ service diamond status diamond stop/waiting $ service diamond start start: Rejected send message, 1 matched rules; type="method_call", sender=":1.49" (uid=2008 pid=6910 comm="start diamond ") interface="com.ubuntu.Upstart0_6.Job" member="Start" error name="(unset)" requested_reply="0" destination="com.ubuntu.Upstart" (uid=0 pid=1 comm="/sbin/init") $ service diamond status diamond stop/waiting Graphite continues to register data from the instance (the last known value repeated), that seems like a bug in the aggregator because the instance hasn't been producing any values for over 16 days. And of course, aside from Graphite being lied to by the aggregator (making it hard to monitor and see that it was down), the diamond process won't start? Puppet is running fine (no errors), and the drives are fine too: $ df -h Filesystem Size Used Avail Use% Mounted on /dev/vda1 7.6G 1.3G 5.9G 19% / udev 2.0G 12K 2.0G 1% /dev tmpfs 396M 288K 396M 1% /run none 5.0M 0 5.0M 0% /run/lock none 2.0G 0 2.0G 0% /run/shm /dev/vda2 1.9G 525M 1.3G 29% /var labstore.svc.eqiad.wmnet:/dumps 9.1T 9.1T 0 100% /public/dumps labstore.svc.eqiad.wmnet:/project/cvn/project 30T 17T 14T 57% /data/project labstore.svc.eqiad.wmnet:/project/cvn/home 30T 17T 14T 57% /home labstore.svc.eqiad.wmnet:/scratch 7.3T 2.6T 4.7T 36% /data/scratch labstore.svc.eqiad.wmnet:/keys 960M 39M 921M 5% /public/keys labstore.svc.eqiad.wmnet:/backups 20T 3.0G 20T 1% /public/backups /dev/mapper/vd-second--local--disk 29G 172M 27G 1% /srv
from irc: so the I haven't dug in disclaimer :) chasemp but I will say chasemp we whitelisted projects that diamond is enabled for in puppet chasemp so if it's not in the array in manifests/role/diamond.pp chasemp it's expected to be stopped chasemp and also one of the undesirable features of lots of the statsd implementations chasemp is this continuity idea where they keep flushing stats even without a real source chasemp to keep whisper files from having invalid xfactor ratios chasemp which is...insane chasemp but this sounds exactly like that and is a big reason I did my own thing statsd wise in the past
Change 148689 had a related patch set uploaded by Krinkle: diamond: Enable for 'cvn' project in labs https://gerrit.wikimedia.org/r/148689
Change 148689 merged by coren: diamond: Enable for 'cvn' project in labs https://gerrit.wikimedia.org/r/148689
Thx.