Triggit Engineering Blog

Monitoring: Circonus and Gollector

| Comments

This post is authored by our Senior Operations Engineer, Erik Hollensbe.

In the end, this is what you get:

Cassandra Network Usage

This — as the name implies — is a graph on the internal network usage of a cassandra cluster. Each host is a different color. This cassandra cluster is 5 nodes and brand new, hence the low utilization.

Some of you will recognize the monitoring service: Circonus is a killer hacker-grade monitoring system that’s made for people who are willing to go the extra mile to do deep integration with their stacks.

So that’s what we did. We wrote Gollector which is a modern, linux-only replacement for Resmon, the preferred agent in the Circonus universe.

Why Reinvent the Wheel?

Before we get into anything about Gollector, or how it integrates with Circonus, this should be answered. Resmon (and other agents like it) tend to shell out to gather metrics; for example, gathering of open network connections:

1
$ netstat -an | wc -l

Simplified, but nonetheless: netstat -an is a great way to make your system soil itself after a certain number of connections are registered (no matter what state they’re in). Here’s a video that visually explains the problem I’m discussing (if exaggerated a bit), although some of our traffic patterns see this regularly without leaking sockets. :)

A lot of monitors — Resmon included — shell out to handle most of their work. This has a significant performance penalty in some scenarios and makes it hard to reason about the impact it has on your system, because, after all, it’s the monitoring system that tells you this stuff.

Gollector is a new monitoring agent that relies on the proc filesystem and C POSIX calls such as sysconf to determine your machine’s profile. We have been using it for about 4 months now without any issue. (well, save this one which in practice is not an issue)

It also introduces a new concept of decoupling monitoring sources — allowing calls into other monitoring systems and proxying the data over to a brokering system such as Circonus. This allows app level metrics to be exposed publicly without compromising the firewall rules or architecture of the application. It also means Gollector itself does a lot less and requires no dependencies on things you may not need — like a postgres or redis library. We’ll discuss this more in a bit.

Gollector and Circonus integration

Circonus has the excellent JSON collector which allows Gollector to work. You just point it at your authenticated URL and Circonus figures out what metrics to use.

N.B.: For those of you who are wondering about security; Gollector uses basic authentication and no SSL. Circonus Enterprise Brokers can be used behind the firewall which was a large part of the reason SSL is currently unsupported. Patches Welcome.

N.B. 2: Gollector’s build and setup instructions are here and you might want to build / set one up before reading the rest of this.

For example, pointing Circonus’s JSON check at http://gollector:gollector@my.host.org:8000/ might yield this visual:

Circonus JSON Check

(Sorry for the unsightly black bars. Because ops, I can’t show you those parts.)

You can see the checks near the bottom — they’re part of a json collection that looks something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
{
  "cpu_usage": [
    0.06,
    16
  ],
  "eth0_usage": {
    "Received (Bytes)": 70560,
    "Received (Packets)": 934,
    "Reception Errors": 0,
    "Transmission Errors": 0,
    "Transmitted (Bytes)": 86502,
    "Transmitted (Packets)": 808
  },
  "eth1_usage": {
    "Received (Bytes)": 37866,
    "Received (Packets)": 331,
    "Reception Errors": 0,
    "Transmission Errors": 0,
    "Transmitted (Bytes)": 42304,
    "Transmitted (Packets)": 217
  },
  "feeds": {
    "adtemplate_resize_image_go_1_images": 0.083256185482529,
    "adtemplate_resize_image_go_2_images": 0.042405453209651,
    "adtemplate_resize_image_go_3_images": 0.06190829802995,
    "adtemplate_resize_image_go_4_images": 10.194585215293,
    "feed_create_or_update_products_to_api": 1.6207301663943,
    "feed_create_or_update_products_to_cassandra": 4.8350784676489
  },
  "hive_logger_usage": [
    0,
    984268259328,
    984477601792,
    false
  ],
  "load_average": [
    0,
    0.11,
    0.22
  ],
  "mem_usage": {
    "Free": 59801563136,
    "Total": 67553263616,
    "Used": 7751700480
  },
  "root_usage": [
    13,
    839262228480,
    968348774400,
    false
  ],
  "triggit_usage": [
    7,
    914974879744,
    984476590080,
    false
  ]
}

As you can see, Circonus uses the “`” separator to walk through JSON objects and arrays. In the cpu_usage case, “cpu_usage`0” denotes the first index in the cpu_usage array up there, which according to the documentation means that is the actual CPU usage (the second one is the number of known CPUs).

There are also two app-level metrics: feeds and hive_logger_usage which are pushed to Gollector via this facility. This is one of a few ways to get metrics into Gollector from your application or other monitoring agents.

Note that because Circonus is also an alerting system, certain thresholds on these metrics can raise alerts to be sent to Pagerduty or SMS.

SOA for monitoring

One of Gollector’s goals is to keep dependencies to a minimum — either via shelling out, or in process — and it accomplishes this in a few ways which minimize what it has to do directly.

Gollector has many built-in systems-level plugins, but if you need something specialized, you may want to look at these options:

The command plugin allows you to shell out to anything that yields valid JSON output.

The record plugin, already described above, allows you to push metrics to gollector (which will then be slurped up by a monitoring system such as Circonus). Ideal for cron jobs or other similar one-off push work.

The json_poll plugin hits a web service that yields JSON and proxies that data similar to the other two plugins. Plans for a version of this which works with unix sockets are in the pipe.

The point of these plugins is to keep your programs doing one thing and doing it well; you can write a redis monitor, for example, which spits out JSON and publishes to gollector in a multitude of ways, without exposing your redis infra to the world, or even that monitor. Gollector just needs to know how to collect from it and merge it into its other metrics. Good surface tension at the network level never hurts. :)

An Example

We have an image processor which uses the go-metrics package to expose numerous stats about our image processing work. This lives in the daemon, and because the daemon itself uses an unauthenticated internal API, the daemon cannot be exposed to the world, or even outside it’s little ecosystem.

The Gollector config looks something like this:

1
2
3
4
5
6
{
  "image_processor": {
    "Type": "json_poll",
    "Params": "http://localhost:5150/metrics"
  }
}

The result is this in the gollector output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
{
  "image_processor": {
    "image.resize.time": {
      "15m.rate": 1.0738576367027,
      "1m.rate": 4.3828089058537e-7,
      "5m.rate": 0.25335430308081,
      "75%%": 701512440.75,
      "95%%": 826383205.1,
      "99%%": 944633233.17,
      "99.9%%": 1424578962.457,
      "count": 3124296,
      "max": 3027431442,
      "mean": 419020163.28018,
      "mean.rate": 4.040577834929,
      "median": 564463417,
      "min": 7637604,
      "stddev": 187876614.26454
    },
    "req.resize.all.time": {
      "15m.rate": 4.3359060585534,
      "1m.rate": 4.5146036900166e-7,
      "5m.rate": 0.53391325109054,
      "75%%": 5851858223,
      "95%%": 7952005085.2,
      "99%%": 8399359505.56,
      "99.9%%": 8775108298.584,
      "count": 33505086,
      "max": 350385336427,
      "mean": 273994559.53917,
      "mean.rate": 43.331332229999,
      "median": 2366365146.5,
      "min": 610911,
      "stddev": 1445781220.6475
    }
  }
}

There’s a lot more here, but you get the idea.

This way, we get a full devops experience; the operations engineers can focus on configuring global monitoring, and our devs can focus on what matters for them: app performance.

One format to rule them all: JSON

JSON, and by proxy HTTP, gives us a ton of flexibility with the tools and opportunities for (ab)use in surprising ways.

We’ve written a small internal dashboard against a Raspberry Pi which bubbles heavily loaded boxes to the top and colors them — the whole thing is written in static HTML and Angular and runs on the surf browser.

gstat comes with Gollector and can be used like iostat to monitor N hosts and specific metrics on a one second interval without interrupting your core monitoring — great for dev use, as it runs anywhere Go does.

Thanks!

It’s been very exciting to develop this application and even more exciting to share it with you all. Have a happy new year!

Comments