Triggit Engineering Blog

Fun With /proc

| Comments

This post is authored by our Senior Operations Engineer, Erik Hollensbe.

The /proc filesystem is one of the easiest, cheapest ways to quickly get at the status of your machine. We use it extensively in Gollector, preferring it to other methods of retrieving data.

Something to get your feet wet

netstat -an is a great way to get a list of all the connections running on a machine, but as mentioned in the previous article, it can be quite slow as it pulls everything up. How can we quickly get the count of all network connections on the machine, for example?

/proc/self/net/ has several files named tcp, udp, tcp6, etc that contain this information. They’re actually in each pid-named directory, but contain the information for all connections it can see. /proc/self is simply a magical directory that points at the current process id, similar to how cd /proc/$$ would work. This allows us to interrogate these files without being root!

Here’s some code to get that count:

1
2
3
4
5
6
7
8
9
10
count=0

for i in tcp tcp6 udp udp6 unix udplite udplite6
do
  count=$(($count + $(wc -l /proc/self/net/$i | awk '{ print $1 }')))
  # each file has a header line, so decrement the count by 1 so we're not off.
  count=$(($count - 1))
done

echo $count

Which on my dev machine returns 255. The output from netstat -an (with descriptive headers) reports 261. We’re good.

Read the friendly manual

man 5 proc has tons of information on these files. I strongly suggest you at least page through it before trying to write C to interrogate the system directly. Depending on what you need to do, reading these files could be considerably less error-prone code.

Let’s dig deeper.

netstat -anp yields all sockets and the PIDs and executables attached to them. This is very useful when trying to find a problem bind(2) or a process which is not cleaning up its sockets properly.

Output looks something like this:

1
2
3
4
5
6
7
8
9
10
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN      1479/master
tcp        0      0 127.0.0.1:953           0.0.0.0:*               LISTEN      1191/named
tcp        0      0 0.0.0.0:40027           0.0.0.0:*               LISTEN      653/Plex Plug-in [c
tcp        0      0 0.0.0.0:32443           0.0.0.0:*               LISTEN      14724/Plex Media Se
tcp        0      0 0.0.0.0:443             0.0.0.0:*               LISTEN      3479/vmware-hostd
tcp        0      0 0.0.0.0:902             0.0.0.0:*               LISTEN      3364/vmware-authdla
tcp        0      0 127.0.0.1:6379          0.0.0.0:*               LISTEN      13686/redis-server
tcp        0      0 0.0.0.0:1836            0.0.0.0:*               LISTEN      14873/Plex DLNA Ser

Let’s write this, using /proc.

exe and cmdline

/proc/<pid>/cmdline and /proc/<pid>/exe are two very useful ways to get at process information for any given process ID.

cmdline is just a text representation, with each argument delimited by a NULL (ASCII 0) byte, accessible by all users. This is the “process title” you see in ps output.

exe is a little more esoteric but extremely useful. Only accessible by the user (or root), it’s a symlink to the dereferenced path of the binary used to run the process.

For example:

I run /bin/zsh -l as my shell. cat /proc/self/cmdline at this point is /bin/zsh\x00-l (the escape is not literal — that’s an ASCII 0 byte). However, on Ubuntu machines zsh comes from /bin/zsh4, which if you readlink /proc/self/exe, is what you end up getting.

exe is really nice for things like sidekiq or sendmail which change their process title; the contents of which will be reflected in /proc/self/cmdline instead of the actual binary executed.

OK… Get on with it.

Let’s write pgrep first, shall we? You’ll need to run this as root.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
my_pgrep() {
  command=$1

  (
    for i in /proc/*/exe
    do
      if readlink $i | grep -q "$command"
      then 
        echo $(basename $(dirname $i))
      fi
    done
  ) | sort -n
}

pgrep zsh

my_pgrep /bin/zsh4

Quite a bit slower than pgrep, but very effective. Can you write a version that uses /proc/*/cmdline and works exactly like pgrep -f?

Sockets and readlink

readlink is a swiss army chainsaw. It’s great for locating the source of symlinks, device files, and … sockets.

Let’s look at a running copy of named. It has numerous file descriptors in /proc/1191/fd and we want to find out which ones are sockets. How do?

Well, readlink can tell us (again, you’ll need to be the same user as named or root to try this):

1
2
3
4
5
6
7
8
9
10
11
my_lsof() {
  pid=$1

  (
    for i in /proc/1193/fd/*
    do
      echo -n "$(basename $i) "
      readlink $i
    done
  ) | sort -n -k 1
}

Which should yield output similar to this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
0 /dev/null
1 /dev/null
2 /dev/null
3 socket:[9783]
4 /dev/null
5 pipe:[1423]
7 pipe:[1423]
8 anon_inode:[eventpoll]
9 /dev/random
20 socket:[14362]
21 socket:[14367]
22 socket:[14369]
23 socket:[9795]
24 socket:[9796]
25 socket:[15925]
26 socket:[15927]
512 socket:[14361]
513 socket:[14366]
514 socket:[14368]
515 socket:[15924]
516 socket:[15926]
517 socket:[262364]

So, we see /dev/null in quite a few spots (this is a daemon, after all), and some special syntax for ephemeral file descriptors:

1
type:[inode]

These inodes can be used to look up values in the network files we examined earlier.

Finally! We’re getting to the meat!

If we look at /proc/self/net/tcp, we see this:

1
2
3
4
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode
   0: 0100007F:0019 00000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 1501 1 ffff8807f03b8000 100 0 0 10 0
   1: 0100007F:03B9 00000000:0000 0A 00000000:00000000 00:00000000 00000000   109        0 9795 1 ffff8807f0510000 100 0 0 10 0
   2: 00000000:9C5B 00000000:0000 0A 00000000:00000000 00:00000000 00000000   110        0 7488168 1 ffff880572db7000 100 0 0 10 0

This contains the information we need — note that inode column on the far right. We can also use /usr/bin/stat or an equivalent to cheat a little on what we look at.

So, let’s dig these sockets up! I’ve written this in ruby instead of shell because it should be a little faster, and a little more terse. I’ve done my best to comment the code on a line-by-line basis so you can understand what’s happening.

Note that no attempt has been made to handle unix sockets or IPv6. An exercise for the reader, perhaps.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
#!ruby

def unpack_raw(addr)
  # extract the network information from the raw ip:port in hex.
  raw_ip, port = addr.split(/:/).map(&:hex) # convert the hex values to integers
  ip = ""
  4.times do
    ip += (raw_ip & 0xFF).to_s # pull the next octet
    ip += "." # append a dot
    raw_ip = raw_ip >> 8 # shift down the octets
  end
  return ip.chop, port # String#chop is used because we'll have a trailing dot.
end

NETWORK_FILES = %w[tcp udp]

inode_hash = { } # mapping of inode -> fd file

Dir["/proc/*/fd/*"].each do |fd|
  link = (File.readlink(fd) rescue nil)
  if link and link =~ /\Asocket:\[/
    inode_hash[link.match(/\Asocket:\[(.*?)\]\z/)[1]] = fd
  end
end

NETWORK_FILES.each do |file|
  lines = File.readlines("/proc/self/net/#{file}")
  lines.shift
  network_info = lines.map { |line| line.split(/\s+/) }
  network_info.each do |info|
    begin # file descriptors can vanish at any time; just bail if we can't get the info
      local, remote = info[2..3]
      local_str     = unpack_raw(local).join(":")
      remote_str    = unpack_raw(remote).join(":")

      inode       = info[10] # get the inode from the text
      fd_file     = inode_hash[inode] # look up our fd file
      owner       = File.stat(fd_file).uid # get the owner of the fd
      pid         = fd_file.match(%r!/proc/(.*?)/!)[1] # get the pid of the fd
      executable  = File.readlink("/proc/#{pid}/exe") # get the process the fd belongs to

      # this prints our output
      puts "#{file}: local:#{local_str} remote:#{remote_str} uid:#{owner} pid:#{pid} exec:#{executable}"
    rescue
    end
  end
end

This is the kind of output it yields when run as root:

1
2
3
4
5
6
7
tcp: local:127.0.0.1:25 remote:0.0.0.0:0 uid:0 pid:1479 exec:/usr/lib/postfix/master
tcp: local:127.0.0.1:953 remote:0.0.0.0:0 uid:109 pid:1191 exec:/usr/sbin/named
tcp: local:0.0.0.0:40027 remote:0.0.0.0:0 uid:110 pid:653 exec:/usr/lib/plexmediaserver/Resources/Python/bin/python
tcp: local:0.0.0.0:32443 remote:0.0.0.0:0 uid:110 pid:14724 exec:/usr/lib/plexmediaserver/Plex Media Server
tcp: local:0.0.0.0:443 remote:0.0.0.0:0 uid:0 pid:3479 exec:/usr/lib/vmware/bin/vmware-hostd
tcp: local:0.0.0.0:902 remote:0.0.0.0:0 uid:0 pid:3364 exec:/usr/sbin/vmware-authdlauncher
tcp: local:127.0.0.1:6379 remote:0.0.0.0:0 uid:111 pid:13686 exec:/usr/bin/redis-server

Thanks!

Please, again, man 5 proc for more information on this great service!

Monitoring: Circonus and Gollector

| Comments

This post is authored by our Senior Operations Engineer, Erik Hollensbe.

In the end, this is what you get:

Cassandra Network Usage

This — as the name implies — is a graph on the internal network usage of a cassandra cluster. Each host is a different color. This cassandra cluster is 5 nodes and brand new, hence the low utilization.

Some of you will recognize the monitoring service: Circonus is a killer hacker-grade monitoring system that’s made for people who are willing to go the extra mile to do deep integration with their stacks.

So that’s what we did. We wrote Gollector which is a modern, linux-only replacement for Resmon, the preferred agent in the Circonus universe.

Why Reinvent the Wheel?

Before we get into anything about Gollector, or how it integrates with Circonus, this should be answered. Resmon (and other agents like it) tend to shell out to gather metrics; for example, gathering of open network connections:

1
$ netstat -an | wc -l

Simplified, but nonetheless: netstat -an is a great way to make your system soil itself after a certain number of connections are registered (no matter what state they’re in). Here’s a video that visually explains the problem I’m discussing (if exaggerated a bit), although some of our traffic patterns see this regularly without leaking sockets. :)

A lot of monitors — Resmon included — shell out to handle most of their work. This has a significant performance penalty in some scenarios and makes it hard to reason about the impact it has on your system, because, after all, it’s the monitoring system that tells you this stuff.

Gollector is a new monitoring agent that relies on the proc filesystem and C POSIX calls such as sysconf to determine your machine’s profile. We have been using it for about 4 months now without any issue. (well, save this one which in practice is not an issue)

It also introduces a new concept of decoupling monitoring sources — allowing calls into other monitoring systems and proxying the data over to a brokering system such as Circonus. This allows app level metrics to be exposed publicly without compromising the firewall rules or architecture of the application. It also means Gollector itself does a lot less and requires no dependencies on things you may not need — like a postgres or redis library. We’ll discuss this more in a bit.

Gollector and Circonus integration

Circonus has the excellent JSON collector which allows Gollector to work. You just point it at your authenticated URL and Circonus figures out what metrics to use.

N.B.: For those of you who are wondering about security; Gollector uses basic authentication and no SSL. Circonus Enterprise Brokers can be used behind the firewall which was a large part of the reason SSL is currently unsupported. Patches Welcome.

N.B. 2: Gollector’s build and setup instructions are here and you might want to build / set one up before reading the rest of this.

For example, pointing Circonus’s JSON check at http://gollector:gollector@my.host.org:8000/ might yield this visual:

Circonus JSON Check

(Sorry for the unsightly black bars. Because ops, I can’t show you those parts.)

You can see the checks near the bottom — they’re part of a json collection that looks something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
{
  "cpu_usage": [
    0.06,
    16
  ],
  "eth0_usage": {
    "Received (Bytes)": 70560,
    "Received (Packets)": 934,
    "Reception Errors": 0,
    "Transmission Errors": 0,
    "Transmitted (Bytes)": 86502,
    "Transmitted (Packets)": 808
  },
  "eth1_usage": {
    "Received (Bytes)": 37866,
    "Received (Packets)": 331,
    "Reception Errors": 0,
    "Transmission Errors": 0,
    "Transmitted (Bytes)": 42304,
    "Transmitted (Packets)": 217
  },
  "feeds": {
    "adtemplate_resize_image_go_1_images": 0.083256185482529,
    "adtemplate_resize_image_go_2_images": 0.042405453209651,
    "adtemplate_resize_image_go_3_images": 0.06190829802995,
    "adtemplate_resize_image_go_4_images": 10.194585215293,
    "feed_create_or_update_products_to_api": 1.6207301663943,
    "feed_create_or_update_products_to_cassandra": 4.8350784676489
  },
  "hive_logger_usage": [
    0,
    984268259328,
    984477601792,
    false
  ],
  "load_average": [
    0,
    0.11,
    0.22
  ],
  "mem_usage": {
    "Free": 59801563136,
    "Total": 67553263616,
    "Used": 7751700480
  },
  "root_usage": [
    13,
    839262228480,
    968348774400,
    false
  ],
  "triggit_usage": [
    7,
    914974879744,
    984476590080,
    false
  ]
}

As you can see, Circonus uses the “`” separator to walk through JSON objects and arrays. In the cpu_usage case, “cpu_usage`0” denotes the first index in the cpu_usage array up there, which according to the documentation means that is the actual CPU usage (the second one is the number of known CPUs).

There are also two app-level metrics: feeds and hive_logger_usage which are pushed to Gollector via this facility. This is one of a few ways to get metrics into Gollector from your application or other monitoring agents.

Note that because Circonus is also an alerting system, certain thresholds on these metrics can raise alerts to be sent to Pagerduty or SMS.

SOA for monitoring

One of Gollector’s goals is to keep dependencies to a minimum — either via shelling out, or in process — and it accomplishes this in a few ways which minimize what it has to do directly.

Gollector has many built-in systems-level plugins, but if you need something specialized, you may want to look at these options:

The command plugin allows you to shell out to anything that yields valid JSON output.

The record plugin, already described above, allows you to push metrics to gollector (which will then be slurped up by a monitoring system such as Circonus). Ideal for cron jobs or other similar one-off push work.

The json_poll plugin hits a web service that yields JSON and proxies that data similar to the other two plugins. Plans for a version of this which works with unix sockets are in the pipe.

The point of these plugins is to keep your programs doing one thing and doing it well; you can write a redis monitor, for example, which spits out JSON and publishes to gollector in a multitude of ways, without exposing your redis infra to the world, or even that monitor. Gollector just needs to know how to collect from it and merge it into its other metrics. Good surface tension at the network level never hurts. :)

An Example

We have an image processor which uses the go-metrics package to expose numerous stats about our image processing work. This lives in the daemon, and because the daemon itself uses an unauthenticated internal API, the daemon cannot be exposed to the world, or even outside it’s little ecosystem.

The Gollector config looks something like this:

1
2
3
4
5
6
{
  "image_processor": {
    "Type": "json_poll",
    "Params": "http://localhost:5150/metrics"
  }
}

The result is this in the gollector output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
{
  "image_processor": {
    "image.resize.time": {
      "15m.rate": 1.0738576367027,
      "1m.rate": 4.3828089058537e-7,
      "5m.rate": 0.25335430308081,
      "75%%": 701512440.75,
      "95%%": 826383205.1,
      "99%%": 944633233.17,
      "99.9%%": 1424578962.457,
      "count": 3124296,
      "max": 3027431442,
      "mean": 419020163.28018,
      "mean.rate": 4.040577834929,
      "median": 564463417,
      "min": 7637604,
      "stddev": 187876614.26454
    },
    "req.resize.all.time": {
      "15m.rate": 4.3359060585534,
      "1m.rate": 4.5146036900166e-7,
      "5m.rate": 0.53391325109054,
      "75%%": 5851858223,
      "95%%": 7952005085.2,
      "99%%": 8399359505.56,
      "99.9%%": 8775108298.584,
      "count": 33505086,
      "max": 350385336427,
      "mean": 273994559.53917,
      "mean.rate": 43.331332229999,
      "median": 2366365146.5,
      "min": 610911,
      "stddev": 1445781220.6475
    }
  }
}

There’s a lot more here, but you get the idea.

This way, we get a full devops experience; the operations engineers can focus on configuring global monitoring, and our devs can focus on what matters for them: app performance.

One format to rule them all: JSON

JSON, and by proxy HTTP, gives us a ton of flexibility with the tools and opportunities for (ab)use in surprising ways.

We’ve written a small internal dashboard against a Raspberry Pi which bubbles heavily loaded boxes to the top and colors them — the whole thing is written in static HTML and Angular and runs on the surf browser.

gstat comes with Gollector and can be used like iostat to monitor N hosts and specific metrics on a one second interval without interrupting your core monitoring — great for dev use, as it runs anywhere Go does.

Thanks!

It’s been very exciting to develop this application and even more exciting to share it with you all. Have a happy new year!