Measure Anything, Measure Everything

Posted by on February 15, 2011

If Engineering at Etsy has a religion, it’s the Church of Graphs. If it moves, we track it. Sometimes we’ll draw a graph of something that isn’t moving yet, just in case it decides to make a run for it. In general, we tend to measure at three levels: network, machine, and application. (You can read more about our graphs in Mike’s Tracking Every Release post.)

Application metrics are usually the hardest, yet most important, of the three. They’re very specific to your business, and they change as your applications change (and Etsy changes a lot). Instead of trying to plan out everything we wanted to measure and putting it in a classical configuration management system, we decided to make it ridiculously simple for any engineer to get anything they can count or time into a graph with almost no effort. (And, because we can push code anytime, anywhere, it’s easy to deploy the code too, so we can go from “how often does X happen?” to a graph of X happening in about half an hour, if we want to.)

Meet StatsD

StatsD is a simple NodeJS daemon (and by “simple” I really mean simple — NodeJS makes event-based systems like this ridiculously easy to write) that listens for messages on a UDP port. (See Flickr’s “Counting & Timing” for a previous description and implementation of this idea, and check out the open-sourced code on github to see our version.) It parses the messages, extracts metrics data, and periodically flushes the data to graphite.

We like graphite for a number of reasons: it’s very easy to use, and has very powerful graphing and data manipulation capabilities. We can combine data from StatsD with data from our other metrics-gathering systems. Most importantly for StatsD, you can create new metrics in graphite just by sending it data for that metric. That means there’s no management overhead for engineers to start tracking something new: simply tell StatsD you want to track “grue.dinners” and it’ll automagically appear in graphite. (By the way, because we flush data to graphite every 10 seconds, our StatsD metrics are near-realtime.)

Not only is it super easy to start capturing the rate or speed of something, but it’s very easy to view, share, and brag about them.

Why UDP?

So, why do we use UDP to send data to StatsD? Well, it’s fast — you don’t want to slow your application down in order to track its performance — but also sending a UDP packet is fire-and-forget. Either StatsD gets the data, or it doesn’t. The application doesn’t care if StatsD is up, down, or on fire; it simply trusts that things will work. If they don’t, our stats go a bit wonky, but the site stays up. Because we also worship at the Church of Uptime, this is quite alright. (The Church of Graphs makes sure we graph UDP packet receipt failures though, which the kernel usefully provides.)

Measure Anything

Here’s how we do it using our PHP StatsD library:


That’s it. That line of code will create a new counter on the fly and increment it every time it’s executed. You can then go look at your graph and bask in the awesomeness, or for that matter, spot someone up to no good in the middle of the night:

Graph showing login successes and login failures over time

We can use graphite’s data-processing tools to take the the data above and make a graph that highlights deviations from the norm:

Graph showing login failures per attempt over time

(We sometimes use the “rawData=true” option in graphite to get a stream of numbers that can feed into automatic monitoring systems. Graphs like this are very “monitorable.”)

We don’t just track trivial things like how many people are signing into the site — we also track really important stuff, like how much coffee is left in the kitchen:

Graph showing coffee availability over time

Time Anything Too

In addition to plain counters, we can track times too:

$start = microtime(true);
StatsD::timing("grue.dinners", (microtime(true) - $start) * 1000);

StatsD automatically tracks the count, mean, maximum, minimum, and 90th percentile times (which is a good measure of “normal” maximum values, ignoring outliers). Here, we’re measuring the execution times of part of our search infrastructure:

Graph showing upper 90th percentile, mean, and lowest execution time for auto-faceting over time

Sampling Your Data

One thing we found early on is that if we want to track something that happens really, really frequently, we can start to overwhelm StatsD with UDP packets. To cope with that, we added the option to sample data, i.e. to only send packets a certain percentage of the time. For very frequent events, this still gives you a statistically accurate view of activity.

To record only one in ten events:

StatsD::increment(“adventurer.heartbeat”, 0.1);

What’s important here is that the packet sent to StatsD includes the sample rate, and so StatsD then multiplies the numbers to give an estimate of a 100% sample rate before it sends the data on to graphite. This means we can adjust the sample rate at will without having to deal with rescaling the y-axis of the resulting graph.

Measure Everything

We’ve found that tracking everything is key to moving fast, but the only way to do it is to make tracking anything easy. Using StatsD, we enable engineers to track what they need to track, at the drop of a hat, without requiring time-sucking configuration changes or complicated processes.

Try StatsD for yourself: grab the open-sourced code from github and start measuring. We’d love to hear what you think of it.

Posted by on February 15, 2011
Category: data, engineering, infrastructure


BRilliant! We use a similar approach with collectd and try to track anything relevant! the funny thing is that something can only be relevant for a specific release, so we track, and then forget!

Why not have the traditional SNMP traps?

JULIEN: Well, the good news is that these UDP pings are so lightweight that it’s generally not a problem to keep them around for a while, and it’s surprising how often you find that “just for this release” metrics turn out to have interesting information in them weeks later. But yes, it’s good to clean house every so often.

MANAS: Those would have worked, I’m sure. I think there are lots of ways to solve this particular problem. As long as a given solution has next to no management overhead, and is trivially easy for engineers to use, you’ve got something useful.

I’m not familiar with the concept of negative coffee. Also, I see you guys sit around the coffee pot at 17:00 just waiting for the fresh pot to finish brewing, and then immediately chug away 😉

Will you be releasing the StatsD PHP client library?

Eric: Yep, it’s already in with the statsd code on github –

Daniel: I see you’ve spotted that our coffee monitoring system doesn’t cope well with people leaving the pot off the scale, demonstrating the importance of tracking metrics in software development 😉

Ian, this is great stuff. I’ve already got a project his is going to get stood up for. I ported the PHP sample to Python, since that’s my environment. You can find it on my statsd fork:

(I sent a pull request just in case you guys find it useful)

Thanks again for sharing your tools!

A stand-alone Python Statsd client is now at:


[…] stumbled across this recent posting by one of the engineers. I am “like WOW!”. I am jumping on the web 3.0 […]

Steve: I applied your patches, good stuff 🙂

Someone needs to make a ruby gem or example client library for us to include!


Thanks! I noticed that a bit earlier.

I also managed to get the standalone client into pypi tonight (, and got it to install via pip on my server. Now to get cairo, pixman, and pycairo working… *grumble*.

I wrote a little Ruby client (basically a port of the Python example), over here:

I get that the fire-and-forget power of UDP allows your apps to track anything/everything without compromising responsiveness. I have a question about the Why behind StatsD. Before you wrote StatsD, did you find that you were saturating Graphite’s agent (carbon-agent), or was this more of a preemptive strike? I’m curious about carbon-agent’s capacity under variable load.

Great blog, btw–it’s inspiring to see a whole crew of developers so proud of the tools they build!

[…] Measure Anything, Measure Everything seems pretty cool. Suspect you could do something in your scripts to ping the counter so you could get visualizations of your runs. […]

As I mentioned to Erik (Kastner) the other day, it would be cool if there was a wiki or other public repository of stats/graphite recipes. I know how to shove data into graphite with statsd, but I don’t feel like I have a good grasp of how to best tease out the interesting graphs.

Tim – I think the issue is in your first sentence. To do what they’re doing with carbon directly you’d have to have the additional overhead of building a tcp connection.

If carbon had the ability to receive data via UDP messages like this I think it would be fine in terms of load. But this code also abstracts some of the the work. As simple as it is to get data into graphite, this lets you easily add certain kinds of graphs without the developers using it having to know much about how it works.

It also lets you force them into a given hierarchy – so they can’t clutter the root with tons of new graph paths, which is a nice touch as well.

Tim – Mark’s point is a good one, but really, the key feature of StatsD is that it aggregates metrics into time buckets (10 seconds in our case). When you send data to graphite, you say “store value N for metric M at time T”. If you have multiple, separate M events happening at time T, you need a central aggregator to sum these and then send a single value to graphite. This central aggregation also allows us to do the statistical work for the timing functions – high/low/mean/90th-percentile.

[…] simple performance metrics without a lot of centralized processing. I could use something like StatsD from the Etsy folks but got inspired by reading about Redis at Disqus the other […]

Aaaand, once more client – in node.js this time:

Joshue Frederick (jfred on gihub) contributed a python implementation of the statsd server. I don’t know how it compares to the node version for speed (it’s not async) but it’s pretty cool to have another implementation of the server.

[…] This blog post by the etsy engineering team about tracking everything made me drool […]

We implemented this in my office, and found frequent 5 second delays (or delays in increments of 5 seconds, since we have multiple statsd calls per transaction). We had to turn it off.

We don’t seem to see as many (or any) delays when the (PHP) client and server are on the same host, but as soon as they’re separated by a network, the delays are awful.

Since Etsy is clearly using more than one server (ha!), presumably you either deal with this problem, or have worked around it, or, I suppose, have a better network than we do. There doesn’t seem to be any way to “fire and forget” an async StatsD::increment, for example.

Any thoughts from either Etsy or other PHP users?

How are you guys doing the coffee graph? I can’t seem to find any documentation in graphite about actual counting on the graphs. Nothing in the statsD library makes me thing I can control that either.

And, I’ve added a java client–

    Andrew, you wrote about the java client 4 years ago. is it still available?

We just created an erlang implementation. It’s only been rudimentary tested against Joshua’s / Steve’s pystatd.server.

Since you want to ignore outliers, why are you using the mean rather than the median?

Phillip: is it a DNS resolution problem? To achieve true “fire and forget”, you probably need to configure the client to send to an IP address rather than a host name.

Does this work with any database configuration, or do you recommend only using the Round Robin Database tool suggested in the Flickr post?

As I start to investigate, this is my main question. MySQL, Postgres or Mongo, whatever, just curious if it’s open to use whatever we’d like.

Donnie: We only wanted to ignore outliers in the “upper bound” metric, just to avoid the worst spikes. We could (probably should) add calculating the median to the StatsD timing code too.

Kris: Graphite uses a round-robin database to store its data, and we send the data to Graphite. The *idea* of StatsD (a central aggregator, sending data on to a storage system) could be applied to any backend storage system, if Graphite doesn’t suit you.

Adam: We cheat, and use the timers to send the number of grams on the USB scale as a “time”, and then graph the mean along with Graphite’s keepLastValue() function, and a its scale() function to convert the grams to cups.

Ian: Here’s something you might consider, based on a common technique in statistics called a box plot. They plot the median, 25% and 75%, and two bars that are 1.5x the difference between 25%–75%, then anything beyond those bars is an outlier that gets an individual dot.

To simplify things, you could just show the median and outliers, or you could add more if you really care about information at that level of detail.

Donnie: At this point, we start to reach the current limits of Graphite’s graph drawing 🙁 It would certainly be interesting to expand StatsD with more statistical analysis (standard deviations, other percentiles, etc.) but its all more work for StatsD to do, and more data to send to Graphite, and right now we don’t really need it, and we can’t necessarily make Graphite draw it nicely. But, the Node code is sufficiently simple that it could be added with very little effort by anyone who did have a use for it.

Phillip: This does sound like a networking issue. You could try doing some UDP tests between one of your remote machines and your StatsD machine to try to isolate the problem area.

Ian, this made my day :).. I have one question, though.. you say you sometimes use rawData to get CSV values and trigger some monitoring from that. Have you self-coded the monitoring or are you using any commercial? I haven’t been able to find an integration between Graphite and nagios, monit, cacti or similar.. I found integration between Graphite and Ganglia and from there you can go to Nagios but seems complicated just for getting some alerts. Thank you for your time and, again, congrats for a great job.

    Guillem – We’ve used Jenkins to do some alerting based on Graphite data. I believe other stuff is assorted bits of Perl, Python, etc. But no, no formal integration.

Awesome. I’ll be standing this up soon on a project in need of monitoring like this. Thanks!

We’re actually working on something similar (which is OSS) named Saturnalia DB

which we are deploying into production now for Spinn3r and is in the process of seeing some much larger installs.

We’re also building our a new UI which is integrated as part of the system but we think that visualization is a major competitive advantage and we want to NAIL it… 🙂

[…] Measure Anything, Measure Everything […]

I’ve written a C client for statsd, named libstatsdc

It’s available at

I’m on OSX and I get this error when trying to start statsd

Has anyone else seen this before?

[…] really defer to Etsy on this. They do it really well and they’re happy to share how they do […]

Nice post.

What do you guys do about logs? Do you monitor your application logs and post metrics (such as errors/min) back to StatsD, or do you handle logs differently?

    Kenshi – take a look at logster which we use for sending log info to Graphite.

[…] script that scrapes the flat logs periodically and passing into an included python script.  Big ups to Esty for letting me know about graphite. Check out these sample graphs from their implementation of graphite: […]

[…] For more background on Statsd, check out this blog article from Etsy:  Measure Anything, Measure Everything […]

[…] the folks at Etsy recently. Etsy is well known in DevOps circles for their Continuous Deploymet and Metrics […]

Why does Etsy care so much about automated software testing?…

At Etsy a key part of our process is that we make many deploys at a high velocity. We’ve found by experience that writing and running tests enables us to ship faster and more often. Tests help us to communicate with each other and having tests for new…

Wow! I need to get this set up and running – it sounds like its a near perfect fit for my needs.


Assuming a normal distribution of the data, it’s straightforward to get an estimate of the standard deviation by tracking the number of values, the sum of the values and the sum of the square of the values.

And then the 95 percentile = mean + 2 x STDDEV.

So where does the “90th percentile” come from? Or am I missing something?


[…] that sets great sites apart from mediocre ones. There are great tools out there that allow us to Measure Anything, Measure Everything in our software products. The goal of instrumenting software is so that you can see that users are […]

[…] A few months ago I blogged abou how much I love stats. One of the things that I shared in that post was a blog post done by the etsy developer team about statsd and graphite to track everything. […]

Colin: When I wrote “normal maximum values” I meant “usual” rather than referring to a normal distribution. I haven’t actually checked the distributions of values we get, but I suspect they’re non-normal. Furthermore, I suspect that standard deviation may be less reliable for lower-frequency metrics where you don’t have lots and lots of timings in a given bucket. The 90th percentile is really just throwing away the worst of the outliers in a very simple manner.

I’m interested in logging data from client-side javascript into graphite. This data is quite frequent, so statsd seems to be right way to go but one can’t make a UDP connection from a browser. What’s the best way to go about this — make an XHR to some other service that then forwards on to statsd, or something else?

Graham: It’s a good question, and one that I was thinking about just the other day. You’re basically looking at an event beacon, such as you get with various web analytics engines. Probably the lowest-overhead/simplest set-up would be a “single-pixel GIF” type request, then picking up the hits by tracking requests in your access logs (see for how you might do that). Bear in mind that publicly-accessible endpoints like this would be vulnerable to malicious hits if someone really wanted to mess with you….

Hey guys, great work! I do have a question though, I cannot for the life of me figure out why you are generating stats_count metrics at the same time as the primary metric. While I am most definitely not a node developer, as far as I can tell it looks like the stats_count will always be nothing more than stat * flush interval seconds. Thinking I must be missing something, and curious why you’re burning this out to disk and how you use it ? Thanks again for publishing this!

I too am trying to figure out stats vs stats_count. What’s up with that?

[…] to your app. Combine it with (say) statsd from Etsy, and adding any stats you want is easy. (Read this blog post from Etsy if you want to learn more about measuring your app, and how to add support for Statsd to your […]

[…] Statsd is a simple client/server mechanism from the folks at Etsy that allows operations and development teams to easily feed a variety of metrics into a Graphite system. For more info on statsd read the seminal blog article on Statsd “Measure Anything, Measure Everything”. […]

[…] currently in the process of evaluating Yahoo Boomerang and graphite for capturing large volumes of performance […]

What are you using for the coffee graph? Is is a wifi-controlled scale or what are you using?

What USB scale does Etsy use for tracking coffee?

I’ve recently deployed stated/graphite and spent a bit of time looking at it. Nice work guys. I think I can respond to the stats vs stats_count question. It appears to me that stats_count is a raw count of the amounts that were sent to statsd. stats on the other hand is treated like a rate. It ends up getting divided by the flush interval (in seconds). So what you have is a per-second representation of that value.

Persisting PAL monitoring stats…

The PAL memcached servers are currently used to persist PAL monitoring counters across requests. This is undesirable for a number of reasons, most prominent of which is that memcached is a cache,……

To answer Charles Henrich, they are creating two metrics: one is the total count that occurred during the flush interval (10 seconds by default) and one is the rate per second. Imagine it is a simple request counter, where each time a request comes in, you increment the metric “webiste.requests”. statsd will create 2 metrics: will have the average number of requests per second in the flush interval, while will have to total requests during the interval.

I am creating a version of statsd that allows you to pass in a timestamp for the counter or timer. Right now it always uses the current time as the value to pass to graphite. The reason that I want to do this is, I want to use statsd in a post-processor that digests apache access logs, which have the timestamp of the request in the log line. It seems like it will be pretty straightforward to make this change, and it may be useful for other people too.

    Jason: Rather than do that, why not send the data straight to graphite? (See Etsy’s Logster – – for an example.) StatsD is really designed for real-time data gathering. Post-processing work doesn’t need all the cleverness.

You guys mentioned:

One thing we found early on is that if we want to track something that happens really, really frequently, we can start to overwhelm StatsD with UDP packets.

Do you have a feeling for about how many events a second an instance can manage ? Just ball park 100’s/1,000’s/10,000’s ?

    It depends on the machine statsd is running on, and the size of its UDP buffer. I think we’re hitting ours with hundreds of thousands of events per second, although I haven’t actually calculated the number. We keep an eye on the packet error graph so we know if we start swamping it and respond accordingly.

What are the advantages of using StatsD over carbon-cache consuming straight from RabbitMQ? The whole node.js + statsd feels a bit of an overkill to me…

    > What are the advantages of using StatsD over carbon-cache consuming straight from RabbitMQ? The whole node.js + statsd feels a bit of an overkill to me…

    There are two real advantages in my eyes:

    1. UDP – fire and forget
    2. the ability to do timers and counts. With timers you get basic statistics computed for the interval, which is really helpful.

    carbon-cache + AMQP would work in some cases, for sure, but you certainly don’t get (1) with that, and (2) would have to be done elsewhere–though I guess graphite could probably do it–but statsd is cheap as heck to run.

      As Andrew says, the key to using StatsD is the fire-and-forgetness. RabbitMQ (or any other messaging based system) is doing a similar decoupling, but if you don’t want to run a message queuing system then that’s no help. The other key advantage is the simplicity – anywhere in my code I can fire a StatsD call and have it just work – minimal overhead, minimal complexity.

[…] options in PHP. Zend_Log_Writer_Stream lets you send log data to a PHP stream. Alternatively, StatsD and the PHP StatsD library look pretty […]

Thanks for the info Andrew & Ian. I will give statsite a go first since I already have the whole Python environment configured nicely on the collectors, roll out node.js only if absolutely necessary. The less components, the better.

[…] need to track in real-time the rate, response code and latency of every web request. Joining the Church of Graphs has never been […]

[…] another Node.js based automation tool created by Etsy (Update: this is actually Ruby based, but Etsy uses Node.js for other DevOpsy things). And of course Joyent has been using Node.js internally for cloud […]

[…] Measure Anything, Measure Everything (Ian Malpass). We introduce you to StatsD, the open source software we built at Etsy to enable obsessive tracking of application metrics and just about anything else in your environment. The best part is you can download StatsD yourself and try it out. […]

Love the tool – thanks for the awesome contribution to the community.

Looking at your graphs, I see that you have “nice” labels in the legend for each value/series you’re graphing. How did you do that? All our legends look like stats.x.y.z

    Use the alias() function in graphite: target=alias(, “Foobar”). (In the graphite UI, click on “Graph Data”, select the metric you want, then Apply Function > Special > Set Legend Name).


Worked like a charm. Thanks again for the awesome contribution to the community!

I believe your timing example may be incorrect. Since microtime(true) returns “the current time in seconds since the Unix epoch accurate to the nearest microsecond”, the difference should be multiplied by 1000 to get the equivalent value in milliseconds (which is what StatsD seems to expect).

    Well spotted – you’re quite correct. Duly updated, thanks.


I was wondering if you had issues with network congestion eating your UDP packets. The project I work on has globally distributed resources and invariably, messages from the opposite side of the world get dropped before they make it back to our monitoring system, which wreaks all sorts of havoc.

Love the blog, by the way.

Andrew, we had all sorts of weird behaviour with cross-continent messages, resorted to RabbitMQ in the end. Worth mentioning that if the collectors are down for whatever reason, be it TCP or UDP, you will lose all messages. RabbitMQ handles this scenario nicely. We’ve dropped aggregators such as statsd and went straight for RabbitMQ carbon-agent. Works a treat!

    Yep. StatsD isn’t the only solution to the problem, and its use of UDP can cause trouble in very distributed networks. It’s really designed more for cases where servers are close together in network terms, and where occasionally dropping stats isn’t a huge problem.

What do you use as storage schema retentions for the deploy graph (drawnonzero)?

Because I had used 60:2880,300:4032,600:262974 and after 2-3 days the deploy history is gone away

What’s the best way for drawing the deploy times on the graphs? I can’t work out a suitable way of doing it with statsd as increment and timing give me line plots, unlike your graphs where it’s just a binary single-line entry on the graph…

Figured out a solution – recording deploys as an arbitrary timing value when they happen then doing Apply Function -> Special -> Draw non-zero As Infinite for that graph, and I get nice neat lines.

    We actually don’t use StatsD for our deploy metrics – they’re just sent directly to graphite. Glad you worked out a solution though.

Have you considered using Pinba ( ?
If no, why did you choose to not use it.

how do you measure script execution time?
did you rename each counter to the name of script? e.g. ./search.php will have a counter caller “search.php) ?

counters are great but it seems Pinba is able to provide some further details about each script behavior.

I am in process of evaluating statsd vs Pinba and would need help

    I hadn’t seen Pinba. Since we use Graphite for a variety of metrics storage purposes (beyond StatsD metrics) having a separate data store wouldn’t be terribly compelling for us.

    Script execution time is determined separately, using server logs. StatsD timers are used more for timing things within a request rather than the whole request.

[…] as little configuration as possible to publish new metrics. For this reason, we decided on using statsd and graphite. Getting statsd and graphite running was the easy part, but we needed a quick, […]

[…] on their first day: deploy to production. We’ve talked a lot in the past about our deployment, metrics, and testing processes. But how does the development environment facilitate someone coming in on […]

[…] technologies to solve this. From one end of the scale there’s the roll your own a la Etsy and the statsd plus graphite solution, all the way to the SaaS (software as a service) solution […]

[…] portions of our application, and to assess the impact of new code releases. With inspiration from Etsy’s statsd, we added bucket sampling to our original collector (allowing calculation of Nth percentages) and […]

[…] long time ago in web years was written a blog post “Measure Anything, Measure Everything” by the devs at Etsy. It got me thinking about this issue, and it’s been really […]

Great post!!
New to node.js and graphite here.
What do you consider a reasonable load that statsd can handle?
Do you have any kernel tuning tips to handle more udp packets? It looks like I am dropping about %50 when running the following ruby test:

require socket
0.upto(10000) do”INSERT_METRIC_NAME_HERE:5000|ms”, flags=0, host=”StatsDServer”, port=8125) end

[…] any production changes until the next scheduled release in two weeks time, you've got problems. Etsy's StatsD is a great example of how they've created something which allows their developers to start […]

[…] these advices have been followed by other companies like Etsy on their Measure Anything Measure Everything blog post or Shopify on their StatsD blog […]

This will probably get lost in all the comments here, but I’ve put up some examples of how to log MySQL innodb stats in Statsd:

Hopefully someone will find it useful

This may get lost here, but I’ve created some examples of how to use Statsd to track MySQL innodb stats:

Hopefully it will help someone

Hi, StatsD looks interesting. I wonder if it can be used to analyze hostoric data too, or if it is all about real-time-data.

    Ah – there’s an important difference here. StatsD is about collecting real time data and putting it into Graphite. Graphite is about displaying data in graph form. You can absolutely send historical data to Graphite and draw graphs for it. You just use Graphite’s TCP interface to load the data, rather than StatsD.


> The Church of Graphs makes sure we graph UDP packet receipt > failures though, which the kernel usefully provides.

how exactly do you do this? some kind of additional agent to buffer accross connectivity issues (on the clients)? if statsd is not receiving your main udp packets, then why would it receive udp packages about udp reception failures?


    We graph packets that we receive but can’t process (due to them exceeding the UDP packet buffer), rather than packets we don’t receive.

[…] was reading a few interesting posts about graphite. When I tried to install it however, I couldn’t find anything that […]

oh okay, so udp packets which get lost on the network are not monitored. (just making an observation, it’s probably not worth implementing in most scenarios)

    It’s not implemented by design. The whole point of StatsD is that it’s completely asynchronous. If you were to implement some sort of system where the StatsD client and the StatsD server did some sort of handshake/syn-ack type of thing you’d have the client blocking on the server and slowing your front end down. Instead, you send the UDP packet to the StatsD server and completely forget about it. The cost of that is that if a packet goes missing, it goes missing. It’s a tradeoff you make, saying it’s better to lose some stats than to potentially cripple your site if the StatsD server goes down or gets swamped.

[…] fascinated by a presentation by Etsy about their approach of metrics driven engineering – see this blog […]

[…] StatsD tool will see a similarity in the way sFlow monitoring is embedded in scripts, see Measure Anything, Measure Everything. The main difference is that sFlow application measurements contain additional structure that […]

Great article guys, but, quick question on what it runs on. What kind of hardware is statsd/graphite sitting on? Are you using dedicated machines or clouds or ec2’s?

Hi Ian!
Graphite and statsd run on dedicated hardware (in fact all of our production systems are dedicated hardware).
The graphite server is an 8-core Intel Xeon E5530 @ 2.4Ghz.
It has 24Gb RAM and 16 146Gb SAS drives in a RAID10 configuration.

The statsd server is similar, but has E5620 CPUs, also @ 2.4Ghz. On the statsd server, CPU can a bottleneck on a single CPU core, whereas on the graphite server the bottleneck is closer to the disks and I/O.
Does that help?

What makes a good engineering culture?…

One of my favorite interview questions for engineering candidates is to tell me about one thing they liked and one thing they disliked about the engineering culture at their previous company. Over the course of a few hundred interviews, this interview …

AVLEENETSY, yup that helps! WI think we’re going to try on a small scale see what we run like. How many requests do you push to statsd? Did you use any calculations to gage the scale or just buy bigger when it started slowing/filling up?

Hi Ian, I’ve found that starting with a plain carbon agent gets you much further than you think. Before adding a new component such as statsd, just go with plain TCP: echo “my.metric 1 unix_timestamp” | nc graphite-ip 2003 .

We’re pushing close to 10k metrics per minute via AMQP (carbon agent supports that natively). AMQP gives you redundancy with minimal effort, simply hook up a secondary carbon-agent running on another machine.

We have a fast SSD bare metal for our primary graphite store and a EC2 small instance with EBS store acting as a hot backup. Both our carbon agents are on 22MB RES for a good few months now.

As you well know, less is always more ; )

    Hi gerhardlazu, that sounds like an interesting setup. TCP has a bit more overhead than UDP, so we prefer UDP for performance reasons, and because if the recipient disappears it doesn’t take down the web site. StatsD also provides aggregation (i.e. we can track the same metric coming at the same time from multiple front ends and add them together rather than have them overwrite, and we can do statistical treatments for timing graphs, etc.). I don’t know offhand how many UDP packets we’re sending to StatsD right now, but we’re pushing 12-30,000 metrics every ten seconds to Graphite from StatsD so I suspect it’s rather more than that. So StatsD provides us with some advantages we like, but I’d certainly applaud any system that provides you with insight into what’s going on in your site.

[…] Measure Anything, Measure Everything – This is one of the initial posts I read that really inspired me to implement this setup. […]

[…] Measure Anything, Measure Everything – This is one of the initial posts I read that really inspired me to implement this setup. […]

[…] our internal status dashboard aka. Blinkenlights. I wanted to write a post about importance of measuring everything for a long time, but Swizec put it so eloquently that the only thing I have to add to his post are […]

[…] such contribution is statsd from etsy. Statsd is easily combined with grpahite to monitor several servers on a cluster. This all works, […]

[…] written in Python and fully distributable across all our servers in various datacenters, EC2, etc. Like Etsy, we added a front-end to our tracking system that is based on Graphite. Ours is called Observatory, […]

Hi there. We’re currently setting up our graphite infrastructure and are ab bit unsure about some details.
When I first read about statsd I thought it was distributed – that the aggregation happens on the originating host itself so the amount of udp packets would be lower.
For instance we have load balancers that send a UDP packet for every incoming connection, which generates a lot of extra traffic since our traffic is at about 1000 requests per minute.
Did you consider that?

    Typically what you do there is sample your stats – you only send 10%, or 1%, or however much you want. StatsD then scales the sample back up to 100%. There’s some loss of detail, but if you get sufficient pings you still get statistically accurate results. Equally, your StatsD server may be able to handle more incoming packets than you expect – just watch for dropped packets and if they go up you’ll know you need to throttle back a bit.

[…] of total page load time, it’s effectively free and unnoticeable.  Do we graph this? Yes we do!  Here’s a days worth of […]

I don’t know how to reply in a threaded way to a comment that I left last year and Ian responded to, but this is in response to his response (from Nov 2011). Ian – I want to use statsd because it does the work of collecting metrics into time buckets (e.g. 10 seconds) and doing some statistics on those before sending them to graphite. If I wanted to cut out statsd, I would have to reproduce that in whatever I wrote that talked to graphite. So, I modified statsd instead to be able to take a timestamp.

Thanks to Etsy and thanks to the opensource community on PHP/ Symfony2 I’m using that do everything for me!

[…] and displays custom metrics. To do so it harnesses the power of StatsD, a metrics collector developed by Etsy. Pup comes with StatsD built-in; no need to reinvent the […]

[…] While we do push the technology envelope in many ways, we also like to stick to tried, true and data driven methods to help us keep us […]

[…] StatsD is develop by the start-up Etsy: here is their blog post about this tool. […]

Nachbericht PHP Unconference Hamburg 2012…

Wie 2011 bin ich auch in diesem Jahr wieder übers Wochenende nach Hamburg zur PHP Unconference gefahren. Hier ein kurzer Überblick über die Sessions, die ich besucht habe, und über das Drumherum. Wer es genauer wissen möchte: Ich werde versuchen, beim …

[…] and displays custom metrics. To do so it harnesses the power of StatsD, a metrics collector developed by Etsy. Pup comes with StatsD built-in; no need to reinvent the […]

[…] and server. Of course, it can’t hurt to setup your own logging as well, using a tool like statsd. Keeping an eye on your server and application’s performance gives you real feedback as to […]

Just a random note, I thought you’d like to know that I’m using statsd in production on to measure tons of different metrics from our site. Thanks to getting data in 10-second intervals, we usually learn about outages even before our monitoring alarms go off!

[…] Measurements in a web app at Etsy. This entry was posted in General by Dan Siemon. Bookmark the permalink. […]

[…] has followed the pattern of yahoo and flickr in developing a philosophy of measurement. Click here to see their […]

[…] our application and system metrics. We also decided to leverage the fantastic work done by Etsy and use statsd,  their simple node.js daemon that collects and aggregates metrics via UDP and […]

Hey Ian,
I found your blog quite useful!! Thank u!!
I want to implement statsd & graphite on my system just to test it before using it elsewhere. I dont know how to start off 🙁
I have installed statsd from the etsy/statsd github repo. How do i configure it to monitor a metric?? What do statsd client & server work?? I want to write a script in php & see to it that my statsd works 🙂

[…] help us visualize our application and system metrics. We also use the fantastic work done by Etsy and use statsd,  their simple node.js daemon that collects and aggregates metrics via UDP and […]

[…] main “must-monitor” ones in Ruby and Java. This article explains how we implemented statsd into our Ruby […]

[…] in simple visual graphs: Cacti for capturing metrics, MONyog for monitoring the database and statsd to put stats logging into code so we can monitor code performance in real time. By saving this […]

[…] statsd (GitHub) — Etsy’s data-gathering daemon, written up in an excellent blog post. […]

[…] “make it ridiculously simple for any engineer to get anything they can count or time into a graph with almost no effort” Measure Anything, Measure Everything […]

[…] Etsy, they graph everything, which they use to great […]

I am using graphite server to capture my metrics data and bring down to graphs. I have 4 application servers which is load balancer setup. My aim is capture system data such as cpu usage, memory usage, disk load, etc., for all the 4 application servers. I setup an graphite environment in a separate server and i wanted to push both system level data as well as app level data for all the applications servers to graphite and get it display as graphs. I don’t know what needs to be done for feeding system data to graphite. My thinking was to install statsd in all application servers and feed the system data to graphite.

    System data is often best sent directly to Graphite via TCP since you’re typically not wanting/needing the buffering and aggregation of StatsD, and you’re dealing with much lower volumes of data. (Most of our system metrics are actually stored in Ganglia rather than Graphite.)

[…] StatsD is a simple NodeJS daemon (and by “simple” I really mean simple — NodeJS makes event-based systems like this ridiculously easy to write) that listens for messages on a UDP port. (See Flickr’s “Counting & Timing” for a previous description and implementation of this idea, and check out the open-sourced code on github to see our version.) It parses the messages, extracts metrics data, and periodically flushes the data to graphite (which I previously wrote about here). Github: The Etsy blog post:… […]

[…] one of the best things I’ve seen come to monitoring is StatsD.  Etsy blogged about the philosophy around it.  The community has gotten behind it and made lots of StatsD […]

Wonderful execution. Allows a lot of room for playing with the code. I was wondering if you could point me the the exact section of the code that interfaces with the Carbon server of the Graphite stack?

Collecting page load times makes sense, but I haven’t seen any hard details on how others have gone about it especially related to including all widgets/components.

Could you elaborate on what Etsy does?

    We actually track page load times based on access log data (which we sent to Graphite using Logster rather than StatsD) because we have extra event information in the access logs which makes the collection cleaner (and it’s less stress on StatsD). StatsD timing is used more for timing parts of page loads for profiling – “how long does getting this list of people take?” etc.

    See also my colleague Mike Brittain’s talk Web Performance Culture and Tools at Etsy.

[…] both give us hope and motivation to get better. On the collection side, we can look at folks like Etsy. They decided that data collection would be a central part of their DNA and invested engineering […]

[…] main “must-monitor” ones in Ruby and Java. This article explains how we implemented statsd into our Ruby […]

I wrote something almost identical, even to the syntax of the requests! I also tracked the max, min and average. I’m going to switch to this because the back end has much more support. One thing I did was, at the client side, if I get back an ICMP error when sending UDP, I have the sender “back off” and it stops sending except every 1000th stat. (This is an optimization that frankly, is not necessary because you will be assuming that the UDP is always sending and have bandwidth for it, so it doesn’t matter if it’s sending but nobody is listening. I just had a nagging instinct not to send if nobody was listening). If it succeeds without error it then resumes sending at full speed.

I also allowed it to batch up and send multiple stat lines in a single UDP packet (You are paying the minimum price of 1500 bytes anyway for any ethernet packet) and also implemented the same “sample” strategy to minimize overall stats being sent.

    We very deliberately don’t allow the client to check anything about the results of sending the packet, because that introduces delays in processing requests. The aim of StatsD is to track as much as possible without slowing down requests. We’ll gladly sacrifice some reliability and intelligence in the client (and some data collection) to preserve performance.

UDP is very reliable on a Local Area Network so I would always put my collector statsd on the same physical network as the collectees.

You then forward the stats via TCP to graphite and that’s very reliable even across the world.

[…] This post is about the server component of statsd. […]

[…] blog post “Measure Anything, Measure Everything” (also quoted in Marks […]

[…] mäta allting, med graphite, new relic, collectd, m.fl […]


Thanks for Statsd. it is an excellent design for Metrics.

Few colleagues have concerned that sending a Metrics over UDP is non deterministic.

I can understand rationale behind using UDP, Performance.

Have you observed any significant packet drop in your internet using UDP ?

I would love to hear your opinion and some thought on this issue. I believe UDP is a right protocol.
I want to counter argument OF USING TCP
Thanks again for sharing your thought

    UDP gives you performance, as you say (sending a UDP packet is quick), but it also protects the caller from failures of the receiving service (StatsD or whatever else you have listening for your data). If StatsD goes down, the client can continue sending UDP data as if nothing was wrong. You lose all that data of course…. If you send too many UDP packets, you can overwhelm the receiving server and it’ll start dropping packets (again, losing data) but again the client doesn’t care. If you overwhelm the server with TCP traffic, the clients get slowed down too.

    If your colleagues require absolute certainty in their stats, then StatsD isn’t right for them, but TCP connections from the client are probably not right for them either! You’d be better off logging and then consuming the logs and sending them to your metrics store through some other means, or using some other technique that insulates your client from the vagaries of your metrics processing and storage systems.

    We do observe packet drop using UDP, but typically because we’re sending too much data. By sending only sampled data for high-volume metrics (e.g. sending only 10% of pings) we can reduce the number of packets being sent (again, sacrificing absolute accuracy) and avoid overloading the server. We keep the packet drop graph on our main deploy dashboard so we can see immediately if a deploy has accidentally started causing too many packets to be sent so we can correct it quickly.

[…] daemon mode is just a resource hog when you have a server with ~300 vhosts. From now on I follow Etsy advise which is, stay with simple and cron based log processing […]


Great post 🙂

I’m having an issue while running some stress test on the original StatsD lib which runs on Node.js.
The cpu gets to 100% because of Node.js and the Carbon agent handles only several amount of metrics instead of thousands.
If I use some different implementation of StatsD
(which is implemented in C) with the same test I get significant improve of the results.
Instead of 300 metrics per 10 sec carbon handles 17000 metrics per sec and also the cpu is lower than 40%.

Is there some special configuration that I need to run the original StatsD?
Please notice that I’m running my StatsD services on Virtual Box, is that matters?
Is there any issue with Node.js on virtual boxes?


Notice that I’m using the latest StastD version…

Thanks, I answered him there.


I have a question in regarding to the Graphite retentions settings, hope that you can help me.
I wrote it here:

Any idea?

    Hi Tomer – I replied on Stack Overflow, although I see you’ve solved your problem yourself 🙂

Something that maybe few people know about: Failure Modes; Effects and Criticality Analysis.
Slightly related to “measure everything”. Idea is to analyze every chunk (“component”, “unit”, slice it as you will” to get a sense of how important it is. Sort of like triage before the failure.

Mean Time Between Failure and Mean Time To Repair is what we used (Hardware; avionics.) Stir in criticality i.e. consequences of error/failure.
This ends up giving you a real good idea of what you need to focus on i.e. worst case would be something that’s likely to fail soon, PITA to replace, and devestating in effect.

–@ITGeek / @bentrem

on PHP and composer there’s that is quite good.

I have a quick question. I have a load balancer with 16 nodes behind it running Apache and modphp. Do I install statsD on all 16 nodes or do I get all 16 nodes to point to one statsd server that then sends the data to graphite. i.e is it 16 nodes sending data to graphite or one statsd daemon.

    Ah, good question. Typically you have one StatsD daemon running and all your nodes send data to it. StatsD then does the aggregation of the data and sends it on to Graphite. I have heard of some places running a StatsD daemon on every node and sending metrics to Graphite with the node name as part of the metric name (e.g.,, etc.). This means you can typically send more data to StatsD without risking dropping UDP packets, and you have specifics on the behaviour of individual nodes, but it means that you can only aggregate on the Graphite side using Graphite’s functions which may be limiting and certainly adds effort. (Rather than sending more data, we usually use sampling for high-frequency events so that we don’t send too much data.)

I am just getting started with statsd. I am missing something important somewhere, and maybe I simply haven’t read enough yet. The statsd agent or client sends messagea over UDP to port 8125. carbon accepts inputs over TCP from port 2003 or 2004. There must be something in the middle, which I haven’t come to yet, which translates from UDP to TCP. What is that?

Thank you.


    So the client (code running in your webapp or whatever) sends data over UDP to StatsD. StatsD aggregates the data and sends it over TCP to Graphite. So StatsD itself – the daemon process that is – is the thing in the middle that you’re looking for.

      Thank you, that’s what I was looking for. I installed statsdaemon.go and did a little poking with wireshark, and it works exactly as you say it does.

      Thank you


Mr. Malpass,

I am a teacher working with a group of high school students, and we are trying to use the data from a heart-rate monitor within a simple computer game the kids are coding. I was wondering if you thought that your application would be useful for what we are trying to do. Thanks.

    Mike – no, it’s unlikely to be appropriate. You’d want the data to go directly to the game, I suspect. StatsD is more for aggregating lots and lots of data from a variety of sources, much more suited to large scale web applications than individual hardware/software interfaces.

I’m interested in logging extended stats in my wife’s Etsy store. I don’t think there’s a way to do that, and I’m surprised there isn’t considering this post on stats and statsd. Am I missing something? I’d love to know, for example 1) Are people scrolling through the images for a given product? 2) Are people scrolling down to read about the product, or merely looking at the pictures? 3) Did the user click the “shipping policies” tab? If I controlled my Etsy page somehow, I could easily dump in a script or maybe and start tracking those things. Is that possible, or can I use statsd from my store somehow?

    StatsD is used by Etsy for our internal metrics – nothing gets exposed to sellers, and you can’t make StatsD calls yourself. Right now, the only supported analytics we have are Shop Stats and Google Analytics:

      Thanks for mentioning Google Analytics. If I’m understanding correctly, there’s nothing extra I can gain there than you already present in my Etsy stats, right? What I’d really like to do is use Google Analytics events, e.g. when someone clicks the “right arrow” on the product image carousel, or clicks “Shipping and Policies”.

      No, there’s only limited functionality with Google Analytics – purely page view related Analytics rather than complex interaction tracking.

I’m interested in logging extended stats in my store. I don’t know how to do that.
What Am I missing I’d like to know, for example 1) Are people scrolling through the images for a given product? 2) Are people scrolling down to read about the product, or merely looking at the pictures? 3) Did the user click the “shipping policies” tab? If I controlled my Etsy page somehow, I could easily dump in a script or maybe and start tracking those things.

    Stefan – see my reply to Tyler above. Google Analytics is the only third party analytics option available.

Thanks looks promising, but had a really hard time installing all the dependencies.
Finally found a github project that setup the whole thing so I can test using it!

[…] answer most likely lies in frameworks like Samza, Storm, and Spark Streaming. Similarly, tools like StatsD solve the problem of collecting real-time analytics. However, this discussion is meant to explore […]

Hi, nice work.
I’m currently looking at ways to efficiently implement profiling via graphite into a bunch of node apps. StatsD looks like the most well supported NodeJS project I have come across so far that performs aggregation. However, I can see that there will be an unnecessary performance overhead to communicating the data via sockets, even UDP, rather than performing the aggregation in memory within the app. Is there a way to ‘require’ some parts of statsd to incorporate them into the app for faster in-memory performance? If not, does anyone know of any reasonably well supported projects that could meet my needs? Thanks a million,

[…] collection & visualization. Works in concert with several tools, including Sensu. Read about Etsy’s experience using Graphite. Graphite runs on Django, a well regarded Python app server similar to Rails in terms of feature […]

[…] goals, use software to track progress over time. Etsy’s engineers built a tool they call StatsD. The software helps Etsy monitor everything from login failures to coffee availability. The data is […]