Four Months of statsd-jvm-profiler: A Retrospective

Posted by on May 12, 2015

It has been about four months since the initial open source release of statsd-jvm-profiler.  There has been a lot of development on it in that time, including the addition of several major new features.  Rather than just announcing exciting new things, this is a good opportunity to reflect on what has come of the project since open-sourcing it and how these new features came to be.

External Adoption

It has been very exciting to see statsd-jvm-profiler being adopted outside of Etsy, and we’ve learned a lot from talking to these new users.  It was initially built for Scalding, and many of the people who’ve tried it out have been profiling Scalding jobs.  However, I have spoken to people who are using it to profile jobs written in other MapReduce APIs, such as Scrunch, as well as pure MapReduce jobs.  Moreover, others have used it with tools in the broader Hadoop ecosystem, such as Spark or Storm.  Most interestingly, however, there have been a few people using statsd-jvm-profiler outside of Hadoop entirely, on enterprise Java applications.  There was never anything Hadoop-specific about the profiling functionality, but it was very gratifying to see that they were able to apply it unchanged to a domain so far from the initial use case.

Contributions

One of the major benefits of open-sourcing a project is the ability to accept contributions from the community.  This has definitely been helpful for statsd-jvm-profiler.  There have been several pull requests accepted, both fixing bugs and adding new features.  Also, there are some active forks that the authors hopefully decide to contribute back.  The community of contributors is small, but the contributions have been valuable.  Questions about how to contribute were common, however, so the project now has contribution guidelines.

An unexpected aspect of community involvement in the project has been the amount of questions and suggestions that have come via email instead of through Github.  In hindsight setting up a mailing list for the project would have been a good idea; at the time of the initial release I had thought the utility of a mailing list for the project was low.  I have since created a mailing list for the project, but it would have been useful to have those original emails be publically available.  Nevertheless, the suggestions have been very helpful.  It would be amazing if everyone who had suggested improvements also sent pull requests, but I recognize that not everyone is willing or able to do so.  Even so I am grateful that people have been willing to contribute to the project in this way.

Internal Use

The use of statsd-jvm-profiler within Etsy has been less successful than it was externally.  We use Graphite as the backend for StatsD and as we started to use the profiler more, we began to have problems with Graphite.  Someone would start to profile a job, thus creating a fairly large number of new metrics.  This would sometimes cause Graphite to lock up and become unresponsive.  We put in some workarounds, including rate limiting the metric creation and configurable filtering of the metrics produced by CPU profiling, but these were ultimately only beneficial for smaller jobs.  Graphite is an important part of our infrastructure beyond statsd-jvm-profiler, so this was a bad situation.  Being able to profile and improve the performance of our Hadoop jobs is important, but not breaking critical pieces of infrastructure is more important.  The issues with Graphite meant that the ability to use the profiler was heavily restricted.  This was the exact opposite of the goal of easy to use, accessible profiling that motivated the creation of statsd-jvm-profiler.  Finally after breaking Graphite yet again the profiler was disabled entirely.  The project admittedly languished for about a month.  Since we weren’t using it internally, there was less incentive to continue improving it.

New Features

statsd-jvm-profiler was in an interesting state at this point.  There were still external users and internal interest, but it was too risky for us to actually use it.  Rather than abandon the project, I set out to bring to a better state, one where we could use it without risk to other parts of our production infrastructure.  The contributions from the community were incredibly helpful at this point.  Ultimately the new features were all developed internally, but the suggestions and feedback from the community provided lots of ideas for what to change that would both meet our internal needs as well as providing value externally.  As a result we’re able to use it internally again without DDOSing our Graphite infrastructure.

Multiple Metrics Backends

The idea of supporting multiple backends for metrics collection instead of just StatsD was considered during initial development, but was discarded to keep the profiling data flowing through StatsD and Graphite.  We use these extensively at Etsy, and the theory was that by keeping the profiling data in a familiar tool would make it more accessible.  In practice, however, the sheer volume of data produced from all the jobs we wanted to profile tended to overwhelm our production infrastructure.

Also, supporting different backends for metric collections was the most commonly requested feature from the community, and there were a lot of different suggestions for which to use.  StatsD is still the default backend, but it is configurable through the reporter argument to the profiler.  We are trying out InfluxDB as the first new backend.  There are a couple of reasons why it was selected.  First, statsd-jvm-profiler produces very bursty metrics in a very deep hierarchy.  This is fairly different than the normal use case for Graphite and we came to realise that Graphite was not the right tool for the job.  InfluxDB was very easy to set up and had better support for such metrics without needing any configuration.  Also, InfluxDB has a much richer, SQL-like query language.  With Graphite we had been dumping all of the metrics to a file and processing that, but InfluxDB’s query language allows for more complex visualization and analysis of the profiling data without needing the intermediate step.  So far InfluxDB has been working well.  Moreover, since it is independent from the rest of our production infrastructure only statsd-jvm-profiler will be affected if problems do arise.

Furthermore, the refactoring done to support InfluxDB in addition to StatsD has created a framework for supporting any number of backends.  This provides a great avenue for community contributions to support some other metric collection service.

New Dashboard

Better tooling for visualizing the data produced by profiling was another common feature request.  The initial release included a script for producing flame graphs, but it was somewhat hard to use.  Also, we had otherwise been using our internal framework for dashboards to get data from Graphite.  With the move to InfluxDB this wouldn’t be possible anymore.  As such we also needed a better visualization tool internally.

To that end statsd-jvm-profiler now includes a simple dashboard.  It is a Node.js application and pulls data from InfluxDB, leveraging its powerful query language.  It expects the metric prefix configured for the profiler to follow a certain pattern, but then you can select a particular process for which to view the profiling data:

Selecting a job from the statsd-jvm-profiler dashboard

From there it will display memory usage over the course of profiling:

Memory metrics

And it will also display the count of garbage collections and the total time spent in GC:

GC metrics

It can also produce an interactive flame graph:

Example flame graph

Embedded HTTP Server

Finally, the ability to disable CPU profiling after execution had started was the other most common feature request.  There was an option to disable it from the start, but not after the profiler was already running.  Both this and the ability to inspect some of the profiler state would have been useful for us while debugging the issues that arose with Graphite initially.  To support both of these features, statsd-jvm-profiler now has an embedded HTTP server.  By default this is accessible from port 5005 on the machine the application being profiled is running on, but this choice of port can be configured with the httpPort option to the profiler.  At present this both exposes some simple information about the profiler’s state and allows disabling collection of CPU or memory metrics.  Adding additional features here is another great place for community contributions.

Conclusions

Unequivocally statsd-jvm-profiler is better for having been open-sourced.  There has been a lot of activity on the project in the months since its initial public release.  It has seen adoption in a variety of use cases, including some quite different from those for which it was initially designed.  There has been a small but helpful community of contributors, both through code and through feedback and suggestions for the project.  When we hit issues using the project internally, the feedback from the community aligned very well with what we needed to get the project back on track and gave us momentum to keep going..

Going forward keeping up contributions from the community is definitely important to the success of the project.  There is a mailing list now, contribution guidelines, as well as some suggestions for how to contribute.  If you’d like to get involved or just try out statsd-jvm-profiler, it is available on Github!

Posted by on May 12, 2015
Category: engineering, infrastructure Tags: , , , ,

Related Posts

11 Comments

Excellent write up. Do you know if there are plans to release your dashboard that pulls data from Graphite?

    At present there are not specific plans as it’s pretty tied up in our internal dashboards framework. However, I can definitely take a look at putting something together.

    I did file an issue on the project for this: https://github.com/etsy/statsd-jvm-profiler/issues/10. In keeping with the theme of the post, if you’re interested in helping out we’d be very happy to accept your contributions :).

      That would be great! I’ll follow along and will be sure to contribute if I come up with anything.

Does the dashboard refresh periodically? I looked through some of the code but couldn’t see anything. That’s one feature I do like about Graphite, even though the graphs themselves are lacking.

Hi Andrew,
Paul from InfluxDB here. Really interesting stuff. For the InfluxDB integration, are you using the 0.8 version? Have you had a chance yet to look at the data model for 0.9.0? Any feedback you can provide on that would be great.

We want to make sure the new tagging model works well for use cases like this.

Thanks,
Paul

    Hey Paul! I am using version 0.8.8 right now. I haven’t yet tried 0.9.0, but I’m really excited about the tagging model. I think that will offer a lot more possibilities with querying the data produced by the profiler. It’s already following the recommended schema design from http://influxdb.com/docs/v0.8/advanced_topics/schema_design.html#migration-to-0.9.0, so I think it will be in a good place to take advantage of tags.

      awesome, great to hear. We’re working hard to get that new release out soon. Definitely looking forward to seeing what people do with tags!

Have you thought about flowing the data into the ELK stack (ElasticSearch, Logstash, Kabana) via graphite to solve the visualisation challenge in Kibana?

    ELK was something I had considered, but InfluxDB was working so well for this use case that I didn’t continue to pursue other options. However, the framework to support additional backends like ELK is in place so it’s definitely something that could be revisited.

Hi Andrew, I met a problem. I use Graphite as backend, but i can’t get any profiles when I run my hadoop job! Please help me, I don’t know why. Or you can tell me where are the profiles.