mctop – a tool for analyzing memcache get traffic
Here at Etsy we (ab)use our memcache infrastructure pretty heavily as a caching layer between our applications and our database tiers. We functionally partition our memcache instances into small pools and overall it works fabulously well. We have however suffered occasionally from what we call “hot keys”.
What is a “Hot” key?
A “hot key” is a single key hashed to an individual memcache instance with a
very high get rate, often being called once for every page view. For the most part network bandwidth across all memcache instances within a pool is relatively balanced. These hot keys, however, contribute a significant additional amount of egress network traffic and have the potential to saturate the available network bandwidth of the interface.
The graph above is an example of a recent hot key issue. The graph y-axis represents bytes per second inbound and outbound of memcached01’s network interface.
As we hit peak traffic, memcached01’s network interface was completely saturated at approximately 960Mbps (it’s a 1Gbps NIC). This has a particularly nasty impact to get latency:
As we began to push past 800Mbps outbound, 90th percentile get request latency jumped from 5ms to 35ms. Once the NIC was saturated latency spiked to over 200ms.
Diagnosing the Issue
This wasn’t the first time a hot key had been responsible for unsually high network bandwidth utilization so this was our first line of investigation. Comparatively memcached01’s bandwidth utilization was significantly higher than the other servers in the pool.
Diagnosing which key was causing problems was a slow process, our troubleshooting process took the following steps:
- Take a brief 60 second packet capture of the egress network traffic from memcached01
- Using the tshark (wireshark’s awesome command line cousin) extract the key and response size from the memcache VALUE responses in captured packet data.
- Post process the tshark output to aggregate counts, estimate requests per second and calculate the estimated bandwidth per key.
- Sort that list by bandwidth then further investigate that key.
Once the potentially offending key is found we’d repeat this process from a couple of client machines to validate this as the offending key. Once the key was confirmed engineers would look at alternate approaches to handling the data contained in the key.
In this particular case, we were able to disable some backend code that was utilizing that key with no user facing impact and relieve the network pressure.
Overal this diagnostic process is quite manual and time intensive. 60 seconds of packet capture at 900Mbps generates close to 6GB of packet data for tshark to process, and if this process needs to be repeated on multiple machines the pain is also multiplied.
Given this wasn’t a new issue for us I decided to have a crack at building a small tool to allow us to interactively inspect in-real time, the request rate and estimated bandwidth use by key. The end result is the tool “mctop” we’re open sourcing today.
Inspired by “top”, mctop passively sniffs the network traffic passing in and out of a server’s network interface and tracks the responses to memcache get commands. The output is presented on the terminal and allows sorting by total calls, requests/sec and bandwidth. This gives us an instantaneous view of our memcache get traffic.
Patches welcome, we hope you find it useful!