Building a Better Mobile Crash Analytics Platform

Posted by on November 4, 2016

‘Crashcan’ (think trashcan, but for crashes) is Etsy’s internal application for mobile crash analytics. We use an external partner to collect and store crash and exception data from our mobile apps, which we then ingest via their API. Crashcan is a single-page web app that refines and better exposes the crash data we receive from our external partner.

Crashcan Preview

Crashcan gives us extra analysis of our crashes on top of what our partner offers. We can make less cautious assumptions about our stack traces internally, which allows us to group more of them together. It connects crashes to our user data and internal tooling. It allows us to search for crashes in a range of versions. It’s provided a good balance between building our own solution from scratch and completely relying on an external partner. We get the ability to customize our mobile crash reporting without having to maintain an entire crash reporting infrastructure.

Error Reporting – The Web vs. Mobile

Unfortunately, collecting mobile crashes and exceptions is (of necessity) quite different from most error reporting on the web, in several ways:

With these differences, it’s been important to approach the analysis of crashes and exceptions differently than we do on the web.

A Crash Analytics Wish List

Many of the issues mentioned above were handled by our external partner. But while this external partner provides a good overview of our mobile crashes, there were still some bits of functionality that would make analyzing crashes significantly easier for us. Some of the functionality we really wanted was:

It seemed like it’d be fairly straightforward to build our own application, using the data we were already collecting, to implement this functionality. We wanted to augment the data we receive from our external partner with data and couple it with our own internal tooling. We also wanted to provide any interested Etsy employees with a simple way to view the overall health of our apps. So that’s exactly what we chose to do.

Building Crashcan

Crashcan’s structure was a pretty wide-open space. All it really needed to do was provide crash ingestion from an API, process the crash a bit, and expose it via a simple user interface (it sounds a lot like many technologies, actually). So while the options for technologies and methodologies were open, we ultimately decided to keep it simple.

By using PHP, Etsy’s primary development language, we keep the barrier to entry for developers at Etsy low . We used as much modern PHP as possible, with Composer handling our dependency management. MySQL handles the data storage, with Doctrine ORM providing the interface to the database.

Data Ingestion

Ingesting the data was the first hurdle. Before handling anything else, we needed to make sure that we could actually (1) get the data we wanted and (2) keep up with the number of crashes that we wanted to ingest, without breaking down our system. After all, if you can’t get the data you want and you can’t do so reliably, there’s really no point.

After analyzing the API endpoints we had at our fingertips (yay, documentation!), we determined that we could get all the data we wanted. The architecture needed to allow us to:

In the end, we developed a schema that allowed us to fulfill all those needs while remaining quick in response to queries:

Crashcan Schema

To actually ingest the data from our external provider, we run a cron job every minute that checks for new crashes. This cron runs a simple loop – it loads new crashes from a paginated endpoint, looping through each page and each crash in turn. Each crash is added to a queue so that we can update it asynchronously.

Cron Flow

We run a series of workers that run continuously, monitoring the queue for incoming crashes. As these workers run, they each pick a crash off the queue and processes it. This includes several steps, first checking whether we have the crash already, then updating it if we have it or creating a new crash if we don’t. We also go through each crash’s occurrences to make sure that we’re recording each one and tying it to an existing user if one exists. The flowchart below demonstrates how these workers process crashes.

Workers Flowchart

Monitoring & Alerting

After building Crashcan’s initial design and getting crashes ingesting correctly, we quickly realized that we needed utilities to monitor the data ingestion and to alert us when something went wrong. Initially, we had to manually compare crash counts in Crashcan with those that our external provider offered in their user interface. Obviously, this was neither convenient nor sustainable, so we began integrating StatsD and Nagios. To check that we were still ingesting all our crashes, we also wrote a script to perform a sort of ‘spot-check’ of our data against our external provider’s – which fails if our data differs too much from theirs.

Crashcan Monitoring Graphs

We created a simple dashboard, linked to StatsD, that allows us to see at-a-glance if the ingestion is going well – or if we’re encountering errors, slowness, or hitting our API limit. While we plan to improve our alerting infrastructure over time, this has been serving us well for now – though before we got our monitoring in a good state, we hit some big snags that kept us from being able to use Crashcan for weeks at a time. There’s an important lesson there: plan for monitoring and alerting from the beginning.

Application Structure

When deciding on Crashcan’s structure, we decided to focus first on building a stable, straightforward API. This would enable us to expose our crash data to both users and other applications – with one interface for accessing the data. This meant that it was simple for us to build Crashcan’s user interface as a Single Page Application. Very few of the disadvantages of single page applications applied in Crashcan’s case, since our users would only be other Etsy employees. Building a robust API also enabled us to share the data easily with other applications inside Etsy – most especially with our internal app release platform.

App Overview View

When an Etsy engineer accesses Crashcan, we aim to present them with the most broadly applicable information first – the overall health of an app. This is presented through an overview of frequent crashes, common crash categories, and new crashes, along with a graph showing all crashes for the app split out by version. This makes it much easier to spot big new crashes or problematic releases. The engineer then has the option to narrow the scope of their search and view the details of specific crashes.

Crash Preview

Ongoing Work

While we’ve finished Crashcan v1 with much of the core functionality and gotten it in a stable enough state that we can depend on its data, there’s still quite a bit that we’d like to improve. For example, we haven’t even begun to implement a couple of the items we mentioned in our wish list, like custom alerting. Second, the user interface could do with some bugfixes and refinement. Right now, it’s in a mostly-usable state that other Etsy engineers can at least tolerate, but it’s not stable or refined enough that we’d be comfortable releasing it to a wider audience.

Additionally, our crash deduplication is still rudimentary. It only performs simple (and expensive) string comparisons to find similar crashes. We’d like to implement more advanced and more efficient crash deduplication using crash signature generation. This would give us a much more reliable way of determining when crashes are related, therefore providing a more accurate picture of how common each crash is.

Lessons Learned

Most of the pain points in Crashcan’s development weren’t new or especially unexpected, but they serve as a valuable reminder of some important considerations when building new applications.

Posted by on November 4, 2016
Category: mobile Tags: ,

Related Posts

No responses yet. You could be the first!