Google Safe Browsing without The Browser

Posted by on March 4, 2012

At Etsy, we are constantly evaluating the security and safety of our members as they use the site.  One way we do this is by analyzing user generated content (UGC) for possible problems.  As part of the process we integrate results from the Google Safe Browsing (GSB) service. Typically this is client-side technology used by web browsers to protect the end-user from visiting dangerous websites that might serve malware or be part of a phishing scam.

The Security and Defensive Systems group here at Etsy have flipped this model around.  Rather than warn the user when a malicious link is followed, we block the link (or the whole page) from displaying in the first place.

There are a few ways to use the Google Safe Browsing service. For lower volume queries, there is a very simple REST API.  For high volume, high performance systems, the GSB V2 protocol is more appropriate as it mirrors the entire GSB database locally. It’s designed to scale to an extremely large number of clients while minimizing network traffic.  To do so, it uses a complicated protocol involving multiple blacklists and whitelists sent as a series of distributed binary diffs.

While many implementations of the GSB protocols are available, for a variety of reasons they were not appropriate for use in Etsy’s operational environment (e.g. use of autoincrement ids, designed to run under a web server, etc), and so we created our own.  We have open sourced our version and made it available in our gsb4ugc git repository. It’s in PHP, but it should be straightforward to port to other languages, as it’s really more of a toolkit than a standalone product.

To use, you’ll need to create and assemble resources to create your own API. First you need to set up some boilerplate for both the GSB updater and client:

// Set up a db connection.
$dbh = new PDO('mysql:host=127.0.0.1; dbname=gsb', 'user', ‘password’);
$dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);

// Create storage; works with mysql, sqlite.
// No auto-increment IDs, so it's safe with master-master replication.
// Etsy subclasses this and adds StatsD calls. http://etsy.me/dQwVXi
$storage = new GSB_StoreDB($dbh);

// Create network access. Pass in your GSB API key. Uses PHP curl.
$network = new GSB_Request($api);

// Logger. Subclass to use your logging infrastructure (or not).
$logger = new GSB_Logger(5);

Then one needs to setup a cron job that runs every 30 minutes to start mirroring the GSB database.

$updater = new GSB_Updater($storage, $network, $logger);
$updater->downloadData($gsblists, FALSE);

It takes about 24 hours to full sync up. Finally, you are able to start checking URLs:

$client = new GSB_Client($storage, $network, $logger);
$url = "http://malware.testing.google.test/testing/malware/”;
print_r($client->doLookup($url));

should return something similar to:

[list_id] => 1
[add_chunk_num] => 70219
[host_key] => b2ae8c6f
[prefix] => 51864045
[match] => malware.testing.google.test/testing/malware/
[hash] => 518640453f8b2a5f0d43bc2251....
[host] => testing.google.test/
[url] => http://malware.testing.google.test/testing/malware/
[listname] => goog-malware-shavar

More details are in the bin/samples directory of our repository.

We are currently scanning a few types of user generated content in production. This is done asynchronously from the website so we don’t block the user experience, however we still care about performance. Almost all performance metrics here at Etsy measure maximum and minimum times, as well as 90th percentile and mean, and this is no exception. The peak times occur when a network call is required, otherwise, it’s typically 5ms.

GSB performance graph,

Since this is security-related code, another goal of gsb4ucg is testability.  The protocol-parsing code is separated out from database and networking code, so it’s very easy to write unit tests. This also helps to explain how the code works. As you see below, we have some more work to do:

Code Coverage Detail

In addition to expanding test coverage and improving performance, we’d like to add MAC support, and to use it for more content types on Etsy.  We’d also like to add the results from PhishTank for completeness and redundancy.  Comments, bug reports, patches and pull requests are all welcome, but if this type of work interests you, consider doing it full time.

Now, go forth and browse and consume content safely!

Posted by on March 4, 2012
Category: engineering, security

11 Comments

This is a super cool idea. Wonder how long it will be till there are more plugins that do this?

nice, definitely gonna play around with this.

Am I missing something? Can’t a malicious user just use a tinyurl like service to create their malicious links?

[…] here: Implementing Google Safe Browsing server-side to sanitize untrusted input This entry was posted in Tyson Zinn and tagged Tyson Zinn, tysonzinn by Tyson Zinn. Bookmark the […]

I would love to see this implemented in much more software. I am now considering a way to add it in to Simple Machines Forum.

Hi ISAAC, true true, however most of the legit URL shortening services work very hard to make sure they aren’t used for malware purposes. In addition, one can resolve the links to their final destination and then pass that to GSB. Thanks for reading and writing to Code As Craft,
nickg

Hi Nick!

You’re certainly familiar with the library https://code.google.com/p/phpgsb/

I use it in production about 2 years. Sometime there is some problem with it (like error on some sort urls), but in most time it works normal. Did you try using it? And did you compare(speed, detentions, correctness) your library with phpgsb?

PS: Thank you for sharing!

Hi Roman,

phgsb is a fine project. When we first looked at it, there were a few issues with it that prevented it being used as-is in Etsy’s environment.

* Our databases uses special master-master replication scheme that precludes using AUTOINCREMENT ids.

* In general phpgsb is high web and browser centric, while we needed something more batch oriented.

* We needed more flexibility in setting up database and networking connections, and need a richer set of statistics.

* Given how Etsy uses GSB we needed something that has more unit-tests and is more testable.

We didn’t compare performance between implementations. Most run “fast enough.” If phpgsb works for you great!

thanks for writing in!

nickg

    I already switched to your library (a month ago). With phgsb I have some errors few times a week.

    Gsb4ugc works ideally. Thx =)

Did you consider to use WOT (Web Of Trust, finnish company) data for checking urls? They have open API and seem to have big sites’ reputation database…

I really doubt it. With MyWOT they will get so many false alarm.