Robots, Graphs, and Binary Search

Posted by on May 24, 2012

We love our human customers. That said we get a lot of traffic from robots too. This is great, especially when they use the Etsy API. However, they sometimes misbehave. And they misbehave frequently in the late hours, not unlike a legendary East Village night club. This time Craig Ferguson isn’t at the door to keep an eye on things. Instead we monitor the robots attending our night club with graphs:


To do this we first compiled a list of where the robots live. This includes datacenters, places that make search engines and crawlers, and companies that lease out CPUs by the slice. The full list has over 1800 records, but here’s a sample:

So given this list, and an incoming IP address, how can one quickly tell who it belongs too, if anyone? Linear scan? No. Binary Search? Yes, and we like binary search.

To do this, the raw file is converted into a static PHP file. It converts all the dotted IPv4 addresses into the equivalent plain integers and turns all the data into an PHP array. As part of the file, there is a find method that uses a binary search algorithm modified to work with disjoint intervals. Can you spot the difference?

<?php
/* Autogenerated.  Do not edit */
class IpDb {
    public static function find($ipstr) {
        $ip = ip2long($ipstr);
        $haystack = self::$db;
        $high = count($haystack) - 1;
        $low = 0;
        while ($high >= $low) {
            $probe = floor(($high + $low) / 2);
            $row = $haystack[$probe];
            if ($row['_ip0'] > $ip) {
                $high = $probe - 1;
            } else if ($row['_ip1'] < $ip) {
                $low = $probe + 1;
            } else {
                return $row;
            }
        }
        return null;
    }
    static public $db = array (
  0 =>
  array (
    '_ip0' => 34603008,
    '_ip1' => 35127295,
    'owner' => 'http://akamai.com/',
  ),
  1 =>
  array (
    '_ip0' => 134488576,
    '_ip1' => 134489087,
    'owner' => 'http://www.peakwebhosting.com/',
  ),
  2 => ... etc...

The resulting file is pushed out in the same way all our other code is. Using this mini WHOIS database is just a PHP function call without any database or network access. The same technique can easily be done in other languages or used for other applications involving date-time intervals.

Fraud Engineering presentation thumbnailThis is just one of many ways we monitor the site for bad behavior (robots or otherwise). Seeing how much traffic comes from where “hands aren’t on keyboard” has given us many insights into how the site is used. Some more details on how this data is used can be found in presentation on Fraud Engineering given at the Merchant Risk Council conference in March.

Finally, let’s not forget that not all robots are bad.

Update: as soon as I posted this, my colleague Jerry Soung informed he implemented the same thing for a different IP database. Binary search FTW.

Posted by on May 24, 2012
Category: engineering, operations, security

7 Comments

Full list link is b0rked! It should be this : https://github.com/client9/ipcat/blob/master/datacenters.csv

Any reason you aren’t just storing this in a regular database and then caching the result?

Hi Joshua Dickerson, thanks for writing in.

There are lots of reason not to use database here.

* The amount of data is really small. Dealing with tables and data loading is probably just as much code as writing a pure-PHP data structure.

* SELECT * from between(left, right) is query most databases don’t handle very well. I’ve tried SQLite3 and MySQL. It’s not cute.

* Using a SQL-based R-Tree would be a natural fit except that all implementations I’ve seen are used for GIS applications with floating point numbers and not integers. So that’s out.

* When the master list gets updated, it’s not easy translate that into SQL for addition, changes, deletions. Also updating a disjoint interval binary tree via SQL can cause corruption very easily (overlaps, left > right, subsets, etc)

* Completely dropped or recreating the table problematic in a live environment. Certainly can be done, but it’s most complicated.

* And using a database + memcache is probably 1000x slower than just using pure PHP, and can possibly fail.

* Using the existing “update the site” mechanism we avoid all these issues, and with no failure case (minus some crazy broken push).

Hope this help answer your question!

thanks again!

nickg

I’m a java programmer and would be inclined to use a TreeSet to store this data. My comparator would compare using the plain integers. The implementation of the TreeSet would give me a binary search for free. Do you see any holes in this solution? Is the problem just that Java isn’t a web language like PHP is?

Looks like a problem that can be solved with Radix-trie [1] like it’s done in most OSes routing code.

[1] http://en.wikipedia.org/wiki/Radix_tree

AISPINA: tree will work fine

SAVETHERBTZ: you are correct and thanks for pointing this out. In large datasets or extremely high performance environments such as routers, a Trie-like data structure is the way to go. For smaller datasets, especially with scripting languages, re-implementing a Trie didn’t seem to provide additional performance or benefits.

thanks all!