Turbocharging Solr Index Replication with BitTorrent

Posted by on January 23, 2012

Many of you probably use BitTorrent to download your favorite ebooks, MP3s, and movies.  At Etsy, we use BitTorrent in our production systems for search replication.

Search at Etsy

Search at Etsy has grown significantly over the years. In January of 2009 we started using Solr for search. We used the standard master-slave configuration for our search servers with replication.

All of the changes to the search index are written to the master server. The slaves are read-only copies of master which serve production traffic. The search index is replicated by copying files from the master server to the slave servers. The slave servers poll the master server for updates, and when there are changes to the search index the slave servers will download the changes via HTTP. Our search indexes have grown from 2 GB to over 28 GB over the past 2 years, and copying the index from the master to the slave nodes became a problem.

The Index Replication Issue

To keep all of the searches on our site working fast we optimize our indexes nightly. Index optimization creates a completely new copy of the index. As we added new boxes we started to notice a disturbing trend: Solr’s HTTP replication was taking longer and longer to replicate after our nightly index optimization.

After some benchmarking we determined that Solr’s HTTP replication was only allowing us to transfer between 2 MB and 8 MB of data per second. We tried various tweaks to HTTP replication adjusting compression and chunk size, but nothing helped. This problem was only going to get worse as we scaled search. When deploying a new slave server we experienced similar issues, only 8 MB per second transfer pulling all of our indexes at once and it could take over 4 hours, with our 3 large indexes consuming most of the transfer time.

Our 4 GB optimized listings index was taking over an hour to replicate to 11 search slaves. Even if we made HTTP replication go faster, we were still bound by our server’s network interface card.  We tested netcat from master to a slave server and the results were as expected, the network interface was flooded. The problem had to be related to Solr’s HTTP replication.

The fundamental limitation with HTTP replication is that replication time increases linearly with the number of slaves. The master must talk to each slave separately, instead of all at once. If 10 boxes take 4 hours, scaling to 40 boxes would take over half a day!

We started looking around for a better way to gets bits across our network.

Multicast Rsync?

If we need to get the same bits to all of the boxes, why not send the index via multicast to the slaves? It sure would be nice to only send the data once. We found an implementation of rsync which used multicast UDP to transfer the bits. The mrsync tool looked very promising: we could transfer the entire index in our development environment in under 3 minutes. So we thought we would give it a shot in production.

 [15:25]  <gio> patrick: i'm gonna test multi-rsyncing some indexes
          from host1 to host2 and host3 in prod. I'll be watching the
          graphs and what not, but let me know if you see anything
          funky with the network
 [15:26]  <patrick> ok
 ....
 [15:31]  <keyur> is the site down?

Multicast rsync caused an epic failure for our network, killing the entire site for several minutes. The multicast traffic saturated the CPU on our core switches causing all of Etsy to be unreachable.

BitTorrent?

For those folks who have never heard of BitTorrent, it’s a peer-to-peer file sharing protocol used for transferring data across Internet. BitTorrent is a very popular protocol for transferring large files. It’s been estimated that 43% to 70% of all Internet traffic is BitTorrent peer-to-peer sharing.

Our Ops team started experimenting with a BitTorrent package herd, which sits on top of BitTornado. Using herd they transferred our largest search index in 15 minutes. They spent 8 hours tweaking all the variables and making the transfer faster and faster. Using pigz for compression and herd for transfer, they cut the replication time for the biggest index from 60 minutes to just 6 minutes!

Our Ops experiments were great for the one time each day when we need to get the index out to all the slave servers, but it would also require coordination with Solr’s HTTP replication. We would need to stop replication, stop indexing, and run an external process to push the index out to the boxes.

BitTorrent and Solr Together

By integrating BitTorrent protocol into Solr we could replace HTTP replication. BitTorrent supports updating and continuation of downloads, which works well for incremental index updates. When we use BitTorrent for replication, all of the slave servers seed index files allowing us to bring up new slaves (or update stale slaves) very quickly.

Selecting a BitTorrent Library

We looked into various Java implementations of the BitTorrent protocol and unfortunately none of these fit our needs:

Eventually we came upon ttorrent which fit most of the requirements that we had for integrating BitTorrent into the Solr stack.

We needed to make a few changes to ttorrent to handle Solr indexes. We added support for multi-file torrents, which allowed us to hash and replicate the index files in place. We also fixed some issues with large file (> 2 GB) support. All of these changes can be found our fork of the ttorrent code; most of these changes have already been merged back to the main project.

How it Works

BitTorrent replication relies on Lucene to give us the names of the files that need to be replicated.

When a commit occurs the steps taken on the master server are as follows:

Once a slave server has been notified of a new version of the index, or the slave polls the master server and finds a newer version of the index, the steps taken on the slave servers are as follows:

When new files need to be downloaded, partial (“.part”) files are created. This allows for us to continue downloading if replication gets interrupted. After downloading is completed the slave server continues to seed the index via BitTorrent. This is great for bringing on new servers, or updating servers that have been offline for a period of time.

HTTP replication doesn’t allow for the transfer of older versions of a given index. This causes issues with some of our static indexes. When we bring up new slaves, Solr creates a blank index whose version is greater than the static index. We either have to optimize the static indexes or force a commit before replication will take place.

With BitTorrent replication all index files are hash verified ensuring slave indexes are consistent with the master index. It also ensures the index version on the slave servers match the master server, fixing the static index issue.

User Interface

The HTTP replication UI is very clunky: you must visit each slave to understand which version of the index it has. Its transfer progress is pretty simple, and towards the end of the transfer is misleading because the index is actually being warmed, but the transfer rate keeps changing. Wouldn’t it be nice to look in one place and understand what’s happening with replication?

With BitTorrent replication the master server keeps a list of slaves in memory. The list of slaves is populated by the slaves polling master for the index version. By keeping this list we can create an overview of replication across all of the slaves. Not to mention the juicy BitTorrent transfer details and a fancy progress bar to keep you occupied while waiting for bits to flow through the network.

The Results

Pictures are worth a few thousand words. Lets look again at the picture from the start of this post, where we had 11 slave servers pull 4 GB of index.

Today we have 23 slave servers pulling 9 GB of indexes.

You can see it no longer takes over an hour to get the index out to the slaves despite more than doubling the number of slaves and the index size. The second largest triangle on the graph represents our incremental indexer playing catch up after the index optimization.

This shows the slaves are helping to share the index as well. The last few red blobs are indexes that haven’t been switch to BitTorrent replication.

Drawbacks

One of the BitTorrent features is hash verification of the bits on disk. This creates a side effect when dealing with large indexes. The master server must hash all of the index files to generate the Torrent file. Once the Torrent file is generated all of the slave servers must compare the hashes to the current set of index files. When hashing 9 GB of index it can take roughly 60 seconds to perform the SHA1 calculations. Java’s SHA1 implementation is not thread safe making it impossible to do this process in parallel. This means there is a 2 minute lag before the BitTorrent transfer begins.

To get around this issue we created a thread safe version of SHA1 and a DigestPool interface to allow for parallel hashing. This allows us to tune the lag time before the transfer begins, at the expense of increased CPU usage. It’s possible to hash the entire 9 GB in 16 seconds when running in parallel, making the lag to transfer around 32 seconds total.

Improvements

To better deal with the transfer lag we are looking at creating a Torrent file per index segment. Lucene indexes are made up of various segments. Each commit creates an index segment. By creating a new Torrent file per segment we can reduce the lag before transfer to milliseconds, because new segments are generally small.

We are also going to be adding support for transfer of arbitrary files via replication. We use external file fields and custom index time stamp files for keeping track of incremental indexing. It makes sense to have Solr manage replication of these files. We will follow HTTP replication’s lead on confFiles, adding dataFiles and indexFiles to handle the rest of the index related files.

Conclusion

Our search infrastructure is mission critical at Etsy. Integrating BitTorrent into Solr allows us to scale search without adding lag, keeping our sellers happy!

Posted by on January 23, 2012
Category: data, engineering, infrastructure, operations, search

42 Comments

Very cool. Any chance you’re going to release the source?

Pure awesomeness.

Excellent. I’ve been considering doing something similar for our Sphinx indexes at Craigslist. Glad to see it works as well as I might have hoped.

I just love these in-depth technical posts. Incredibly informative and a really nice way to get a look behind the scenes of a high-profile operation like yours. Please, please keep them coming. The Internet needs more ops stuff like this.

Whow, great post, very interesting read.

Fortunately we don’t have these issues yet at Fashiolista. Lot’s of traffic, but a nice and small search index!

Really creative solution guys, awesome write up!

What are you using to create this sort of chart: http://etsycodeascraft.files.wordpress.com/2012/01/after-bittorent.png

Also where does the search_listings and json_indextime_repl data come from?

Great post.

Thanks.

    Both search_listings-v2_indextime_repl and search_listings-v2-json_indextime_repl are generated using ganglia. When replication completes a file called “replication.properties” is updated with a time stamp:

    #Replication details
    #Mon Jan 23 23:22:30 UTC 2012
    indexReplicatedAt=1327360950251

    We have some shell scripts that compare the indexReplicatedAt to the time “now”, which give us the index age. The index age is written to disk and picked up by ganglia for graphing and nagios for monitoring.

Can you provide more detail about “Java’s SHA1 implementation is not thread safe”? That sounds a bit worrying. Is there a bug report anywhere?

http://docs.oracle.com/javase/6/docs/api/java/security/MessageDigest.html#getInstance(java.lang.String) indicates that a new MessageDigest is returned (which one would hope might be thread-safe, but I’ve not looked at the internals).

When you say you’re doing the SHA1 in parallel – do you mean you are doing SHA1 hashes of multiple (separate) index files at the same time?

I am not familiar with how the index files are stored in this system so just trying to clarify if you seeing that performance increase when switching to parallel hashing on a single large file, or if you have hashes running on multiple files..?

    It could be one or many files. The hashing is defined as part of the BitTorrent specification. A “.torrent” file contains “piece length” which is used to divide the file (or set of files) into “pieces”. Each piece is hashed to a 20-byte SHA1 string, one per piece. For us the “piece length” is 5mb, so hashing a 10gb file would require about 20480 pieces. Each of the 20480 pieces needs to be read and hashed, doing this in parallel takes about 16 seconds verses about 90 seconds in serial.


    http://wiki.theory.org/BitTorrentSpecification#Metainfo_File_Structure

      Thanks for the clarification. I was wondering as I have been curious to find out if doing multiple sequential reads in parallel on the same file is faster than doing one (thinking that maybe the overhead of multiple disk reads from different points on the disk might actually make it slower overall), so that is really interesting.

      I use BT a lot to move around large files internally between servers and waiting for the hashing is always a very irritating part of the process.

      We use solid state drives for our search servers, using Java’s RandomAccessFile we are able to read the entire 10gb in in less than 5 seconds. Hashing the pieces is a much slower process for us.

      Ahh, right, of course. Out of interest, have you tested it on a non-solid state drive (ye olde spinning disk)?

It seems that this is mostly a workaround to the huge change that optimizing an index causes.

I was wondering – have you considered simply not optimizing? The lucene optimize() method is now deprecated, as optimizing the index is rarely recommended on an actively changing index (http://wiki.apache.org/solr/SolrPerformanceFactors#Optimization_Considerations).

Have you tested to run without Tracker/.torrent-file by using DHT (http://en.wikipedia.org/wiki/BitTorrent_%28protocol%29#Distributed_trackers) and Magnet URIs (http://en.wikipedia.org/wiki/Magnet_URI_scheme) only? I’d guess this make the initial time before it discovers all seeds longer though.

Does rsync failture unfixable? I’m big fan of bittorrent but I think rsync should work much faster in your use case.

You have mentioned size about 28G, but it’s not that much for broadband networks. More important measure is number of files (hundreds of thousands), how many files in index do you have?

Also have you tried http://www.rasterbar.com/products/libtorrent/ ? It doesn’t have java bindings, but it most recent and pretty popular library (sure, vuze outshines it by userbase, but who uses vuze as library?)

[...] Turbocharging Solr Index Replication with BitTorrent « Code as Craft [...]

Great post, keep them coming!

Two more questions.

Why do you need custom library? I had read your article once again, but have not found anything that exceeds let say rtorrent functionality. You can’t use bashCLI or something?

Also why don’y you just share .torrents files by rsync? It’s much simpler to push new version just by replacing old file, rsyn will do all the job for you.

Why not go for elasticsearch? Sooner or later that index is going to be too large for a single machine.

So now you have about a year to get the multicast working? Before you saturate your network I mean. Because this is still the wrong O.

Great article and insight.

FYI mrsync is confusingly named since it doesn’t actually use the rsync algorithm it performs multicast file transfer. (The whole file is sent not just the deltas).

I’ve always thought it would be cool if someone added rsync capabilities on top of bittorrent. I’m not sure Etsy would see any benefit but I bet Twitter with their murder server deploy would.

Nice to see this Solr work happening at Etsy.

I feel like I’ve read this same text somewhere else in 2011. Did you publish this elsewhere before?

Open-sourcing?

Is this going be to be unnecessary once you switch to SolrCloud?

Interesting article, thank you. Quite surprising that you were not able to see the speed more than 8 MB/s; we use a plain vanilla http replication and the speed is usually capped only by SSD write speed (80-90 MB/s). Also, it is not clear how “The master must talk to each slave separately, instead of all at once.” since the clients could replicate simultaneously (but you may want to throttle this to avoid network saturation).

What are the performance benefits that you see with optimize? How much faster is your search service after the optimize as opposed to using something like the TieredMergePolicy?

I am asking these questions as I am thinking of migrating my current master service away from optimize to a TieredMergePolicy.

This approach is pretty bad-ass! Thanks for sharing!

“Java’s SHA1 implementation is not thread safe making it impossible to do this process in parallel.”
Well, hmmmm…not parallel for one file – but parallel for multiple files. Let me throw an idea in: why not using the GPU as an accelerator for the calculation(s)?

With an GPU approach you might calculate multiple SHA1 hashes for different files, while the CPU keeps track of the files processed. Might be a perfect job for HPC in AWS or Nimbix.

Neat!

I’m wondering why are you guys doing optimize? Do you really see significant performance impact over not fully optimized index? In Lucene trunk optimize was even renamed to forceMerge (see “LUCENE-3454 rename optimize to a less cool-sounding name”) to not let people get into that trap.

Did you consider having only a few slaves replicate directly from the master and having the rest of the slaves replicate from those? I believe this is a fairly common setup to solve this exact problem. While I don’t understand all the details of your problem, it also sounds like a far simpler solution.

> Multicast rsync caused an epic failure for our network, killing the entire site for several minutes. The multicast traffic saturated the CPU on our core switches causing all of Etsy to be unreachable.

Wow, that’s a heck of an issue. What did the mrsync guys say? Was it a bug, or were you supposed to throttle the connections or something?

I’m wondering if your slow HTTP replication is due to reading off of disk for each machine. Perhaps if there’s a caching layer that keeps the last 1G of data read off of disk so that the next guy that asks for that file it can come from memory.

Thanks for the post!

Wow! That’s neat. We haven’t run into any replication speed issues yet. We sure have one more trick in the bag in case we run into any in the future. Thanks for a great detailed post.

This was a wonderful post about a great approach. Did setting up repeaters didn’t helped a lot?

[...] may have noticed a theme in many of our blog posts. While we do push the technology envelope in many ways, we also like to stick to tried, true and data [...]

[...] pattern of not bothering with RAID; The SSD is perfectly fast enough on it’s own, and we have BitTorrent index distribution which means getting the indexes to the machine is super [...]

Did you consider getting a better switch or digging into its configuration instead of abandoning mrsync? Multicast is extremely popular in the finance world and they transmit a lot more data than you’re dealing with over it, but they tend to buy switches that are specifically known to have good multicast support.

[...] Turbocharging Solr Index Replication with BitTorrent « Code as Craft [...]

[...] is what Etsy uses for search [...]

Very informative, indepth thought provoking analysis. Great illustration of the results and findings. Thank you for sharing and hope to see more of the great posts!