Announcing Hound: A Lightning Fast Code Search Tool

Posted by and on January 27, 2015

Today we are open sourcing a new tool to help you search large, complex codebases at lightning speed. We are calling this tool Hound. We’ve been using it internally for a few months, and it has become an indispensable tool that many engineers use every day.

The Problem Hound Solves

Before Hound, most engineers used ack or grep to search code, but with our growing codebase this started taking longer and longer. Even worse, searching across multiple repositories was so slow and cumbersome that it was becoming frustrating and error prone. Since it is easy to overlook dependencies between repositories, and hard to even know which you might need to search, this was a big problem. Searching multiple repositories was especially painful if you wanted to search a repo that you didn’t have cloned on your machine. Due to this frustration, Kelly starting working on a side project to try out some of the ideas and code from this article by Russ Cox. The code was robust, but we wanted to tweak some of the criteria for excluding files, and create a web UI that could talk to the search engine.

The end goal was to provide a simple web front-end with linkable search results that could perform regular expression searches on all of our repositories quickly and accurately. Hound accomplishes this with a static React front-end that talks to a Go backend. The backend keeps an up-to-date index for each repository and answers searches through a minimal API.

Why Create a New Tool?

Some people might point out that tools like this already exist – OpenGrok comes to mind immediately. Our main beef with OpenGrok is that it is difficult to deploy, and it has a number of hard requirements that are not trivial to install and configure. To run Hound you just need Go 1.3+. That’s it. A browser helps to see the web UI, but with a command line version on the way and a Sublime Text plugin already live, the browser is optional. We wanted to lower the barrier to entry for this tool so much that anyone, anywhere, could starting using Hound to search their code in seconds or minutes, not hours.

Get It

Hound is easy enough to use that we recommend you just clone it and see for yourself. We have committed to using the open source version of Hound internally, so we hope to address issues and pull requests quickly. That’s enough jibber jabber – check out the quick start guide and get searching!

 

Hound trained at Etsy by  Jonathan Klein and Kelly Norton

Posted by and on January 27, 2015
Category: engineering, infrastructure

31 Comments

Looks very interesting! Any sense of how many repos it can scale to? We have, literally, hundreds.

    That’s an interesting question. We have a dozen or so and it works well, with hundreds you might run into memory issues (depending on how large they are). Each index takes up a decent amount of memory, so with millions and millions of lines of text to index you are going to need a pretty beefy server. Let us know how it works for you!

    I expect you’ll want to have a look at the undocumented ms-between-poll config option. By default, it looks like the Hound system polls each repo in the config file every 30 seconds for any changes that need to be pulled.

    It is also notable that the system will clone a copy of every repo in the config file into its own dbpath. So if you are trying to scale to hundreds of repo’s, you will need enough disk space on the local system to store all of these.

    Speaking for the underlying index/search system, I’ve been using Russ Cox’s codesearch tool (which is what Hound runs on) with about 180 web projects for some time now…codesearch can handle this without a problem. The total size of those 180 project directories is about 42GB (this includes lots of assets that are not indexed by codesearch), and the resulting codesarch index is about 300MB. Total time to build the index is 14 minutes on my machine.

    If you are like me, you may be interested in the API and web frontend of Hound for searches, but not really the REPO integration of the indexer. If I get some time, I hope to look into just this – creating an alternate indexer module for Hound that you can just point at one or more directories on your filesystem that you want to be able to search through.

      If you want to skip the repo clone step, you can always use the file:// protocol in the config url to reference a local directory on disk. With this approach you will miss the repo updates that happen via polling as you mentioned, but if you have some other way to keep the folder updated this works well.

      re: using the file:// protocol, I haven’t setup a test for this yet to see, but based on a skim of the source code, it seems like doing this would cause Hound to periodically (every 30 seconds) try to update my working directory by doing a pull from the origin. Is that not the case?

      Yes, currently the polling loop will attempt to pull with a file:// URL. Even though this is a small bug, it doesn’t cause any actual problems.

Looks like a great tool, unfortunately the Windows Versions seems to have compile problems:
# code.google.com/p/codesearch/index
src\code.google.com\p\codesearch\index\mmap_windows.go:24: too few values in str
uct initializer
src\code.google.com\p\codesearch\index\mmap_windows.go:26: cannot use f.Fd() (ty
pe uintptr) as type syscall.Handle in argument to syscall.CreateFileMapping
src\code.google.com\p\codesearch\index\mmap_windows.go:36: too few values in str
uct initializer
src\code.google.com\p\codesearch\index\read.go:426: undefined: syscall.Munmap
src\code.google.com\p\codesearch\index\write.go:250: undefined: syscall.Munmap

Maybe you should also mention which codebases are searcheable? Only GIT Repositories?

    Yeah currently Hound only supports git repos, I can add that to the README. It wouldn’t be too difficult to add support for other version control systems though, and we would be happy to accept a PR for that.

Does Hound require my repos be public??

I may be missing something obvious, but how does this differ from the Silver Searcher (ag)? http://geoff.greer.fm/ag/

From someone who works in an enterprise Java environment, running some code and accessing from a browser or editor, sounds amazing. No web server, virtual machine, frameworks, or whatever else Java venders sell.

This is pretty cool project! You guys can try this online tool also
https://code-grep.com

It will take a bit of time for scanning and indexing the source code into database. We have the demo link of linux source code on the site also.

I’ll definitely check it out – and I’m sure it’s pretty awesome, but, even if it weren’t, I really wanted to applaud the spirit of the initiative: you invested effort in making something, you think it worthwhile and are sharing it with the community, free and openly.

How awesome is that!

Good, positive karma coming your way 🙂

I’m sure you didn’t intend it this way, but I think your positioning of Hound as “a new code search tool” is a bit misleading. In truth, I would say that Hound is not a new code search tool, but a new and improved interface to a very-good existing code search tool – Russ Cox’s codesearch. What Hound has done (and this is significant, and good, but underplayed in the article above), is created useful wrappers around cindex and csearch. The cindex wrapper introduces 1) a config file, 2) the plumbing to pull code out of vcs repos listed in that config file, 3) the plumbing required to update the index as needed with code changes in those repos, and 4) configurable per-project exclude files. The csearch wrapper provides a web API and an implementation of a web frontend.

The Problem Hound Solves, I believe, is providing a web UI to Russ Cox’s codesearch tool, for searching code stored in the master branch of your repositories. By design, Hound appears optimized for running on a server, which is accessed over HTTP by multiple clients. This way developers can search across all repositories without having them checked out (the Hound server will check them out for the purposes of maintaining the index).

    I understand your point, but we do specifically mention Russ Cox’s code and say “The code was robust, but we wanted to tweak some of the criteria for excluding files, and create a web UI that could talk to the search engine.” I feel like this is pretty honest and transparent.

      Sorry! When I said “misleading” I didn’t mean disingenuous!

      Like I said, I’m sure you didn’t mean for it to come across this way, but the way I first understood it, and the way I saw it reported elsewhere, this really was positioned as another tool alongside silver-searcher, grep, ack, or the “find in files” pane of an IDE. But it really isn’t! Because those are all geared towards zipping through a directory tree of files and spitting out matches.

      I now understand how Hound is to codesearch what OpenGrok is to Lucene, but it really didn’t come across until I downloaded and read the code. I think anything you do to clarify that to your audience will only improve community engagement and the quality of bug reports and PR contributions to the project.

Does Hound also search repository history? What about across branches?

    By default Hound only searches HEAD, although we have PR in process that allows you to specify a branch or set of branches for Hound to index.

      That answers re: branches, but I’m still unclear on history. By searching HEAD, are you saying it searches only the latest commit of HEAD, or the entire history of the HEAD branch?

      Hound will only search the latest revision of master, or whatever branch you specify once that PR is merged.. We don’t look at history at all, and have no plans to do so.

We built a similar tool to search on a large set of custom developed modules. It used lucene as the core indexing library. Did you also check lucene to index the codebase?

    We didn’t look at Lucene. The driver for this project was adopting and adapting the Google Codesearch code from Russ Cox, so we started and ended with that as a base.

This is amazing – thanks! As a DevOps admin on the Ops side, deploying this is an easy way to get brownie points with the Devs (even though I will use it just as much!). Still working out the kinks of the URL for returned files not being in a Stash format though.

Hi,

I have been using Hound in my organization. Our code is in perforce so we have locally synced all the repos. Our repos total size is around 14 GB. But the time to build the index is around 80 minutes. As we want our code to be up to date, we are building the index everyday. 80 minute downtime is a long time. Is there any way by which we can reduce the time taken to build the index. Thanks in advance.