How Etsy Manages HTTPS and SSL Certificates for Custom Domains on Pattern

Posted by and on January 31, 2017

In April of 2016 Etsy launched Pattern, a new product that gives Etsy sellers the ability to create their own hosted e-commerce website. With an easy-setup experience, modern and stylish themes, and guest checkout, sellers can merchandise and manage their brand identity outside of the Etsy.com retail marketplace while leveraging all of Etsy’s e-commerce tools.

The ability to point a custom domain to a Pattern site is an especially popular feature; many Pattern sites use their own domain name, either registered directly on the Pattern dashboard, or linked to Pattern from a third-party registrar.

At launch, Pattern shops with custom domains were served over HTTP, while checkouts and other secure actions happened over secure connections with Etsy.com. This model isn’t ideal though; Google ranks pages with SSL slightly higher, and plans to increase the bump it gives to sites with SSL. That’s a big plus for our sellers. Plus, it’s the year 2017 – we don’t want to be serving pages over boring old HTTP if we can help it … and we really, really like little green lock icons.

Securify that address bar!

In this post we’ll be looking at some interesting challenges you run into when building a system designed to serve HTTPS traffic for hundreds of thousands of domains.

How to HTTPS

First, let’s talk about how HTTPS works, and how we might set it up for a single domain.

Let’s not talk that much about how HTTPS works though, because that would take a long time. Very briefly: by HTTPS we mean HTTP over SSL, and by SSL we mean TLS. All of this is a way for your website to communicate securely with clients. When a client first begins communicating with your site, you provide them with a certificate, and the client verifies that your certificate comes from a trusted certificate authority. (This is a gross over-simplification, here’s a very interesting post with more detail.)

Suffice to say, if you want HTTPS, you need a certificate, and if you want a certificate you need to get one from a certificate authority. Until fairly recently, this is the point where you had to open up your wallet. There were a handful of certificate authorities, each with an array of confusing options and pricey up-sells.

 

What's the _deal_ with this cert pricing?!

 

You pick whichever feels right to you, you get out your credit card, and you wind up with a certificate that’s valid for one to three years. Some CA’s offer API’s, but three years is a relatively long time, so chances are you do this manually. Plus, who even knows which cell in which SSL pricing matrix gets you API access?

From here, if you’re managing a limited number of domains, typically what you do is upload your certificate and private key to your load balancer (or CDN) and it handles TLS termination for you. Clients communicate with your load balancer over HTTPS using your domain’s public certificate, and your load balancer makes internal requests to your application.

 

3-SSL_Blog_Diagrams_2017-01-04_15-40-01

 

And that’s it! Your clients have little green padlocks in their address bars, and your application is serving payloads just like it used to.

More certificates, more problems

Manually issuing certificates works well enough if your application is served off a handful of top-level domains. If you need certificates for lots and lots of TLDs, it’s not going to work. Have we mentioned that Pattern is available for the very reasonable price of $15 per month? A real baseline requirement for us is we can’t pay more for a certificate than someone pays for a year of Pattern. And we certainly don’t have time to issue all those certificates by hand.

Until fairly recently, our only options at this point would have been 1) do a deal with one of the big certificate authorities that offers API access, or 2) become our own certificate authority. These are both expensive options (we don’t like expensive options).

Let’s Encrypt to the rescue

Luckily for us, in April of 2016,  the Internet Security Research Group (ISRG), which includes founders from the Mozilla Foundation and the Electronic Frontier Foundation, launched a new certificate authority named Let’s Encrypt.

The ISRG has also designed a communication protocol, Automated Certificate Management Environment (ACME), that covers the process of issuing and renewing certificates for TLS termination. Let’s Encrypt provides its service for free, and exclusively via an API that implements the ACME protocol. (You can read about it in more detail here.)

This is great for our project: we found a certificate authority we feel really good about, and since it implements a freely published protocol, there are great open source libraries out there for interacting with it.

Building out certificate issuance

Let’s re-examine our problem now that we have a good way to get certificates:

  1. We have a large quantity of custom domains attached to existing Pattern sites; we need to generate certificates for all of those.
  2. We have a constant stream of new domain registrations for new Pattern sites; we’ll need to get new certificates on a rolling basis.
  3. Let’s Encrypt certificates last 90 days; we need to renew our SSL certificates fairly frequently.

All of these problems are solvable now! We chose an open source ACME client library for PHP called AcmePHP. Another exciting thing about Let’s Encrypt using an open protocol, there are open source server implementations as well as client implementations. So we were able to spin up an internal equivalent of Let’s Encrypt called boulder and do our development and testing inside our private network, without worrying about rate limits.

 

4-SSL_Blog_Diagrams_2017-01-04_15-40-56

 

The service we built out to handle this ACME logic, store the certificates we receive, and keep track of which domains we have certificates for we named CertService. It’s a service that deals with certificates, you see. CertService communicates with Let’s Encrypt, and also exposes an API for other internal Etsy services. We’ll go into more detail on that later in this post.

TLS termination for lots and lots of domains

Now that we’ve build out a service that can have certificates issued, we need a place to put them, and we need to use them to do TLS termination for HTTPS requests to Pattern custom domains.

We handle this for *.etsy.com by putting our certificates directly onto our load balancers and giving them to our CDN’s. In planning this project, we thought about hundreds of thousands of certificates, each with a 90 day lifetimes. If we did a good job spacing out issuing and renewing our certificates, that would add up to thousands of daily write operations to our load balancers and CDN’s. That’s not a rate we were comfortable with; load balancer and CDN configuration changes are relatively high risk operations.

What we did instead, is create a pool of proxy servers, and use our load balancer to distribute HTTPS traffic to them. The proxy hosts handle the client SSL termination, and proxy internal requests to our web servers, much like the load balancer does for www.etsy.com.

 

5-SSL_Blog_Diagrams_2017-01-04_15-44-05

 

Our proxy hosts run Apache and we leverage mod_ssl and mod_proxy to do TLS termination and proxy requests. To preserve client IP addresses, we make use of the PROXY protocol on our load balancer and mod_proxy_protocol on the proxy hosts. We’re also using mod_macro to avoid ever having to write out hundreds of thousands of virtual hosts declarations for hundreds of thousands of domains. All put together, it looks something like this:


ProxyPass               /  https://internal-web-vip/
ProxyPassReverse        /  https://internal-web-vip/

<Macro VHost $domain>
<VirtualHost *:443>
    ProxyProtocol On
    ServerName $domain

    SSLEngine on
    SSLCertificateFile      $domain.crt
    SSLCertificateKeyFile   $domain.key
    SSLCertificateChainFile lets-encrypt-cross-signed.pem
</VirtualHost>
</Macro>

Use VHost custom-domain-1.com
...
Use VHost custom-domain-n.com

To connect all this together, our proxy hosts periodically query CertService for a list of recently modified custom domains. Each host then 1) fetches the new certificates from CertService, 2) writes them to disk, 3) regenerates a config like the one above, and 4) does a graceful restart of Apache. These restarts are staggered across our proxy pool so all but one of the hosts is available and receiving requests from the load balancer at any give time (fingers crossed).

How do we securely store lots of certificates?

Now that we’ve figured out how to programmatically request and renew SSL certificates via LetsEncrypt, we need to store these certificates in a secure way. To do this, there are some guarantees we need to make:

  1. Private keys are stored in a database segmented from other types of data
  2. Private keys encrypted at rest and never leave CertService in plaintext
  3. SSL key pair generation and LetsEncrypt communications take place only on trusted hosts
  4. Private keys can be retrieved only by the SSL terminating hosts

Guarantee #1 is a no-brainer. If an attacker were to compromise a datastore containing thousands of SSL private keys in plaintext, they would be be able to intercept critical data being sent to thousands of custom domains. Since security is about raising costs for an attacker – that is, making it harder for an attacker to succeed – we employ a number of techniques to secure our keys. Our first layer of defense is at an infrastructure level: private keys are stored in a MySQL database away from the rest of the network. We use iptables to limit who can connect to the MySQL server, and given that CertService is the only client that needs access, the scope is really narrow. This vastly reduces attack surface, especially in cases where an attacker is looking to pivot from another compromised server on the network. Iptables is also then used to lock down who can communicate to the CertService API; adding constraints to connectivity on top of a secure authentication scheme makes retrieving certificates that much more difficult. That addresses Guarantee #4.

Now that we’ve locked down access to the database, we need to make sure they’re stored encrypted. For this, we make use of a concept known as a hybrid cryptosystem. Hybrid cryptosystems, in a nutshell, combine asymmetric (public-key crypto) and symmetric cryptosystems. If you’re familiar with SSL, much of how we handle crypto here is analogous to Session Keys.

At the start of this process, we have two pieces of data: the SSL private key and its corresponding public key – the certificate. We don’t particularly care about the certificate since that is public by definition. We start by generating a domain specific AES-256 key and encrypt the SSL private key. This only technically addresses the issue of not having plaintext on disk; the encrypted SSL private key is stored right next to the AES key, which can be used to both encrypt and decrypt. An attacker who could steal the encrypted keys could also steal the AES key. To address this, we encrypt the AES key with a CertService public key. Now we have an encrypted SSL private key (encrypted with the AES-256 key) and an encrypted AES key (encrypted with CertService’s RSA-2048 public key). Now not only are keys truly stored encrypted on disk, they also cannot be decrypted at all on CertService. This means if an attacker were to break CertService’s authentication scheme, the most they would receive is an encrypted SSL private key; they would still need the CertService private key – available only on the SSL terminating hosts – to decrypt it. Now we’ve fully taken care of Guarantee #2.

 

6-SSL_Blog_Diagrams_2016-11-28_10-52-25

 

The only Guarantee that remains is #3. If key generation were compromised, an attacker would be able to grab private keys before they were encrypted and stored. If LetsEncrypt communication were compromised, an attacker could use our keys to generate certificates for domains we’ve already authorized (they could technically authorize new ones, but that would be significantly more difficult) or even revoke certificates. Both of these cases would render the entire system untrustworthy. Instead, we limit this functionality to CertService and expose it as an API; that way, if the web server handling Pattern requests were broken into, the attacker would not be able to affect critical LetsEncrypt flows.

 

7-SSL_Blog_Diagrams_2016-11-23_15-30-55

 

One of our stretch goals is to look into deploying HSMs. If there are bugs in the underlying software, the integrity of the entire system could be compromised thus voiding any guarantees we try to keep. While bugs are inevitable, moving critical cryptographic functions into secure hardware will mitigate their impact.

No cryptosystem is perfect, but we’ve reached our goal of significantly increasing attacker cost. On top of that, we’ve supplemented the cryptosystem with our usual host based alerting and monitoring. So not only will an attacker have to jump through several hoops to get those SSL key pairs, they will also have to do it without being detected.

After the build

With all of that wrapped up, we had a system to issue large numbers of certificates, securely store this data, and terminate TLS requests with it. At this point Etsy brought in a third-party security firm to do a round of penetration testing. This process finished up without finding any substantive security weaknesses, which gives us an added level of confidence in our system.

Once we’ve gained enough confidence, we will enable HSTS. This should be the final goal of any SSL rollout as it forces browsers to use encryption for all future communication. Without it, downgrade attacks could be used to intercept traffic and hijack session cookies.

Every Pattern site and linked domain now has a valid certificate stored in our system and ready to facilitate secure requests. This feature is rolled out across Pattern and all Pattern traffic is now pushed over HTTPS!

(This project was a collaboration between Ram Nadella, Andy Yaco-Mink and Nick Steele on the Pattern team; Omar and Ken Lee from Security; and Will Gallego from Operations. Big thanks to Dennis Olvany, Keyur Govande and Mike Adler for all their help.)

Posted by and on January 31, 2017
Category: engineering, infrastructure, security

26 Comments

The way I read the article, it appears that every apache server has every customer’s certificate. Is this the case? That seems like a lot of certs per server. Is there a way to segment the solution so that the certs are divided evenly between the apache servers?

    Hi Pat, we do keep all the certificates on each server in the proxy pool. Segmenting the certificate storage would mean we’d also need some kind of routing table so that each domain’s traffic went to the appropriate server. This is difficult traffic to route as the routing needs to happen before TLS termination happens.

    I’m also going to maintain all certs on all ssl proxy hosts. IMO, the best way to achieve this using NFS based mount on each server but if you find it needlessly adding more complexity then simply use rsync your certs on all proxy hosts. I think this would be much more easier than to create a routing table for customer domains.

Thanks for the great article and congrats with the migration to HTTPS at such a large scale.
Two questions though: anno 2017, what’s the reason you’re still using Apache in stead of Nginx as proxy? And why did you choose to generate RSA 2048 certificates in stead of the faster, smaller, more efficient and more secure ECDSA (ECC) certificates?

Great post!!

Wonderful and extremely helpful post. I was looking for a solution exactly like this from last couple of weeks. Seems like you guys nailed it. I’m going to try it out now.

Is there any community or direct channel from where I could ask questions about that may arise during implementation?
Thanks a lot.

Very informative! Great work!

[…] How Etsy Manages HTTPS and SSL Certificates for Custom Domains on Pattern from Tumblr http://chrisshort.tumblr.com/post/156671219510 via IFTTT […]

Excellent work and effort! One thing i’m still little confused, that’s the SSL proxy host passes the request internally to one of our webserver using https or normal http? What if we simply use http for internal requests to our webservers?

    Hi Humanyu, you can use https or http for the internal requests. The external requests are the only ones that must be signed by the public Let’s Encrypt certificate for the domain. You internal requests could use https, but use certificates created by your own internal certificate authority, or self-signed certificates.

Great read, even for non-sysadmins!

Great post! thanks for sharing your experiences.
I am also building similar solution for my company by using nginx. It’s great to see that I am on the right track.

Any chance that we will see CertService released as an open source project? This seems to be the missing admin layer that would be critical for a company to deploy at scale such as yours.

I’m primarily interested if you avoid the need for hosting HTTP services for validation, which I find concerning, and would rather script the “manual” validation of the LE interaction, which I would be fine with.

One more thing where I’m a bit confused. If on every SSL Proxy host side, we write,
ProxyPass / https://internal-web-vip/
ProxyPassReverse / https://internal-web-vip/

How does an SSL host know to which webserver in a webserver cluster it needs to pass the internal request? Shouldn’t we have to put another load balancer between the SSL hosts-cluster and the Webserver hosts-cluster?

    Yes, in our case here `internal-web-vip` is a virtual IP on the same load balancer, and that balances amongst the webserver hosts.

Which library should I use on SSL Proxy hosts side to decrypt the encrypted AES and SSL keys. And also how to encrypt them effectively on CertService side? Can you suggest a tutorial or documentation from where I could get help for encrypting & decrypting these keys effectively?

Isn’t your system affected by latest rate limit changes – https://letsencrypt.org/docs/rate-limits/? It seemed to me that on your scale it may be an issue.

    These rate limits are almost all at the domain level, or apply to certificates with the same set of domains, so we’re not in much danger of hitting them. For the account level rate limits around pending authorizations, we do have some monitoring internally to make sure we stay clear of them. We haven’t had a problem with this so far though.

[…] engineers at Etsy recently shared how they offer their merchants TLS-certificates for their custom domains on the Etsy-platform9. The tricky part here is scale: How to store the massive amount of certificates in a secure way? […]

[…] Two Etsy devs share how they manage HTTPS and SSL certificates12 for custom domains on Etsy. (Image […]

[…] engineers at Etsy recently shared how they offer their merchants TLS-certificates for their custom domains on the Etsy-platform9. The tricky part here is scale: How to store the massive amount of certificates in a secure way? […]

One thing… You said the SSL private keys for each domain is encrypted in your CertService DB using AES256-key which is also encrypted using PublicKey of certservice and this SSLPrivateKey can only be decrypted by SSLProxy hosts using CertService PrivateKey, which seems logical. So how can CertService renew certificates for domains because the service itself can not decrypt the PrivateKey of a domain, yet it needs the domain’s KeyPair for renewal of certificates. That’s where I’m little confused as how to achieve this?
Would it be safe to store an un-encrypted SSLPrivateKey copy in local disk too? Wouldn’t it be a security flaw and redundancy issue later?

    So, there isn’t really a “renew” operation. Some Certificate Authorities offer a renewals, but Let’s Encrypt does not, nor does the ACME protocol require it. So while clients think about a “renewal”, it’s really just the same authorization and issue steps you’d go through if you were getting a cert for a domain for the first time. The upshot is that CertService doesn’t need to decrypt the private key.

      Well, are you sure getting a fresh new certificate after every three months wouldn’t cause any issues? If you guys are already doing it then it’s good, I can try that too. However, I do saw renewal of domain certificates code in acmephp.phar file. So they do proper renewals in acmephp.phar.
      And here’s the quote from acmephp documentation…
      “Note: You only need to prove once that you own a domain (certificates renewals won’t require it), as long as you keep the same account key.”

      But again, I do see simplicity in getting a fresh new certificate and keypair for a domain rather than get the renewal.

One last thing, what’s the best practice to store “Use VHost custom-domain.com” statements in a file? I mean we can have them in DB and then slap thousands of them in a separate file which gets Included in the mod_macro template file. We can create a fresh new file and remove the old one whenever update happens.
Is that a right way to do it on sslproxy hosts? Can you suggest a better way?