Scaling User Security
Summer is ending
New security features
Sweeping in like Fall
The Etsy Security Team is extremely happy to announce the simultaneous release of three important security features: Two factor authentication, full site SSL support, and viewable login history data. We believe that these protections are industry best practice, and we’re excited to offer them proactively to our members on an opt-in basis as a further commitment to account safety. A high level overview of the features is available here, while on Code as Craft we wanted to talk a bit more about the engineering that went into the SSL and two factor authentication features.
Rolling out Full Site SSL
When we initially discussed making the site fully accessible over SSL, we thought it might be a simple change given our architecture at the time. During this time period we relied on our load balancers to both terminate SSL and maintain the logic as to which pages were forced to HTTPS, which were forced to HTTP, and which could be either. To test out our “simple change” hypothesis we set up a test where we attempted to make the site fully SSL by disabling our load balancer rules that forced some pages down to HTTP. After this triggered a thrilling explosion in the error logs, we realized things weren’t going to be quite that easy.
The first step was to make our codebase HTTPS friendly. This meant cleaning up a significant number of hard coded “http://” links, and making all our URL generating functions HTTPS aware. In some cases this meant taking advantage of scheme relative URLs. We also needed to verify that all of our image storage locations, and our various CDNs could all play nicely with SSL.
Next up was moving the logic for enforcing whether a URL could be HTTP, HTTPS, or both from the load balancer to the application itself. With the logic on the load balancer, adding or changing a rule went like this:
- Engineer makes a ticket for ops
- Ops logs into the load balancer admin interface
- Ops needs to update the rule in three places (our development, preprod, and prod environments)
- Ops lets engineer know rule was updated, hopes all is well
- In case of rollback, ops has to go back into admin interface, find the rule (the admin interface did not sort them in any meaningful way), change the rule back, and hope for the best
In this flow, changes have to be tracked through a change ticket system not source control and there is no way for other engineers to see what has been updated. Why is application logic in our load balancers anyway? Wat?
To address these issues, we moved all HTTPS vs HTTP logic into the web server via a combination of .htaccess rules and hooks in our controller code. This new approach provided us with far greater granularity on how to apply rules to specific URLs. Now we can specify how URLs are handled for groups of users (sellers, admins, etc) or even individual users instead of using load balancer rules in an all-or-nothing global fashion. Finally, the move meant all of this code now lives in git which enables transparency across the organization.
HSTS is a new header the instructs browsers to only connect to a specific domain over HTTPS in order to provide a defense against certain man-in-the-middle attacks. As part of our rollout we are making use of this header when a user opts in to full site SSL. Initially we are setting a low timeout value for HSTS during rollout to ensure things operate smoothly, and we’ll be increasing this value via a config push over time as we’re confident there will be no issues.
Why not use full site SSL for all members and visitors?
First and foremost, rolling out SSL as the default for all site traffic is something we’re actively working on and feel is the best practice to be striving towards. As with any large scale change in site-wide traffic, capacity and performance are significant concerns. Our goal with making this functionality available on an opt-in basis at first is to provide it to those members who use riskier shared network mediums such as public WiFi. Going forward, we’re analyzing metrics around CDN performance (especially for our international members), page performance times of SSL vs non-SSL, and overall load balancer SSL capacity. When we’re confident in the performance and capacity figures, we’re excited to continue moving towards defaulting to full site SSL for all members and visitors.
Two factor authentication
Our main focus during the course of our two factor authentication project (aside from security) was how to develop and apply metrics to create the best user experience possible over the long term. Specifically, the questions we wanted to be able to answer about the voice call/SMS delivery providers we selected were:
- “Does provider A deliver codes faster than provider B?”
- “Does provider A deliver codes more reliably than provider B?”
- “Can we easily swap out provider A for provider C? or D? or F?”
From the beginning we decided that we did not want to be tied to a single provider, so abstraction was critical. We went about achieving this in two ways:
- Only relying on the provider for transmission of the code to a member. All code generation and verification is performed in our application, and the providers are simply used as “dumb” delivery mechanisms.
- Abstracting our code to keep it as generic and provider-agnostic as possible. This makes it easy to swap providers around and plug in new ones whenever we wish.
Metrics and performance testing
There are two main provider metrics we analyze when it comes to signins with 2FA:
- Time from code generation to signin (aka: How long did the provider take to deliver the code?)
- Number of times a code is requested to be resent (aka: Was the provider reliable in delivering the code?)
These metrics allow us to analyze a providers’ quality over time, and allow us to make informed choices such as which provider we should use for members in specific geographical locations.
In order to collect this data from multiple providers, we make heavy use of A/B testing. This approach lets us easily balance the usage of providers A and B one week, and B and C the next. Finally, from a SPOF and resiliency point of view this architecture also makes it painless to fail over to another provider if one goes down.
In closing, we hope you’ll give these new features a shot by visiting your Security Settings page, and we’re excited to continue building proactive security mechanisms for our members.
This post was co-written by Kyle Barry and Zane Lackey on behalf of the Etsy Security Team.