lloyd.io - Persona Architectural Changes

Persona Architectural Changes

2012-06-15 00:00:00 -0700

The Persona Login service lets web developers implement seamless login via email address with a trivial amount of code. It is written in NodeJS and is supported by Mozilla. The service is deployed as several distinct NodeJS processes. Recently we’ve added a new process to the service, and this short post will describe what’s changed and why.

This post is targeted at interested community members, and the people who build, test, deploy, and maintain Persona.

Previous Software Layout

The Persona deployment previously consisted of the following processes with said responsibilities:

browserid - The main entry point for all web requests, handles all read only API requests, static file serving, and forwards API requests that need either access to our private “domain” key (certificate signing requests), or those which require write access to the database.
keysigner - The only process in the system that requires read access to the domain key to handle key signing requests.
verifier - A process that performs verification of assertions
dbwriter - A process that handles all API requests which require write access to the database.

In addition to these NodeJS processes, we have quite a bit of infrastructure for high availability in our layout as well. Roughly, here’s what Persona looked like:

-> Persona Arch Before <-

Adding router

As you can see in the description above, the browserid process had a large set of diverse responsibilities. Repeatedly in load testing we have seen that process CPU bound and is the current bottleneck of the system. Further, we have begun work implementing countermeasures to keep the service available during malicious network attacks. Both for performance purposes and to create a place to run countermeasures, we’ve introduced a new process that subsumes some responsibilities of browserid, she’s called router.

The new process and responsibility breakdown is as follows:

router - The main entry point for all web requests, responsible for forwarding API requests.
browserid - Responsible for handling all read only API requests, static file serving, and forwarding to the keysigner when certification is required.
keysigner - The only process in the system that requires read access to the domain key to execute key-signing requests.
verifier - A process that performs verification of assertions
dbwriter - A process that handles all API requests which require write access to the database.

As you can see, the responsibility shift is extremely minor in this first push to introduce router. We have not yet introduced the countermeasures mentioned above, nor have we moved all of orthogonal responsibilities of the browserid process. These changes are planned for future trains (with 2 week development cycles, the only way to roll is incrementally).

The new software deployment looks roughly like this:

-> Persona Arch After <-

Deployment Considerations

As you might expect, a very small part of the Persona deployment is actually about NodeJS: of equal importance are the mechanisms that start the programs, display statistics and health, and monitor them to allow us to pro-actively discover problems before they impact users. The next several sections run through the new deployment architecture from these different perspectives.

Monitoring

The router has no interaction with our database servers, it’s primary function at this point is to forward requests to the appropriate process. Monitoring should initially target non-200 HTTP responses. While 4xx responses are part of normal operation, 5xx should be monitored and alerted.

In addition to monitoring HTTP responses, the router emits statsd events over UDP. These are all name-spaced under browserid.router., and the most interesting stats emitted are API invocation counts, under browserid.router.wsapi.*.

Health Checks

Load balancers need an externally accessible health check URL to verify the health of a component. The browserid process exports a ping api that can be hit via router. Because these two processes run on the same machine, they should be considered a block. The ping api verifies the ability of router to forward requests, and the ability of browserid to process API requests, AND the ability of browserid to access the database.

Logging

What you’ll see in the router’s log-file is mostly reports of forwarded requests. That’s all it does. Any error level messages are of great interest.

Expected Performance Impact

The router removes HTTP forwarding from the most heavily loaded process in the system. Now that browserid is no longer responsible request forwarding, it’s expected that we’ll see slightly higher possible ADU in load testing, with greater total processor usage on our machines.

Upcoming Work

The static process

An immediate desire is to introduce another service for static file serving. This service would be responsible for serving all of our static content. This includes both:

views: HTML pages with no dynamic content that do not vary between deployments, but change every 2 weeks and hence have a max-Age of 0 to force cache re-validation for instant cache update after deployment.
static resources: non-HTML resources that are served from perma-urls with a hash embedded in the URL, having far-future expiration dates.

A CDN

All static content should be moved off to a Mozilla controlled HTTPS CDN for lower latency and faster load times. the static process discussed above would then be nothing more than a CDN origin server. We will soon deploy a change that will move all static content off to a different hostname to prepare for the introduction of a CDN.

Anti abuse heuristics

Now that router is in the works, we can move forward and begin to implement anti-abuse heuristics for different types of live network attacks (initial focus will be on (D)DoS.