Scaling a Production Application

One of the most exciting things you can see as a product owner is growth in your customer base. PassiveTotal has grown considerably in the last several months and that has forced us to scale our services quickly. Chances are, if you've you used the application in the past month, you may have noticed slower responses, so we wanted to outline a bit of what was done to scale our services to meet demand.

Prior to the RiskIQ acquisition, we ran all of our operations out of New York datacenters through some smaller VPS providers. Our web front-end talked to our replicated database and it worked well for the client load we had at the time, but it wasn't perfect. There was no load balancing and if the primary front-end went down, it meant we were offline. Fortunately, we had plenty of monitors in place to detect issues, but what we really needed was another front-end server to process requests.

Fast-forward to September 2015 and we were busy getting settled into our new RiskIQ home. One of our goals for the year was to move all our services from our VPS hosts over to AWS, all while on-boarding a considerable amount of new customers. Handling non-client facing servers was quick, but the production application itself was a bit tricky.

Scaling the Front-end

In order to get our data to AWS without taking down any nodes, we had to setup mirror database nodes inside of AWS and then sync our New York-based servers over to our new California database mirrors. This added extra latency to requests and put load on the existing database servers since they were constantly writing data across the country. From the client view, this meant sluggish performance.

With database nodes in place, we moved to setting up additional load-balanced web front-end servers in AWS to provide high availability for the application. This exercise also revealed the need to move a lot of of local resources like website images from the server itself to services like S3. Once live, testing began using Amazon's DNS services to slowly start routing our users from New York to our new California balanced front-ends.

Having obtained good results, we moved all client requests over to our new web front-ends and promoted our AWS database nodes to be the primary database processors and downgraded the New York servers to just mirrors. After a week of no issues, we severed ties to New York and saw a considerable boost in performance since now all of our application was in one spot.

Scaling the API

About the same time we switched our application to 100% AWS, a new issue began to creep into place. During certain peak customer hours, our website would come to a crawl making it difficult to use. The issue was ultimately attributed back to the API which lived on the same hosts as the web servers. Users were pounding away at the API, forcing Apache to spawn more instances of our application into memory which in turn slowed down the web performance.

Knowing what had to be done, we spun up two new API boxes and a separate load-balancer to handle just the API only. At the same time, we refactored a lot of the existing API code to run completely on it's own which meant smaller memory footprints and faster response times. In roughly two days, we were able to put routing rules in place to direct all API requests to our new load-balanced nodes. The results were dramatic and the website hasn't dealt with issues since.

Sleeping Better

This post was a little dry, but I felt it was important to have out there. Moving your operations from one side of the country to the other, while not going down and continuing to produce new features, bug fixes and on-board clients is difficult. It wasn't unusual to get alerts at 4AM telling me one of our front-ends was under too much load and needed tending. That alone was enough to keep anyone on edge, but more difficult to process was knowing our customers weren't getting the best experience.

Now that our operations live inside of AWS, it means we can easily scale to the demand as we continue to grow. Additionally, we don't have to worry about node failure as we have numerous points of backup in our operations and detailed monitors to let us know when something goes wrong. I personally appreciate the thousands of users who put up with the slower speeds for the past month and look forward to scaling out into the future.

Quick Update - Right after hitting the publish button on this blog, we got an alert that our primary database node was crossing into 90%+ CPU load. A quick view of our graphs for the past month showed a steady increase in CPU activity that now needed to be addressed. What once would have been a difficult process literally took 2 minutes with our new hosting. We simply removed a secondary node, double its settings, told our database to elect it primary and walked away. Instant scale.