This is the third blog post in our blog series where we explain the physical infrastructure for our real-time personalization engine and how we deploy, manage, and monitor our infrastructure.
In designing and building our production environment we had to take into account several requirements such as scalability, handling varying loads from CPU intensive to memory and IO intensive. To allow for scaling horizontally and easy change of machine specifications, we choose to use cloud-based hosted instances in Amazon AWS.
Our requirement for handling traffic for version 1.0 of the personalization engine is over 4000 requests per second (rps). We found that most of our microservices tend to be more compute intensive due to the nature of machine learning model evaluation and scoring computation, and not so heavy on memory usage. We also found that the network “chatter” around our microservices was very high. We ended up provisioning 7 application servers of the instance type c4.2xlarge. Each server is able to serve approximately 600 requests per second each. Using this instance type gave us sufficient breathing room for the network chatter between our microservices. We also use ElasticCache as a write-through cache. At the beginning it was saturated by network I/O. After implementing client side sharding and sharing the load across several ElasticCache instances we keep the network I/O sane on any given ElasticCache server.
As explained in our previous blogpost in this series, we have several microservices that live on and across several servers. We are constantly developing, patching, reverting, and upgrading these microservices and this means that we must be able to perform deployment tasks for one or many microservices at a time, while maintaining no downtime. Furthermore, we should be able to deploy our changes to just a subset of the nodes and all tasks should be idempotent; that is, repeating the same task multiple times should give the same result. To achieve all of these requirements, we leverage the open source Ansible-playbook for orchestrating configurations and deployment.
To use the Ansible automation engine we first define a templated service-agnostic deployment flow. Each flow is defined as a sequence of tasks. We tag each task with a functional service operation such as start, stop, restart, or deploy. We then supply variables and definitions for each service and sequentially run the deployment flow on each server.
Our deployment tasks are composed of few steps, namely, the creation of the required folders, copying of the artifacts to the target machine, and compiling and installing upstart and logging settings.
The running task stops the existing Linux service for the respective microservice. It waits for a configured amount of time for the instances to cleanly leave the cluster, then it starts the new instance. This *pausing* is actually very critical to ensure that the Akka cluster remains in a consistent state. Moreover, because we use a native library for Kamon metrics, we frequently experienced JVM crashes when two JVM processes of the same microservice starts simultaneously. Consequently, we do a service stop, pause then service start and completely avoid service restart command.
Another thing we do is bind Akka cluster roles to specific hosts. So when we specify in Jenkins to deploy CF and PPS roles, it will deploy & run only on the servers where this role is assigned.
We achieve this by adding a variable named roles in Ansible hosts file to each server. For example, to run a certain group of microservices on a specific server, we just add all the microservices roles to the roles variable associated with this server IP as follows:
220.127.116.11 roles=seed,pps,cf jvm_min_memory=0.1 jvm_max_memory=0.30
18.104.22.168 roles=seed,orchestrator,ca jvm_min_memory=0.1 jvm_max_memory=0.30
For simplicity, we assign JVM memory equally while leaving 10-20% of server memory reserved for other system processes.
We use Jenkins as our central automation server interface. There are two main Jenkins jobs 1) Build and 2) Deploy. First, the build job executes SBT to compile, test and package the artifact (packageBin) for each microservice. The SBT packageBin will copy the artifact zip file to a sub-folder in the project called deployment that contains the Ansible playbook. Second, the deploy job clones only the deployment folder from the last successful build job. The deploy job consists of two Ansible steps:
- Invoke the main Ansible playbook with deploy tag. This will create all the necessary project folders on each server, copy artifacts, unarchive into each microservice folder and generate upstart config and other configurations.
- Invoke the Ansible playbook again but with restart tag to restart the Akka cluster. To ensure that deployment has zero downtime, we do a rolling update. For the restart step Ansible fork and serial parameters both are set to 1, to ensure it acts on a single server at a time.
Another requirement we have is to minimize the risk of deployment especially when there are breaking changes. We use a green deployment job to mitigate this risk. Using this deployment job, we deploy all the roles onto exactly one server, which we call the green server. This allows us to compare every metric on the experiment green server against the remaining servers. When we have achieved satisfactory confidence in the green deployment, we can proceed with the main Jenkins deploy job which will implement a rolling deploy on the remaining servers.
Monitoring & Logging
Having transparency of our cluster and services at all times is absolutely critical. This includes understanding the physical health of the servers such as CPU usage, memory consumption, network and disk I/O, the application-level metrics, and business-level metrics, and being able to correlate anomalous events across all these levels.
In order to achieve this, we use Datadog monitoring service. We install datadog-agent on all of our application servers. This agent scrapes system and physical metrics continuously in real-time. We also use kamon in our applications to publish metrics to the same datadog agent. We try to publish these application metrics wherever possible to provide as much system context at any given time.
val cacheHits = Kamon.metrics.counter("pps.cache-hits") // define a counter
case Some(hit) => // increment counter during business events
Once we have this setup, we built very granular dashboards across all the components. Using these Dashboards we are able to correlate metric across all components. For example, we can very easily attribute a spike in request processing time to heavy writes on our Elasticsearch cluster, or request timeouts increasing due to CPU over-utilization.
We also build component specific dashboards for our subsystems to make sure we understand the performance of each component on its own and the of the entire system as a whole.
We also want to have a fast response to anomalies. We have setup some very powerful alerting from the resulting metrics. For example, we trigger a Pagerduty alert when the median response time from Elasticsearch is greater than 50ms or when the percentage of non-200 HTTP status codes to the 200 status codes is greater than 1 percent.
Our team also leverages the Elasticsearch, Logstash, and Kibana (ELK) stack for centralized logging. Once we’ve identified the problematic service/host from our dashboards, we can drill down to the application logs to debug.
Aside from the above more standard monitoring and logging, we expose an endpoint in our API layer that will return the state of the cluster. This allows us to see the total number of instances of our applications running and the roles assigned to each. We can make assertions that perceived state of the cluster is what we are expecting.
We’ve been using this workflow for deployment and monitoring for months now. It really has helped increase our ability to release quickly and understand how our infrastructure serves our loads and how it reacts to externalities. Having data points at our fingertips combined with automated deployment was really worth the effort in terms of productivity and stabilization of the system performance and helped us doing better capacity planning. Little tricks and a few lines of monitoring and logging go a very long way in making your system transparent.