The Journey of Migrating to Kubernetes

Pratik Saha
7 min readJul 13, 2022

--

In this article, I am going to discuss the process of migrating the production workload to Kubernetes at Welcome Software (now acquired by Optimizely). This will include why we chose to migrate to Kubernetes, what prerequisites needed to be figured out before the process and how we completed the migration without causing any major disruption to our services.

Overview of the Legacy Infrastructure

Before we start talking about the migration process, let’s look into what our infrastructure looked like before the migration.

In general, we divided our services into two types -

  1. Internal tools & services
  2. Product-related microservices

Both types of services had similar architecture. Here’s a simplified overview of the legacy architecture -

Simplified overview of the legacy architecture

For internet-facing services, we used “internet-facing” ALBs and the other services were using internal ALBs. We used cloudformation to create each service's ALBs, Target Groups, and Auto Scaling Groups. The process of creating and updating the Route53 records to point to the correct ALB DNS was manual.

Reasons behind Migrating to Kubernetes

  1. Revamping the legacy infrastructure was the primary reason behind this endeavor. We have been planning to utilize the power of containerization to reduce the number of EC2 instances and properly utilize the allocated resources of the running instances. We have previously tried ECS and decided not to pursue that further due to several reasons.
  2. Scalability was another thing that we had in our minds. In our existing infrastructure, scaling was becoming costly. It’s easier to scale in k8s (short for Kubernetes) than using just virtual machines since Kubernetes was designed with scalability in mind.
  3. Reliability was also a key factor in our decision-making process. k8s provides repetitive components and ensures the application containers can be scheduled across multiple nodes and availability zones which makes it highly available and reliable. The stability of our services is very important for us and that’s why we choose to go with k8s.

Prerequisites for Starting The Migration Process

Even before we can proceed with the first stage, we brainstormed a few prerequisites that needed to be figured out. Without figuring them out, we couldn’t start our migration journey. Here are some of the prerequisites that we needed to handle, I have added a brief overview of our decision and why we made that choice -

  • Self-managed k8s or Managed k8s? We decided to go with Amazon Elastic Kubernetes Service (Amazon EKS). Since we had a small team, self-managed Kubernetes would have been too much to handle. We evaluated GKE as well, but since we were already using a lot of AWS-managed services, we ultimately went with EKS.
  • How do we manage the EKS clusters? We decided to go with Terraform. It allowed us to easily launch EKS clusters along with a few other essential resources like Kubernetes-dashboard, AWS-load-balancer-controller, external-DNS, etc.
  • How to make service highly available & reliable? As stated above, we chose k8s for its ability to be highly available and highly reliable. We decided to spread the EKS nodes in multiple availability zones and schedule pods in such a way that they get distributed evenly in those availability zones. This diagram explains our EKS architecture -
EKS Architecture

We also decided to use the k8s replicaset which would ensure that desired number of pods are running for each service. We also used Horizontal Pod Autoscaling to ensure that pods scale based on CPU, memory usage, or any other desired metric.

  • How to manage the k8s manifest files? Helm was our tool of choice. We also evaluated “Kustomize”, but Helm was more suited to our needs. It allowed us to template the manifest files and generate a single deployable artifact called a helm chart. It made the management and deployment process much easier.
  • What does the CI/CD pipeline look like? We were using Jenkins as our CI/CD tool. We decided to continue using it. For our legacy infrastructure deployment, we were using Ansible. We decided to continue using it as well. Since we opted for helm-based deployments, we used the ansible helm module. Here’s what our pipeline looked like
CI/CD pipeline for deploying to EKS cluster
  • Service discovery & secrets management? We evaluated AWS Secrets Manager but realized that implementing this from scratch would require some time and effort which may delay the EKS migration process. We had an existing system for service discovery and secrets management built with ansible custom modules and filter plugins, we decided to re-use this and revisit when the migration is complete to evaluate a better alternative.
  • How to manage logging in containers? Since our services used to run in EC2 instances, we were using file-based logging. But that’s not feasible in k8s since we will be running Docker containers. We decided to make logging configurable during the migration period for each service. When the application runs in EKS, it will log directly to STDOUT and when it runs in EC2, it will write logs in files as usual. In EKS, we will ingest the STDOUT logs directly to Elasticsearch.
  • How to monitor resources in EKS? We used Nagios for monitoring the resources running on EC2, we decided to continue using Nagios during the migration period. In Nagios, we added some checks to monitor status endpoints for our services to check if the service is running successfully. We also added checks to ensure that each service has the expected number of pods running without any restarts and no EKS deployment was stuck with the “CrashLoopBackOff” error. Later on, we also evaluated Prometheus and Grafana, but ultimately settled with Datadog, but that’s a story for another time.
  • How to use AWS Load Balancers with EKS? We wanted to continue using AWS load balancers even after migrating to EKS. We were able to achieve that through the AWS Load Balancer Controller add-on, which allowed us to use ALBs through annotations. Initially, we were using one ALB for each service but later discovered that we can share an ALB across multiple services by using a shared ingress annotation. It came with its caveats, but we can discuss more on it another time.
  • How to automatically manage DNS records? As discussed earlier, the DNS record creation/management part was manual, we needed manually create records on Route53 when a new service was launched and point the DNS of the ALB of that particular service. The ALB in the legacy infrastructure was static and we had control over the lifecycle of those ALBs. So, it didn’t matter much. But now, the ALBs will be created by EKS using the AWS Load Balancer Controller add-on. So, the ALBs can change and the DNS value would also change automatically. That’s why we needed a way to automate the management of Route53 records and went with the External DNS plugin.

There were a few other things that we needed sorting out before the migration, but these were the major ones. Figuring these out was a major hurdle for us and once we were done, we felt like we were one step closer to our journey.

Migration Stages to Avoid Disruption

Now that we have figured out some of the burning questions, we can, at last, get started with the migration. But we had to make sure that we were not causing any outage throughout the process. For completing the migration reliably and without causing any disruption to the production environment, we divided the process into multiple stages -

Stage 1

  • Migrate product-related microservices to k8s in the staging environment
  • Figure out the steps required for the migration
  • Analyze performance and find out issues before proceeding to the next stage

Stage 2

  • Migrate internal tools & services to k8s in the production environment by applying the knowledge gathered from “Stage 1”
  • Analyze the performance, availability, and reliability of the internal tools & services running in EKS
  • Prepare a rollback plan to legacy infrastructure in case of an outage
  • Iron out any kinks before proceeding to the next stage

Stage 3

  • Migrate our product-related microservices to k8s in the production environment by applying the knowledge gathered from the previous two stages
  • Analyze the performance, availability, and reliability of the internal tools & services running in EKS
  • Prepare a rollback plan to legacy infrastructure in case of an outage

After a long and tedious process, we were able to migrate the majority of our production workload to k8s. The biggest hurdle was to figure out the prerequisites. Once they were done, our main challenge was migrating each service reliably by making sure all features were working correctly for each service. It wasn’t an easy task and required assistance from the teams that owned each service. But ultimately, we were able to cross the hurdle without causing any major outage which was a huge achievement for everyone involved with the process.

Although it was the end of our migration journey, it wasn’t the end of the road for us. We are continuously learning new things about k8s and improving our infrastructure. I will try to write more about the things we learned and how we adapted in future articles.

--

--

Pratik Saha

Manager, Software Engineering (DevOps) at Optimizely!