signageOS' Server Migration to AWS Part II

AWS Migration Part 2

How signageOS works

So you got a glimpse of signageOS’ migration and a couple of the problems we ran into. Now let’s give you a little more information on how signageOS works with AWS and why this migration was necessary.

The ecosystem of signageOS mainly consists of two parts - devices that run in various locations around the world and the cloud system that controls and monitors them.

Devices

Devices are our main focus. Everything we do aims to make the devices more reliable, powerful and better, in one way or another. These are the SoC displays, media players, and Raspberry Pis. They run signageOS software and each of our customer’s unique HTML5 players.

Cloud

All devices connect to the signageOS cloud system. They send various real-time information about the state of the device and it’s content to the cloud where it’s processed and presented to the user. Various commands can be issued from the cloud and sent back to the device in real-time and the device acts upon those commands.

Ok, let’s get technical

Until now, the signageOS cloud ran in 4 different locations - 2 data centers in Prague, Czech Republic that belong to a closely cooperating private cloud provider and Microsoft Azure’s North Europe and West Europe data centers.

Private cloud provider

This is where the main application services and all the databases were located.

The Platform architecture is designed in the SOA (service-oriented architecture) format. Meaning that every simple service of the platform is separated and can be easily scaled horizontally including databases. The original infrastructure was running on several virtual servers powered by VMWare solution and were distributed into 2 separate availability zones. Every server had its own purpose. Some servers were dedicated to keep data in a couple of shards with appropriate replicas.

The main databases, who were under heavier loads, were running directly on virtual machines to allow for the fastest possible performance levels during data manipulation. For example, we feel that the most important databases are MongoDB for complete history log and PostgreSQL for current system state.

Some of our secondary “read-only” databases ware running in the Docker environment to allow for easy deployment, migration, scaling and upgrades. For example, RethinkDB was actually used for this purpose and was the pain point during the migration.

For our application services, 2 separated servers were dedicated in different availability zones with easy possibility of launching backup and performing servers. This state was stable and adequate for the current load but any request on scaling in case of need, was more or less manual.

Microsoft Azure

We used some of the services offered by Microsoft that we felt were important and more effective than if we implemented them in-house. In particular, the most heavily used service was Azure Blob Storage, which we used for serving static files in combination with Azure CDN.

Why AWS

As you probably know, AWS is used by the largest corporations in the world and even some governments. Let’s explore some of the major upsides of running our infrastructure on AWS:

Scaling

From time to time we need to scale horizontally (add new servers) or vertically (add more CPU, RAM, disk space, etc.).

Providers such as AWS allow you to quickly create new VPS’s (Virtual Private Server) or make changes to the existing ones, create disk volumes, load balancers, reroute network traffic, create policies for automatic horizontal scaling of server clusters, etc. We can then quickly iterate, making as many changes as we need in a short time until we reach the desired outcome. That’s one of the reasons why we decided to switch to AWS.

Also, AWS allows us to ensure the highest availability on multiple levels:

Availability zones

From AWS documentation: “Amazon EC2 is hosted in multiple locations world-wide. These locations are composed of Regions and Availability Zones. Each Region is a separate geographic area. Each Region has multiple, isolated locations known as Availability Zones. Each Availability Zone is isolated, but the Availability Zones in a Region are connected through low-latency links. If you distribute your instances across multiple Availability Zones and one instance fails, you can design your application so that an instance in another Availability Zone can handle requests.”

Each availability zone is de-facto an isolated data center so if a catastrophic failure occurs, it doesn’t affect other availability zones in the same region. That allows us to design our infrastructure in a way that even in the event of a catastrophic failure occuring, our system’s availability won’t be affected. By running multiple redundant instances of critical services across multiple availability zones, any of them can take over in case another becomes unavailable.

Placement groups

From AWS documentation: “You can launch or start instances in a placement group, which determines how instances are placed on underlying hardware. A spread placement group is a group of instances that are each placed on distinct racks, with each rack having its own network and power source.”

Placement groups allow us to protect our services from unexpected hardware failure. By running multiple redundant instances of critical services inside a “spread” placement group, we ensure that each instance runs on a different underlying physical hardware. If a piece of hardware fails, it only affects a single instance, thus minimizing the impact on our system’s performance.

Auto Scaling Groups

From AWS documentation: “An Auto Scaling group contains a collection of Amazon EC2 instances that are treated as a logical grouping for the purposes of automatic scaling and management. An Auto Scaling group also enables you to use Amazon EC2 Auto Scaling features such as health check replacements and scaling policies. Both maintaining the number of instances in an Auto Scaling group and automatic scaling are the core functionality of the Amazon EC2 Auto Scaling service.”

Auto scaling groups are useful when you have a logical group of redundant server instances, i.e. a Kubernetes cluster. Grouping them in an auto scaling group allows us to automate several things:

  • If an instance doesn’t pass a regular health check, it’s terminated and a new instance is created in its place
  • If the traffic suddenly increases and the existing instances get overwhelmed, additional instances will be created to share the load

As a bonus, the auto scaling group can be configured to create instances across multiple availability zones, thus reaping the benefits of both.

Volume Snapshots

Snapshots are backups of disk volumes in a point in time. It’s important to take them as often as possible in case of unexpected data loss. Therefore we set-up a Snapshot Lifecycle Policy, which is an automated process of taking snapshots at certain time intervals. We take snapshots of all important data every 2 hours, which is the most frequent period that AWS allows for.

These and other tools, plus the recognized reputation of AWS led us to a decision to migrate everything to them.

Desired state

The main building blocks of our system are:

  • PostgreSQL
  • MongoDB
  • RethinkDB
  • RabbitMQ
  • Redis
  • Kubernetes
  • Static files storage and CDN

Databases

Our PostgreSQL, MongoDB and RethinkDB instances contain the state of the whole system, therefore it’s crucial to ensure their availability and safety.

It is, in theory, possible to run these databases inside the Kubernetes cluster, however at this point we’re not comfortable with doing that. It is a hot topic and there are arguments for and against it. It’s possible we will look into it in the future.

For now, we wanted each instance/replica of each database to run on its own VPS instance and spread out to as many availability zones as possible, which is exactly what we did with AWS EC2.

Kubernetes

We run all the business logic inside Kubernetes for a variety of reasons:

  • Easy horizontal scaling by adding more nodes
  • Services can run across different physical hardware and availability zones, increasing availability
  • Automatic health checking and recovery

We used AWS EKS as the master and we created an Auto Scaling Policy that automatically deploys nodes as EC2 instances from a template. You can learn more about it here.

Static files

Right now, our static files are still located on Microsoft Azure but we’re moving them to AWS S3, which is pretty much it’s identical counterpart. Both services also provide an option to run CDN on top of your data for faster delivery.

Constraints

The main constraint while migrating, was that our system’s performance or availability musn’t be impaired. In other words, our customers musn’t notice it. Unfortunately, this was not achieved for the reasons mentioned in the previous post.

Conclusion

So now that you have a better understanding of how signageOS operates with AWS and why this was a necessary step, let us know what you think about the process. What would you have done differently? We want to hear from our tech-savvy fans on their thoughts and feedback.

signageOS Team Server Cloud Developer AWS

Subscribe Here!