As our customer base grows, and with it increasing traffic, we must maintain the same levels of quality and performance that our global partners have come to trust. After careful and considerable evaluation, we selected AWS as our cloud provider so that we may manage scaling, availability, and security most efficiently. Learn more about the migration in our two part series on signageOS’ server migration to AWS.
However, the complexity of signageOS’ cloud infrastructure with AWS is worth examining in even greater detail. signageOS runs a cloud-based system that controls and monitors thousands of digital signage devices all around the world.
The ecosystem of signageOS mainly consists of two parts - devices that run in various locations around the world and the cloud system that controls and monitors them.
Devices are signageOS’ main focus. Everything we do aims to make devices more reliable and powerful, through a multitude of methods. We are referring to the SoC displays, media players, and Raspberry Pi modules that we integrate, test, and deploy. They run signageOS software and each of our customers' unique HTML5 players. signageOS provides a sample HTML5 player (Applet) to CMS companies. They then build their unique software on top of the signageOS HTML5 player and then deploy the HTML5 player on any device type.
All devices connect to the signageOS cloud system. They send various real-time information about the state and contents of the device to the cloud where it is then processed and presented to the user. This supports issuing various commands from the cloud and sending those commands back to the device in real-time where the device then acts upon those commands.
The cloud infrastructure is running on several separated virtual machines powered by AWS. Kubernetes Cluster technologies, EKS, developed by AWS manage all of our proprietary microservice applications. Our expert experience gives us a chance to write robust infrastructure as a Code (Terraform + Helm charts) which is very generic and can handle all standard cases and many special cases of system load.
Autoscaling using Kubernetes and AWS EC2
We utilize a combination of auto-scaling options from Kubernetes itself with Auto-Scaling groups provided by AWS EC2. So the continual traffic is efficiently distributed to all servers based on predefined rules. In the event of any peak traffic load, the system automatically starts reacting on that and begins creating new server instances immediately. These EC2 instances are a part of the EKS cluster and all overloaded instances are greatly lightened long before they go to die. All traffic changes are continuously monitored by AWS CloudWatch, which notifies all DevOps responsible persons in case something is not functioning correctly. Most of the peak-performance cases are currently handled automatically.
Databases are separated from EKS cluster to utilize the full performance of EC2 instances. We are combining HA (high availability) replication (master-master) of all our databases to achieve not just HA mode, but to have the most utilized system for read and write operations. In the long-term, we aim to keep CPU and Memory utilization of all database servers under 20% CPU and 60% RAM in common traffic load. When the long statistics indicate the overuse of this limit, we are prepared to scale up the database cluster, mostly in a horizontal manner. In case of the unexpected peak of traffic coming to database instances, we rely on EC2 T3 burstable credit, which handles it in a short amount of time.
Our system has to be available from all locations around the world 24/7/365. Which means that all components should not have any downtime during the expected deployment releases of new features, bug and security fixes. These requirements are satisfied using master-master replications of all databases in different availability zones. Any maintenance of the database is done using rolling deployment. Database patches are made one by one and during patching, the traffic is temporarily redirected to the rest of cluster nodes. So deployment is always processed with zero downtime. The exception for this is just architectural changes in signageOS system SOA(service-oriented architecture). These changes are always announced to partners through our status page.
Proprietary components of the system are also set to HA mode. All Kubernetes nodes are spread out into more availability zones (AZs) and each service is always running at least once in every availability zone. So in the case of deployment of a new version or outage of any AZ, the traffic is handled by a different AZ. Kubernetes also tries to continuously distribute services to most of AZs based on set up rules. Each AZ has a self-separated electricity source, internet access, cooling systems and failover plans including a safety system. For more details, you can look at the official article about AZs by AWS.
Even with all of the practices listed above in place to ensure that the system is always up and safe, accidents happen and we have to be prepared for these critical scenarios. From a universal best practices experience, the reason for fatal failures is usually hardware degradations or failures, which cannot be recovered in real-time.
Another possible reason for losing data or outage of part of the infrastructure is an operator error or a human mistake. We strictly grant access to internal, selected staff only for necessary operations. Most of which are read-only and the deployment is done automatically by the defined and tested process with several review steps.
In addition, we have independent backup processes that stores all data and infrastructure states into system snapshots every second hour. This means that everything is duplicated to separated storage and can be easily restored in case of needs within minutes. This process is not currently automated and we are not considering automation right now because this dramatic failure of the system could happen only in very rare cases. Also, it is not automated because it could have a different, unknown origin which would require a unique solution. So the first step of this failover process is detecting the problem cause. Then our engineers would deep dive into preparing a solution. In this case, we expect at least partial recovery of the system in several minutes.
signageOS’ skilled team of software engineers maintain this complex system to the highest capability in order to offer you the best services possible. signageOS continually reviews processes and performance to ensure a flawless system that functions at the greatest capacity. Additionally, signageOS maintains the highest level of transparency so our partners remain aware and knowledgeable on the current state of signageOS’ services. Building a trustworthy relationship between signageOS and our partners is a core value of our business that we put at the forefront in all aspects of development and maintenance.