At signageOS, we run a cloud-based system that controls and monitors thousands of digital signage devices all around the world.
As our customer base grew and traffic increased, we weren’t able to maintain the same level of quality and performance with the current private cloud provider. After careful evaluation, we concluded that in order to keep up with current growth we must migrate our system to a provider that will allow us to manage scaling, availability, and security more efficiently. That provider came to be AWS.
We started preparing for the migration well over a month before we launched. We set-up all the new database instances, the Kubernetes cluster and all the networking rules needed to replicate the data from the databases.
When the day came, we stayed long after of business hours and made the final adjustments. After a few hours everything worked, so we flipped the DNS records and the traffic started coming to AWS.
Everything worked exactly as we planned it, or so we thought at first. All the data got replicated and as each database got copied to AWS, we moved all of the traffic to it and stopped the original instances. Then a problem came up - RethinkDB.
RethinkDB Migration and Performance Hell
RethinkDB is a great database and we have used it since the beginning. It offers a fast document storage and it’s greatest feature (the main reason why we were using it) is that you can set-up listeners and get any changes to the data in real time. We use this feature heavily in our Box UI so that any changes in the system are immediately displayed to the user. That also means that all the front-end services are closely tied to RethinkDB, therefore it’s important that it’s fast and stable.
To migrate our RethinkDB data, we created new cluster on AWS and connected it to the original cluster to create one logical cluster. We mapped each instance in AWS to be an exact secondary replica of an instance from the original cluster. That way, all the data gets copied and once it’s all in AWS, we can set the AWS copy to be the master and shut down the original instance.
Unfortunately, we ran into some serious performance issues that we couldn’t predict. First of all, the large database required over 12 hours to replicate due to replication while still being active and a previously unknown replication inefficiency within RethinkDB. That wouldn’t be an issue on it’s own as our services kept being connected to the original cluster during the process. The problem was that the replication had a devastating effect on the performance of the whole cluster. Connections started being throttled, some requests timed out, others took a long time to respond. That had an adverse effect on our front-end services that rely on this database to be fast. In order to mitigate this, the resources required for running RethinkDB had to be almost tripled on the AWS instances to temporarily resolve issues but the damage was already done.
In the end, we migrated everything, including RethinkDB and our system is now fully running on AWS.
As we already mentioned, the RethinkDB migration slowed down our whole system, even though other parts of the system were fine. That led us to multiple conclusions.
Persistence in Offline Caching First Approach
signageOS’ offline first approach to storing data for end-devices has been reinforced through this experience. Because of this approach end-devices were not affected; therefore, no content playback was disrupted throughout the incident period.
MongoDB Successfully Handles Larger Loads More Effeciently
RethinkDB is an open source database that brought a lot of hype and innovative features to the market. Unfortunately, the company that created it shut down in 2016. Even though it transitioned into a community managed project on Github, it’s future is questionable.
Since the inception of signageOS, we used RethinkDB and didn’t have any problems with it but now the traffic on our servers reached proportions where it seems that RethinkDB can’t perform anymore. With read/write operations reaching tens of thousands a second, the migration processed exposed inefficiencies in both RethinkDB itself and how we worked with it.
After recovering from the sequence of failures that happened during the migration, we took the time to evaluate the situation. We decided to get rid of RethinkDB and replace it with MongoDB. There are multiple reasons for that:
- MongoDB also has the feature of listening to changes in real time that made us use RethinkDB in the first place
- MongoDB is a project with a great reputation and it’s being actively developed by a large company, 1000+ employees strong
- MongoDB is extremely fast
MongoDB was previously used in sections of signageOS’ operations prior to the incident. After replacing the RethinkDB cluster with the MongoDB cluster, it has since been tested under even heavier traffic than the previous RethinkDB cluster load and performed flawlessly. Unlike RethinkDB, it performed great before, during and after migration.
You don’t have to replicate on the fly
A lot of the data that we had stored in RethinkDB was by nature historical and not necessary for day-to-day operations. If we instead dumped that data to a file and copied it manually after the main migration was over, we would have avoided most of the performance issues.
Even though our migration to AWS didn’t go as planned and we experienced some performance issues, we believe that it was a great lesson and we’re already working hard on implementing what we learned during this process. Not only did we experience some crucial turning points in signageOS' development but we also found a bug for AWS which resulted in an AWS voucher.
Migrating to AWS was a crucial step, that will allow us to deliver our services world wide as our customer base rapidly grows.