Part of the mission I was given when I arrived at my now workplace was the complete migration of our rails hosting from a Heroku service to a kubernetes solution. This was my first experience migrating a production stack, and my first experience with kubernetes as well. And just in case you’d think the task wasn’t daunting enough, the previous sole developper decided to leave for another opportunity a few weeks after my arrival. This departure was soon forgotten with the company hiring another developper who had already transitioned an app to kubernetes. So I took a course on udemy, read the kubernetes documentation thouroughly, took notes, and played around with a toy cluster. Then I set out on the transition journey.
Six months later, the Heroku-Kubernetes transition is now over. I have gained a lot of experience. And I feel it’s time to share it.
Our company’s flagship product is a smart lamp that detects elderly people’s movements and activities and send an alarm whenever the fall. It is mainly used in french retirement houses, but we have received quite a lot of buzz in the silver economy field and we are about to scale worldwide, and sell our product, along with its web infrastructure, to franchised resellers. This is where Kubernetes comes as a handy solution, as it allows for quick replication and scaling of a single stack.
So the stack consists in a MQTT broker (hosted with CloudMQTT) that receives the messages sent by the smart lamps and their accessories. A sidekiq worker is then constantly polling the broker to get the messages and take action based on the content: send alerts, monitor activity, or just log temperature and so on. The rails web app is in charge of an admin interface and a few API endoints used by the mobile apps. Other connected services include Heroku Postgres for the data, Redis as the Sidekiq backend, Elasticsearch to index and search the logged MQTT payloads, logentries to track the rails log, Sentry for alerting and Sendgrid for emails.
All these services had been subscribed to as Saas from the Heroku app console.
What I did right
I asked for time
Given the context, that was the first thing I did. The company wanted a swift move, but I said I had to learn about the stack, get acquainted with the codebase and troubleshoot a few issues before I could get a good idea of what the implications of such a migration meant. They agreed, and gave me a few months to transition.
I can only imagine how painful it would have been to undertake such a task with a limited knowledge of the app.
I asked for help
My previous devops-ish experience was limited to editing my Capistrano configuration and maintaining a few VPS. Pushing a new update at my new workplace merely consisted in typing
git push heroku master and maybe run a migration on the server afterwards. Tuning the stack consisted in adding or removing a dyno, and changing the Saas billing plan to get more ressources.
When I started looking at the way Kubernetes is architectured, I quickly understood that I needed the help of a consultant to guide me and advise me on how to build the new stack and make wise decisions on the myriad of configuration options available. I am ever so thankful to that guy, who happened to be very friendly and patient. Kudos.
I played around and took notes
This is important. I destroyed and rebuilt the cluster from scratch at least three times. Every time, I wrote down stuff in a wiki for reference. And every time, I improved the workflow and gained a lot of experience.
I deleted pods, services, recreated them, fiddled with PVCs, tried out many helm charts… This gave me hands-on ecxperience and allowed me to crash things a few times before knowing how to get things working.
I improved the production workflow
Maybe I should say “he”, because that’s what the kubernetes consultant’s job.
We now have a nice, fully integrated CI flow that creates kubernetes environments for production, testing and staging. It still needs to be improved, but we are almost on par with
git push heroku master now.
What I did wrong
Lots of things. Here are the details.
I lost time adapting the stack
When I introduced Elasticsearch to the stack, our memory consumption went ballistic. So we decided to find an alternative. Elasticsearch is not used as an end-user-facing feature. It is used by our aftersales department and our developers. So we decided that searching the logs could take 3 seconds instead of 300ms and got rid of elasticsearch altogether.
We were wrong. Not only did I spend a lot of time researching solutions, tuning indices, partitioning tables, refactoring code, but in the end I didn’t come up with a viable alternative that could be used on a day-to-day basis by my fellow workers. So I decided to temporarily move back to Elasticsearch to transition, and do the reasearch later on. That should have been our decision from the start, because we lost plenty of man-hour money for the sake of saving a few yearly hundred bucks.
I added features
When you are given a new toy, it is tempting to try it in every situation that pops about.
After a few weeks, the k8s branch had diverged from master in an tremendous measure. Again, it could have gone horribly wrong, and it did to some extent, as I had to do a few live patches on day 2 after a few customers explaind that some parts of the service was not behaving as expacted.
I changed versions
Hey. Postgres 12 is available. Let’s use it.
Maybe that’s a good idea to upgrade your database. But just keep this for a later time, once the dust is settled. I did encounter a few bugs related to that upgrade. They were not gigormous bugs. But they could have been avoided, or postponed.
I moved everything in one go
We decided that we wanted to make our cluster totally self-contained and stop relying on Saas, except for Sentry and Sendgrid. So we went for Stolon as a high-availability Postgresql database, VerneMQ for the MQTT broker, and bitnami helm charts for Elasticsearch and Redis.
This is fine. But it is too much to handle in a single move. I had quite a lot of experience with handling a Postgres database, Redis is OK, but I was a complete newbie to elasticsearch administration. On migration day, this was the servoce that caused the most issues, and I had to reindex the 80 million logs database twice to get it right without crashing the server.
Looking back, things could have gone extremely wrong.
I should have taken things gradually:
- First, the rails app, Redis and Postgres. That’s the core of our stack. And that’s enough work for a fist pass. Fix the bugs, tune the app, and let it settle for a week or two.
- Second, set up Grafana. I use Loki for logs, so it is a nice replacement for Logentries.
- Then, bring in the MQTT broker. It’s easy, because you can just bridge the messages from one to another. Then that’s just a case of pointing the devices to the new URL. That’s the IoT dev team’s problem, not mine.
- Finally, try Elasticsearch with a subset of logs. See how it behaves. Then gradually move the rest to the new infrastrucure.
TL; DR; How you should do it
Don’t be too greedy, be humble and try moving things little by little. Try to keep the stack as similar as possible as the Heroku stack to start with: same software versions, same environment. Only then can you upgrade software if you wish to do so. Keep the connected Saas running, and gradually add them to your stack if that’s what you want to do. But be sure to have a proper testing and staging environment beforehand.