What if we rewrite everything ?

Hugo

·Jan 2, 2023·9 min read

This title sounds like a warning message doesn't it? I am rather very cautious when it comes to rewriting everything from scratch and I agree with Joel Spolsky who spoke of it as one of the biggest mistake to make.

A rewrite is often the culmination of a technical debt that is too heavy to repay, an acceptation that we can no longer manage to develop, the cost and the risks being too high. But if this situation is already a problem, the rewriting itself is a trap that can kill a product.

There are many examples of companies that have wasted years rewriting a product and then shut down because the market didn't wait for them.

But why am I telling you all this? Is Malt in such a catastrophic situation that it is necessary to redo everything? Well, that's what we will discuss in this post. We will talk about the evolution of Malt's technical stack since 2012, debt management and how we plan to repay our debt iteratively by component.

A little bit of history

The first version of Malt was written in 2012. At that time, it was a monolithic Java/Spring boot application, which ran on a PAAS (Heroku) with a MongoDB database and an Elasticsearch search engine. The frontend was mostly vanilla JS with a few touches of AngularJS. We were using JSPs as a templating engine, which was a pretty good choice at the time.

In 2014 the application was expanded. I was able to go full-time following a fund raise. We were 2 out of 3 co-founders to have a technical background and work on the site. In the meantime, we introduced Redis for session and cache management and RabbitMQ to decouple some functional parts via an observer pattern.

We had some code conflicts from time to time and we concluded that it would be interesting to be able to release parts of the site independently. This allowed for example to take out the search engine while we would continue to have the payment engine in draft. In addition to this, the cost of the Heroku service and of its addons (Mongo and Redis), the robustness of certain SAAS services (Elastic in particular) mean that the SAAS/PAAS alternative was starting to be less relevant.

We therefore decided to split our single application into several small applications, to add an Nginx reverse proxy in front of it and run everything on dedicated OVH servers.

Even retrospectively, it is difficult to judge all of these decisions. Splitting into multiple applications was questionable at this point. Especially since we were already using a feature flipping strategy to disable parts of code. But the migration to OVH, in particular for reasons of cost of addons, was rather pragmatic and gave us a lot of flexibility for the future.

If this part interests you, a blog post on the migration to OVH was written at the time (in French sorry but you can try with google translate).

For the record, the entire application and databases ran on a single machine!

Between 2014 and 2018, the Product team grew to 30 people. The architecture has become more complex to satisfy stronger needs in terms of resilience, security, scalability. In particular, we have partly switched to OVH Cloud. It was an opportunity to introduce Terraform. Consul was also adopted for service discoverability. Vault was added for private key management. Postgresql was introduced in 2019 following many disappointments with Mongo.

On the frontend side we introduced Vue.js and removed AngularJs (EDIT: in fact, apparently not everywhere). Our modules in Vanilla JS have been structured to give birth to a small, fairly light in-house framework. Handlebars allowed us to share page templates between client and servers and thus solve some SEO issues.

After 2018, we encountered many limitations with OVH cloud. At the time, their implementation of the OpenStack API was unstable and we suffered from many 500 errors. The lack of included services forced us to rebuild solutions, without it being our core business (private key management, log management, Mongo replication to PG, or the creation of a data warehouse, for example). We therefore decided to move to Google Cloud Platform (GCP) in 2019 to benefit from more packaged services. More precisely, we are now running on GKE (managed Kubernetes). Traefik has definitely replaced Nginx as a reverse proxy thanks to its native integration with Docker/Kube.

What is technical debt?

Taken together, each decision described above can be explained. They all made it possible to build the Malt platform that we know today and I therefore validate them a posteriori. Rewriting history does not prejudge that we would have arrived at the same result.

But if these decisions are good, why talk about technical debt?

Well, because debt isn't just bad code.

Let's start with a simple definition: technical debt is the sum of choices aimed at saving time but which will induce future costs to be reimbursed. To simplify, the debt makes it possible to buy time that will have to be repaid later.

I really like the analogy with financial debt because:

the debt is not always negative, it can be used to move faster at a given moment, it is investment. In the ideal case, we can reimburse it when we have obtained the means to do so.
a debt is repaid with interest, this interest accumulates over time, it is the principle of compound interest
when you can no longer repay, you default on a debt. This results in a complete rewrite of the application with the associated cost. In the worst case, it is the end of the product and the company.
technical debt is subject to inflation and deflation. External events can cause it to explode (or cancel). It is therefore not entirely predictable.

Regarding the last point, the debt can change for external reasons, but this is precisely an important point in answering the initial question. The choices made on Malt have all been made according to a given context.

But:

the constraints that led to certain choices have changed. For example, technologies that did not exist in 2012 are now standard (SSR for JS frontends for example)
the increased size of the team leads to additional constraints on development standards
the increased size of the company and of the user base leads to additional constraints on the technologies used (e.g. a batch which takes 10 minutes on a low volume and which takes 2 hours today)

etc...

In short, some choices that are not initially seen as debt may later become debt.

Note that I am not saying that we must anticipate future constraints. For many of them I recommend waiting until the last moment because their resolution often induces additional complexity which also has a cost and which affects team velocity.

Observation on Malt

At Malt we have always had a strong principle of autonomy. Each person is encouraged to have a critical eye, to propose and improve the technological stack. These choices are discussed, adopted or rejected collectively.

This participatory approach has been very useful to us. The choices have always been very effective, innovative while remaining pragmatic.

For example we have:

introduced PGSQL to gradually replace Mongo (in French)
gradually switched from Java to Kotlin
gradually switched from JavaScript to Typescript

But, with our approach, we also often decided to make these additions gradually on the new code, without carrying out a complete migration of the old code. It was a logical choice at the time, but that increased our technical debt.

This choice has a strong consequence: the explosion of dependencies.

And these dependencies have significant impacts

on the cognitive load of each developer who regularly discovers different ways of working
on applications' startup time as they load many dependencies
on workstations that must run an increasingly complex stack
on the IDE (IntelliJ) which struggles significantly. Not only because the code base is large, but also because it has many libraries and languages to analyze. And each plugin costs!

We have many metrics that we constantly watch, build times, availability rates and so on. But for some, the evolution took place gradually over time, which does not draw attention.

For example:

App startup time is now over 50 seconds on some workstations
dev stations require a very heavy configuration to run the entire stack (imagine docker with containers for Elasticsearch, Redis, MongoDB, PostgreSQL, Traefik, etc...).
IntelliJ indexing which can last for hours with a memory footprint 8GB in the latest versions

And all of this leads to a significant increase in development cycle times.

However, so far so good.

But we know from experience that we have reached the point where, if we do not take strong action, in a year we will have to make a partial default on our debt, which would result in a freeze code of several months to repay.

And that is what must be avoided at all costs.

KISS

In 2022, among our 5 corporate OKRs, we had a somewhat special OKR: KISS, Keep it simple stupid.

This issue of simplification is totally in line with the subject developed in the previous chapter. We must simplify to reduce the cognitive load per developer and the load induced on our tools.

To make the link with the introduction of this post, we are not going to rewrite all Malt from scratch of course. But we will work to simplify Malt through several projects.

And this is one of those projects I want to talk about soon: the complete modernization of our frontend stack, how to go from 6 frontend technologies to only one, and how to completely get rid of Java on the frontend side (the famous JSPs).

In a somewhat poetic way, this project is called Singapore because it is the story of a trip outside the island of Java.

Stay tuned, that will be the subject of the next post.

0 Comments

No comments yet. Be the first to comment!