...

Patterns and Playbooks for Making Every Commit Production Ready

Originally aired:

About the Session

Trunk-based-development, continuous deployment and blue/green deploys are powerful techniques to increase the velocity and safety of our releases. But they have a dark side. Life becomes more complicated when new code and old code can run side-by-side. Much of the advice on the broader internet boils down to "Just use feature flags." Reality is a little more nuanced...

The good news is that we engineers have been wrestling with compatibility problems for years and there are some well tested techniques and strategies for coping.

In this talk I want to share some production-ready, battle-tested strategies for evolving live systems with zero down-time. We will look at adding new functionality, refactoring live code, evolving database schemas, and a few of worst mistakes we made along the way.

This talk should be of interest to engineers who are new to, or interested in, working with continuously deployed services (be they micro or otherwise).

Transcript

Thank you everyone for joining me today, let's get started. My name is Simon Gerber. I'm an engineering manager at Prospection. We're a Sydney based startup working on data analytics in healthcare, and I have a confession to make.

I don't like working after hours. I don't want to do it. I don't want to be on call.

I don't want to volunteer for the Saturday deployments. I want to get up from my desk every day at 5:00 AM and go home to my family. At the moment, that just means opening the bedroom door and walking downstairs. But you know what I mean. Now, here’s where I struggle, because I also like working for startups.

I want meaningful work, building meaningful things, but during office hours only, thank you very much.

I've done my time. I've been there. I've been there at midnight, frantically fixing bugs, right on master, trying not to inhale that death breath we programmers get after too much coffee and not enough sleep. I've kicked off the final, final build at 1:00 AM and slunk off home to bed. And then I've gotten out of bed four hours later and crawled back into the office by 6:00 AM to kick off the final, final, final build, because the 1:00 AM build failed, and then, exhausted and bleeding from the eyeballs, I give this build to the operations team. And what would they do with it? Nothing. Not a bloody thing because they were worried about running the data migrations.

Now, to be fair, we'd have to set aside a whole weekend to get this thing done. Kick off the migrations on Friday night, babysit them through Saturday, roll back on Sunday. One particularly notable migration was so huge that we needed to do it on the Easter long weekend. We have to set aside four full days to make sure we had enough time to run this thing and then roll back if it failed. So, we had just one shot at this, just one shot. It was absolutely crucial we get it right. And we put blood, sweat, and literal tears into this thing. I mean it. I vividly remember someone crying at their desk over this.

And do you think we made it? Do you think we got it done? Do you think this one mammoth migration that everything hinged upon and that had to be perfect, do you think it worked? Well, no, it failed. Of course, it failed. We failed, the migration failed, and we rolled back, and that client could not update again for a whole another year. And in that year, they actually canceled our contract and engaged another vendor to replace us. So, after the first five years of my programming career, this was just life. It was how things were done. Welcome to the real world, buddy.

But then I joined a FinTech, a company that moves money around. Real people’s real money, where you would have thought the risk profile would be higher and deployments might've been even slower and more risk adverse. But their deployments ran every two weeks like clockwork. They deployed at 10:00 AM on Wednesdays in the morning during office hours with no downtime. So, at first, compared to my previous experience, I was shocked, but the engineers had told me that they didn't like it deploying after hours or on weekends. So, they treated the problem of daytime deployments as an engineering challenge and they solved it.

So, these days, continuous deployment, continuous delivery, blue-green deployment, they're all fairly common terms. Back then when I joined the startup, they hadn't achieved widespread adoption. And for the time what that FinTech was doing was quite revolutionary. We had over a hundred engineers working on master all the time. No branches, no merge conflicts and no downtime, just a smooth, endless flow of code from keyboard to production.

This was just my second ever job at the second ever code base I'd worked on in my professional life, and after just a brief moment of appreciation, I got on with the job, because you know, new normal, right? Now, in hindsight, it was a kind of developer paradise, but the trouble with paradise is that you start to take it for granted. Without all the high drama, eventually I got a little bored in the world of finance and moved on.

And boy, wasn't that a shock! Back to the world of branches and GitFlow and being on call, but this time the phone actually rings, and late-night deployments. And I feel like I've come down from the mountain and I'm shouting “It doesn't have to be this way! It doesn't!” But it is. For so many of us, it still is.

That's when I started to realize that just chanting “CICD! CICD! CICD!” over and over again wasn't helping. Saying “Feature flags, feature flags, feature flags!” doesn't help.

Continuous integration is good. Continuous delivery is good. We know this, but most of us for most of our careers only ever merge into master when everything is finished, ready, and working. That's just how we're used to working. How do you not break stuff? How do you stabilize the build while you work on new stuff? And what about migrations? It just doesn't seem like it will work.

The fact of the matter is that it won't work if you keep on writing code the same way as you always were. It requires both a mindset shift and a fundamentally different approach to the way you order your development processes. And that's why I'm fundamentally giving this talk, because I think that learning to write small deployable commits is a new skill.

Let me compare it to learning to drive. When you first learned to drive, there's a lot to learn: how to accelerate, how to steer, how to park, how to overtake on the left while honking like a maniac, the road rules, how to anticipate what other drivers are going to do, so you can cut off that jerk who wants to overtake you on the left.

I get it. We've got a job to do. We've got features to build. Hypothetically, if our hypothetical destination is just to the corner store, and we need to learn to drive to get to the corner store and it's a 10-minute walk, then forget it. The quickest thing to do is just walk. Learning to drive first? It's going to take us, what, six months, maybe? Why bother.

But the thing is if you'd never, ever learned to drive, your personal concepts or your organizational concept of how far you can travel in say, an hour, is going to be constrained. So, when I think about my first driving lesson, I was terrified.

I refused to go over 40, maybe even 20, even 20 felt fast, but now it's automatic. You just get in the car and go, all you have to think about it as your destination.

So, if you're currently, or have only ever really worked with feature branches and offline deployments that you do every month or every couple months, or even longer, coming in and talking about deploying daily, hourly, every push, it seems totally unachievable. There's a lot of stuff you need to get done before you can get to that point.

Like any new skill, this initial knowledge gap can seem overwhelming. So, this talk, I hope will be like saying “Come hop in my Uber.” You're not going to learn to drive from one conference talk, but I hope that you're going to learn more than what you would just from standing on the sidewalk. And the payoff is freedom. It is so worth it, but I'm going to start with an honest truth.

Continuous deployment will complicate your life. That is because in order to continuously deploy with no downtime during the day, new code and old code are going to be running side by side. Messaging event handlers; RPC endpoints; your application logic; your persistence logic; old code, new code running side by side, and then also individual services and servers can go down at any time.

The way you manage that risk is with small, frequent releases. And the way we get to small, frequent releases is with small, frequent, safe commits and continuous integration. I'm going to start the technical portion of this talk with an example of how this blue-green business, old code, new code side-by-side can completely screw with your head.

Has anyone here written something that looked like this? If we were live, I would ask you to raise your hands. We're not, so I'm just going to imagine you raising your hands, and in my imagination, at least 90% of you have your hands up, because I don't think you're a real developer until you've not only written this code, but you've seen “This can never happen.” written in your logs, because I certainly have. Maybe I'm projecting, but I think this is a pretty common occurrence.

And in fact, it happened to us at this FinTech I spoke of. One morning we come in, and overnight we'd received a whole bunch of alerts and those alerts said “Our money thing worker is sad, an impossible thing has happened. The development team has been notified and we'll attend to this error between 9:00 AM and 5:00 PM on the next business day.” because you see, at this FinTech, even our error messages were designed to avoid us getting paged after hours.

So anyway, we dug in to investigate and we have our money thing worker. I'm just going to turn on my pointer. We have our money thing worker and it receives events from our money events sender, and the error message was right. There was basically no possible way this error condition could have occurred, unless we double processed a message, but that was impossible. We had a whole lot of infrastructure and libraries dedicated to giving us exactly once message passing semantics. It looked like we double processed one of these, but surely that wasn't possible. And yet, it seemed to have happened.

Sherlock Holmes said that when you've eliminated the impossible, whatever remains, however implausible, must be the truth. We needed to confront reality. This single message passing infrastructure worked so well, we tended to forget about it.

And in our heads, when we wrote code, this is what our architecture looked like. But no, this is reality. This is what our architecture really looked like. We were doing something that we called “live live”, where we had two data centers. And for every service, there was another copy, at least one other copy running on the other data center for redundancy.

When we sent this money thing message, it was actually received by two money think workers, and they would attempt to process this thing simultaneously. So, to avoid duplicate handling, we would extract a unique identifier, an idempotency key out of the message, and we'd wrap that into a database table and wrap that whole thing in an asset transaction.

So, either this thing would finish processing first, and when this one comes along, there's two line hops here, so it would come along a bit later. It would see “I've already got this identifier in my database. I'm going to skip processing, this one's a duplicate.” Worst case is they would both process simultaneously, one would succeed, the other one would fail with a unique constraint violation, which we expected and handled gracefully. So, this thing was bulletproof. It never failed. And yet somehow, we seem to have double processed a message.

So, we will dig a little bit deeper and we stare at the logs because that is what you do, right? When something is blowing your mind and you have no idea what's happened, you just stare at the logs and hope for a moment of clarity. So, we stare at the logs and we have a moment of clarity. The timestamp on that error message indicated that the failure occurred during the deployment window. So, our deployment, our blue-green deployment was actually stretched over two days, so you had the old code and the new code running side by side for a whole day. And that was a bit of a clue, because we realized that what's going on is, we've actually got two slightly different versions of the money thing worker in production, and they're both looking at the same message. But has anything changed in code between our blue instance and our green instance that might've caused this error?

It turned out there was. There was a very subtle change to the library code that worked out the idempotency key of a message. So, when this single message was seen by the two different instances on the two different data centers, they came up with different views on what that idempotency key was. And so, the message was double processed, so we spent a lot of time in our accounting system fixing up all the double entries and making sure this sort of thing couldn't happen again.

We learned something very important from this occurrence. We learnt that thinking all the time about backwards and forwards compatibility in everything you do is really hard. That's kind of the skill I want to impart a flavor for in this talk. So to assist with that, I've organized my thoughts and organize this talk by breaking the problems we face into three broad categories, I've got adding new features, refactoring live code, and then migrating relational database schemas.

So, I'm going to start with adding new features.

Just remember the golden rule, if nothing else remember the golden rule: “Additions are safe. Removals are not. Ignoring this rule, will hurt you a lot.” When I say additions are safe, what I'm talking about is things like you can add an unused nullable column to a database table, and I will cover database migrations in a lot more detail later, but you can add this column and commit that code and push that code to production and deploy it everywhere you want. In theory, it shouldn't break anything. It doesn't help you. It doesn't have business value, but at least it's there in production. So, you can do that. It is safe.

Similarly, adding an unused optional field to a DTO, to a data transfer object, that should be safe. Likewise, you can add an unused parameter to a method.

You can add an unused method to a class, and you can commit that and push it to the production. So again, not helping you, but it is safe. It's not breaking anything. Likewise, adding unused classes to a module, adding unused modules to an application, and if you're doing microservices, you can add an unused service to a deployment. All these things are safe.

The one caveat is, when I say “unused”, I mean unused by production code and live users. So, unit tests, integration tests, toggling a feature flag and testing manually, all of that still applies. Now, one thing we can do is to take the fact that additions are safe and use that to build features bottom up.

Adding Features: Bottom Up

A simple work example of this: let's have a hypothetical create user form with a really simple data model behind it. I'm just going to turn my pointer on again. First name, last name on the form. First name, last name on the entity. Simple. Now, a product manager comes along and they say “We need to capture the favorite salad green as well. We need the new field.” So, building bottom up, the first thing we would do is go to our data layer and we can add a new nullable field to our database, to our table. If you're using Hibernate or some kind of ORM, we could probably add our favorite salad greens at the entity as well. This is all nullable. Nothing at any higher layer of code is populating this, but it's there and it's safe to commit and you can push that in.

The next layer up would be the domain layer. So, let's pretend we've got really nice separation. We're using hexagonal architecture ports and adapters, something like that, so really nice separation of concerns. We also have an additional user business object, domain object. So likewise, we can add favorite salad greens to that, we can start to build some code in our domain layer that uses it. And that code, nothing is calling it. It's new code. It's unused. It should all be safe to commit. If the code was deployed at that point in time, it shouldn't break anything.

The next layer above that in a typical project is probably going to be your MVC controller. If you're like every other Java project in the world, you've probably got some kind of DTO that maps to and from JSON. You could add your favorite salad green there as an optional parameter as well. Nothing is calling this API, because I'm talking about unused by production code. However, if this was populated, it would pass through all the layers down to the database, and likewise, you could also have written code that passes it all the way up. Because this is all unused, an unused path through the code, at any point in time, as we're building this up, if it had been deployed to production, nothing should have broken.

The last final step on top of that is once you've got all this backend work done is you can come and put the user interface on top, safe in the knowledge that all of this is done and tested and it's working and it's safe for you.

So, in that way, you can build features bottom up without the complexity of putting in feature flags, if the change is simple enough. The development flow is pretty straightforward if you know what you're doing. The risk is that a backend driven design might not end up meeting the frontend requirements if your requirements are a little bit more complicated, and this is a very linear approach to work, so it's kind of hard to parallelize that across different members of the team.

Adding Features: Top Down

You can do exactly the same thing top down if you start with a feature flag. You can start by building the user interface, but this time there is no domain layer or persistence layer behind it.

You need that feature flag to stop people seeing this interface, which won't work yet. Protected by this feature flag, the next thing you can do is add the controller and then the domain layer and then the persistence layer. Once it's all done, and you've tested that, you can get rid of the feature flag and you're done.

The advantage of going top down is that you get rid of that risk that the frontend requirements is going to be misaligned with the backend implementation. The risk is exactly the converse. The risk could be that a frontend driven design may make incorrect assumptions about what the backend should be like. Again, it's very linear. It's hard to parallelize that across the team.

Adding Features: Middle Out

You can work middle out. I say middle out, not top down and bottom up because otherwise this sort of thing tends to happen with any non-trivial feature. The middle of the feature is probably going to be an API contract.

If you've got a frontend/backend split, that might be a rest API, so you could use something like Swagger or OpenAPI or Pact if you're doing consumer driven contracts. If it's microservice to microservice, same thing might apply, but it might be gRPC. You might've CraftQL there. If you're a modular monolith, this could be an interface or a facade or an adapter or an anticorruption layer.

You've got a contract. On the backend, you can start to build against that contract and you're safe. This is unused code because nothing is actually calling it. On the other side, be it frontend or another service, you can start to build up an API client against that contract definition.

Again, note that this client is not used by any higher layer, so this chunk can be sort of iterated and developed and worked upon without needing a feature flag. And at any point, that code should be able to be cut out, pushed through to production. And although it doesn't help you, it doesn't break anything.

The next step beyond that is you would probably then introduce your feature flag and start building the UI on the backend. You can start then working on the domain layer and the persistence layer, however you like. And this path read through the codes, sort of as per the similar examples, it's protected by that flag.

You can build that up and commit it at any point in development, and really it shouldn't break. This kind of does depend on your ability to isolate changes so that they are all protected by that flag.

Adding Features: Applying the Patterns

In terms of adding new unused code, if you can structure your features such the job or in the adding code, that's kind of [inaudible]. To apply these patterns, my tips would be to bias for additions.

So, ask yourself for a given feature: How can I break it up into a set of discrete steps that are only adding code and what order am I going to perform those steps in? And then after each step, if the code was deployed right now, the release train left the station. Would anything break?

The next category is refactoring live code. I've said the golden rule: additions are safe; removals or not and modifying code therefore is unsafe. If you think about the way Git treats a changed line, your source control treats a changed line. It's an addition and a removal in one step, so we cannot do that in one step. We cannot do an atomic add remove. We have to break that apart. First add, then remove.

So, in general, doing changes will mean adding the new code. Deploying that. Using our new code and deploying that, and then getting rid of the old code and deploying that.

At the most simplistic level, it will probably look something like this. If there's a flag enabled, do something new, otherwise do something old. This new behavior should, for any non-trivial feature, take a little while to develop, several commits to develop because this flag is protecting you. You can continue working on that, pushing it, deploying that to production, and it shouldn't break anything because we're not letting anybody in. It's protected by that flag.

So, to take that very generic description, new behavior, old behavior, and bring it back to our previous example of capturing the salad greens, maybe the reason we want to capture favorite salad greens is so we can upsell radishes as people sign up. So, at that exact same example, but with a bit of domain logic. It's not very logical, but anyway, maybe if our radish campaign flag is on, then we're going to create the account and upsell. Otherwise, we just do the plain old boring create account. So, there's sort of an example of modifying a feature in flight, protected by a flag.

However, the thing that might occur to you at this point is the common behavior there in both cases was create account, so when I say buys for additions, and I understand this is a very contrived toy example, but what you could do is pull out the create account and always do that, whether the flag is on or off, because you always want to do that, whether the flag is on or off. So, the only thing you're then protecting at this point is whether not you upsell radishes to that user.

If you can structure your code such that you're getting rid of that else and only adding code, then that makes it a lot easier to do modifications. At this point, you might even want to take that further and ask yourself, do you even need a flag? Maybe the presence of favorite salad green in the incoming API call alone is enough to tell you whether to execute this new behavior.

Now, the other thing I want to pose is how do you know it's safe? Because if it takes a nontrivial amount of time to build this new work, you're going to have unit tests. You're going to have integration tests. You're going to have end-to-end tests, maybe? No, I hope not. There are better ways to test behavior. But are you really sure that is okay? Have you really found all the edge cases? Are you certain? Because users do some really crazy stuff...

Refactoring Live Code: Scientist

I'd like to introduce a library and a technique pioneered or at least blogged about by GitHub. GitHub have this library they created called Scientist, and this library came about because they needed to refactor the merge functionality.

When you merge your pull request or merge a branch in GitHub, that was actually shelling out and running GitHub on the command line and they needed to port that into native Ruby and GitHub can't really take itself down to maintenance for any lengthy period of time. They also can't really afford to break the merge feature.

So, they were thinking, how do we do this safely? How do we do this continually integrated with everything checked in and deployed? And how do we make sure it's absolutely safe? Now, what they did was they ran the new merge behavior and the old behavior at the same time side by side. And then they compared the two and reported on whether they were different or not.

Irrespective of whether they were the same or the different, they just returned the old behavior. They proceeded with the old code path beyond that point. But this was deployed to production and running in production. So, real users with real code, real data, real behavior, and doing all the crazy things that users do. The old code and the new code, old merged behavior, new merged behavior. They were running side by side and over the course of a few days, they had millions of mergers performed, and when they had those millions of mergers all report identical behavior, they could be confident that this was working, and at that point they could chuck out the old behavior and start returning the new behavior here. So, this technique is really powerful. That library has ports in many other languages, and some variation of this technique is an incredibly useful thing to have in your tool belt.

Branch By Abstraction

It also leads me on to a technique called branch by abstraction, which is again, quite an old idea. If you Google for that term, you'll find a lot of blog posts written by people that explain it a lot better than I can.

The one I like the best is by Martin Fowler and these images, in fact, I confess I took them from his blog. Branch by abstraction is a technique you pull out of your belt when you need to rebuild your engine while your plane is in flight. So, the situation that you probably find yourself in is you have something you want to replace. The flawed supplier, whatever it is, you want to get rid of it.

I think when this technique was pioneered, they wanted to upgrade a version of Hibernate. So, you have something that's deeply coupled to many different other parts of your code base. We're talking about small, safe commits that at any point in time can be deployed to production, so if you were to try and upgrade this whole thing in one big bang, you'd normally think I need to put that on a branch and then you have to do all that work on the branch while simultaneously master is changing and you need to continually merge and keep your changes up to date with the other changes, and depending on the speed of change, you can get lost and you can give up and throw the whole thing away, or you can work on it for a month, and merge it back in, and there's a big problem. So, branch by abstraction is a way to avoid that whole hamster wheel.

What you want to do is you put an abstraction layer in front of the thing you want to get rid of. It could be the adaptive pattern and interface, if you're working in other language. Before that line exists, you can build this up slowly over a course of several commits, and because it's unused code that should all be safe to push out to production. Once you've got this abstraction layer working the way you want, you should be able to point some clients to it. This layer should be really thin and just sit exactly over the top of that thing. So, this is still just basically a refactor, you've just added a level of indirection and nothing should change. They should all be safe to go out at this point and you can take your time with live code, working on master, slowly pointing your clients towards that abstraction layer. Over time, you should end up in a situation like this, where you've actually pointed all the code clients to the abstraction layer, and now you've decoupled your client code from the thing you want to get rid of.

This is where we get back to continuous delivery, continuous deployment and feature flagging. Inside the supply, you can now start to branch your behavior, depending on whether, for example, a flag is enabled or not or you could use a Scientist library at this point too, or a similar technique.

At this point, when you're happy that the new behavior is working, then you can just throw it away, throw away the old stuff and use the new thing. Whether or not you keep that abstraction layer is a subject of debate. You could get rid of it if you wanted to tidy things up, or you could just leave it in the assumption that maybe it changed that again in the future. Up to you.

Evolving APIs

Another thing I want to really briefly mention at this point is evolving APIs. This is not the same as versioning APIs. Versioning APIs, that's something you do in the consumer of the APIs outside of your control, or if your product is itself an API. I'm thinking of things like the GitHub public API, where there will be consumers and integrations in the wild that you can't control, or you might have a native phone client and people never upgrade their phone apps, so you're going to have every version you've ever released will be in the wild at some point.

If you cannot control the consumers of your API, then we're talking about versioning. If you do control both sides of your API, if it's an inter- if it's a frontend that you're in control of, like a single-page app, or it's just service-to-service in a microservice world, then we're talking about evolving APIs, not versioning them.

What I just want to say here quickly is remember the golden rule, and something like consumer driven contracts are your friend. If you're working with something that doesn't have a schema, like Rest, then looking up something like Pact or Spring Cloud Contracts is a really good idea. I don't have time to dive into APIs in any more detail in this talk, but I will point you to Pact if you're using Rest.

Refactoring Live Code: Applying the Patterns

So, applying these modification patterns. My first tip would be, firstly, ask yourself, can you avoid modifying that code in the first place? Can you architect your code so that you can implement the next feature by only adding code? In the earlier talks, I talked about some architectural styles that could help there, like the the hub and spoke pattern or the… I confess I've forgotten the word they used, but… the plugin architecture. Microkernel, sorry, that was it. Microkernel architecture, where you can basically plug in features. An architecture like that, if it's appropriate for your domain, can help so that you can add new features in, by mostly adding code.

If modification is unavoidable, you can isolate that change and then use flags or Scientist to expose that work in progress for testing. And also, just, get comfortable with nulls, because sometimes during development a null can be as good as the flag.

The last problem category is schema migrations. By schema migrations, I'm talking about relational database schemas. This doesn't necessarily apply entirely to a non-relational database or some other data store, maybe elements of it are transferable. But I do specifically want to talk about RDS schemas, because this tends to be the elephant in the room when we talk about continuous deployment.

So, to make schema migrations work, you're going to have a few prerequisites. So, you want a schema migration tool, something like Flyway and Liquibase in the Java world. You want a break glass procedure, some way to safely get in and run some manual alters against production, because you're going to make a mistake at some point when you're learning and you're going to have to fix that up. That means the other really important thing is you want a tested backup and restore strategy that hopefully you're not in production without it anyway, so hopefully you're all covered.

And then you want to think about database rollback, but what I'm going to say about database rollback is, do not think about database rollback. It is not your friend. The reason why is because of blue-green deployments, canary deployments, live live deployments, old code, new code running side by side. If you're thinking the way to make a migration of a schema safe is to make sure that there's a rollback strategy, it won't help you because you still have live code running in production that's relying upon the old schema. That's one reason. The other reason is that at some point, you're going to perform some alter that is destructive. You're going to drop a column. You're going to rename it. You’re going to drop a column, drop a table. You might even just add a column, which you think is non-destructive, but then data goes into that column, and so you can't roll back without losing the data. So, you cannot rely on database rollback to save you.

What you need to get to is forward and backwards code compatibility. You need comfort that your code will be forward and backwards compatible with the schema you’ve got. One really good way to do this is a schema backwards compatibility test.

This is kind of like a meta test and the way it would work at a high level is you would run up an empty database. You would execute the migrations from latest, from master against that empty database. And then you go, and you grab the version of code that's currently running in production and you run the integration test suites against, from production against the database, the evolved database from master.

Now, if that test suite is nice and comprehensive, and if it passes at this point, what you have just shown, what you have demonstrated, is that the database that's in shrunk, awaiting deployment, will not break the code that's in master. Now, what that also means is if that deployment of that code goes pear shaped,

you could theoretically roll back the code to that version that was previously in production and not have to roll back the database because it's compatible. So then from there we go to the golden rule. Additions are safe; removals, you have to take a lot more care.

So, adding a new column is safe if it's nullable or defaulted. Adding a new table is safe. Removing an unused column is safe and removing unused tables are safe. How do you know that they're safe? You know that they're safe to remove because when you try and do it on your local development environment or in your CI pipeline, the database backwards compatibility test will fail if it's not safe.

So, you want to be in a position where you can think of a column that’s unused, remove the column while you test locally, or push it to CI, and the build will tell you whether or not that was actually safe. So, at that point, you're in a pretty good position to take this on. You can do it without those backwards compatibility tests, but you'll be paranoid by every change you make, so I strongly recommend starting by getting those compatibility tests in place.

If you want to do something more complex, a multistep process or something that should be simple, but it's going to require a multistep process, then I'm thinking renaming a column, altering a column, splitting a column, something more complex than just adding a column. It needs to become a multistep process. I'm going to talk through an example where we have discovered that names are hard. Not everybody just has a first name and a last name, and you know what? We don't even really care, so we just want to capture full name as a plaintext field instead of separating out first name and last name in our schema.

This is going to be a migration where we need to combine first and last name together to get full name, and then going forward, we just capture the full name as a string. If you are doing this in a non-backwards compatible fashion, you're just concatenating a couple of fields, putting them into a new column. That's a pretty easy migration, a couple of SQL statements. But if you have code in production, that is still assuming you have a first name and last name, it's going to break as you come to do your zero-downtime deployment. So, you cannot do this in one step.

It needs to be a multistep process. Step one will be at that new column, at the full name column. It's nullable, so every time a new user is created, you just get a null in this column. That can be deployed. In fact, it should be deployed. The next thing you want to do at this point is write code that is going to write in your data to both the old data structure and the new data structure.

When Helen Dodson gets created, we populate first name and last name as per the old code we're trying to deprecate, and we also populate the full name column. Once that's been deployed out, we will not be generating any new nulls in the full name column. So, if we go and do a migration of the old data at that point, we're not going to end up with any more nulls. This is a good time to actually get rid of those nulls. We can run a code migration or some sort of migration to backfill the full name values from the first name and last name and I'll loop back around and cover code migrations in a little bit more detail in a moment.

But now the next step beyond this is using the strategy like in a bottom up or top down, we can then change the UI and the rest of the business code so that we have something more like this. We just have a form that has full name on it, not first name, last name, and we just populate that into the database. We can stop populating the first name and the last name once this UI is good and working.

So finally, at this point, in the final step, we should be safe now there's no longer any production code using first name and last name, we can get rid of those old columns.

And so there we have it. We joined two columns in five easy steps with each step being forwards and backwards compatible and safe to deploy. Excuse me. If you're releasing every month or so you can do the math. That's nearly half a year to join two columns, something you could have trivially had done in a single step migration, if you weren't worried about backwards compatibility. And so of course you aren't going to bother, but what if you were deploying every two weeks or every hour, even every push.

I think this example is a good litmus test because when you hear the steps, whether or not you think, “Yeah, that makes sense”, or you just recoil in abject horror and think, “No way!”, that's going to depend on your personal experience and what's hard or easy about your current deployment process.

Continuous delivery, dev ops, technical excellence, lean startup, lean product management. All these things are just full of feedback loops. And software engineering is really riddled with these local maxima. So, once you've reached a place where you're comfortable with your current release process, it feels like any change you make is going to push you out of that comfort zone and slow you down, but once you get there, once you actually get there and put in the effort, then you start to feel, “How could I have ever lived without this?”

What I want to do is reassure you at this point that if this does seem ridiculously complex, perhaps it's just merely unfamiliar. You're still learning to drive, and these things become easy with practice. As you get really good, you can find ways to short circuit some of these steps by using updatable views. There are cheats once you're comfortable with the basics.

Schema Migrations: Read Old, Write New

I talked about code migrations. One approach to doing a migration in code is read old, write new. This is just like branch by abstraction, but with persistence added on top. You need to create a new empty database structure and deploy that, and then a compatibility layer in code that adapts the old tables to new interfaces. You read from the new tables and if you get back a null, then you read from the old tables instead. When you write data, you just created a new structure.

Sorry, I zoomed through that because I want to keep on time, but here it is in diagrammatic format, which I hope is easier to understand. So, you know, here we've got our abstraction, a branch by abstraction layer. Inside that, we're doing something like read from new. If we get back in null, null coalescing operator, then we read from the old table instead, and then we just return data. I haven't typed this, I'm using implicit typing, but you could be presumed from this code here that whether you read from the new database structure or the old database structure, you have to conform that to a common interface. When you write data, though you just write to the new tables.

The thing about this approach is the data is only updated when you attempt to write to it. So how do you know this migration is ever going to finish? Will it ever finish? Is there a way to make it finish sooner?

The answers to this are entirely contextual and you will have to answer this for yourself as you approach these migrations, but there are other techniques that you can use to try and get these migrations done.

Some examples: manual deployments, you can just run an SQL script after you did the deployment. I would suggest to avoid that, one step better is to build that migration into the application and you can set it off just by poking it.

Call a JMX endpoint or put some kind of admin interface on top of it that they trigger certain migrations, some way to trigger that migration instead of just running it ad hoc.

If you're using an event driven or event sourced architecture, maybe you can just send yourself the events again, in order to force things to reprocess. There could be natural events, or you could just create them. In the past, in this FinTech, we have done things like writing a script, which operations ran, which injected a whole bunch of messages into our RabbitMQ system. That was one way we've got some of these migrations done.

The saga pattern out of microservices. Well, it's commonly used in microservices for long running processes. That can be a way to get code migrated.

Batch processing is fine. Don't be afraid of a good old-fashioned batch processing. That's a very legitimate way to get some of these code migrations done.

I'm going to call out reconciliation as well, because if you've got a more complicated migration, complicated data structures you’re trying to migrate between, the Scientist library I talked about for code, you can do that sort of thing for data as well. It can be quite helpful to have some process, whether it's in code or a dashboard that looks at the data in the old tables and the new tables and tries to reconcile it and making sure the data is looking the way it should.

Schema Migrations: What Could Possibly Go Wrong

So this is all fine. I'm sure at this point, you're saying I'm right on this. I'm going to start straight away. What could possibly go wrong? I'm going to tell you one story about what could possibly go wrong, because I don't think it's fair to tell you about all the good things about telling you about some of the landmines that are out there as well.

So we were toying with this idea of updates, sorry, insert only schema. So for auditing purposes, in our money thing worker, we needed to know what an account balance was. And so keep an audit trail one way to do that. One way to do that is to not just track the balance, but to track the individual transactions.

So here's simplistically, we have, you know, account ID and amount. So if we want to find out the account balance of one, two, three, four, we just do a select star where accounted, confirm two, three, four, and then just set up the amounts. So that worked quite well from an audit perspective, but you know what they say about the road to hell, right?

It's paved with good intentions. So we have a developer who thinks let's optimize our queries. If, if we just need to know the bat. And if we wrote that balance into the table, kept a running tally. As we go, then we don't need to create all the rows. We just do something like select bounce from table, order by ID a limit one, we just grabbed the latest row and there's a balance.

So, this was pretty much a real query in our system at one point in time. And someone had to look at that and they thought, well, hang on order by ID surely ordering by an order generated database ID is a little bit fragile. Maybe we shouldn't do that. And I must confess the person that fought that suggested we change it was, was, was me.

So. Oops. Sorry. How do we, how do we get rid of order by ID? So good intention too, is this should be sorted by date. We have the timestamp right there. So can we change this? Is this safe to do. We ran these queries against our data point. We thought by ID, or if we sort by date, we get back the same route.

So we were thinking, yes, sure. This should be safe to do this migration. Let's change the sort order from sort by ID to sort by timestamp. And we push that to production now, you know, I said, I don't like working after hours and late nights. This was a late night during the deployment stage where we have one version of the code sorting by ID and one version of the code sorting by timestamp.

The thing, the very thing we were trying to protect ourselves against happened where the timestamps were out of order from the database IDs. And so depending on which serve out the blue, several, the green server was actually processing and money thing events. They got different balances for the same account at the same time.

And well, luckily our money handling code is Paris. So there wasn't really any user impact to this, but all the alarms and checks and balances we had in place went off all at the same time. And we spent a lot of time unraveling this issue. And so we asked that we learned two hard lessons from this we learnt, you know, maybe it's not a good idea to cash calculator cons and the database and the heart.

The other hard lesson we learned is that even something that seems innocuous, like changing the sort order of an existing query needs to be thought about this carefully as you would performing a migration. So there's two sides to data, right? There's writing the data in the reading data. Now, I've talked a lot about how to do migrations in terms of updating the schema so that you can write data safely, reading data you need to think about and just as much detail.

So how could we have avoided that trap? Um, I think the thing we would have had to do to avoid falling into this trap is to migrate our code, to not use the bounce field first. Um, All of our code would have had to just order by ID. And is that going to work? I think it would have had to do the code migration first, before we went before we attempted to get rid of, get rid of the database column.

Okay. So I've hit the end of my problem categories. But we're not done, not quite done because there's a lot of stuff I didn't talk about. I didn't talk about evolving API in any useful detail and that includes messages or events. I didn't talk about graceful shutdown. I didn't talk about batch jobs or ETL pipelines or machine learning models.

I didn't talk about deployment techniques. Culinary builds was Bluegreen versus live live, and there's a lot more. But, you know, there's a good news in that the mental muscle, you need to deal with this. It's just a superset of the sterilization compatibility problem. If you've ever done Java, serialization or Python pickling, it's the same sort of thing, just on a different scale.

It's very unlikely at this point, you're going to run into situations that are entirely novel. If you do some Googling to probably find an answer to the problem you're facing or someone who's dealt with it before, uh, You could even hire that person and get a real leg up. Hi, I really want to stress. It's just engineering at the hardest part is just remembering to think about it.

You don't need microservices or Coopernetties or Istio or LaunchDarkly or serverless or anything fancy to get started with this. You can go a long, long way with just a springboard executable, jars and property files. You can do all this with not very much at all, just remember, but the golden rule. And remember that it gets easy with practice.

And it really is worth it. So where do you start? Start by convincing yourself that it's worth the pain because the benefits include happiness satisfaction. You mean working hours, not only that, but improved financial outcomes for the company. The science is real. So this book here is fantastic. I highly recommend reading that if you need further convincing.

So there's two sides to this, right? The safe commits and the safe deployment to get started with safe commits. It helps to realize that good agile practice helps you with good commit hygiene and vice versa. So, I mean, you want to slice up your epics into stories. Think about value streams, minimum, minimum, viable product plan, those stories after tasks, and then feel the flow, understand how you're going to build this thing.

What are those subtests into a series of steps and by the way, that really helps them estimation as well. Plan those steps of compatibility. And remember the golden safe deployments. Your goal is to decouple deploying code from releasing features. So ask yourselves, I want to deploy during office hours.

How can I make that true? Identify challenges and they will be challenges, cultural challenges, process challenges, technical challenges, client, client guide challenges, but ask yourself, what are you afraid of and ask that non rhetorically. And you will come up with your challenges and pick a challenge and fix it and then rinse, wash, repeat, and you'll get there.

Okay. For the talk, I'll just finish up with a list of really fantastic books. Um, That covers us in a lot more detail on a lot more ground. And these pictures, these kind of pictures are hyperlinked. So when the slides go around later, hopefully those will be clickable as well. And there's also a website down here which gives a lot of extra assistance on minimizing long lived frigid branches, which is really great.

And so with that, I think we'll finish up and open up for questions.

See Highlights of
Wurreka

Hear What Attendees Say

“Once again Wurreka has knocked it out of the park with interesting speakers, engaging content and challenging ideas. No jetlag fog at all, which counts for how interesting the whole thing was."

Cybersecurity Lead, PwC

“Very much looking forward to next year. I will be keeping my eye out for the date so I can make sure I lock it in my calendar"

Software Engineering Specialist, Intuit

“Best conference I have ever been to with lots of insights and information on next generation technologies and those that are the need of the hour."

Software Architect, GroupOn

Hear What Speakers & Sponsors Say

“Happy to meet everyone who came from near and far. Glad to know you've discovered some great lessons here, and glad you joined us for all the discoveries great and small."

Scott Davis, Web Architect & Principal Engineer, ThoughtWorks

“What a buzz! The events have been instrumental in bringing the whole software community together. There has been something for everyone from developers to architects to business to vendors. Thanks everyone!"

Voltaire Yap, Global Events Manager, Oracle Corp.

“Wonderful set of conferences, well organized, fantastic speakers, and an amazingly interactive set of audience. Thanks for having me at the events!"

Dr. Venkat Subramaniam, Founder - Agile Developer Inc.