Meetup: Resilience

After spending a meetup discovering the power of a modern gaming engine (Unity), it was time to dive into a subject that we have a slightly higher chance of actually encountering at our clients. It can’t be denied that microservices have significantly grown in popularity over the last few years and a lot of companies adopted the architecture to solve organisational and scalability issues. Moving from, say, a monolith based architecture to microservices is no easy feat, as there are plenty of problems you can solve by replacing them with different problems, one being resiliency.

It is an interesting (and broad) topic, however. How do we (elegantly) deal with service disruptions and manage the flow of traffic between several (micro)services in an attempt to stay as available as possible for our users?

IT resilience is the ability of an organisation to maintain acceptable service levels when there is a disruption of business operations, critical processes, or your IT ecosystem.

Given that we were on the last meetup of the year — and we very much understood that after a year filled with Corona and (intelligent, semi, full) lockdowns we were also in the mood for a fun outdoor activity where we could at least look at each other without using webcams — we had limited time reserved for this topic. I expect we will circle back on this somewhere in the near future.

We focused on one of the more popular frameworks on the Java ecosystem right now, Resilience4j. The idea was to start with something practical, but quite quickly we found ourselves in a discussion on the core problems to solve with frameworks like this. For instance:

  • How do you efficiently configure timeouts across service calls?
  • How do you stay aware of your users’ requirements when applying rate limiting?
  • How much room to allow in your bulkheads?
  • How do you handle caching, default responses?

A lot of organisations gravitate towards autonomy, meaning that the teams have a large extent of freedom to configure resiliency measures. Over the years, we have encountered several instances where this didn’t work out very well:

  • Tight bulkheads kept a service alive, but the majority of requests never got past it;
  • Timeouts set too tight, a response would hardly ever be received in time;
  • Default responses obscured an issue (a service calls another service for user data. An issue occurs and the circuit breaker returns an empty response — this is not accurate as for the user cannot distinguish between errors and functional scenario’s)

These were some of the issues that were discussed and near the end it became obvious that, as with many things we encounter when building complex software systems across multiple cross-functional teams, it all comes down to proper communication and standards being applied. The tools themselves are not too difficult to grasp (and we will definitely have a closer look at them at some later point) but applying the effectively and correctly in a complex environment with many moving parts is the real challenge. By that time, time was largely up, so I am guessing this will be continued somewhere in 2022.

The afternoon was reserved for some fun time, we were in the fortunate situation to have planned this before the new lockdown was announced, so we were actually able to physically meet up in the woods for a three hour mountainbike run. For the people in shape, it was already a nice challenge, not to mention for me. It was great fun to do, mostly so to be able to be around real people and enjoy ourselves together outside of our homes. Even if the enjoyment came with an extra dose of saddle pain.

We are already excited to get going again with our next meetup, planned for the end of January 2022. Until that time, stay safe and happy new year!

Leave a Reply

Your email address will not be published.