Context Driven Observability @ Booking.com

Mihai Balaci
11 min readJul 6, 2021

--

The spine of this article is context! Without proper context, our actions have no meaning, our data has no purpose, our understanding of the environment we live in, is limited.

I was wondering the other day how we can get more value out of our observability data?

I realized that this is more a social problem than a technical problem.

Why am I saying this you may wonder ? well, during my career, I interacted with Monitoring & Observability systems in different ways: starting from the basic output of PING, traceroute, iostat, free, top, s/dtrace and tcpdump in the early days and switching to more complex monitoring tools; I operated them, built and designed some, engaged with vendors, and used open-source; All scenarios checked, believe me :) It was never the lack of data creating gaps, but improper usage of it!

This term “Observability” does not have a single definition, it seems more like a continuous moving target in the IT world. The more complex the system gets, the more visibility you need.

What I feel safe to say is that Observability is the representation layer that allows us to touch/understand/feel a world(software) we don’t see! And that’s why we need to build up these mental models, these understandings about what’s going on and why in this invisible world (code performance, cpu and memory consumption, latency, saturation, duration, processing time).

The only tangible interaction we have with the binary world is through this representation layer we call Observability, and through it we can manipulate this invisible world.

This layer enables us to see how our software really works vs how it was described in the docs, how it breaks vs how it was supposed to break considering all the circuit breakers we put in place, and what is required to keep it all working.

Why do we observe?

Many times people asked me why is Monitoring not enough? And what’s the difference between being monitored and making yourself observable ?

There is a simple answer to this: Monitoring happens because we know what we are paying attention to! We know what and where to look. A simple example is disk utilization — we know that a disk will eventually get filled up, we monitor it so we can act before this happens :) Magic right ;)

The problem with modern stacks is that things are far from being predictable anymore. There is a lot of stuff in our stacks that we simply cannot anticipate.

The point with Observability is to prepare yourself for the inevitable possibility that stuff will break, things will fail, and we cannot anticipate what it was.

Story of my life: every single time I got paged in the middle of the night because the thing that broke was something I never thought will happen. Literally!

Observability does not make those things transparent to you; it is not magic! But it does prepare you for the fact, so that when things break, you are more likely to find the root cause and the context in which the failure happened. When shit gets WEIRD is where Observability gets valuable, not for the things you already expected to happen. For the rest, we have Monitoring.

Systems are complex

Hello captain Obvious :) — the systems we build get more and more complicated. Not only are they complex, but systems are non-linear, meaning that the inputs that you give them may result in an out of proportions output. Systems are chaotic as well; a slightly different input can result in a widely different output.

Systems are emergent because these little human things interacting with them are constantly creating new microservices that interact with other microservices. Before you know it, you have this complicated system that seems almost alive.

As you already recognized, the above is a simplification of a request-based system. We know that a request based system can usually be a couple of simplistic states: waiting for response or listening mode, processing, responding, and maybe we got shutting down. So four (4) states!

And humans, oh humans can be pretty magical; we look at these things, diagnose them, make mental models about how we process them, and do even more interesting things — we create robots and automation to monitor and manage the systems for us. As simple as it looks for the robot in the picture, the more complicated it will be for us to define each system state so the robot can understand.

In cybernetics, this is called “The law of requisite variety” which states in simple words: for a controller to property control something, it needs to understand more states than the system is controlling.

Following this law, the systems we are building to monitor other systems need to be more complex.

I can say that: Variety defeats Variety! I think it is interesting that we need even more complex tools for us to run complex systems.

Observability

As an engineer, you are probably used to debugging via intuition. To get to the source of a problem, it’s likely you feel your way along a hunch or use a fleeting reminder of some outage long past to guide your investigation. However, the skills that served you well in the past are no longer applicable in this world. The intuitive approach only works as long as most of the problems you encounter are variations of the same few predictable themes you’ve encountered in the past.

Similarly, the metrics-based approach of monitoring relies on having encountered known failure modes in the past. Monitoring helps detect when systems are over or under predictable thresholds that someone has previously deemed means they’re experiencing an anomaly. But what happens when you don’t know that type of anomaly is even possible?

Historically, the majority of problems that software engineers encounter have been variants of somewhat predictable failure modes. Perhaps it wasn’t known that your software could fail quite in the manner that it did, but if you reasoned about the situation and its components, it wasn’t a logical leap to discover a novel bug or failure mode. It is a rare occasion for most software developers to encounter truly unpredictable leaps of logic because they haven’t typically had to deal with the type of complexity that makes it common place (until now, most of the complexity for developers has been in the app bundle).

“Every application has an inherent amount of irreducible complexity. The only question is: Who will have to deal with it — the user, the application developer, or the platform developer?”

— — Larry Tesler

Modern distributed systems architectures notoriously fail in novel ways that no one is able to predict and that no one has experienced before. This condition happens often enough that an entire set of assertions has been coined about the false assumptions that programmers new to distributed computing often make.

In a modern world, debugging with metrics requires you to connect dozens of disconnected metrics that were recorded over the course of executing any one particular request, across any number of services or machines, to infer what might have occurred over the various hops needed for its fulfillment.

By contrast, debugging with observability starts with a very different substrate: a deep context of what was happening when this action occurred. Debugging with observability is about preserving as much of the context around any given request as possible, so that you can reconstruct the environment and circumstances that triggered the bug that led to a novel failure mode.

I imagine 2 completely different dimensions: Application Insight Data or First Party data, and Operational Data or Third Party Data

  • Our own software produces app Insight data, and it’s hidden value is the business logic KPIs that are driving all our business decisions, investments, and advertisement campaigns. I usually call this data the source for customer journey analytics. Observability data often serves many different users & different use cases. A business metric added for a Product Owner to track product adoption may feed a fraud detection model for a security engineer or provide a key insight to an incident responder.
  • Operational Data consists of telemetry and logs we collect from all platforms we use to run our services on (bare-metal servers, network equipment, Operating Systems, OpenStack and BKS) — this is our Monitoring data backbone.

Events/Logs/Metrics/Traces @ Booking

Events are implicitly schemed and flexible blobs of information and come through multiple distinct stages during their life cycle.

Depending on the use case, it might be wise to preserve the created event object somewhere in the global context so other parts of code can reuse it before it would be sent. This is exactly what we do in our WEB handler: when we begin to serve an HTTP request, we create a global WEB event for that request implicitly for developers and push it into Bookings::Context.

Events exist in different forms: WEB event (HTTP query), METRICS event (set of metrics), STS event (daemon metrics). They are valuable data because out of all you can build the context I was describing before and reconstruct the system states.

Events are very wide structured logs composed and sent from inside the application. They are often produced once per business action: this may be per web request, or per cron job run.

An interesting attribute of events is their infinite cardinality data that we don’t usually pre-aggregate and this is how we build the infinite business context.

Example: in the context of databases, cardinality refers to the uniqueness of data values contained in a set. Low cardinality means that a column has a lot of duplicate values in its set. High cardinality means that the column contains a large percentage of completely unique values. A column containing a single value will always be the lowest possible cardinality.

A column containing unique IDs will always be the highest possible cardinality.

For example, if you had a collection of a hundred million user records, you can assume that userID numbers will have the highest possible cardinality. First name and last name will be high cardinality, though lower than userID because some names repeat. A field like gender would be fairly low-cardinality given the non-binary, but finite, choices it could have. A field like species would be the lowest possible cardinality, presuming all of your users are humans.

Cardinality matters for observability, because high-cardinality information is the most useful data for debugging or understanding a system. Consider the usefulness of sorting by fields like user IDs, shopping cart IDs, request IDs, or any other myriad IDs like instances, container, build number, spans, and so forth. Being able to query against unique IDs is the best way to pinpoint individual needles in any given haystack.

Our business logic mainly relies on WEB events. And they are even more granular as they are generated per request. Using these events, we can perform powerful Observability.

A nice aptness of events is that they deliberately blur the boundaries between operational data, technical and business data, and application insight. This results in us being able to associate business outcomes with technical/operational behaviors more easily than when using data types which offer only binary value in period of time instead of the entire context.

Another simple example: “bookings are down because half of our workloads are returning 500s” is a pretty well founded statement of fact, not an interpretative guess, when you look at the event stream and see that half of the “complete-booking” action events no longer contain “did_the_booking=1” but have errors.

Imagine this: You are the traveler, looking at www.booking.com to book an accommodation in Amsterdam. City was an easy choice ;) Now you pick the dates and start looking for a nice fancy hotel. Once you find it a new universe of choices appears: you need to select the room type, breakfast, parking/non-parking and so on. The more options you care about the more possible combinations you get. (simple mathematics — Combinations without repetition) After 30–40 minutes of reviewing all these options and reading all reviews you decide to click the magic button to finish the transaction and big surprise !!! Something fails, reservation unable to complete.

In order to understand what happened, our engineers use contextual data to reconstruct the customer journey and identify where and in which specific circumstances the system failed. What is the combination of choices the user selected that broke.

In 1988, by way of SNMPv1, the foundational substrate of monitoring was born: the metric. A metric is a single number, with tags optionally appended for grouping and searching those numbers. Metrics are, by their very nature, disposable and cheap. They have a predictable storage footprint. They’re easy to aggregate along regular time series buckets.

Metrics are a powerful tool to detect trends and anomalies. Metrics are an aggregation of data.

CPU metric example:

[{"target":"sys.app-1002_ams4_prod_booking_com.cpu.total.user","datapoints":[[1262.27,1618911780],[1241.92,1618911840],[1341.82,1618911900],[1524.78,1618911960], ….

Unfortunately, they will never tell you why a single request failed, but will help you detect that it did. The issue with metrics is that they don’t tell you if an error and an increase of latency are coming from the same HTTP request or from two different ones.

Are metrics powerful? Absolutely yes. Are they enough? I’m afraid not!

Shared context drives understanding.

Technical and business metrics are not distinct worlds: correlations between the two help us see our systems more clearly. Preserve the capability to join telemetry across different domains of the same business event. To fulfill this case, making use of our infinite cardinality events, we extract & process data out of events into metrics like in the following example:

Out of this event example (which is 854 fields):

WEB [Tue Apr 20 13:39:52 2021]

{

"__az_name__" => "bk-eu-central1-a",

"__count_sent_by_pid__" => 1113,

"__created_epoch__" => "1618918790.94219",

"__dc__" => 32,

"__dc_name__" => "fra3",

"__flavour__" => "app",

"__git_tag__" => "app-20210416-104206",

"__handler_epoch__" => "1618918790.94179",

"__node_name__" => "app-6031",

"__persona__" => "app",

------

"wallclock" => 1426,

"wallclock_before_psgi" => 1,

"wallclock_firstbyte" => 1425,

"wallclock_geoipsrv" => 2,

"wallclock_logging" => 76,

"wallclock_memcached" => 0,

"wallclock_promises" => 101,

"wallclock_promises_cleanup" => 0,

-----

By using TuningStats, we produce metrics like this:

[{"target":"general.tuning.minutely.per_persona.app.summary.all.wallclock.99_99th_percentile","datapoints":[[8162,1618912020],[7101,1618912080],[7017,1618912140],[7329,1618912200],[7423,1618912260],[6985,1618912320]...

There is a final lens we work to enable for Booking.com: Traces — the system to system communication path.

Traces will bring a lot of visibility in modern systems: details of what is happening inside a single application as well as following the path of customer’s request into the backed services. This would enable us to troubleshoot performance issues, fasten firefighting (as pin-pointing the cause would be easier), improve dependency mapping, code performance, etc.

In my previous example, our robot can see that from service A to service B the request span is 2ms, from B to C is 10ms and from C to D is 10s => an E2E latency of 10s,12ms

Another interesting example: my teams see this behavior every week as the business recovers from C19: even if we see 3x lower traffic on our edges compared to 2019, we experience higher latency in processing on the underlying systems and back-ends. How does this make sense? Well, with the proper context it does. Caches being empty, systems not being warmed up contribute to this initially weird systems behavior.

What is Observability after all? I like to believe it is a data lake that we can query in an interactive and real-time way. Build for an iterative workflow where users explore data and hone in on interesting correlations and aggregations.

Take away

We can monitor you, but you need to make yourself observable!

With proper instrumentation you will be able to understand your E2E system behavior.

Once you define “what’s expected” by adding content to the data you produce, the observability team will be able to extract and model this data then link it to your business data.

Today’s system behavior & incidents will shape the design of tomorrow’s components, subsystems and architectures. Visibility and incidents are opportunities! Opportunities to influence tomorrow’s decisions, budgets, hiring and policies.

The delta between how we think our systems work vs how they really work is the greatest value we get from Observability. This data help us improving MTTA/MTTR’s so we can keep the quality commitment to our customers.

Looking forward, the opportunities are endless — AI/ML/DL — will drive our pattern discovery and accurate predictions and improve our business pro-activeness.

Do not hide your observability data :)

--

--

Mihai Balaci

*Nix Solution Craftsman, Passionate about Reliability & automation, addicted to open-source, truly in love with BSD's :)