The power of inspired action.

When I first discovered medium, I hadn’t really considered taking this platform seriously-let alone using it as a platform for broadcasted expression. College was basically the only time I had…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Operational Excellence Journey

EXPEDIA GROUP TECHNOLOGY — SOFTWARE

Focus on instrumentation yields big improvements in reliability

Blurred chart on a computer screen

The Stay Experience team at Expedia Group™ faced many challenges at the beginning of the year 2020. We were constrained by our legacy platform; time was not invested into the tech stack for years. Logging was deficient, so the dashboards were not great either. We relied on a Slack channel to raise an alarm in the event of an incident. But ironically, nobody could say for sure what the majority of those alerts meant. As a result the channel was full of alerts which were mostly ignored. We lagged behind in monitoring and alerting, which were a bit hazy at that time. We once had an incident that lasted for about two days due to lack of logging and monitoring. It was clear that significant improvement was required in instrumentation and monitoring.

We wanted to focus on operational excellence, but it was also evident that the tech stack needed to be evolved first.

As the tech stack evolved, we saw the need for new services, and we soon realised that this was the time to make up for the lost investment in operational excellence. My manager asked me if I would be willing to lead on OPEX for the team and I agreed, as I knew how important it was for the team.

This was the beginning of our operational excellence journey.

The days and weeks that followed were spent identifying the existing dashboards, alerts and monitors. We held several brainstorming sessions and invested some time into improving the existing dashboards, but it became apparent that we didn’t understand those metrics and graphs well enough. In addition, we identified that several crucial alerts and monitors were missing.

We all knew what we wanted, but we were just not sure how to achieve it.

The company has invested in monitoring platforms to be leveraged, but we were not well-acquainted with these tools in order to use them in combating production incidents. I needed to identify the tool that I could leverage to kick-start the journey. One that would give us visibility of how our applications and services were performing, one that would help us identify the performance benchmark, one that would let us explore the latency stack.

In the following days, I set off to understand the Datadog dashboard for the application that powers the traveler experience.

We advanced the dashboard for our traveler-facing application to monitor customer-facing features.

We found that metrics were not available to cover all traveler experiences, so we defined the routes for each page in the server route configuration in our traveler-facing application. We now had metrics for all the pages and were able to see how they performed. This gave us the ability to assess any production issue and determine whether the issue was application/service specific, or something at a higher level. Although this can also be attained via Splunk queries, the quickest and most reliable option is a quick scan of the dashboard. We also configured the dashboards to align with traveler experience as we understood.

We also started looking at a dashboard for our GraphQL usage. This dashboard has the graphs for all queries and mutations consumed within our application, which is particularly helpful in pinning down an issue to a specific query or mutation.

While I was focusing on the Datadog dashboards, our Splunk graphs also started evolving at the same time. The team started focusing on them, and we now have Splunk dashboards for almost all the services we own. Datadog dashboards help us visualise the trends in latency or errors etc. Week on week trend comes out of the box with Datadog. Splunk dashboards provide the much required insights into a particular metric. For instance, the 4xx errors reported in Datadog can be further broken down into the detailed error reporting for 401, 403, 409 errors, etc. in Splunk. Detailed graphs can be drawn for all errors by category in Splunk.

We established that we will not go live without both making our logging great and making our dashboards great.

Logging became a key focus area for all of our service usage. All the requests and responses were logged and tracked via request-markers.

Significant improvements were made to all new features and services. We wanted to understand the customer experience and also monitor the performance.

When we were migrating from a legacy service to a newer one, we created a comparison dashboard between the two services and database. Before going live, this dashboard was used to examine results logged by the services. The legacy service was logging differences in data discovered whilst comparing the results from the new service and the old one. The aim was to generate a level of confidence with regard to how well replicated the newer service data was between the two systems.

This dashboard confirmed a substantial 99.8% data match between the old and new systems. This was a huge win as it gave the team that level of confidence to go ahead with the switch to the new service.

We added downstream services graphs on all the Datadog dashboards for our services. Wherever applicable, this helps in pinning down an issue to a specific service that is not owned by us.

The team actively monitors the dashboards to get detailed insight into the services, and services and monitoring are continuously being improved.

By now, we had several Datadog dashboards to analyse but unless someone was closely monitoring them, there was no way to identify if the traffic on our site was particularly low or if the latency was high. Even though we had some monitors in place for some downstream services, we did not have anything for our application.

A set of monitors were created for low requests/second, high latency, and high 4xx response rate. The threshold values were determined based on a couple of months of normal ranges from the Datadog dashboard. We are in the process of creating monitors for each downstream service.

There was some relief that now we would be paged when required.

One day in November, the person on support was paged for high latency in our traveler application. The customer impact was that the trip details page was taking too long to load. Upon investigation, it was discovered that this was due to a service release in production, so the release was rolled back, and the issue was resolved. Although the service release was being monitored, the latency could not be detected in the graph — this wouldn’t have been possible without the monitors that we put in place. We started seeing the benefits of the efforts that went in so far. Below is the graph that triggered the monitor.

We met often to discuss our progress and to identify further actions.

At one of our sessions, it was suggested that we have a daily stand-up in place of occasional brainstorming.

It was a Eureka moment! Wasting no time, I set up our first ever daily OPEX stand-up.

This was a breakthrough. We started meeting every day, relentlessly looking at the dashboards, reviewing the daily or weekly trends, asking questions, and going back to find answers. Notes were taken and action items derived for the team. Gradually, the on call support started to lead the standup. This helped our team to step up and up-skill.

We also identify areas of improvement, and tickets are raised and picked up as part of tech improvement. This has been the most impactful step that we have taken so far.

On multiple occasions, we were able to track a rise in latency and pin it down to a particular app/service release.

Below is an example of a graph where we could clearly see the impact of a release on the latency:

Graph indicating the rise in latency following a release

We improved our Quick To Fix (QTF) score. QTF is the ability to remediate the user or customer impact of an incident in under 60 minutes. Earlier in the year, it used to take a long time to fix an issue. But by the end of the year we were fixing them in considerably less time — sometimes within minutes.

First To Know (FTK) score went up by 30.7%. FTK is the ability and goal for a system to know or identify an incident before a user or customer.

Mean time to detect was down by two-thirds.

Mean time to repair was down by about half.

Avoiding degradation of the customer experience is another achievement. We are alerted every time there is a rise in latency or rise in errors; we identify the underlying issue and quickly take actions to resolve it if the issue is within an app or service we own. If the impact is caused by a downstream service, the respective owners are made aware of it for them to handle it at their end. Additionally, we review the dashboards/graphs in our daily OPEX standup and quickly act on any signs of degradation.

Since we set off on this journey, we have seen major improvements to the extent that we haven’t had an incident in the last five months. We have come a long way, but the journey is far from over; it is an evolving process for the team. This is what we are going to focus on for the coming months:

We hope to leap from strength to strength in the coming weeks and months!

Chart images are owned by the author

Add a comment

Related posts:

Zombies Really Are Scary

I woke up unusually refreshed and knowing that it was time to check my son’s blood glucose levels. As I headed out across the hall, I saw that it was 2am. I was shocked when I saw that in the next…

Reflections on My Bone Marrow Transplant

It was in June of 2021 that my hematologist, who had been seeing me for low iron in my blood, announced that my iron continued to trend downward and that I had Myelofibrosis and should consider a…