Data EngineeringBig DataHadoop

Upgrading Large Hadoop Cluster

A detailed account of upgrading a large Telco Hadoop cluster from HDP 2.6.4 to 3.1.5, covering practice runs, planning strategies, and lessons from executing the upgrade during COVID remote work.

18 May 2020 · 6 min read

A long time back, I wrote a post about migrating a large Hadoop cluster in which I shared my experience moving between two Hadoop environments. Last weekend we did another similar activity, which I thought worth documenting.

Epilogue

We are a large Telco with multiple Hadoop environments. This post is about the upgrade story of one of those clusters.

Many Weeks Before the Weekend

For many weeks we had been working to upgrade one of our main Hadoop platforms. Since it was a major upgrade from HDP 2.6.4 to 3.1.5, it needed extensive planning and testing. Several things helped us face the day with confidence.

Practice Upgrades

We did three practice upgrades in our development environment to ensure we knew exactly how each step would work and what issues we might encounter. A comprehensive knowledge base of all known errors and solutions was built from these exercises. This document was shared with all team members involved so that someone would recognise an issue during the real upgrade. It becomes extremely challenging when you hit a problem and the clock is ticking to bring the cluster back up for Monday’s workload. We also did one full-team practice run so that everyone understood the steps, sequences, and overall feel of a real upgrade.

Code Changes

We made all the code changes required to ensure our existing applications could run on the new platform stack. Testing was done in a development environment stood up with the new stack versions, and we opened it to all application teams and use cases for validation.

Meeting the Prerequisites

One of our challenges is the massive amount of data. As a Telco, our network feeds can fill a cluster very quickly. We had to keep tight control over what data came in and what queries users ran to keep total cluster storage under 85% — a single bad user query can fill a cluster within hours. Our cluster was around 1.8 PB. Moving data to another environment when HDFS is over-utilised is a normal part of our operations.

One Week Before the Weekend

Imaginary Upgrade

We brainstormed a fictitious upgrade and walked through the mindset of what steps and sequences we would follow. We listed every detail we could think of, from raising the change request to closing it after completion. This exercise surfaced many things that had not been planned earlier and helped us arrange a precise order of execution.

Application and Use Case Teams

In a large shared cluster environment, finding all job dependencies and impacted applications is a challenge. We started sending bulk communications to all platform users one month in advance so that everyone was eventually aware of the planned downtime and could prepare accordingly.

Data Feed Redirection

Many data feeds in the Telco space are very large. We have the opportunity to capture some of them only once — if we miss them, the data is lost. To prepare for the downtime, we planned redirection to an alternative platform with a view to bringing them back post-upgrade. This exercise required careful impact analysis to determine whether feeds could be permanently lost or recaptured from the source later.

The Time Roster

A few days before the upgrade, we created a timeline view of the weekend. The goal was to bring people in and out, giving them rest as needed. We divided into groups: people who redirect and stop data feeds before the upgrade, people who perform the upgrade itself, people who resume jobs and stop feed redirection post-upgrade, and a group of beta testers to validate all user-facing items over the weekend. This structure gave everyone a clear picture of when they would enter the scene and what was expected of them.

The Weekend

Friday

We divided the upgrade into eight stages and aimed to complete the Ambari upgrade on Friday, getting as far as possible into the subsequent stages. The Ambari upgrade went smoothly with no blockers, and we finished within our planned timeframe.

Saturday and Sunday

Our original estimate for the HDP and HDF upgrade, based on past experience, was around 20 hours. However, three technical issues pushed our timeline by an additional 15 hours. Cloudera’s on-call engineers were very responsive in helping us resolve those problems. Hadoop is a massive beast — no single person can know everything — so having access to subject matter experts from Cloudera when we needed them was a massive morale booster. It was reassuring to know we had someone to call, and they jumped in to resolve every blocker. A massive thank you to the Cloudera team.

Credits

Collaboration and COVID

This upgrade was different for us. Due to COVID, like companies worldwide, we had been working remotely for weeks. Without giving credit to Microsoft Teams, it would not be fair — Teams made it possible from day one of working from home to collaborate effectively. Our core team of four people involved in the upgrade stayed connected in a single Teams meeting for three days. We used screen sharing and document sharing features to get the job done.

Kids and Families

Lastly, it is worth mentioning the patience of our families, who brought all meals to the computer so we could keep working, and who took care of the kids during long hours. With Teams broadcasting for hours on end, we could hear each other’s children (except for one team member, who is a bachelor) shouting and trying to grab attention. With this upgrade behind us, we were back to spending more time with them.

Monday After the Weekend

The upgrade was successful. Project teams and users were gradually coming back live on the platform. Users reported issues they encountered, and we fixed them incrementally. Data started flowing back in, with the flood gates for the largest feeds to be opened later in the week. Things were slowly returning to normal. Our users were excited about the new functionality this upgrade brought, and I am proud of what we achieved.

Massive planning and practice exercises delivered a good outcome. We missed planning for a few things, but we will learn from them — that is what life is about.

Until the next upgrade, goodbye.