Upgrading HDFS and Hadoop 2.X on a Multi-node cluster with Ubuntu 14.04

This guide shows step by step how to upgrade a multi node cluster with Hadoop and HDFS from version 2.4.1 to version 2.7.2 on Ubuntu 14.04. These instructions make reference to the setup I described previously. In general, this should work for any upgrade within the 2.X branch, but I cannot guarantee that.

Note: this is not a high-availability upgrade, I will shut down my hadoop/hdfs cluster as first thing, so maybe you want to consider this page on the Apache Wiki instead, but I did not try it.

I will assume you've something similar to what I described in my previous article, so assume we have a 3 nodes cluster, my test case is the following (with IP addresses and shortnames) :

10.10.10.104  mynode1
10.10.10.105  mynode2
10.10.10.106  mynode3

Note: We assume the nodes in the cluster have the same hardware configuation, i.e., the same type of architecture.

I also assume that the hduser is existing and authenticated on all machines, i.e., you should run

sudo su - hduser
cd ~

From now on, in the rest of this guide, all commands will be run as the `hduser`.

Setup

Make sure you have everything up to date.

sudo apt-get update
sudo apt-get upgrade

Repeat this installation procedure, up to this point, on every node you have in the cluster.

The following will be necessary only on the first node: Then we start a screen to work remotely without fear of losing work if disconnected.

screen -S installing

After the -S you can put whatever name for your sessions

First thing now is to stop the cluster and check that everything is quiet.

start-dfs.sh
jps

So yes, if you are looking for a way to avoid downtime while you do this upgrade, Google suggested this page on the Apache Wiki.

Compile the Sources

The following steps will be needed only once on the primary node. I was upgrading from 2.4.1 to 2.7.2 and for some reason the new version required a tool called javah (note the h) which was installed alongside with java, but placed in some other direcotry, so I need to change one line of the .bashrc and/or .profile configuration. To be more precise, replace the line

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")

with the line

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:jre/bin/java::")

so that the JAVA_HOME now points to /usr/lib/jvm/java-8-oracle/ instead of /usr/lib/jvm/java-8-oracle/jre.

Then, download hadoop 2.X stable, to do so you navigate in the List of Mirrors select one and decide what version to download. With wget you can run something like the following for hadoop 2.7.2 :

wget https://mirror.nohup.it/apache/hadoop/common/hadoop-2.7.2/hadoop-2.7.2-src.tar.gz

Once it has been downloaded, unpack it, enter the directory and compile

tar -xvf hadoop-2.7.2-src.tar.gz
cd hadoop-2.7.2-src/
mvn package -Pdist,native -Dmaven.javadoc.skip=true  -DskipTests -Dtar

Compiled files will be found in hadoop-dist/target/hadoop-2.7.2.tar.gz just put them in the home

mv hadoop-dist/target/hadoop-2.7.2.tar.gz ~/

Now let's copy these files on the other nodes, e.g, from mynode1 to mynode2 and mynode3

scp ~/hadoop-2.7.2.tar.gz  hduser@10.10.10.105:~/
scp ~/hadoop-2.7.2.tar.gz  hduser@10.10.10.106:~/

Install the Compiled Code

The following steps will be needed on all the machines We unpack the compiled version and put it in /usr/local, alongside with the old version, and we replace the shortcut called /usr/local/hadoop, this will effectively replace your links, from the old software to the new.

sudo tar -xvf ~/hadoop-2.7.2.tar.gz -C /usr/local/
sudo rm /usr/local/hadoop
sudo ln -s /usr/local/hadoop-2.7.2 /usr/local/hadoop
sudo chown -R hduser:hadoop /usr/local/hadoop-2.7.2

We also have to edit hadoop-env.sh files with for the same $JAVA_HOME variable, that they seem not able to set up properly, so we open the file in

nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh

and around line 27 we can replace

export JAVA_HOME=${JAVA_HOME}

with

JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:jre/bin/java::")

If you want to be sure it worked, you can print some values, like

echo $JAVA_HOME
echo $HADOOP_HOME

Set up all the config files

On all the machines we want to pass the configuration from the previous version to the new one. So we need to copy over the files: hdfs-site.xml, core-site.xml, yarn-site.xml, and slaves to the current hadoop directory, hence, assuming we are copying from a previously installed 2.4.1 version, installed in /usr/local/hadoop-2.4.1 we run the command

 cp /usr/local/hadoop-2.4.1/etc/hadoop/hdfs-site.xml  \
 /usr/local/hadoop-2.4.1/etc/hadoop/core-site.xml \
 /usr/local/hadoop-2.4.1/etc/hadoop/yarn-site.xml \
 /usr/local/hadoop-2.4.1/etc/hadoop/slaves \
 /usr/local/hadoop/etc/hadoop/

Initialize the Upgrade of HDFS

These commands will be used only on the main node, and only once

If all went well we should be able to run the following command

hadoop version

and obtain something like

Hadoop 2.7.2
Subversion Unknown -r Unknown
Compiled by hduser on 2016-02-19T11:03Z
Compiled with protoc 2.5.0
From source with checksum d0fda26633fa762bff87ec759ebe689c
This command was run using /usr/local/hadoop-2.7.2/share/hadoop/common/hadoop-common-2.7.2.jar

Now the first step is to initialize the upgrade, which will translate the data stored and something else that I'm not really sure about (if you know, please let me know). So on the main node you start the cluster in upgrade mode by running:

start-dfs.sh -upgrade

This will have for you a running hadoop cluster on the new sofware version, without - hopefully- compromising the data and allowing you a roll-back (if you really need it).

You can check this status also on the HDFS Web User Interface, opening the url https://10.10.10.104:50070, on the top appears a message in a blue box saying that an upgrade is in progress.

Test the Data on the Cluster!

These commands will be used only on the main node

And if the preivious command didn't complain about anythign, we can check the content of the hdfs directory with

hfs -ls /

so that we know if the data was lost (no, they are not!). Maybe we can try and export some data on the local disk to check it. Assume you have the file on the hdfs called /datastore/my_file.txt, you can get a local copy with

hfs -copyToLocal /datastore/my_file.txt ./my_file.txt

Note that I'm using my alias for hfs, which is in the .bashrc/.profile file as:

alias hfs="hdfs dfs"

You can find all of them on my previous tutorial.

Finalize the Upgrade

Once you are reasonably confident that this upgrade has to be finalized, then you run, only on the main node, the command

hdfs dfsadmin -finalizeUpgrade

Check again the Web User Interface,https://10.10.10.104:50070, on the top the message about the upgrade in progress is disappeared.

And you are done!