This guide shows step by step how to upgrade a multi node cluster with Hadoop and HDFS from version 2.4.1 to version 2.7.2 on Ubuntu 14.04. These instructions make reference to the setup I described previously. In general, this should work for any upgrade within the 2.X branch, but I cannot guarantee that.
Note: this is not a high-availability upgrade, I will shut down my hadoop/hdfs cluster as first thing, so maybe you want to consider this page on the Apache Wiki instead, but I did not try it.
I will assume you've something similar to what I described in my previous article, so assume we have a 3 nodes cluster, my test case is the following (with IP addresses and shortnames) :
10.10.10.104 mynode1 10.10.10.105 mynode2 10.10.10.106 mynode3
Note: We assume the nodes in the cluster have the same hardware configuation, i.e., the same type of architecture.
I also assume that the
hduser is existing and authenticated on all machines, i.e., you should run
sudo su - hduser cd ~
From now on, in the rest of this guide, all commands will be run as the `hduser`.
Make sure you have everything up to date.
sudo apt-get update sudo apt-get upgrade
Repeat this installation procedure, up to this point, on every node you have in the cluster.
The following will be necessary only on the first node: Then we start a screen to work remotely without fear of losing work if disconnected.
screen -S installing
-S you can put whatever name for your sessions
First thing now is to stop the cluster and check that everything is quiet.
Compile the Sources
The following steps will be needed only once on the primary node.
I was upgrading from
2.7.2 and for some reason the new version required a tool called
javah (note the
h) which was installed alongside with java, but placed in some other direcotry, so I need to change one line of the
To be more precise, replace the line
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
with the line
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:jre/bin/java::")
so that the
JAVA_HOME now points to
Then, download hadoop
2.X stable, to do so you navigate in the List of Mirrors select one and decide what version to download.
wget you can run something like the following for hadoop
Once it has been downloaded, unpack it, enter the directory and compile
tar -xvf hadoop-2.7.2-src.tar.gz cd hadoop-2.7.2-src/ mvn package -Pdist,native -Dmaven.javadoc.skip=true -DskipTests -Dtar
Compiled files will be found in
hadoop-dist/target/hadoop-2.7.2.tar.gz just put them in the home
mv hadoop-dist/target/hadoop-2.7.2.tar.gz ~/
Now let's copy these files on the other nodes, e.g, from
scp ~/hadoop-2.7.2.tar.gz email@example.com:~/ scp ~/hadoop-2.7.2.tar.gz firstname.lastname@example.org:~/
Install the Compiled Code
The following steps will be needed on all the machines
We unpack the compiled version and put it in
/usr/local, alongside with the old version, and we replace the shortcut called
/usr/local/hadoop, this will effectively replace your links, from the old software to the new.
sudo tar -xvf ~/hadoop-2.7.2.tar.gz -C /usr/local/ sudo rm /usr/local/hadoop sudo ln -s /usr/local/hadoop-2.7.2 /usr/local/hadoop sudo chown -R hduser:hadoop /usr/local/hadoop-2.7.2
We also have to edit
hadoop-env.sh files with for the same
$JAVA_HOME variable, that they seem not able to set up properly, so we open the file in
and around line 27 we can replace
JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:jre/bin/java::")
If you want to be sure it worked, you can print some values, like
echo $JAVA_HOME echo $HADOOP_HOME
Set up all the config files
On all the machines we want to pass the configuration from the previous version to the new one.
So we need to copy over the files:
slaves to the current hadoop directory, hence, assuming we are copying from a previously installed
2.4.1 version, installed in
/usr/local/hadoop-2.4.1 we run the command
cp /usr/local/hadoop-2.4.1/etc/hadoop/hdfs-site.xml \ /usr/local/hadoop-2.4.1/etc/hadoop/core-site.xml \ /usr/local/hadoop-2.4.1/etc/hadoop/yarn-site.xml \ /usr/local/hadoop-2.4.1/etc/hadoop/slaves \ /usr/local/hadoop/etc/hadoop/
Initialize the Upgrade of HDFS
These commands will be used only on the main node, and only once
If all went well we should be able to run the following command
and obtain something like
Hadoop 2.7.2 Subversion Unknown -r Unknown Compiled by hduser on 2016-02-19T11:03Z Compiled with protoc 2.5.0 From source with checksum d0fda26633fa762bff87ec759ebe689c This command was run using /usr/local/hadoop-2.7.2/share/hadoop/common/hadoop-common-2.7.2.jar
Now the first step is to initialize the upgrade, which will translate the data stored and something else that I'm not really sure about (if you know, please let me know). So on the main node you start the cluster in
upgrade mode by running:
This will have for you a running hadoop cluster on the new sofware version, without - hopefully- compromising the data and allowing you a roll-back (if you really need it).
You can check this status also on the HDFS Web User Interface, opening the url
https://10.10.10.104:50070, on the top appears a message in a blue box saying that an upgrade is in progress.
Test the Data on the Cluster!
These commands will be used only on the main node
And if the preivious command didn't complain about anythign, we can check the content of the
hdfs directory with
hfs -ls /
so that we know if the data was lost (no, they are not!).
Maybe we can try and export some data on the local disk to check it.
Assume you have the file on the
/datastore/my_file.txt, you can get a local copy with
hfs -copyToLocal /datastore/my_file.txt ./my_file.txt
Note that I'm using my alias for
hfs, which is in the
.profile file as:
alias hfs="hdfs dfs"
You can find all of them on my previous tutorial.
Finalize the Upgrade
Once you are reasonably confident that this upgrade has to be finalized, then you run, only on the main node, the command
hdfs dfsadmin -finalizeUpgrade
Check again the Web User Interface,
https://10.10.10.104:50070, on the top the message about the upgrade in progress is disappeared.
And you are done!