Apache Hadoop and Karaf, Article 1: Karaf as HDFS client

Maybe some of you remember that, a couple of months ago, I posted some messages on the Hadoop mailing list about OSGi support in Hadoop (http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201202.mbox/%3C4F3285F1.2000704@nanthrax.net%3E).

In order to move forward on this topic, instead of an important refactoring, I started to work on standalone and atomic bundles that we can deploy in Karaf. The purpose is to avoid to change Hadoop core, but provides a good Hadoop support directly in Karaf.

I worked on Hadoop trunk (3.0.0-SNAPSHOT) and prepared patches (https://issues.apache.org/jira/browse/HADOOP-9706).

I also deployed bundles on my Maven repository to give users the possibility to directly deploy karaf-hadoop in a running Karaf instance.

The purpose is to explain what you can do, the values about this, and maybe you will vote to “include” it in Hadoop directly 😉

To explain exactly what you can do, I prepared a serie of blog posts:

  • Article 1: Karaf as HDFS client. This is this first post. We will see the hadoop-karaf bundle installation, the hadoop and hdfs Karaf shell commands, and how you can use HDFS to store bundles or features using the HDFS URL handler.
  • Article 2: Karaf as MapReduce job client. We will see how to run MapReduce jobs directly from Karaf, and the “hot-deploy-and-run” of MapReduce jobs using the Hadoop deployer.
  • Article 3: Exposing Hadoop, HDFS, Yarn, and MapReduce features as OSGi services. We will see how to use Hadoop features programmatically thanks to OSGi services.
  • Article 4: Karaf as a HDFS datanode (and eventually namenode). Here, more than using Karaf as a simple HDFS client, Karaf will be part of HDFS acting as a datanode, and/or namenode.
  • Article 5: Karaf, Camel, Hadoop all together. In this article, we will use the Hadoop OSGi services now available in Karaf inside Camel routes (plus the camel-hdfs component).
  • Article 6: Karaf as complete Hadoop container. I will explain here what I did in Hadoop to add a complete support of OSGi and Karaf.

Karaf as HDFS client

Just a reminder about HDFS (Hadoop Distributed FileSystem).

HDFS is composed by:
– a namenode hosting the metadata of the filesystem (directories, blocks location, file permissions or modes, …). There is only one namenode per HDFS, and the metadata are stored in memory by default.
– a set of datanode hosting the file blocks. Files are composed by blocks (like in all filesystems). The blocks are located on different datanodes. The blocks can be replicated.

A HDFS client connects to the namenode to execute actions on the filesystem (ls, rm, mkdir, cat, …).

Preparing HDFS

The first step is to set up the HDFS filesystem.

I gonna use a “pseudo-cluster”: a HDFS with the namenode and only one datanode on a single machine.
To do so, I configure the $HADOOP_INSTALL/etc/hadoop/core-site.xml file like this:


<configuration>

  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost/</value>
  </property>

</configuration>

For a pseudo-cluster, we setup only one replica per block (as we have only one datanode) in the $HADOOP_INSTALL/etc/hadoop/hdfs-site.xml file:

<configuration>

  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>

</configuration>

Now, we can format the namenode:


$HADOOP_INSTALL/bin/hdfs namenode -format

and start the HDFS (both namenode and datanode):


$HADOOP_INSTALL/sbin/start-dfs.sh

Now, we can connect to the HDFS and create a first folder:


$HADOOP_INSTALL/bin/hadoop fs -mkdir /bundles
$HADOOP_INSTALL/bin/hadoop fs -ls /
Found 1 items
drwxr-xr-x - jbonofre supergroup 0 2013-07-07 22:18 /bundles

Our HDFS is up and running.

Configuration and installation of hadoop-karaf

I created the hadoop-karaf bundle as standalone. It means that it embeds a lot of dependencies internally (directly in the bundle classloader).

The purpose is to:

  1. avoid to alter anything in Hadoop core. Thanks to this approach, I can provide hadoop-karaf bundle for different Hadoop versions, and I don’t need to alter Hadoop itself.
  2. ship all dependencies in the same bundle classloader. Of course it’s not ideal in term of OSGi, but to provide a very easy and ready to use bundle, I gather most of dependencies in the hadoop-karaf bundle.

I worked on trunk directly (for now, if you are interested I can provide hadoop-karaf for existing Hadoop releases): Hadoop 3.0.0-SNAPSHOT.

Before deploying the hadoop-karaf bundle, we have to prepare the Hadoop configuration. In order to be integrated in Karaf, I implemented a mechanism to create and populate the Hadoop configuration from OSGi ConfigAdmin.
The only requirement for the user is to create a org.apache.hadoop PID in the Karaf etc folder containing the Hadoop properties. Actually, it means to just create a $KARAF_INSTALL/etc/org.apache.hadoop.cfg file containing:


fs.default.name = hdfs://localhost/

If you don’t want to compile hadoop-karaf bundle yourself, you can use the artifact that I deployed on my Maven repository (http://maven.nanthrax.net/org/apache/hadoop/hadoop-karaf/3.0.0-SNAPSHOT/hadoop-karaf-3.0.0-20130708.050912-1.jar).

To do this, you have to edit my Maven repository in etc/org.ops4j.pax.url.mvn.cfg and add my repository in the org.ops4j.pax.url.mvn.repositories property:


org.ops4j.pax.url.mvn.repositories = \
  http://maven.nanthrax.net/@snapshots@id=maven, \
  http://repo1.maven.org/maven2@id=central, \
  ...

Now, we can start Karaf as usual:


$KARAF_INSTALL/bin/karaf

NB: I use Karaf 2.3.1.

We can now install the hadoop-karaf bundle:


karaf@root> osgi:install -s mvn:org.apache.hadoop/hadoop-karaf/3.0.0-SNAPSHOT
karaf@root> la|grep -i hadoop
[ 54] [Active ] [Created ] [ 80] Apache Hadoop Karaf (3.0.0.SNAPSHOT)

hadoop:* and hdfs:* commands

The hadoop-karaf bundle comes with new Karaf shell commands.

For this first blog post, we are going to use only one command: hadoop:fs.

The hadoop:fs command allow you to use a HDFS directly in Karaf (it’s a wrapper to hadoop -fs):


karaf@root> hadoop:fs -ls /
Found 1 items
drwxr-xr-x - jbonofre supergroup 0 2013-07-07 22:18 /bundles
karaf@root> hadoop:fs -df
Filesystem Size Used Available Use%
hdfs://localhost 5250875392 307200 4976799744 0%

HDFS URL handler

Another thing provided by the hadoop-karaf bundle is an URL handler to support directly hdfs URL.

It means that you can use hdfs URL in Karaf commands, as osgi:install, features:addurl, ….

It also means that you can use HDFS to store your Karaf bundles, features, or configuration files.

For instance, we can copy an OSGi bundle in the HDFS:


$HADOOP_INSTALL/bin/hadoop fs -copyFromLocal ~/.m2/repository/org/apache/servicemix/bundles/org.apache.servicemix.bundles.commons-lang/2.4_6/org.apache.servicemix.bundles.commons-lang-2.4_6.jar /bundles/org.apache.servicemix.bundles.commons-lang-2.4_6.jar

The commons-lang bundle is now available in the HDFS. We can check that directly in Karaf using the hadoop:fs command:


karaf@root> hadoop:fs -ls /bundles
Found 1 items
-rw-r--r-- 1 jbonofre supergroup 272039 2013-07-07 22:18 /bundles/org.apache.servicemix.bundles.commons-lang-2.4_6.jar

Now, we can install the commons-lang bundle in Karaf directly from HDFS, using a hdfs URL:


karaf@root> osgi:install hdfs:/bundles/org.apache.servicemix.bundles.commons-lang-2.4_6.jar
karaf@root> la|grep -i commons-lang
[ 55] [Installed ] [ ] [ 80] Apache ServiceMix :: Bundles :: commons-lang (2.4.0.6)

If we list the bundles location, we can the hdfs URL support:


karaf@root> la -l
...
[ 53] [Active ] [Created ] [ 30] mvn:org.apache.karaf.management.mbeans/org.apache.karaf.management.mbeans.dev/2.3.1
[ 54] [Active ] [Created ] [ 80] mvn:org.apache.hadoop/hadoop-karaf/3.0.0-SNAPSHOT
[ 55] [Installed ] [ ] [ 80] hdfs:/bundles/org.apache.servicemix.bundles.commons-lang-2.4_6.jar

Conclusion

This first blog post shows how to use Karaf as a HDFS client. The big advantage is that the hadoop-karaf bundle doesn’t change anything from Hadoop core, and so I can provide it for Hadoop 0.20.x, 1.x, 2.x, or trunk (3.0.0-SNAPSHOT).
In Article 3, you will see how to leverage directly HDFS as OSGi services (and so use in your bundles, Camel routes, …).

Again, if you think that this articles serie is interesting, and you would like to see the Karaf support in Hadoop, feel free to post a comment, a message on the Hadoop mailing list, and whatever to promote it 😉

Comments

Popular posts from this blog

Getting started with Apache Karaf Minho

Using Apache Karaf with Kubernetes

Exposing Apache Karaf configurations with Apache Arrow Flight