Wednesday, November 10, 2010

Using Hadoop Streaming and bash scripts to generate an xml file

Today we are going to try to generate an xml file from a hdfs source. I looked around for a straight-forward solution, but couldn't find one.
I'm definitely missing something, as there sould be an easier way to fetch a hdfs file and serialize it to xml.

Usecases, you ask? Sure. Whenever you need to push the hadoop job output to anything that deals (just) with xml files (I'm looking at you Solr deletes, also sitemaps, you get the idea).

Anyway, this is what I could come up with. A fair warning, it is not very pretty.

Striving to keep things really simple, I will use just hadoop streaming and bash scripts (no python, sorry).

In order to generate an xml file, you would need a sort of master file that will push the xml root node, then the streaming job will fill in the child nodes, as needed.

Following the Solr delete usecase, we need to have the following format: "<delete>...nodes...</delete>". Next, we need a streaming mapper that will generate each "<id>ID12345</id>" node of the xml file.

This is how the master script looks like:


#!/bin/bash

#! the streaming jar location
HADOOP_STREAMING_JAR=$HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar

#! the current timestapm, to avoid name conflicts on disk and hdfs
TIMESTAMP=`date +%s`

#! input (hdfs location)
IN=/user/hadoop-user/in/test-input

#! output (hdfs location), this will be an empty folder, containing just the job's logs
OUT=/user/hadoop-user/out/$TIMESTAMP

#! xml file output (disk location)
XML_OUT=/home/hadoop/test/test-out-$TIMESTAMP.xml

#! the job's options, notice how we specify 0 reducers
OPTS='-D mapred.reduce.tasks=0 -D mapred.job.name=Streaming Test'

echo '<delete>' > $STREAMING_JOB_OUT
hadoop jar $HADOOP_STREAMING_JAR $OPTS -mapper "`pwd`/map.sh $XML_OUT" -input $IN -output $OUT
echo '</delete>' >> $STREAMING_JOB_OUT


There are some things we need to be careful about:

- use the script's full path, otherwise hadoop is not going to be able to find it.
You'll get a java.lang.RuntimeException: Error in configuring object .... Caused by: java.io.IOException: Cannot run program "map.sh": java.io.IOException: error=2, No such file or directory.

- I've used zero reducers here. By specifying "-D mapred.reduce.tasks=0" the output from the map is considered to be the final output of the job.

- you still need to define an out value for the job, although nothing goes there, just the job logs. This is the reason why I had to use the timestamp trick (`date +%s`), so I won't get an error every time I try to execute the job. There are 2 places where you can have a name clash: disk and hdfs, you can notice that I've added the timestamp info to both output variables(OUT and XML_OUT).

- another trick was to pass the output file name from the master to the mapper script. By using "/home/hadoop/test/map.sh $XML_OUT" I've managed to push the file name information as a parameter, while also reading from stdin the streaming info. This keeps the map.sh really really light.

Next, the mapper script(map.sh):


#!/bin/bash

while read in
do
  echo $in >> $1;
done


Just take the stdin(read in) and push it to the output file ( passed as a param: $1)
It's, that easy :)

To be honest, I've only used a single column file, so I won't worry about splitting logic, although this is easy enough in bash. I'll link to an stackoverflow question on splitting a string in bash for your reading pleasure.
Moreover, I did not worry about stripping new lines from the input, this can lead to a pretty verbose xml (each node being on a different line).

Overall this feels a bit hacky, I feel that is should be easier to save as xml from hadoop, to keep the friction of interacting with other systems to a minimum.

0 comments:

Post a Comment