Saturday, May 21, 2011

Dealing with JPA's lack of addScalar method

If you are looking for a way to cache your native queries using the latest and the greatest (Hibernate 3.6.x, JPA 2.0 & ehcache) you are in for a nice surprise.

There is no easy way to specify a query result type in JPA, or maybe there is and I couldn't find one :)

The problem is twofold, because if you cannot specify a return type, you cannot cache your native query with ehcache. Now, you could say that maybe this is a sign, and you shouldn't cache your queries anyway, or you could just sweat the small stuff, like I do all the time, and find a solution to this tiny issue.

So, if you've made it this far, you are pretty much stuck with writing some sucky code, to cast back to the hibernate session, and set the info.

And now for some (scala) code:
val sql = "select distinct story as story from ...";
val q: Query = getEntityManager().createNativeQuery(sql);
//hello nasty hack
q.asInstanceOf[HibernateQuery].getHibernateQuery().asInstanceOf[SQLQuery].addScalar("story", StandardBasicTypes.LONG);
//next, caching  
q.setHint("org.hibernate.cacheable", true);
q.setHint("org.hibernate.cacheRegion", "query.getTopLinks");

The tick is that you cannot cast directly to SQLQuery, you have to cast to HibernateQuery, then call getHibernateQuery, then cast the result to SQLQuery.

Ugly(ish) but usable, just how I like it.

Also, notice the use of StandardBasicTypes.LONG, replacing the old Hibernate.LONG which is now deprecated.

ps. I've decided to write this after I've found this question on SO.

Saturday, May 7, 2011

Hibernate Search meets Scala

Today I've pushed to github a small prototype, to test how/if Hibernate Search can play nicely with Scala 2.8.1.

If you haven't heard about Hibernate Search before, I highly recommend a quick look, it brings Lucene's power to your hibernate enabled application.
So if you are using Hibernate and you're interested in Lucene, chances are this could be for you.

Anyway, I've spent a little time building this small project, and I ran into some issues along the way:

Getting the Annotations to work right was the biggest pain point. Otherwise everything went smoothly.

The Java version:
@Indexed(index = "indexes/test")
@AnalyzerDef(name = "html_standard_analyzer", 
    tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), 
    charFilters = { @CharFilterDef(factory = HTMLStripCharFilterFactory.class) }, 
    filters = {
        @TokenFilterDef(factory = StandardFilterFactory.class),
        @TokenFilterDef(factory = LowerCaseFilterFactory.class),
        @TokenFilterDef(factory = StopFilterFactory.class) })
@Analyzer(definition = "html_standard_analyzer")

The Scala version:
@Indexed(index = "indexes/test")
@AnalyzerDef(name = "html_standard_analyzer",
  tokenizer = new TokenizerDef(factory = classOf[StandardTokenizerFactory]),
  charFilters = { Array(new CharFilterDef(factory = classOf[HTMLStripCharFilterFactory])) },
  filters = {
    Array(new TokenFilterDef(factory = classOf[StandardFilterFactory]),
      new TokenFilterDef(factory = classOf[LowerCaseFilterFactory]),
      new TokenFilterDef(factory = classOf[StopFilterFactory]))
  })
@Analyzer(definition = "html_standard_analyzer")

I'm using Eclipse, and there are still some weird issues: although there are no problems reported on the project as a whole and the tests pass, opening the annotated class will show some 'local' errors, apparently related to the scala compiler, not eclipse ide itself:
@Field(index = org.hibernate.search.annotations.Index.TOKENIZED, store = Store.NO)
will yield this error:
Multiple markers at this line
  - annotation argument needs to be a constant; found: org{org.type}.hibernate{org.hibernate.type}.search{org.hibernate.search.type}.annotations
   {org.hibernate.search.annotations.type}.Index{org.hibernate.search.annotations.Index.type}.TOKENIZED{org.hibernate.search.annotations.Index(value TOKENIZED)}
  - annotation argument needs to be a constant; found: Store{<null>}.NO{<null>}

JUnit's @AfterClass not really related to Hibernate Search, but still interesting.

If you attempt (as I did :) to just add @AfterClass to just any method on the class, you'll see the beautiful:
java.lang.Exception: Method beforeClass() should be static

AfterClass is expected to be on a static method. Apparently in scala this is done by having a companion object.
This is how a proper JUnit test should look like.
class SimpleJunitTest {
  @Before
  def before() {
    println("before");
  }
  @Test
  def test1() {
    println("t1");
  }
  @Test
  def test2() {
    println("t2");
  }
  @After
  def after() {
    println("after");
  }
}

object SimpleJunitTest {
  @BeforeClass
  def beforeClass() {
    println("before class");
  }
  @AfterClass
  def afterClass() {
    println("after class");
  }
}
Notice class SimpleJunitTest has the tests, and object SimpleJunitTest has the static methods annotated with @BeforeClass and @AfterClass
Output:
before class
before
t1
after
before
t2
after
after class

Friday, February 11, 2011

Scripted Java Install on Ubuntu 10, part II: remote install

Part deux follows quickly as a simple extension of Part I.

Yes, you say, it is simple enough to install java without having to click accept, but how can I do that over the wire, remotely via ssh, from java.

So, just to recap: install java via ssh from java on ubuntu 10.

To be able to do anything from java via ssh, I strongly recommend jsch. I cannot tell you enough how cool it is, and it is used in a lot of open source projects.
One minor drawback: there is hardly any documentation on it. I mean really, close to none. But don't let that get you down, this is the reason we're all into open source in the first place, isn't it?

The code is divided into 2 main parts:

- the jsch wrapper, which only helps out with the ssh details, also it provides some higher level abstractions, like running a remote command. this is not tied to Ubuntu, it will work on any *nix like system, provided the commands are correct.
Here is the very lightweight version of the class:


- and the actual bash scripts, that come from Part I.


I'm just trying things out with gists from git right now, so if this doesn't turn out to be a good idea, don't take it personally ;)

Moreover, in an effort to improve the quality of this blog, and also to revive it, as it is been a while since I've written anything useful here, I've decided to publish everything in git hub.
This also helps me to play around a bit with the cool kids, as it seems that SVN is considered passé these days.
So, this is the link to todays project/snippet: remote-java-install.

enjoy!

Scripted Java Install on Ubuntu 10

This is how you deal with installing java via command line scripts automatically, without having to click accept the licence on Ubuntu 10 (this works on both 10.04 and 10.10)

sudo echo sun-java6-jre shared/accepted-sun-dlj-v1-1 select true | sudo /usr/bin/debconf-set-selections
sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
sudo apt-get update
sudo apt-get install --yes sun-java6-jdk sun-java6-jre

echo "JAVA_HOME=/usr/lib/jvm/java-6-sun" >> .bashrc
echo "PATH=\$PATH:\$JAVA_HOME/bin" >> .bashrc
java -version

In short, you accept the licence prior to installing, you add the update repository, you install java, you set the JAVA_HOME variable, update the PATH, and in the end check that it all ties in perfectly.

Wednesday, November 10, 2010

Using Hadoop Streaming and bash scripts to generate an xml file

Today we are going to try to generate an xml file from a hdfs source. I looked around for a straight-forward solution, but couldn't find one.
I'm definitely missing something, as there sould be an easier way to fetch a hdfs file and serialize it to xml.

Usecases, you ask? Sure. Whenever you need to push the hadoop job output to anything that deals (just) with xml files (I'm looking at you Solr deletes, also sitemaps, you get the idea).

Anyway, this is what I could come up with. A fair warning, it is not very pretty.

Striving to keep things really simple, I will use just hadoop streaming and bash scripts (no python, sorry).

In order to generate an xml file, you would need a sort of master file that will push the xml root node, then the streaming job will fill in the child nodes, as needed.

Following the Solr delete usecase, we need to have the following format: "<delete>...nodes...</delete>". Next, we need a streaming mapper that will generate each "<id>ID12345</id>" node of the xml file.

This is how the master script looks like:


#!/bin/bash

#! the streaming jar location
HADOOP_STREAMING_JAR=$HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar

#! the current timestapm, to avoid name conflicts on disk and hdfs
TIMESTAMP=`date +%s`

#! input (hdfs location)
IN=/user/hadoop-user/in/test-input

#! output (hdfs location), this will be an empty folder, containing just the job's logs
OUT=/user/hadoop-user/out/$TIMESTAMP

#! xml file output (disk location)
XML_OUT=/home/hadoop/test/test-out-$TIMESTAMP.xml

#! the job's options, notice how we specify 0 reducers
OPTS='-D mapred.reduce.tasks=0 -D mapred.job.name=Streaming Test'

echo '<delete>' > $STREAMING_JOB_OUT
hadoop jar $HADOOP_STREAMING_JAR $OPTS -mapper "`pwd`/map.sh $XML_OUT" -input $IN -output $OUT
echo '</delete>' >> $STREAMING_JOB_OUT


There are some things we need to be careful about:

- use the script's full path, otherwise hadoop is not going to be able to find it.
You'll get a java.lang.RuntimeException: Error in configuring object .... Caused by: java.io.IOException: Cannot run program "map.sh": java.io.IOException: error=2, No such file or directory.

- I've used zero reducers here. By specifying "-D mapred.reduce.tasks=0" the output from the map is considered to be the final output of the job.

- you still need to define an out value for the job, although nothing goes there, just the job logs. This is the reason why I had to use the timestamp trick (`date +%s`), so I won't get an error every time I try to execute the job. There are 2 places where you can have a name clash: disk and hdfs, you can notice that I've added the timestamp info to both output variables(OUT and XML_OUT).

- another trick was to pass the output file name from the master to the mapper script. By using "/home/hadoop/test/map.sh $XML_OUT" I've managed to push the file name information as a parameter, while also reading from stdin the streaming info. This keeps the map.sh really really light.

Next, the mapper script(map.sh):


#!/bin/bash

while read in
do
  echo $in >> $1;
done


Just take the stdin(read in) and push it to the output file ( passed as a param: $1)
It's, that easy :)

To be honest, I've only used a single column file, so I won't worry about splitting logic, although this is easy enough in bash. I'll link to an stackoverflow question on splitting a string in bash for your reading pleasure.
Moreover, I did not worry about stripping new lines from the input, this can lead to a pretty verbose xml (each node being on a different line).

Overall this feels a bit hacky, I feel that is should be easier to save as xml from hadoop, to keep the friction of interacting with other systems to a minimum.

Thursday, July 8, 2010

Haversine formula implemented in Pig

Today I needed to compute the distance between 2 geo points using Apache Pig. A simple google search took me to this blog post: Calculate distance, bearing and more between Latitude/Longitude points, using the Haversine formula. Exactly what I needed!

Moreover, it seems that Pig provides bindings for most (maybe all?) of the java.lang.Math functions that I need to compute this distance. You only need to reference the piggybank library in your pig scripts.

To test the computation, I'm going to use a sample fine, containing two geo-points (id, latitude, longitude) and I'm going to CROSS that with itself, so that we keep this example to a minimum.

The 2 points I'm going to use are 2 well known locations, taken from Wikipedia, so that we can easily test the distance by comparing to outside references.
First is the Eiffel Tower having coordinates: (48.8583°N 2.2945°E) or (48.8583, 2.2945), you can check that here too.
Second is the Arc de Triomphe, coordinates: (48° 52′ 25.68″ N, 2° 17′ 42″ E) or (48.8738, 2.295).

Using the initially cited script: we get distance: 1.724km. Let's see how that would look like in Pig.


REGISTER /home/user/pig-0.7.0/contrib/piggybank/java/piggybank.jar;

define radians org.apache.pig.piggybank.evaluation.math.toRadians();
define sin org.apache.pig.piggybank.evaluation.math.SIN(); 
define cos org.apache.pig.piggybank.evaluation.math.COS();
define sqrt org.apache.pig.piggybank.evaluation.math.SQRT();
define atan2 org.apache.pig.piggybank.evaluation.math.ATAN2();

geo = load 'haversine.csv' using PigStorage(';') as (id1: long, lat1: double, lon1: double);
geo2 = load 'haversine.csv' using PigStorage(';') as (id2: long, lat2: double, lon2: double);

geoCross = CROSS geo, geo2;

geoDist = FOREACH geoCross GENERATE id1, id2, 6371 * 2 * atan2(sqrt(sin(radians(lat2 - lat1) / 2) * sin(radians(lat2 - lat1) / 2) + cos(radians(lat1)) * cos(radians(lat2)) * sin(radians(lon2 - lon1) / 2) * sin(radians(lon2 - lon1) / 2)), sqrt(1 - (sin(radians(lat2 - lat1) / 2) * sin(radians(lat2 - lat1) / 2) + cos(radians(lat1)) * cos(radians(lat2)) * sin(radians(lon2 - lon1) / 2) * sin(radians(lon2 - lon1) / 2)))) as dist; 

dump geoDist;


Just for context info, I am using pig 0.7 to run this script.
As you can see I have to include the piggybank jar, so I can use the Math bindngs. Next, I'll define some shortcuts, so we keep this code readable ( 'radians' is actually org.apache.pig.piggybank.evaluation.math.toRadians(), 'sin' means actually org.apache.pig.piggybank.evaluation.math.SIN() and so on).

As I've previously said, I'm loading the same file twice, just to keep things simple.
This is the sample file (haversine.csv):



1;48.8583;2.2945
2;48.8738;2.295 


The actual haversine formula is defined here:



6371 * 2 * atan2(sqrt(sin(radians(lat2 - lat1) / 2) * 
sin(radians(lat2 - lat1) / 2) + cos(radians(lat1)) * 
cos(radians(lat2)) * sin(radians(lon2 - lon1) / 2) * 
sin(radians(lon2 - lon1) / 2)), sqrt(1 - (sin(radians(lat2 - lat1) / 2) * 
sin(radians(lat2 - lat1) / 2) + cos(radians(lat1)) * 
cos(radians(lat2)) * sin(radians(lon2 - lon1) / 2) * 
sin(radians(lon2 - lon1) / 2))))


'6371' is the earth's radius in kilometres so this will output the distance in kilometres. If you want miles, you should replace '6371' with '3958.7'.
Unfortunately I had to make it a one-liner, so this might get in the way of the actual readability of the script.

I know you are curious to see the output:



(2L,1L,1.7239093620868347)
(2L,2L,0.0)
(1L,1L,0.0)
(1L,2L,1.7239093620868347)


The distance is 1.7239093620868347 km, which is basically the same (1.724km). Pig FTW!

Using pig has turned out, once again to be a real life saver.
I wonder if this could be a useful addition to piggybank.

Friday, July 2, 2010

The infamous dissapearing single quote in property files problem

Recently I had some problems with the internationalization support in an application. Apparently some single quotes kept disappearing. Not all of them though...The faulty web page even had a mix of good and bad behaviour.
Let the digging begin...

I came across this using Spring's ReloadableResourceBundleMessageSource. But apparently this comes directly from Java's MessageFormat class.

The Docs

It seems that if you need to add proper support for internationalization to your application via property files, you have to be really careful with the specs. They are not at all obvious.

In fact, just by skimming the docs, you get to this point: "Warning: The rules for using quotes within message format patterns unfortunately have shown to be somewhat confusing. In particular, it isn't always obvious to localizers whether single quotes need to be doubled or not."

Ok, fair warning, what about the Spring docs? Nothing on ReloadableResourceBundleMessageSource, but if we go deeper, you have a mention of this issue on AbstractMessageSource, the parent class of our reloadable bundle of joy: "Note: By default, message texts are only parsed through MessageFormat if arguments have been passed in for the message. In case of no arguments, message texts will be returned as-is. As a consequence, you should only use MessageFormat escaping for messages with actual arguments, and keep all other messages un-escaped. If you prefer to escape all messages, set the "alwaysUseMessageFormat" flag to "true". ".

...ok, whatever... I said to myself the first time I read this, and like any other good developer, I forgot I ever red that and moved on.

Back to our problem, the application was exhibiting strange behaviour inconsistencies, on apparently similar use cases. Then it hit me... parameters!! If you use parameters in the internationalization string, every single quote has to be escaped by another single quote, else it does not have to be escaped - ...you said what now?
Moreover, by not escaping single quotes in strings that have params, the injection will not work, so it's even more crappy.

As it later occurred to me, by reading the Spring docs, if you enable the "alwaysUseMessageFormat", it will always try to format the message, so you could just default to escaping every single quote and you are done.

The Tests

It was still not 100% clear to me, so I decided to write some unit tests, covering these cases, just to be sure.

And now for the coding part of the post, I'm going to go through each test, and explain shortly what is going on.

A little context, before beginning: I'm using JUnit 4, and Spring's ReloadableResourceBundleMessageSource, bundled together with a small test property file. I'll also attach the entire project at the end, for your testing pleasure.

The property file:
simple=page
  simple_param=page {0}
  quoted=page d'accueil
  quoted_escaped_noparam=page d''accueil
  quoted_unescaped_param=page d'accueil est {0}
  quoted_escaped_param=page d''accueil est {0}

The test class, hopefully the comments I added to each method makes for an easier understanding of the code


public class SingleQuoteTest {
  private ReloadableResourceBundleMessageSource source;

  /**
   * initialize the bundle to make sure that we have a clean state
   */
  @Before
  public void init() {
    source = new ReloadableResourceBundleMessageSource();
    source.setBasename("test");
    source.setDefaultEncoding("UTF-8");
  }

  /**
   * no surprises here, just making sure the basic mechanism works
   */
  @Test
  public void testSimple() {
    String code = "simple";
    String expectedValue = "page";
    String actualValue = source.getMessage(code, null, null);
    assertEquals(expectedValue, actualValue);
  }

  /**
   * just making sure the basic mechanism works even with parameters
   */
  @Test
  public void testSimpleParam() {
    String code = "simple_param";
    String[] params = { "d'accueil" };
    String expectedValue = "page d'accueil";
    String actualValue = source.getMessage(code, params, null);
    assertEquals(expectedValue, actualValue);
  }

  /**
   * basic functional test on a un-scaped string, with no params
   */
  @Test
  public void testQuotedNoParams() {
    String code = "quoted";
    String expectedValue = "page d'accueil";
    String actualValue = source.getMessage(code, null, null);
    assertEquals(expectedValue, actualValue);
  }

  /**
   * This will <b>fail</b> because we passed a parameter, even though it is
   * not used
   * 
   * junit.framework.ComparisonFailure: null expected:<page d[']accueil> but
   * was:<page d[]accueil>
   * 
   */
  @Test
  public void testQuotedUselessParams() {
    String code = "quoted";
    String[] params = { "d'accueil" };
    String expectedValue = "page d'accueil";
    String actualValue = source.getMessage(code, params, null);
    assertEquals(expectedValue, actualValue);
  }

  /**
   * This will <b>fail</b> by returning two single quotes, because we did not
   * pass any parameters
   * 
   * junit.framework.ComparisonFailure: null expected:<page d'[]accueil> but
   * was:<page d'[']accueil>
   */
  @Test
  public void testQuotedEscapedNoParams() {
    String code = "quoted_escaped_noparam";
    String expectedValue = "page d'accueil";
    String actualValue = source.getMessage(code, null, null);
    assertEquals(expectedValue, actualValue);
}

  /**
   * apparently if you enable the 'alwaysUseMessageFormat' flag, this will
   * always get escaped
   * 
   */
  @Test
  public void testQuotedEscapedNoParamsAlwaysFormatOn() {
    source.setAlwaysUseMessageFormat(true);
    String code = "quoted_escaped_noparam";
    String expectedValue = "page d'accueil";
    String actualValue = source.getMessage(code, null, null);
    assertEquals(expectedValue, actualValue);
  }

  /**
   * <b>fail</b> again
   * 
   * Having params and un-escaped single quotes makes for a nasty output
   * 
   * junit.framework.ComparisonFailure: null expected:<page d['accueil est
   * ici]> but was:<page d[accueil est {0}]>
   * 
   */
  @Test
  public void testQuotedUnescapedParams() {
    String code = "quoted_unescaped_param";
    String[] params = { "ici" };
    String expectedValue = "page d'accueil est ici";
    String actualValue = source.getMessage(code, params, null);
    assertEquals(expectedValue, actualValue);
  }

  /**
   * the way to go: params and escaped single quotes
   */
  @Test
  public void testQuotedEscapedParams() {
    String code = "quoted_escaped_param";
    String[] params = { "ici" };
    String expectedValue = "page d'accueil est ici";
    String actualValue = source.getMessage(code, params, null);
    assertEquals(expectedValue, actualValue);
  }
}


If you remember the initial starting point of this shenanigan, my web page had one message that received an unused parameter, that was causing the escaping problem, thus all the weirdness.
As you can see, this was meant to document a strange (at first sight) behaviour, so that maybe future-me will save half a day of digging through the internet.

I really hope you too learned something today!



Download the test project: click here (the link to archive uploaded in google docs)