Thursday, July 8, 2010

Haversine formula implemented in Pig

Today I needed to compute the distance between 2 geo points using Apache Pig. A simple google search took me to this blog post: Calculate distance, bearing and more between Latitude/Longitude points, using the Haversine formula. Exactly what I needed!

Moreover, it seems that Pig provides bindings for most (maybe all?) of the java.lang.Math functions that I need to compute this distance. You only need to reference the piggybank library in your pig scripts.

To test the computation, I'm going to use a sample fine, containing two geo-points (id, latitude, longitude) and I'm going to CROSS that with itself, so that we keep this example to a minimum.

The 2 points I'm going to use are 2 well known locations, taken from Wikipedia, so that we can easily test the distance by comparing to outside references.
First is the Eiffel Tower having coordinates: (48.8583°N 2.2945°E) or (48.8583, 2.2945), you can check that here too.
Second is the Arc de Triomphe, coordinates: (48° 52′ 25.68″ N, 2° 17′ 42″ E) or (48.8738, 2.295).

Using the initially cited script: we get distance: 1.724km. Let's see how that would look like in Pig.


REGISTER /home/user/pig-0.7.0/contrib/piggybank/java/piggybank.jar;

define radians org.apache.pig.piggybank.evaluation.math.toRadians();
define sin org.apache.pig.piggybank.evaluation.math.SIN(); 
define cos org.apache.pig.piggybank.evaluation.math.COS();
define sqrt org.apache.pig.piggybank.evaluation.math.SQRT();
define atan2 org.apache.pig.piggybank.evaluation.math.ATAN2();

geo = load 'haversine.csv' using PigStorage(';') as (id1: long, lat1: double, lon1: double);
geo2 = load 'haversine.csv' using PigStorage(';') as (id2: long, lat2: double, lon2: double);

geoCross = CROSS geo, geo2;

geoDist = FOREACH geoCross GENERATE id1, id2, 6371 * 2 * atan2(sqrt(sin(radians(lat2 - lat1) / 2) * sin(radians(lat2 - lat1) / 2) + cos(radians(lat1)) * cos(radians(lat2)) * sin(radians(lon2 - lon1) / 2) * sin(radians(lon2 - lon1) / 2)), sqrt(1 - (sin(radians(lat2 - lat1) / 2) * sin(radians(lat2 - lat1) / 2) + cos(radians(lat1)) * cos(radians(lat2)) * sin(radians(lon2 - lon1) / 2) * sin(radians(lon2 - lon1) / 2)))) as dist; 

dump geoDist;


Just for context info, I am using pig 0.7 to run this script.
As you can see I have to include the piggybank jar, so I can use the Math bindngs. Next, I'll define some shortcuts, so we keep this code readable ( 'radians' is actually org.apache.pig.piggybank.evaluation.math.toRadians(), 'sin' means actually org.apache.pig.piggybank.evaluation.math.SIN() and so on).

As I've previously said, I'm loading the same file twice, just to keep things simple.
This is the sample file (haversine.csv):



1;48.8583;2.2945
2;48.8738;2.295 


The actual haversine formula is defined here:



6371 * 2 * atan2(sqrt(sin(radians(lat2 - lat1) / 2) * 
sin(radians(lat2 - lat1) / 2) + cos(radians(lat1)) * 
cos(radians(lat2)) * sin(radians(lon2 - lon1) / 2) * 
sin(radians(lon2 - lon1) / 2)), sqrt(1 - (sin(radians(lat2 - lat1) / 2) * 
sin(radians(lat2 - lat1) / 2) + cos(radians(lat1)) * 
cos(radians(lat2)) * sin(radians(lon2 - lon1) / 2) * 
sin(radians(lon2 - lon1) / 2))))


'6371' is the earth's radius in kilometres so this will output the distance in kilometres. If you want miles, you should replace '6371' with '3958.7'.
Unfortunately I had to make it a one-liner, so this might get in the way of the actual readability of the script.

I know you are curious to see the output:



(2L,1L,1.7239093620868347)
(2L,2L,0.0)
(1L,1L,0.0)
(1L,2L,1.7239093620868347)


The distance is 1.7239093620868347 km, which is basically the same (1.724km). Pig FTW!

Using pig has turned out, once again to be a real life saver.
I wonder if this could be a useful addition to piggybank.

Friday, July 2, 2010

The infamous dissapearing single quote in property files problem

Recently I had some problems with the internationalization support in an application. Apparently some single quotes kept disappearing. Not all of them though...The faulty web page even had a mix of good and bad behaviour.
Let the digging begin...

I came across this using Spring's ReloadableResourceBundleMessageSource. But apparently this comes directly from Java's MessageFormat class.

The Docs

It seems that if you need to add proper support for internationalization to your application via property files, you have to be really careful with the specs. They are not at all obvious.

In fact, just by skimming the docs, you get to this point: "Warning: The rules for using quotes within message format patterns unfortunately have shown to be somewhat confusing. In particular, it isn't always obvious to localizers whether single quotes need to be doubled or not."

Ok, fair warning, what about the Spring docs? Nothing on ReloadableResourceBundleMessageSource, but if we go deeper, you have a mention of this issue on AbstractMessageSource, the parent class of our reloadable bundle of joy: "Note: By default, message texts are only parsed through MessageFormat if arguments have been passed in for the message. In case of no arguments, message texts will be returned as-is. As a consequence, you should only use MessageFormat escaping for messages with actual arguments, and keep all other messages un-escaped. If you prefer to escape all messages, set the "alwaysUseMessageFormat" flag to "true". ".

...ok, whatever... I said to myself the first time I read this, and like any other good developer, I forgot I ever red that and moved on.

Back to our problem, the application was exhibiting strange behaviour inconsistencies, on apparently similar use cases. Then it hit me... parameters!! If you use parameters in the internationalization string, every single quote has to be escaped by another single quote, else it does not have to be escaped - ...you said what now?
Moreover, by not escaping single quotes in strings that have params, the injection will not work, so it's even more crappy.

As it later occurred to me, by reading the Spring docs, if you enable the "alwaysUseMessageFormat", it will always try to format the message, so you could just default to escaping every single quote and you are done.

The Tests

It was still not 100% clear to me, so I decided to write some unit tests, covering these cases, just to be sure.

And now for the coding part of the post, I'm going to go through each test, and explain shortly what is going on.

A little context, before beginning: I'm using JUnit 4, and Spring's ReloadableResourceBundleMessageSource, bundled together with a small test property file. I'll also attach the entire project at the end, for your testing pleasure.

The property file:
simple=page
  simple_param=page {0}
  quoted=page d'accueil
  quoted_escaped_noparam=page d''accueil
  quoted_unescaped_param=page d'accueil est {0}
  quoted_escaped_param=page d''accueil est {0}

The test class, hopefully the comments I added to each method makes for an easier understanding of the code


public class SingleQuoteTest {
  private ReloadableResourceBundleMessageSource source;

  /**
   * initialize the bundle to make sure that we have a clean state
   */
  @Before
  public void init() {
    source = new ReloadableResourceBundleMessageSource();
    source.setBasename("test");
    source.setDefaultEncoding("UTF-8");
  }

  /**
   * no surprises here, just making sure the basic mechanism works
   */
  @Test
  public void testSimple() {
    String code = "simple";
    String expectedValue = "page";
    String actualValue = source.getMessage(code, null, null);
    assertEquals(expectedValue, actualValue);
  }

  /**
   * just making sure the basic mechanism works even with parameters
   */
  @Test
  public void testSimpleParam() {
    String code = "simple_param";
    String[] params = { "d'accueil" };
    String expectedValue = "page d'accueil";
    String actualValue = source.getMessage(code, params, null);
    assertEquals(expectedValue, actualValue);
  }

  /**
   * basic functional test on a un-scaped string, with no params
   */
  @Test
  public void testQuotedNoParams() {
    String code = "quoted";
    String expectedValue = "page d'accueil";
    String actualValue = source.getMessage(code, null, null);
    assertEquals(expectedValue, actualValue);
  }

  /**
   * This will <b>fail</b> because we passed a parameter, even though it is
   * not used
   * 
   * junit.framework.ComparisonFailure: null expected:<page d[']accueil> but
   * was:<page d[]accueil>
   * 
   */
  @Test
  public void testQuotedUselessParams() {
    String code = "quoted";
    String[] params = { "d'accueil" };
    String expectedValue = "page d'accueil";
    String actualValue = source.getMessage(code, params, null);
    assertEquals(expectedValue, actualValue);
  }

  /**
   * This will <b>fail</b> by returning two single quotes, because we did not
   * pass any parameters
   * 
   * junit.framework.ComparisonFailure: null expected:<page d'[]accueil> but
   * was:<page d'[']accueil>
   */
  @Test
  public void testQuotedEscapedNoParams() {
    String code = "quoted_escaped_noparam";
    String expectedValue = "page d'accueil";
    String actualValue = source.getMessage(code, null, null);
    assertEquals(expectedValue, actualValue);
}

  /**
   * apparently if you enable the 'alwaysUseMessageFormat' flag, this will
   * always get escaped
   * 
   */
  @Test
  public void testQuotedEscapedNoParamsAlwaysFormatOn() {
    source.setAlwaysUseMessageFormat(true);
    String code = "quoted_escaped_noparam";
    String expectedValue = "page d'accueil";
    String actualValue = source.getMessage(code, null, null);
    assertEquals(expectedValue, actualValue);
  }

  /**
   * <b>fail</b> again
   * 
   * Having params and un-escaped single quotes makes for a nasty output
   * 
   * junit.framework.ComparisonFailure: null expected:<page d['accueil est
   * ici]> but was:<page d[accueil est {0}]>
   * 
   */
  @Test
  public void testQuotedUnescapedParams() {
    String code = "quoted_unescaped_param";
    String[] params = { "ici" };
    String expectedValue = "page d'accueil est ici";
    String actualValue = source.getMessage(code, params, null);
    assertEquals(expectedValue, actualValue);
  }

  /**
   * the way to go: params and escaped single quotes
   */
  @Test
  public void testQuotedEscapedParams() {
    String code = "quoted_escaped_param";
    String[] params = { "ici" };
    String expectedValue = "page d'accueil est ici";
    String actualValue = source.getMessage(code, params, null);
    assertEquals(expectedValue, actualValue);
  }
}


If you remember the initial starting point of this shenanigan, my web page had one message that received an unused parameter, that was causing the escaping problem, thus all the weirdness.
As you can see, this was meant to document a strange (at first sight) behaviour, so that maybe future-me will save half a day of digging through the internet.

I really hope you too learned something today!



Download the test project: click here (the link to archive uploaded in google docs)