Moreover, it seems that Pig provides bindings for most (maybe all?) of the java.lang.Math functions that I need to compute this distance. You only need to reference the piggybank library in your pig scripts.
To test the computation, I'm going to use a sample fine, containing two geo-points (id, latitude, longitude) and I'm going to CROSS that with itself, so that we keep this example to a minimum.
The 2 points I'm going to use are 2 well known locations, taken from Wikipedia, so that we can easily test the distance by comparing to outside references.
First is the Eiffel Tower having coordinates: (48.8583°N 2.2945°E) or (48.8583, 2.2945), you can check that here too.
Second is the Arc de Triomphe, coordinates: (48° 52′ 25.68″ N, 2° 17′ 42″ E) or (48.8738, 2.295).
Using the initially cited script: we get distance: 1.724km. Let's see how that would look like in Pig.
REGISTER /home/user/pig-0.7.0/contrib/piggybank/java/piggybank.jar;
define radians org.apache.pig.piggybank.evaluation.math.toRadians();
define sin org.apache.pig.piggybank.evaluation.math.SIN();
define cos org.apache.pig.piggybank.evaluation.math.COS();
define sqrt org.apache.pig.piggybank.evaluation.math.SQRT();
define atan2 org.apache.pig.piggybank.evaluation.math.ATAN2();
geo = load 'haversine.csv' using PigStorage(';') as (id1: long, lat1: double, lon1: double);
geo2 = load 'haversine.csv' using PigStorage(';') as (id2: long, lat2: double, lon2: double);
geoCross = CROSS geo, geo2;
geoDist = FOREACH geoCross GENERATE id1, id2, 6371 * 2 * atan2(sqrt(sin(radians(lat2 - lat1) / 2) * sin(radians(lat2 - lat1) / 2) + cos(radians(lat1)) * cos(radians(lat2)) * sin(radians(lon2 - lon1) / 2) * sin(radians(lon2 - lon1) / 2)), sqrt(1 - (sin(radians(lat2 - lat1) / 2) * sin(radians(lat2 - lat1) / 2) + cos(radians(lat1)) * cos(radians(lat2)) * sin(radians(lon2 - lon1) / 2) * sin(radians(lon2 - lon1) / 2)))) as dist;
dump geoDist;
Just for context info, I am using pig 0.7 to run this script.
As you can see I have to include the piggybank jar, so I can use the Math bindngs. Next, I'll define some shortcuts, so we keep this code readable ( 'radians' is actually org.apache.pig.piggybank.evaluation.math.toRadians(), 'sin' means actually org.apache.pig.piggybank.evaluation.math.SIN() and so on).
As I've previously said, I'm loading the same file twice, just to keep things simple.
This is the sample file (haversine.csv):
1;48.8583;2.2945
2;48.8738;2.295
The actual haversine formula is defined here:
6371 * 2 * atan2(sqrt(sin(radians(lat2 - lat1) / 2) *
sin(radians(lat2 - lat1) / 2) + cos(radians(lat1)) *
cos(radians(lat2)) * sin(radians(lon2 - lon1) / 2) *
sin(radians(lon2 - lon1) / 2)), sqrt(1 - (sin(radians(lat2 - lat1) / 2) *
sin(radians(lat2 - lat1) / 2) + cos(radians(lat1)) *
cos(radians(lat2)) * sin(radians(lon2 - lon1) / 2) *
sin(radians(lon2 - lon1) / 2))))
'6371' is the earth's radius in kilometres so this will output the distance in kilometres. If you want miles, you should replace '6371' with '3958.7'.
Unfortunately I had to make it a one-liner, so this might get in the way of the actual readability of the script.
I know you are curious to see the output:
(2L,1L,1.7239093620868347)
(2L,2L,0.0)
(1L,1L,0.0)
(1L,2L,1.7239093620868347)
The distance is 1.7239093620868347 km, which is basically the same (1.724km). Pig FTW!
Using pig has turned out, once again to be a real life saver.
I wonder if this could be a useful addition to piggybank.