Tuesday, August 20, 2013

Working with data

A very important insight from http://radar.oreilly.com/2012/07/data-jujitsu.html I want to share with you:

 It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data.

Monday, August 19, 2013

Sunday, August 18, 2013

How to serve files over http from the shell

This blog published a short bash script for serving files over http. Because I like this solution so much I copy it for you here:
while true; do { echo -e 'HTTP/1.1 200 OK\r\n'; cat <filename>; } | nc -l 8080; done

Tuesday, August 6, 2013

How to enable snappy compression using cascading

The following code enables snappy compression for the output of a cascading flow:
Properties properties = new Properties();
// set path to main class
AppProps.setApplicationJarClass(properties, Main.class);

// compress mapreduce output
properties.put("mapred.output.compress", "true");

// set compression codec
properties.put("mapred.output.compression.codec", "org.apache.hadoop.io.compress.SnappyCodec");

FlowConnector flowConnector =  new HadoopFlowConnector(properties);