It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data.
Tuesday, August 20, 2013
Working with data
A very important insight from http://radar.oreilly.com/2012/07/data-jujitsu.html I want to share with you:
Monday, August 19, 2013
How to check the integrity of a bunch of gzip files
gunzip -t *.gzwill report broken gzip archives without unpacking them.
Sunday, August 18, 2013
How to serve files over http from the shell
This blog published a short bash script for serving files over http. Because I like this solution so much I copy it for you here:
while true; do { echo -e 'HTTP/1.1 200 OK\r\n'; cat <filename>; } | nc -l 8080; done
Tuesday, August 6, 2013
How to enable snappy compression using cascading
The following code enables snappy compression for the output of a cascading flow:
Properties properties = new Properties();
// set path to main class
AppProps.setApplicationJarClass(properties, Main.class);
// compress mapreduce output
properties.put("mapred.output.compress", "true");
// set compression codec
properties.put("mapred.output.compression.codec", "org.apache.hadoop.io.compress.SnappyCodec");
FlowConnector flowConnector =  new HadoopFlowConnector(properties);
Subscribe to:
Comments (Atom)
 
