It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data.
Tuesday, August 20, 2013
Working with data
A very important insight from http://radar.oreilly.com/2012/07/data-jujitsu.html I want to share with you:
Monday, August 19, 2013
How to check the integrity of a bunch of gzip files
gunzip -t *.gzwill report broken gzip archives without unpacking them.
Sunday, August 18, 2013
How to serve files over http from the shell
This blog published a short bash script for serving files over http. Because I like this solution so much I copy it for you here:
while true; do { echo -e 'HTTP/1.1 200 OK\r\n'; cat <filename>; } | nc -l 8080; done
Tuesday, August 6, 2013
How to enable snappy compression using cascading
The following code enables snappy compression for the output of a cascading flow:
Properties properties = new Properties(); // set path to main class AppProps.setApplicationJarClass(properties, Main.class); // compress mapreduce output properties.put("mapred.output.compress", "true"); // set compression codec properties.put("mapred.output.compression.codec", "org.apache.hadoop.io.compress.SnappyCodec"); FlowConnector flowConnector = new HadoopFlowConnector(properties);
Subscribe to:
Posts (Atom)