Monday, December 16, 2013

How to increase Xmx for the hadoop client applications

Sometimes you need more memory for hadoop client application tools like hive, beeline or pig.
export HADOOP_CLIENT_OPTS=-Xmx2G

Thursday, December 12, 2013

Find out the total size of directories with the 'du' command

To get a sorted list of folder sizes run:
du -sm * | sort -n

How to access the web interface of a remote hadoop cluster over a SOCKS proxy

You want to use the web interface of a hadoop cluster but you only have ssh access to it? SOCKS is the solution:

Use ssh to open a SOCKS proxy:

ssh -f -N -D 7070 user@remote-host
Click here for an explanation of the command.

After that you can configure firefox to use this proxy:

  1. Go to the manual proxy settings and add localhost:7070 as SOCKS v5 host
  2. Go to about:config and set network.proxy.socks_remote_dns to true to use DNS resolution over the proxy (thanks to Aczire for this!).
Thats all!

Wednesday, December 11, 2013

How to rsync a folder over ssh

Another short shell snippet:

rsync -avze ssh source user@host:target-folder
For an explanation of the command click here.

Monday, October 28, 2013

How to write unit tests for your Hadoop MapRecude jobs

Simple answer: use MRUnit.

You will need a classifier to include it in your maven project:

  <dependency>
   <groupId>org.apache.mrunit</groupId>
   <artifactId>mrunit</artifactId>
   <version>1.0.0</version>
   <classifier>hadoop2</classifier>
   <scope>test</scope>
  </dependency>

Time based tests with joda time

In unit tests it is helpful to set a fixed date and time. With joda time you can do this by simply calling
DateTimeUtils.setCurrentMillisFixed(new DateTime(2013, 3, 2, 0, 0).getMillis());

Tuesday, October 8, 2013

How to launch eclipse with a specific jdk on mac os

Edit /Eclipse.app/Contents/MacOS/eclipse.ini. Add the following before -vmargs:
-vm 
/path/to/java/home
The linebreak after -vm matters! To find out the path to java home run
/usr/libexec/java_home
or 
/usr/libexec/java_home -v 1.7

Tuesday, October 1, 2013

How to install the sun 6 / oracle 7 jdk on ubuntu

https://github.com/flexiondotorg/oab-java6 provides a script for building a java package for sun jdk 6 and oracle jdk 7 for ubuntu.
wget https://github.com/flexiondotorg/oab-java6/raw/0.3.0/oab-java.sh -O oab-java.sh
chmod +x oab-java.sh
sudo ./oab-java.sh

sudo apt-get install sun-java6-jdk

Friday, September 27, 2013

How to cache your git credentials

The following command tells git to cache your password for 5 minutes:
git config --global credential.helper 'cache --timeout=300'
See also http://git-scm.com/docs/git-credential-cache.


https://confluence.atlassian.com/display/STASH/Permanently+authenticating+with+Git+repositories

OS X

Follow these steps to use Git with credential caching on OS X:
  1. Download the binary git-credential-osxkeychain.
  2. Run the command below to ensure the binary is executable:
    chmod a+x git-credential-osxkeychain
  3. Put it in the directory /usr/local/bin.
  4. Run the command below:
    git config --global credential.helper osxkeychain

Monday, September 2, 2013

Use avro-tools to inspect avro files in hdfs

Since version 1.7.5 avro-tools include support for hdfs tools. To read an avro file in hdfs use the following:
hadoop jar avro-tools-1.7.5.jar  tojson hdfs://<hostname>:/path/to/file.avro | less

Tuesday, August 20, 2013

Working with data

A very important insight from http://radar.oreilly.com/2012/07/data-jujitsu.html I want to share with you:

 It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data.

Monday, August 19, 2013

Sunday, August 18, 2013

How to serve files over http from the shell

This blog published a short bash script for serving files over http. Because I like this solution so much I copy it for you here:
while true; do { echo -e 'HTTP/1.1 200 OK\r\n'; cat <filename>; } | nc -l 8080; done

Tuesday, August 6, 2013

How to enable snappy compression using cascading

The following code enables snappy compression for the output of a cascading flow:
Properties properties = new Properties();
// set path to main class
AppProps.setApplicationJarClass(properties, Main.class);

// compress mapreduce output
properties.put("mapred.output.compress", "true");

// set compression codec
properties.put("mapred.output.compression.codec", "org.apache.hadoop.io.compress.SnappyCodec");

FlowConnector flowConnector =  new HadoopFlowConnector(properties);

Thursday, June 27, 2013

How to create an executable jar file with maven

Executable jars may be created using the maven shade plugin. The shade plugin allows for the creation of 'uber-jars'. This are self contained jar files which contain the application code as well as all dependencies.
The shade plugin is configured in the build-section of your pom.xml:
  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>2.1</version>
        <configuration>
          <!-- put your configurations here -->
        </configuration>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>shade</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>

To make the jar file executable, is has to contain an MANIFEST.MF which names the class containing the main function. Just add the following lines to the plugin config:

<configuration>
  <transformers>
    <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
      <mainClass>path.to.MainClass</mainClass>
    </transformer>
  </transformers>
</configuration>

See also: http://maven.apache.org/plugins/maven-shade-plugin/examples/executable-jar.html

Big Data Priciples - Part I

Your Input Data Is Immutable

Data storage systems are used to store information. This information may change in the course of time. You have two basic options to reflect these changes in your data storage systems:
  • Update the information
  • Store the new information in addition to the existing information
Consider the following example:

In a social network a user has a list of followers. This list may be modified by two events:

  • follow event - A new follower is added to the list of followers
  • unfollow event - A follower chooses to unfollow the user.
One possibility to store this information is to always store and update a list of current followers for each user. Each time a new follower is added or removed you update this list in your storage system.

The second possibility is to store all follow and unfollow events. The current list of followers for a user is derived from this information.

scenario solution one solution one
Get the current list of followers for the user 'arthur'. Read the list of current followers. Derive the list of current followers from the sequence of follow and unfollow events.
Get the number of followers for the user 'arthor'. Compute the length of the list of current followers. Compute the length of the list of current followers.
Get all users that have been following the user 'arthur' two years ago - Derive the list of followers from the sequence of follow and unfollow events.
Get a list of all users the user 'ford' has been following on 2000-01-01. - Derive the list of followed users from the sequence of follow and unfollow events.

As you can see solution one offers a simple and efficient solution for answering the questions you may have had in mind while implementing the solution. But there are many possible questions you cannot answer with this data model.

Solutions two requires much more storage and also additional computation efforts to answer simple questions. This is a high price to pay, but you get a great reward: Since you do not lose information due to data updates you have the possibilites to answer that arise later in time. Questions that you did not even think of when you where implementing your application.

Solution two provides another big advantage: Since you never update your raw data the danger of data corruption due to an application error is much less!

Monday, April 29, 2013

List all used ports on a unix machine

netstat -a -t --numeric-ports -p

How to manage the hadoop services with CDH4

Show HDFS Service status

for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x status ; done
 * Hadoop datanode is running
 * Hadoop namenode is running
 * Hadoop secondarynamenode is running

Show Mapreduce service status

for x in `cd /etc/init.d ; ls hadoop-0.20-mapreduce-*` ; do sudo service $x status ; done
hadoop-0.20-jobtracker is running
hadoop-0.20-tasktracker is running