export HADOOP_CLIENT_OPTS=-Xmx2G
Monday, December 16, 2013
How to increase Xmx for the hadoop client applications
Thursday, December 12, 2013
Find out the total size of directories with the 'du' command
du -sm * | sort -n
How to access the web interface of a remote hadoop cluster over a SOCKS proxy
Use ssh to open a SOCKS proxy:
ssh -f -N -D 7070 user@remote-hostClick here for an explanation of the command.
After that you can configure firefox to use this proxy:
- Go to the manual proxy settings and add localhost:7070 as SOCKS v5 host
- Go to about:config and set network.proxy.socks_remote_dns to true to use DNS resolution over the proxy (thanks to Aczire for this!).
Wednesday, December 11, 2013
How to rsync a folder over ssh
rsync -avze ssh source user@host:target-folderFor an explanation of the command click here.
Monday, October 28, 2013
How to write unit tests for your Hadoop MapRecude jobs
You will need a classifier to include it in your maven project:
<dependency> <groupId>org.apache.mrunit</groupId> <artifactId>mrunit</artifactId> <version>1.0.0</version> <classifier>hadoop2</classifier> <scope>test</scope> </dependency>
Time based tests with joda time
DateTimeUtils.setCurrentMillisFixed(new DateTime(2013, 3, 2, 0, 0).getMillis());
Tuesday, October 8, 2013
How to launch eclipse with a specific jdk on mac os
-vm /path/to/java/homeThe linebreak after -vm matters! To find out the path to java home run
/usr/libexec/java_home or /usr/libexec/java_home -v 1.7
Tuesday, October 1, 2013
How to install the sun 6 / oracle 7 jdk on ubuntu
wget https://github.com/flexiondotorg/oab-java6/raw/0.3.0/oab-java.sh -O oab-java.sh chmod +x oab-java.sh sudo ./oab-java.sh sudo apt-get install sun-java6-jdk
Friday, September 27, 2013
How to cache your git credentials
git config --global credential.helper 'cache --timeout=300'See also http://git-scm.com/docs/git-credential-cache.
https://confluence.atlassian.com/display/STASH/Permanently+authenticating+with+Git+repositories
OS X
Follow these steps to use Git with credential caching on OS X:- Download the binary git-credential-osxkeychain.
- Run the command below to ensure the binary is executable:
chmod a+x git-credential-osxkeychain
- Put it in the directory /usr/local/bin.
- Run the command below:
git config --global credential.helper osxkeychain
Monday, September 2, 2013
Use avro-tools to inspect avro files in hdfs
hadoop jar avro-tools-1.7.5.jar tojson hdfs://<hostname>:/path/to/file.avro | less
Tuesday, August 20, 2013
Working with data
It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data.
Monday, August 19, 2013
How to check the integrity of a bunch of gzip files
gunzip -t *.gzwill report broken gzip archives without unpacking them.
Sunday, August 18, 2013
How to serve files over http from the shell
while true; do { echo -e 'HTTP/1.1 200 OK\r\n'; cat <filename>; } | nc -l 8080; done
Tuesday, August 6, 2013
How to enable snappy compression using cascading
Properties properties = new Properties(); // set path to main class AppProps.setApplicationJarClass(properties, Main.class); // compress mapreduce output properties.put("mapred.output.compress", "true"); // set compression codec properties.put("mapred.output.compression.codec", "org.apache.hadoop.io.compress.SnappyCodec"); FlowConnector flowConnector = new HadoopFlowConnector(properties);
Thursday, June 27, 2013
How to create an executable jar file with maven
The shade plugin is configured in the build-section of your pom.xml:
<build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>2.1</version> <configuration> <!-- put your configurations here --> </configuration> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> </execution> </executions> </plugin> </plugins> </build>
To make the jar file executable, is has to contain an MANIFEST.MF which names the class containing the main function. Just add the following lines to the plugin config:
<configuration> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"> <mainClass>path.to.MainClass</mainClass> </transformer> </transformers> </configuration>
See also: http://maven.apache.org/plugins/maven-shade-plugin/examples/executable-jar.html
Big Data Priciples - Part I
Your Input Data Is Immutable
Data storage systems are used to store information. This information may change in the course of time. You have two basic options to reflect these changes in your data storage systems:- Update the information
- Store the new information in addition to the existing information
In a social network a user has a list of followers. This list may be modified by two events:
- follow event - A new follower is added to the list of followers
- unfollow event - A follower chooses to unfollow the user.
The second possibility is to store all follow and unfollow events. The current list of followers for a user is derived from this information.
scenario | solution one | solution one |
---|---|---|
Get the current list of followers for the user 'arthur'. | Read the list of current followers. | Derive the list of current followers from the sequence of follow and unfollow events. |
Get the number of followers for the user 'arthor'. | Compute the length of the list of current followers. | Compute the length of the list of current followers. |
Get all users that have been following the user 'arthur' two years ago | - | Derive the list of followers from the sequence of follow and unfollow events. |
Get a list of all users the user 'ford' has been following on 2000-01-01. | - | Derive the list of followed users from the sequence of follow and unfollow events. |
As you can see solution one offers a simple and efficient solution for answering the questions you may have had in mind while implementing the solution. But there are many possible questions you cannot answer with this data model.
Solutions two requires much more storage and also additional computation efforts to answer simple questions. This is a high price to pay, but you get a great reward: Since you do not lose information due to data updates you have the possibilites to answer that arise later in time. Questions that you did not even think of when you where implementing your application.
Solution two provides another big advantage: Since you never update your raw data the danger of data corruption due to an application error is much less!
Monday, April 29, 2013
List all used ports on a unix machine
netstat -a -t --numeric-ports -p
How to manage the hadoop services with CDH4
Show HDFS Service status
for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x status ; done * Hadoop datanode is running * Hadoop namenode is running * Hadoop secondarynamenode is running
Show Mapreduce service status
for x in `cd /etc/init.d ; ls hadoop-0.20-mapreduce-*` ; do sudo service $x status ; done hadoop-0.20-jobtracker is running hadoop-0.20-tasktracker is running