Thursday, June 27, 2013

How to create an executable jar file with maven

Executable jars may be created using the maven shade plugin. The shade plugin allows for the creation of 'uber-jars'. This are self contained jar files which contain the application code as well as all dependencies.
The shade plugin is configured in the build-section of your pom.xml:
  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>2.1</version>
        <configuration>
          <!-- put your configurations here -->
        </configuration>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>shade</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>

To make the jar file executable, is has to contain an MANIFEST.MF which names the class containing the main function. Just add the following lines to the plugin config:

<configuration>
  <transformers>
    <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
      <mainClass>path.to.MainClass</mainClass>
    </transformer>
  </transformers>
</configuration>

See also: http://maven.apache.org/plugins/maven-shade-plugin/examples/executable-jar.html

Big Data Priciples - Part I

Your Input Data Is Immutable

Data storage systems are used to store information. This information may change in the course of time. You have two basic options to reflect these changes in your data storage systems:
  • Update the information
  • Store the new information in addition to the existing information
Consider the following example:

In a social network a user has a list of followers. This list may be modified by two events:

  • follow event - A new follower is added to the list of followers
  • unfollow event - A follower chooses to unfollow the user.
One possibility to store this information is to always store and update a list of current followers for each user. Each time a new follower is added or removed you update this list in your storage system.

The second possibility is to store all follow and unfollow events. The current list of followers for a user is derived from this information.

scenario solution one solution one
Get the current list of followers for the user 'arthur'. Read the list of current followers. Derive the list of current followers from the sequence of follow and unfollow events.
Get the number of followers for the user 'arthor'. Compute the length of the list of current followers. Compute the length of the list of current followers.
Get all users that have been following the user 'arthur' two years ago - Derive the list of followers from the sequence of follow and unfollow events.
Get a list of all users the user 'ford' has been following on 2000-01-01. - Derive the list of followed users from the sequence of follow and unfollow events.

As you can see solution one offers a simple and efficient solution for answering the questions you may have had in mind while implementing the solution. But there are many possible questions you cannot answer with this data model.

Solutions two requires much more storage and also additional computation efforts to answer simple questions. This is a high price to pay, but you get a great reward: Since you do not lose information due to data updates you have the possibilites to answer that arise later in time. Questions that you did not even think of when you where implementing your application.

Solution two provides another big advantage: Since you never update your raw data the danger of data corruption due to an application error is much less!