Hadoop, java MapReduce: starting from an arbitrary web / EE container

  • Tutorial
There are quite a lot of examples on the Internet about how to start MapReduce from the standalone of a Java application.
But for a beginner to work with an Indian elephant, it can be difficult to understand how to start a job from some java container.

For example, this tutorial , kindly provided by ikrumping , contains this code example:

        Job job = new Job(config, "grep");
        /*
         * Для запуска программы из jar-файла необходимо указать любой
         * класс из вашего приложения.
         */
        job.setJarByClass(Grep.class);


Such code will work if you run the stendelon application:

If you run the code from JBOSS AS, WebSphere AS, Glassfish AS, etc., this code will not work.
Why? Yes, because the container unpacks your JAR file into its different caches and runs the classes from there.

Who cares why the setJarByClass method does not work in the case of a container - I invite you under the spoiler
To begin, I suggest taking a look at the implementation of the setJarByClass method.

public void setJarByClass(Class cls)
  {
    String jar = findContainingJar(cls);
    if (jar != null)
      setJar(jar);
  }
 private static String findContainingJar(Class my_class)
  {
    ClassLoader loader = my_class.getClassLoader();
    String class_file = my_class.getName().replaceAll("\\.", "/") + ".class";
    try {
      Enumeration itr = loader.getResources(class_file);
      while (itr.hasMoreElements()) {
        URL url = (URL)itr.nextElement();
        if ("jar".equals(url.getProtocol())) {
          String toReturn = url.getPath();
          if (toReturn.startsWith("file:")) {
            toReturn = toReturn.substring("file:".length());
          }
          toReturn = toReturn.replaceAll("\\+", "%2B");
          toReturn = URLDecoder.decode(toReturn, "UTF-8");
          return toReturn.replaceAll("!.*$", "");
        }
      }
    } catch (IOException e) {
      throw new RuntimeException(e);
    }
    return null;
  }


As you can see, the findContainingJar method expects the protocol type of the URL to be “jar”.
And in the case of each container, the protocol will be different.
As a result: the setJarByClass method works mainly only for wall-mounted applications.



How to start mapreduction work in a universal way that does not depend on a specific application container?

To do this, do the following:
  1. create a separate JAR containing all the classes used from the job
  2. applaud it in the hoods HDFS file system, where you are going to run MapReduce
  3. add a JAR archive to the classpath of the launched job


In the above example, you need to replace:

       job.setJarByClass(Grep.class);

on the
        DistributedCache.addFileToClassPath("/user/UserName/test.jar", config);


Where the first parameter of the addFileToClassPath method contains the path to the JAR file inside the HDFS distributed file system.
And the second is the configuration of the hood (org.apache.hadoop.conf.Configuration).

There used to be 2 more ways to slip your jadarka hadup, but they are already outdated: blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job

Also popular now: