Improving data processing speed with data locality in Hadoop

    Author: Andrey Lazarev

    One of the main bottlenecks in computing that require processing large amounts of data is network traffic passing through the switch. Fortunately, executing map code on the node where the data is located makes this problem much less serious. This method, called “data locality,” is one of the main advantages of the Hadoop Map / Reduce model.

    In this article, we will consider the requirements for data locality, as well as how the OpenStack virtualized environment affects the topology of the Hadoop cluster, and how to ensure data locality when using Hadoop with Savanna.

    Hadoop data locality requirements


    To take full advantage of data locality, you need to make sure that your system architecture meets a number of conditions.

    First, the cluster must have an appropriate topology. Hadoop Map Code must be able to read data locally. Some popular solutions, such as network attached storage (network attached storage (NAS) and storage area networks [SANs]), will always generate network traffic, so in a sense you may not consider them to be "local", but, in fact, , it depends on the point of view. Based on your situation, you can define "local" as "located within the same data center" or "everything that is on the same rack."

    Second, the Hadoop cluster must know the topology of the nodes on which jobs are running. Tasktracker nodes are used to perform Map tasks, so the Hadoop Scheduler needs network topology information to properly distribute tasks.

    Last but not least, the Hadoop cluster needs to know where the data is located. This can be a bit more complicated due to the support of various storage systems in Hadoop. So, HDFS supports default locality of data, while other drivers (such as Swift) require an extension to provide topology information in Hadoop.

    Hadoop Cluster Topology in Virtualized Infrastructure


    Hadoop typically uses a 3-tier network topology. Initially, these three levels included the data center, rack, and node, although the cross-data center case is not common and this level is often used to define top-level switches.

    image

    Such a topology is well suited for the traditional implementation of Hadoop clusters, but the virtual environment is difficult to display at these three levels, because there is no room for a hypervisor in them. Under certain conditions, two virtual machines running on the same host machine can interact much faster than if they were running on separate host machines, since no network is involved. Therefore, starting with version 1.2.0, Hadoop supports a 4-tier topology.

    image

    This new level, called the “host group,” corresponds to the hypervisor that hosts the virtual machines. Separate virtual machines can have several nodes on the same host machine, which are controlled by one hypervisor, which makes it possible to interact without passing data through the network.

    Data Locality in Savanna and Hadoop


    So, knowing all of the above, how to implement this in practice? One option is to use Savanna to configure your Hadoop clusters.

    Savanna is an OpenStack project that allows you to deploy Hadoop clusters on top of the OpenStack platform and perform tasks on it. In the latest Savanna 0.3 release, data locality support for the Vanilla plugin was added (in the Hortonworks plugin, data locality support is planned in the next Icehouse release). With this enhancement, Savanna can advance the configuration of cluster topology in Hadoop and ensure data locality. This allows Savanna to support both 3- and 4-level network topology. For the 4-tier topology, Savanna uses the host ID of the compute node as the identifier for the node group of the Hadoop cluster. (Note: Be careful not to confuse the Hadoop cluster node groups and Savanna node groups that have different goals.)

    Savanna can also provide data locality for Swift input streams in Hadoop, but for this you need to add a specific Swift driver to Hadoop, because the Vanilla plugin uses Hadoop 1.2.1 without native Swift support. The Swift driver was developed by the Savanna project team and is already partially integrated into Hadoop 2.4.0. It is planned to fully integrate it into 2.x repo, and then apply it on the old version 1.x.

    Using data locality in Swift involves turning on the data locality function in Savanna, and then specifying both the Compute topology and the Swift topology. The video below shows how to start a Hadoop cluster in Savanna and configure data locality on it:

    http://youtu.be/XayHkbmjK9g

    Conclusion


    Processing data in a Hadoop virtual environment is perhaps the next step in the development of big data. As clusters grow, optimization of consumed resources becomes extremely important. Technologies such as data locality can significantly reduce network utilization and allow you to work with large distributed clusters without losing the benefits of using smaller and more local clusters. This makes the scalability of the Hadoop cluster virtually limitless.

    Original article in English .

    Also popular now: