Pentaho Data Integration (PDI), Python and Deep Learning

Published on February 07, 2019

Pentaho Data Integration (PDI), Python and Deep Learning

    Hi, Habr! I present to you the translation of the article "Pentaho Data Integration (PDI), Python and Deep Learning . "

    Deep Learning (DL) - why is there so much noise around it?


    According to Zion Market Research, the deep learning market (DL) will increase from $ 2.3 billion in 2017 to more than $ 23.6 billion by 2024. With an average annual growth rate of almost 40% annually, DL has become one of the hottest areas for analytics experts to create models. Before turning to the question of how Pentaho can help implement your organization's DL models in the product environment, let's take a step back and consider why DL is such a breakthrough technology. Below is some general information about this:

    image

    image

    • It uses artificial neural networks with several hidden layers that can perform accurate image recognition, computer vision / object detection, video stream processing, natural language processing, and much more. Improvements in the proposed DL capabilities and computational power, such as GPUs, cloud storage, have significantly accelerated the already active growth of DL in the past few years;
    • Trying to imitate the activity of the human brain through layers of neurons, DL learns to recognize patterns in digital representations of sounds, video streams, images and other data;
    • Reduces the need to design objects before launching the model by using several hidden layers, performing object extraction on the fly while the model is running;
    • Improves performance and accuracy compared to traditional machine learning algorithms due to updated frameworks, the presence of very voluminous data sets (ie, big data) and a significant jump in the growth of computing power, such as graphics processors and so on
    • Provides development environments, environments, and libraries, such as Tensorflow, Keras, Caffe, PyTorch, and others that make DL more accessible to analytics experts.

    Why use PDI to develop and implement deep learning models using Python?


    Today, data experts and data engineers collaborate in hundreds of data science projects created in PDI. Thanks to Pentaho, they were able to transfer complex data science models to the production environment at lower cost than traditional data preparation tools. We are pleased to announce that Pentaho can now bring this ease of use to the DL frameworks, helping to achieve the goal of Hitachi Vantara, allowing organizations to innovate with all their data. With PDI and the new Python Executor Step, Pentaho can do the following:

    • Integration with popular DL frameworks at the transformation stage, expanding Pentaho’s already extensive data science capabilities;
    • Simple implementation of DL Python script files received from data experts in the new PDI Python Executor Step;
    • Run a DL model on any CPU / GPU hardware, allowing organizations to use GPU acceleration to improve the performance of their DL models;
    • Include data from previous PDI steps through a data stream as a Python Pandas data frame from a Numpy array in Python Executor Step for DL ​​processing;
    • Integration with the Hitachi Content Platform (HDFS, Local, S3, Google Storage, etc.), allowing you to move and place unstructured data files in the locale (for example, data lake and the like), thereby reducing storage and processing costs Dl.

    Benefits:

    • PDI supports the most widely used DL platforms, i.e. Tensorflow, Keras, PyTorch, and others that have the Python API, allowing data professionals to work in their favorite libraries;
    • PDI allows data engineers and data specialists to collaborate on the implementation of the DL;
    • PDI effectively distributes the skills and resources of data specialists (that is, create, evaluate, and run DL models) and data engineers (create data conveyors in PDI to handle DL).

    How does PDI introduce deep learning?


    Components Used:

    • Pentaho 8.2, PDI Python Executor Step, Hitachi Content Platform (HCP) VFS
    • Python.org 2.7.x or Python 3.5.x
    • Tensorflow 1.10
    • Keras 2.2.0.

    See Pentaho 8.2 Python Executor Step in the Pentaho online help for a list of dependencies. Python Executor - Pentaho Documentation .

    Main process:

    1. Select the VFS HCP file in the PDI Step. Copy and prepare unstructured data files for use with the DL framework using PDI Python Executor Step .

    image

    Additional information:
    https://help.pentaho.com/Documentation/8.2/Products/Data_Integration/Data_Integration_Perspective/Virtual_File_System


    2. Use the new transformation that will implement workflows to process the DL framework and associated datasets and so on. Enter hyper parameters (values ​​used to configure and run models) to evaluate the most effective model. Below is an example that implements four DL framework workflows, three using Tensorflow and one using Keras, with the Python Executor Step.

    image

    image

    3. Focusing on the Tensorflow DNN Classifier workflow (which implements the implementation of hyper parameters ), use the PDI Data Grid Step , that is, the name Injected Hyperparameters , with values ​​that correspond to the Python Script Executor steps.

    image

    4. InPython Script Executor step use Pandas DF and implement the entered hyper parameters and values ​​as variables on the Input tab .

    image

    5. Run the script associated with the DL the Python (or using «Embed», or using a «Link from file») and by reference to the framework imposed by DL and hyperparameters. In addition, you can set a different path for the Python virtual environment than the default one for it.

    image

    6. Ensure that TensorFlow is installed, configured and correctly imported into the Python shell.

    image

    7. Returning to Python Executor Step , click the Output tab.and then click “Get Fields”. The PDI will pre-check the script file to check for errors, output, and other parameters.

    image

    8. At this, the settings for starting the conversion are completed.

    Hitachi Vantara offers a proprietary GPU solution to accelerate deep learning


    DL frameworks can significantly gain in performance when executed using a graphics rather than a central processor, so most DL frameworks support some types of graphics processors. In 2018, Hitachi Vantara developed and delivered an advanced DS225 server with NVIDIA Tesla V100 graphics processors. This is the first graphic server Hitachi Vantara, designed specifically for the implementation of DL.

    image

    More information about this offer can be found on the Hitachi Vantara website .

    Why do organizations need to use PDI and Python for Deep Learning?


    • Intuitive drag and drop tools: PDI simplifies the implementation and execution of DL frameworks using a graphical development environment for pipelines and DL-related workflows;
    • Productive collaboration: data processing engineers and data engineers can work on a common workflow and use their skills and time efficiently;
    • Efficient allocation of valuable resources: a data engineer can use PDI to create workflows, move and create unstructured data files from / to HCP, as well as to customize entered hyper parameters in preparation for a Python script received from an analyst data expert;
    • Best-in-class GPU processing: Hitachi Vantara offers a DS225 Advanced server with NVIDIA Tesla V100 GPUs that allow DL frameworks to take advantage of the performance benefits of working with the GPU.