entony May 22, 2019 at 16:23

ML on Scala with a smile, for those who are not afraid of experimentation

Hello! Today we will talk about the implementation of machine learning on Scala. I'll start by explaining how we got to such a life. So, our team for a long time used all the features of machine learning in Python. It is convenient, there are many useful libraries for data preparation, a good infrastructure for development, I mean Jupyter Notebook. Everything would be fine, but faced with the problem of parallelizing computations in production, and decided to use Scala in the prod. Why not, we thought, there are tons of libraries out there, even Apache Spark is written in Scala! At the same time, today we develop models in Python, and then repeat the training in Scala for further serialization and use in production. But, as they say, the devil is in the details.

Immediately I want to clarify, dear reader, this article was not written to undermine Python's reputation for machine learning. No, the main goal is to open the door to the world of machine learning on Scala, give a short overview of the alternative approach that follows from our experience, and tell you what difficulties we encountered.

In practice, it turned out that it wasn’t all so joyful: there are not many libraries that implement classical machine learning algorithms, but those that are are often OpenSource projects without the support of large vendors. Yes, of course, there is Spark MLib, but it is strongly tied to the Apache Hadoop ecosystem, and I really did not want to drag it into the microservice architecture.

What was needed was a solution that would save the world and bring back a restful sleep, and it was found!

What do you need?

When we chose a tool for machine learning, we proceeded from the following criteria:

it should be simple;
despite its simplicity, no one has canceled wide functionality;
I really wanted to be able to develop models in the web-interpreter, and not through the console or constant assemblies and compilations;
the availability of documentation plays an important role;
ideally, there would be support at least answering github issues.

What did we see?

Apache Spark MLib : did not suit us. As mentioned above, this set of libraries is strongly tied to the Apache Hadoop stack and Spark Core itself, which weighs too much to build microservices based on it.
Apache PredictionIO : an interesting project, many contributors, there is documentation with examples. In fact, this is a REST server on which models are spinning. There are ready-made models, for example, text classification, the launch of which is described in the documentation. The documentation describes how you can add and train your models. We did not fit, since Spark is used under the hood, and this is more from the area of a monolithic solution, rather than a microservice architecture.
Apache MXNet : an interesting framework for working with neural networks, there is support for Scala and Python - this is convenient, you can train a neural network in Python, and then load the saved result from Scala when creating a production solution. We use it in production solutions, there is a separate article about this here .
Smile : very similar to scikit-learn package for Python. There are many implementations of classical machine learning algorithms, good documentation with examples, support on github, a built-in visualizer (powered by Swing), you can use Jupyter Notebook to develop models. This is just what you need!

Environment preparation

So, we chose Smile. I'll tell you how to run it in the Jupyter Notebook using the k-means clustering algorithm as an example. The first thing we need to do is install Jupyter Notebook with Scala support. This can be done via pip, or use an already assembled and configured Docker image. I am for a simpler, second option.

To make Jupyter friends with Scala, I wanted to use BeakerX, which is part of the Docker image available in the official BeakerX repository. This image is recommended in the Smile documentation, and you can run it like this:

# Официальный образ BeakerX
docker run -p 8888:8888 beakerx/beakerx

But here the first trouble was waiting: at the time of writing the article, BeakerX 1.0.0 was installed inside the beakerx / beakerx image, and version 1.4.1 was already available in the official github of the project (more precisely, the latest release 1.3.0, but the wizard contains 1.4.1, and it works :-)).

It’s clear that I want to work with the latest version, so I put together my own image based on BeakerX 1.4.1. I will not bore you with the contents of the Dockerfile, here is a link to it.

# Запускаем образ и монтируем в него рабочую директорию
mkdir -p /tmp/my_code
docker run -it \
    -p 8888:8888 \
    -v /tmp/my_code:/workspace/my_code \
    entony/jupyter-scala:1.4.1

By the way, for those who will use my image, there will be a small bonus: in the examples directory there is an example k-means for a random sequence with plotting (this is not a completely trivial task for Scala notebooks).

Download Smile in Jupyter Notebook

Excellent prepared environment! We create a new Scala notebooks in a folder in our directory, then we need to download the libraries from Maven for the Smile to work.

%%classpath add mvn
com.github.haifengl smile-scala_2.12 1.5.2

After executing the code, a list of downloaded jar files will appear in its output block.

Next step: importing the necessary packages for the example to work.

import java.awt.image.BufferedImage
import java.awt.Color
import javax.imageio.ImageIO
import java.io.File
import smile.clustering._

Preparing data for clustering

Now we will solve the following problem: generating an image consisting of zones of three primary colors - red, green and blue (R, G, B). One of the colors in the picture will prevail. We cluster the pixels of the image, take the cluster in which there will be the most pixels, change their color to gray and build a new image from all the pixels. Expected result: the zone of the predominant color will turn gray, the rest of the zone will not change its color.

// Размер изображения будет 640 х 360
val width = 640
val hight = 360
// Создаём пустое изображение нужного размера
val testImage = new BufferedImage(width, hight, BufferedImage.TYPE_INT_RGB)
// Заполняем изображение пикселями. Преобладающим будет синий цвет.
for {
    x <- (0 until width)
    y <- (0 until hight)
    color = if (y <= hight / 3 && (x <= width / 3 || x > width / 3 * 2)) Color.RED
    else if (y > hight / 3 * 2 && (x <= width / 3 || x > width / 3 * 2)) Color.GREEN
    else Color.BLUE
} testImage.setRGB(x, y, color.getRGB)
// Выводим созданное изображение
testImage

As a result of executing this code, the following picture is displayed:

Next step: convert the picture to a set of pixels. By pixel we mean an entity with the following properties:

wide side coordinate (x);
narrow side coordinate (y);
color value;
optional value of the class / cluster number (before the clustering is completed, it will be empty).

It’s convenient to use as an entity case class:

case class Pixel(x: Int, y: Int, rgbArray: Array[Double], clusterNumber: Option[Int] = None)

Here, an array rgbArrayof three values of red, green, and blue is used for color values (for example, for red Array(255.0, 0, 0)).

// Перегоняем изображение в коллекцию пикселей (Pixel)
val pixels = for {
    x <- (0 until testImage.getWidth).toArray
    y <- (0 until testImage.getHeight)
    color = new Color(testImage.getRGB(x, y))
} yield Pixel(x, y, Array(color.getRed.toDouble, color.getGreen.toDouble, color.getBlue.toDouble))
// Выводим первый 10 элементов коллекции
pixels.take(10)

This completes the data preparation.

Pixel color clustering

So, we have a collection of pixels of three primary colors, so we will cluster the pixels into three classes.

// Количество кластеров
val countColors = 3
// Выполняем кластеризацию
val clusters = kmeans(pixels.map(_.rgbArray), k = countColors, runs = 20)

The documentation recommends setting a parameter runsin the range from 10 to 20.

When this code is executed, an object of type will be created KMeans. The output block will contain information about the results of clustering:

K-Means distortion: 0.00000
Clusters of 230400 data points of dimension 3:
  0    50813 (22.1%)
  1    51667 (22.4%)
  2    127920 (55.5%)

One cluster does contain more pixels than the rest. Now we need to mark our collection of pixels with classes from 0 to 2.

// Разметка коллекции пикселей
val clusteredPixels = (pixels zip clusters.getClusterLabel()).map {case (pixel, cluster) => pixel.copy(clusterNumber = Some(cluster))}
// Выводим 10 размеченных пикселей
clusteredPixels.take(10)

Repaint image

The only thing left is to select the cluster with the largest number of pixels and repaint all the pixels included in this cluster in gray (change the value of the array rgbArray).

// Серый цвет
val grayColor = Array(127.0, 127.0, 127.0)
// Определяем кластер с наибольшим количеством пикселей
val blueClusterNumber = clusteredPixels.groupBy(pixel => pixel.clusterNumber)
    .map {case (clusterNumber, pixels) => (clusterNumber, pixels.size) }
    .maxBy(_._2)._1
// Перекрашиваем все пиксели кластера в серый
val modifiedPixels = clusteredPixels.map {
    case p: Pixel if p.clusterNumber == blueClusterNumber => p.copy(rgbArray = grayColor)
    case p: Pixel => p
}
// Выводим 10 элементов из новой коллекции пикселей
modifiedPixels.take(10)

There is nothing complicated, just group by the cluster number (this is ours Option:[Int]), count the number of elements in each group and pull out the cluster with the maximum number of elements. Next, change the color to gray only for those pixels that belong to the found cluster.

Create a new image and save the results.

Gathering a new image from the pixel collection:

// Создаём пустое изображение такого же размера
val modifiedImage = new BufferedImage(width, hight, BufferedImage.TYPE_INT_RGB)
// Наполняем его перекрашенными пикселями
modifiedPixels.foreach { 
    case Pixel(x, y, rgbArray, _) => 
        val r = rgbArray(0).toInt
        val g = rgbArray(1).toInt
        val b = rgbArray(2).toInt
        modifiedImage.setRGB(x, y, new Color(r, g, b).getRGB)
}
// Выводим новое изображение
modifiedImage

That's what, in the end, we did.

We save both images.

ImageIO.write(testImage, "png", new File("testImage.png"))
ImageIO.write(modifiedImage, "png", new File("modifiedImage.png"))

Conclusion

Machine learning on Scala exists. To implement the basic algorithms, it is not necessary to drag some huge library. The above example shows that during development you can not give up the usual means, the same Jupyter Notebook can be easily made friends with Scala.

Of course, for a complete overview of all the features of Smile, one article is not enough, and this was not included in the plans. The main task - to open the door to the world of machine learning on Scala - I think is completed. Whether to use these tools, and even more so, drag them into production or not, is up to you!

References

Dockerfile for Scala Jupyter Notebook image: https://github.com/AntonYurchenko/docker/blob/master/jupyter-scala/Dockerfile
Git with an example: https://github.com/AntonYurchenko/habr/tree/master/scala-smile-clustering
Apache Spark MLib: https://spark.apache.org/mllib
Apache PredictionIO: http://predictionio.apache.org
Apache MXNet: https://mxnet.incubator.apache.org
Smile: https://haifengl.github.io/smile
Git BeakerX: https://github.com/twosigma/beakerx

Only registered users can participate in the survey. Please come in.

You will be interested to read about examples of other algorithms implemented on Scala Smile?

90.9% Yes 30
9% No 3

Tags: