codezombie November 6, 2014 at 02:02

Microsoft Azure ❤ Big Data

About six months ago, I published a retrospective of what is interesting for researchers happening in the Microsoft Azure cloud.

I will continue this topic, shifting the focus a little in the area that for me the last couple of years has invariably remained the most interesting in IT: Big Data , machine learning and their symbiosis with cloud technologies .

Below we will discuss mainly the October announcements of Microsoft Azure services, which provide the possibility of batch and real-time processing of large data arrays, a high-performance cluster on demand, wide support for machine learning algorithms.

Real time

Apache Storm in HDInsight

At the Strata + Hadoop World conference in October of this (2014) year, support for Apache Storm in
HDInsight (a PaaS service that provides Hadoop on demand) was announced .

Apache Storm is a high-performance, scalable, fault-tolerant distributed application execution framework both in near real-time and in batch mode.

As a data source in HDInsight Storm, you can use the ServiceBus Queues or Event Hubs cloud services.

At the conference, TechEd Europe 2014 , held in late October, it was announced the availability Azure Cloudera Enterprise and Hortonworks Data Platform as a preconfigured Azure VM (IaaS service).

Azure Event Hubs

Event Hubs , a highly scalable service capable of processing millions of requests per second in near real-time mode, was launched into commercial operation .

The main features of the service (except those already listed in the definition):

volume of incoming requests:> 1 GB per second;
number of event producers:> 1 million;
support for HTTP (S), AMQP;
elastic scaling up / down without downtime;
time-based event buffering with support for ordering;
limits depending on the plan.

Azure Stream Analytics

Stream Analytics is an event-processing engine that allows you to process a large number of events in real time. As befits a decent cloud service, Stream Analytics handles a load of more than 1M requests per second, scales elastic, supports several data sources - Azure Blob Storage and Event Hubs. Transformation rules in Stream Analytics are written in an SQL-like language (as much as you can! - another SQL-like query language).

To work with the service, you need to take actions similar to what was done in Azure Data Factory (about it a little later) - create the service itself, determine the input data sources and output streams via the web interface, write a query request for data transformation.

This will look like a request for the formation of second candles on orders for the purchase / sale of shares coming from the abstract exchange:

SELECT DateAdd(second,-1,System.TimeStamp) as OpenTime,
    system.TimeStamp as CloseTime,
    Security,
    Max(Price) as High,
    Min(Price) as Low,
    Count(Volume) as TotalVolume
FROM trades
GROUP BY TumblingWindow(second, 1), Security

By the concept and request of Stream Analytics, we recall the product Stream Insight. There is not much documentation for the service, but “ Get started ... ” has already been written.

HPC

Azure batch

Azure Batch is a on-demand cluster service. The service allows you to write a highly scalable application using high-performance nodes to run many of the same tasks.

Service application cases for Azure Batch are traditional for the Big Data world - tasks can work with a lot of data (more than one node can fit into RAM). Applications: genetic engineering, banking, financial exchanges, retail, telecom, healthcare, government agencies and commercial web services that accumulate large amounts of data in their work.

In addition, Azure Batch is suitable for tasks that require more processing power (on CPUs of one node, calculations would take more time). Such tasks include the tasks of rendering and image processing, (de) video encoding.

Below is an example of a framework that must be implemented to run an application in Azure Batch (I specify classes with a namespace so that the origin of the classes is obvious).

The code for the server side will look something like this:

public class ApplicationDefinition
{
    /// 
    /// Регистрируем приложение
    /// 
    public static readonly Microsoft.Azure.Batch.Apps.Cloud.CloudApplication Application = new Microsoft.Azure.Batch.Apps.Cloud.ParallelCloudApplication
    {
        ApplicationName = "StockTradingAnalyzer",
        JobType = "StockTradingAnalyzer",
        JobSplitterType = typeof(MyApp.TradesJobSplitter),
        TaskProcessorType = typeof(MyApp.TradesTaskProcessor)
    };
}
public class TradesJobSplitter : Microsoft.Azure.Batch.Apps.Cloud.JobSplitter
{
    /// 
    /// Делим Job на задачи для параллельной обработки
    /// 
    /// Последовательность задач, запускаемых на вычислительных узлах
    protected override IEnumerable Split(IJob job, JobSplitSettings settings)
    {
        /* split job here */
    }
}
public class TradesTaskProcessor: Microsoft.Azure.Batch.Apps.Cloud.ParallelTaskProcessor
{
    /// 
    /// Исполняем процесс для внешней задачи
    /// 
    /// Выполняемая задача
    /// Информация о выполнении запроса
    /// Результат выполнения задачи
    protected override TaskProcessResult RunExternalTaskProcess(ITask task, TaskExecutionSettings settings)
    {
        /* some magic */
    }
    /// 
    /// Объединяем результаты выполнения гранулярных задач (tasks) в результат выполнения Job
    /// 
    /// Задача объединения результатов
    /// Информация о выполнении запроса
    /// Результат выполнения Job
    protected override JobResult RunExternalMergeProcess(ITask mergeTask, TaskExecutionSettings settings)
    {
        /* yet another magic */
    }
}

We call on the client:

Microsoft.WindowsAzure.TokenCloudCredentials token = GetAuthenticationToken();
string endpoint = ConfigurationManager.AppSettings["BatchAppsServiceUrl"];
// создаем клиента
using (var client = new Microsoft.Azure.Batch.Apps.BatchAppsClient(endpoint, token))
{
    // утверждаем Job
    var jobSubmission = new Microsoft.Azure.Batch.Apps.JobSubmission()
    {
        Name = "StockTradingAnalyzer", 
        Type = "StockTradingAnalyzer",
        Parameters = parameters,
        RequiredFiles = userInputFilePaths,
        InstanceCount = userInputFilePaths.Count
    };
    Microsoft.Azure.Batch.Apps.IJob job = await client.Jobs.SubmitAsync(jobSubmission);
    // мониторинг Job
    await MonitorJob(job, outputDirectory); // тут много кода с отслеживанием статуса Job, обработкой ошибок, который для краткости опущен
}

Subjectively, of course, I would like some higher levels of abstraction and exotics such as acyclic execution graphs and something like Distributed LINQ, i.e. as it was done in the Naiad and Dryad projects ( in Russian ) from Microsoft Research.

A small overview of the service is already available on azure.microsoft.com . For those who want to count words, you can take a look at the next tutorial .

Examples of working code (and not the “abstractions” that I wrote above) are already available on code.msdn.com (C #) and GitHub (Python).

Azure VM D-Series

Regarding the HPC, I note the recent announcement of the D-series Azure VM . As always, it happens in such announcements, plus nn% was added to what was already there. But if we return to the iron facts, then computing nodes with 16x CPU, 112 Gb RAM and 800 Gb SSD for ~ $ 2.5 per hour became available (the price is current at the end of October 2014).

And more ... (did not come up with a category)

Azure data factory

Azure Data Factory is a service that provides tools for orchestrating and transforming data streams, monitoring data sources such as MS SQL Server, Azure SQL Database, Azure Blobs, Tables.

The idea is intuitively simple (which counts as a plus): we create a service; we bind input and output data sources; create a pipeline (one or more activities that accepts input data, manipulates them and writes to the output stream) and look. The result will look like this . It simplifies the work with services that all steps, except the creation of a pipeline, can be done through the web interface.

Examples of working with the service are already on GitHub , and “ Step by step walkthrough ” is written.

Machine learning

What I will write in this section is no longer “hot” news, but it’s nice to repeat it anyway.

Most importantly: the first step has been taken - Azure ML public preview is available in Azure . Declared support for R and more than 400 (maybe not 400, but certainly a lot) pre-installed packages for R. At the moment, there is also support for Vowpal Wabbit modules.

In addition, the Azure for Research Award Program was launched for research projects , the main goal of which is to make the world a better place, and the secondary one is to popularize the Azure platform, in general, and the Azure ML service in particular among the academic community. I quite like both goals (especially since no one forbids Microsoft's "cloud" competitors to do the same).

Also on the subject of machine learning, Microsoft hosted a number of interesting events.

Events

The train left (events passed, but maybe the recordings remained)

In September, Microsoft Machine Learning Hackathon 2014 took place .
In mid-October, the TechEd Europe 2014 conference was held , about which XaocCPS already wrote on the hub
In late October, Microsoft Azure Machine Learning Jump Start took place .

The train is on (and you can still jump into it!)

On November 15-16 in Moscow the Big Data hackathon will be held , about which the word ahriman has already put in the habr .

Appreciate the data. Explore. Open up!

Only registered users can participate in the survey. Please come in.

I'm wondering what's new in real-time and batch processing of large data arrays and machine learning in the clouds:

55.5% Amazon Web Services 35
41.2% Google App Engine 26
69.8% Microsoft Azure 44
4.7% Another vendor (I will write in the comments) 3
14.2% I'm not interested, I'm a lunatic, accidentally clicked on a link and rewound to the end of article 9

Tags: