Real-time data flow analysis with Azure Stream Analytics

Original author: Anton Staykov
  • Transfer


Recently, Microsoft announced a preliminary version of a new service - Azure Stream Analytics , created for streaming data processing in near real-time mode.

The current version of Azure Stream Analytics connects to the Azure Event Hub and Azure Blob Storage to receive the data stream (called Inputs), as well as to Event Hubs, Blob Storage, Azure SQL Database for recording results (Outputs). A stream processor is designed using a language similar to SQL, which allows you to specify the processing and conversion of stream data to reliable information in real time.

And here the power of the cloud comes to the fore. In just a few steps and a couple of hours, you can raise a reliable infrastructure that can handle tens of thousands of events or messages per second.

I was very curious to find out how much can be achieved using this service. So I made a test case. The basis for my experiment was the manual, which can be found at this link .

The manual has a slight inaccuracy in the “Start the Job” step. It says that you should go to the “configure” section of your task (Job) in order to set the start time of the task (job output). However, this setting is not in the Configure section. This parameter is configured in the window where you start your task.

In order to make the test more interesting, I changed the following settings:
  • Set the Event Hub scale to 10 units. Thus, it is potentially possible to achieve 10,000 events per second.
  • Changed the Event Hub code to increase the number of messages.
  • Created a small PowerShell script to run N simultaneous instances of the command line application
  • All this was done in a virtual machine in the same Azure data center (Western Europe), where Event Hub and Stream Analytics work

Changes to the source code for the Service Bus Event Hub demo

I deleted all the unnecessary code (for example, creating an Event Hub). As a result, my Program.cs file looks like this:
static void Main(string[] args)
{
    System.Net.ServicePointManager.DefaultConnectionLimit = 1024;
    eventHubName = "salhub";
    Console.WriteLine("Start sending ...");
    Stopwatch sw = new Stopwatch();
    sw.Start();
    Paralelize();
    sw.Stop();
    Console.WriteLine("Completed in {0} ms", sw.ElapsedMilliseconds);
    Console.WriteLine("Press enter key to stop worker.");
    Console.ReadLine();
}
static void Paralelize()
{
    Task[] tasks = new Task[25];
    for (int i = 0; i < 25; i++)
    {
        tasks[i] = new Task(()=>Send(2000)); 
    }
    Parallel.ForEach(tasks, (t) => { t.Start(); });
    Task.WaitAll(tasks);
}
public static void Send(int eventCount)
{
    Sender s = new Sender(eventHubName, eventCount);
    s.SendEvents();
}

Now, using this command line application, I am sending 25 x 2,000, or 50,000 messages in parallel. To make things even more fun, I will run the application pseudo-in parallel, just starting it 20 times using the following PowerShell script:
for($i=1; $i -le 20; $i++)
{
    start .\BasicEventHubSample.exe 
}

Thus, I start processes almost simultaneously. And I wait until the end, that is, until all processes send their messages. Twenty times 50,000 messages will generate 1,000,000 messages. Then just get the result of the slowest operation. Of course, all these indicators are a little rough, but enough to give me an idea of ​​the opportunities that I have. Without the need to invest in expensive equipment and the development of complex solutions.

Another point is that I launched my stream analytics task before running command line applications that download data, just to make sure that the stream processor is already running before I drop it with data.

Pay attention to some points. First of all, the Stream Analytics service is still in the preliminary version stage, so there may be crashes. But the end result is still simply amazing.

Look at the Event Hub and Stream Analytics charts - it's just awesome. By the way, I was also convinced that the new performance levels of Azure SQL Database are also amazing.

With such a volume of data in Stream Analytics, the service did not have problems writing results to a single database of the Basic level (5 DTUs)! I started getting the results in the table of my SQL database as soon as I switched from running the program to my SQL Server Management Studio and was able to see the results coming in real time.

And finally, I pumped up 1,000,000 events in the Event Hub in just 75 seconds! That means over 13,000 events per second! In total, with a couple of lines of code.

It's great to look at charts like this:



And it's great to look at charts like this Azure Event Hubs:



Azure Event Hubs, millions of posts. Just think how long it would take us to create a local test lab in order to process such a volume of data?

Listed below are some of the most important limitations and known issues of Stream Analytics:
  • Geographic accessibility of the preliminary version of the service (Central US region and Western Europe)
  • Stream unit quota (12 stream units per azure per region per subscription)
  • UTF-8 is the only encoding supported for input CSV and JSON data sources.
  • In the preliminary version, some cool performance counters are not available, for example, counting delays.

Looking at the results, I’m convinced that Azure Event Hubs can really provide a throughput of millions of events per second, and Stream Analytics can really process this amount of data.

useful links



Also popular now: