FPGA accelerators go to the clouds

The appearance on the market of FPGA accelerators, which can be reprogrammed as many times as necessary, and in a high-level language like “C”, has become a real breakthrough in the niche of high-performance computing. But the opportunity to use FPGA technology without buying these very expensive adapters (price in Russia from 250 thousand rubles), but just renting a dedicated server with an accelerator in the provider's cloud, was no less a breakthrough.

Introduction in 3 paragraphs
The basic differences between FPGA and CPU, GPU
FPGA Applications
The concept of a configuration image (IP core, image)
Difficulties in accessing technology for customers
FPGA Cloud Computing
FPGA Configuration Image Store Concept

Introduction or about FPGA chips in 3 paragraphs

The FPGA (Field-Programmable Gate Array) chip, also known as the Field Programmable Gate Array (FPGA), is an integrated circuit (IC) that can be reconfigured for any complex computational tasks. In the industry, there is a need for specialized microcircuits (ASICs, application-specific integrated circuits, “special-purpose integrated circuits”) - from spacecraft control to financial model calculations. However, before the advent of FPGAs, the strong and at the same time weak point of specialized integrated circuits was the rigid functionality embedded in the chip, as well as the high complexity of the design and the cost of launching the production. If the functional was then required to be changed at least a little bit, or errors occurred at the design stage, then it was necessary to create essentially a new IC.

FPGA Accelerator with Intel Altera Arria 10 Chip for PCI Express Bus

FPGA accelerator with Intel Altera Arria 10 chip and 10GE ports

The appearance on the market of FPGA accelerators that can be reprogrammed as many times as necessary, and in a high-level language such as C, has become a real breakthrough in the niche of high-performance computing. This allowed us to accelerate the development time, the time to market for products. There are completely new opportunities for hardware developers, including working on programming specialized integrated circuits such as ASIC.

FPGA processors have already passed 2 stages in terms of the availability of this technology and today they are actively entering the third stage. The first FPGAs appeared in 1985, but their programming still required knowledge of a low-level language such as assembler. At the second stage, which began around 2013, and thanks to the efforts of Altera, it became possible to program in a high-level C-like language. This dramatically expanded the applicability of FPGAs, but the high cost of the chips still held back the expansion of the circle of customers who could afford this technology.

Traditionally, the FPGA design and verification route is extremely time-consuming and requires high specialization; in its complexity, the route approaches ASIC design. This limits the use of FPGAs by developers. This is especially true for computing applications, where the participants in the process — a programmer, mathematician, and algorithmist — want to focus on their task, and not on its hardware implementation. Solving this problem, Altera in 2013 introduced the openCL programming standard for heterogeneous computing platforms OpenCL for its FPGAs, which expanded the possibility of developers using computing applications that are not familiar with FPGAs, HDL languages, design and verification routes. But, the problem remained - expensive equipment and design tools.

And finally, somewhere in 2016, we can talk about the third stage, which was marked by the availability for a wide range of clients of fully ready servers (physical and virtual) with FPGA processors in the clouds of the largest data centers - Amazon Web Services (AWS), Cloud Alibaba and Huawei Cloud. In Russia, for the first time dedicated servers with FPGA processors have become available in the Selectel data center since 2017.

Why might FPGA accelerators be required? Data streams are growing on the one hand, and on the other, difficulties are noted in increasing computing power without increasing the size and consumption of the computing system. Typically, an application has management tasks and data-intensive processing tasks. It is advisable to leave the control tasks to the CPU, and send the processing tasks to a specialized resource. "Configuration on the fly" for the task - also seems to be a very useful property. Synthesis of a computing resource on FPGAs for a specific task should also give a gain both in productivity growth and in reducing consumption. Also, on the FPGA there is an internal fast memory and a developed (and reconfigurable) communication part, which allows organizing almost all known input-output protocols. For example, for organizing hash memory, hardware DSP blocks, memory controllers, etc. In other words, it is a developed system on a chip with the property of synthesizing a specific computing core for each task.

The basic differences between FPGA and CPU, GPU

What types of accelerators are available today? Today available: multi-core processors (CPU) Xeon, GPU and FPGA, consider them below.

Each type of processor — universal (CPU), graphic (GPU), or FPGA — has its own advantages, otherwise they would not have been produced. CPUs provide good performance with the highest degree of versatility and applicability. About 99% of all existing programs are written for execution under the CPU. GPUs have a larger number of cores and vector architecture, high speed memory and I / O. FPGAs have the highest performance per watt of power consumption due to the properties of the equipment, but require very careful and time-consuming programming.

Below about these differences a little more in detail:

Universal CPUs are essentially the workhorses of the IT industry. They can be used for a wide variety of tasks, but because of their architecture, the CPUs are not so effective for parallel computing. In recent years, this problem has been partially solved by implementing multiple cores in the processor chip. However, even with the most productive CPUs, the number of cores is still measured in a few dozen.
Graphic processors (GPUs) for many years worked only in the niche of displaying information on the screen. And only relatively recently, GPUs began to be used for high-performance computing tasks, including cryptocurrency mining. Working with graphics as vector tasks has led to such a development of the GPU architecture, which has become adapted for the purposes of parallel computing. As a result, the modern GPU architecture allows you to accelerate the run of vectorized data through its pipelines, which otherwise would have to run through many other logical blocks in the CPU with a corresponding loss in performance. Modern GPUs contain several thousand processor cores in a chip.
FPGA, in contrast to the universal and graphic processors, can be reprogrammed in accordance with the features of the computing problem being solved on them. It turns out the synthesis of a specialized processor for a specific task. Other important differences between FPGAs are reduced power consumption per unit of computing power, as well as an architecture with parallel execution of many vector operations at the same time - the so-called massive-parallel fine-grained architecture. The number of cores in an FPGA chip can reach one million or more.

An FPGA accelerator, as a rule, is a device in a different form factor (VPX, Com-express, PCIe, etc.), which, in addition to the FPGA chip itself (or several), contains SRAM and DRAM memory on the board, including ultra -New HBM (high-bandwidth DRAM) and high-speed I / O interfaces such as the popular 10/40/100 GE and PCI Express. FPGA accelerators are also available in the SOM form factor (on-module system, single-board computer) for embedded systems, which is popular in video analytics systems or industrial applications.

SOM FPGA Accelerator

Each FPGA chip contains an array of up to 5 million logic elements (transcoding array and triggers), which can be reprogrammed for different functional tasks. In addition, there are hardware resources - cache memory, signal processors, digital processing units, interface units.

Why does FPGA outperform ASIC? The answer is very simple - thanks to more advanced technological processes for creating crystals. For FPGA, technological processes of the level of 20 nm and even 14 nm are used. While to create ASIC crystals, more “ancient” technological processes of 60 nm level are used. Accordingly, on the same crystal area, FPGAs can have many times more logic cells than ASICs, which provides a performance gain.

FPGA Applications

From the moment of its invention to the present day, one of the basic directions of FPGA application has been and remains the prototyping of microcircuits for small and medium-sized products, when the production of ASIC microcircuits is not economically feasible.

At the beginning of 2018, according to the Russian company Almaz-SP, the scope of application of FPGA accelerators was as follows:

50% - special applications in military electronics,
20% - telecommunications (equipment of GSM base stations, etc.),
10% - processing of video streams (video studios, video analytics),
10% - industrial use,
10% - prototyping and more (including scientific calculations).

However, despite the predominantly military use in the past, the civilian use of FPGA accelerators is growing much faster now. In 2015, Intel acquired one of the largest manufacturers of FPGAs - Altera. Altera developments are now embodied in silicon already under the Intel brand. And the new line of FPGA chips known as Intel Cyclone 10 was not long in coming. Cyclone 10 GX chip models show very high performance (up to 134 GFLOP) and have advanced I / O capabilities. Connecting to other devices is done through the 10GE network port or via the PCI Express x4 bus. These FPGA chips are designed for machine vision systems, surveillance, video broadcasts, as well as robotics. The junior model of the Cyclone 10 LP chip is implemented as a computing core for engineering systems - control of sensor complexes,

In addition to the Cyclone line, the Intel production program also includes other series of FPGA chips inherited from Altera: MAX, Arria and Startix. The last two series are the most powerful FPGA chips on the market, in 2018 they are expected to upgrade to Arria 10 and Startix 10. Startix 10 will be built on hyperflex architecture and have a performance of 10 teraflops (i.e. almost 3 orders of magnitude more powerful Cyclone 10).

The Cyclone, MAX, Arria and Startix series partially overlap in performance, but Intel positions each series separately. For Arria, these are signal processors for instrumentation; for Startix, high-performance computing in data centers and telecommunications. We have already talked about the applications for the Cyclone series, which was the only one to receive updates in 2017. But another such application for Cyclone is definitely worth mentioning: the Internet of Things, IoT.

More than 50% of cases of using FPGA accelerators are in military and industrial electronics, but the sphere of civilian tasks and scientific calculations is growing rapidly.

The concept of image in FPGA technology

Above, we have listed Intel's popular FPGA chip series today, but to use them in servers, you will need to purchase FPGA accelerator cards and program the chip logic on the adapter for a specific application. Adapter cards are available from Intel partners in the FPGA Design Solutions Network. In particular, in Russia such a partner is Almaz-SP LLC (also participating in the Euler project), which supplies both original Intel adapters and own-developed motherboards with FPGAs of the latest generations.

Demonstration of a server with an FPGA accelerator at the SelectelTechDay # 2 conference, in the center - Anton Vista, representative of Almaz-SP LLC

Demonstration of a server with an Almaz-SP FPGA accelerator on SelectelTechDay # 2

Hardware Innovation Demo at SelectelTechDay # 2. First left - FPGA server from Almaz-SP

Demo zone of hardware innovations at SelectelTechDay # 2 (FPGA - the first stand on the left)

If you need to ignore the design route and focus on the computational task, you can use OpenCL and Intel FPGA SDK for OpenCL. To do this, you need a BSP support package that allows you to ignore the complexities of building a system on a chip (memory controllers, PCIe, interfaces, clock domains, time constraints, partial reconfiguration, etc.) and focus on the computational task. Such a package is provided if the board has OpenCL support (OpenCL BSP). Having a similar support package, you can get a "software developer environment" - where there is a platform model, a function for acceleration, a runtime support library, a memory model, as well as special extensions to increase throughput. Then they start writing code, profiling, optimization.

As a result of using SDK and BSP, a single configuration file (bitstream) is obtained, which FPGA is configured and a complete system on a chip is obtained for a specific computing task. The result of programming is a microprogram that solves a specific application (for example, calculating a matrix of equations, converting video formats, etc.). Such firmware is called an FPGA image (FPGA Image). Quite often, the term “IP core” is used instead of the term “image”.

FPGA Image (FPGA Image) is the control firmware for the FPGA chip, developed and debugged to perform specialized computing tasks.

Difficulties accessing FPGA technology for customers

Despite the attractive concept, “the highest performance for a specific computing task,” two objective factors interfere with the widespread adoption of FPGAs. This is the high cost of an adapter with an FPGA chip and a shortage of developers with practical experience in programming and debugging FPGA cores.

In addition to the accelerator, you must also acquire a license for the Intel OpenCL SDK, without which it is only possible to run compiled kernels, but their compilation is impossible. The requirements for the developer's computer are also very high: this includes recommendations for the RAM capacity of 18-48 GB. On a machine with an 8-core CPU and 32 GB of memory, compiling a kernel to calculate the Mandelbrot set takes about 2 hours. If the processor utilization exceeds 90%, then compilation may take a day or more. With less than 16 GB of memory, compilation may not be possible.

Therefore, potential customers are actively interested in this technology, but are in no hurry with the acquisition of FPGA accelerators. This is mainly due to fears that the cost of the accelerator (s) will be significant for their IT budget, and the in-house team will not be able to master the programming and debugging of FPGA images at the proper level.

FPGA Cloud Computing

FPGA cloud services have emerged as a response to the high cost of accelerator boards with an FPGA chip. In this case, customers are offered to rent physical and / or virtual servers with FPGA accelerators installed in them. As a rule, this is a partner product from a manufacturer (for example, Intel) and a data center as an IaaS service provider.

FPGA server with accelerator from Almaz-SP can be tested for free in the Selectel data center

One of the solutions to the problem of accessibility of technology for mass application seems to be the possibility of leasing computing power based on FPGA. At Selectel, the service involves gaining access to a server with the installed Euler accelerator manufactured by Euler Project based on Intel Arria 10 FPGA. The necessary SDK and BSP are deployed on the server for developing, debugging and compiling OpenCL kernels, development tools for writing host applications (Visual Studio). As an introductory demonstration, the previously considered example with the construction of the Mandelbrot set is proposed: the project is provided in source codes and configured for compilation.

The Euler Project provides an OpenCL programming course for FPGAs for everyone. This course is designed specifically for the Russian audience: engineers, researchers, students of technical universities. It has incorporated the material of official Intel training and makes it possible to step-by-step study of technology from the assembly of the simplest application to the application of specific optimization methods, sometimes absolutely necessary to achieve optimal performance.

In this form, FPGA technology is becoming more attractive to customers, since they no longer need to purchase hardware directly, and capital costs are replaced by operating costs. Accordingly, the range of companies that can afford to use calculations on FPGA accelerators for their projects is expanding significantly.

The cloud-based model for using servers with FPGA accelerators provides access to this technology for many new customers who would like to try “how it works” on their specific projects and computing tasks.

FPGA Imaging Store Concept

Creating an efficiently working FPGA image for a specific application is a rather time-consuming and time-consuming task. A well-coordinated team for image programming can take up to a couple of months, and less experienced customers will spend much more time, or even not cope with this task at all.

Поэтому сама собой напрашивается концепция магазина образов, — по аналогии с существующими магазинами приложений для таких платформ как MacOS, Windows или Android. Разработчики могли бы передавать туда работоспособные образы, созданные ими для различных задач, а клиенты — приобретать их для загрузки на свои серверы с FPGA-ускорителями, если эти образы соответствует вычислительным задачам в их проектах.

В компании Selectel в 2018 году начата работа над созданием подобного магазина образов FPGA, которые можно было использовать на арендованных серверах Selectel с этой технологией. Тем самым, для клиентов значительно ускорился бы цикл разработки для новых проектов, а сами программисты (авторские коллективы) получили бы определенный доход от ранее проделанной работы, плюс были бы защищены от пиратского распространения образов по рынку без их согласия.

Useful link:

Free testing of dedicated server with FPGA adapter in Selectel Labs

Tags: