Optimization of the architecture of artificial intelligence: the race has begun

Transfer

As the AI architecture improves and costs fall, experts say that more and more enterprises will develop these technologies, which will give impetus to innovation and will bring big dividends for companies and for AI developers.

AI applications often work on completely different architectures than traditional enterprise applications. In turn, suppliers are ready for much to provide new components for which the demand is growing.

“The computational technology industry is undergoing major changes - the interest of enterprises in AI gives impetus to innovations that will help develop and deploy AI at any scale,” said Keith Streer, expert on AI, a consultant at EY. Investors are investing huge amounts of money in startups that optimize AI, and major manufacturers are starting to offer not only the chips and storage, but also the network and cloud services required for deployment. ”

.
According to him, now the main task of the information technology directors is to select the appropriate architecture of artificial intelligence for the company's needs.

Streer says that since AI is a math on an unprecedented scale, the implementation of this technology requires completely different technical specifications and security tools than usual corporate workloads. To use all the benefits of AI, suppliers will need to provide the technical infrastructure, cloud and other services necessary for AI, without which such complex calculations will be impossible.

But we are on our way to this, and in the future even more advanced architectures of artificial intelligence will appear. Streer believes that not only small high-performance computing firms, but also other representatives of the high-performance computing industry, including start-ups to create microcircuits and cloud services, who strive to set a high bar for AI, will ensure the flexibility, power and speed of computing architectures. calculations

As more and more specialists and developers in the field of AI appear, this technology will become more accessible, which will give a good impetus to innovation and bring significant dividends - for companies and suppliers.

In the meantime, information technology directors should familiarize themselves with the difficulties associated with creating an architecture of artificial intelligence for corporate use in order to be ready to solve them.

Chip development

The development of graphic processors, programmable logic integrated circuits (FPGA), and specialized AI chips has become the most important condition for the transition from traditional computing architectures to AI. Spreading GPU-based and FPGA-based architectures will help improve the performance and flexibility of computing and storage systems, allowing solution providers to offer a range of advanced services for AI applications and machine learning.

“These are microchip architectures that relieve a lot of advanced functions [for example, AI training] and help implement an improved stack for computing and storage that gives unsurpassed performance and efficiency,” says Surya Varanasi, founder and technical director of Vexata Inc., data management solution company.

But while new chips are not capable of something more complicated. To find the optimal architecture for AI workloads, it is necessary to perform large-scale calculations that require high bandwidth and are not without delays. The key to success here is high-speed networks. But many AI algorithms must wait until the next set of data is collected, so the delay should not be overlooked.

In addition, when crossing server boundaries or transferring from servers to storage, data passes through several protocols. To simplify these processes, data experts may try to locate the data locally so that one server can process large chunks of data without waiting for the others. Saving money also allows improved integration between graphics processors and storage. Other vendors are looking for ways to simplify the design of AI servers to ensure compatibility, so that the same servers can be used for different workloads.

Non-volatile memory for processing AI workloads

Many solutions based on the GPU are based on the Direct Attached Drive (DAS), which greatly complicates distributed learning and the formation of logical conclusions for AI. As a result, installing and managing these data lines for in-depth training becomes complex and time-consuming tasks.

To solve this problem, non-volatile memory (NVM), which was originally designed to provide high-quality connectivity between solid-state drives (SSD) and traditional corporate servers, is suitable. Now this type of memory is often included in the I / O matrix to optimize AI workloads.

The bottom line is that NVMe over Fabrics (NVMeF) - so called these interfaces - will help reduce costs when converting between network protocols and control the characteristics of each type of SSD. This will allow CIOs to justify the cost of AI applications that use large data sets.

NVMeF interfaces entail their own risks, including the need for high costs of advanced technologies. In addition, dependence on NVMeF suppliers still remains in this industry, so when choosing a product, IT directors should try to avoid being tied to a supplier.
But the implementation of NVMeF will make it possible to take another step towards optimizing the corporate architecture of artificial intelligence, Varanasi said.

“Despite the fact that the proliferation of NVMe over Fabrics architecture on an industrial scale may take another year or a year and a half, we already have the main components, and the pioneers are already reporting promising results,” says Varanasi.

Information technology directors who seek to develop AI applications can try to create a shared storage pool for NVMeF optimized for AI if in the short term it can successfully replace existing storage networks. But if you wait until NVMeF is backward compatible, you can lose a lot.

Reduced data movement

When planning the different stages of AI deployment, you need to pay special attention to the cost of data movement. Projects of AI, including data processing and transformation, as well as for training algorithms, require huge amounts of data.

The hardware and human resources needed to complete these tasks, as well as the time spent on moving the data itself, can make AI projects too costly. If information technology directors manage to avoid moving data between stages, it’s likely that they can develop a robust AI infrastructure that meets these needs, says Haris Pozidis, PhD, manager, specialist in drive acceleration at IBM Research. Manufacturers are already working on this issue.

For example, IBM is experimenting with various options for optimizing hardware and software to reduce data movement for large-scale AI applications in laboratories in Zurich. Such optimizations helped 46 times improve the performance of the test script of the popular click analysis tool. Pozidis says that such work is based on distributed learning and acceleration of the graphics processor, which improves support for sparse data structures.

Parallelism is another important component of speeding up AI work tasks. For distributed learning, it is necessary to make changes at the hardware and software levels, which will increase the efficiency of processing algorithms for parallel graphics processors. IBM researchers have created a prototype platform with data parallelism, which allows you to scale and learn on large amounts of data in excess of the memory on a single machine. This is very important for large-scale applications. A new platform optimized for communication training and ensuring data locality has helped reduce data movement.

At the hardware level, IBM researchers used NVMeF to improve the interconnectivity of the graphics processor, CPU, and memory components on servers, as well as between servers and storage.

“The performance of different AI workloads can be limited by network bottlenecks, memory bandwidth, and bandwidth between the CPU and the graphics processor. But if you implement more efficient connection algorithms and protocols in all parts of the system, you can make a big step towards developing more high-speed AI applications, ”said Pozidis.

Compound calculations

Today, most workloads use a pre-configured database optimized for a specific hardware architecture.

Chad Miles, vice president of analytics products and solutions at Teradata, says the market is moving toward software-driven hardware, which will allow organizations to intelligently distribute processing across graphics processors and CPUs, depending on the current task.

The difficulty lies in the fact that enterprises use different computation engines to access different storage options. Large corporations prefer to store valuable data that they need regular access to, for example, customer, financial, supply chain, product, and other information using high-performance input-output environments. In turn, rarely used data sets, such as sensor readings, web content, and multimedia, are stored in low-cost cloud storage.

One of the goals of composite computing is to use containers to optimize the performance of instances such as SQL engines, graph engines, machine learning and depth learning engines that gain access to data distributed across different storages. Deploying several analytical computing engines allows the use of multiprocessor models that use data from different engines and, as a rule, produce better results.

IT vendors such as Dell Technologies, Hewlett Packard Enterprise and Liquid are gradually moving away from traditional architectures that assign workloads at the computing level. Instead, they seek to assign AI workloads to an entire system consisting of CPUs, GPUs, memory, and storage devices. For such a transition, it is necessary to master the new network components, which increase speed and reduce the delay in connecting the various components of the system.

For example, in many cloud data centers, Ethernet is used to connect computing components and storage, where the delay is about 15 microseconds. The InfiniBand high-speed switched computer network, which is used in many converged infrastructures, can reduce latency to 1.5 microseconds. Liquid created a set of tools for connecting different nodes using PCI Express (PCIE), which reduces the delay to 150 nanoseconds.

In addition, some experts suggest increasing the amount of memory for graphics processors used to handle large loads with fast connections. For example, DDR4 is often used with RAM, which reduces the delay to 14 nanoseconds. But this only works for small pieces of a few inches.

Little Marrek, founder and developer of the ClusterOne AI management service, believes that even more work will be needed to ensure the compatibility of AI workloads in the software environment. Despite the fact that some enterprises are already trying to ensure compatibility with Docker and Kubernetes, it is too early to apply the same approach to graphics processors.

"In general, doing the workloads of the GPU and tracking them is not so easy," says Marrek. “There is no universal solution yet that will allow monitoring of all systems.”

Storage and graphics processor

Another approach is to use a graphics processor for preprocessing data in order to reduce the amount needed for a particular type of analysis, and to help organize data and assign tags to them. This will allow to prepare a suitable set of data for several GPUs involved in processing, so that the algorithm can work from inside the memory instead of transferring data from the storages over slow networks.

“We see storage, computing, and memory as separate components of a solution that has developed historically, and therefore we are trying to increase processing volumes,” says Alex St. John, technical director and founder of Nyriad Ltd., a storage software company that appeared in a result of research for the world's largest radio telescope - a telescope with an antenna array of a square kilometer (SKA).

The more data volumes, the harder it is to move them somewhere for processing.

The SKA telescope required large amounts of power to process 160 TB of radio data in real time, which became the main obstacle for researchers. As a result, they decided to abandon the RAID storage, which is most often used in data centers, and deploy a parallel cluster file system, such as BeeGFS, which simplifies the preparation of data for AI workloads.

Information technology directors who are working on the optimal strategy for an artificial intelligence architecture should pay particular attention to usability. If developers, data specialists, and development and operation integration teams can quickly master a new technology, they can invest their time and energy in creating successful business logic instead of solving deployment problems and data lines.

In addition, organizations need to carefully calculate how much time and effort it will take to build a new AI architecture into an existing ecosystem.

“Before implementing new infrastructures and planning large workloads, information technology directors need to assess how much depletable resources are required for this,” said Asaf Somekh, founder and CEO of Iguazio.

Tags: