What is automated machine learning (AutoML)
What is automated machine learning (AutoML) and
is it going to deprive data scientists of work?
Since the beginning of the emergence of automated machine learning tools (AutoML), such as Google AutoML, experts have been discussing the question of whether they are ready for full corporate integration and application. The AutoML tool description states that anyone can take on the role of a “data scientist,” capable of creating machine learning models ready for industrial use without the traditionally necessary technical background.
Although it is certainly true that automated machine learning processes are changing the ways in which enterprises can perform data analysis tasks, the technology is not yet ready to leave data specialists out of work. One of the main claims of the technology is that automatically created models have similar quality and are produced as soon as possible in comparison with the equivalent model created by a group of data researchers.
Although AutoML models are faster to create, they are only effective if the problem they are looking for is constant and recurring. Most AutoML models work well and achieve consistent quality under these conditions; but the more complex the data problem, the more specialist intervention is required to understand what the AutoML system has launched and turn it into something useful. To understand some of these limitations, let's look at the AutoML process in more detail.
AutoML tools simplify the processing of data by doing everything possible using the available information. The process consists of three main steps:
The first stage includes the “extraction” of information, which helps to increase the productivity of the generated models, creating additional information for study. This takes a lot of time, since a data analysis specialist needs to almost manually identify the relationships between data elements and develop ways to present information as additional data fields that the machine can use for training, as well as decide on the completeness of the data to build a model .
This is an important step, as these additional data very often mean the difference between an unsuitable and an excellent model. AutoML is programmed to use a limited range of data discovery methods, usually in such a way as to satisfy the “medium” data problem, limiting the final performance of the model, since it cannot use the knowledge of a specific SME (small medium business), which can be important for success and that a data specialist can use in his work.
Many data problems begin with significant mental effort to select the data to represent in the algorithm. Transferring all the data that you have in the system can lead to a model that does not match the parameters, because the data usually contains many different, often conflicting signals that must be targeted and modeled individually.
This is especially true with regard to fraud, when different geographical regions, payment channels, etc. have very different types of fraud. Attempts to manually discover these patterns and design the appropriate datasets to ensure accurate detection are still largely not automated. Using a multi-purpose automated approach to this problem is currently impossible due to the enormous complexity of such an event.
The next stage is the generation of models. Models with different configurations are created and trained using data from the previous stage. This is very important because it is almost impossible to use the default configuration for each problem and get the best results.
At this point, AutoML systems have an edge over data experts because they can create a huge number of test models in a very short amount of time. Most AutoML systems strive to be universal and produce only deep neural networks, which can be redundant for many tasks, when a simple model, such as logistic regression or decision trees, may be more suitable and benefit from hyper parameter optimization.
The final stage is a mass performance testing and choosing the best performer. It is at this stage that some manual labor is required, not least because it is imperative that the user selects the right model for the task. It is useless to have a fraud risk model that identifies 100% of fraud cases, but casts doubt on each authorization.
In the current manual process, data specialists work with SMEs to understand data and develop effective descriptive data functions. This important link between SMEs and the data specialist is missing from general AutoML. As described earlier, the process attempts to automatically generate these models from what the tool can detect in the data, which may be inappropriate, leading to inefficient models. Future AutoML systems must be designed with this and other limitations in mind in order to create high-quality models in accordance with the standards developed by experts.
The Future of AutoML
AutoML continues to evolve, and major current AutoML vendors (Google and Microsoft) have made significant improvements. These developments focused mainly on increasing the speed of generation of off-the-shelf models, and not on how to improve the technology to solve more complex problems (for example, detecting fraud and network intrusions), where AutoML can go further than a data specialist.
As AutoML solutions continue to evolve and expand, more complex manual processes can be automated. Modern AutoML systems work great with images and speech because AutoML has built-in business knowledge to do these tasks so well. Future AutoML systems will have the opportunity for business users to input their knowledge to help the machine automatically create very accurate models.
On top of that, complex data pipelines will become more and more ordered, and the addition of a large number of various algorithms for optimization will further expand the possible problems that scientists working with citizen data can solve.
Although many data processing tasks will become automated, it will allow scientists to perform custom tasks for the business; further stimulating innovation and enabling businesses to focus on the more important areas of revenue generation and business growth.