Machine Learning at Top Speed: Four-Month Predictive Maintenance
- Tutorial
Posted by Lyudmila Dezhkina, Solution Architect, DataArt
For about half a year, our team has been working on the Predictive Maintenance Platform, a system that should predict possible errors and equipment failures. This area is at the intersection of IoT and Machine Learning, and you have to work here with hardware and, in fact, with software. How we build Serverless ML with the Scikit-learn library on AWS will be discussed in this article. I will talk about the difficulties that we encountered, and about the tools that I used to save time.
Just in case, a little about yourself.
I have been engaged in programming for more than 12 years, and during this time I participated in various projects. Including gaming, e-commerce, highload and Big Data. For about three years I have been involved in projects related to Machine Learning and Deep Learning.
Так выглядели требования, выдвинутые заказчиком с самого начала
Собеседование с клиентом было сложным, в основном мы говорили о машинном обучении, нас много спрашивали об алгоритмах и конкретном личном опыте. Но не буду скромничать — в этой части мы изначально разбираемся очень хорошо. Первым камнем преткновения стал кусок Hardware, который содержит система. Все-таки, опыт работы с железом лично у меня не такой разнообразный.
The customer explained to us: "Look, we have a conveyor." I immediately came up with a conveyor belt at the checkout in a supermarket. What and what can be taught there? But it quickly became clear that the word conveyor hides a sorting center with an area of 300-400 square meters. m, and in fact, there are many conveyors there. That is, many elements of equipment need to be connected together: sensors, robots. A classic illustration of the concept of "Industrial Revolution 4.0" , in which IoT and ML are drawing together.
The Predictive Maintenance theme will certainly be on the rise for at least another two to three years. Each conveyor is decomposed into elements: from a robot or motor moving a conveyor belt to a separate bearing. Moreover, if any of these parts fails, the whole system stops, and in some cases an hour of idle conveyor can cost one and a half million dollars (this is not an exaggeration!).
One of our customers is engaged in cargo transportation and logistics: on its basis, robots unload 40 trucks in 8 minutes. There can be no delays here, cars must come and go in accordance with a very strict schedule, no one is fixing anything during the unloading process. In general, there are only two or three people with tablets on this base. But there is a slightly different world where everything does not look so fashionable, and where mechanics with gloves and without computers are directly on the object.
Our first small prototype project consisted of approximately 90 sensors, and everything went fine until the project had to be scaled. To equip the smallest separate part of a real sorting center, about 550 sensors are already required.
PLC and sensors
Programmable logic controller - a small computer with a built-in cyclic program - most often used to automate the process. Actually, with the help of PLC, we take readings from the sensors: for example, acceleration and speed, voltage level, vibration along the axes, temperature (in our case, 17 indicators). Sensors are often mistaken. Although our project has been over 8 months old, we still have our own laboratory, where we experiment with sensors, selecting the most suitable models. Now, for example, we are considering the use of ultrasonic sensors.
Personally, I first saw the PLC, only when I hit the customer’s site. As a developer, I had never encountered them before, and it was rather unpleasant: as soon as we delved deeper than two, three, and four-phase motors in a conversation, I began to lose thread. About 80% of the words were still intelligible, but the general meaning stubbornly slipped away. In general, this is a serious problem, the roots of which are at a fairly high threshold for entering PLC programming - a microcomputer where you really can do something costs at least $ 200-300. Programming itself is not complicated, and problems begin only when the sensor is attached to a real conveyor or motor.
37-in-1 Standard Sensor Set
Sensors, as you know, are different. The simplest ones that we managed to find cost from $ 18. The main characteristic - “bandwidth and resolution” - how much data the sensor transmits in a minute. From my own experience I can say that if the manufacturer claims, say, 30 datapoints per minute, in reality their number is unlikely to be more than 15. And this also poses a serious problem: the topic is fashionable, and some companies are trying to make money on this hype. We tested sensors worth $ 158, the bandwidth of which theoretically made it possible to simply throw away part of our code. But in fact, they turned out to be an absolute analogue of those same devices at $ 18 apiece.
The first stage: we attach sensors, collect data
Actually, the first phase of the project was the installation of hardware, the installation itself is a long and tedious process. This is also a whole science - the data that it collects may depend on how you attach the sensor to a motor or box. We had a case when one of two identical sensors was attached inside the box, and the other outside. Logic suggests that the temperature inside should be higher, but the data collected indicated otherwise. It turned out that the system failed, but when the developer arrived at the plant, he saw that the sensor was not just in the box, but right on the fan located there.
This illustration shows how the first data entered the system. We have a gateway, there are PLC and sensors associated with it. Further, of course, cache - equipment usually runs on mobile cards and all data is transmitted via mobile Internet. Since one of the customer’s sorting centers is located in an area where there are often hurricanes, and the connection may break, we accumulate data on the gateway until it is restored.
Next, we use the Greengrass service from Amazon, which sends data inside the cloud system (AWS).
As soon as the data is inside the cloud, a bunch of events are triggered. For example, we have an event for raw data that saves file system data. There is a “heartbeat" to indicate normal system performance. There is a “downsampling”, which is used for display on the UI, and for processing (the average value, say, per minute for a certain indicator) is taken. That is, in addition to raw data, we have downsampling data that falls on the screens of users who monitor the system.
Raw data is stored in a parquet format. At first we chose JSON, then we tried CSV, but in the end we came to the conclusion that both the analytics team and the development team are satisfied with the “parquet”.
Actually, the first version of the system was built on DynamoDB, and I don’t want to say anything bad about this database. It’s just that as soon as we got analytics - mathematicians who should work with the data obtained - it turned out that the query language on DynamoDB was too complicated for them. They had to specially prepare data for ML and analytics. Therefore, we settled on Athena, the query editor in AWS. For us, its advantages are that it allows you to read Parquet data, write SQL, and collect the results in a CSV file. Just what the analytics team needs.
Second stage: what do we analyze?
So, from one small object, we collected about 3 GB of raw data. Now we know a lot about temperature, vibration, and axial acceleration. So, it is time for our mathematicians to gather to understand how and, in fact, what we are trying to predict based on this information.
The goal is to minimize equipment downtime.
People enter this Coca-Cola plant only when they receive a signal about a breakdown, oil leak, or, say, a puddle on the floor. The cost of one robot starts with $ 30,000 dollars, but almost all production is built on them
About 10,000 people work in six Tesla factories, and for production of such a scale this is very little. Interestingly, Mercedes factories are even more automated. It is clear that all involved robots need constant monitoring.
The more expensive the robot, the less its working part vibrates. With simple actions, this may not be decisive, but more subtle operations, say with the neck of the bottle, require it to be minimized. Of course, the vibration level of expensive cars must be constantly monitored.
Services that save time
We launched the first installation in just over three months, and I think it's fast.
Actually, these are the main five points that allowed us to save development efforts.
The first, due to which we reduced the time, is that most of the system is built on AWS, which is scalable by itself. As soon as the number of users exceeds a certain threshold, autoscaling is triggered, and none of the team has to spend time on this.
I would like to draw attention to two nuances. First, we work with large volumes of data, and in the first version of the system we had pipelines in order to make backups. After some time, the data became too much, and keeping copies for them became too costly. Then we just left the Raw data lying in the bucket read-only, forbidding them to be deleted, and refused backups.
Our system involves continuous integration, to support a new site and a new installation does not take so much time.
It is clear that real time is built on events. Although, of course, difficulties arise due to the fact that some events work twice or the system loses touch, for example, due to weather conditions.
Data encryption, as required by the customer, is automatically done in AWS. Each client has its own bucket, and we don’t do what we encrypt at all.
Meeting with analysts
We received the very first code in PDF format along with a request to implement one or another model. Until we started receiving the code in the form of .ipynb, it was alarming, but the fact is that analysts are mathematicians who are far from programming. All our operations take place in the cloud, we do not allow downloading data. Together, all these points pushed us to try the SageMaker platform.
SageMaker allows you to use about 80 algorithms out of the box; it includes frameworks: Caffe2, Mxnet, Gluon, TensorFlow, Pytorch, Microsoft cognitive tool kit. At the moment we use Keras + TensorFlow, but everyone except Microsoft cognitive toolkit managed to try. Such a wide coverage allows us not to limit our own analytical team.
The first three to four months, people did all the work with the help of simple mathematics, there really was no ML. Part of the system is based on purely mathematical laws, and it is designed for statistical data. That is, we monitor the average temperature level, and if we see that it goes off scale, alerts are triggered.
Then follows the training of the model. Everything looks easy and simple, and so it seems before the start of implementation.
Build, train, deploy ...
I’ll briefly describe how we got out of the situation. Look at the second column: we collect data, process it, clean it, use S3 bucket and Glue to launch events and create “partitions”. We have all the data arranged in partitions for Athena, this is also an important nuance, because Athena is built on top of S3. Athena itself is very cheap. But we pay for reading the data and getting it out of S3, since each request can be very expensive. Therefore, we have a large system of partitions.
We have a downtimer. And Amazon EMR, which allows you to quickly collect data. Actually, for feature engineering, in our cloud, for each analyst, a Jupyter Notebook was raised - this is their own instance. And they analyze everything directly in the cloud itself.
Thanks to SageMaker, we were primarily able to skip the Training Clusters phase. If we did not use this platform, we would have to raise clusters in Amazon, and one of the DevOps engineers would have to follow them. SageMaker allows using the parameters of the method, the image on Docker to raise the cluster, it remains just to indicate the number of instances in the parameter that you want to use.
Further, we do not have to deal with scaling. If we want to process some kind of large algorithm or if we urgently need to calculate something, we enable autoscaling (it all depends on whether you want to use a CPU or GPU).
In addition, all our models are encrypted: this also comes out of the box in SageMaker - the binaries that are in S3.
Model deployment
We are approaching the first model deployed in an environment. Actually, SageMaker allows you to save model artifacts, but just at this stage we had a lot of controversy, because SageMaker has its own model format. We wanted to get away from it, getting rid of the restrictions, so our models are stored in pickle format so that we could use even Keras, even TensorFlow or something else if desired. Although we used the first model from SageMaker, as it is, through the native API.
SageMaker allows you to simplify the work in three stages. Every time you try to predict something, you have to start a certain process, give data and get prediction values. Everything went well with this until custom algorithms were needed.
Analysts know that they have a CI and a repository. There is a folder in the CI repository where they should put three files. Serve.py is a file that allows SageMaker to raise the Flask service and communicate with SageMaker itself. Train.py is a class with the train method, into which they must put everything that is needed for the model. Finally, predict.py - with its help they raise this class, inside which there is a method. Having access, they raise all kinds of resources from S3 from there - inside SageMaker we have an image that allows you to run anything from the interface and programmatically (we do not limit them).
From SageMaker we get access to predict.py - inside image is just a Flask application that allows you to call predict or train with certain parameters. All this is tied to S3 and, in addition, they have the ability to save models from the Jupyter Notebook. That is, in the Jupyter Notebook, analysts have access to all the data, and they can do some kind of experimentation.
In production, all this falls as follows. We have users, there are predicted values endpoint. The data lies on S3 and goes to Athena. Every two hours, an algorithm is launched that calculates a prediction for the next two hours. This time step is due to the fact that in our case, about 6 hours of analytics is enough to say that something is wrong with the motor. Even at the moment of switching on, the motor heats up from 5-10 minutes, and sharp jumps do not occur.
In critical systems, say, when Air France checks aircraft turbines, prediction is done at the rate of 10 minutes. In this case, the accuracy is 96.5%.
If we see that something is going wrong, the notification system turns on. Then one of the many users on a watch or other device receives a notification that a particular motor is behaving abnormally. He goes and checks his condition.
Manage notebook instances
In fact, everything is very simple. Coming to work, the analyst launches an instance on the Jupyter Notebook. He gets the role and session, so two people cannot edit the same file. Actually, we now have an instance for each analyst.
Create training job
SageMaker has an understanding of training jobs. Its result, if you use just an API - a binary that is stored on S3: from the parameters that you provide, your model is obtained.
sagemaker = boto3.client('sagemaker')
sagemaker.create_training_job(**create_training_params)
status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print(status)
try:
sagemaker.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)
finally:
status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print("Training job ended with status: " + status)
if status == 'Failed':
message = sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']
print('Training failed with the following error: {}'.format(message))
raise Exception('Training job failed')
Training Params Example
{
"AlgorithmSpecification": {
"TrainingImage": image,
"TrainingInputMode": "File"
},
"RoleArn": role,
"OutputDataConfig": {
"S3OutputPath": output_location
},
"ResourceConfig": {
"InstanceCount": 2,
"InstanceType": "ml.c4.8xlarge",
"VolumeSizeInGB": 50
},
"TrainingJobName": job_name,
"HyperParameters": {
"k": "10",
"feature_dim": "784",
"mini_batch_size": "500",
"force_dense": "True"
},
"StoppingCondition": {
"MaxRuntimeInSeconds": 60 * 60
},
"InputDataConfig": [
{
"ChannelName": "train",
"DataSource": {
"S3DataSource": {
"S3DataType": "S3Prefix",
"S3Uri": data_location,
"S3DataDistributionType": "FullyReplicated"
}
},
"CompressionType": "None",
"RecordWrapperType": "None"
}
]
}
Parameters. The first is the role: you must indicate what your SageMaker instance has access to. That is, in our case, if the analyst works with two different productions, he should see one bucket and not see the other. Output config is where you save all the model metadata.
We skip autoscale and can simply specify the number of instances on which you want to run this training-job. At first, we generally used middle instances without TensorFlow or Keras, and that was enough.
Hyperparameters You specify the Docker image in which you want to start. As a rule, Amazon provides a list of algorithms and their images, that is, you must specify hyperparameters - the parameters of the algorithm itself.
Create model
%%time
import boto3
from time import gmtime, strftime
job_name = 'kmeans-lowlevel-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Training job", job_name)
from sagemaker.amazon.amazon_estimator import get_image_uri
image = get_image_uri(boto3.Session().region_name, 'kmeans')
output_location = 's3://{}/kmeans_example/output'.format(bucket)
print('training artifacts will be uploaded to: {}'.format(output_location))
create_training_params = \
{
"AlgorithmSpecification": {
"TrainingImage": image,
"TrainingInputMode": "File"
},
"RoleArn": role,
"OutputDataConfig": {
"S3OutputPath": output_location
},
"ResourceConfig": {
"InstanceCount": 2,
"InstanceType": "ml.c4.8xlarge",
"VolumeSizeInGB": 50
},
"TrainingJobName": job_name,
"HyperParameters": {
"k": "10",
"feature_dim": "784",
"mini_batch_size": "500",
"force_dense": "True"
},
"StoppingCondition": {
"MaxRuntimeInSeconds": 60 * 60
},
"InputDataConfig": [
{
"ChannelName": "train",
"DataSource": {
"S3DataSource": {
"S3DataType": "S3Prefix",
"S3Uri": data_location,
"S3DataDistributionType": "FullyReplicated"
}
},
"CompressionType": "None",
"RecordWrapperType": "None"
}
]
}
sagemaker = boto3.client('sagemaker')
sagemaker.create_training_job(**create_training_params)
status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print(status)
try:
sagemaker.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)
finally:
status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print("Training job ended with status: " + status)
if status == 'Failed':
message = sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']
print('Training failed with the following error: {}'.format(message))
raise Exception('Training job failed')
%%time
import boto3
from time import gmtime, strftime
model_name=job_name
print(model_name)
info = sagemaker.describe_training_job(TrainingJobName=job_name)
model_data = info['ModelArtifacts']['S3ModelArtifacts']
print(info['ModelArtifacts'])
primary_container = {
'Image': image,
'ModelDataUrl': model_data
}
create_model_response = sagemaker.create_model(
ModelName = model_name,
ExecutionRoleArn = role,
PrimaryContainer = primary_container)
print(create_model_response['ModelArn'])
Creating a model is the result of a training job. After the latter is completed, and when you have monitored it, it is saved on S3, and you can use it.
This is how it looks from the point of view of analysts. Our analysts go to the models and say: in this image I want to launch this model. They simply point to the S3 folder, Image and enter the parameters into the graphical interface. But there are nuances and difficulties, so we moved on to custom algorithms.
Create endpoint
%%time
import time
endpoint_name = 'KMeansEndpoint-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_name)
create_endpoint_response = sagemaker.create_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=endpoint_config_name)
print(create_endpoint_response['EndpointArn'])
resp = sagemaker.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']
print("Status: " + status)
try:
sagemaker.get_waiter('endpoint_in_service').wait(EndpointName=endpoint_name)
finally:
resp = sagemaker.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']
print("Arn: " + resp['EndpointArn'])
print("Create endpoint ended with status: " + status)
if status != 'InService':
message = sagemaker.describe_endpoint(EndpointName=endpoint_name)['FailureReason']
print('Training failed with the following error: {}'.format(message))
raise Exception('Endpoint creation did not succeed')
So much code is needed to create an Endpoint that twitches from any lambda and from the outside. Every two hours, an event is triggered that pulls Endpoint.
Endpoint view
This is how analysts see it. They simply indicate the algorithm, time and pull it with his hands from the interface.
Invoke endpoint
import json
payload = np2csv(train_set[0][30:31])
response = runtime.invoke_endpoint(EndpointName=endpoint_name,
ContentType='text/csv',
Body=payload)
result = json.loads(response['Body'].read().decode())
print(result)
And this is how it is done from lambda. That is, we have an Endpoint inside, and every two hours we send a payload in order to make a prediction.
Useful SageMaker Links: github links
These are very important links. Honestly, after we started using the usual Sagemaker GUI, everyone understood that sooner or later we would come to a custom algorithm, and all this would be hand-assembled. Using these links you can find not only the use of algorithms, but also the assembly of custom images:
github.com/awslabs/amazon-sagemaker-examples
github.com/aws-samples/aws-ml-vision-end2end
github.com/juliensimon
github. com / aws / sagemaker-spark
What's next?
We approached the fourth production and now, in addition to analytics, we have two development paths. Firstly, we are trying to get logs from mechanics, i.e. we are trying to come to training with support. The first Mantainence logs we received looks like this: something broke on Monday, I arrived there on Wednesday, and started fixing it on Friday. We are now trying to supply the customer with CMS - a content management system that will allow logging of failure events.
How it's done? As a rule, as soon as a breakdown occurs, the mechanic arrives and changes the part very quickly, but he can fill in all sorts of paper forms, say, in a week. By this time, the person simply forgets what exactly happened to the part. CMS, of course, takes us to a new level of interaction with mechanics.
Secondly, we are going to install ultrasonic sensors on the motors that read sound and are engaged in spectral analysis.
It’s possible that we will abandon Athena, because on big data, using S3 is expensive. At the same time, Microsoft recently announced its own services, and one of our customers wants to try to do about the same thing on Azure. Actually, one of the advantages of our system is that it can be disassembled and assembled in another place, like from cubes.