"Computer, how is my build doing?" And other magic spells

    Baruch Sadogursky tells how, with the help of the voice command service Alexa, you can add a voice interface to completely unexpected things, such as IntelliJ IDEA and Jenkins, and also, leaning back in your chair with a glass of your favorite drink, manage everything.


    The article is based on Baruch's speech at the JPoint 2017 conference in Moscow.

    Baruch is engaged in Developer Relations at JFrog. He is also an enthusiast of Groovy, DevOps, IOT, and Home Automation.

    What is Alexa and why is it in our lives?


    In this article I will talk with Alexa. This is a product of Amazon - a voice assistant, the same as Siri, Cortana (God forbid, if someone could not remove it), Google Home, etc. Alexa is the market leader today, and we'll talk about why, although the reason is obvious.

    There are three types of Alexa devices.



    I will talk with Amazon Tap. Previously, she did not know how to respond to the name, so I had to click on the button, hence the name Tap. Now she is already responding to the name, but the name remains. The full Alexa is big with a good speaker. There is also a “phantom stick” with a weak speaker, which is designed to connect to other receivers, speakers, etc. All of them are quite affordable, they cost 170, 130 and 50 dollars. I have a total of nine at my place. Why do I need all this, we’ll talk now.

    I mentioned that Alexa is now the undisputed market leader. She is a leader thanks to the open scripting API Alexa Skills. Just recently it was said that the number of skills in the main base reached 10 thousand. Most of them are stupid, but thanks to many useful skills there is nothing similar to Alexa in its usefulness and popularity. How they screwed 10 thousand is quite understandable. There is a template by which you can make a skill by changing a couple of lines. Then publish it, get a T-shirt - and now they have 10 thousand of them. But today we will learn how to write some more interesting skills - and you will understand that it is quite simple.

    Why am I interested in this and how do I use the Alex army at my place?



    I get up and first of all I ask her to turn off the alarm ("Alexa, turn off the alarm"). Then somehow I get to the kitchen and ask to turn on the light ("Alexa, turn on morning lights"). I greet her (Alexa, good morning). By the way, she responds to “Good morning” with interesting things, for example, reports that a large snake was found in Malaysia last year. Useful information in the morning, under coffee.

    Then I ask about the news ("Alexa, what's my news flash?"), And she reads me the news from various resources. After that I ask what is on my calendar? ("Alexa, what's on my agenda?"). Thanks to this, I know what my day looks like. And then I ask how I go to work, are there traffic jams and so on (“Alexa, how's my commute?”). This is a home-oriented use, there are no custom skills at all, all of this is built-in.

    But there are many more cool applications.



    • Smart home is the coolest option. As you probably already saw, I turned on the light through Alex. But it can be not only light: locks, cameras, burglar alarms, all kinds of sensors - everything that connects to a smart home.
    • Music. Alexa can play the music I love. Here are audio books. Adults usually do not have time for them, but my son listens to fairy tales through Alex.
    • Questions / Answers. That is what is usually searched on Wikipedia.
    • News / weather - I already spoke about this.
    • Ordering food - of course, it is very important that you can, without getting up from the couch, order pizza, sushi, etc.

    But you came here, of course, not in order to listen to a consumer speech about Alex - you came to hear about the voice interfaces of the future. Today it is a hype, and, in general, quite unexpected. But this hype is very important, because voice interfaces are what we are waiting for. We grew up waiting for voice interfaces to work. And now this is happening before our eyes.

    Analysts say that over the past 30 months, there has been progress in the field of voice interfaces that has not happened in the past 30 years. The bottom line is that the voice interface at some point replaced the graphical one. This is logical, because the graphical interface was a crutch that existed only because the voice interface was not sufficiently developed.

    Of course, there are things that are better to show, but, nevertheless, a lot of what we do on the screen could, and indeed should, be done in voice.

    This API, which lies on top of voice recognition and the specific Artificial Intelligence that Alex has, is what guarantees explosive development in this industry and the ability for many people to write many useful skills.

    Today we will see two skills. The first opens the application, my IntelliJ IDEA, and it is, of course, cool, but not very useful (since opening the application by voice and then writing code with pens does not make much sense). The second will be about Jenkins, and it should be much more useful.

    Writing skill for Alexa


    Writing a skill is a simple thing and consists of three stages:

    • Define an interactive model - the same voice API;
    • Write a handler for the commands that come to us;
    • Pass the review in Amazon (we will not consider this stage).



    Interactive voice model: what is it, why is it?


    It is based on a very simple idea:



    We can extract the variables from the text and pass them to the handler (for example, I ask you to open IDEA, which is a slot, but I could just as well ask to open Rider or Sea Lion).

    I call the commands JSON, and then I give the text how these commands can sound. And here the magic happens, because the same thing can be said in different ways. And this Artificial Intelligence (Voice Recognition, which provides Alexa) is able to select the meaning of not only the team that I prescribed, but all the same. In addition, there is a set of built-in commands, such as Stop, Start, Help, Yes, No, and so on, for which you do not need to write any examples.

    Here is the JSON IntentSchema that says which examples I want. This is our example, which opens the tools from the JetBrains Toolbox.

    {
     	"intents": [
       	{
           	"intent": "OpenIntent",
           	"slots" : [
            {
      	        "name" : "Tool",
                  	"type" : "LIST_OF_TOOLS"
            }
         	]
     	},
     	{
         	"intent": "AMAZON.HelpIntent"
     	}  
      ]
    }
    

    You see, I have an intent called OpenIntent, and there is one Tool slot. Its parameter is a list of tools. In addition, there is also a help.

    Slot types


    Here are the types of slots:

    • Built-in, such as AMAZON.DATE, DURATION, FOUR_DIGIT_NUMBER, NUMBER, TIME;
    • A lot of all the information that Alexa already knows and around which we can write skills, for example, a list of actors, ratings, a list of cities in Europe and the USA, a list of famous people, films, drinks, etc. (AMAZON.ACTOR, AGREATERATING, AIRLINE, EUROPE_CITY, US_CITY, PERSON, MOVIE, DRINK). Information about them and all possible options are already stored and stored in Alexa.
    • Custom Types.

    Be sure to remember that they are not enum. That is, this list serves only as a priority. If Alexa recognizes some other word, it will be transferred to my skill.

    Here is the very List of Tools - the options that I gave her.



    The spelling here differs from the spelling of JetBrains products simply because Alexa works with natural words. Accordingly, if I write IntelliJ IDEA in one word, she will not be able to recognize what it is.

    Examples of phrases for teams


    Here are examples of how people can apply to claim the discovery of this tool:

    OpenIntent open {Tool}
    OpenIntent start {Tool}
    OpenIntent startup {Tool}
    OpenIntent {Tool}

    There are other options. When Alexa sees this set, she knows that the synonyms of these words also fall under this intent.

    Command handler


    We described the voice interface, and then we have the command handler. It works very simply: Alexa turns the voice we are talking to into a REST request in JSON format.

    The request can go either to AWS Lambda Function or to an arbitrary HTTP server. The advantage of Lambda Function is that they do not need an arbitrary HTTP server. We have a platform as a service, where we can write our handler without having to raise any services.

    Benefits of AWS Lambda Function:

    • Serverless cumpute server - it works on its own
    • No-ops!
    • Node.js is the most elegant of implementations. We write Javascript functions, and when we pull this service, they are worked out there.
    • Python support (we are writing a script with some functions, and this all works great)
    • Java 8.

    Java 8 is much more complicated. In Java, we don’t have any top-level functions that we can write and call - all this should be wrapped in classes. Our friend Sergey Yegorov is on a short leg with the guys who are sawing Lambda, and now he is working to ensure that Groovy can be used in Lambda not as it is now (when we create a jar file and work with it the same way), but directly through Groovy scripts, when it will be possible to write scripts with callbacks that will be called.

    Speechlet


    The class that handles Alexa's Java requests is called Speechlet. When you see speechlets, you remember about applets, about midlets and about servlets. And you already know what to expect - a controlled execution cycle, that is, roughly speaking, some kind of interface that we, as developers, will need to implement with different phases of our “summer” life, in this case, a matchlet.

    And you were not mistaken, because here is the Speechlet interface, where there are four methods that we need to implement:

    public interface Speechlet {
    void onSessionStarted(SessionStartedRequest request, Session session);
           SpeechletResponse onLaunch(LaunchRequest request, Session session);
           SpeechletResponse onIntent(IntentRequest request, Session session);
    void onSessionEnded(SessionEndedRequest request, Session session);
    }

    In the beginning оnSessionStarted, this is when Alexa rises and realizes that she has a matchstick. onLaunch- this is when we call a team with the name of our skill. onIntent- when a man talked to us, and we got what he said in the form of Json and the team that he called. onSessionEnded- this is when we make a normal clinap.

    In general, it is very similar to any other “years”, and now we will see how everything looks in the code.

    Where to write Speechlet


    There are two places in Amazon that we need to work with:



    • Alexa Skill Kit, where we describe a new skill (interactive model, metadata, name, where to go when a request is made).
    • Lambda or any other service where we have a speechlet, which is a request handler. That is, roughly speaking, we get something like this:



    We have a user who says: "Alexa, ask Jenkins how's my build?" This comes to the device, in this case, to Amazon Tap. Then everything goes to the same skill that turns the voice into JSON, accesses Lambda Functions and pulls the Jenkins API.

    Code Example: JbToolBoxActivator Speechlet


    Now it's time to look at the code
    ( https://github.com/jbaruch/jb-toolbox-alexa-skill/blob/master/src/main/groovy/ru/jug/jpoint2017/alexa/jbtoolboxactivator/JbToolBoxActivatorSpeechlet.groovy )

    Here our match. We put Help Test, Default Question, and so on into the constants. We have HTTP BUILDER, I remind you, this is a service that runs to Amazon Lambda. Accordingly, he should pull something on the Internet. This is actually Groovy, but add a boilerplate and you will have Java.

    On оnSessionStartedwe configure our HTTP client, we say that we will knock on the toolbox on such a host and such a port. Next, we read from the file a list of supported tools, which you also saw (List of tools.txt).

    Of onLaunchwe issueHelpResponse. This one that you heard about - I can open the same tools.



    The most interesting thing is happening at onIntent. We do a switch by the name of intent. That is, of all the teams that we had, that came to us. In this case, if you remember, we had two intentions. One is open, which is our custom meter, and the other is help. There may also be stop, cancel, and any others.



    And here is the fun part openIntent. We remove the same slot, tool from it (we had IDEA and so on), and then use the HTTP builder to contact our URI plus this tool. That is, we are referring to the JetBrains Toolbox API, which understands this format.

    Accordingly, then we return the answer. The answer can be either Opening $toolName, or, if the tool is not in the list:Sorry, I can't find a tool named $ toolName in the toolbox. Goodbye .

    Help works as a help, stop and Kansel does Goodbye. If an intent has arrived that we don’t have, we throw an invalid Intent.

    Everything is extremely simple, I specially wrote it so simple so that you can see how simple it is.

    onSessionEndedI have nothing. Here I have a method newAskResponsethat in its purest form boilerplate creates three objects in ten lines of code: one object, which theoretically needs to translate two others into the constructor, in which some texts need to be translated. In general, all he does is create an objectSpeechletResponsein which we have OutputSpeech text and repromptText. Why does it take ten lines of code? Well, it has historically happened, we’ll talk about this a bit. I hope everything else, except for this boilerplate, is clear and simple.

    Code Example: Jenkins Speechlet


    Let's look at another skill. And this time, let's go the other way: from the code to the skill and at the end we will try to run it. Here is the Jenkins speechlet - the skill handler that controls Jenkins ( https://github.com/jbaruch/jenkins-alexa-skill/blob/master/src/main/groovy/ru/jug/jpoint2017/alexa/jenkins/JenkinsSpeechlet.groovy )

    It all starts very similar:оnSessionStartedwe initialize the HTTPBuilder, which will access Jenkins through the Rest API in JENKINS_HOST and log in with a specific username and password. This, of course, through environment variables is not very correct. Alexa has a whole system that allows us to register with username and password in a skill, that is, when we install this skill on our local Alexa, a window opens in which we can log in. But for simplicity, here we take username and password from environment variables.

    onLaunchwe will drum the same text: “Greetings, giant faces, we control Jenkins,” and here begins our interesting one onIntent.

    Here, accordingly, we have more intents, it makes sense to see our model -https://github.com/jbaruch/jenkins-alexa-skill/blob/master/src/main/resources/speechAssets/IntentSchema.json

    We have LastBuild, which, obviously, will return us information about our last build GetCodeCoverage, which will return us information about the coverage of the code, and FailBuildIntentwhich will overwhelm the build. In addition, there are a bunch of built-in ones, such as help, stop, cancel, and even yes, no. Let's see what we, in fact, do with these yes and no.

    Let's get started. Intent came to us, and we take data by name from it. If we were asked for last build, then we will go to the Jenkins API and take the list of builds with their name and color from there (red or blue color - passed or failed). Take last, a build with that name, passed or failed.

    GetCodeCoverage- again, we will turn to the Jenkins API, to a plugin called jacoco. In it, as in any good plugin, there are a lot of parameters. We will take one - lineCoverage - and get some information.

    FailBuild- This is a request to turn the incoming build into a fail one. I would not want to immediately agree to this. Alexa often responds in vain, so there is a chance of accidentally flipping a build. And we will ask her for confirmation. We will send another request and say: “here I want to fill up the build” “Is this really what you had in mind?” We will put in sessions. This is exactly what is held through different invokations, and we will put some flag in fail requested.

    And then we have the same yes-no. If we answered yes, then we need to check if there was a question, or maybe we just said yes. And if there was a question, then again, we will create some kind of post request, this time in our Jenkins API, and fill up the job. And if the job really turned red, we will say that we failed the job, and if not, nothing happened. Well, stop - we say Goodbye and exit.

    Such interesting functionality, and the code here is one page. I tried to complicate the code, but there is nothing to complicate, because it is actually so simple.

    After that, we collect it all with gradle, and Gradle is the simplest here. I have a bunch of dependencies here: groovy, which I naturally need, plus three dependencies of this API, logger, commons-io, commons-lang. All! testCompile - naturally, I have tests for this business. And then I build a ZIP in which my jar lies in the main directory. In addition, there is a lib directory with all the dependencies. Really nowhere easier.



    Using Alexa Skill Kit


    Now let's see what we do with this build. We have two places where we really turn. The first is Alexa Skill Kit, it has all the skills that I wrote. Let's take a look at Jenkins Skill.



    As I said, here we have metadata: the name and invocation name is the very word (Jenkins) when I say: “Alexa, ask Jenkins to do this.”



    Then you can specify whether I need an audio player (if, for example, I want to stream sound, for example, play music, play news and so on)



    Next, we have our interactive model - the same JSON that describes the intents.



    There is no slot here, but in JetBrains we would have a custom slot and this slot would have values ​​and examples that work for my intents.



    Configuration tab - what I call: Lambda or HTTPS.



    There is also the same Account Linking - do we give the opportunity to log in when setting up the skill (and in theory, it would be nice to do this in Jenkins).



    And further Permissions for all sorts of purchases, but we are no longer very interested in this.



    Testing tab - here I can write what she will do if I speak.



    Next is the Publishing Information tab. The skill passes all sorts of checks at Amazon. I have to tell them what good it is, how it is tested and so on.

    Using AWS Lambda


    The second part of working with skill is my AWS Lambda. I have nothing there but these three skills.



    Let's take a look at Jenkins. Here I fill in the same jar that Gradle built. There are also variables (host, password and user).



    In the configuration, I write which runtime I need. As I said, Node, Python, Java are supported. And recently began to be supported by C #. Next up is the handler. This is where my speechlet is inside. My Role is called lambda_basic_execution. And I have a description. All.



    The trigger comes from Alexa. There is a special kind of trigger called the Alexa Kit. This means that the call will be from there.



    And here you can also test if it is known which JSON Alex sends to Lambda. And I know which one, because when I test it here, I will see this Jenkins. And now let's look at this Jenkins. I sent a request: how is my build.



    I sent intent, intent is called last build, and Alexa realized what I want.



    It is all wrapped in JSON, which I can test in Lambda. I can say: I am sending this JSON. And when I say this, I can only test this piece.



    Yes, I have end-to-end testing with this voice recognition thing (text in JSON, in Lambda and then back), or I can only test this at the Lambda level, if I know the JSON that I need.

    Jenkins Speechlet Demonstration


    In general, it’s time to start everything already. We have Jenkins, and I have a build of the previous skill.



    Then I have this kind of dialogue with the system:

    Baruch: Alexa, open Jenkins
    Alexa: With this skill you can control your build server
    Baruch: Alexa, ask Jenkins, how is my build?
    Alexa: You last build jb-toolbox-alexa-skill is passing. What do you want to do next?
    Baruch: What's the code coverage?
    Alexa: The code coverage for the last jb-toolbox-alexa-skill build is 30%. What do you want to do next?
    Baruch: Fail the build.
    Alexa: I understand you want to fail the latest successful build. Are you sure?
    Baruch: Yes
    Alexa: Successfully changed the build status to failed. Thank you and goodbye.




    It is very cool and very simple, even ashamed of such an easy level.

    Alexa visual companion


    In addition to the voice interface, Alexa has a visual companion. This is an application in which, in addition to setting up all devices, the Internet and other things, there are also cards. This is such a stream of information that is auxiliary to the voice interface. In general, this is a sound idea, because not everything can be said in voice. If I, for example, have now requestedcode coverage, and she gave me some kind of metric - in fact, jacoco returns six metrics: coverage by branch execution, by methods, by lines of code, etc. Naturally, it makes sense to display this visually in the application. This can be done in one line - there is a command "send text to the application." And then there you can send plain text, a picture or html, which will be displayed there. For example, if you ask about the weather, the assistant will give a short answer. In the application, you can see the weather for the whole week. I also checked that my castle is locked and that my son is currently turning on a fairy tale about Little Red Riding Hood. Here's what it looks like:



    Alexa's flaws: imaginary and real


    Let’s now talk about two types of flaws: imaginary and real. The first - imaginary - is associated with speech recognition. The user can speak with an accent or not very clearly pronounce the words. I don’t know how they do it, but Alexa perfectly understands even my little son, although my wife and I do not always understand him.

    The second alleged flaw is shown in the illustration below.



    In fact, this problem can be solved very simply with the help of security: Alexa will never place an order in the store or open the door until you pronounce a predefined PIN code.

    There is still such an aspect: "Oh-oh-oh, they constantly listen to us." It was all a long time ago that reverse engineering has been made, everyone knows perfectly well that they are listening. The only thing that they listen to all the time is trigger words. That is, when I call her. Sometimes trigger words are spoken randomly, after that for ten seconds what we say is recorded, and the record is sent to the main intent, to the main skill. If it is not recognized there, it is thrown into the trash. Therefore, all this paranoia is unjustified.

    And now we’ll talk about real flaws. They can be divided into several categories.

    There are disadvantages to the voice user interface. They are connected with the fact that some people call things the same words. Try asking for Helloween music and not getting music for Halloween. I think that given the progress, the context should solve this problem, because Alexa should already know that between the music for Halloween and the Helloween band, I prefer the band. Everything goes to this, but so far it is not so, especially if the context is incomprehensible.

    And one more problem. It is associated with the lack of support for non-standard names and names. If Alexa does not know any name or title, then she will not be able to pronounce it.

    In addition, Alexa itself, as a consumer device, also has disadvantages:

    • Не понимает множественные команды. Я могу построить интерактивную модель, как мы сделали с Jenkins, когда я не произносил каждый раз ее имя для передачи следующей команды. Но, например, выполнить команду «Alexa, turn on TV and set living room lights to 20%» на данный момент она не может. И если команд много и ты ими пользуешься постоянно, это надоедает.
    • Не работает в кластере. Несмотря на то, что у меня дома семь таких девайсов, каждый из них считает себя единственным. И поэтому если я стою в месте, где их три, то все три мне отвечают. Кроме того, я не могу с помощью одной включить музыку у другой, потому что они не знают, что их больше, чем одна. Эта проблема известна, над ней работают и, надеюсь, скоро это будет решено.
    • Doesn't know where she is or where I am. Because of this, Alexa is unable to respond to the Turn on the lights command. I have to say “Turn on the lights in the bedroom,” and it's stupid.
    • The application itself is made in html5, it is slow, crooked, but it is also fixed.
    • It understands only three languages: British English, American English and German. Accordingly, in Russian it has not yet been taught.

    The biggest and nastiest flaws come out when trying to write Alexa Skill, for example the one that we just did. The Java API is a nightmare, the voice model must be manually copied to the page, and this is indicated even in the documentation. In addition, there is no bootstrap: neither the Maven Archetype, nor the Lazybones Template.

    There is no local test infrastructure, that is, if I changed a single line in the code, I need to go to the skill site and change the pens in the text area json, and then go to Lambda and load a new jar there (well, let's say there is Rest on Lambda API, because she did write for developers, you can do continuous deployment there, everything is fine there). But on the side of the skill kit, with any change, I need to go and download and test all this only on the server. Locally, there is no infrastructure, and this, of course, is also a very big minus.

    Conclusion


    That's all. The skills that were discussed can be taken from my Github - jb-toolbox-alexa-skill and jenkins-alexa-skill . I have a big request for you: let's invent and write useful skills, then we can return to this topic at the next conferences.

    So that you do not miss a single detail, under the spoiler we left for you answers to questions from the audience

    Вопрос о безопасности. Есть ли вероятность брутфорса, когда пин-код может быть подобран (например, чтобы открыть дверь)?

    Поскольку пин-код вводится голосом, а Alexa должна реагировать на каждый запрос, на перебор значений может уйти целый месяц (если, конечно, я не появлюсь дома до этого времени).

    Смартфон, ноутбук при включении могут требовать аутентификацию. Как обстоят дела с Alexa?

    У Alexa нет никакого пин-кода на включение, ты можешь его украсть и включать и выключать свет у меня дома. А все опасные штуки по идее должны быть защищены пин-кодом. На данный момент нет никакого voice recognition, хотя над этим работают, и это будет прикручено к аутентификации.

    Что вы можете сказать об Alexa в сравнении с Google Home?

    Google Home намного моложе Alexa, и у него пока что много проблем. Самая главная – это то, что писать скиллы для Google Home намного сложнее. Это требует намного большего уровня вхождения, что не очень хорошо для адаптации этой технологии. Однако у Google Home есть большое преимущество – собственный поисковик. У Alexa его нет, поэтому без кастомного поискового скилла Гугла обойтись невозможно.

    Ты показывал большие и маленькие устройства Alexa. Есть ли смысл ставить много больших устройств?

    Единственная разница между большими и маленькими устройствами — в качестве встроенного динамика. В самом маленьком динамик предназначен только для общения, но не для музыки. То есть везде можно поставить маленькие, а там, где ты хочешь слушать музыку – большой.



    We hope you enjoy the experience of Baruch. And if you like to relish all the details of development in Java in the same way as we, you will probably be interested in these reports at our April JPoint 2018 conference :


    Also popular now: