Using the Google Speech API to control your computer

Good afternoon to all habrazhitel. Several articles

have already been written on the habr about the use of the Google Speech API, including its use in creating the Smart Home . In this article I want to tell how you can write a small program for voice control of a computer. Who cares, please, under the cat. For development, I use Embarcadero RAD Studio XE and several free auxiliary components (JEDI Core, JEDI VCL, New Audio Components for Delphi, Synapse, uJSON, CoolTrayIcon). The article “Using Google Voice Search in our .NET Application” described how Google works Speech API and what are the subtleties.










I will describe the algorithm of my program and some of the nuances of using auxiliary components.

1. Recording audio in FLAC format.

For this, I use the New Audio Components for Delphi component. Sound is recorded in the FLAC format with a frequency of 8 kHz and saved to a file.

The VCL component DXAudioIn1 is responsible for recording, it also contains recording settings (1 channel and a frequency of 8 kHz)

Next, the data from DXAudioIn1 goes to FastGainIndicator1 which has OnGainData level processing, if the level drops N times lower than the set value (red indicator), then stop recording and sending data to Google.
I also made it possible to start automatic recording when the level is exceeded by some threshold M times (blue pointer).

Of course, such an algorithm is not very reliable, but it eliminates the need to press the start and stop buttons. With appropriate settings for levels and the number of operations, the program catches the fact of the presence of a useful component from the microphone.

And in the end, the data from FastGainIndicator1 goes to the FLACOut1 component, which writes directly to a file in the FLAC format.

The StartRecord procedure is responsible for starting recording.

2. Sending a file to Google for recognition and receiving a response The

recorded file is sent to Google for recognition using the Synapse library.

What are the subtleties when working with Synapse and the fact that data needs to be sent using HTTPS?

a) Libeay32.dll and ssleay32.dll libraries are required
b) In uses, you need to connect the SSL_OpenSSL file. The HTTPPostFile

function is responsible for sending the file.

It is called simply:
HTTPPostFile ('https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=ru-RU', 'userfile', ExtractFilename (OutFileName), Stream, StrList);

where
Stream is the TFileStream stream into which we read our recorded file in the FLAC format.
StrList is a TStringList with a response from Google.

The HTTPPostFile function itself is quite simple, but there are subtleties in it:

function TMainForm.HTTPPostFile(Const URL, FieldName, FileName: String; Const Data: TStream; Const ResultData: TStrings): Boolean;
const
  CRLF = #$0D + #$0A;
var
  HTTP: THTTPSend;
  Bound, Str: String;
begin
  Bound := IntToHex(Random(MaxInt), 8) + '_Synapse_boundary';
  HTTP := THTTPSend.Create;
  try
    Str := '--' + Bound + CRLF;
    Str := Str + 'content-disposition: form-data; name="' + FieldName + '";';
    Str := Str + ' filename="' + FileName + '"' + CRLF;
    Str := Str + 'Content-Type: audio/x-flac; rate=8000' + CRLF + CRLF;
    HTTP.Document.Write(Pointer(Str)^, Length(Str));
    HTTP.Document.CopyFrom(Data, 0);
    Str := CRLF + '--' + Bound + '--' + CRLF;
    HTTP.Document.Write(Pointer(Str)^, Length(Str));
    HTTP.MimeType := 'audio/x-flac; rate=8000, boundary=' + Bound;
    Result := HTTP.HTTPMethod('POST', URL);
    ResultData.LoadFromStream(HTTP.Document);
  finally
    HTTP.Free;
  end;
end;


3. Parsing the response line from Google and executing the command

The response line from Google comes in JSON form, for example:

{"status": 0, "id": "5e34348f2887c7a3cc27dc3695ab4575-1", "hypotheses": [{"utterance": "notepad "," Confidence ": 0.7581704}]}

For parsing, I use the uJSON library.

What do the response fields mean:
status = 0 field - record successfully recognized
status = 5 field - record not recognized
id field - this is a unique identifier of the request
field hypotheses - the result of recognition, it contains 2 subfields:
utterance - recognized phrase
confidence - recognition recognition

File sending, I parsed the answer, searched for and executed the command in a separate thread JvThreadRecognize.

Lists of commands are stored in the file MSpeechCommand.ini, an example file: Results: This program does not pretend to be finished, this is just an example of using the Google Speech API to execute some commands on a computer (so far it is only launching applications and executing system commands). But no one bothers to finalize it and teach to move the mouse, type text in a text editor, etc. Ready-made build programs and sources (GPLv3) are available at code.google.com/p/mspeech I will be glad to hear constructive criticism and wishes. Thanks.

блокнот;notepad.exe
свернуть все программы;script\Show_Desktop.scf
заблокировать компьютер;script\Lock_Workstation.cmd
выключить компьютер;script\Halt_Workstation.cmd
перезагрузить компьютер;script\Reboot_Workstation.cmd
завершить сеанс;script\Logoff_Workstation.cmd
запустить qip;C:\Program Files\QIP Infium\infium.exe
интернет;firefox.exe







Also popular now: