Speech technology. Voice biometrics for dummies using the example of a contact center

  • Tutorial
image

Hello.
Recently I wrote here such an article about recognition of continuous speech, and now I would like to write about voice biometrics, i.e. confirmation of a person’s identity by voice and recognition of a person by voice.

Again, since Since my work is connected with contact centers (KC), I will talk about them. This is also due to the fact that they are now actively interested in voice biometry, which is not surprising, because a telephone channel is its ideal application.
- you do not see the subscriber on the other end of the wire;
- you cannot use other modalities to confirm your identity: on the face, on the retina, on the fingerprint.
- no additional scanning devices are needed, such as those where you need to put your finger on or to whom to show your eye.
- This is the cheapest way of biometrics, although slightly inferior in reliability to other methods. But since other modalities are not technically applicable by telephone in mass use, there is no choice in fact.
Of course, you can argue about the option of confirming the subscriber’s identity “based on knowledge” - these are passwords, secret words, TPIN codes (banks), passport data, etc. - but all this is not reliable from the point of view of security and requires storing information from the subscriber or always keeping information at hand, which is not very convenient for the subscriber and not efficient (costly) for the CC.

To begin with, let's define what is included in the concept of voice biometrics:
- Thisidentification , i.e. identification of a person by voice. This is when an old friend calls you from an unknown number and says: “Guess who this is?” And you try to find the best match among all the known (familiar) voices. When the memory scan is over and you find a less suitable match, you can already say: “Yeah, this is my classmate Seryoga with whom I have not talked for 10 years.” But you do not have guarantees that it is he, and then the time comes for verification.
- Verification- This is a confirmation of identity by voice, i.e. unique identification card. To do this, we can ask to prove that Seryoga is the one for whom he claims to be. We can ask him: “Tell me where we were at the graduation party at 6 am” - this information will allow us to confirm Seryoga’s identity, as only he can be the carrier of this information (similar to the password about which I wrote above).

If you want a smarter definition, then:
Identification - Verifies the coincidence of one sample of voice with many from the base of votes. As a result of identification, the system shows a list of personalities with similar voices in percentage terms. A 100% match means that the sample voice completely matches the voice from the database and the identity is established reliably.
Verification- Compares two voice samples: the voice of the person whose identity must be verified with the voice stored in the database of the system and whose identity has already been authenticated. As a result of verification, the system shows the degree of coincidence of one vote with another in percentage terms.
There is such a thing as authentication . It’s hard to say how it differs from verification. Some of our employees have the opinion that this is a certain process of confirming a biological (!) Personality, when it is difficult to separate the identification process from verification, i.e. This is a generalized process.

Voice Verification.
I will tell you about verification, because it is more interesting for a real contact center application than identification.

What is verification?

- Text-independent
When the confirmation of identity occurs by spontaneous speech of the subscriber, i.e. we don’t care what the person says. This is the longest confirmation method - the subscriber’s clean speech should accumulate at least 6-8 seconds. Typically, this method is used directly during the communication of the subscriber with the CC operator, when the latter needs to clearly make sure that the subscriber is exactly who he claims to be. The most interesting thing is that this verification method can be used discreetly from the subscriber himself. At the workplace of the CC operator, such a working tool is visible.

image

Figure 1. Part of the interface of the workplace of the CC operator for the verification of the client.

- Text dependent on a static passphrase
When an identity confirmation occurs according to the passphrase that the subscriber invented at the time of registration. The duration of the passphrase must be at least 3 seconds. Usually we offer to say your name and company name. The passphrase is always the same.
- Text dependent on dynamic passphrase
When the verification of identity occurs according to the passphrase that the system itself offers at the time of the call for verification, i.e. each time the passphrase is different! We usually offer a dynamic passphrase from a sequence of numbers. The subscriber repeats the number after the system until it makes an unambiguous decision “friend / foe”. It can be either a single number of type "32" or a whole set of "32 58 64 25". The interesting thing is that pronouncing different numbers gives a different amount of information for comparison: the most “useful” figure is “eight” - it contains the most useful speech information, the most useless “two”.

How does voice verification work?

Step 1.
In order for us to be able to verify by voice, we need to already have a voice sample in our database (voice cast), the owner of which is reliably known. Therefore, the first step is the accumulation of the base by casting votes, for this we ask subscribers (customers) to go through the registration process in the system.
Registration in the subscriber’s system means that he voluntarily leaves his cast of voice, which we will then use for verification. Usually we ask you to leave 3 consecutive casts of votes in a row, so that there would be variability - say your password three times. Then, when the verification is successfully completed, we will replace the oldest voice cast with the new one, thus, the casts are constantly updated if the subscriber often uses the system. So we solve the problem of voice aging.
If we apply verification using a dynamic passphrase, we ask that the subscriber pronounce the numbers from 0 to 9 three times. As a result, we will have 30 voice samples.

It is desirable that the client leaves his cast of voice (registered) through the communication channel through which he will then be verified, otherwise the probability of errors increases. There are cases when they are registered from the headset on Skype, and then verified by home phone - here the factor of the communication channel will play a big role in the reliability of the service. When building a service, you can take into account that communication channels can be different - this is worked out and tested separately for a specific case, and you can level out the effects of a communication channel almost completely. But without thinking about it right away and immediately introduce it - there will be difficulties.

When should I offer a client to register? Then, when we have already confirmed his identity in other ways, for example, when visiting the company’s office or when the KC operator asked 100,500 different questions about his mother’s maiden name.
We have a really working service (stand) on the phone, how the registration mechanism for bank customers is implemented in practice, you can find out from this document .

It is important that the client independently and consciously passes registration (knows why this is necessary and how it will help him later), because then only a loyal subscriber who needs the result and who accepts the "rules of the game" can pass verification.
If the client is forced to undergo verification to the place and out of place, then he can subconsciously change his voice, fooling around (not being friendly to the service) - this will lead to errors and customer loyalty will fall, although he himself will indirectly be to blame for this.

How is subscriber registration in the system? (static passphrase)

image

Fig. 2. Scheme of registration of a person in a biometric system.

1. The subscriber calls the biometric system, which invites him to come up and pronounce a passphrase. Say 3 times.
2. The voice is processed by the biometrics server and on the way out we get 3 voice models. One for each spoken password.
3. On the server, we have a customer card (Yuri Gagarin) to which we attach the received 3 voice models.

What is a voice model?
- these are the unique characteristics of the human voice reflected in the matrix of numbers, i.e. This is a file size of 18KB (for static pf). It is like a fingerprint. It is these voice models that we then compare. In total, the voice model captures 74 (!) Different voice parameters.

How to get voice models?
We use 4 independent methods:
- analysis of the statistics of the fundamental tone;
- a method of a mixture of Gaussian distributions and SVM;
- spectral-formant;
- the method of complete variability.
I will not undertake to describe them in detail here - it is difficult even for me, and it is definitely not included in the course "for dummies". We learn this all at our department of RIS at ITMO (St. Petersburg).

Step 2.
This is verification itself. That is, we have a subscriber on the other end of the line who claims to be Yuri Gagarin. And in our database, accordingly, we have a card of a client of Yuri Gagarin, where casts of his voice are stored, therefore, all we need to do is compare the voice of a person who claims to be Yuri Gagarin with the voice of a real Yuri Gagarin.

How is the subscriber verified in the system? (static passphrase)

image

Fig. 3. Scheme of verification of a person in a biometric system.

1.First, we act as during registration, i.e. we have the password spoken by the client, which we send to the biometrics server and build the voice model of “allegedly” Yuri Gagarin.
2. Then we take 3 voice models of the real Yuri Gagarin, make the average model in a tricky way and also send it to the biometric server.
3.Just compare 2 different models. At the output, we get the percentage of compliance of one model to another.
4. Next, we need to do something with this number (in the figure, 92%). Is it a lot or a little, can we definitely say that it is Yuri Gagarin or is it a liar?

image

Figure 4. Threshold of trust "friend / foe".

In the system, we have such a parameter as the “threshold of trust” - this is a certain percentage of compliance. Suppose we ourselves set it at 60%. Thus, if the percentage of matching the voice model of "allegedly" Yuri Gagarin does not reach the "threshold of confidence", then a deceiver called us. If there is more than a “threshold of trust”, then the real Yuri Gagarin called us. We can set the “threshold of confidence” ourselves, usually from 50 to 70%, depending on the task of verification.

Here I would need to tell you about errors of the first (FR) and second kind (FA), as well as generalized errors (EER), but I will not do this - this will greatly complicate and increase the text. If it’s interesting, I’ll try to persuade anyone from the scientific department to describe this popularly and post it here separately.

I will simply say that, depending on the task of verification, it is more useful for us to more likely to miss “our own” than not to miss “another's”. And vice versa, sometimes it’s more important not to miss the “stranger” than to miss the “friend”.
I am sure that the first time no one understood these 2 sentences from you, and you had to read them thoughtfully again in order to realize the meaning.

Integration of a biometrics server into a contact center.

image

Figure 5. A block diagram of a VoiceKey product.

Honestly, everything is very simple here: we input the voice in wave or PCM format via http, at the output we get the comparison result. I don’t want to dwell on this in more detail.

The verification process takes an average of 0.8 seconds. It is possible to work simultaneously with many threads.

We have onthe site is described in detail, and most importantly there are well-developed use cases for contact centers. Over the past years, I have talked quite a lot with various large KCs in Russia, first of all, this is the financial sector and I have formed an understanding of goals and objectives.

Now we will raise the following question: how generally is voice biometrics technology suitable for mass use? Is she reliable?

In short, YES, it really works really well. Our company has telephone demonstration stands. If interested, then each of you can call and personally try how and what works. I give the telephone number and testing instructions on request from this page. Just for statistics of interest in this topic and estimates of server load.

For reference: the development of Russian scientists in the field of voice biometrics occupies, if not the first place in the world, then they definitely share it with others. This is confirmed by independent studies, for example NIST (National Institute of Standards and Technology, USA), where our company was in the top three in all five tests among commercial companies. Or the fact that our VoiceKey product won the nomination “Best Product of the Year for CC” in 2013 in the international competition “ Crystal Headset ”.
It can also be noted that our company owns the implementation of the largest project in the world today on voice biometrics in a telephone channel.

In short, here is such an educational program. Ready to answer questions in the comments.

Also popular now: