How captcha told about the vulnerability of Yandex
Sit down comfortably, brew a seagull for yourself, for I write a little longer and through my right ear. So are you ready? Great, then we proceed.
ATTENTION! The information described below is written solely for research purposes and is not intended to be used for mercenary purposes!
I will begin, perhaps with prehistory. Namely, with the development of a network drive, the sectors of which are located in the cloud. The essence of this technology made it possible to make several accounts (well, about 100 or 1k accounts) of the same Yandex.Disk by 10 Gb each (may change at the time of reading) one big disk, say for 10Tb. Here ...
I took it, it means that the development of this program a couple of years ago (I wrote the program all the same, but about this in another article) and the question arose of how to enter YaCaptcha.
So I looked at this captcha, googled a little and I think - it would be necessary to use TensorFlow, and then transfer to FANN. Upset, of course, a little, but nothing to do. Give, I think, I’ll download a few images for now (~ 100k) so that I can train neurons, but for now I’ll remember the U-net segmentation. Well, I sketched a couple of lines of code on Delphi + Synapase, I started it and started to google about the neuron. It was downloaded during the search, I must say, up to ... a lot, in general. And here the most interesting begins.
I went in, it means I am in a daddy with pictures and see - only repetitions! Yes, there were a lot of duplicate pictures. Well, I think that this will not work, and I downloaded the first available program for deleting identical files (it seems, if memory serves, clonespy ). Launched it, but fell on the stove to sleep. I looked at the statistics for the morning and thought about it: 76k repetitions were deleted from 100k images, and they are not just similar - they are 100% identical! What does this mean, you ask? And I will explain now.
If you take a neuron, then you can get somewhere ~ ~ 18% of recognition, as our comrade writes (well, it seems to me, you can squeeze up to 45-50% if you wish). But even if so, let's not forget how much fuss with similar methods (databases need to be created, dox manually entered ... a lot of captcha, after that you need to systematize everything and wait until it all gathers) and how much space it takes, and time Do not say what the program will have.
Is it possible to somehow go the other way? - you ask. And here I propose to count - we have 100k images, of which 76k are repetitions, i.e. if we score the base of these images (for example, take the hash of the sum), we get as much as 76% recognition, which is higher than the neuron, and the database weight is about the same (if not less) as the weight table for TensorFlow. At the same time, this method will work everywhere and does not need a bunch of libraries.
We get that Yandex is so naughty, having created so few variations of captcha. And thanks to this, you can write a program about which I wrote above. Is this not a visual system vulnerability? Or do you think over 100Tb of hard disk space is cheap?
Thank you for watching!