Exploring OpenCL with Password Hacking
Introduction
Recently, after reading various articles and presentations about GPGPU, I decided to try programming for video cards for myself. In fact, the choice of technologies in this area is not great - only CUDA (nVidia proprietary standard) and OpenCL (free standard, works on GPUs from ATI, nVidia, as well as on central processors) are still alive and developing. Due to the fact that my laptop has an ATI graphics card (Mobility Radeon 5650 HD), the choice was completely reduced to one option - OpenCL. This article will discuss the process of learning OpenCL from scratch, as well as what came of it.
Overview of OpenCL and PyOpenCl
At first glance, everything seemed very confusing to me, both the control code in C and the code of the so-called kernels - kernels. In the C API provided, even launching a simple program takes a large number of lines, especially with processing at least some errors, so I wanted to find something more convenient and humane. The choice fell on the PyOpenCL library , from the name of which it is clear that the control code is written in Python. Everything looks more understandable in it, even for someone who sees OpenCL code for the first time (of course, this applies only to simple examples). However, the code of the kernels themselves is still written in a slightly modified C, so you still have to study it. Full documentation on it can be obtained on the website of the developer of the standard ( Khronos), and information on specific implementations - on the ATI and nVidia sites, respectively.
You can get a first impression of the language using the simplest example (adding two arrays):
__kernel void sum(__global const float *a,
__global const float *b, __global float *c)
{
int gid = get_global_id(0);
c[gid] = a[gid] + b[gid];
}
And here is the complete necessary code to run this example and validate (taken from PyOpenCL documentation):
Hidden text
import pyopencl as cl
import numpy
import numpy.linalg as la
a = numpy.random.rand(50000).astype(numpy.float32)
b = numpy.random.rand(50000).astype(numpy.float32)
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
b_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b)
dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, b.nbytes)
prg = cl.Program(ctx, """
__kernel void sum(__global const float *a,
__global const float *b, __global float *c)
{
int gid = get_global_id(0);
c[gid] = a[gid] + b[gid];
}
""").build()
prg.sum(queue, a.shape, None, a_buf, b_buf, dest_buf)
a_plus_b = numpy.empty_like(a)
cl.enqueue_copy(queue, a_plus_b, dest_buf)
print la.norm(a_plus_b - (a+b))
Specific lines are immediately visible: creating context, execution queues, creating and copying buffers to the device, as well as compiling and starting the kernel itself. You can read more about contexts and queues in OpenCL in the documentation, and for relatively simple programs you only need one queue and one context, which will be created by lines very similar to those in the example. In general, often the structure of calculations in OpenCL programs looks something like this:
- creating context, queues, compiling a program
- copying to the device data (buffers) that do not change during execution
- cycle
- copying data specific to this iteration to the device
- kernel execution
- copying the calculated data back to the main memory, some processing is possible
SHA1 hashing
It is time to go down a level and figure out how the kernel code works. For whom, in order for the OpenCL function to be run from the outside, it must be indicated by the __kernel attribute, have a void value type and a certain number of arguments, which can be directly values (int, float4, ...) or pointers to regions memory __global, __constant, __local. Also, for convenience, other functions that can be called from the kernel can be declared in the program, and this does not affect performance: all functions are automatically substituted (that is, as with the inline directive). The reason for this is that recursion in OpenCL is not supported at all.
Using the fact that OpenCL is a modified C language, you can take a ready-made implementation of the hash function, for example, SHA1, and use minor modifications with:
Hidden text
#define K1 0x5A827999
#define K2 0x6ED9EBA1
#define K3 0x8F1BBCDC
#define K4 0xCA62C1D6
#define a0 0x67452301;
#define b0 0xEFCDAB89;
#define c0 0x98BADCFE;
#define d0 0x10325476;
#define e0 0xC3D2E1F0;
#define f1(x,y,z) ( z ^ ( x & ( y ^ z ) ) ) /* Rounds 0-19 */
#define f2(x,y,z) ( x ^ y ^ z ) /* Rounds 20-39 */
#define f3(x,y,z) ( ( x & y ) | ( z & ( x | y ) ) ) /* Rounds 40-59 */
#define f4(x,y,z) ( x ^ y ^ z ) /* Rounds 60-79 */
#define ROTL(n,X) ( ( ( X ) << n ) | ( ( X ) >> ( 32 - n ) ) )
#define expand(W,i) ( W[ i & 15 ] = ROTL( 1, ( W[ i & 15 ] ^ W[ (i - 14) & 15 ] ^ \
W[ (i - 8) & 15 ] ^ W[ (i - 3) & 15 ] ) ) )
#define subRound(a, b, c, d, e, f, k, data) \
( e += ROTL( 5, a ) + f( b, c, d ) + k + data, b = ROTL( 30, b ) )
#define REVERSE(value) value = ((value & 0xFF000000) >> 24) | ((value & 0x00FF0000) >> 8) | ((value & 0x0000FF00) << 8) | ((value & 0x000000FF) << 24)
long sha1(uint *eData, const int length)
{
unsigned int A = a0;
unsigned int B = b0;
unsigned int C = c0;
unsigned int D = d0;
unsigned int E = e0;
((__local char *)eData)[length] = 0x80;
for (int i = 0; i <= length / 4; i++)
{
REVERSE(eData[i]);
}
eData[14] = 0;
eData[15] = length * 8;
subRound( A, B, C, D, E, f1, K1, eData[ 0 ] );
subRound( E, A, B, C, D, f1, K1, eData[ 1 ] );
subRound( D, E, A, B, C, f1, K1, eData[ 2 ] );
subRound( C, D, E, A, B, f1, K1, eData[ 3 ] );
subRound( B, C, D, E, A, f1, K1, eData[ 4 ] );
subRound( A, B, C, D, E, f1, K1, eData[ 5 ] );
subRound( E, A, B, C, D, f1, K1, eData[ 6 ] );
subRound( D, E, A, B, C, f1, K1, eData[ 7 ] );
subRound( C, D, E, A, B, f1, K1, eData[ 8 ] );
subRound( B, C, D, E, A, f1, K1, eData[ 9 ] );
subRound( A, B, C, D, E, f1, K1, eData[ 10 ] );
subRound( E, A, B, C, D, f1, K1, eData[ 11 ] );
subRound( D, E, A, B, C, f1, K1, eData[ 12 ] );
subRound( C, D, E, A, B, f1, K1, eData[ 13 ] );
subRound( B, C, D, E, A, f1, K1, eData[ 14 ] );
subRound( A, B, C, D, E, f1, K1, eData[ 15 ] );
subRound( E, A, B, C, D, f1, K1, expand( eData, 16 ) );
subRound( D, E, A, B, C, f1, K1, expand( eData, 17 ) );
subRound( C, D, E, A, B, f1, K1, expand( eData, 18 ) );
subRound( B, C, D, E, A, f1, K1, expand( eData, 19 ) );
subRound( A, B, C, D, E, f2, K2, expand( eData, 20 ) );
subRound( E, A, B, C, D, f2, K2, expand( eData, 21 ) );
subRound( D, E, A, B, C, f2, K2, expand( eData, 22 ) );
subRound( C, D, E, A, B, f2, K2, expand( eData, 23 ) );
subRound( B, C, D, E, A, f2, K2, expand( eData, 24 ) );
subRound( A, B, C, D, E, f2, K2, expand( eData, 25 ) );
subRound( E, A, B, C, D, f2, K2, expand( eData, 26 ) );
subRound( D, E, A, B, C, f2, K2, expand( eData, 27 ) );
subRound( C, D, E, A, B, f2, K2, expand( eData, 28 ) );
subRound( B, C, D, E, A, f2, K2, expand( eData, 29 ) );
subRound( A, B, C, D, E, f2, K2, expand( eData, 30 ) );
subRound( E, A, B, C, D, f2, K2, expand( eData, 31 ) );
subRound( D, E, A, B, C, f2, K2, expand( eData, 32 ) );
subRound( C, D, E, A, B, f2, K2, expand( eData, 33 ) );
subRound( B, C, D, E, A, f2, K2, expand( eData, 34 ) );
subRound( A, B, C, D, E, f2, K2, expand( eData, 35 ) );
subRound( E, A, B, C, D, f2, K2, expand( eData, 36 ) );
subRound( D, E, A, B, C, f2, K2, expand( eData, 37 ) );
subRound( C, D, E, A, B, f2, K2, expand( eData, 38 ) );
subRound( B, C, D, E, A, f2, K2, expand( eData, 39 ) );
subRound( A, B, C, D, E, f3, K3, expand( eData, 40 ) );
subRound( E, A, B, C, D, f3, K3, expand( eData, 41 ) );
subRound( D, E, A, B, C, f3, K3, expand( eData, 42 ) );
subRound( C, D, E, A, B, f3, K3, expand( eData, 43 ) );
subRound( B, C, D, E, A, f3, K3, expand( eData, 44 ) );
subRound( A, B, C, D, E, f3, K3, expand( eData, 45 ) );
subRound( E, A, B, C, D, f3, K3, expand( eData, 46 ) );
subRound( D, E, A, B, C, f3, K3, expand( eData, 47 ) );
subRound( C, D, E, A, B, f3, K3, expand( eData, 48 ) );
subRound( B, C, D, E, A, f3, K3, expand( eData, 49 ) );
subRound( A, B, C, D, E, f3, K3, expand( eData, 50 ) );
subRound( E, A, B, C, D, f3, K3, expand( eData, 51 ) );
subRound( D, E, A, B, C, f3, K3, expand( eData, 52 ) );
subRound( C, D, E, A, B, f3, K3, expand( eData, 53 ) );
subRound( B, C, D, E, A, f3, K3, expand( eData, 54 ) );
subRound( A, B, C, D, E, f3, K3, expand( eData, 55 ) );
subRound( E, A, B, C, D, f3, K3, expand( eData, 56 ) );
subRound( D, E, A, B, C, f3, K3, expand( eData, 57 ) );
subRound( C, D, E, A, B, f3, K3, expand( eData, 58 ) );
subRound( B, C, D, E, A, f3, K3, expand( eData, 59 ) );
subRound( A, B, C, D, E, f4, K4, expand( eData, 60 ) );
subRound( E, A, B, C, D, f4, K4, expand( eData, 61 ) );
subRound( D, E, A, B, C, f4, K4, expand( eData, 62 ) );
subRound( C, D, E, A, B, f4, K4, expand( eData, 63 ) );
subRound( B, C, D, E, A, f4, K4, expand( eData, 64 ) );
subRound( A, B, C, D, E, f4, K4, expand( eData, 65 ) );
subRound( E, A, B, C, D, f4, K4, expand( eData, 66 ) );
subRound( D, E, A, B, C, f4, K4, expand( eData, 67 ) );
subRound( C, D, E, A, B, f4, K4, expand( eData, 68 ) );
subRound( B, C, D, E, A, f4, K4, expand( eData, 69 ) );
subRound( A, B, C, D, E, f4, K4, expand( eData, 70 ) );
subRound( E, A, B, C, D, f4, K4, expand( eData, 71 ) );
subRound( D, E, A, B, C, f4, K4, expand( eData, 72 ) );
subRound( C, D, E, A, B, f4, K4, expand( eData, 73 ) );
subRound( B, C, D, E, A, f4, K4, expand( eData, 74 ) );
subRound( A, B, C, D, E, f4, K4, expand( eData, 75 ) );
subRound( E, A, B, C, D, f4, K4, expand( eData, 76 ) );
subRound( D, E, A, B, C, f4, K4, expand( eData, 77 ) );
subRound( C, D, E, A, B, f4, K4, expand( eData, 78 ) );
subRound( B, C, D, E, A, f4, K4, expand( eData, 79 ) );
A += a0;
B += b0;
C += c0;
D += d0;
E += e0;
return as_ulong((uint2)(D, E));
}
Some clarification is needed here. As an “experimental” password hash for cracking, I took leaked LinkedIn hashes, which are almost 6 million (unique). There are several options for a fairly quick check for the presence in the list, I used hash tables (more on that later). To reduce memory consumption and speed up the idea came up not to store the full 20 bytes of SHA1, but the last 8 bytes, i.e. one long / ulong value. Of course, this increases the likelihood of a false match, but it remains very small: out of all the passwords I had, I had only 6 such cases, which is not critical at all. Therefore, the cropped value (the last 8 bytes) is returned immediately from the above function. Otherwise, everything is standard, the SHA1 algorithm is implemented directly according to the specification for the case of strings less than 56 bytes.
Organization of search
Next, you need to organize the search itself. The simplest option is brute force, for all positions the same set of characters, and this can be directly implemented, for example, in a similar way:
Hidden text
__kernel void do_brute(
__global const long *table,
const ulong start_num,
__global ulong *result,
__global uint *res_ind)
{
char s[64];
uint *s_l = (__local uint *)s;
int i, j;
ulong _n, n;
ulong start = start_num + get_global_id(0) * HASHES_PER_WORKITEM;
for (j = 0; j < HASHES_PER_WORKITEM; j++)
{
n = _n = j + start;
for (i = 15; i >= 0; i--)
{
s_l[i] = 0;
}
for (i = COMB_LEN - 1; i >= 0; i--)
{
s[i] = charset[n % CHARS_CNT];
n /= CHARS_CNT;
}
if (check_in_table(table, sha1(s_l, COMB_LEN)))
{
result[atomic_inc(res_ind)] = _n;
}
}
}
Here HASHES_PER_WORKITEM is the number of hashes processed in one run by one work item (stream), COMB_LEN is the length of the combination, charset is an array of characters, CHARS_CNT is the number of characters in the array. As you can see, at startup, a pointer to a hash table, the number of the password from which the search starts, and also a pointer to an array to display the result and an index in it are passed to this kernel.
In OpenCL, not one thread is launched at a time, but a certain number of them, called global work size, and all threads receive the same arguments. In order for each of them to sort out its part of the key space, the string
ulong start = start_num + get_global_id(0) * HASHES_PER_WORKITEM;
calculates the number for a particular thread (get_global_id (0) is a standard function that returns the thread index from 0 to the current global work size).Next, each stream HASHES_PER_WORKITEM enumerates passwords, each of which is hashed by the sha1 function and checked for the presence of the check_in_table function, which is described later. In this form and with the simplest implementation of the hash table, I got the result of about 20 million passwords per second, which frankly is not enough compared to for example oclHascat, which gives out more than 300 million on my laptop (even that the verification goes through a large list of hashes does not justify this). Looking ahead, I’ll say that for simple brute force I managed to achieve a speed of 160 million per second, which is more than half the speed of oclHascat (with one hash).
Hash table
So, checking for the existence of a hash. The first option implemented was a simple hash table with open addressing . It was filled, so as not to complicate matters, with a processor and not with a video card; in OpenCL, there were only requests to it. This case looked something like this:
Hidden text
bool check_in_table(
__global const long *table,
const long value)
{
uint index = calc_index(value);
uint step = calc_step(value);
for (uint a = 1; ; a++)
{
index %= TABLE_SIZE;
if (table[index] == 0)
{
return false;
}
if (table[index] == value)
{
return true;
}
index += a * step + 1;
}
}
I tried various table sizes and probing methods, but the speed did not noticeably improve. Looking for material on GPU hash tables, I came across an article “Building an Efficient Hash Table on the GPU” by Vasily Volkov , which mentions a certain Cuckoo Hashtable (I don’t know if there is an established translation into Russian), which I previously did not hear. Briefly, its essence lies in the use of several hash functions instead of one and a special way of filling, after which the element is found in no more than k memory accesses, where k is the number of hash functions. Since speed is more important to me than occupied memory, I used k = 2. Filling it also occurs on the CPU.
Conclusion
Also, of course, the optimization affected another part of the code, namely the generation of passwords. In the above version, several non-optimal places are immediately visible, for example, each next password is generated from scratch, although you can change the previous one. There are other places for optimization, in particular, specific to OpenCL: using global or constant memory for an array of characters instead of a much faster local one (it is better to read more about memory areas directly from the developer of a specific implementation). However, it’s worth writing a separate article about the various optimizations of the kernel code, but here I’ll say that when programming for the GPU it’s worth trying different options and looking at their speed, because it is not always possible to say by eye that it will work faster. Sometimes even deleting some instruction slows down execution,
In the future, I added support for different alphabets for different positions, as well as, more significantly, alphabets from words, and not just individual characters. For convenience and flexibility, the kernel code is processed by the Mako template engine. All this is in the archive (see below).
conclusions
So, what I eventually acquired:
- OpenCL knowledge with many subtleties (from scratch, I think that is quite successful)
- raised the level of programming in Python (not quite from scratch, but from a fairly low level)
- I learned and tried related technologies: Cython compiler, Mako template engine, VCS git (almost never used before)
Tangible Results:
- the program enumerates passwords at a speed of about 160 million per second for simple brute force (when using the alphabet of words, not characters, the speed is less: from 50 million per second) on a laptop video card - you can compare on the one hand with 300 million for oclHascat working with one hash, and on the other with 30 million at the fastest brute force processor on the CPU (I have i5 2.5 GHz), written using assembler and SSE instructions
- using it and without any tricks, about 2.5 million passwords from LinkedIn hashes were selected (not so many, but here I did not aspire to a record)
- You can download it here (in some places there is a scary shit!), to start you need Python 2.7, numpy, PyOpenCL, Cython, Mako and a file with LinkedIn hashes.
Impressions:
- Python is a very slow language (but quite simple to write), processing large lists with loops is very slow, as well as various comprehensions, which is why Cython is used for some parts of the program - by the way, a really convenient thing
- OpenCL with the proper approach to profiling and subsequent optimizations can work very quickly, and for this you do not need to rewrite anything in assembler
PS:
Various comments and recommendations are welcome.