V2008n October 13, 2014 at 15:56

New solutions to the old problem

Tutorial

Or transfer the bike to jet propulsion

There is one very old task whose age is equal to the age of the American Standard Code for the Exchange of Information. More specifically, it is the task of converting an integer to its hexadecimal representation of an ASCII string.
In this publication, we will consider the conversion of an unsigned sixty-four-bit integer into a string of fixed length without truncating the leading zeros.
The task at first glance seems elementary. It would be so if the ASCII table was different. But we have what we have.
All solutions will only be for IA-32 and Intel 64 architecture.

Consider the input-output data and the algorithm for solving this problem.
Entrance:
A 64-bit unsigned integer that occupies 8 bytes at addresses from low to high. The digits of a number are arranged in the same order. Each bit takes 4 bits, each byte has two.
Output:
A string of ASCII characters, each occupying one byte. Each byte represents one digit of the original number. The order of the characters is the first byte (lowest address) - the most significant bit of the original number.

Algorithm:
1) Go from older notebooks to younger ones
2) Take a notebook and convert it to a byte with an extension of zero.
3) Add 30h
4) If the result is more than 39h then
4.1) Add another 17 (decimal)
4.2) Go to 5
4.3) Otherwise go to 5
5) Save the received byte to the string
6) While not all notebooks have been processed, go to 2

Solution No. 1

Head-on

        mov cx,8
        mov si,value
        mov di,hexstr
        add si,cx ;highest byte of value
        dec si
next_tetrade:
        std
        lodsb
        mov bl,al
        and al,0fh
        call digit
        cld
        stosb
        mov al,bl
        shr al,4
        call digit
        loop next_tetrade
       ret
digit:
        add al,30h
        cmp al,39h
        jnb _zero-nine
;digit greater than 9
        add al,11h
 _zero-nine:
         ret

Like most decisions in the forehead, it is not distinguished by either beauty or speed. Home ugliness is a conditional transition that bypasses just one command. There are two ways to bring beauty.
The first is to use ancient “magic” as a sequence of AAA AAD 17 commands.
Solution # 2

AAA AAD 17

mov cx,8
        mov si,value
        mov di,hexstr
        add si,cx ;highest byte of value
        dec si
next_tetrade:
        std
        lodsb
        mov bl,al
        and al,0fh
        call digit
        cld
        stosb
        mov al,bl
        shr al,4
        call digit
        loop next_tetrade
        ret
digit:
        mov ah,30h
        aaa
        aad 11h
        ret

Another way is to use the XLATB command.
Decision No. 3

Xlatb

         mov cx,8
         mov si,value
         mov di,hexstr
         mov	bx,hextable
         add si,cx 
         dec si
m3:
         std
         lodsb
         mov	ah,al
         and	al,0fh
         xlatb
         cld
         stosb
         mov	al,ah
         shr	al,4
         xlatb
         stosb
         loop	m3
         ret
hextable db "0123456789ABCDEF"

Both solutions are almost identical. But Solution No. 2 will not work in 64-bit processor mode, due to the elimination of support for AAA and AAD commands in it.
But is it really possible to process 8 bytes at a time, will we accept processing only 4 bits at a time?
Are there any ways to turn 9 (1001) into 39h (00111001) and A (1010) into 41h (01000001)?
Let's try to reveal the essence of a pair of AAA AAD teams, and pick up a replacement for them.

AAA AAD Replacement

         mov bl,0ah
         xor ah,ah
         div bl             ; al - частное, ah -остаток. Понятно, что для цифр больше 9 частное будет равно 1.
         mov bh,al          ;
         shl al,4           ; и из этой единицы мы формируем 11h
         add al,bh 
         add al,30h         ; voila!

This code, of course, is not suitable for our purposes, but it provides valuable information. It shows that you can get a transfer unit for numbers greater than 9. And it shows how to use this unit later. Now, if it were possible to turn each notebook into a byte, then these bytes could be processed in parallel. Eight digits at a time!
Let rax contain the original number. To isolate the younger notebooks, you just need to perform conjunction with the mask. And so as not to lose the older notebooks, copy them to rbx.

Unpuck

         mov rdx,0f0f0f0f0f0f0f0fh
         mov rbx,rax
         and rax,rdx
         shr rbx,4
         and rbx,rdx

Unfortunately, it will not work to obtain the desired transfer by division. Are there any other methods?
Actually, elementary: 10h-0ah = 6. It is enough to add 6 to each byte and we will get the necessary unit in the senior notebook.

Carry

         mov rdx,0f0f0f0f0f0f0f0fh
         mov rcx,0606060606060606h
         mov rbx,rax
         and rax,rdx
         mov rdi,rax ;копия пригодится позже
         shr rbx,4
         and rbx,rdx
         add rax,rcx
         shr rax,4
         and rax,rdx ; теперь в тех байтах, которые соответствуют цифрам больше 9, находятся единицы. В других - нули.

Unlike the previous method of obtaining units of transfer using division, instead of the remainder of division, we still have the original figure. That is, where the number A was - there it will remain, and not turn into 0. Therefore, you need to add not 11h, but 41h-3ah = 7.
And now the next task arises, how to make seven out of one? Yes, so as not to affect neighboring bytes. After all, 7 = 0111b, which means you can get what you need with two left shifts and two disjunctions.

One to seven

         mov rsi,rax
         shl rsi,1
         or  rax,rsi ;11b
         shl rsi,1
         or rax,rsi ; 111b

Now put the pieces together and see what happens

Unsigned to hex ver 0.1

        mov rax,[value]
        mov rdx,0f0f0f0f0f0f0f0fh
        mov rcx,0606060606060606h
                      mov rbx,rax
        mov rbp,3030303030303030h     
                      shr rbx,4                 
        and rax,rdx
                      and rbx,rdx
        mov rdi,rax 
                      mov r9,rbx
        add rax,rcx
                      add rbx,rcx
        shr rax,4
                      shr rbx,4
        and rax,rdx
                      and rbx,rdx 
        mov rsi,rax
                      mov r8,rbx
        shl rsi,1
                      shl r8,1
        or rax,rsi
                      or rbx,r8
        shl rsi,1
                      shl r8,1
        or rax,rsi
                      or rbx,r8
        add rax,rbp
                      add rbx,rbp 
        add rax,rdi
                      add rbx,r9
        mov [hexstr],rax
                       mov [hexstr+8],rbx

If you compile and run this code, one trouble will be revealed - the order of the numbers in the line is not in order. Firstly, the numbers go back and forth, and secondly, even and odd numbers are grouped together. You can, of course, give it to conclusion, and so, but we are honest and still put things in order.

Deinterleaving

        mov rcx,4
highpart:
        rol	rbx,8
        shrd	[hexstr],rbx,8
        rol	rax,8
        shrd	[hexstr],rax,8	
        loop	highpart	
        mov rcx,4
lowpart:
        rol	rbx,8
        shrd	[hexstr+8],rbx,8
        rol	rax,8
        shrd	[hexstr+8],rax,8	
        loop	lowpart
        ret

From what they wanted to leave, they came to that. Again byte processing in 64 mode. In order to flip the bytes, put them backwards, Intel has long made the bswap command. But for deinterlacing you have to look towards MMX, SSE and their development. And there is such a team and its name is punpcklbw. We use our finds.

Deinterleaving ver. 2

        bswap rax
                      bswap rbx
        mov [hexstr],rax
                      mov [hexstr+8],rbx
        movdqu xmm0,[hexstr]
                      movdqu xmm1,[hexstr+8]
        punpcklbw xmm1,xmm0
        movdqu	[hexstr],xmm1

Stop stop stop. If we started using SSE, maybe there is something else useful? It could be to rewrite our code entirely on SSE.

Unsigned to hex ver 1.1

        movdqu	xmm0,[value]
        pxor	xmm1,xmm1
        punpcklbw	xmm0,xmm1
        movdqa	xmm1,xmm0
        pand	xmm1,[efes]
        psllq	xmm0,4
        pand	xmm0,[efes]
        por	xmm0,xmm1
        movdqa	xmm1,xmm0
        paddb	xmm1,[sixes]
        psrlq	xmm1,4
        pand	xmm1,[efes]
        pxor	xmm9,xmm9
        psubb	xmm9,xmm1
        pand	xmm9,[sevens]
        paddb	xmm0,xmm9
        paddb	xmm0,[zeroes]
        movdqu	[hexstr],xmm0
        mov	rax,[hexstr]
        mov	rbx,[hexstr+8]
        bswap	rax
        bswap	rbx
        mov	[hexstr],rbx
        mov	[hexstr+8],rax
        ret
efes:	dq	0f0f0f0f0f0f0f0fh
        dq	0f0f0f0f0f0f0f0fh
zeroes:	dq	3030303030303030h
        dq	3030303030303030h
sixes:	dq	0606060606060606h
        dq	0606060606060606h
sevens:	dq	0707070707070707h
        dq	0707070707070707h

Here we simplified the unpacking, and organized the obtaining of the sevens in a different way, using one subtraction and one conjunction.

But what did we win at all? Compare the performance of each of the methods.

CPU	1	2	3	4	5	6	7	8
Core 2 Quad Q8200	670	600	150	77	70	77	76	1170
AMD C-60	290	195	290	120	105	120	140	290

Lodsb stosb with jnb
Lodsb stosb with xlatb
General purpose registers with shrd
General purpose registers with punpck
SSE 64 bit
SSE 64 bit unaligned
SSE 128 bit
Lodsb stosb with xlatb 128 bit

Values in parrots are the average number of processor ticks.
If the numbers on Intel fit well in theory, then on AMD they are somewhat mysterious. A pleasant surprise was the high speed SSE code on the processor from Intel. You can safely increase the bit depth of the converted numbers to 256 bits with a slight increase in the required time, since there are still many free xmm registers in 64-bit mode. In general, initially it would seem a sequential task, it became possible to solve a very parallel method.
The inverse problem of converting a hexadecimal string to a number is solved no less entertainingly.

For a snack:

SSE 128 bit

      movdqa 	xmm0,[value]
      pxor    xmm1,xmm1
		movdqa	xmm2,xmm0
      punpcklbw       xmm0,xmm1
      movdqa  xmm1,xmm0
		punpckhbw	xmm2,xmm1
      pand    xmm1,[efes]
		movdqa	xmm3,xmm2
      psllq   xmm0,4
		pand	xmm3,[efes]
      pand    xmm0,[efes]
		psllq	xmm3,4
      por     xmm0,xmm1
		pand	xmm2,[efes]
      movdqa  xmm1,xmm0
		por	xmm2,xmm3
      paddb   xmm1,[sixes]
		movdqa	xmm3,xmm2
      psrlq   xmm1,4
		paddb	xmm3,[sixes]
      pand    xmm1,[efes]
		psrlq	xmm3,4
      pxor    xmm9,xmm9
		pand	xmm3,[efes]
      psubb   xmm9,xmm1
		pxor	xmm10,xmm10
      pand    xmm9,[sevens]
		psubb	xmm10,xmm3
      paddb   xmm0,xmm9
		pand	xmm10,[sevens]
      paddb   xmm0,[zeroes]
		paddb	xmm2,xmm10
      movdqa  [hexstr],xmm0
		paddb	xmm2,[zeroes]
      mov     rax,[hexstr]
		movdqa	[hexstr+16],xmm2
      mov     rbx,[hexstr+8]
		mov	rcx,[hexstr+16]
      bswap   rax
		mov	rdx,[hexstr+16+8]
      bswap   rbx
		bswap	rcx
      mov     [hexstr],rbx
		bswap	rdx
      mov     [hexstr+8+16],rax
	      mov	[hexstr+8],rcx
	      mov	[hexstr+16],rdx
      ret

Tags:

New solutions to the old problem

Or transfer the bike to jet propulsion

Also popular now: