Fast integer multiplication using tables

I want to tell readers about a programming trick that I met in some kind of translation book that contains a selection of such tricks in those distant times when it wasn’t yet invented not only the byte, but it’s scary to say the stack, and the great Dijkstra has not yet cursed the operator GOTO (sic, in uppercase).

I liked the trick so much by its simplicity and grace that already in this millennium I was happy to tell students about it in the form of the following task.

Imagine that in the process of returning from the Moon in 2030, you suddenly discovered that your on-board computer is not performing integer multiplication operations correctly, and this will certainly lead to an accident during landing.

In this story there is nothing particularly fantastic. Recall, for example, what problems happened once with Pentium processors, and by the time you were sent to the moon, you had not yet reached full import substitution. In general, it is necessary to check whether the processors were specially drilled.

But to the point. You urgently need to implement multiplication programmatically so that it works quickly in real time and fits into an available resource.

From the school course of arithmetic, we recall that multivalued numbers can be multiplied by a column, and the result of the multiplication of individual numbers can be taken from the multiplication table.

But only if the numbers are chosen short (for example, 0 and 1), the multiplication table will be short and the columns long, and their calculation will take a lot of time.

If, on the contrary, we take long numbers (for example, from 0 to 65535), then for 16-bit arithmetic
the result is taken directly from the table, and the columns are missing. However, the size of the classical Pythagorean table turns out to be about 17GB (4 * 65536 * 65536), if we take into account the symmetry with respect to the diagonal, then half is about 8.5GB.

It may be a bit too much.

Tighten and remember algebra.

$ (a + b) ^ 2 = a ^ 2 + b ^ 2 + 2ab $(1)

$ (a - b) ^ 2 = a ^ 2 + b ^ 2 - 2ab $(2)

Subtract (2) from (1)

$ (a + b) ^ 2 - (a - b) ^ 2 = 4ab $

and further

$ ab = ((a + b) ^ 2 - (a - b) ^ 2) / 4 $

Thus, having a table of squares in the sqr array, we get

a * b = (sqr [a + b] - sqr [a - b]) / 4 (*) The

size of the 8 * table (65535 + 65535) is about 8.4MB, which is already may be acceptable.

The size of the table element of 8 bytes is due to the fact that for the maximum a and b, the square of their sum of 4 bytes does not fit - 2 bits are not enough.

Next, I will describe some improvement, which was not in the book. I came up with it myself when I was already writing this note.

Note that the two least significant bits of an even square are always 00, and the odd square is always 01. On the other hand, for any pair of numbers, their sum and difference have the same parity.
Therefore, in our formula (*), the process of subtraction in parentheses cannot have hyphens,
associated with the two least significant bits. Therefore, the contents of the elements of the table of squares
can be shifted in advance by two bits to the right and thereby save half the memory.

Finally, we have

a * b = sqr4 [a + b] - sqr4 [a - b] (**)

where sqr4 is the modified table of squares.

In our example, its size is about 4.2 MB.

Below to illustrate this approach, the text of the Lua program is included.

function create_multiplier(N)  -- N это число разных цифр
 local sqr4 = {}		-- создаём таблицу
 for i = 1, 2 * N - 1 do
  local temp = 0
  for j = 1, i - 1 do		-- для вычисления квадрата
   temp = temp + i - 1	-- используем многократное сложение
  end					-- т.к. умножение "неисправно"
  sqr4[i] = math.floor(temp / 4)  -- экономим 2 бита
 end
 return 			-- возвращаем мультипликатор (функцию)
  function (i, j)
   if i < j then i, j = j, i end
   return sqr4[1 + i + j] - sqr4[1 + i - j]
  end
end
N = 256
mpy = create_multiplier(N)
for i = 0, N - 1 do
 for j = 0, N - 1 do
  if i * j ~= mpy(i,j) then
   print("Ошибка", i, j, i * j, mpy(i,j))
  end
 end
end
print("Всё работает.")

For modern processors, it seems reasonable to have a digit size that is a multiple of the byte size for easy access to them. With 1-byte digits, the size of the table is only 1022 bytes, which may allow this trick to be used in 8-bit processors that do not have hardware multiplication.

I will be grateful to all readers of this note for corrections and comments.

Also popular now: