
Programming a coprocessor in C #? Yes!
Probably everyone knows about the existence of the FPU coprocessor. How to write code for it we read further. FPU - floating point unit - a part of the central processor specifically designed to work with data types representing floating point numbers, or in a different way with types float and double. This module as part of the processors appeared after the birth of Intel 486DX (thank you for the amendment), but so long ago. And since then, he has been doing the work of computing various mathematical expressions, or rather, them in the form of code in assembly language. In other words, the compiler does not convert all the program code to a standard set of instructions like mov, sub, and others, but also to fld, fstp, fsub, fadd ... if we are talking about calculations involving double types. As you can see, the instructions for the FPU have the prefix “f”, according to which you can immediately distinguish the code destined for him. All information on FPU you can find on the Internet, google it by name, I also recommend the site wasm.ru - section "Processors". A coprocessor is a very interesting thing and programming is a very interesting thing, I would even say exciting - I don’t know what you will feel, but I was delighted when I managed to “spell” the code, giving commands directly to the processor without intermediaries, CLR- Wednesday, etc. Why "conjure"? More about this later. giving commands directly to the processor without intermediaries, compilers, CLR-environment, etc. Why "wedge"? More about this later. giving commands directly to the processor without intermediaries, compilers, CLR-environment, etc. Why "wedge"? More about this later.
I borrowed the term “conjure” from the author of wonderful articles on the site. This is a series of articles about “Code Spell” that I recommend you read after reading my article.
Now I will show you how to write a simple example of a code spell for FPU. I must immediately warn that at least in the end C # will participate, C ++ is needed for the spell itself.
Suppose we need to calculate the following expression: result = arg1 - arg2 + arg3.
There are several options for writing code. In order not to complicate the understanding of what is happening, I will first show one, a little later show the other.
So, the first option looks like this:
fld [arg1]
fld [arg2]
fsubp
fld [arg3]
faddp
fstp [result]
ret
Now I will explain. In square brackets, we must indicate the addresses of the variables arg1, arg2, arg3, result.
The fld instruction loads the value of the double variable at the top of the stack (FPU works with the stack, and it has some features), the address of which goes immediately after the instruction; fsubp - subtracts the value lying 1 position in the stack below, the value lying on the top of the stack and frees the top of the stack, thereby writing the result to the place of the value from which it is subtracted, the result is now on the top of the stack; faddp - works similar to fsubp, only does not subtract, but adds the values; fstp - unloads the double value from the top of the stack, unloads into the cell at the address indicated below; Well, the ret statement, which is intuitive, terminates the execution of the function and transfers control to the function that called it. To make it more clear, I will show the work of our code in pictures:

The result is recorded in the memory cell, where it can be taken. I hope the operation of the instructions is clear. Now let's see how we can create such code from a C ++ program.
Now let's take a look at the most delicious here. So, using the VirtualAlloc function, we allocate a certain amount of memory to our code (namely, according to the value of the
SYSTEM_INFO. DwAllocationGranularity structure, as it were, the system value of the memory partition); pay attention to what arguments the input function takes, namely, PAGE_EXECUTE_READWRITE - this parameter allows you to access the newly created section of memory not only for reading / writing, but also for code execution, i.e. we can transfer control to this memory area and the processor will read further instructions from here.
We allocate half of this created array for the code, the second half for the data is a kind of similarity between the code segment and the data segment. All that remains is to populate these data segments with the necessary values. To fill the array with code, you just need to write the opcodes (processor instructions) into this array in hexadecimal form. Let's take it in order.
The FLD instruction has the opcode DD / 0. Yes, by the way, I’ll say right away that you can see the values of the opcodes and their mnemonic spelling in the documentation on processor architecture. Let's continue, FSTP also has the DD opcode, but with the / 3 prefix, this is the opcode extension - mod r / m bytes. Here is a table of values for mod r / m bytes [http://www.sandpile.org/ia32/opc_rm32.htm] (Inquisitive minds in the presence of interest will be able to understand all this, believe me). Since the instruction FLD and FSTP can operate on operands of different types, i.e. cells, processor registers, then for this there is an opcode extension. We need a kind of operand for the address of the double number, so in that table we look at the value for [sdword]. For FLD, this value is 05h, for FSTP 1Dh. Add these values to the opcodes and get: FLD = DD05h, FSTP = DD1Dh. The FSUBP instruction has the DE / 5 opcode, and again we need to turn to the opcode extension table and look at the extension value for XMM1 (this is the link element of the FPU stack) and see that it is equal to E9h, i.e. FSUBP = DEE9h. FADDP, like FSUBP, has the DE opcode, but already / 0, which has the value C1h for XMM1, i.e. FADDP = DEC1h. The RET instruction has the opcode C390h.
It should be noted that the processor reads the instructions from the end, so they should be written back, given that they are 2 bytes and paired, i.e. FLD = DD05h should be written not 05DDh, but 05DDh, this is important!
Well, that's basically all about opcodes. The C ++ code above shows how to populate the array with instructions. First, write down the instruction, then, if necessary, the address of the cells. Please note that the address has a length of 4 bytes (32 bits) for 32-bit systems, so after writing the address to the code array, you must move the pointer 4 bytes forward, instead of 2 bytes in case of instructions.
The culmination of this miracle is the execution of code recorded in memory. How to execute code from our array? For help, we turn to a pointer to a function, here the C ++ language helps out. We create a pointer to a function of type void with void parameters, then assign it a pointer to the beginning of the code array. All! We start our pointer to the function, we get the result of the program directly in memory, the processor did everything exactly as we told it in our code array.
Now I remind you that this is 1 way to pass parameters and return the result. The second way is to create a pointer to a function of type double (void), i.e. so that we don’t get the result in our memory and we pull it out ourselves, but so that our function created dynamically will return the result to us. To do this, simply change the code to this:
fld [arg1]
fld [arg2]
fsubp
fld [arg3]
faddp
// fstp [result]
ret
That is just leave the result at the top of the stack. And our function pointer will return the result from the top of the stack. Everything is simple.
The reader already from the middle of the article asks the question: “And what about C # ??? One C ++ and Assembler, incomprehensible numbers ... ". Fair, but you have to be patient :).
So, we all know that we can perform functions written in C ++, Delphi, etc. from C #. You can
implement this using the extern keyword and the [DllImport ("*. Dll")] attribute.
There is also an option and easier. Programmers of the .NET platform were able to make friends managed code and unmanaged. Thus, we simply create a new class in C ++ using the aforementioned technique that implements code generation, a code spell. Next, we simply connect this library to a project using managed C # code and use it completely unhindered. That's exactly what I did. How glad I was when the result was not long in coming! :)
Here is what I did:
This is already in C #!
Check it out! Everything works!
It’s clear that there is more C ++ code here, however, if those interested have a certain talent and interest in tormenting in this area, then you can write some C ++ wrapper that will generate such code dynamically, and use this wrapper already from C # filling it with necessary variables and parameters, etc. You can get a pretty interesting thing.
Add a couple of amenities.
The article is written with reference to co-processor programming. In fact, you can write whatever you want, for this you need to study the architecture of the memory and computer processor, instructions. Technologically advanced programmers who know what SSE is (and it’s already almost 5) can write code using all the innovations of processor technologies and the most pleasant thing is to use it in C #. Everything is limited by fantasy =). Good luck in your endeavors!
I want to express my deep gratitude to my friend Peter Kankowski, who at one time helped me figure this out! He has his own wiki site, where he and his colleagues and friends discuss various ways to optimize the code, etc. [http://www.strchr.com/]
UPD: Herethere is a simple example of the same principle of generating native code, but already completely in C #. Thanks lastmsu for the tip on Marshal.GetDelegateForFunctionPointer ().
Thank you for attention! Good luck
I borrowed the term “conjure” from the author of wonderful articles on the site. This is a series of articles about “Code Spell” that I recommend you read after reading my article.
Now I will show you how to write a simple example of a code spell for FPU. I must immediately warn that at least in the end C # will participate, C ++ is needed for the spell itself.
Suppose we need to calculate the following expression: result = arg1 - arg2 + arg3.
There are several options for writing code. In order not to complicate the understanding of what is happening, I will first show one, a little later show the other.
So, the first option looks like this:
fld [arg1]
fld [arg2]
fsubp
fld [arg3]
faddp
fstp [result]
ret
Now I will explain. In square brackets, we must indicate the addresses of the variables arg1, arg2, arg3, result.
The fld instruction loads the value of the double variable at the top of the stack (FPU works with the stack, and it has some features), the address of which goes immediately after the instruction; fsubp - subtracts the value lying 1 position in the stack below, the value lying on the top of the stack and frees the top of the stack, thereby writing the result to the place of the value from which it is subtracted, the result is now on the top of the stack; faddp - works similar to fsubp, only does not subtract, but adds the values; fstp - unloads the double value from the top of the stack, unloads into the cell at the address indicated below; Well, the ret statement, which is intuitive, terminates the execution of the function and transfers control to the function that called it. To make it more clear, I will show the work of our code in pictures:

The result is recorded in the memory cell, where it can be taken. I hope the operation of the instructions is clear. Now let's see how we can create such code from a C ++ program.
double ExecuteMagic(double arg1, double arg2, double arg3)
{
short* code;
short* code_cursor;
short* code_end;
double* data;
double* data_cursor;
SYSTEM_INFO si;
GetSystemInfo(&si);
DWORD region_size = si.dwAllocationGranularity;
code = (short*)VirtualAlloc(NULL, region_size * 2, MEM_COMMIT, PAGE_EXECUTE_READWRITE);
code_cursor = code;
code_end = (short*)((char*)code + region_size);
data = (double*)code_end;
data_cursor = data;
*data_cursor = arg1;
*code_cursor++ = (short)0x05DDu; //fld
*(int*)code_cursor = (int)(INT_PTR)(data_cursor); //1.0
code_cursor = (short*)((char*)code_cursor + sizeof(int)); // смещение
data_cursor++;
*data_cursor = arg2;
*code_cursor++ = (short)0x05DDu; //fld
*(int*)code_cursor = (int)(INT_PTR)data_cursor++; //-2.0
code_cursor = (short*)((char*)code_cursor + sizeof(int)); // смещение
*code_cursor++ = (short)0xE9DEu; //fsubp
*data_cursor = arg3;
*code_cursor++ = (short)0x05DDu; //fld
*(int*)code_cursor = (int)(INT_PTR)data_cursor++; //2.0
code_cursor = (short*)((char*)code_cursor + sizeof(int)); // смещение
*code_cursor++ = (short)0xC1DEu; //faddp
double *result = data_cursor;
*code_cursor++ = (short)0x1DDDu; //fstp
*(int*)code_cursor = (int)(INT_PTR)data_cursor++; //
code_cursor = (short*)((char*)code_cursor + sizeof(int)); // смещение
*code_cursor++ = (short)0x90C3u; //ret
void (*function)() = (void (*)())code;
//1-(-2)+2=5
function();
return *result;
}
* This source code was highlighted with Source Code Highlighter.
Now let's take a look at the most delicious here. So, using the VirtualAlloc function, we allocate a certain amount of memory to our code (namely, according to the value of the
SYSTEM_INFO. DwAllocationGranularity structure, as it were, the system value of the memory partition); pay attention to what arguments the input function takes, namely, PAGE_EXECUTE_READWRITE - this parameter allows you to access the newly created section of memory not only for reading / writing, but also for code execution, i.e. we can transfer control to this memory area and the processor will read further instructions from here.
We allocate half of this created array for the code, the second half for the data is a kind of similarity between the code segment and the data segment. All that remains is to populate these data segments with the necessary values. To fill the array with code, you just need to write the opcodes (processor instructions) into this array in hexadecimal form. Let's take it in order.
The FLD instruction has the opcode DD / 0. Yes, by the way, I’ll say right away that you can see the values of the opcodes and their mnemonic spelling in the documentation on processor architecture. Let's continue, FSTP also has the DD opcode, but with the / 3 prefix, this is the opcode extension - mod r / m bytes. Here is a table of values for mod r / m bytes [http://www.sandpile.org/ia32/opc_rm32.htm] (Inquisitive minds in the presence of interest will be able to understand all this, believe me). Since the instruction FLD and FSTP can operate on operands of different types, i.e. cells, processor registers, then for this there is an opcode extension. We need a kind of operand for the address of the double number, so in that table we look at the value for [sdword]. For FLD, this value is 05h, for FSTP 1Dh. Add these values to the opcodes and get: FLD = DD05h, FSTP = DD1Dh. The FSUBP instruction has the DE / 5 opcode, and again we need to turn to the opcode extension table and look at the extension value for XMM1 (this is the link element of the FPU stack) and see that it is equal to E9h, i.e. FSUBP = DEE9h. FADDP, like FSUBP, has the DE opcode, but already / 0, which has the value C1h for XMM1, i.e. FADDP = DEC1h. The RET instruction has the opcode C390h.
It should be noted that the processor reads the instructions from the end, so they should be written back, given that they are 2 bytes and paired, i.e. FLD = DD05h should be written not 05DDh, but 05DDh, this is important!
Well, that's basically all about opcodes. The C ++ code above shows how to populate the array with instructions. First, write down the instruction, then, if necessary, the address of the cells. Please note that the address has a length of 4 bytes (32 bits) for 32-bit systems, so after writing the address to the code array, you must move the pointer 4 bytes forward, instead of 2 bytes in case of instructions.
The culmination of this miracle is the execution of code recorded in memory. How to execute code from our array? For help, we turn to a pointer to a function, here the C ++ language helps out. We create a pointer to a function of type void with void parameters, then assign it a pointer to the beginning of the code array. All! We start our pointer to the function, we get the result of the program directly in memory, the processor did everything exactly as we told it in our code array.
Now I remind you that this is 1 way to pass parameters and return the result. The second way is to create a pointer to a function of type double (void), i.e. so that we don’t get the result in our memory and we pull it out ourselves, but so that our function created dynamically will return the result to us. To do this, simply change the code to this:
fld [arg1]
fld [arg2]
fsubp
fld [arg3]
faddp
// fstp [result]
ret
That is just leave the result at the top of the stack. And our function pointer will return the result from the top of the stack. Everything is simple.
The reader already from the middle of the article asks the question: “And what about C # ??? One C ++ and Assembler, incomprehensible numbers ... ". Fair, but you have to be patient :).
So, we all know that we can perform functions written in C ++, Delphi, etc. from C #. You can
implement this using the extern keyword and the [DllImport ("*. Dll")] attribute.
There is also an option and easier. Programmers of the .NET platform were able to make friends managed code and unmanaged. Thus, we simply create a new class in C ++ using the aforementioned technique that implements code generation, a code spell. Next, we simply connect this library to a project using managed C # code and use it completely unhindered. That's exactly what I did. How glad I was when the result was not long in coming! :)
Here is what I did:
#include
#pragma once
using namespace System;
namespace smallcodelib
{
public ref class CodeMagics
{
public:
static double ExecuteMagic(double arg1, double arg2, double arg3)
{
short* code;
short* code_cursor;
short* code_end;
double* data;
double* data_cursor;
SYSTEM_INFO si;
GetSystemInfo(&si);
DWORD region_size = si.dwAllocationGranularity;
code = (short*)VirtualAlloc(NULL, region_size * 2, MEM_COMMIT, PAGE_EXECUTE_READWRITE);
code_cursor = code;
code_end = (short*)((char*)code + region_size);
data = (double*)code_end;
data_cursor = data;
*data_cursor = arg1;
*code_cursor++ = (short)0x05DDu; //fld
*(int*)code_cursor = (int)(INT_PTR)(data_cursor); //1.0
code_cursor = (short*)((char*)code_cursor + sizeof(int)); // смещение
data_cursor++;
*data_cursor = arg2;
*code_cursor++ = (short)0x05DDu; //fld
*(int*)code_cursor = (int)(INT_PTR)data_cursor++; //-2.0
code_cursor = (short*)((char*)code_cursor + sizeof(int)); // смещение
*code_cursor++ = (short)0xE9DEu; //fsubp
*data_cursor = arg3;
*code_cursor++ = (short)0x05DDu; //fld
*(int*)code_cursor = (int)(INT_PTR)data_cursor++; //2.0
code_cursor = (short*)((char*)code_cursor + sizeof(int)); // смещение
*code_cursor++ = (short)0xC1DEu; //faddp
double *result = data_cursor;
*code_cursor++ = (short)0x1DDDu; //fstp
*(int*)code_cursor = (int)(INT_PTR)data_cursor++; //
code_cursor = (short*)((char*)code_cursor + sizeof(int)); // смещение
*code_cursor++ = (short)0x90C3u; //ret
void (*function)() = (void (*)())code;
//1-(-2)+2=5
function();
return *result;
}
};
}
Это код для класса на С++.
И:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Runtime.InteropServices;
using smallcodelib;
namespace test_smallcodelib
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Заклинание кода! (* Выход)");
while (!Console.ReadLine().Equals("*"))
{
double arg1;
double arg2;
double arg3;
Console.Write("arg1?: "); arg1 = Convert.ToDouble(Console.ReadLine());
Console.Write("arg2?: "); arg2 = Convert.ToDouble(Console.ReadLine());
Console.Write("arg3?: "); arg3 = Convert.ToDouble(Console.ReadLine());
double result = CodeMagics.ExecuteMagic(arg1, arg2, arg3);
Console.WriteLine(String.Format("Result of arg1 - arg2 + arg3 = {0}", result));
}
}
}
}
* This source code was highlighted with Source Code Highlighter.
This is already in C #!
Check it out! Everything works!
It’s clear that there is more C ++ code here, however, if those interested have a certain talent and interest in tormenting in this area, then you can write some C ++ wrapper that will generate such code dynamically, and use this wrapper already from C # filling it with necessary variables and parameters, etc. You can get a pretty interesting thing.
Add a couple of amenities.
The article is written with reference to co-processor programming. In fact, you can write whatever you want, for this you need to study the architecture of the memory and computer processor, instructions. Technologically advanced programmers who know what SSE is (and it’s already almost 5) can write code using all the innovations of processor technologies and the most pleasant thing is to use it in C #. Everything is limited by fantasy =). Good luck in your endeavors!
I want to express my deep gratitude to my friend Peter Kankowski, who at one time helped me figure this out! He has his own wiki site, where he and his colleagues and friends discuss various ways to optimize the code, etc. [http://www.strchr.com/]
UPD: Herethere is a simple example of the same principle of generating native code, but already completely in C #. Thanks lastmsu for the tip on Marshal.GetDelegateForFunctionPointer ().
Thank you for attention! Good luck