Basic API sets for implementing transparent proxy services
One of the most important parts of any system for protecting corporate data from leaks is the module for analyzing outgoing network traffic. Most often, the module is implemented as a transparent proxy service, i.e. a service that “transparently” stands between the network application and the target server, and whose task is to intercept the data flow between the application and the server.

The article is devoted to the transparent proxying service and methods for implementing traffic proxying. It will not address the issues of redirecting network traffic to a transparent proxy service, although this is also a rather interesting technical problem.
Since the target applications can work on any, including non-standard, ports, all traffic needs to be processed. The number of connections that are created during the operation of "high-performance" network applications exceeds 100 per second. In this regard, a transparent proxy service should be as efficient as possible. The general algorithm of the service is as follows:
What APIs in the Microsoft Windows operating system can help solve this problem?
To organize proxying using this API, you need to do the following:
1. Create a socket
2. Create an event where socket state changes will be monitored
3. Associate the socket with the event, while indicating what changes in the state of the socket we are interested in. When transmitting traffic, we are interested in completing data transmission and completing data reception. In addition, the moment of closing the connection is interesting, as this is a sign that it is time to end the processing of traffic
4. Initiate reading data from the socket
5. Organize the expectation of socket state change events. Most often this is done as follows: a separate thread is launched, and the function of waiting for a state change is called in it
An event associated with the socket will become alarm if data is received or data is sent, or the connection is closed. I / O errors also cause the event to signal.
6. Check how the state of the socket has changed, and carry out the appropriate processing

What pitfalls await us when using this API:
Programs can establish dozens of connections at the same time, and a transparent proxy service should create twice as many sockets, i.e. the proxy service creates two sockets per connection of the program. The WSAWaitForMultipleEvents function used has a limitation - it cannot accept more than 64 objects at a time. Therefore, you need to run several waiting threads and somehow distribute the sockets between them.
Long processing of data in one of the waiting threads can lead to the fact that events from other sockets that are expected in this thread will not be processed. To solve this problem, you need to run individual threads of data processing and monitor their loading.
Getting data from a socket requires calling three functions: recv, WSAWaitForMultipleEvents, and WSAEnumNetworkEvents. Each of these functions potentially "goes into kernel mode", which is a rather expensive operation.
If the pool of threads for waiting for socket events and data processing is implemented inefficiently, then an increase in the number of computing resources (processor cores) will not lead to an increase in the proxy speed of connections, and for terminal servers this feature is very important.
Thus, this API is not very suitable for implementing an efficient transparent proxying method. Consider a different set of APIs.
1. Create a socket. But now for performing asynchronous operations we need some contextual structure that describes the asynchronous operation. A feature of this structure is that its first element is the standard OVERLAPPED data type. This order allows you to implement the correct operation of callback functions.
2. We connect the socket to the I / O completion port, events from which are processed inside the system thread pool. Since we will use a pointer to an OVERLAPPED structure to initiate an asynchronous operation, no one bothers us allocating more memory for this purpose with this structure. And we will get the address of this particular structure in the callback of the I / O completion port.
3. We initiate the asynchronous read operation from the socket. It should be remembered that if the operation completed immediately, i.e. either without an error, or with an error other than ERROR_IO_PENDING, then you need to complete the processing in the thread that initiated the reading. In this case, the I / O completion port callback function will not be called. The context of the asynchronous operation should be stored in the structure that describes the intercepted connection, since the lifetime of this structure coincides with the lifetime of the connection context. Moreover, this structure can be reused for read operations from the socket.
The implementation of ReceiveDoneCallback is similar to the synchronous case.
4. We process the received data. Since we already use the system thread pool for I / O processing, we must use the system thread pool for data processing. It should be remembered: the data must be processed and transmitted to our pair socket in the same sequence in which they were received. Therefore, a queue of processed and transmitted data should be organized. The system pool function should work specifically with the queue. It is important that the queue process only one thread in the pool. You can arrange a queue arbitrarily.
Access to the queue of processed items, as well as access to information on the processing status should be synchronized. Asynchronous data transfer to our “pair” is organized in a similar way, but instead of ReadFile, the WriteFile function is used.

What we got when we started using this set of APIs:
This API set allows you to increase the number of processed connections by increasing the number of process cores, i.e. this scheme will work on a terminal server.
But this API still has disadvantages:
These problems can be solved using a different set of APIs.
This set of functions allows you to create separate pools of threads, and configure each of them. Consider the steps you need to take to organize proxying of network connections using this API.
1. Create and configure the environment in which the thread pool will work. This environment allows you to correctly wait for the completion of all tasks that were transferred to a given pool
2. Create and configure a thread pool
Now we have a dedicated pool of threads, in which there can not be less than two or more than ten threads. In addition, we can use the io_pool_cleanup variable to wait for the completion of all operations that were initiated in this pool. Similarly, you can configure a thread pool to process intercepted data (processing_pool).
3. We create a socket and structures which are necessary for initiation of asynchronous operations
The implementation of the IoDoneCallback (ReceiveDoneCallback) and WorkRoutine functions is similar to the implementations described for the previous set of APIs. Those. You can reuse the existing business logic for processing intercepted data.
4. We initiate the asynchronous operation of reading data from the socket
Processing the results of the operation is similar to that described for the variant with the input / output completion port, but with one peculiarity. If we don’t want to receive a callback in the pool for the case of synchronous completion of the operation (and it will be performed “by default”), we need to specially mark the socket after its creation:
In addition, it is important to remember that each initiated I / O operation must either be completed or canceled, i.e. if the operation completed synchronously, with or without an error, you need to call:
5. We initiate the processing of the received data. The processing function is the same as with QueueUserWorkItem

The described set of APIs is good for everyone, but it exists only in versions of the operating system, starting with Windows Vista. For Windows XP and Windows Server 2003, I / O ports and the old system pool must be used. Nevertheless, the interface of both options allows you to process intercepted data in the same way, so the code base is one, although it is collected for different operating systems.
Any high-quality software product should use the most effective ways to solve technical problems from those that the operating system provides. The transparent proxying service of our product has come a long way in development, and at the moment it is implemented, it seems to me, as efficiently as possible. I hope that the conclusions from the path we have passed will help others quickly understand the technologies and make the right decision.

The article is devoted to the transparent proxying service and methods for implementing traffic proxying. It will not address the issues of redirecting network traffic to a transparent proxy service, although this is also a rather interesting technical problem.
Since the target applications can work on any, including non-standard, ports, all traffic needs to be processed. The number of connections that are created during the operation of "high-performance" network applications exceeds 100 per second. In this regard, a transparent proxy service should be as efficient as possible. The general algorithm of the service is as follows:
- Accept a redirected connection.
- Get information about where to establish a “proxied” connection.
- Create a connection to the server (from step 2).
- Get data from the application and transfer it to the server.
- Receive data from the server and transfer it to the application.
- Repeat steps 4 and 5 until either the server or application closes the connection.
- Close the pair connection.
What APIs in the Microsoft Windows operating system can help solve this problem?
Sockets + WSA events
To organize proxying using this API, you need to do the following:
1. Create a socket
SOCKET socket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
2. Create an event where socket state changes will be monitored
WSAEVENT sock_event = WSACreateEvent();
3. Associate the socket with the event, while indicating what changes in the state of the socket we are interested in. When transmitting traffic, we are interested in completing data transmission and completing data reception. In addition, the moment of closing the connection is interesting, as this is a sign that it is time to end the processing of traffic
WSAEventSelect(socket, sock_event, FD_READ|FD_WRITE|FD_CLOSE);
4. Initiate reading data from the socket
int res = recv(socket, buf, buf_len, 0);
5. Organize the expectation of socket state change events. Most often this is done as follows: a separate thread is launched, and the function of waiting for a state change is called in it
int res = WSAWaitForMultipleEvents(1, sock_event, FALSE, INFINITE, FALSE);
An event associated with the socket will become alarm if data is received or data is sent, or the connection is closed. I / O errors also cause the event to signal.
6. Check how the state of the socket has changed, and carry out the appropriate processing
WSANETWORKEVENTS wsaNetworkEvents;
WSAEnumNetworkEvents(socket, sock_event, &wsaNetworkEvents);
if( ( wsaNetworkEvents.lNetworkEvents & FD_READ) ) {
//Данные получены, можно их обработать и передать нашей «паре»
ProcessReceivedData();
}
if( ( wsaNetworkEvents.lNetworkEvents & FD_WRITE) ) {
//Данные переданы, можно запрашивать следующую порцию у «пары»
IssuerRead();
}
if( ( wsaNetworkEvents.lNetworkEvents & FD_CLOSE) ) {
//Соединение закрылось,
//закрываем нашу пару (но только после того, как все данные будут переданы)
ClosePeer();
}

Pros and cons
What pitfalls await us when using this API:
Programs can establish dozens of connections at the same time, and a transparent proxy service should create twice as many sockets, i.e. the proxy service creates two sockets per connection of the program. The WSAWaitForMultipleEvents function used has a limitation - it cannot accept more than 64 objects at a time. Therefore, you need to run several waiting threads and somehow distribute the sockets between them.
Long processing of data in one of the waiting threads can lead to the fact that events from other sockets that are expected in this thread will not be processed. To solve this problem, you need to run individual threads of data processing and monitor their loading.
Getting data from a socket requires calling three functions: recv, WSAWaitForMultipleEvents, and WSAEnumNetworkEvents. Each of these functions potentially "goes into kernel mode", which is a rather expensive operation.
If the pool of threads for waiting for socket events and data processing is implemented inefficiently, then an increase in the number of computing resources (processor cores) will not lead to an increase in the proxy speed of connections, and for terminal servers this feature is very important.
Thus, this API is not very suitable for implementing an efficient transparent proxying method. Consider a different set of APIs.
Overlapped I / O + Thread Pool + Completion Ports
1. Create a socket. But now for performing asynchronous operations we need some contextual structure that describes the asynchronous operation. A feature of this structure is that its first element is the standard OVERLAPPED data type. This order allows you to implement the correct operation of callback functions.
struct AsyncOperationContext
{
//Важно, что бы эта структура была первой
OVERLAPPED ov;
//Функция обратного вызова – по завершении операции
//Определяется пользователем)
CALLBACK_FUNC pfFunc;
//Произвольный контекст операции
PVOID pContex;
}
SOCKET sock =
::WSASocket( AF_INET, SOCK_STREAM, IPPROTO_TCP, NULL, 0, WSA_FLAG_OVERLAPPED);
2. We connect the socket to the I / O completion port, events from which are processed inside the system thread pool. Since we will use a pointer to an OVERLAPPED structure to initiate an asynchronous operation, no one bothers us allocating more memory for this purpose with this structure. And we will get the address of this particular structure in the callback of the I / O completion port.
BindIoCompletionCallback(sock, IoSockCompletionRoutine, 0);
VOID CALLBACK IoCompletionRoutine(
DWORD error,
DWORD bytes,
LPOVERLAPPED ov)
{
AsyncOperationContext* actx = reinterpret_cast< AsyncOperationContext*>(ov);
actx->pfFunc(actx->pContext,error,bytes);
}
3. We initiate the asynchronous read operation from the socket. It should be remembered that if the operation completed immediately, i.e. either without an error, or with an error other than ERROR_IO_PENDING, then you need to complete the processing in the thread that initiated the reading. In this case, the I / O completion port callback function will not be called. The context of the asynchronous operation should be stored in the structure that describes the intercepted connection, since the lifetime of this structure coincides with the lifetime of the connection context. Moreover, this structure can be reused for read operations from the socket.
AsyncOperationContext receive_ov;
//Инициализируем системную часть структуры
memset(&receive_ov, 0, sizeof(OVERLAPPED));
//Инициализируем функцию обратного вызова и контекст обратного вызова
receive_ov.pfFunc = ReceiveDoneCallback;
receive_ov.pContext = this;
//Инициируем операцию чтения из сокета
BOOL res = ReadFile((HANDLE)sock, buf, buf_len, &received, (LPOVERLAPPED)&receive_ov);
if(res)
{
//Операция закончилась синхронно.
//Порт завершения ввода/вывода использован не будет
if(received > 0)
{
//Данные получили сразу, обрабатываем их
ProcessReceivedData();
//Инициируем следующую операцию чтения
InitiateRead();
}
else
{
//Ничего не получили. Считаем, что удалённый конец
//соединения закрыл сокет.
ProcessConnectionClose();
}
}
else
{
DWORD error = GetLastError();
if(error != ERROR_IO_PENDING)
{
//Ошибка получения данных. Закрываем соединение
ProcessConnectionClose();
}
}
The implementation of ReceiveDoneCallback is similar to the synchronous case.
4. We process the received data. Since we already use the system thread pool for I / O processing, we must use the system thread pool for data processing. It should be remembered: the data must be processed and transmitted to our pair socket in the same sequence in which they were received. Therefore, a queue of processed and transmitted data should be organized. The system pool function should work specifically with the queue. It is important that the queue process only one thread in the pool. You can arrange a queue arbitrarily.
//Добавляем полученные данные к очереди необработанных и неотправленных данных
AddReceivedDataToQueue(buf, buf_len);
//Проверяем, что обработка очереди не запущена и
//меняем состояние, если запуск требуется
If(!IsQueueProcessingAndMark())
{
QueueUserWorkItem(DataProcessingRoutine, this, 0);
}
DWORD WINAPI WorkRoutine(LPVOID param)
{
DataItem* dataItem;
while( dataItem = GetQueueProcessingItem() )
{
ProcessDataItem(dataItem);
//Передаём обработанные данные
InitiateWrite();
}
MarkQueueProcessing(FALSE);
}
Access to the queue of processed items, as well as access to information on the processing status should be synchronized. Asynchronous data transfer to our “pair” is organized in a similar way, but instead of ReadFile, the WriteFile function is used.

Pros and cons
What we got when we started using this set of APIs:
- We no longer need our own implementation of the thread pool - we use the thread pool, which is implemented by the operating system.
- There are no restrictions that are related to the number of processed connections.
- The data received on the socket is immediately passed to the callback function. Accordingly, you just need to initiate the operation and process the result. No additional API calls are required.
This API set allows you to increase the number of processed connections by increasing the number of process cores, i.e. this scheme will work on a terminal server.
But this API still has disadvantages:
- The API does not allow you to manage the pool, i.e. we cannot limit the number of threads in the pool.
- We cannot “guaranteed” separate the threads that handle I / O and the threads that do business processing intercepted data.
- It is necessary to organize in a special way the expectation of “frozen” I / O operations.
These problems can be solved using a different set of APIs.
Using Vista Thread Pool API
This set of functions allows you to create separate pools of threads, and configure each of them. Consider the steps you need to take to organize proxying of network connections using this API.
1. Create and configure the environment in which the thread pool will work. This environment allows you to correctly wait for the completion of all tasks that were transferred to a given pool
PTP_CALLBACK_ENVIRON io_pool_env;
InitializeThreadpoolEnvironment(io_pool_env);
PTP_CLEANUP_GROUP io_pool_cleanup = CreateThreadpoolCleanupGroup();
SetThreadpoolCallbackCleanupGroup(io_pool_env,io_pool_cleanup,NULL);
2. Create and configure a thread pool
PTP_POOL io_pool = CreateThreadpool(NULL);
SetThreadpoolThreadMaximum(io_pool,10);
SetThreadpoolMinimum(io_pool,2);
SetThreadpoolCallbackPool(&io_pool_env, io_pool);
Now we have a dedicated pool of threads, in which there can not be less than two or more than ten threads. In addition, we can use the io_pool_cleanup variable to wait for the completion of all operations that were initiated in this pool. Similarly, you can configure a thread pool to process intercepted data (processing_pool).
3. We create a socket and structures which are necessary for initiation of asynchronous operations
SOCKET sock =
WSASocket( AF_INET, SOCK_STREAM, IPPROTO_TCP, NULL, 0, WSA_FLAG_OVERLAPPED);
PTP_IO io_item =
CreateThreadpoolIo((HANDLE)sock, IoDoneCallback, this, io_pool);
PTP_WORK process_item =
CreateThreadpoolWork(WorkRoutine,this, processing_env);
The implementation of the IoDoneCallback (ReceiveDoneCallback) and WorkRoutine functions is similar to the implementations described for the previous set of APIs. Those. You can reuse the existing business logic for processing intercepted data.
4. We initiate the asynchronous operation of reading data from the socket
//Указываем, что следующая операция ввода/выводя пойдёт через наш пул
StartThreadpoolIo(io_item)
//Инициируем операцию ввода/вывода.
BOOL res = ReadFile((HANDLE)sock, buf, buf_len, &received, &ov);
Processing the results of the operation is similar to that described for the variant with the input / output completion port, but with one peculiarity. If we don’t want to receive a callback in the pool for the case of synchronous completion of the operation (and it will be performed “by default”), we need to specially mark the socket after its creation:
SetFileCompletionNotificationModes((HANDLE), FILE_SKIP_COMPLETION_PORT_ON_SUCCESS)
In addition, it is important to remember that each initiated I / O operation must either be completed or canceled, i.e. if the operation completed synchronously, with or without an error, you need to call:
CancelThreadpoolIo(io_item);
5. We initiate the processing of the received data. The processing function is the same as with QueueUserWorkItem
//Добавляем полученные данные к очереди не отправленных данных
AddReceivedDataToQueue(buf, buf_len);
//Проверяем, что обработка очереди не запущена и
//меняем состояние, если запуск требуется
If(!IsQueueProcessingAndMark())
{
SubmitThreadpoolWork(processing_item);
}

Pros and cons
The described set of APIs is good for everyone, but it exists only in versions of the operating system, starting with Windows Vista. For Windows XP and Windows Server 2003, I / O ports and the old system pool must be used. Nevertheless, the interface of both options allows you to process intercepted data in the same way, so the code base is one, although it is collected for different operating systems.
conclusions
Any high-quality software product should use the most effective ways to solve technical problems from those that the operating system provides. The transparent proxying service of our product has come a long way in development, and at the moment it is implemented, it seems to me, as efficiently as possible. I hope that the conclusions from the path we have passed will help others quickly understand the technologies and make the right decision.