Large transaction graph generator with patterns of criminal activity
A couple of years ago, our team (compliance at a Swiss bank) faced a very interesting task: it was necessary to generate a large graph of transactions between customers, companies and ATMs, add patterns similar to those of money laundering and other criminal activities to this graph, and also add a minimum information about the nodes of this graph - names, addresses, time, etc. Of course, all the data had to be generated from scratch, without using existing customer data.
To solve this problem, a generator was written, which I would like to share with you. Under the cut you will find a story explaining why we needed it, and a description of the operation of the generator. For impatient - here lies the code . I would be glad if someone will benefit from our experience.
Why are we doing such nonsense?
Our team decided to participate as sponsors at the LauzHack hackathon. One of the conditions for participation in the sponsor format was the provision of a real business task for the participants. Just at that time, we had a very interesting project related to the automation of the search for financial crimes and money laundering among the transactions of our customers, and without hesitation, we decided to offer the same task to the hackathon participants.
For obvious reasons, we could not use real data, so we had to create them. To make the task as close as possible to reality, we looked at the statistics of real data and tried, as we could, to bring the generated data closer to the real distributions, and also did not skimp on the amount and complexity of the data - we did not need a solution working on a graph of 100 nodes and 200 connections, we were looking for a solution capable of processing graphs the size of millions of nodes and billions of connections, and taking into account all available information about nodes and connections.
What did we get
And we got quite a fast (adjusted for the amount of data), interesting and configurable generator! Let's understand in detail
We want to have a graph of financial transactions, respectively, possible participants in this graph are:
- Client - you can say an account of an abstract bank client. It is described by name, email, age, work, political views, nationality, education and address of residence
- A company is a business entity in the financial system. It is determined by the type of company, name and country.
- ATM - roughly speaking, the exit points of money from the graph controlled by us. Defined by geographic coordinates.
- Transaction - The fact of transferring money from one node of the graph to another. Defined by the start and end node, amount, currency and time.
To create this data, we use Mimesis , a great library for creating fake data.
Creating a graph: basic entities
First you need to create all the basic entities - customers, companies and ATMs. The script takes the number of customers that you want to create, and on the basis of this calculates the number of companies and ATMs. According to our data, the number of companies having any large number of transactions with customers is approximately 2.5% of the number of customers, and the number of ATMs is 0.05% of the number of customers. These values are very generalized and non-configurable (wired in the generator code).
All information is saved in .csv files. Writing to these files occurs in batches, k lines at a time. This value is configured by script arguments. Also, three types of nodes are generated in parallel.
Creating a graph: connections between entities
After creating the basic entities, we begin to connect them together. At this stage, we are not yet generating the transactions themselves, but simply the fact that there is a connection between the nodes. This was done to speed up the process of generating the entire graph and works approximately as follows: if two nodes are connected, then we generate a certain number of transactions between them, scattered in time. If not connected, but transactions between these nodes do not exist.
The likelihood of a connection between the two nodes is configured through arguments, the standard values are listed below.
Possible connection types:
- Client -> Client (p = 0.4%)
- Client -> Company (p = 1%)
- Client -> ATM (p = 3%)
- Company -> Client (p = 0.5%)
Like nodes, all types of connections are generated in parallel and written to their files in batches.
Graph Creation: Transactions
Having the nodes of the graph and the connections between them falling under the desired distribution, we can start generating transactions. The process is quite simple in itself, but parallelizing it is quite difficult. Therefore, at this stage, there are only two independent flows - transactions originating from the client, and transactions originating from the company.
Nothing particularly interesting happens at this stage: the script runs through the list of connections and generates a random number of transactions for each connection. It is written all in the same way - in .csv files by packages.
Count Creations: Patterns
And here there are interesting points. The types of behavior patterns that we wanted to get in the final column:
- Flow - a large amount goes from one node to m to the other, each of these m nodes transfers money to the next level of n nodes, and so on, until the last level sends all the money to one recipient.
- Circular - the amount of money goes in a circle and returns to the source.
- Time - a certain amount of money goes from one node to another with some fixed frequency.
Let's look at each of these patterns in more detail:
To begin with, the number of levels through which money will have to go is selected. In our implementation, this random number between 2 and 6 is not configurable and is wired in the code. Next, two nodes of the graph are selected - the sender and the recipient. A random amount is also selected, which the sender will send to the recipient (according to a tricky formula
50000 * random() + 50000 * random()).
Each member of this network takes some kind of fee for their services. In our implementation, the maximum price for passing money through the network will be 10% of the amount transferred by the sender.
Generated transactions have a time shift of relative transactions of the previous network level - that is, money first comes to level n-1, and only then goes to level n. Delays are randomly selected within 4-5 days. Also, generated transactions have pseudo-random amounts (limited by the initial amount and taking into account the fees for each node)
It is generated according to a similar principle as Flow, but instead of different sender and receiver and several levels in this pattern, the money goes in a circle and returns to the original node. All intermediate nodes charge a fee, as is the case with Flow, and transactions also have a time offset.
The simplest pattern. A certain amount is sent from the sender to the recipient a random number of times (from 5 to 50, not configurable) with pseudo-random time shifts.
All new transactions are written in the same way to .csv files in batches.
Graph randomization and collecting all transactions in one file
At this stage, we have several .csv files:
- 3 files with nodes (clients, companies and ATMs)
- 4 transaction files: one for regular transactions and 3 containing patterns.
An additional script mixes pattern transactions along with regular transactions so that it is not possible to see patterns in a graph in the order in which transactions are recorded in a file.
And what to do with all this?
In the end, we have 4 beautiful files with graph nodes and transactions between them. You can import into Neo4J, you can distribute through REST, but whatever your heart desires, you can do with them.
As for us, we received a very positive feedback from the hackathon participants, and some very interesting solutions for finding patterns in massive graphs.