How to generate a huge financial graph with money laundering patterns?
Couple of years ago my team (compliance in one of Swiss banks) and I had an interesting task to implement — we had to generate a huge random graph of financial transactions between clients, companies and ATMs. Moreover, we wanted this graph to contain some money-laundering and other financial crime patterns alongside with nodes description such as names, addresses, currencies etc. Obviously, all data should be randomly generated from scratch as long as we could not use any real data for obvious reasons.
As a solution we wrote a generator that I’d love to share with you. This article explains why we needed it and how this generator is working, but if you don’t want to read and want to try it on your own here is the code: https://github.com/MGrin/transactions-graph-generator. I hope that our experience will be helpful to any of you.
Why were we interested in such generator?
Our team decided to sponsor the LauzHack hackaton. One of conditions for sponsors was providing a real business task for participants, and by chance at the same time we had an interesting project related to money laundering discovery among transaction’s graph. Of course, we decided to give same task to hackaton participants.
As I said earlier, we could not use real data, so we needed to create it. In order to make the task as close to the real world as possible we’ve got some statistics from our data and tried to make generated data to follow similar distributions. Also we didn’t want to have a small graph — on a daily basis we are working with billions of transactions between millions of nodes, and we wanted to give to participants the ability to try their ideas on the same scale.
What did we get as a result?
We could build a quite fast, interesting and configurable graph generator! Let’s see it more in details:
Participants of our generated financial system:
- Client — an account of an abstract bank client. Contains fields like name, email, age, profession, education, nationality and address.
- Company — a business entity in our financial system. Contains fields like type, name and country.
- ATM — an exit point for the money from our financial graph to outside where it can not be tracked anymore. Contains GPS coordinates.
- Transaction — record of money transfer between 2 graph nodes. Contains pointers to source and target nodes, amount, currency and date and time of execution.
To generate these data we used Mimesis, a great library for fake data generation.
Graph generation: basic entities
The generator starts by creation of basic entities — clients, companies and ATMs. The script takes a wanted number of clients as input, and based on this number computes the number of companies and ATMs to generate. We inferred that the number of companies is equal to 2.5% of total number of clients, and the number of ATMs is equal to 0.05%. These parameters are quite averaged and are hardcoded inside the generator itself.
All generated information is saved into .csv files. Writing to these files is done by batches of k rows, k being a configurable parameter. Also, every type of node is generated in parallel to speed up the whole process.
Graph generation: edges between entities
After creation of basic entities we start to connect them between each other. At this stage we are not generating transactions yet, but only the fact that two nodes are connected. We did it this way to speed up the process and it works as follows: if there is an edge between two nodes there will be some number of transactions separated in time. No edge — no transactions at all.
The probability if and edge between two nodes and possible edge types are:
- Client -> Client, p = 0.4%
- Client -> Company, p = 1%
- Client -> ATM, p = 3%
- Company -> Client, p = 0.5%
These probabilities can be configured using generator parameters.
Just like nodes, all edge types are generated in parallel and are written in files in batches.
Graph generation: transactions
Having nodes and edges having a wanted distribution we can start transactions generation. The process is straight forward to implement but is quite difficult to be parallelised. Because of it at this stage only two threads are working — one for client sourcing transaction generation, another one — for company sourcing transaction generation. Number of transactions for an edge is random and randomly separated in time.
As anything above, generated transactions are written in .csv files in batches.
Graph generation: patterns
This is where things are getting interesting. We defined three patterns that we wanted to have in our graph:
- Flow — a big amount of money is sent from source node to N nodes, and them any of these N nodes sends money to another K nodes, etc. until the last layer of this network will send all received money to a target node.
- Circular — a big amount of money is going through different nodes and comes back to the source node.
- Time — some amount is transferred from node A to node B multiple times, separated by pseudo-random time.
Let’s discover these patterns one by one.
We start by selecting a number of layers of the network. In our realisation it is a random number between 2 and 6, and these values are not configurable. Then we randomly choose 2 nodes — source and target. Also the amount that will be issued by a source node is randomly generated as 50000 * random() + 50000 * random().
Every participant of this network is taking a payment for its service. In our realisation the total payment for the whole network usage will not exceed 10% of the source amount.
All generated transactions are delayed in time from one layer to another. The delays are random and are not exceeding 5 days.
Is similar to the Flow pattern except that the source and target nodes are the same, and there is only one node per intermediate layer.
The easiest pattern. A random amount is sent from node A to node B multiple time (random number between 5 and 50, not configurable) with pseudo-random delays between transactions.
Final transactions randomisation
At this stage we have a number of .csv files:
- 3 files with nodes information (clients, companies, ATMs)
- 4 files with transactions: one for usual transactions and 3 for transactions from generated patterns.
There is another script that randomise all transactions and concats them in one transactions file.
What to do with the output of this generator
At the end of the day we have 4 beautiful .csv files with graph nodes and transactions. We can import this graph into Neo4J or serve it over REST API — basically we can do anything with it! The size of graph can be as big as you want (we ended up by generating a graph with 2 billions clients and it took around 8 hours on my MacBook 2014).
We’ve got a lot of positive feedback from hackaton participants as well as a couple of really good solutions of pattern detection problem in big graphs.
Thank you for your time, and here is the link to the generator: https://github.com/MGrin/transactions-graph-generator