# Agglomeration analysis of associated systems

Suppose you are in molecular modeling. On your hands you have bulky path files that would be nice to analyze somehow. Consider (and below we will discuss mainly this system), for example, methanol. We can, for example, construct a radial distribution function (RDF), as is done in almost every similar article. But, recalling that there is a specific agglomeration in methanol - hydrogen bonds - we may

First, it doesn’t matter by what method you got the file paths - at least QM, at least MM, at least some kind of hybrid. The important thing is that you need to know the exact coordinates of the atoms. It doesn’t even matter if it was a trajectory - you can use one snapshot of the trajectory, but statistics will suffer from this and, as a result, the results can be incorrect.

Now we turn to the question related to the determination of the hydrogen bond. It is proposed to determine the presence (or absence) of a hydrogen bond in several possible ways (as well as their combinations), for example:

As you know, modern resources do not allow to simulate bulky systems (a la unit cell containing 1 mole of molecules), therefore it is necessary to use

The second question we may face is what if we have a molecule with two functional groups? For example, for hydrogen bonds - ethanediol. Or even more fun - water. I suggest taking this opportunity into account by introducing an additional label - the number of the molecule. That is, to operate not with separate groups, but with molecules immediately.

The third problem is that similar agglomeration can be observed not only in the system with OH groups, but also, for example, in chlorobenzene. I will not go into details that, how and why, I only notice that such an association is similar to micelle formation in colloidal systems. In this case, we can analyze the chlorine-chlorine distances to determine whether the molecules are bound to associate.

I will immediately make a couple of comments. The programmer from me is a little more than complete curve, because something, for sure, can be done better. At first I thought of bringing a block diagram, but it turned out to be too cumbersome.

Initial data:

Algorithm for searching for bound molecules:

Output Algorithm:

Then we write out arrays of agglomerates in any convenient form. At the end we add a block with general statistics (averaged over all points of the trajectory). You can add statistics of bonds between, say, oxygen atoms.

The first is a cross-analysis for the presence or absence of cycles in the associate. Without going into evidence, I note that. if we assign a weight of 0 to a molecule associated with one molecule, with two -1, etc., and then calculate the total weight of the graph (associate), then we can determine the number of cycles in the graph:

The second is much more fun. There are original articles here and here . The point is that if we have two graphs, we can compare whether they are isomorphic (in the sense of the same ± permutations) and determine the permutation by which one graph from the other is obtained. True, if my memory serves me, in these articles there were typos in the algorithms themselves, be careful.

As applied to our system, we can create a database of the main graphs and check the resulting associates for compliance with a certain graph. You can also conduct statistics.

Unfortunately, I did not implement this block. Only a program is available that checks two graphs for isomorphism.

Well, the third. The analysis of one molecule is the dynamics of its presence in various agglomerates throughout the entire trajectory.

Source codes of programs can be taken from the link . Everything is implemented in C. Unfortunately, there are no comments in the code.

The STAT_GEN program works with a file format similar to TINKER’s trajectory files, with one exception - there is no block corresponding to the connection of different atoms with each other, and a column indicating the number of the molecule has been added.

The AGL program pulls out individual agglomerates from the path files and writes them to the PDB format.

The graph program analyzes graph isomorphism.

Unfortunately, the operability of these programs on Windows after transferring to LInux (it was originally written under Windows, because there is, for example, the conio.h library) was not tested, because I don’t know what it is.

At the output, you can get something like this picture (from one of my presentations):

Or another structure, in the form of a graph:

Thank you for your attention!

*suddenly*want to see how it really looks like there. See how agglomerates look (maybe even compare their topology), how they are distributed by size, or whatever else you want. Actually, below I propose one of the options for implementing such a program.#### Bit of theory

First, it doesn’t matter by what method you got the file paths - at least QM, at least MM, at least some kind of hybrid. The important thing is that you need to know the exact coordinates of the atoms. It doesn’t even matter if it was a trajectory - you can use one snapshot of the trajectory, but statistics will suffer from this and, as a result, the results can be incorrect.

Now we turn to the question related to the determination of the hydrogen bond. It is proposed to determine the presence (or absence) of a hydrogen bond in several possible ways (as well as their combinations), for example:

- The distances between the oxygen atoms of different molecules and the hydrogen atom and the oxygen atom of another molecule should be less than the reference values
- The distances between the oxygen atoms of different molecules and the HOO angle should be less than the reference values
- Energy criterion

#### What do we need to consider?

As you know, modern resources do not allow to simulate bulky systems (a la unit cell containing 1 mole of molecules), therefore it is necessary to use

*periodic boundary conditions*. A real infinite system is replaced by a set of identical finite systems, each of which is located in a spatial cell surrounded by identical cells on all sides, and molecules in one cell interact with molecules in neighboring cells. In reality, for such an analysis, we do not have to translate the cell in all directions, increasing the estimated space by 27 times (instead of one cell we get 3 ^ 3 = 27). You can limit yourself to broadcasting half a rib in each direction. It is easy to show that in this case we will increase the estimated space by only 8 times. At the same time, we can explicitly translate, scoring new coordinates into variables, or simply “keep in mind” the possibility of translation in the analysis. My program implements the first option. In this case, by the way, it should be borne in mind that*the bond with the molecule in the translated cell must be transferred to the molecule in the original cell*.The second question we may face is what if we have a molecule with two functional groups? For example, for hydrogen bonds - ethanediol. Or even more fun - water. I suggest taking this opportunity into account by introducing an additional label - the number of the molecule. That is, to operate not with separate groups, but with molecules immediately.

The third problem is that similar agglomeration can be observed not only in the system with OH groups, but also, for example, in chlorobenzene. I will not go into details that, how and why, I only notice that such an association is similar to micelle formation in colloidal systems. In this case, we can analyze the chlorine-chlorine distances to determine whether the molecules are bound to associate.

#### Implementation

I will immediately make a couple of comments. The programmer from me is a little more than complete curve, because something, for sure, can be done better. At first I thought of bringing a block diagram, but it turned out to be too cumbersome.

Initial data:

- The coordinates of the atoms of interest to us (including those translated), in the example with methanol - the oxygen atom and the hydrogen associated with it
- An array of labels indicating which molecule certain functional groups belong to.
- 2 parameters - reference intermolecular distances OH and OO

Algorithm for searching for bound molecules:

- Two nested loops iterating over all molecules (including translated ones) - the second loop, naturally, starts with the next molecule (why do we need to take the same thing into account twice?).
- We consider the distance between the oxygen atoms. For all groups in the molecules, we find among them the minimum. We check with the reference. If less, then move on.
- The implementation of the previous step, but already for oxygen and hydrogen atoms. That is, we consider all possible (if on one oxygen, for example, two hydrogens - like in water) the distance between the oxygen atoms of the first molecule (defined in the previous step) and the hydrogen of the second molecule (associated with oxygen determined in the previous step), we find minimum. We check with the reference. If less, then move on. If it’s more, then we’ll first check the opposite - the distance between the hydrogen atoms of the first molecule and the oxygen of the second molecule. If less, then move on, more - go to the next molecules.
- We write that these molecules are connected (if necessary, and bond lengths too), we proceed to the following. In this case, all communications with the translated molecules are recorded on the molecules located in the original cell.

Output Algorithm:

- We put a binary label for each molecule in the original cell (i.e. without translated ones) in True. A loop iterated over the molecules in the original cell.
- We take the first molecule labeled True, set it to False. We write it into an array of agglomerates, then we write down there all the molecules that have the label True and are associated with it, attributing them False. Continue until
~~blue~~until there are no new connections. - We repeat the previous paragraph until all molecules have a False label.

Then we write out arrays of agglomerates in any convenient form. At the end we add a block with general statistics (averaged over all points of the trajectory). You can add statistics of bonds between, say, oxygen atoms.

#### What else to add?

The first is a cross-analysis for the presence or absence of cycles in the associate. Without going into evidence, I note that. if we assign a weight of 0 to a molecule associated with one molecule, with two -1, etc., and then calculate the total weight of the graph (associate), then we can determine the number of cycles in the graph:

- If the weight P of the graph is N-2, where N is the number of graph vertices (molecules in the agglomerate), then the graph has no cycles and has either a branched or linear structure
- If the weight P of the graph is greater than N-2, then the graph has cycles in the amount of (PN) / 2 + 1

The second is much more fun. There are original articles here and here . The point is that if we have two graphs, we can compare whether they are isomorphic (in the sense of the same ± permutations) and determine the permutation by which one graph from the other is obtained. True, if my memory serves me, in these articles there were typos in the algorithms themselves, be careful.

As applied to our system, we can create a database of the main graphs and check the resulting associates for compliance with a certain graph. You can also conduct statistics.

Unfortunately, I did not implement this block. Only a program is available that checks two graphs for isomorphism.

Well, the third. The analysis of one molecule is the dynamics of its presence in various agglomerates throughout the entire trajectory.

#### In addition

Source codes of programs can be taken from the link . Everything is implemented in C. Unfortunately, there are no comments in the code.

The STAT_GEN program works with a file format similar to TINKER’s trajectory files, with one exception - there is no block corresponding to the connection of different atoms with each other, and a column indicating the number of the molecule has been added.

The AGL program pulls out individual agglomerates from the path files and writes them to the PDB format.

The graph program analyzes graph isomorphism.

Unfortunately, the operability of these programs on Windows after transferring to LInux (it was originally written under Windows, because there is, for example, the conio.h library) was not tested, because I don’t know what it is.

At the output, you can get something like this picture (from one of my presentations):

Or another structure, in the form of a graph:

Thank you for your attention!