Haskell in the real world
This blog has already written a lot about the Haskell language itself, and there have been several articles about its practical application. Now I will inspirationally talk about another real application of the language in production.
I work in telecoms: fixed telephony, Internet, IP-telephony. The tasks include processing traffic from the PBX, from IP-telephony servers, supporting current billing, writing software and administering networks, domain (yes, “jack of all trades”). Recently, new equipment was installed in my organization that worked on completely different principles, and traffic, accordingly, also changed. Now it comes in two versions: from old stations and from new ones. New traffic is duplicated in binary and text formats. Binaries do not suit us at all (although some unknowingly said: “Why there? You take them and they drop into billing themselves!”), The text one takes an order of magnitude more space, but it is possible to do something with it. Traffic from the old equipment is still going on, and we are throwing it with my partner using well-established schemes, using local DBMS services. It would be possible to configure the same services for traffic from new stations, but several features were discovered. Traffic at new stations is written to separate files every 15 minutes; In total, about 3,000 such files are obtained per month (ideally - 2976 for 31 days and 2880 for 30 days). You will not import each file separately, for insanity. It is possible and even necessary to merge them into one, since they are all textual, and call records are arranged line by line. Manual merging looks like this: you select files only last month and add them to the simplest merge script on the command line. The file name format is fixed, therefore, the merger can be automated, you only have to parse the year and month. Linux would use some kind of Bash, Perl or Python - it's cheap and cheerful, but you won’t put them on a Windows machine for one operation, the same goes for PowerShell. And cmd is a perversion, which we are well aware of. ;) Finally, surprises were also found in the traffic itself, because of which even after merging and importing using DBMS tools, a lot of manual SQL work was required. In general, the factors somehow so successfully developed into a task for Haskell, which I at that time (April-May 2011) began to study.
So, ~ 3,000 15-minute files per month. On the equipment, you can change the interval and set not 15 minutes, but any other value: 5, 10, 30, 45 ... The number of files and their sizes, respectively, will change. Example file name (05/09/2011 09:30:00):
There are more subscribers, and each file is growing steadily in size. Now we have an average of 240 lines per file, but there was a seasonal summer subsidence, people went on vacation and called less. For September, we expect an increase in activity in one and a half to two times.
Call records are diverse. There are several different types of records that have a different number of fields. Entries like R210 are very rare, and we did not find out what they mean (hereinafter, traffic data is replaced by random):
As you can see, there are only 4 fields: record type identifier, start date, end date (ISO 8601 / SQL format) and, for some reason, one. The fields are separated by a vertical bar, which should be at the beginning of the record, and at the end, so that there are actually 1 more fields, that is, 5. It is convenient to assume that the field with index 0 is empty, is located before the first vertical bar. Then the countdown of the significant fields will go from 1.
Normal calls are recorded in records like R200. There are already 152 fields, and this can be reconfigured on the equipment: add some fields, delete some, and others, it is possible to change the format.
We are interested in fields with indices [7, 8, 9, 12, 14, 36, 112, 122], and in the end we would like to filter out everything that is unnecessary so that we do not import any excess into the DBMS. Choosing only what is needed from the raw data, we get the line:
All other fields are not particularly needed. Some, as you see, are generally empty, but the meaning of others is unknown. Unless the IP addresses (changed) belong to two boards in the telephone network infrastructure, between which RTP traffic will go. By importing these fields, we could study the load on the boards. Perhaps in the future it will come in handy.
Traffic entries are tight, line by line. Perhaps there are some other types of records, but they are not interesting to us. For tariffing, only records of type R200 are sufficient. However, during a visual study of traffic, another interesting fact emerged. Sometimes calls came from the same number, starting at the same time, but their duration was different. At first, in conditions of incomplete information, I thought it was some kind of glitch, because a person cannot call in parallel from the same number. Then the pattern began to be seen, and, finally, I realized what was the matter. Here is an example of such records for one phone number, I threw away all the extra fields for clarity:
You can now see what the salt is, and then it was not easy for me to find these records among thousands of similar ones, among a bunch of extra fields, letters and numbers. In general, it was either luck, or intuition, or magic. :) And the answer is simple: the equipment every half hour (1800 seconds) marks a “milestone in the conversation” in case something happens. Even if the last 29 minutes of the conversation were somehow lost, the entire previous three-hour discourse was recorded many times - for greater reliability. In the last record there will be actual data. Perhaps, on the equipment, the duration of the milestones can be somehow changed, but so far their number looks like this: [1800, 3600, 5400, 7200, 9000, over 9000 ..] It is also noteworthy that the milestone record differs in several "unimportant" fields from the result record. It may be worth considering this in the future in order to better filter out unnecessary in the meantime, I decided to simply discard all records with a duration from this series. Theoretically, here some small percentage of normal calls will be lost, but for this it is necessary that a person talk for half an hour exactly to the second. The probability of this is very, very small, and our volumes are not so significant that the law of large numbers somehow affects the sample. We called this phenomenon creatively: "1800 seconds problem."
In total, I created four programs that merged the necessary files and filtered out useful information. Initially, they were parser and merger - two programs written separately and combined into a third, also called merger. All of them worked extremely slowly and consumed a lot of memory, although they coped with the tasks. They could process the data of one month (3000 files of 240 lines = 720000 lines) for at least 10 minutes with 500 MB of memory being consumed (or even more, because the swap worked). Very, very scary result. And although the task had to be done once a month, my partner scornfully scowled at Haskell. True, Haskell has nothing to do with it; I made this in programs a number of typical mistakes of a novice functionalist, which is why crooked algorithms worked very badly. But they worked! Even more: programs (and the largest of them takes only 150 useful lines) could be configured from the command line. Here are the features that were available:
1. Work in the mode without parameters. Fields are taken by default, files of the last month are taken.
2. Work in the mode with parameters:
- Fields [<list of field indices>] - which fields to take (parser Fields [1, 24, 55]);
- (yyyy, mm) - which month to process (merger (2011, 5));
-W - do not close the console window after working out (from the word “wait”).
3. Three files
were obtained : - yyyy.mm.txt - all files with raw traffic for this month merged into it;
- processed.txt - file only with the necessary fields for this month;
- yyyy.mm.txt.log - a log file where raw raw files are listed and summary information is written (number of lines, files, date range).
4. Programs display statistics and examples of processed traffic.
A couple of times we used what it was, but then, of course, I rewrote the program from scratch. Already in the old code there were a lot of crooked code, unnecessary bicycles, stupid algorithms and strange solutions. As a result, the fourth program, NgnParser, with the same functionality and on the same data set, works not 10 minutes, but 10 seconds, consuming only 10 MB of memory. In speed, the difference is almost two orders of magnitude, and at least one in memory! What could such a stir up soslow down the program? I believe there are people who have stepped on the same rake as I, who believed more in the retardation of the language than in their crooked hands, not without reason there are so many screams on the Internet about and without reason ... Haskell is a wonderful language. It was easy for me to write these programs. For each, I spent no more than two working days. And every time I got a lot of pleasure. I can’t imagine how much torment there would be if I did the same in C.
I started the first program with a simple one. For the time being, I merged the necessary files into one (merged.txt) using the command line tools, and the parser’s task was to parse the traffic and filter out the necessary records with the identifier R200. To process a large amount of text, it is more expedient to use a special ByteString string type. But working with it is not as convenient as with the usual String type. If ByteString is an optimized implementation of the string itself, then String is a list of characters, that is, [Char]. String, due to its nature, is very convenient, but on big data the performance of any lists drops sharply. In the first versions of the program, it was String and a few stupid decisions that caused strong brakes and a large memory consumption. However, then I was worried about the speed of development, not productivity. I wrote the prototype of the program very quickly. Here's what it looks like (revision 2):
You can immediately notice the strange function replaceChars with the type Char -> Char -> Char -> Char. The idea was this: take a line-record, replace the vertical bar '|' space and use the words function to break the string into words:
As a result of executing sInWords, we get the following conversion:
Unfortunately, a date-time field will also be split into two separate fields. Later, to avoid this, I further complicated the design, initially replacing the space in this field with an asterisk and then returning it. As a result, the line-record went through even more conversions:
Replacing '' -> '*';
Replacing '|' -> '';
Word breaking by the words function;
Processing a list of fields, obtaining fields with the desired index;
Merge fields with the unwords function;
Replacement '' -> '|'
Replacement '*' -> ''
In addition, after these terrible transformations, the number of fields received did not coincide with the number of source fields, because empty fields eventually disappeared altogether (see the conversion example above). It’s good that in all the records the empty / filled fields were in the same places, otherwise I would get unpleasant artifacts. The code, as you can see, is not only redundant, but also ugly. I do not even presume to evaluate its asymptotic complexity; I think she far exceeded O (n ^ 2). Moreover, closer to revision 12, I realized that we need to do something with fields that are lost due to double vertical bars. And added another conversion:
Thus added another full pass on each line! Well, in truth - monkey labor. But it was necessary to do quite a bit: use the split function from the Data.String.Utils module instead, or write your own version. Then, in just one pass along the record line, I would get the correct breakdown into fields:
What can I say ...facepalm No words, only emotions.
Experienced Haskellists have already noticed that the code does not use a disjointed style (I did not feel it at that time yet), and there are almost no case-expressions, comparisons with a sample or any kind of security expressions. As a result, even such simple code is hard to read in places. Here's a better way to write a few functions:
The intercalate function "\ r \ n" generally needed to be replaced with unlines. It would be shorter and clearer, and in tests unlines showed great performance - at least 30%:
But while still inexperienced, I didn’t know the standard functions even from the Prelude module, because of which I blocked unnecessary bicycles. Although, even now I do not really understand how you can, with minimal efforts, select elements with the desired indexes from the list of elements. Compare the code from the old and the new program:
In the first case, we iterate over the list of indices and pull out the element using an unsafe function !!. In the second case, we use the battery at the same time iterating over the list of fields, and if the battery is found in the index list, we take the current field with the battery index in the collection. Both there and there are tail recursion. Although, according to tests, the new code was 40% slower. Perhaps in this case it is worth returning the old code - to speed up the new program even more.
The next program - merger - was supposed to combine all the text files from the last month into one. (I didn’t want to do this manually each time, and laziness, as you know, is the engine of progress.) The question arose: how do we know what we will have last month? Yes, in general, it’s simple: we take the current date and subtract one month. The program was to be used exclusively in the early days of the month, so there were no problems expected. The current date and, in general, operations with date-time are in the Time module. Operations with the file system structure lie in the System.Directory module. Both functions work mainly in the IO monad, and at that time I still did not really know how to comb the monad code, in the end it looks creepy (merger, revision 14):
What this code does, I don’t even recommend going into it ... But it was here that the seed of a future huge error crept in, because of which the latest version of the old program was devouring a huge amount of memory. Consider the merge function, which is called from this code:
It takes a list of files to be controlled, reads them, and returns a list of their contents. There are two lines in the code:
The key point is unlines fsContents. That is, all the raw contents of all files were combined in one sitting and crammed into the result file. Later, when parser and merger were combined, it was this huge amount of data that was transferred to the processing of the parser part, where, you remember, there are a whole bunch of replacements, passes and other overhead. Here is what this piece of code looks like in an old program compiled from parser and merger:
And it's not justfacepalm bad, it's a gross violation of the dataflow concept. It should be like this:
And it turned out like this:
I immediately felt the difference between sequential processing of parts and processing of everything in full. Now I know: it’s not worth taking the entire amount of work at once, it’s better to create a mechanism that issues data in batches — it will be both effective and faster in memory. And, besides everything else, a program made according to the second scheme cannot be parallelized. Horror, in a word ...
Later, I tried to optimize the old program, - almost everywhere replaced String with ByteString. The gain, of course, was small, but you won’t be able to disperse the fan of fog.
Needless to say, I did the new program, NgnTraffic, already having a normal supply of knowledge? A lot has been observed in it. The program is divided into modules: Main, Constants, DataProcess, FileListProces, Options, Tools, Types. Of course, scheme 1 is used. Instead of String, the ByteString type (Strict variant) was originally taken. The code is more combed, even a bitless style on hand. And, most importantly, the principle of action has changed. First, a list of files to be processed is collected - this part is similar to the one from the old program. Then, however, the files are not read all at once into one large variable, but each is read separately. Its contents are immediately processed, lines are parsed, records are filtered, the necessary fields are pulled out. The result is added to the resulting file. As a result, we have the cycle “Reading from disk (F (An)) - Processing strings (G (Bn)) - Writing the result to disk (H (Cn))” - and so many times in the number of files. Plus, of course, there are no replacements from one character to another, but there is a simple split function from the Data.ByteString.Char8 module, which splits a record string into fields in one pass and in itself wins a monstrous share of performance. Here are the functions that satisfy Scheme 1:
Here, process takes the name of the result file, a list of files to process, and indices of the required fields. Indexes can come from the command line, so they are printed as an argument to this function. process takes each file name in turn and applies the function process 'to it (in the line mapM_ (process' resFile fieldIndexes) fs). She is already doing the main work with files. The processData function takes care of processing the contents of the file. Interestingly, in addition to filtering R200 records, the predicate mechanism was created in the new program: any record can be checked against an available predicate, and the predicates themselves can be easily supplemented with the necessary ones. So far two have been done: the field with index x belongs to the list of fields and does not belong to it. This is what the type of predicates looks like:
And these are the given constant predicates:
It is easy to guess that only those records for which field 1 is equal to “R200” and field 36 is not in the list of “Problems 1800 seconds” will satisfy predicates. Screening occurs in the checkPredicate and examineFields functions:
Now I see that the examineFields function could be done via foldr, but it’s so, little things.
In general, I am pleased with the work done. This is rare - but I even like the NgnTraffic code. And I especially like the progress that has become visible on the example of the same program, written at different times and with different knowledge. And “even more special” is that it is a real Haskell program used in real production.
Whoever says it, Haskell is a terrific language, the best I have ever worked with.
Parser sources for revisions: [2] , [6] , [12]
Merger sources (including the merger integrated program): [13] , [14] ,[16] , [23]
Sources of NgnTraffic
PS Translation of the next part of Yet Another Monad Tutorial will be soon too.
PPS Corrected links to programs, thanks to comrade steck . PPPS I fixed a
couple of errors, thanks to comrades goder and jack128 .
Data Description
I work in telecoms: fixed telephony, Internet, IP-telephony. The tasks include processing traffic from the PBX, from IP-telephony servers, supporting current billing, writing software and administering networks, domain (yes, “jack of all trades”). Recently, new equipment was installed in my organization that worked on completely different principles, and traffic, accordingly, also changed. Now it comes in two versions: from old stations and from new ones. New traffic is duplicated in binary and text formats. Binaries do not suit us at all (although some unknowingly said: “Why there? You take them and they drop into billing themselves!”), The text one takes an order of magnitude more space, but it is possible to do something with it. Traffic from the old equipment is still going on, and we are throwing it with my partner using well-established schemes, using local DBMS services. It would be possible to configure the same services for traffic from new stations, but several features were discovered. Traffic at new stations is written to separate files every 15 minutes; In total, about 3,000 such files are obtained per month (ideally - 2976 for 31 days and 2880 for 30 days). You will not import each file separately, for insanity. It is possible and even necessary to merge them into one, since they are all textual, and call records are arranged line by line. Manual merging looks like this: you select files only last month and add them to the simplest merge script on the command line. The file name format is fixed, therefore, the merger can be automated, you only have to parse the year and month. Linux would use some kind of Bash, Perl or Python - it's cheap and cheerful, but you won’t put them on a Windows machine for one operation, the same goes for PowerShell. And cmd is a perversion, which we are well aware of. ;) Finally, surprises were also found in the traffic itself, because of which even after merging and importing using DBMS tools, a lot of manual SQL work was required. In general, the factors somehow so successfully developed into a task for Haskell, which I at that time (April-May 2011) began to study.
So, ~ 3,000 15-minute files per month. On the equipment, you can change the interval and set not 15 minutes, but any other value: 5, 10, 30, 45 ... The number of files and their sizes, respectively, will change. Example file name (05/09/2011 09:30:00):
999992011050909300081.txt 99999 - identifier sewn into the equipment (for obvious reasons, I replaced it with nine) 2011 05 - month 09 - day 09 - hours 30 - minutes 00 - seconds 81 - some random number, perhaps tenths of a second.
There are more subscribers, and each file is growing steadily in size. Now we have an average of 240 lines per file, but there was a seasonal summer subsidence, people went on vacation and called less. For September, we expect an increase in activity in one and a half to two times.
Call records are diverse. There are several different types of records that have a different number of fields. Entries like R210 are very rare, and we did not find out what they mean (hereinafter, traffic data is replaced by random):
| R210 | 2011-06-24 21: 43: 53 | 2011-06-24 01: 43: 52 | 1 |
As you can see, there are only 4 fields: record type identifier, start date, end date (ISO 8601 / SQL format) and, for some reason, one. The fields are separated by a vertical bar, which should be at the beginning of the record, and at the end, so that there are actually 1 more fields, that is, 5. It is convenient to assume that the field with index 0 is empty, is located before the first vertical bar. Then the countdown of the significant fields will go from 1.
Normal calls are recorded in records like R200. There are already 152 fields, and this can be reconfigured on the equipment: add some fields, delete some, and others, it is possible to change the format.
| R200 | 99999 | 111111 | CR, CS, AM | 1 | 1 | 3022 | 222222 | 333333 ||| 2011-06-23 11: 33: 58 | C | 2011-06-23 11: 34: 22 | S | 0 | 16 | 1 |||||| 1 | 1 |||||| 3 | 162 | 17 | 1 | 12 | 24 ||||||||||| 16 | 0 |||| || 192.168.1.172 || 192.168.1.12 |||||| 8 | 8 | 20 | 20 | 64 | 64 | 20 | 0 | 0 | OS | 7777 | 8888 | 555555 | 666666 | 0 | 8 | 9 | ||| OS | 19 ||| 30 | 10 | 42 | 43 ||||||||||||| 1 |||||| 1 | 1 | 0 | 3 || 222222 ||||| || 2 | 1 || 333333 |||||||||||||||||||||||||||||||||
We are interested in fields with indices [7, 8, 9, 12, 14, 36, 112, 122], and in the end we would like to filter out everything that is unnecessary so that we do not import any excess into the DBMS. Choosing only what is needed from the raw data, we get the line:
Record: 3022 | 222222 | 333333 | 2011-06-23 11: 33: 58 | 2011-06-23 11: 34: 22 | 24 | 222222 | 333333 Indexes: 7 | 8 | 9 | 12 | 14 | 36 | 112 | 122 Indices | Explanation --------------------------- 7 | area code of Chita 8, 112 | outgoing number 9, 122 | incoming number 12 | date and time of the start of the conversation 14 | end date and time 36 | talk duration in seconds
All other fields are not particularly needed. Some, as you see, are generally empty, but the meaning of others is unknown. Unless the IP addresses (changed) belong to two boards in the telephone network infrastructure, between which RTP traffic will go. By importing these fields, we could study the load on the boards. Perhaps in the future it will come in handy.
Traffic entries are tight, line by line. Perhaps there are some other types of records, but they are not interesting to us. For tariffing, only records of type R200 are sufficient. However, during a visual study of traffic, another interesting fact emerged. Sometimes calls came from the same number, starting at the same time, but their duration was different. At first, in conditions of incomplete information, I thought it was some kind of glitch, because a person cannot call in parallel from the same number. Then the pattern began to be seen, and, finally, I realized what was the matter. Here is an example of such records for one phone number, I threw away all the extra fields for clarity:
| 7 | 8 | 9 | 12 | 14 | 36 | | 3022 | 222222 | 333333 | 2011-05-23 13: 07: 54 | 2011-05-23 13: 37: 54 | 1800 | | 3022 | 222222 | 333333 | 2011-05-23 13: 07: 54 | 2011-05-23 13: 59: 40 | 3106 | | 3022 | 444444 | 555555 | 2011-05-23 14: 53: 52 | 2011-05-23 15: 23: 52 | 1800 | | 3022 | 444444 | 555555 | 2011-05-23 14: 53: 52 | 2011-05-23 15: 53: 52 | 3600 | | 3022 | 444444 | 555555 | 2011-05-23 14: 53: 52 | 2011-05-23 16: 00: 50 | 4018 | | 3022 | 666666 | 777777 | 2011-05-23 19: 15: 55 | 2011-05-23 19: 45: 54 | 1800 | | 3022 | 666666 | 777777 | 2011-05-23 19: 15: 55 | 2011-05-23 20: 15: 54 | 3600 | | 3022 | 666666 | 777777 | 2011-05-23 19: 15: 55 | 2011-05-23 20: 45: 54 | 5400 | | 3022 | 666666 | 777777 | 2011-05-23 19: 15: 55 | 2011-05-23 20: 47: 17 | 5483 |
You can now see what the salt is, and then it was not easy for me to find these records among thousands of similar ones, among a bunch of extra fields, letters and numbers. In general, it was either luck, or intuition, or magic. :) And the answer is simple: the equipment every half hour (1800 seconds) marks a “milestone in the conversation” in case something happens. Even if the last 29 minutes of the conversation were somehow lost, the entire previous three-hour discourse was recorded many times - for greater reliability. In the last record there will be actual data. Perhaps, on the equipment, the duration of the milestones can be somehow changed, but so far their number looks like this: [1800, 3600, 5400, 7200, 9000, over 9000 ..] It is also noteworthy that the milestone record differs in several "unimportant" fields from the result record. It may be worth considering this in the future in order to better filter out unnecessary in the meantime, I decided to simply discard all records with a duration from this series. Theoretically, here some small percentage of normal calls will be lost, but for this it is necessary that a person talk for half an hour exactly to the second. The probability of this is very, very small, and our volumes are not so significant that the law of large numbers somehow affects the sample. We called this phenomenon creatively: "1800 seconds problem."
Briefly about the programs
In total, I created four programs that merged the necessary files and filtered out useful information. Initially, they were parser and merger - two programs written separately and combined into a third, also called merger. All of them worked extremely slowly and consumed a lot of memory, although they coped with the tasks. They could process the data of one month (3000 files of 240 lines = 720000 lines) for at least 10 minutes with 500 MB of memory being consumed (or even more, because the swap worked). Very, very scary result. And although the task had to be done once a month, my partner scornfully scowled at Haskell. True, Haskell has nothing to do with it; I made this in programs a number of typical mistakes of a novice functionalist, which is why crooked algorithms worked very badly. But they worked! Even more: programs (and the largest of them takes only 150 useful lines) could be configured from the command line. Here are the features that were available:
1. Work in the mode without parameters. Fields are taken by default, files of the last month are taken.
2. Work in the mode with parameters:
- Fields [<list of field indices>] - which fields to take (parser Fields [1, 24, 55]);
- (yyyy, mm) - which month to process (merger (2011, 5));
-W - do not close the console window after working out (from the word “wait”).
3. Three files
were obtained : - yyyy.mm.txt - all files with raw traffic for this month merged into it;
- processed.txt - file only with the necessary fields for this month;
- yyyy.mm.txt.log - a log file where raw raw files are listed and summary information is written (number of lines, files, date range).
4. Programs display statistics and examples of processed traffic.
A couple of times we used what it was, but then, of course, I rewrote the program from scratch. Already in the old code there were a lot of crooked code, unnecessary bicycles, stupid algorithms and strange solutions. As a result, the fourth program, NgnParser, with the same functionality and on the same data set, works not 10 minutes, but 10 seconds, consuming only 10 MB of memory. In speed, the difference is almost two orders of magnitude, and at least one in memory! What could such a stir up soslow down the program? I believe there are people who have stepped on the same rake as I, who believed more in the retardation of the language than in their crooked hands, not without reason there are so many screams on the Internet about and without reason ... Haskell is a wonderful language. It was easy for me to write these programs. For each, I spent no more than two working days. And every time I got a lot of pleasure. I can’t imagine how much torment there would be if I did the same in C.
First program
I started the first program with a simple one. For the time being, I merged the necessary files into one (merged.txt) using the command line tools, and the parser’s task was to parse the traffic and filter out the necessary records with the identifier R200. To process a large amount of text, it is more expedient to use a special ByteString string type. But working with it is not as convenient as with the usual String type. If ByteString is an optimized implementation of the string itself, then String is a list of characters, that is, [Char]. String, due to its nature, is very convenient, but on big data the performance of any lists drops sharply. In the first versions of the program, it was String and a few stupid decisions that caused strong brakes and a large memory consumption. However, then I was worried about the speed of development, not productivity. I wrote the prototype of the program very quickly. Here's what it looks like (revision 2):
replaceChars :: Char -> Char -> Char -> Char
replaceChars whatC withC c = if c == whatC then withC else c
interestFields :: [ String ] -> [ Int ] -> [ String ]
interestFields s takeWhat = undefined - plug
isR200 :: [ String ] -> Bool
isR200 s = (head s) == "the R200"
ProcessLine :: String -> String
ProcessLine s = if isR200 sInWords then unwords (interestFields sInWords [1,2,3]) else [] - [1,2,3] - test fields
where sInWords = words (map (replaceChars '|' '') s)
processString :: String -> [ String ]
processString s = map processLine (lines $ s)
main :: IO ()
main = do
str <- readFile "merged.txt"
putStrLn (intercalate "\ r \ n" (processString $ str))
You can immediately notice the strange function replaceChars with the type Char -> Char -> Char -> Char. The idea was this: take a line-record, replace the vertical bar '|' space and use the words function to break the string into words:
sInWords = words (map (replaceChars '|' '') s)
As a result of executing sInWords, we get the following conversion:
"| R200 | 99999 | 111111 | CR, CS, AM | 1 | 1 | 3022 | 222222 | 333333 ||| 2011-06-23 11: 33: 58 |" ->
"R200 99999 111111 CR, CS, AM 1 1 3022 222222 333333 2011-06-23 11:33:58" ->
["R200", "99999", "111111", "CR, CS, AM "," 1 "," 1 "," 3022 "," 222222 "," 333333 "," 2011-06-23 "," 11:33:58 "]
Unfortunately, a date-time field will also be split into two separate fields. Later, to avoid this, I further complicated the design, initially replacing the space in this field with an asterisk and then returning it. As a result, the line-record went through even more conversions:
Replacing '' -> '*';
Replacing '|' -> '';
Word breaking by the words function;
Processing a list of fields, obtaining fields with the desired index;
Merge fields with the unwords function;
Replacement '' -> '|'
Replacement '*' -> ''
In addition, after these terrible transformations, the number of fields received did not coincide with the number of source fields, because empty fields eventually disappeared altogether (see the conversion example above). It’s good that in all the records the empty / filled fields were in the same places, otherwise I would get unpleasant artifacts. The code, as you can see, is not only redundant, but also ugly. I do not even presume to evaluate its asymptotic complexity; I think she far exceeded O (n ^ 2). Moreover, closer to revision 12, I realized that we need to do something with fields that are lost due to double vertical bars. And added another conversion:
- In order for empty fields marked as "||" to be processed correctly, a space is inserted between them.
refieldDoubles :: String -> String
refieldDoubles [] = []
refieldDoubles ('|': []) = "|"
refieldDoubles ('|': '|': ss) = "| |" ++ (refieldDoubles ('|': ss))
refieldDoubles (s: []) = [s]
refieldDoubles (s: ss) = s: (refieldDoubles ss)
Thus added another full pass on each line! Well, in truth - monkey labor. But it was necessary to do quite a bit: use the split function from the Data.String.Utils module instead, or write your own version. Then, in just one pass along the record line, I would get the correct breakdown into fields:
split "|" "| R200 | 99999 | 111111 | CR, CS, AM | 1 | 1 | 3022 | 222222 | 333333 ||| 2011-06-23 11: 33: 58 |" ->
["", "R200", "99999", "111111", "CR, CS, AM", "1", "1", "3022", "222222", "333333", "", " "," 2011-06-23 11:33:58 "," "]
What can I say ...
What improvements are possible
Experienced Haskellists have already noticed that the code does not use a disjointed style (I did not feel it at that time yet), and there are almost no case-expressions, comparisons with a sample or any kind of security expressions. As a result, even such simple code is hard to read in places. Here's a better way to write a few functions:
replaceSymbols s = map (replaceChar '|' '') (map (replaceChar '' '*') s)
- ->
replaceSymbols = map (replaceChar '|' ''. replaceChar '' '*')
isR200 s = (head s) == "R200"
- ->
isR200 ("R200": _) = True
isR200 _ = False
replaceChars whatC withC c = if c == whatC then withC else c
- ->
replaceChars whatC withC c | c == whatC = withC
| otherwise = c
processLine s = if isR200 sInWords then unwords (interestFields sInWords [1,2,3]) else []
where sInWords = words (map (replaceChars '|' '') s)
- ->
processLine s | isR200 sInWords = unwords (interestFields sInWords [1,2,3])
| otherwise = []
where sInWords = words. map (replaceChars '|' '') $ s
processString s = map processLine (lines $ s)
- ->
processString = map processLine. lines
The intercalate function "\ r \ n" generally needed to be replaced with unlines. It would be shorter and clearer, and in tests unlines showed great performance - at least 30%:
ItemsCnt testUnlines (ns) testIntercalate (ns) Percent 10 23.84 34.05 29.9 100 22.70 34.62 34.4 1000 23.28 35.48 34.3 10000 22.17 35.48 37.5 50000 22.13 33.26 33.4 100000 21.06 35.47 40.6 200000 22.70 34.05 33.3
But while still inexperienced, I didn’t know the standard functions even from the Prelude module, because of which I blocked unnecessary bicycles. Although, even now I do not really understand how you can, with minimal efforts, select elements with the desired indexes from the list of elements. Compare the code from the old and the new program:
- Old code:
- Argument 1 - list of raw food
- Argument 2 - list of desired indices
takeInterest :: [ String ] -> [ Int ] -> [ String ]
takeInterest _ [] = []
takeInterest ss (n: ns ) = [ss !! n] ++ takeInterest ss ns
- New code:
- Argument 1 - battery
- Argument 2 - list of required indices
- Argument 3 - list of raw fields
collectFields :: Int -> [ Int ] -> [ String ] -> [ String ]
collectFields _ _ [] = []
collectFields idx fis (s: ss) | idx `elem` fis = s: collectFields (idx + 1) fis ss
collectFields idx fis (s: ss) | otherwise = collectFields (idx + 1) fis ss
In the first case, we iterate over the list of indices and pull out the element using an unsafe function !!. In the second case, we use the battery at the same time iterating over the list of fields, and if the battery is found in the index list, we take the current field with the battery index in the collection. Both there and there are tail recursion. Although, according to tests, the new code was 40% slower. Perhaps in this case it is worth returning the old code - to speed up the new program even more.
ItemsCnt takeInterest (ns) collectFields (ns) Percent 10 17.33 36.84 52.9 100 20.58 36.84 44.1 1000 21.67 37.92 42.8 10000 21.13 36.84 42.6 50,000 21.67 37.92 42.8
Move on
The next program - merger - was supposed to combine all the text files from the last month into one. (I didn’t want to do this manually each time, and laziness, as you know, is the engine of progress.) The question arose: how do we know what we will have last month? Yes, in general, it’s simple: we take the current date and subtract one month. The program was to be used exclusively in the early days of the month, so there were no problems expected. The current date and, in general, operations with date-time are in the Time module. Operations with the file system structure lie in the System.Directory module. Both functions work mainly in the IO monad, and at that time I still did not really know how to comb the monad code, in the end it looks creepy (merger, revision 14):
main :: IO ()
main = do
args <- getArgs
curDir <- getCurrentDirectory
dirContents <- getDirectoryContents curDir
curTime <- T.getClockTime
monthAgoTime <- return $ T.addToClockTime (T.TimeDiff 0 (-1) 0 0 0 0 0 ) curTime
calendarMonthAgoTime <- T.toCalendarTime monthAgoTime
let maybeDateRange = case args of
(a: b: _) -> readDateRange (unwords [a, b])
_ -> Just $ defaultDateRange calendarMonthAgoTime
case maybeDateRange of
Just dr -> do
let fsToMerge = filesToMerge dirContents dr
let fsToMergeCountStr = show $ length fsToMerge
let mergeLog = (newFileName dr ++ ".log")
let dateRangeMsg = "DateRange:" ++ show dr
fsContents <- merge fsToMerge
writeFile (newFileName merments)
file (unlines fsToMerge ++ printf "\ n% s \ nTotal files:% s" dateRangeMsg fsToMergeCountStr)
putStrLn (unlines fsContents)
putStrLn dateRangeMsg
--putStrLn ("Files to merge:" ++ unlines fsToMerge)
putStrLn (printf "Count of files:% s. See% s for file list." fsToMergeCountStr mergeLog)
Nothing -> putStrLn ("Invalid date range.")
What this code does, I don’t even recommend going into it ... But it was here that the seed of a future huge error crept in, because of which the latest version of the old program was devouring a huge amount of memory. Consider the merge function, which is called from this code:
merge :: [ String ] -> IO [ String ]
merge fsToMerge = mapM readFile fsToMerge
It takes a list of files to be controlled, reads them, and returns a list of their contents. There are two lines in the code:
do
...
fsContents <- merge fsToMerge
writeFile (newFileName dr) (unlines fsContents)
...
The key point is unlines fsContents. That is, all the raw contents of all files were combined in one sitting and crammed into the result file. Later, when parser and merger were combined, it was this huge amount of data that was transferred to the processing of the parser part, where, you remember, there are a whole bunch of replacements, passes and other overhead. Here is what this piece of code looks like in an old program compiled from parser and merger:
do
...
fsContents <- readFilesToMerge fsToMerge
let mergedContents = unlines fsContents
writeFile (newFileName dr) mergedContents
let processedContentStr = unlines $ processData nedeedFields mergedContents
...
And it's not just
Scheme 1 _____ _____ _____ | | | | | | A1 -> | F (A1) | -> B1 -> | G (B1) | -> C1 -> | H (C1) | -> RESULT 1 -> SAVE | _____ | | _____ | | _____ | _____ _____ _____ | | | | | | A2 -> | F (A2) | -> B2 -> | G (B2) | -> C2 -> | H (C2) | -> RESULT 2 -> SAVE | _____ | | _____ | | _____ | _____ _____ _____ | | | | | | A3 -> | F (A3) | -> B3 -> | G (B3) | -> C3 -> | H (C3) | -> RESULT 3 -> SAVE | _____ | | _____ | | _____ | ...-> ... -> ... -> ... -> ... -> ... -> RESULT n -> SAVE
And it turned out like this:
Scheme 2 ____________________ ____________________ ____________________ | | | | | | | | | | | | | | | | | | A-> | F (A) | -> B-> | G (B) | -> C-> | H (C) | -> RESULT-> SAVE | | | | | | | | | | | | | ____________________ | | ____________________ | | ____________________ |
I immediately felt the difference between sequential processing of parts and processing of everything in full. Now I know: it’s not worth taking the entire amount of work at once, it’s better to create a mechanism that issues data in batches — it will be both effective and faster in memory. And, besides everything else, a program made according to the second scheme cannot be parallelized. Horror, in a word ...
Later, I tried to optimize the old program, - almost everywhere replaced String with ByteString. The gain, of course, was small, but you won’t be able to disperse the fan of fog.
Needless to say, I did the new program, NgnTraffic, already having a normal supply of knowledge? A lot has been observed in it. The program is divided into modules: Main, Constants, DataProcess, FileListProces, Options, Tools, Types. Of course, scheme 1 is used. Instead of String, the ByteString type (Strict variant) was originally taken. The code is more combed, even a bitless style on hand. And, most importantly, the principle of action has changed. First, a list of files to be processed is collected - this part is similar to the one from the old program. Then, however, the files are not read all at once into one large variable, but each is read separately. Its contents are immediately processed, lines are parsed, records are filtered, the necessary fields are pulled out. The result is added to the resulting file. As a result, we have the cycle “Reading from disk (F (An)) - Processing strings (G (Bn)) - Writing the result to disk (H (Cn))” - and so many times in the number of files. Plus, of course, there are no replacements from one character to another, but there is a simple split function from the Data.ByteString.Char8 module, which splits a record string into fields in one pass and in itself wins a monstrous share of performance. Here are the functions that satisfy Scheme 1:
process ' :: ResFilePath -> FieldIndexes -> FilePath -> IO ()
process' resFile fis targetFile = do
fileContents <- C.readFile targetFile
let processResult = processData fis predicates fileContents
C.appendFile resFile processResult
process :: ResFilePath -> [FilePath ] -> FieldIndexes -> IO String
process _ [] _ = return "No files to process."
process resFile fs fieldIndexes = do
C.writeFile resFile C.empty
mapM_ (process' resFile fieldIndexes) fs
return "All ok."
Here, process takes the name of the result file, a list of files to process, and indices of the required fields. Indexes can come from the command line, so they are printed as an argument to this function. process takes each file name in turn and applies the function process 'to it (in the line mapM_ (process' resFile fieldIndexes) fs). She is already doing the main work with files. The processData function takes care of processing the contents of the file. Interestingly, in addition to filtering R200 records, the predicate mechanism was created in the new program: any record can be checked against an available predicate, and the predicates themselves can be easily supplemented with the necessary ones. So far two have been done: the field with index x belongs to the list of fields and does not belong to it. This is what the type of predicates looks like:
data Predicate = NotInList [C.ByteString]
| InList [C.ByteString]
type PredicateMap = [(FieldIndex, Predicate)]
And these are the given constant predicates:
predicates :: PredicateMap
predicates = [(1, InList [C.pack "R200"]),
(36, NotInList (map C.pack ["1800", "3600", "5400", "7200", "9000" , "10800", "12600", "14400", "16200", "18000", "19800", "21600", "23400"]))]
It is easy to guess that only those records for which field 1 is equal to “R200” and field 36 is not in the list of “Problems 1800 seconds” will satisfy predicates. Screening occurs in the checkPredicate and examineFields functions:
checkPredicate :: Predicate -> C. ByteString -> Bool
checkPredicate (NotInList l) str = (not. elem str) l
checkPredicate (InList l) str = elem str l
examineFields :: Int -> PredicateMap -> Fields -> Bool
examineFields _ _ [] = True
examineFields idx preds (s: ss) = case L.lookup idx preds of
Just pred -> (checkPredicate pred s) && (examineFields (idx + 1) preds ss)
Nothing -> examineFields (idx + 1 ) preds ss
Now I see that the examineFields function could be done via foldr, but it’s so, little things.
In general, I am pleased with the work done. This is rare - but I even like the NgnTraffic code. And I especially like the progress that has become visible on the example of the same program, written at different times and with different knowledge. And “even more special” is that it is a real Haskell program used in real production.
Whoever says it, Haskell is a terrific language, the best I have ever worked with.
Parser sources for revisions: [2] , [6] , [12]
Merger sources (including the merger integrated program): [13] , [14] ,[16] , [23]
Sources of NgnTraffic
PS Translation of the next part of Yet Another Monad Tutorial will be soon too.
PPS Corrected links to programs, thanks to comrade steck . PPPS I fixed a
couple of errors, thanks to comrades goder and jack128 .