Command line pipelines
I suggest that knowledgeable people share ways to build pipelines on Unix systems. Maybe some reference will come out :-)
I will start with some of the most primitive sets useful for processing web server logs.
So, as everyone knows, the tail program displays a specified number of lines from the end of the file, in addition, in the mode specified by the -f switch, new lines are displayed in the file in real time.
cat is just one way to display the contents of a file or multiple files.
cut - allows you to select a fragment of a line with a given index, despite the fact that the line is divided into fragments by a specified
sort symbol - sorts several lines in the right order according to the necessary
uniq rules - deletes the same lines in a row, with the -c switch adds the number of repetitions in front of each line
egrep- selection of lines from a file or stream according to different logical conditions, including regular expressions
xargs - man xargs
The standard log of some web server access_log is implied in the standard format:
10.10.0.1 - - [11 / Aug / 2008: 02: 40: 15 +0400] “GET / HTTP / 1.0” 403 529 “htt: //referring.site. com »“ Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; Hotbar 4.3.1.0) ”
To see what happens to the resource, are there any requests and which ones are:
tail -f access_log
View only requested documents:
tail -f access_log | cut -d '' -f7
View all requested unique documents:
cat access_log | cut -d '' -f7 | sort | uniq
If you want to quickly find out who made how many requests, sort in descending order:
cat access_log | cut -d '' -f1 | sort | uniq -c | sort -r -d
A ridiculous example from real practice. There is a small botnet attacking the site, which makes a mistake in the requested URL. The error is that there are two slashes at the end of the URL, and not one, for example: '/ rss / tag / CSS //'. Using a small set of console programs, it is quite trivial to block access to it (this is by no means the most effective way, since each bot will still make one request). It is understood that the ipfw firewall has table 1, all addresses from which access to the resource is denied. So:
cat ./access_log | egrep 'GET [^] + //' | cut -d '' -f1 | xargs ipfw table 1 add $ 1
I will start with some of the most primitive sets useful for processing web server logs.
Inventory
So, as everyone knows, the tail program displays a specified number of lines from the end of the file, in addition, in the mode specified by the -f switch, new lines are displayed in the file in real time.
cat is just one way to display the contents of a file or multiple files.
cut - allows you to select a fragment of a line with a given index, despite the fact that the line is divided into fragments by a specified
sort symbol - sorts several lines in the right order according to the necessary
uniq rules - deletes the same lines in a row, with the -c switch adds the number of repetitions in front of each line
egrep- selection of lines from a file or stream according to different logical conditions, including regular expressions
xargs - man xargs
What can be done with this
The standard log of some web server access_log is implied in the standard format:
10.10.0.1 - - [11 / Aug / 2008: 02: 40: 15 +0400] “GET / HTTP / 1.0” 403 529 “htt: //referring.site. com »“ Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; Hotbar 4.3.1.0) ”
To see what happens to the resource, are there any requests and which ones are:
tail -f access_log
View only requested documents:
tail -f access_log | cut -d '' -f7
View all requested unique documents:
cat access_log | cut -d '' -f7 | sort | uniq
If you want to quickly find out who made how many requests, sort in descending order:
cat access_log | cut -d '' -f1 | sort | uniq -c | sort -r -d
A ridiculous example from real practice. There is a small botnet attacking the site, which makes a mistake in the requested URL. The error is that there are two slashes at the end of the URL, and not one, for example: '/ rss / tag / CSS //'. Using a small set of console programs, it is quite trivial to block access to it (this is by no means the most effective way, since each bot will still make one request). It is understood that the ipfw firewall has table 1, all addresses from which access to the resource is denied. So:
cat ./access_log | egrep 'GET [^] + //' | cut -d '' -f1 | xargs ipfw table 1 add $ 1