"Regular expressions" or "Just about ugly"
"Regular expressions" or "Just about ugly"
I’ll start with an explanation of what prompted me to write this article. The article on regular expressions, which was published a little earlier, who read the habr, moved me, they probably already saw it, I honestly did not like the article, because they wrote it, but they gave some complicated example as a use and an offer to buy a book with 600+ pages , which, it seems to me, will only scare away people who could use them.
I specifically will not look into any manuals, and stuff you with information that I myself did not remember, I am sure that in order to become interested and start using it will be enough that I can and will use it myself.
I will give all examples using the “grep” which is for both Windows and linux (for him it is of course more familiar).
I myself am sitting under linux, so I will show examples using linux itself, in some cases I will use "|" ("Pipe"), the output redirection command, but if you want, you can do without it using a temporary result file. I will use the regular expression syntax standard therefore, it should work everywhere where regular expressions are fully supported.
I appeal to those who do not know anything about grep. You should not be confused, because I am using a tool that is not familiar to you, it is used only as an example, regular expressions have a standardized syntax that is supported by many text editors.
For the seed. Have you ever had to write short parsers if you need to select some words or phrases from a file when an ordinary search no longer works? ..
1. Suppose we have a text file containing some logins with the following contents: we need to get a list of logins, which in in this case, what should the user do? if the file weighs like that ... MB or even GB? or didn’t you have to face such tasks? .. We will learn how to do this. 2. Next, the task is this, let's say we have the source code for a certain C ++ project and we need to look at what classes this project has (we’ll do without template ones for convenience) and structures, given that classes and structures can be declared somehow so (without apostrophes of course):
.....много слов
login="Figaro"
.....много букв
login="Tolik"
........
'class Foo'
или так
'class Bar'
или так
' class Any'
' struct A'
and this we will learn to do too.
3. Well, and yet, there is a certain format file in each line, and we need to find 3.1 users from this file whose password consists of only numbers, 3.2 users whose password consists of small letters, 3.3 only of capital letters, 3.4 whose password is shorter than 5 characters and consists only of small or large letters. 3.5 in which the login is similar to the ip address, these are very strange users like the bots :)) (for example, 243.11.22.03 or 243-11-22-03 or 243_11_22_03)
user="Login" Passwd="anypassword"
Those who are already familiar with regular expressions can solve these problems for warming up and compare the result with the one we get below.
I think this is enough for now.
And now a grammar of regular expressions, alas not do without it, but will manage to start a small list, write the description itself can therefore be very different from the canonical:
Symbol :
. - dot This is an arbitrary character. If you don’t know that there is a letter or number or symbol, write a dot
\ w - This is some letter word, unfortunately few people support such Russian letters, for example KDevelop 3.5.10 supports
\ d - This is some digit digit
\ s - This is a space or tab i.e. space
\ l - Lowercase lowercase letters are not supported by my version of grep
\ u -
Uppercase upper letters is not supported by my version of grep [abcde] - Any of the listed characters in the set
^ can be encountered ^ - start of line
$ - end of line
() - brackets are used to group expressions
Repetitions :
Used as "(expression) ( repetition) "without brackets, it will be more clear below.
* - the expression can be repeated an arbitrary number of times starting from 0 ie it can be absent
+ - the expression can be repeated 1 time or more, that is, it must be present
{3,6} this is a universal way to specify repetitions in this case from 3 to 6 repetitions
{3,} from three or more repetitions
? - the expression can be repeated 0 or 1 times can be written as {0,1}
some characters that should be escaped in regular expressions only if you do not have their regular essence
\ ?,
\ +,
\ |,
\ (,
\)
\.
\ [
\]
\ -
\ ~
In fact, the syntax of regular expressions is greater, but this is the main one, nevertheless, for general development, you can familiarize yourself with other possibilities in order to write more concise and less understandable expressions :)) a
little about grep, this is a console utility that takes parameters from the command line, there are GUI versions of this tool, keys:
-P says that the expression is regular,
-o output only the found regular expression and not the entire string
-r go down recursively in the subdirectory
* process all files (unless there are include)
--include "* .h" use only files with the .h extension
Let's move from words to deeds.
Task 1
so we will begin to solve our problems, recall the first condition: The text of the logins.txt file on which I checked: file: We have an example of the expression that we are looking for: select its changing part, this is the word written in quotation marks directly, we look at the syntax, the characters used are "\ w "the number of repetitions is" + ", then to find all the pairs we get an expression of the form: that is, having the logins.txt file, the command that gets the list name =" login "will look like this: but the condition says that you need to get a list of logins, not pairs, look at an expression like name = "login" in order to highlight only login from it, obviously you need to pick up the expression in quotation marks, the regular expression will look like a part of the one we already had: Допустим мы имеем некоторый текстовый файл logins.txt содержащие некоторые логины следующего содержания :
.....много слов
login="Figaro"
.....много букв
login="Tolik"
........
а получить нужно именно список логинов
.....много слов
login="Figaro"
.....много букв
login="Tolik" login="Petya"

name="Figaro"
name="\w+"
grep -Po 'login="\w+"' logins.txt
"\w+"
then, in order to pick up the list of logins from the result, we get the following construction
grep -Po 'login="\w+"' logins.txt | grep -Po '"\w+"'
(or use the intermediate file instead of '|' and apply the second command to this file)
Command output:

Finished with the first task, sorted it out, we get the list of logins, you can remove the quotes if you wish, but it’s by yourself.
Challenge number 2
, I recall the condition: - The text of the class.h file on which I checked: - the contents of the file: So what we have, let’s assume that the class headers are at the beginning of the line, otherwise it is monstrous for me. we take an example of a declaration line: and try to turn this expression into a template that will fit all other declarations, generalize: the beginning of the line => '^', then the space or tab => '\ s', maybe or not =>' * ', goes the word class => class, or =>' | ' the word struct => struct, there is a space => '\ s', at least one => '+', there is a name => '\ w', must contain at least one letter => '+' Допустим у нас есть исходники некоторого проекта на "с" и нам нужно посмотреть какие у этого проекта есть классы ( обойдемся без шаблонных для удобства) и структуры, учитывая что классы и структуры могут быть объявлены как-то так (без апострофов конечно) :
'class Foo'
или так
'class Bar'
или так
' class Any'
' struct A'
class Foo
или так
class Bar
или так
class Any
struct A

' class Any'
^\s*(class|struct)\s+\w+
and so the team that finds the ads:
grep -Pr "^\s*(class|struct)\s+\w+" class.h
output:

or an asterisk to apply to all files:
grep -Pr "^\s*(class|struct)\s+\w+" *
Done, we’ve sorted out the second task.
Task number 3
, condition: --Text of users.txt file on which I checked: - file: 3.1 In order to find passwords with only numbers, it is enough to fulfill the condition where only the numbers are in quotation marks of the password, this is very simple, I think you will understand everything without Paintings: [\ w \ d \ ._ \ -] - these are valid letters, numbers, periods, underscores, dashes in the login. result: 3.2 We look at the grammar rules for the designation of small letters, here is a similar condition, only instead of numbers there are small letters, and we get as a result: result: 3.3 Here it is similar: result: 3) Ну и еще, имеется некоторый файл формата, user="Login" Passwd="anypassword" в каждой строке, и нам нужно
из этого файла найти
3.1 юзеров у которых пароль состоит только из цифр,
3.2 юзеров у которых пароль состоит только из меленьких букв,
3.3 только из больших букв,
3.4 у которых пароль короче 5 символов и состоит только из меленьких или только больших букв.
3.5 у которых логин похож на ip адрес, это очень странные пользователи похожи на ботов :))
(например 243.11.22.03 или 243-11-22-03 или 243_11_22_03 )
user="Login" Passwd="12login"
user="Anya" Passwd="12341234"
user="Masha" Passwd="2345234524"
user="Pasha" Passwd="4657467"
user="234.255.252.21" Passwd="2342346354"
user="Petya" Passwd="0099"
user="Misha" Passwd="victor"
user="Lena" Passwd="VASYA"
user="Sveta" Passwd="PUPKIN"
user="Ira" Passwd="PETR"
user="Lera" Passwd="%^&&@&&@*****"
user="Sasha" Passwd=")(#@*)($#K$@LKJLKJLK"
user="Dima" Passwd="K:LSDKL:FS:LFD"
user="Serega" Passwd=")(*#@$(*#@()$"
user="212_2_3_3" Passwd="JDK"
user="225-234-234-22" Passwd="123"
user="192.116.166.13" Passwd="466"
user="234.255.252.22" Passwd="111"

user="[\w\d\._\-]+"\s+Passwd="\d+"
grep -P 'user="[\w\d\._\-]+"\s+Passwd="\d+"' users.txt

user="[\w\d\._\-]+"\s+Passwd="[a-z]+"
grep -P 'user="[\w\d\._\-]+"\s+Passwd="[a-z]+"' users.txt

user="[\w\d\._\-]+"\s+Passwd="[A-Z]+"
grep -P 'user="[\w\d\._\-]+"\s+Passwd="[A-Z]+"' users.txt

3.4 Here you can think a little, and remember a similar case when we were looking for a class or struct, only length restrictions are added: I
suggest stopping for a while and writing yourself :)). Seriously, once again read paragraph 3.4 and the solution to Problem 2, and write the expression yourself. Come back after you try. =))
Here it’s not much more complicated: the
user="[\w\d\._\-]+"\s+Passwd="([a-z]|[A-Z]){1,4}"
grep -P 'user="[\w\d\._\-]+"\s+Passwd="([a-z]|[A-Z]){1,4}"' users.txt
result:

3.5 Here I will write the expression itself, and analyze it, that is, I suggest understanding it myself, although of course, you could write it yourself.
I’ll do it, those who feel within themselves suggest writing the solution of clause 3.5 on their own, and then checking it on the test data, for those who don’t want to write on their own, it remains to overpower the expression below, I understand it looks scary only if I sign it to you, rather nothing will be postponed:
here is the expression itself:
user="(\d{1,3}[\._\-]){3}\d{1,3}"\s+Passwd=".*"
grep -P 'user="(\d{1,3}[\._\-]){3}\d{1,3}"\s+Passwd=".*"' users.txt
I’ll give a hint:
a block is highlighted - (\ d {1,3} [\ ._ \ -]) this is for example: '251.' which is repeated
3 times after which we get something like '251.243.243.' followed by another number \ d {1,3}.
result:

In the end, I would like to note that the names of the sets can be specified explicitly, so instead of \ d you can write [1234567890] or [0-9] to explicitly indicate Russian letters, you can also specify [aaaaaaaaaa], though such a construction may not always to be understood correctly. only small English letters [az] only large [AZ] small large and numbers [a-zA-Z0-9], etc.
Thank you for your attention, I want to advise - spend a little time to figure it out, it really makes life (especially the programmer) easier.