Notification of the emergence of a new topic on Habrahabr using Python
I like it when the program / code is completely different ... you understand the purpose of each letter and why the solution is just that. In this topic I want to offer my parser of Habrahabr topics in Python without third-party libraries.
When a new topic appears, a pop-up window informs about this.
The current version for Linux with GNOME, but 1 line redone for your system.
It works in words this way:
1) Download the root file of the site to your own under the desired name
2) Open the file
3) Read the line by line name of the Blog until the right moment and filter the name itself
4) Continue reading until the moment with the name of the Topic and filter the name itself
5) The same with the topic date
6) Compare the name of each topic with the last seen earlier
7) If the topic does not match - create a pop-up window with Blog / Topic / Date of the new topic
8) If they match - exit the program We will
implement it in Python (parts of the code go sequentially and without clippings): We
specify the interpreter, encoding, the necessary modules and the home folder user, where temporary files will be stored (last edit for your system):
Specify the variables needed for the script to work:
Check if the script is running for the first time: if the .habralast file exists, the script has already been launched, otherwise, we create the file with an empty line. Variable topic1 assigns the name of the last vision of the topic, and an empty string if the script is launched for the first time:
Download the Habrahabr Root file (10 topics are displayed on the main page - if you missed more: open the page habrahabr.ru/page N, where N is the page number):
Open the resulting page text:
These lines are the basis - later we will read the file line by line until we find just such parts:
I was convinced many times that these parts of the lines will not be found anywhere else! Therefore, we can safely use it, there will be no confusion.
We check each line in turn for the presence of a sign for the name of the blog (2000 is taken experimentally from the number of lines in the HTML file allocated for topics) and by filtering it we assign it to the blog variable :
We found a blog - we are looking for a topic name tag (from the page code you can see that the topic is no further than 50 lines from the blog) and filtering it, assign it to the topic variable . If the topic did not match the last seen one ( topic! = Topic1 ) - we write a new one in the .habralast file , we do not do such a check anymore so as not to write a later topic, because first newest:
... I noticed that sometimes tags are inserted at the beginning of the topic name, but we do not need them, so we filter it out. We make print with the names of the blog and topic (if desired, all lines with print can be commented out).
Then we read it again line by line until the topic of the topic date appears, no more than 100 lines are needed:
The os.system (notify) line creates the same pop-up window with information about the new topic. The content is filled in the line above. We delete the source HTML file as unnecessary and exit the program.
As soon as you find the last topic you saw earlier, delete the HTML file and exit the system:
This was the first iteration of the main loop. If you missed more topics than there are on the first page, open the next one and everything repeats:
In the Gnome environment, the notify-send command is responsible for pop-ups . On your system, it may be different. Then edit the line with the variable notify = "notify-send 'Habrahabr.ru:" + blog + "' '" + topic + "\ n " + date + " '" under your command with its syntax.
Here, I deliberately did not align the lines to the left, so that it was more clear what comes after and what it depends on. I had to fix the 3rd code from the bottom by 2 Tabs to the left, otherwise it doesn’t look very beautiful. Therefore, here is the whole script, so as not to get confused:
I ran this script in the terminal, and the output of the script and pop-up windows appeared. If you use the script only in the terminal and you do not need pop-ups, comment out the line os.system (notify) with the previous one. Otherwise, you can place the python path_to_script / HabraParser.py command in crontab and call, say, every 30 minutes. Or you can do it manually - the decision is yours! The main thing is to adjust the directory where the files will be saved.
Here's how it looks like at the bottom of the screen when I find a new topic:
That's it ... mine is convenient!
When a new topic appears, a pop-up window informs about this.
The current version for Linux with GNOME, but 1 line redone for your system.
It works in words this way:
1) Download the root file of the site to your own under the desired name
2) Open the file
3) Read the line by line name of the Blog until the right moment and filter the name itself
4) Continue reading until the moment with the name of the Topic and filter the name itself
5) The same with the topic date
6) Compare the name of each topic with the last seen earlier
7) If the topic does not match - create a pop-up window with Blog / Topic / Date of the new topic
8) If they match - exit the program We will
implement it in Python (parts of the code go sequentially and without clippings): We
specify the interpreter, encoding, the necessary modules and the home folder user, where temporary files will be stored (last edit for your system):
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os, sys
HOME_DIR = "/home/user"
Specify the variables needed for the script to work:
LAST_DIR = HOME_DIR + "/.habralast" # файл с названием последнего проверенного топика
HTML_DIR = HOME_DIR + "/.habr.html" # текст корневого файла Хабрахабра
SHOW_FIRST_TIME = 5 # сколько топиков отобразить если скрипт запущен в первый раз
n = 1
new_addr = 0
count = 0
Check if the script is running for the first time: if the .habralast file exists, the script has already been launched, otherwise, we create the file with an empty line. Variable topic1 assigns the name of the last vision of the topic, and an empty string if the script is launched for the first time:
if os.path.isfile(LAST_DIR):
fp = open(LAST_DIR, "r")
topic1 = fp.readline()
fp.close()
last_existed = 1
else:
fp = open(LAST_DIR, "w")
topic1 = ""
fp.close()
last_existed = 0
Download the Habrahabr Root file (10 topics are displayed on the main page - if you missed more: open the page habrahabr.ru/page N, where N is the page number):
while(1):
if n == 1:
url = "habrahabr.ru"
else: url = "habrahabr.ru/page" + str(n) + "/"
wget = "wget " + url + " -O " + HTML_DIR
try:
os.system(wget)
except:
print "Cannot connect to server"
sys.exit()
Open the resulting page text:
index = open(HTML_DIR, "r")
These lines are the basis - later we will read the file line by line until we find just such parts:
s = ' ' # строчка с датой написания топика
I was convinced many times that these parts of the lines will not be found anywhere else! Therefore, we can safely use it, there will be no confusion.
We check each line in turn for the presence of a sign for the name of the blog (2000 is taken experimentally from the number of lines in the HTML file allocated for topics) and by filtering it we assign it to the blog variable :
for i in range(2000): line = index.readline() if s in line: blog_s = line.find('">') blog_e = line.find("
") blog = line[blog_s+2:blog_e]
We found a blog - we are looking for a topic name tag (from the page code you can see that the topic is no further than 50 lines from the blog) and filtering it, assign it to the topic variable . If the topic did not match the last seen one ( topic! = Topic1 ) - we write a new one in the .habralast file , we do not do such a check anymore so as not to write a later topic, because first newest:
for j in range(50):
line = index.readline()
if ss in line:
topic_s = line.find('">')
topic_e = line.find("")
topic = line[topic_s+2:topic_e]
if topic.find("") != -1:
topic = topic[topic.find("")+7:]
if topic != topic1:
if new_addr == 0:
fp = open(LAST_DIR, "w")
fp.write(topic)
fp.close()
new_addr = 1
print "Blog:\t" + blog
print "Topic:\t" + topic
... I noticed that sometimes tags are inserted at the beginning of the topic name, but we do not need them, so we filter it out. We make print with the names of the blog and topic (if desired, all lines with print can be commented out).
Then we read it again line by line until the topic of the topic date appears, no more than 100 lines are needed:
for k in range(100):
line = index.readline()
if sss in line:
line = index.readline()
time_s = line.find("")
time_e = line.find("")
date = line[time_s+6:time_e]
print "Date:\t" + date + "\n"
notify = "notify-send 'Habrahabr.ru: " + blog + "' '" + topic + "\n" + date + "'"
os.system(notify)
count += 1
if count == SHOW_FIRST_TIME and last_existed == 0:
os.system("rm -f " + HTML_DIR)
sys.exit()
break
break
The os.system (notify) line creates the same pop-up window with information about the new topic. The content is filled in the line above. We delete the source HTML file as unnecessary and exit the program.
As soon as you find the last topic you saw earlier, delete the HTML file and exit the system:
else:
os.system("rm -f " + HTML_DIR)
sys.exit()
This was the first iteration of the main loop. If you missed more topics than there are on the first page, open the next one and everything repeats:
n += 1
index.close()
In the Gnome environment, the notify-send command is responsible for pop-ups . On your system, it may be different. Then edit the line with the variable notify = "notify-send 'Habrahabr.ru:" + blog + "' '" + topic + "\ n " + date + " '" under your command with its syntax.
Here, I deliberately did not align the lines to the left, so that it was more clear what comes after and what it depends on. I had to fix the 3rd code from the bottom by 2 Tabs to the left, otherwise it doesn’t look very beautiful. Therefore, here is the whole script, so as not to get confused:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os, sys
HOME_DIR = "/home/user"
LAST_DIR = HOME_DIR + "/.habralast"
HTML_DIR = HOME_DIR + "/.habr.html"
SHOW_FIRST_TIME = 5
n = 1
new_addr = 0
count = 0
if os.path.isfile(LAST_DIR):
fp = open(LAST_DIR, "r")
topic1 = fp.readline()
fp.close()
last_existed = 1
else:
fp = open(LAST_DIR, "w")
topic1 = ""
fp.close()
last_existed = 0
while(1):
if n == 1:
url = "habrahabr.ru"
else: url = "habrahabr.ru/page" + str(n) + "/"
wget = "wget " + url + " -O " + HTML_DIR
try:
os.system(wget)
except:
print "Cannot connect to server"
sys.exit()
index = open(HTML_DIR, "r")
s = ' '
for i in range(2000):
line = index.readline()
if s in line:
blog_s = line.find('">')
blog_e = line.find("")
blog = line[blog_s+2:blog_e]
for j in range(50):
line = index.readline()
if ss in line:
topic_s = line.find('">')
topic_e = line.find("")
topic = line[topic_s+2:topic_e]
if topic.find("") != -1:
topic = topic[topic.find("")+7:]
if topic != topic1:
if new_addr == 0:
fp = open(LAST_DIR, "w")
fp.write(topic)
fp.close()
new_addr = 1
print "Blog:\t" + blog
print "Topic:\t" + topic
for k in range(100):
line = index.readline()
if sss in line:
line = index.readline()
time_s = line.find("")
time_e = line.find("")
date = line[time_s+6:time_e]
print "Date:\t" + date + "\n"
notify = "notify-send 'Habrahabr.ru: " + blog + "' '" + topic + "\n" + date + "'"
os.system(notify)
count += 1
if count == SHOW_FIRST_TIME and last_existed == 0:
os.system("rm -f " + HTML_DIR)
sys.exit()
break
break
else:
os.system("rm -f " + HTML_DIR)
sys.exit()
n += 1
index.close()
I ran this script in the terminal, and the output of the script and pop-up windows appeared. If you use the script only in the terminal and you do not need pop-ups, comment out the line os.system (notify) with the previous one. Otherwise, you can place the python path_to_script / HabraParser.py command in crontab and call, say, every 30 minutes. Or you can do it manually - the decision is yours! The main thing is to adjust the directory where the files will be saved.
Here's how it looks like at the bottom of the screen when I find a new topic:
That's it ... mine is convenient!