DoubleGIS update with Linux console tools
Introduction
Very often, users are asked to install a DoubleGIS (do not consider it an advertisement) directory, especially if the user goes on business trips, communicates with people from other cities.
And like any system administrator, I had the idea to automatically and centrally update DublGIS for all cities.
For a number of reasons, it was decided to do this using Linux.
One of the reasons was the lack of a solution for a centralized update for this operating system.
Another is the lack of a file on a site with all the databases and shell in one archive for Linux users.
In this article I will tell you how to update DublGIS for all cities using Linux console tools.
What do you need?
- Linux server (works for me under Fedora 15)
- wget
- sed, grep
- unzip
- Your favorite text editor
Writing a script
Here is the script I got.
Download the Web page with links to the city.
wget --no-proxy --html-extension -P/root/2gis 'http://www.2gis.ru/how-get/linux/'
We select from all the html files that we downloaded, all the lines with links, sort, delete duplicates and write them to a temporary index.tmp file.
cat /root/2gis/*.html | grep http:\/\/ |sort |uniq >/root/2gis/index.tmp
Delete the web page - it is no longer needed.
rm -f /root/2gis/*.html
With this creepy team, we processed Index.tmp to get all the links with the how-get line and immediately downloaded web pages from these links.
cat /root/2gis/index.tmp | grep -o [\'\"\ ]*http:\/\/[^\"\'\ \>]*[\'\"\ \>] | sed s/[\"\'\ \>]//g | grep how-get | xargs wget --no-proxy -r -p -np -l1 -P/root/2gis --tries=10 --html-extension --no-directories --span-hosts --dot-style=mega
Deleted index.tmp - it just gets in the way.
rm -f /root/2gis/index.tmp
Glued all files with the html extension into one index2.tmp.
cat /root/2gis/*.html >/root/2gis/index2.tmp
Delete downloaded web pages.
rm -f /root/2gis/*.html
Now the most interesting thing is that you need to pull out links to updates and download files on them.
We process index2.tmp for links with the string "/ last / linux /", sort, delete duplicates and immediately download only new files to the
cat /root/2gis/index2.tmp | grep -o [\'\"\ ]*http:\/\/[^\"\'\ \>]*[\'\"\ \>] | sed s/[\"\'\ \>]//g | grep "/last/linux/" | sort | uniq | xargs wget --no-proxy -nc -P/root/2gis.arch --tries=3 --html-extension --no-directories --span-hosts --dot-style=mega
2gis.arch folder. Delete all temporary files.
rm -fr /root/2gis/index*
Unpack from the archive folder all zip files to our target folder / root / 2gis /
unzip -o /root/2gis.arch/\*.zip -d /root/2gis/
Delete archives older than 20 days so that there are no duplicates
find /root/2gis.arch/ -name * -mtime +20 |xargs rm -fr
Now in the / root / 2gis folder the unpacked DublGIS for all cities is located, and in the /root/2gis.arch folder archives for Linux users downloaded from the site.
We put the script to execute on cron.
I set it every day, the script does not download unnecessary files.
Conclusion
The site structure of DoubleGIS is constantly changing, it is possible that the script may not download updates. I recommend periodically monitoring this.
UPDATED 12/31/2011
Edited the script. Removed all unnecessary.
New option. PS Thanks to kriomant for constructive criticism. Happy New Year everyone!
wget -O - 'http://www.2gis.ru/how-get/linux/' 2>/dev/null | sed "s/^.*\(http:\/\/[^\"\'\ ]*\/how-get\/linux\/\).*$/\1/g" |\
grep "how-get\/linux"|sort|uniq|xargs wget -p -O - 2>/dev/null |sed "s/^.*\(http:\/\/[^\"\'\ ]*\/last\/linux\/\).*$/\1/g"|grep "last\/linux"| sort|uniq|\
xargs wget -N -P/root/2gis.arch
unzip -o /root/2gis.arch/\*.zip -d /root/2gis/