DoubleGIS update with Linux console tools

    Introduction


    image
    Very often, users are asked to install a DoubleGIS (do not consider it an advertisement) directory, especially if the user goes on business trips, communicates with people from other cities.

    And like any system administrator, I had the idea to automatically and centrally update DublGIS for all cities.

    For a number of reasons, it was decided to do this using Linux.
    One of the reasons was the lack of a solution for a centralized update for this operating system.
    Another is the lack of a file on a site with all the databases and shell in one archive for Linux users.

    In this article I will tell you how to update DublGIS for all cities using Linux console tools.

    What do you need?


    • Linux server (works for me under Fedora 15)
    • wget
    • sed, grep
    • unzip
    • Your favorite text editor


    Writing a script


    Here is the script I got.

    Download the Web page with links to the city.
    wget --no-proxy --html-extension -P/root/2gis 'http://www.2gis.ru/how-get/linux/'

    We select from all the html files that we downloaded, all the lines with links, sort, delete duplicates and write them to a temporary index.tmp file.
    cat /root/2gis/*.html | grep http:\/\/ |sort |uniq >/root/2gis/index.tmp

    Delete the web page - it is no longer needed.
    rm -f /root/2gis/*.html

    With this creepy team, we processed Index.tmp to get all the links with the how-get line and immediately downloaded web pages from these links.
    cat /root/2gis/index.tmp | grep -o [\'\"\ ]*http:\/\/[^\"\'\ \>]*[\'\"\ \>] | sed s/[\"\'\ \>]//g | grep how-get | xargs wget --no-proxy -r -p -np -l1 -P/root/2gis --tries=10 --html-extension --no-directories --span-hosts --dot-style=mega

    Deleted index.tmp - it just gets in the way.
    rm -f /root/2gis/index.tmp

    Glued all files with the html extension into one index2.tmp.
    cat /root/2gis/*.html >/root/2gis/index2.tmp

    Delete downloaded web pages.
    rm -f /root/2gis/*.html

    Now the most interesting thing is that you need to pull out links to updates and download files on them.

    We process index2.tmp for links with the string "/ last / linux /", sort, delete duplicates and immediately download only new files to the
    cat /root/2gis/index2.tmp | grep -o [\'\"\ ]*http:\/\/[^\"\'\ \>]*[\'\"\ \>] | sed s/[\"\'\ \>]//g | grep "/last/linux/" | sort | uniq | xargs wget --no-proxy -nc -P/root/2gis.arch --tries=3 --html-extension --no-directories --span-hosts --dot-style=mega

    2gis.arch folder. Delete all temporary files.
    rm -fr /root/2gis/index*

    Unpack from the archive folder all zip files to our target folder / root / 2gis /
    unzip -o /root/2gis.arch/\*.zip -d /root/2gis/
    Delete archives older than 20 days so that there are no duplicates
    find /root/2gis.arch/ -name * -mtime +20 |xargs rm -fr

    Now in the / root / 2gis folder the unpacked DublGIS for all cities is located, and in the /root/2gis.arch folder archives for Linux users downloaded from the site.
    We put the script to execute on cron.
    I set it every day, the script does not download unnecessary files.

    Conclusion


    The site structure of DoubleGIS is constantly changing, it is possible that the script may not download updates. I recommend periodically monitoring this.

    UPDATED 12/31/2011

    Edited the script. Removed all unnecessary.

    New option. PS Thanks to kriomant for constructive criticism. Happy New Year everyone!
    wget -O - 'http://www.2gis.ru/how-get/linux/' 2>/dev/null | sed "s/^.*\(http:\/\/[^\"\'\ ]*\/how-get\/linux\/\).*$/\1/g" |\
    grep "how-get\/linux"|sort|uniq|xargs wget -p -O - 2>/dev/null |sed "s/^.*\(http:\/\/[^\"\'\ ]*\/last\/linux\/\).*$/\1/g"|grep "last\/linux"| sort|uniq|\
    xargs wget -N -P/root/2gis.arch
    unzip -o /root/2gis.arch/\*.zip -d /root/2gis/




    Also popular now: