miriarder September 11, 2014 at 22:41

Convenient download from Books.ru or add to the case WWW :: Mechanize

If there is a book, it’s good,
And when on the contrary, it’s bad
Instead of an epigraph

As everyone knows, there has recently been an action with the possibility of acquiring a large number of e-books on books.ru at a fair price . User icoz made a script for batch downloading , but the script is not very convenient, since books are stored under uncomfortable names and you need to download them by hand.
In general, I told myself that everything should be convenient and automatic, as is known “said-done”, which is especially important in the light of the upcoming sale tomorrow.

Step 1 . We connect the necessary modules.
We will need

use WWW::Mechanize;
use HTTP::Request::Common;
use LWP;
use LWP::UserAgent;

The module itself and several service modules on which it depends. If you, like I use Ubuntu, then downloading WWW :: Mechanize from CPAN is contraindicated, but instead it's better to say

sudo apt-get install libwww-mechanize-perl

Step 2 . We create a mechanization object and pick up the script parameters from the command line: login and password.

my $mech = WWW::Mechanize->new();
$booklog =  $ARGV[0];
$bookpsw =  $ARGV[1];

Step 3 . Log in to the site

my $resp = $mech->get('http://www.books.ru/member/login.php');
$mech->cookie_jar->set_cookie(0, 'cookie_first_timestamp',DateTime->now->epoch, '/', 'www.books.ru');
$mech->cookie_jar->set_cookie(0, 'cookie_pages', '1', '/', 'www.books.ru');
$resp = $mech->post('http://www.books.ru/member/login.php',[
    'login'  => $mail,
    'password' => $password,
    'go' => 'login',
    'x' => rand_from_to(40, 50), 'y' => rand_from_to(1, 20), 
    'token' => ''
  ]);

I draw your attention to lines 2 and 3. In the original code, these cookies are generated using JavaScript, but just for the sake of calculating two parameters, connecting JavaScript is not rational and it is easier to rewrite it in pearl.
Step 4 . We get the general list of our orders and create an iterator on it:

$resp = $mech->get('http://www.books.ru/member/orders/');
my @order_list = mkGunz($resp->content) =~ /\/gi;
foreach my $order_id (@order_list) {...}

I draw attention to the mkGunz function, which automatically decompresses the data if the server packs them using gzip.
Step 5 . Now we need to extract the authors of the book and its title from the page. Since we use the HTML :: TokeParser module to parse the page, it’s easiest to stream the data we need using the URL.

  my $fname =  '';
  my $authors = '';
  while (my $token = $stream->get_token) 
  {
    if ($token->[0] eq 'S' && $token->[1] eq 'a') 
    {
      my $href = $token->[2]{'href'};
      $authors .= $stream->get_trimmed_text('/a').',' if ($href =~ /\/author\//);
      if ($href =~ /show\=1/)
      {
        $fname = $stream->get_trimmed_text('/a');
        $fname =~ s/\(файл\sPDF\)//gi;
      }
      if ($href =~ /download\/\?file_type\=pdf/)
      {
        chop($authors);
        $fname = trim($authors.','.$fname);
        $fname =~ tr/\//_/;
        $fname .= '.pdf';
        ....
       }
     }

Step 6 . Get and save PDF. There are several interesting points at once: if you do not make clone, then only one book will be downloaded, apparently a bug on books.ru. You cannot use the IO :: File module to save files with Russian letters, a bug in the module for pearl version v5.14.2. Well, the binmode call, so as not to break the PDF files.

        my $gbm = $mech->clone();
        $resp = $gbm->get($href);
        $resp = $gbm->submit_form(with_fields => {'agreed' => 'Y', 'go' => 1});
        my $pdfFile = $resp->content;
        $pdfFile = mkGunz($resp->content) unless ($resp->content =~ /^\%PDF/);
        print "Saving ".$fname." as ".length($pdfFile)." bytes.\n" ;
        open(my $fh, ">", $fname);
        if (defined $fh) 
        {
          binmode($fh);
          print $fh $pdfFile; 
          close($fh);
        }

And finally, everything is complete.

#!/usr/bin/perl
use WWW::Mechanize;
use HTTP::Request::Common;
use LWP;
use LWP::UserAgent;
use URI::Escape;
use HTML::TokeParser;
use DateTime;
use Compress::Raw::Zlib;
use Encode qw(decode encode);
use warnings;
sub trim($);
my $mech = WWW::Mechanize->new();
$booklog =  $ARGV[0];
$bookpsw =  $ARGV[1];
#die "Usage: books.su.pl  \n" if (scalar @ARGV < 2);
$mail = $booklog;
$password = $bookpsw;
$mech->agent_alias("Linux Mozilla");
#$mech->proxy('https', 'http://127.0.0.1:8888/');
#$mech->proxy('http', 'http://127.0.0.1:8888/');
my $resp = $mech->get('http://www.books.ru/member/login.php');
$mech->cookie_jar->set_cookie(0, 'cookie_first_timestamp',DateTime->now->epoch, '/', 'www.books.ru');
$mech->cookie_jar->set_cookie(0, 'cookie_pages', '1', '/', 'www.books.ru');
#print mkGunz($resp->content)."\n";
$resp = $mech->post('http://www.books.ru/member/login.php',[
    'login'  => $mail,
    'password' => $password,
    'go' => 'login',
    'x' => rand_from_to(40, 50), 'y' => rand_from_to(1, 20), 
    'token' => ''
  ]);
#print mkGunz($resp->content)."\n";
$resp = $mech->get('http://www.books.ru/member/orders/');
my @order_list = mkGunz($resp->content) =~ /\/gi;
foreach my $order_id (@order_list)
{
  $resp = $mech->get('http://www.books.ru/order.php?order='.$order_id);
  my $hcont = mkGunz($resp->content);
  my $stream = HTML::TokeParser->new(\$hcont);
  $stream->empty_element_tags(1);
  my $fname =  '';
  my $authors = '';
  while (my $token = $stream->get_token) 
  {
    if ($authors eq '' && $fname ne "" && $token->[0] eq 'S' && $token->[1] eq 'br') 
    {
      $authors .= cnv($stream->get_trimmed_text('/p')).','; 
    }
    if ($token->[0] eq 'S' && $token->[1] eq 'a') 
    {
      my $href = $token->[2]{'href'};
      if ($href =~ /show\=1/)
      {
        $fname = cnv($stream->get_trimmed_text('/a'));
        $fname =~ s/\(файл\sPDF\)//gi;
      }
      if ($href =~ /download\/\?file_type\=pdf/)
      {
        chop($authors);
        $fname = trim($authors.','.$fname);
        $fname =~ tr/\//_/;
        $fname .= '.pdf';
        my $gbm = $mech->clone();
        $resp = $gbm->get($href);
        $resp = $gbm->submit_form(with_fields => {'agreed' => 'Y', 'go' => 1});
        my $pdfFile = $resp->content;
        $pdfFile = mkGunz($resp->content) unless ($resp->content =~ /^\%PDF/);
        print "Saving ".$fname." as ".length($pdfFile)." bytes.\n" ;
        open(my $fh, ">", $fname);
        if (defined $fh) 
        {
          binmode($fh);
          print $fh $pdfFile; 
          close($fh);
        }
        else
        { 
          die "Unable to open:".$fname."\n";
        }
        $authors = '';
        $fname =  '';
      }
    }
  }
}
sub cnv {return shift;}#encode('cp1251', decode('UTF-8', shift));}
sub rand_from_to
{
  my($from, $to) = @_;
  return int(rand($to - $from)) + $from;
}
sub mkGunz
{
  my ($ind) = @_;
  return $ind if($ind =~ /html/);
  my $gun = new Compress::Raw::Zlib::Inflate(WindowBits => WANT_GZIP);
  {
    my $out;
    my $status = $gun->inflate($ind, $out);
    if ($status == Z_OK || $status == Z_STREAM_END)
    {
        return $out;
    }
    else
    {
        die $status.":".$ind;
    }
  };
}
sub trim($)
{
	my $string = shift;
	$string =~ s/^\s+//;
	$string =~ s/\s+$//;
	return $string;
}

Note for Windows lovers :
Most likely, you need to change the line $ fname = ~ tr / \ // _ /; on $ fname = ~ tr / \ / \: \ * \? \\ / _ /; since NTFS has more forbidden characters than ext4 and tinker with the encoding, for which the cnv function is provided.

Mandatory wishes to the reader : I wish you not to miss the sale , buy many books, download them to your tablet and read calmly on weekends at the cottage without the Internet.

Legal disclaimer : Since under a license agreement it is forbidden to rename files after downloading , you must download them immediately under the correct and convenient name, which this script does!

Tags:

Convenient download from Books.ru or add to the case WWW :: Mechanize

Also popular now: