dsd_corp May 29, 2013 at 21:08

Saving Google Reader data. PHP version

Announcement: this article does not pretend to be a full-fledged article. There will be another “pumping out all the data” from the Google Reader service , a note from the category “I'll just leave it here.”

Introduction

For the well-known occasion of the closure of the excellent RSS reader, several articles have already been written.
This small note “inspired” me here, this one in which an intermediate solution was also given in the form of saving my data from GReader using a “python” script.
Actually, the essence of this article of mine is only in one thing: I wrote a similar script for myself for PHP (I don’t know the “python”), and I think it would be nice to share it.
On the one hand, maybe someone else will help, and on the other, maybe someone will help me by pointing out any errors or that I forgot something, left it unfinished or messed up somewhere.

What is it?

A script from one file that can be pulled from Google Reader and saved to your hard drive all your subscriptions, including history - absolutely all available posts. Including posts from "long dead" sites that are still available in the Reader. Above in the “introduction” is a link to a post with a python script, which, as I understand it, can do it too.
This script practically does not use third-party libraries, therefore it does not require extra settings and quests “go there and download it, but there it is”.

Where to get?

The script itself can be taken here on GitHub (although presumably in July this will all become irrelevant).
One script file, and one batch file (.bat file). The script was written under windows, but should work everywhere.
With PHP for Windows there is a simplified version, already gave it here .
The bottom line is a stand-alone archive with PHP , expand it to any folder (it doesn’t matter which one, it is desirable that the path be shorter and without spaces), for example c: \ php, after which we either write this directory to the system environment variable path, or edit the first one the line of the script that starts the bat-nickname, or if deployed to c: \ php, then you don’t need to do anything else (in the attached batch file it is it that is registered). Well, either download fresh from php.net, or many already have everything.

All that remains for us is to indicate your authorization data on Google at the beginning of the php script, set the desired settings, start the batch file and wait until it downloads everything.

How it works?

Now a description of the script and what it can do.
For starters, you should probably indicate what it uses from the non-flashing cURL libraries and the json_decode () function.
cURL, as I suppose, is turned on by default for many, and JSON functions, although they have appeared since some fifth version of PHP, but for earlier versions the script works and is switched on by default replacing this function with simple regulars . That is, from the "obligations" only cURL remains.
Also, to clear my conscience, it’s probably worth mentioning that the authorization code for the service is taken from this small class . In fact, only a couple of functions to get a token were taken from there, the rest was shoveled and built into the end of the script.

Now the settings. They are at the beginning of the script and look like this:

$GLOBALS['account_user']='googleuser@gmail.com';
$GLOBALS['account_password']='qwerty';
$GLOBALS['is_atom']=true;
$GLOBALS['try_consolidate']=true;
$GLOBALS['fetch_count']=1000;
$GLOBALS['fetch_special_feeds']=true;
$GLOBALS['fetch_regular_feeds']=true;
$GLOBALS['atom_ext']="atom.xml.txt";
$GLOBALS['json_ext']="json.txt";
$GLOBALS['save_dir']="./feeds/";
$GLOBALS['log_file']=$GLOBALS['save_dir']."log.txt";
$GLOBALS['use_json_decode']=false;//function_exists('json_decode');
/* !!!!!!!!!! */
$GLOBALS['need_readinglist']=false;
/* !!!!!!!!!!
 important!
 this will fetch a very full feed list, mixed from all subscribtions and ordered by post date.
 in most cases this data is unusefull and this option will double the script worktime and the hdd space requirement.
 so probably you don't need set this to true.
!!!!!!!!!! */

Where to enter the login and password, I think it is clear)
For me, with the configured two-step authentication in the Google account, the “application password” works fine in the script.

The rest:
$ GLOBALS ['is_atom'] - drag data in json or xml (atom) format. if true, it will create the xml version.

$ GLOBALS ['try_consolidate'] - if true, tries to write each subscription to one continuous file.
The thing is that Google does not allow more than a thousand records to be pulled out in one request, so the script drags in pieces at $ GLOBALS ['fetch_count']records (1000 is the maximum valid value of this parameter), and it can either put each such pack into numbered "thousand-plus" files, or try to append all the time to the same file without violating its structure (json and xml). Because actually parsing incoming data during the course of the script is unprofitable, it has a rather clumsy mechanism for merging files on simple regulars, which nevertheless works. In general, you can play with the parameters and see what happens at the output.

$ GLOBALS ['fetch_special_feeds'] = true; whether to pull out special feeds such as “notes”, “marked entries”, etc. Maybe someone does not need.

$ GLOBALS ['fetch_regular_feeds'] = true;Whether to pull the main feeds on the list individually. You can chop off, for example, if for some reason you only need the main tape, where everything is mixed up (parameter $ GLOBALS ['need_readinglist'] ).

$ GLOBALS ['atom_ext'] = "atom.xml.txt";
$ GLOBALS ['json_ext'] = "json.txt";
These are the file extension settings that the script will assign to everything that it downloads, depending on the parameter $ GLOBALS ['is_atom'] will choose either one or the other.

$ GLOBALS ['save_dir'] = "./ feeds /"; directory to download to. by default, he will create a feeds directory near him, as you might guess from this parameter)

$ GLOBALS ['log_file'] - by default, the logs file will be in the feeds subdirectory, in which everything will be duplicated,

$ GLOBALS ['use_json_decode'] - whether to use the json_decode function, or do with a simplified version. If you do this:
$ GLOBALS ['use_json_decode'] = function_exists ('json_decode'); then it will automatically use the system function if it is supported by your version of PHP. Theoretically, it should work, but in real life I have nothing to try.

Well, the last setting is $ GLOBALS ['need_readinglist'] = false;highlighted by a bunch of exclamation points and a comment. Whether to drag the main tape of the Reader. There are a lot of posts, theoretically, these are all posts from all subscriptions piled up and sorted by date, but in practice, for example, I have a little more than half of the posts from subscriptions. In any case, it will be a large file, it will swing for a long time, and it is unclear why it is needed. Well, or say this: I do not know why anyone might need it. If enlighten in the comments, thanks in advance, maybe it will make sense to me to deflate it))

Conclusion

Well, that’s all, good luck to everyone, I hope this hobby helps someone. And prepare a place on the “screws” - it pulls out about a gigabyte of data from me. For example, the subscription of the main Habr’s feed currently occupies almost 80 thousand records, the oldest of which are no longer available on the Habré itself.

PS I can not answer the question of how then to import these saved data into any RSS reader. I think that not all readers will, in principle, support the import of subscription content from external sources. For myself, I do not ask this question - for I am writing a reader for myself under OS X. I don’t know if I will do it and lick it for everyone or leave it to myself. But I think that because here, on Habré, there are authors of some online readers, they may well later implement support for importing this data from their services. Or maybe they’ll see how to pull out the whole story and realize it at home - it’s not just that users almost universally complain that the reader, if it supports import from GReader, somehow for each subscription it pulls out only 500-1000 recent entries and that's it.

Tags: