The best publications of social networks

    My telegram channel: https://t.me/winc0de .
    Hello. In my free time I do social projects. My friends and I have a sufficient number of “publics” in different social networks, which allows us to conduct various experiments. The burning issue is finding relevant content and news that you can publish. In this regard, the idea came up to write a service that will collect posts from the most popular pages and issue them according to the specified filter. For the initial test, I chose the social network VKontakte and Twitter.

    Technology


    First of all, it was necessary to determine the data storage (by the way, now the number of stored records is more than 2 million) and this figure will melt every day. The requirements were as follows: very frequent insertion of a large amount of data and quick sampling among them.

    Before that, I had already heard about nosql databases and wanted to try them. I will not describe in the article comparisons of the databases that I conducted (mysql vs sqlite vs mongodb).
    I chose memcached as caching , later I will explain why and in what cases.
    A python daemon was written as a data collector, which simultaneously updates all groups from the database.

    MongoDB and the daemon


    First of all, I wrote a prototype of a collector of publications from groups. I saw several problems:
    • Storage capacity
    • API Limitations

    One publication with all metadata takes about 5-6KB of data, and in the middle group about 20,000-30,000 records, about 175MB of data is obtained per group, and there are a lot of these groups. Therefore, I had to set a task in filtering uninteresting and advertising publications.

    I did not have to invent too much, I have only 2 “tables”: groups and posts , the first stores the records of groups that need to be parsed and updated, and the second - the scope of all publications of all groups. Now it seems to me an unnecessary and even bad decision. It would be best to create a table for each group, so it will be easier to select and sort records, although the speed even with 2 million is not lost. But this approach should simplify the overall sample for all groups.

    API


    In cases when you need server processing of some data from the VKontakte social network, a standalone application is created that can issue a token for any action. For such cases, I saved a note with the following address: Instead of APP_ID, insert the identifier of your standalone application. The generated token allows you to access the indicated actions at any time. The parser operation algorithm is as follows: We take the group id, in the loop we get all the publications, at each iteration we filter the “bad” posts, save to the database. The main problem is speed. Vkontakte API allows you to execute 3 requests per second. 1 request allows you to get a total of 100 publications - 300 publications per second.

    http://oauth.vk.com/authorize?client_id=APP_ID&redirect_uri=https://oauth.vk.com/blank.html&response_type=token&scope=groups,offline,photos,friends,wall







    In the case of the parser, this is not so bad: you can merge a group in one minute, but there will already be problems with updating. The more groups - the longer the update will take place and, accordingly, the issue will not be updated as quickly.

    The solution was to use the execute method, which allows you to collect requests to the api in a heap and execute at a time. Thus, in one request, I do 5 iterations and get 500 publications - 1,500 per second, which allows the group to be drained in ~ 13 seconds.

    Here is the file with the code that is passed to execute:
    var groupId = -|replace_group_id|;
    var startOffset = |replace_start_offset|;
    var it = 0;
    var offset = 0;
    var walls = [];
    while(it < 5)
    {
        var count = 100;
        offset = startOffset + it * count;
        walls = walls + [API.wall.get({"owner_id": groupId, "count" : count, "offset" : offset})];
        it = it + 1;
    }
    return
    {
        "offset" : offset,
        "walls"  : walls
    };
    


    The code is read into memory, replace_group_id and replace_start_offset tokens are replaced . As a result, I get an array of publications, the format of which you can see on the official VK API page vk.com/dev/wall.get

    The next step is a filter. I took different groups, looked through publications and came up with possible screening options. First of all, I decided to delete all publications with links to external pages. It is almost always an advertisement.

    urls1 = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text)
    urls2 = re.findall(ur"[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)", text)
    if urls1 or urls2:
    	# Игнорировать эту публикацию
    


    Then I decided to completely exclude reposts - this is 99% advertising. Few people will just repost someone else's page. Check for repost is very simple:

    if item['post_type'] == 'copy':
        return False


    item - the next element from the walls collection returned by the execute method.

    I also noticed that a lot of ancient publications are empty, they have no attachments and the text is empty. For the filter it is enough to verify that item ['attachments'] and item ['text'] are empty.

    And the last filter that I just deduced over time:
    yearAgo = datetime.datetime.now() - datetime.timedelta(days=200)
    createTime = datetime.datetime.fromtimestamp(int(item['date']))
    if createTime <= yearAgo and not attachments and len(text) < 75:
    	# Игнорировать эту публикацию


    As in the previous paragraph, many old publications were with text (a description of the picture in the attachment), but the pictures themselves were no longer preserved.

    The next step was to clear the failed posts that just “didn't go”:
    db.posts.aggregate(
    	{
    		$match : { gid : GROUP_ID }
    	}, 
    	{
    		$group : { _id : "$gid", average : {$avg : "$likes"} }
    	}
    )
    


    This method is executed on the posts table, which has a likes field (the number of likes of the publication). It returns the arithmetic average of likes for this group.
    Now you can simply delete all publications older than 3 days that have less than average likes:
    db.posts.remove(
    	{
    		'gid' : groupId, 
    		'created' : { '$lt' : removeTime }, 
    		'likes': { '$lt' : avg }
    	}
    )


    removeTime = datetime.datetime.now() - datetime.timedelta(days=3)
    avg = результату предыдущего запроса, разделенного на два (методом подбора).
    


    I add the resulting and filtered publication to the database, this parsing ends. The difference between parsing and updating groups I made only in one point: the update is called exactly 1 time for the group, i.e. I get only the last 500 entries (5 to 100 through execute). In general, this is quite enough, given that VKontakte introduced a limit on the number of publications: 200 per day.

    Front end

    I will not write much in detail, javascript + jquery + isotope + inview + mustache.
    • Isotope is used for the modern output of publications in the form of tiles.
    • Inview allows you to easily respond to events that hit the viewport of a particular element. (in my case, I remember the watched publications, and I highlight the new ones in a special color).
    • Mustache allows you to build dom objects on a template.


    Filter publications by group

    To output data by groups, a simple php script was written.
    This is a helper function that, by the type of time filter, created an object that can be used directly in the request.
        function filterToTime($timeFilter)
        {
            $mongotime = null;
            if ($timeFilter == 'year')
                $mongotime = new Mongodate(strtotime("-1 year", time()));
            else if ($timeFilter == 'month')
                $mongotime = new Mongodate(strtotime("-1 month", time()));
            else if ($timeFilter == 'week')
                $mongotime = new Mongodate(strtotime("-1 week", time()));
            else if ($timeFilter == 'day')
                $mongotime = new Mongodate(strtotime("midnight"));
            else if ($timeFilter == 'hour')
                $mongotime = new Mongodate(strtotime("-1 hour"));
            return $mongotime;
        }
    


    And the following code already receives the 15 best posts of the month:
    $groupId  = 42; // Какой-то id группы
    $mongotime =  filterToTime('week');
    $offset = 1; // Первая страница
    $findCondition = array('gid' => $groupId, 'created' => array('$gt' => $mongotime));
    $mongoHandle->posts->find($findCondition)->limit(15)->skip($offset * $numPosts);
    


    Logic index page

    It’s interesting to look at group statistics, but it’s much more interesting to build a general rating of absolutely all groups and their publications. If you think about it, the task is very difficult:
    We can build a rating based on only 3 factors: the number of likes, reposts and subscribers. The more subscribers - the more likes and reposts, but this does not guarantee the quality of the content.

    Most millionaire groups often publish any garbage that has been surfing the Internet for several years, and among the million subscribers there are always those who will repost and like.
    It is easy to build a rating based on bare numbers, but the result obtained cannot be called a rating of publications by their quality and uniqueness.
    There were ideas to derive the quality factor of each group: build a time scale, watch user activity for each period of time, and so on.
    Unfortunately, I did not come up with an adequate solution. If you have any ideas, I will be glad to listen.

    The first thing I realized was the realization that the contents of the index page need to be calculated and cached for all users, because it is a very slow operation. This is where memcached comes to the rescue. For the simplest logic, the following algorithm was chosen:
    1. We cycle through all groups
    2. We take all the publications of the i-th group and select the 2 best ones for the specified period of time


    As a result, there will be no more than 2 publications from one group. Of course, this is not the most correct result, but in practice it shows good statistics and relevance of the content.

    Here's what the stream code looks like, which generates an index page once every 15 minutes:

        # timeDelta - тип фильтра по времени (hour, day, week, year, alltime)
        # filterType - likes, reposts, comments
        # deep - 0, 1, ... (страница)
        def _get(self, timeDelta, filterTime, filterType='likes', deep = 0):
            groupList = groups.find({}, {'_id' : 0})
            allPosts  = []
            allGroups = []
            for group in groupList:
                allGroups.append(group)
                postList = db['posts'].find({'gid' : group['id'], 'created' : {'$gt' : timeDelta}}) \
                    .sort(filterType, -1).skip(deep * 2).limit(2)
                for post in postList:
                    allPosts.append(post)
            result = {
                'posts'  : allPosts[:50],
                'groups' : allGroups
            }
            # Этот код позволяет сгенерировать timestamp из mongotime, при конвертировании в json
            dthandler = lambda obj: (time.mktime(obj.timetuple()) if isinstance(obj, datetime.datetime) or isinstance(obj, datetime.date) else None) 
            jsonResult = json.dumps(result, default=dthandler)
            key = 'index_' +filterTime+ '_' +filterType+ '_' + str(deep)
            print 'Setting key: ',
            print key
            self.memcacheHandle.set(key, jsonResult)
    


    I will describe the filters that affect the output:
    Time: hour, day, week, month, year, all time
    Type: likes, reposts, comments

    Objects were generated for all time points
            hourAgo  = datetime.datetime.now() - datetime.timedelta(hours=3)
            midnight = datetime.datetime.now().replace(hour=0, minute=0, second=0, microsecond=0)
            weekAgo  = datetime.datetime.now() - datetime.timedelta(weeks=1)
            monthAgo = datetime.datetime.now() + dateutil.relativedelta.relativedelta(months=-1)
            yearAgo  = datetime.datetime.now() + dateutil.relativedelta.relativedelta(years=-1)
            alltimeAgo = datetime.datetime.now() + dateutil.relativedelta.relativedelta(years=-10)
    


    All of them are passed in turn to the _get function along with various filter variations by type (likes, reposts, comments). Besides all this, you need to generate 5 pages for each filter variation. As a result, the following keys are affixed to memcached:

    Setting key: index_hour_likes_0
    Setting key: index_hour_reposts_0
    Setting key: index_hour_comments_0
    Setting key: index_hour_common_0
    Setting key: index_hour_likes_1
    Setting key: index_hour_reposts_1
    Setting key: index_hour_comments_1
    Setting key: index_hour_common_1
    Setting key: index_hour_likes_2
    Setting key: index_hour_reposts_2
    Setting key: index_hour_comments_2
    Setting key: index_hour_common_2
    Setting key : index_hour_likes_3
    Setting key: index_hour_reposts_3
    Setting key: index_hour_comments_3
    Setting key: index_hour_common_3
    Setting key: index_hour_likes_4
    Setting key: index_hour_reposts_4
    Setting key: index_hour_comments_4
    Setting key: index_day_likes_1
    Setting key: index_day_likes_1
    Setting key: index_day_likes_1
    Setting key: index_day_likes_1
    Setting key: index_day_likes_1
    Setting key: index_day_repess_1
    Setting key: index_day_repess_1
    Setting key: index_day_reposts_1
    Setting key: index_day_reposts_1
    Setting key: index_day_reposts_1
    Setting key: index_day_reposts_1
    Setting key : index_day_reposts_1 Setting key: index_day_reposts_1 Setting key: index_day_reposts_1 : index_day_comments_2
    Setting key: index_day_common_2
    Setting key: index_day_likes_3
    Setting key: index_day_reposts_3
    ...


    And on the client side only the desired key is generated and the json string is pulled from memcached.

    Twitter

    The next interesting task was to generate popular tweets in the CIS countries. The task is also not easy, I would like to receive relevant and not “trashy” information. I was very surprised by the limitations of Twitter: it will not work so easy to take and merge all the tweets of certain users. The API very limits the number of requests, so you can’t do it the way VC does: make a list of popular accounts and constantly parse their tweets.

    A day later, a decision came: we create an account on Twitter, subscribe to all important people whose publication topics are of interest to us. The trick is that in almost 80% of cases, one of these people will retweet some popular tweet. Those. we don’t need to have a list of all accounts in the database, it’s enough to collect a database of 500-600 active people who are constantly on the trend and retweets really interesting and popular tweets.
    There is a method in the Twitter API that allows you to get a user feed that includes the tweets of those we follow and their reposts. All we need now is to read our feed to the maximum every 10 minutes and save tweets, filters and everything else as we do with VKontakte.

    So, another thread was written inside the daemon, which once in 10 minutes ran such code:
        def __init__(self):
            self.twitter = Twython(APP_KEY, APP_SECRET, TOKEN, TOKEN_SECRET)
       def logic(self):
            lastTweetId = 0
            for i in xrange(15): # Цифра подобрана методом тыка
                self.getLimits()
                tweetList = []
                if i == 0:
                    tweetList = self.twitter.get_home_timeline(count=200)
                else:
                    tweetList = self.twitter.get_home_timeline(count=200, max_id=lastTweetId)
                if len(tweetList) <= 1:
                    print '1 tweet, breaking' # Все, больше твитов API нам не выдаст
                    break
                 # ...
                 lastTweetId = tweetList[len(tweetList)-1]['id']
    


    Well, then the usual and boring code: we have a tweetList, loop through and process each tweet. List of fields in the official documentation. The only thing I want to focus on:

             for tweet in tweetList:
                    localData = None
                    if 'retweeted_status' in tweet:
                        localData = tweet['retweeted_status']
                    else:
                        localData = tweet
    


    In the case of retweets, we need to save not the tweet of one of our subscribers, but the original one. If the current record is a retweet, then it contains inside the key 'retweeted_status' exactly the same tweet object, only the original one.

    The final

    There are problems with the site design and layout (I myself have never been a web programmer), but I hope someone will have useful information that I described. I myself have been working with social services for a lot of time. networks and their APIs and I know a lot of tricks. If anyone has any questions, I will be happy to help.

    Well, a few pictures:

    Index Page:



    The page of one of the groups that I constantly monitor:



    Twitter per day:



    Thanks for attention.
    - 88.198.106.150

    Also popular now: