Download chat history with all VK users using Python

For linguistic research, I needed a body of direct speech generated by one person. I decided that for starters it is most convenient to use my own correspondence in VK. This article is about how to download all the messages you have ever sent to your friends using the Python program and VKontakte API. To work with the API we will use the vk library .

To work with the site you need to create an application and log in using a token. This process is nothing complicated and is described here and here .

So, the token is received. We import the necessary libraries (we will need time and re later), connect to our application and get started.

import vk
import time
import re
session = vk.Session(access_token='your_token')
vkapi = vk.API(session)

Since we want to get correspondence with all our friends, let's start by getting a list of friends. Further processing the complete list of friends can be quite lengthy, so for testing you can write the id of several friends manually.

friends = vkapi('friends.get') # получение всего списка друзей для пользователя
# friends = [1111111, 2222222, 33333333] # задаем друзей вручную 

Having a list of friends, you can immediately start downloading dialogs with them, but I want to process only those dialogs that contain more than 200 messages, since short conversations with unfamiliar people are not very interesting to me. Therefore, we will write a function that will return the “caps” of the dialogs.

def get_dialogs(user_id):
	dialogs = vkapi('messages.getDialogs', user_id=user_id)
	return dialogs

Such a function returns a “header” of a dialog with a user whose id is equal to the specified user_id. The result of her work looks something like this:

[96, {'title': ' ... ', 'body': '', 'mid': 333333, 'read_state': 1, 'uid': 111111, 'date': 1490182267, 'fwd_messages': [{'date': 1490173134, 'body': 'Не, ну все это и так понятно, но нам же там жить.', 'uid': 222222}], 'out': 0}]

The list contains the number of messages (96) and the data of the last message in the dialog. Now we have everything we need to download the necessary dialogs.

The main disadvantage is that VKontakte allows you to make a maximum of about three requests per second, so after each request you need to wait a while. For this we need the time library. The smallest waiting time that I managed to set so as not to get rejected after several operations is 0.3 seconds.

Another difficulty is that you can download a maximum of 200 messages per request. This will also have to be fought. Let's write a function.

def get_history(friends, sleep_time=0.3):
	all_history = []
	i = 0
	for friend in friends:
		friend_dialog = get_dialogs(friend)
		time.sleep(sleep_time)
		dialog_len = friend_dialog[0]
		friend_history = []
		if dialog_len > 200:
			resid = dialog_len
			offset = 0
			while resid > 0:
				friend_history += vkapi('messages.getHistory', 
					user_id=friend, 
					count=200, 
					offset=offset)
				time.sleep(sleep_time)
				resid -= 200
				offset += 200
				if resid > 0:
					print('--processing', friend, ':', resid, 
						'of', dialog_len, 'messages left')
			all_history += friend_history
		i +=1
		print('processed', i, 'friends of', len(friends))
	return all_history

Let’s see what happens here.

We go through the list of friends and get a dialogue with each of them. Consider the length of the dialogue. If the dialogue is shorter than 200 messages, just go to the next friend, if it’s longer, then download the first 200 messages (argument count), add them to the message history for this friend and calculate how many more messages are left to download (resid). As long as the remainder is greater than 0, at each iteration we increase the offset argument, which allows you to set the indent in the number of messages from the end of the dialog by 200.

Due to the need to wait after each request, the program runs for a rather long time, so I added the output of a small report on the current step to understand what is being processed and how much is left.

NB: the messages.get method has an out argument with which you can ask the server to send only outgoing messages. I decided not to use it and select the messages I need after downloading for the following reasons: a) the file will still have to be cleaned, because the server gives each message in the form of a dictionary containing a lot of technical information, and b) messages from the interlocutors may also be of interest to my research.

Each downloaded message is a dictionary and looks something like this:
{'read_state': 1, 'date': 1354794668, 'body': 'Вот так!
Потому что тут модель вышла довольно непонятная.', 'uid': 111111, 'mid': 222222, 'from_id': 111111, 'out': 1}


All that remains is to clear the result and save it to a file. This part of the work no longer relates to interaction with the VK API, so I will not dwell on it in detail. Yes, and what can I say here - just select the necessary elements (body) for the desired user and use re to remove line breaks that are tagged
. We save everything to a file.

The complete program code looks like this:

import vk
import time
import re
session = vk.Session(access_token='your_token')
vkapi = vk.API(session)
SELF_ID = 111111
SLEEP_TIME = 0.3
friends = vkapi('friends.get') # получение всего списка друзей для текущего пользователя
def get_dialogs(user_id):
	dialogs = vkapi('messages.getDialogs', user_id=user_id)
	return dialogs
def get_history(friends, sleep_time=0.3):
	all_history = []
	i = 0
	for friend in friends:
		friend_dialog = get_dialogs(friend)
		time.sleep(sleep_time)
		dialog_len = friend_dialog[0]
		friend_history = []
		if dialog_len > 200:
			resid = dialog_len
			offset = 0
			while resid > 0:
				friend_history += vkapi('messages.getHistory', 
					user_id=friend, 
					count=200, 
					offset=offset)
				time.sleep(sleep_time)
				resid -= 200
				offset += 200
				if resid > 0:
					print('--processing', friend, ':', resid, 
						'of', dialog_len, 'messages left')
			all_history += friend_history
		i +=1
		print('processed', i, 'friends of', len(friends))
	return all_history
def get_messages_for_user(data, user_id):
	self_messages = []
	for dialog in data:
		if type(dialog) == dict:
			if dialog['uid'] == user_id and dialog['from_id'] == user_id:
				m_text = re.sub("
", " ", dialog['body']) self_messages.append(m_text) print('Extracted', len(self_messages), 'messages in total') return self_messages def save_to_file(data, file_name='output.txt'): with open(file_name, 'w', encoding='utf-8') as f: print(data, file=f) if __name__ == '__main__': all_history = get_history(friends, SLEEP_TIME) save_to_file(all_history, 'raw.txt') self_messages = get_messages_for_user(all_history, SELF_ID) save_to_file(self_messages, 'sm_corpus.txt')

At the time of the launch of the program, I had 879 friends in VK. It took about 25 minutes to process them. The raw result file was 74MB in size. After highlighting the text of only my posts - 15MB. The total number of messages in the received corpus is about 150,000, and their text occupies 3,707 pages (in the Word document).

I hope my article will be useful to someone. All methods that can be used to access the VK API are described in detail in the section for VKontakte developers .

Also popular now: