We use the Twitter API to get the "tree" of users
Good morning habrauser. In this post I would like to share a little experience with the twitter API, and in particular with parsing a large number of users and receiving information about each user (account creation date, username, screen_name, user web page, number of tweets, number of friends, number of followers , location). This is my first post, so please do not judge strictly, but I also have nothing against constructive criticism.
Task: There are about 100 active and highly respected Twitter users (T0). For these users, I needed to get a list of friends (T1) and for each user to get personal data. In the same way we get T2 (T2 - friends of users from T1) and T3.
As a result, we have a user base T = T0 + T1 + T2 + T3. Since each user of Twitter has about 1286 friends (statistics are based on data from about 80 million accounts), the number of users in each group is growing very quickly:
When parsing so many accounts, the first problem we are facing is the limit of API requests. We can fulfill 150 requests / hour if we are unauthorized and 350 requests / hour if we are authorized. In addition, these 150/350 requests are divided into two 30 minute intervals. That is, we can execute 75/175 requests from each user in 30 minutes. This is clearly not enough to get so much data. To do this, I used a database of about 3000 accounts (bots) from the botnet, which I developed for the same customer (if it is interesting to someone, I can tell you in a separate post about the botnet’s functionality and some “pitfalls”). That is, I had a margin of almost 0.5 million requests in 30 minutes, and here everything rested on the speed of processing the API response and writing data to the database.
To communicate with the API, I did not invent bicycles and used the abraham oauth library, widely known in narrow circles . I only slightly modified it so that it could use multi_curl (we remember that we need to make a lot of requests).
To get a list of user's friends, the friends / ids API method was used . This method allows you to get a list of user IDs of the user. If the number of friends exceeds 5000, the result is divided into pages (I received a maximum of 5000 friends for each user and did not make additional requests if there were more than 5k).
After we got the friends of all users, we need to get data about each user. We use a wonderful method for this.users / lookup . We take from the base ID packs of 100 and parsim data.
As a result, we get a fairly large user database. The following are some statistics:
Task: There are about 100 active and highly respected Twitter users (T0). For these users, I needed to get a list of friends (T1) and for each user to get personal data. In the same way we get T2 (T2 - friends of users from T1) and T3.
As a result, we have a user base T = T0 + T1 + T2 + T3. Since each user of Twitter has about 1286 friends (statistics are based on data from about 80 million accounts), the number of users in each group is growing very quickly:
- T0: 100 users
- T1: ~ 42,000 unique users
- T2: ~ 5,200,000 unique users
- T3: ~ 80 million unique users
When parsing so many accounts, the first problem we are facing is the limit of API requests. We can fulfill 150 requests / hour if we are unauthorized and 350 requests / hour if we are authorized. In addition, these 150/350 requests are divided into two 30 minute intervals. That is, we can execute 75/175 requests from each user in 30 minutes. This is clearly not enough to get so much data. To do this, I used a database of about 3000 accounts (bots) from the botnet, which I developed for the same customer (if it is interesting to someone, I can tell you in a separate post about the botnet’s functionality and some “pitfalls”). That is, I had a margin of almost 0.5 million requests in 30 minutes, and here everything rested on the speed of processing the API response and writing data to the database.
To communicate with the API, I did not invent bicycles and used the abraham oauth library, widely known in narrow circles . I only slightly modified it so that it could use multi_curl (we remember that we need to make a lot of requests).
To get a list of user's friends, the friends / ids API method was used . This method allows you to get a list of user IDs of the user. If the number of friends exceeds 5000, the result is divided into pages (I received a maximum of 5000 friends for each user and did not make additional requests if there were more than 5k).
After we got the friends of all users, we need to get data about each user. We use a wonderful method for this.users / lookup . We take from the base ID packs of 100 and parsim data.
As a result, we get a fairly large user database. The following are some statistics:
- average number of tweets ~ 4317
- average number of friends ~ 1286
- average number of followers ~ 35045