Analyze Twitter data in the cloud with Apache Hadoop and Hive
- Transfer
- Tutorial
This guide describes how to query, examine, and analyze Twitter data using Apache Hadoop-based services for Windows Azure, as well as the Hive request in Excel. Social networks are the main source of big data. Therefore, the public APIs of social media such as Twitter serve as a source of useful information and help to better understand network trends.
The manual consists of the following sections.
- Search, Download, Install, and Use Microsoft Analytics for Twitter
- Getting Twitter feeds using cURL and the Twitter Streaming API
- Request and configure a new Hadoop on a Windows Azure cluster
- Processing Twitter data with Hive on Hadoop in a Windows cluster
- Configure Hive ODBC and Hive dashboards in Excel to get Hive data
Search, Download, Install, and Use Microsoft Analytics for Twitter
Microsoft Analytics for Twitter is available for download at the following address: http://www.microsoft.com/download/en/details.aspx?id=26213 .
To work, you need Excel 2010 application and PowerPivot add-in, which can be downloaded at http://www.microsoft.com/download/en/details.aspx?id=29074 ).
Paste the following movie tags and accounts into the query window:
#moneyball, @MoneyballMovie, @helpmovie, @BridesmaidsSay, @CONTAGION_movie
Click the button
and follow the on-screen instructions.
Note. Hadoop is not used in this section. Here's how to work with the Twitter Search API and self-service business intelligence in Excel and PowerPivot.
Getting Twitter feed using cURL and Twitter Streaming API
At this point, curl.exe is required. Download the curl file for your OS (for example, the binary binary SSL file for 64-bit Windows) at http://curl.haxx.se/download.html
and unzip curl.exe into a suitable folder (for example, C: \ twitterdata ).
Copy the two files - get twitter stream.cmd and twitter_params.txt - from the Step2GetTwitterFeedUsingCURLAndTwitterStreamingAPI folder to the folder containing curl.exe :
Change the twitter_params.txt file in the following way to track tweets:
track = moneyball, MoneyballMovie, helpmovie, BridesAGidsSmidsmiesiem_mieiemayiemiesiemayiemayiemayiemayiemayiemayiemayiemayiemayiemayiemayiemayiemayiemayiemayiemayiemayiemayiemayiemayiemayiemayiemayiemayiemayiemayiemayiemayiemayiemayiemayiemayie
Change the get twitter stream.cmd command script by inserting your Twitter username instead of USER and your password instead of PASSWORD on the following line:
curl -d @ twitter_params.txt -k stream.twitter.com/1/statuses/filter.json -uUSER: PASSWORD >> twitter_stream_seq.txt
Run the script the get twitter stream.cmd from the command line as follows:
You will see information about this:
To finish the script work, press the Ctrl + the C . Then you can rename the file and re-run the script.
Request and configure a new Hadoop on a Windows Azure cluster
At this point, CTP access is required for the Apache Hadoop-based service on Windows Azure. Go to https://www.hadooponazure.com/ and click on the invitation link . If you have access, click the Sign in button .
Request a new cluster. The following is an example of a large cluster called mailboxpeak. Enter your username and password, and then click Request cluster. For all questions, see the Apache Hadoop for Windows Azure Based Service and FAQ guide.
Open the FTPS and ODBC ports to access the server.
Click the Interactive Console icon .
Create a directory for the Twitter text file on HDFS using the following Javascript command:
js> #mkdir / example / data
To download the test text files, run the following commands:
js> #put
Source: C: \ hadoop \ example \ data \ Sample.txt
Destination: / examples / data
To download large (uncompressed) text files directly to HDFS, you need the curl.exe file. If this file does not exist, download it according to the instructions of step 2 and unzip it to a suitable folder, for example, here: C: \ hadoop \ example \ data. Then open PowerShell, go to C: \ hadoop \ example \ data and paste the following PowerShell script for FTPS into the SampleData text file ( SampleData.txt ):
C: \ hadoop \ example \ data>
# ----- begin curl ftps to hadoop on azure powershell example -
# ------ Replace XXXXXXX with the appropriate servername / username / password
$ serverName = "XXX.cloudapp.net"; $ userName = "XXXX";
$ password = "XXXXXXXX";
$ fileToUpload = "SampleData.txt"; $ destination = "/ example / data /";
$ Md5Hasher = [System.Security.Cryptography.MD5] :: Create ();
$ hashBytes = $ Md5Hasher.ComputeHash ($ ([Char []] $ password))
foreach ($ byte in $ hashBytes) {$ passwordHash + = "{0: x2}" -f $ byte}
$ curlCmd = ". \ curl -k --ftp-create-dirs -T $ fileToUpload -u $ userName "
$ curlCmd + =": $ passwordHash ftps: // $ serverName "+": 2226 $ destination "
Very large files must be compressed before downloading. A compressed file (with the extension .gz, etc.) can be uploaded to your Windows Azure storage account. Using the CloudXplorer program ( http://clumsyleaf.com/products/cloudxplorer ), the file is downloaded as follows:
After setting up your Windows Azure storage account and installing CloudXplorer, go to the Windows Azure portal and copy the primary passkey of your storage account by clicking the button View in the right column.
Then open CloudXplorer and select File -> Manage Accounts . A new dialog box opens. Click New and select Windows Azure account .
In the next dialog box, paste the name of the vault account that you specified when setting up the vault account (for example, hadoopdemo), and the copied access key.
In the new storage account, create a container (in Windows Azure, directories are called containers ).
Download (copy) the ZIP archive to the container (in our case, the container is called data ).
Set up your Windows Azure Blob Storage account by clicking the Manage Data icon
next to Set up ASV .
Now you need the name of the Windows Azure Storage Account Name (in our case it was hadoptmo ) and the main passkey.
Enter the name of the Windows Azure storage account and primary passkey, and then click Save settings .
Processing Twitter data with Hive on Hadoop in a Windows cluster
Go to https://www.hadooponazure.com/ . Connect to the Hadoop head node by clicking Remote Desktop .
Click the Open button.
Log in to the remote server with the name and password that you used when creating the cluster in step 3.
Create a directory (for example, c: \ Apps \ dist \ example \ data ) on the remote Hadoop head node server (on the NTFS side) using Explorer or command line, and then go to it.
Copy the entire contents of the CopyToHeadnode folder to the new directory. This also includes the HiveUDFs.jar file (user-defined functions for Hive requests), gzip , as well as Hive request text files. Also copy the fileAll steps to run from the Hadoop Command Shell.txt to simplify the last part of this step.
RDP supports copying between hosted and remote desktop. Sometimes Hadoop decompresses the gzip file while it is being copied to HDFS.
Open the Hadoop Command Shell on the remote desktop.
Go to c: \ Apps \ dist \ example \ data .
Copy the twitter stream stream seq8.gz file from the Windows Azure storage to the c: \ Apps \ dist \ example \ data folder (on the NTFS side). The location of the file in the storage account depends on the Windows Azure storage associations specified in step 3. In our case, the container is called dataand displayed in the line under asv: // :
c: \ Apps \ dist \ example \ data> hadoop fs -copyToLocal asv: //data/twitter_stream_seq8.txt.gz twitter_stream_seq8.txt.gz
Unzip the twitter stream seq8.gz archive into the folder c: \ Apps \ dist \ example \ data , as shown below (you will need the gzip.exe program , which you need to download from the website http://www.gzip.org/ and put it in the directory where the command is executed from):
c : \ Apps \ dist \ example \ data> gzip -d -N twitter_stream_seq8.txt.gz
Note: Sometimes Hadoop unpacks the file when copying to HDFS, however this only works for .bz2 (bzip2) http://bzip.org/ archives :
hadoop fs -copyFromLocal twitter_stream_seq8.txt.gz /example/data/twitter_stream_seq8.txt
Copy twitter stream seq8.txt from the c: \ Apps \ dist \ example \ data folder to HDFS with the following command:
c: \ Apps \ dist \ example \ data >
hadoop fs -copyFromLocal twitter_stream_seq8.txt /example/data/twitter_stream_seq8.txt
1. Make sure the file on HDFS is updated. To do this, open
and go to the / example / data folder.
The following few steps are contained in the All steps to run from the Hadoop Command Shell.txt file that you copied to the head node.
Create and download twitter_raw with the following command:
c: \ apps \ dist \ example \ data> hive -v -f load_twitter_raw.txt
The table will be created in the / hive / warehouse directory on the HTFS side:
This can be verified using Hive by typing c: \ Apps \ dist \ example \ data> hive and Hive> show tables; as shown below.
To exit Hive, use the hive> quit command ; . Create and download twitter_temp as follows:
c: \ apps \ dist \ example \ data> hive -v -f create_twitter_temp.txt
If there are 4 nodes, this operation will take more than 20 minutes, and if there are 8 nodes, it will take 8 minutes 55 seconds. Check the progress in the following window:
Click a task to view details and progress. The operation may take more than 20 minutes.
You can also monitor the progress of the task using the Hadoop Command Shell:
This can be checked using Hive by typing c: \ Apps \ dist \ example \ data> hive and Hive> show tables :
Create and download twitter_stream as follows:
c: \ apps \ dist \ example \ data> hive -v -f create_twitter_stream.txt
If there are 4 nodes, this operation will take more than 60 minutes, and if there are 8 nodes, it will take 31 minutes 54 seconds. Track the progress as described above. Create and download the twitter stream sample with the following command:
c: \ apps \ dist \ example \ data> hive -v -f create_twitter_stream_sample.txt
Track the progress as described above. Create and download twitter_movies as follows:
c: \ apps \ dist \ example \ data> hive -v -f create_twitter_movies.txt
Track the progress as described above. Create and download twitter movies vw with the command:
c: \ apps \ dist \ example \ data> hive -v -f create_twitter_movies_vw.txt
Track the progress as described above.
Configure Hive ODBC and Hive dashboards in Excel to get Hive data
This section is taken from the Apache Hadoop for Windows Azure Service Instructions and Frequently Asked Questions document , which is located on the
Hadoop download tile on the Windows Azure portal.
From there, you can download HiveODBCSetup for 64-bit and 32-bit versions of Excel.
How to connect to the Hive add-in for Excel in Hadoop on a Windows Azure platform using HiveODBC
A key feature of Microsoft's big data solution is the integration of Hadoop with Microsoft Business Intelligence components. A good example is the connection of Excel to the Hive data warehouse framework in a Hadoop cluster. This section shows how to use Excel through the Hive ODBC driver.
Installing the Hive ODBC Driver
To start the installation, download the 64-bit version of the Hive ODBC driver (MSI file) from Hadoop in the Windows Azure portal. Double-click HiveODBCSetupx64.msi to start the installation. Read the license agreement. If you agree to its terms, click I accept and then Install .
After the installation is complete, click Finish to exit the wizard.
Install Hive add-in for Excel
Installing this add-in requires 64-bit versions of the Hive ODBC driver and Excel 2010 applications. Run the 64-bit version of Excel 2010. The system will prompt you to install the HiveExcel extension . Click Install . When the extension is installed, click the Data tab in Microsoft Excel 2010. The Hive panel opens, as shown in the following screenshot:
Creating a Hive ODBC Data Source for Excel
Select Start -> Control Panel to launch the Microsoft Windows Control Panel . In the control panel window, select System and Security -> Administrative Tools -> Data Sources (ODBC) . The ODBC Data Source Administrator dialog box appears .
In the ODBC Data Source Administrator dialog box, select the System DSN tab . Click Add to create a data source. Select the HIVE driver in the ODBC driver list.
Click Finish . The ODBC Hive Setup dialog box opens , as shown in the screenshot below.
Enter a name in the Data Source Name field . For example, MyHiveData . In the Host field, enter the node name of the cluster created on the portal. For example, myhadoopcluster.cloudapp.net . Specify a username for authentication on the portal.
Click OK to save the Hive data source. Click OK to close the ODBC Data Source Administrator dialog box .
Getting Hive data in Excel
Run the 64-bit version of Excel 2010. Then, click the Data tab . Click Hive Panel to open the Hive panel in Excel. In the Select or Enter Hive Connection drop-down list, specify the name of the data source you created earlier.
The system will ask you to enter a password for authentication in the cluster on the portal. Enter your password and username. From the Select the Hive Object to Query drop-down list, select hivesampletable [Table] . Check all the columns in the table. The Hive Query panel should look something like this:
Click Execute Query .
To process the data from our example, you need to execute a query of the following form:
Select * from twitter movies vw limit 20
conclusions
In this guide, we looked at how to query, study, and analyze data from Twitter using Hadoop on Windows Azure and Hive query in Excel.