Dropbox: Inside Look

In this article I will talk about the internal device of the popular Dropbox cloud storage service. In particular, the Dropbox protocol device will be affected, as well as statistics on its use in some countries of Europe will be shown. In addition, I will compare it with other services such as iCloud, Google Drive and SkyDrive.

The article is purely technical. There will be no summary tables with the cost per GB and an analysis of how much more can be obtained from the invited “friends”.

The text is based on the scientific article “Dropbox Inside: Exploring Cloud Storage Services” (Inside Dropbox: Understanding Personal Cloud Storage Services). Pdf

In the past few years, there has been a huge jump in the popularity of cloud storage services. All major players and several young startups are participating in the arms race. Basically, all the information about the internal structure of services and the real numbers of their use is a mystery behind seven seals. We are fed only with data that passed through the marketing department, which, of course, is somewhat different from reality. So let's dig a little deeper with the guys Idilio Drago, Anna Sperotto, Marco Mellia, Ramin Sadre, Maurizio M. Munafò and Aiko Pras - the authors of the study.

Introduction


The Dropbox client is primarily developed in Python using third-party libraries such as librsync. The client supports all major operating systems: Windows, Mac, Linux. Using Python clearly indicates that the client was designed with lightweight porting to various platforms.

The main element of the system is a block (chunk) up to 4 Mb in size. In case the file is larger, it is divided into several blocks, and each block is perceived by the system independently of the others. For each block, a SHA256 hash is calculated, and this information is part of the file’s metadata. Dropbox reduces the amount of data transferred by transferring only the difference between the changed blocks of the file. In addition, locally it contains all the file meta-information, which synchronizes with the server and transfers only changes from the previous version (incremental updates).

Dropbox uses two types of servers: the control (control) and the data server (data storage). Management servers are controlled by Dropbox, data servers are Amazon servers (Amazon S3, EC2). In all cases, HTTPS is used for communication with servers.

The domain names used by Dropbox always end with dropbox.com. The table below shows the subdomains for the management and data servers.

SubdomainHostingDescription
client-lb / clientXDropboxMeta data
notifyXDropboxNotifications
apiDropboxAPI control
wwwDropboxWeb servers
dDropboxEvent logs
dlAmazonDirect links
dl-clientXAmazonClient storage
dl-debugXAmazonBack traces
dl-webAmazonWeb storage
api-contentAmazonStorage API


Dropbox: Inside


Because Dropbox uses HTTPS to encrypt all traffic between servers, a simple interception will not provide any useful information. For research, we installed Squid and directed all traffic from a Linux computer to this proxy. They also set SSL-bump on the proxy so that SSL can be decrypted. The last step is to install the self-signed certificate on Squid and change the certificate inside the Dropbox application. This configuration allows you to decrypt and view Dropbox traffic.




The illustration shows the protocol used by Dropbox to upload locally modified blocks to its servers. After registering the client on the clientX.dropbox.com management servers , the list commandreceives changes in metadata that show the difference between the local copy and what is on the server. Once a local file change occurs, Dropbox calls the commit_batch command ( client-lb.dropbox.com ) and sends the changed metadata to the server. After that, the server answers which blocks it needs using the need_blocks command , and the client sends these blocks to Amazon ( dl-clientX.dropbox.com ). The saving of each block is confirmed by the OK command.

After that, the local client once again sends the commit_batch command to the server and receives confirmation that all blocks have been received. Data storage transactions can be executed in parallel.

Control protocol

Dropbox uses the following groups of management servers:

  • Notifications
    Dropbox keeps a constant open TCP connection to notification servers ( notifyX.dropbox.com ). This is necessary to obtain information about file changes that may have occurred on other clients. Compared to other traffic, this information is not encrypted. Delayed HTTP response is used to quickly notify clients (push mechanism). The client sends a request, and the server delays the response for about 60 seconds. After 60 seconds, the client immediately sends the next request to the server. If the answer is formed earlier, the server responds immediately.
  • Meta-data administration
    Metadata management servers are responsible not only for informing about changes in blocks and files, but also for client authentication. The following domain names are used for these servers: client-lb.dropbox.com, clientX.dropbox.com. In addition, management servers can control client behavior. At the time of the experiment, it was noticed that the servers can indicate to the client the maximum number of blocks that he can send to the server. This is used to control the traffic that the client generates.
  • System messages (system logs) of the
    server are provided by Amazon and are called dl-debug.dropbox.com; the rest of the messages go directly to Dropbox d.dropbox.com .


Data set and customer popularity

We've chosen a passive way to monitor Dropbox. To collect traffic, the open source tool Tstat was used. Tstat allows you to collect a variety of information about TCP, providing information on more than a hundred different connection parameters. To analyze Dropbox, we took a few extra steps.

Since Dropbox uses HTTPS, we have determined that the name in all certificates used by Dropbox is * .dropbox.com. This was important for the proper classification of traffic.

We supplemented the open information with records from the DNS servers that the clients accessed. Thus, we linked the IP addresses and server names.

Tstat returned unencrypted information about the device and the names of the directories exchanged between the client and the notification server.

Data was obtained using the Tstat installation at 4 locations in Europe. Records from the points designated as Home 1 and Home 2, make up the data of users of the well-known Internet service provider (ISP), which provides the Internet via ADSL and optical cable. Data labeled Campus 1 and Campus 2 were collected at universities. The studies were conducted from March 24, 2012 to May 5, 2012.

NameA typeNumber of IP AddressesData Volume (GB)
Campus 1Wired4005,320
Campus 2Wired / wireless2,52855,054
Home 1FTTH / ADSL18,785509,909
Home 2ADSL13,723301,448

Below is a graph that shows how many different IP addresses were associated with the cloud storage service at least once a day.



The second graph shows how much data was transferred to this cloud storage per day.



I would like to draw attention to the following:
  • Despite the large number of devices using iCloud, the amount of data transferred to this service is comparable with other services.
  • At the time of the advent of Google Drive, the traffic transmitted to this service made a big leap and approached iCloud; at the same time, the number of program installations remained minimal.

For comparison, we present the data on the use of YouTube and Dropbox services in Campus 2. The



table shows the total Dropbox traffic that we tracked during our measurements.

Campus 1Campus 2Home 1Home 2Total
Inquiries167,1891,902,8241,438,369693,0864,204,666
Volume (GB)1461,8141,1535063,624
Of devices2836,6093,3501,31311,561

Traffic analysis

The graphs show the cumulative distribution function for a different number of blocks.



It turned out that in more than 80% percent of the cases, the number of blocks when saving data does not exceed 10. The graph for the data from the Home 2 point is significantly different from the rest, since here we observed one client who constantly forwarded several days the same blocks. An analysis of the data shows that the main scenario for using Dropbox is to constantly work with small, constantly changing files.

As we reviewed above, Dropbox uses central servers for data storage. This immediately raises the question of the speed of the service for users who are geographically far from servers.

The maximum speed that we observed was close to 10 Mbit / s and was observed on files larger than 1 Mb. The average speed for Campus 2 was: write - 462 kbits / s and read - 797 kbits / s. For Campus 1: write - 359 kbits / s and read - 783 kbits / s.



It can also be seen from the graphs that the speed substantially depends on the number of blocks: the more blocks, the lower the speed.

Changes in Dropbox 1.4.0

Starting with version 1.4.0, Dropbox has added two new commands: store_batch and retrieve_batch , which allows you to work with several blocks simultaneously. This improvement should significantly improve service throughput.

Number of devices

The graph shows the number of Dropbox installations for users at home. In approximately 60% of cases, users have only 1 device with Dropbox. 25% of home users have 2 devices using Dropbox.


Average usage time

The graph shows the average time you use Dropbox. Analyzing the time of use, we looked at how long the client was in contact with the notification server. Since the client always keeps this connection open or reopens it, this is a good way to estimate usage time.


The graph shows that the time of using Dropbox in most cases is less than 4 hours. The exception is Campus 1, where there are many working computers and computers that are constantly working.

Initial data


You can download the source data that was used in this article for further analysis. ( Source data ).

I want to note that the original article contains more information. It may contain answers to questions that you may have after reading.

Also popular now: