Microsoft Research databases are now available for everyone.

Original Author: Microsoft Research Blog
  • Transfer
We are pleased to tell you that our colleagues from Microsoft Research have published data obtained from years of work in curating and studying information from scientific papers. In particular, data on engineering, computer science, computer science, mathematics, physics, biology, social and natural sciences became available. More under the cut!



For the past few years, the Microsoft Research Outreach team has actively collaborated with the scientific community, helping researchers to conduct research projects based on cloud infrastructure. All this time we have everywhere observed the relevance of the fourth paradigm of scientific discoveries proposed by Jim Gray, which is based on the study of large amounts of data and involves the use of data components of each of them in almost all research programs. We clearly saw that the processing of such a vast flow of information requires supervised and analyzed data sets at the scale of the research community, and it is not advisable to limit ourselves only to the area of ​​computing systems - it is necessary to cover interdisciplinary and subject sciences.

Today we are pleased to presentMicrosoft Research Open Data is a new open-source cloud repository designed to facilitate the interaction of researchers around the world. The Microsoft Research Open Data unified cloud repository provides convenient access to data sets obtained as a result of Microsoft's many years of work in curating and studying information from published scientific papers.

Why we invest in this project


The goal of the project is to provide Microsoft researchers and employees with a convenient platform for sharing data sets, equipped with the necessary technologies and tools. The Microsoft Research Open Data repository is designed to simplify data access, facilitate the interaction of researchers using cloud resources, and ensure the reproducibility of experiments. We will continue to work on the formation and development of our repository and supplement it with new functions, guided by community feedback.

We know that dozens of data repositories are available to researchers today, and we expect that the capabilities of Microsoft Research Open Data will complement the functionality of existing repositories.


Fig. 1. Data set in the Microsoft Research Open Data open repository

“It is a turning point in the world of big data. Initiatives such as Microsoft Research Open Data reduce barriers to sharing information, and maintain reproducibility of experiments through the use of cloud platforms,
”said Sam Madden, a professor at the Massachusetts Institute of Technology.

With the exponential growth of data, it is expected that by 2025 their volume will be 150 ST. This means that today we must pay special attention to data processing, and not to the problems of their transmission via Internet channels, which are developing much more slowly. We believe that the ability to process data will bring real benefits. Therefore, users can not only download datasets, but also copy them directly to an Azure-based Data Science virtual machine (see Figure 2).


Fig. 2. Data is copied from microsoftopendata.com to a Linux virtual machine in the Azure cloud.

Data Science is preinstalled with development tools that are popular with researchers and practitioners (see Figure 3).


Fig. 3. Data Science Virtual Machine on Linux

“I am often asked to share experimental data, so I used to share them. It was the most popular way. Coordinating and cataloging datasets in one place with Azure will be useful for both internal and external researchers. They will have easy access, interoperability and convenient use of extensive open data in the Microsoft Research cloud,
”comments John Krumm, lead researcher at Microsoft Research AI.

The data sets in the Microsoft Research Open Data are classified according to the main area of ​​research (see Figure 4). Using data sets, you can search for links to research projects and publications. Available datasets can be viewed, downloaded and copied directly to an Azure subscription using automated workflow. The repository meets the highest standards of information sharing and ensures the availability of data sets, their compatibility and reusability; personal information in the case is missing. The site will continue its work and will assist in gathering user feedback.


Fig. 4. Categories of data sets

The Microsoft Research Open Data repository emerged as a result of the implementation of the Microsoft Research Outreach Data research program. This was made possible thanks to the close cooperation of many departments and researchers at Microsoft, our industry partners, and educational consultants.

We will be glad to receive your comments and feedback! Send us a message using the feedback form on the site and share your thoughts.

Also popular now: