How the Data Scientist profession works
- Transfer
In addition to stories about our own experience in optimizing various services of our IaaS provider, we also analyze Western experience. From project management to technology cases that other IT companies talk about.
Today we decided to take a look at the profession, which is associated with direct work with data, and drew attention to a note by Philipp Guo, who works at the University of Rochester as a “data scientist”. / Photo by Jer Thorp / CC Philippe developed a number of thematic tools while working on his Ph.D. thesis on “ Data Tools ” back in 2012.
From that moment, the concept of “ data science ” has become the generally accepted name for a particular profession, and higher education institutions around the world have incorporated this direction into their curricula.
Philip's experience allows us to talk about the difficulties that await everyone who would like to seriously engage in this area.
In order to feel like a “data scientist,” you can use a number of publicly available sources. For example, using open statistics published by government or companies, find an open API and experiment with downloading data from your favorite social network, and even generate a data set yourself using specialized software.
Working with data is a multi-step process that requires careful adherence to techniques. Even the most basic level - data collection, from which everything starts, is fraught with unobvious difficulties and potential errors that may make further analysis impossible due to the low quality of the data collected. Here it is necessary to verify the quality of the data on the side of the source itself and understand how they were originally obtained and systematized.
From this stage follows the following - data storage. Of course, the problem here is not which version of Excel to choose, but how to group and organize thousands of files containing related data that will be further analyzed in detail.
In the case of large amounts of data, it is advisable to think about using a cloud IT infrastructure, if we are talking about individual experiments, for which not so much budget has been allocated. It would be strange to spend these funds on the purchase of your iron, which later will also have to be sold.
Various data analysis tasks require the presentation of information in a specific form and format. As a rule, you will not receive a ready-made data set that can be immediately analyzed without any additional processing.
At this stage, you will encounter the need to correct semantic errors and correct formatting. Profile software is useful here , which helps automate a number of routine tasks.
As part of the process of bringing data into a working form, you can once again analyze their structure and get additional insights regarding the hypotheses that it makes sense to put forward for your research.
Of course, at this step you will feel a general decrease in productivity, but this work should be taken as mandatory. Without it, it will be very difficult for you to analyze the data and its quality will be very easy to criticize.
Here we are talking about working directly on algorithms and programs that are responsible for interpreting your data set. For convenience, they can be called scripts, which are written using Python, Perl, R and MATLAB.
You need to understand the entire cycle of data analysis, which consists of preparing and editing scripts until you get the first results, interpret them and then adjust your scripting practices.
From what may not go exactly as you plan, it is worth noting the time costs and various failures. We can spend a huge amount of time due to the large amount of processed data and inefficient use of computing resources. For example, you will use exclusively a home computer, the resources of which are very difficult to scale.
In addition to this, the data analysis algorithm embedded in your script can also take time. For this, it is necessary to carry out trial runs, analyze the progress of the process and promptly make adjustments. Similarly, you should pay attention to possible failures.
As a result of the first three steps, you get certain results. They are no longer raw and allow conclusions to be drawn. To do this, it is recommended to make detailed notes and present them to colleagues.
This approach will help to relate the result to what you planned to get at a very early stage of work with a particular topic. Such reflection will allow you to trace the evolution of your hypothesis and may lead you to additional experiments with data. A visual presentation of the results to your colleagues can contribute to this.
Comparison with what results were obtained in similar works of other scientists will help you conduct our work with potential errors, return to one of the previous steps and subsequently proceed to the stage of processing the results of the study.
In addition to an oral presentation, infographics, and a classic presentation that brings all these elements together in front of your audience, there are other ways to complete your research project. The result of a lot of work on data analysis are programs and algorithms with documentation and explanatory notes.
This form makes it possible to quickly reproduce the results of your colleagues in the profession and moves the area of data analysis forward. To do this, you must be well versed in software development so as not to put the expert community in an awkward position when working with a script without any clear documentation.
PS Additional reading recommended by Philippe Guo.
Today we decided to take a look at the profession, which is associated with direct work with data, and drew attention to a note by Philipp Guo, who works at the University of Rochester as a “data scientist”. / Photo by Jer Thorp / CC Philippe developed a number of thematic tools while working on his Ph.D. thesis on “ Data Tools ” back in 2012.
From that moment, the concept of “ data science ” has become the generally accepted name for a particular profession, and higher education institutions around the world have incorporated this direction into their curricula.
Philip's experience allows us to talk about the difficulties that await everyone who would like to seriously engage in this area.
How it works - data collection
In order to feel like a “data scientist,” you can use a number of publicly available sources. For example, using open statistics published by government or companies, find an open API and experiment with downloading data from your favorite social network, and even generate a data set yourself using specialized software.
Working with data is a multi-step process that requires careful adherence to techniques. Even the most basic level - data collection, from which everything starts, is fraught with unobvious difficulties and potential errors that may make further analysis impossible due to the low quality of the data collected. Here it is necessary to verify the quality of the data on the side of the source itself and understand how they were originally obtained and systematized.
From this stage follows the following - data storage. Of course, the problem here is not which version of Excel to choose, but how to group and organize thousands of files containing related data that will be further analyzed in detail.
In the case of large amounts of data, it is advisable to think about using a cloud IT infrastructure, if we are talking about individual experiments, for which not so much budget has been allocated. It would be strange to spend these funds on the purchase of your iron, which later will also have to be sold.
Data processing
Various data analysis tasks require the presentation of information in a specific form and format. As a rule, you will not receive a ready-made data set that can be immediately analyzed without any additional processing.
At this stage, you will encounter the need to correct semantic errors and correct formatting. Profile software is useful here , which helps automate a number of routine tasks.
As part of the process of bringing data into a working form, you can once again analyze their structure and get additional insights regarding the hypotheses that it makes sense to put forward for your research.
Of course, at this step you will feel a general decrease in productivity, but this work should be taken as mandatory. Without it, it will be very difficult for you to analyze the data and its quality will be very easy to criticize.
Data analysis
Here we are talking about working directly on algorithms and programs that are responsible for interpreting your data set. For convenience, they can be called scripts, which are written using Python, Perl, R and MATLAB.
You need to understand the entire cycle of data analysis, which consists of preparing and editing scripts until you get the first results, interpret them and then adjust your scripting practices.
From what may not go exactly as you plan, it is worth noting the time costs and various failures. We can spend a huge amount of time due to the large amount of processed data and inefficient use of computing resources. For example, you will use exclusively a home computer, the resources of which are very difficult to scale.
In addition to this, the data analysis algorithm embedded in your script can also take time. For this, it is necessary to carry out trial runs, analyze the progress of the process and promptly make adjustments. Similarly, you should pay attention to possible failures.
Try to run the analysis taking into account various parameters and features of the input data. This process may require a series of experiments with changes in these parameters and additional iterations with adjustments to the processing algorithm itself.
conclusions
As a result of the first three steps, you get certain results. They are no longer raw and allow conclusions to be drawn. To do this, it is recommended to make detailed notes and present them to colleagues.
This approach will help to relate the result to what you planned to get at a very early stage of work with a particular topic. Such reflection will allow you to trace the evolution of your hypothesis and may lead you to additional experiments with data. A visual presentation of the results to your colleagues can contribute to this.
Comparison with what results were obtained in similar works of other scientists will help you conduct our work with potential errors, return to one of the previous steps and subsequently proceed to the stage of processing the results of the study.
Presentation
In addition to an oral presentation, infographics, and a classic presentation that brings all these elements together in front of your audience, there are other ways to complete your research project. The result of a lot of work on data analysis are programs and algorithms with documentation and explanatory notes.
This form makes it possible to quickly reproduce the results of your colleagues in the profession and moves the area of data analysis forward. To do this, you must be well versed in software development so as not to put the expert community in an awkward position when working with a script without any clear documentation.
PS Additional reading recommended by Philippe Guo.