Series: Big Data - like a dream. 1st series
Big Data through the eyes of different industries - this is another dream of the Grail, which will decide, save and protect! In life, everything is exactly the opposite: Big Data is a completely new task, curtailing stagnant projects and dismissing non-rebuilding specialists. We offer a series of articles on the practice of the real application of unstructured Big Data in various industries, the formation of new specialties that are just being coined by names - Big Data analyst and sociologist, HiLoad linguist, trend journalist (from the word TREND, not TREND), and, We hope fruitful discussions about where the new big road should lead.
Pink dreams, as well as ideas, on BD (Big Data) are different for everyone: vendors have a lot of hardware, software developers have a lot of new software, telecoms have clouds, customers have a magic wand: “I clicked on the button, and she’s all made for me! ". There is no worse bummer than a bummer from unfulfilled dreams. At the same time, vendors, software developers, telecoms, etc., will fulfill their dreams, and will fly to collect pollen from new dreams of customers who have become disillusioned with BD. Knowledge is power, it's time to use that power and take a sober look at BD with the eyes and expectations of customers and industries.
For several years, we have been engaged in the most “delicious” BDs - unstructured rtBD & A (real-time Big Data & Analytics). In the rtBD & A segment, rapidly growing or transforming existing industries are being created that require the “right” specialists and a lot: Gartner evaluates only the US market for BD analysts in 190 thousand people by 2018. As practitioners who have already faced new challenges, we understand that “it is due to us”: to tell, explain, help - otherwise it will be as usual: a dream from a grainy “pink elephant” will turn into a “big pig” with all the consequences.
The term Big Data, as a new concept with only a 5-year history, begins to penetrate and be used actively in various fields and industries: video, RTB, sociology, medicine, space, finance and everywhere else - wherever you stick, there are people everywhere, who will proudly tell you how they bravely struggle with terabytes and trillions of records to improve the CURRENT work of specific industries.
Unfortunately, this approach is probably the biggest mistake of the client understanding Big Data as a dream of a bright future. Let's try to figure out what the problem is. Next, we set forth our vision, formulated by 20 years of experience in creating various Internet projects “in the field of Big Data” (they were called differently before) and with an emphasis on rtBD & A.
Our vision in some aspects may differ, and even significantly, from the usual VVV technology template (volume, variety, speed) for Big Data, because:
1) Only the result (the whale fish, the periodic table) should be visible from the client’s side, and not the ocean of data itself;
2) Diversity not only in data, but also in the variety of sources, as well as the diversity of attitudes towards the sources themselves;
3) Such super-complex “systems” as a person, groups of people or entire nations, with their individual worldview, history, relationships, phraseology and vocabulary, can act as sources- “sensors” of BD;
4) Life is always wider than any patterns.
So, firstly, let's forget about "BD is a lot of data."Analysts (researchers, inventors, and other “scientists” and clients) need enough data for an “explosion” to make it possible to arrange an “explosion” of the OLD industry formation. A wonderful example: we do not know how much data the Mendeleev had, but they were enough for him to form at the output of the “Periodic Table of Chemical Elements” of less than 100 cells. No further comments are required - now everyone is studying chemistry at school.
Secondly, it is necessary to separate :
A) personalized “many-data-by-object”,
B) information field of data in the industry and around objects.
An example of type A: RTB data for showing “targeted” ads to a particular browser on a particular device. Are you still haunted by unnecessary advertising of high bank deposits, as your half poked into a beautiful advertisement with a handbag? - This is it, a type A system - your browser’s “trips” on the laptop are stored in petabytes to remind you of all the sins of youth, even if you have already changed sex.
Type B examples: what problems did the iPhone play in lowering sales in Russia? Will Le Pen manage to get around Sarkozy in regional elections?
Type A is often referred to as the "Dossier" type: there is a specific known object (for example, a person, or a wallet account, or phone), with any "stirring" the data on the object is replenished with another entry in the Dossier. For type B, a specific object is not important (there is that big fish in the ocean), data are analyzed for the entire ocean as a whole, with all fish, algae and plankton.
“Winwood Reed said it well,” Holmes went on. - He says that an individual person is an unsolvable mystery, but in the aggregate people represent a certain mathematical unity and are subject to certain laws. Is it possible, for example, to predict the actions of an individual person, but the behavior of the whole collective can, it turns out, be predicted with greater accuracy. Individuals differ from each other, but the percentage of human characters in any team remains constant. ”(Arthur Conan Doyle,“ Sign of the Four ”)
Thirdly, it is necessary to distinguish between structured data(for example, a check for a purchase in a store) and unstructured (yes even this article on MegaMozge). Of course, there will always be someone who considers the text of the article “structured” - at least in the form of a set of 33 letters of the alphabet, 10 numbers and several punctuation marks. You can send nonsense to school to teach the same chemistry (why a liquid-ice molecule of water is obtained from two combustible and volatile chemical atoms “H” and “O”).
Fourth, which is closer to technocracy, BD can be divided into real time and ... non-real time. Again without fanaticism, please. About two years ago, in a conversation with colleagues from Cloudera, when they showed them some examples of rtBD & A, one of their specialists said plaintively that Hadoop, of course, is cool, and processing a brain tomography in a day or two is the thing, but real-time completely different solutions. But more about that in another song.
Summary of the 1st series: Big Data - the amount of data needed for revolution, not evolution. The data can be object or throughout the information field, can be presented as structured or not, some tasks require data processing in a mode close to real time.
In the following series: Who are Big Data Analysts? Why is IBM ready to train 10,000 employees on Twitter data analysis? Some unique case studies of unstructured BD analytics. What industries already go "under the chandelier"? What technologies are required to process Big Data? Why did such successful companies as Motorola, Nokia, HTC "die", and will Samsung survive in the fight against Apple? Where are ideas now being born and who comes up with them? ..
But, as often happens in rtBigData & A, all the plans mentioned above can be faded into the background, and the next series will be devoted to discussing the issues and tasks that will be posed in the comments on this introductory material: - )
Series 2: Big Data negative or positive?
Pink dreams, as well as ideas, on BD (Big Data) are different for everyone: vendors have a lot of hardware, software developers have a lot of new software, telecoms have clouds, customers have a magic wand: “I clicked on the button, and she’s all made for me! ". There is no worse bummer than a bummer from unfulfilled dreams. At the same time, vendors, software developers, telecoms, etc., will fulfill their dreams, and will fly to collect pollen from new dreams of customers who have become disillusioned with BD. Knowledge is power, it's time to use that power and take a sober look at BD with the eyes and expectations of customers and industries.
For several years, we have been engaged in the most “delicious” BDs - unstructured rtBD & A (real-time Big Data & Analytics). In the rtBD & A segment, rapidly growing or transforming existing industries are being created that require the “right” specialists and a lot: Gartner evaluates only the US market for BD analysts in 190 thousand people by 2018. As practitioners who have already faced new challenges, we understand that “it is due to us”: to tell, explain, help - otherwise it will be as usual: a dream from a grainy “pink elephant” will turn into a “big pig” with all the consequences.
The term Big Data, as a new concept with only a 5-year history, begins to penetrate and be used actively in various fields and industries: video, RTB, sociology, medicine, space, finance and everywhere else - wherever you stick, there are people everywhere, who will proudly tell you how they bravely struggle with terabytes and trillions of records to improve the CURRENT work of specific industries.
Unfortunately, this approach is probably the biggest mistake of the client understanding Big Data as a dream of a bright future. Let's try to figure out what the problem is. Next, we set forth our vision, formulated by 20 years of experience in creating various Internet projects “in the field of Big Data” (they were called differently before) and with an emphasis on rtBD & A.
Our vision in some aspects may differ, and even significantly, from the usual VVV technology template (volume, variety, speed) for Big Data, because:
1) Only the result (the whale fish, the periodic table) should be visible from the client’s side, and not the ocean of data itself;
2) Diversity not only in data, but also in the variety of sources, as well as the diversity of attitudes towards the sources themselves;
3) Such super-complex “systems” as a person, groups of people or entire nations, with their individual worldview, history, relationships, phraseology and vocabulary, can act as sources- “sensors” of BD;
4) Life is always wider than any patterns.
So, firstly, let's forget about "BD is a lot of data."Analysts (researchers, inventors, and other “scientists” and clients) need enough data for an “explosion” to make it possible to arrange an “explosion” of the OLD industry formation. A wonderful example: we do not know how much data the Mendeleev had, but they were enough for him to form at the output of the “Periodic Table of Chemical Elements” of less than 100 cells. No further comments are required - now everyone is studying chemistry at school.
Secondly, it is necessary to separate :
A) personalized “many-data-by-object”,
B) information field of data in the industry and around objects.
An example of type A: RTB data for showing “targeted” ads to a particular browser on a particular device. Are you still haunted by unnecessary advertising of high bank deposits, as your half poked into a beautiful advertisement with a handbag? - This is it, a type A system - your browser’s “trips” on the laptop are stored in petabytes to remind you of all the sins of youth, even if you have already changed sex.
Type B examples: what problems did the iPhone play in lowering sales in Russia? Will Le Pen manage to get around Sarkozy in regional elections?
Type A is often referred to as the "Dossier" type: there is a specific known object (for example, a person, or a wallet account, or phone), with any "stirring" the data on the object is replenished with another entry in the Dossier. For type B, a specific object is not important (there is that big fish in the ocean), data are analyzed for the entire ocean as a whole, with all fish, algae and plankton.
“Winwood Reed said it well,” Holmes went on. - He says that an individual person is an unsolvable mystery, but in the aggregate people represent a certain mathematical unity and are subject to certain laws. Is it possible, for example, to predict the actions of an individual person, but the behavior of the whole collective can, it turns out, be predicted with greater accuracy. Individuals differ from each other, but the percentage of human characters in any team remains constant. ”(Arthur Conan Doyle,“ Sign of the Four ”)
Thirdly, it is necessary to distinguish between structured data(for example, a check for a purchase in a store) and unstructured (yes even this article on MegaMozge). Of course, there will always be someone who considers the text of the article “structured” - at least in the form of a set of 33 letters of the alphabet, 10 numbers and several punctuation marks. You can send nonsense to school to teach the same chemistry (why a liquid-ice molecule of water is obtained from two combustible and volatile chemical atoms “H” and “O”).
Fourth, which is closer to technocracy, BD can be divided into real time and ... non-real time. Again without fanaticism, please. About two years ago, in a conversation with colleagues from Cloudera, when they showed them some examples of rtBD & A, one of their specialists said plaintively that Hadoop, of course, is cool, and processing a brain tomography in a day or two is the thing, but real-time completely different solutions. But more about that in another song.
Summary of the 1st series: Big Data - the amount of data needed for revolution, not evolution. The data can be object or throughout the information field, can be presented as structured or not, some tasks require data processing in a mode close to real time.
In the following series: Who are Big Data Analysts? Why is IBM ready to train 10,000 employees on Twitter data analysis? Some unique case studies of unstructured BD analytics. What industries already go "under the chandelier"? What technologies are required to process Big Data? Why did such successful companies as Motorola, Nokia, HTC "die", and will Samsung survive in the fight against Apple? Where are ideas now being born and who comes up with them? ..
But, as often happens in rtBigData & A, all the plans mentioned above can be faded into the background, and the next series will be devoted to discussing the issues and tasks that will be posed in the comments on this introductory material: - )
Series 2: Big Data negative or positive?