What is Big Data, Part 1
- Transfer
Big data is Big News, Big Importance and Big Business, but what is it really? What is big data? For those who live by them, everything is obvious, but I'm just a dumbass to ask such questions. But those who live by them consider most people stupid, right? Therefore, in the beginning I want to talk with those readers who, like me, are not in the subject. What is this all about? This week I plan to thoroughly investigate this issue, and most likely to publish three long articles ( translator's note: translations of the next two parts will be published in the coming days ).
My PBS program “Triumph of the Nerds” is the story of a personal computer and the rise of its popularity from 1975 to 1995. “Nerds 2.01: A Brief History of the Internet” (“Nerds 2.01: A Brief History of the Internet ”) Is the story of the rise of its popularity from 1966 to 1998. But each issue was actually about the influence of Moore’s law on technology and society. Personal computers became possible only when the cost of microprocessors fell to the amount accessible to an ordinary person. This was not possible until the tipping point when the market was ready for explosive growth.
Commercial Internet, in turn, became possible only when the price of servers fell two more orders of magnitude by the mid-1990s, which made dial-up economically viable and led to the next critical point. If you think in the same terms, big data is what happened when the cost of computing fell another two orders of magnitude by 2005, leading to the very last critical point. We pretend that this happened earlier, in 1998, but this is not so (this is part of the story). 2005 marked the advent of mobile and cloud computing, and the beginning of the era of big data. In exactly the same way as it was shown in my two documentaries, we people are again on the threshold of a new era in almost complete misunderstanding: how we got here or what it all means.
Personal computers have changed America, the Internet has changed the world, but Big Data is transforming the world. They will rule the development of technology for the next hundred years.
Wherever you are, computers monitor you and record data about your activities, noting, first of all, what you watch, read, look at or buy. If you go out onto the streets of almost any city, then video surveillance is added to this: where are you, what are you doing, who or what is nearby? Your messages are partially listened, and sometimes even recorded. Everything you do on the Internet, from commenting on tweets to just browsing, is never erased from history. This is partly due to national security. But the main goal of this technology is simply to make you buy more things, to make you a more effective consumer. The technology that allows the collection and analysis of data has been invented for the most part in Silicon Valley by a variety of technology startups.
Why are we in this position, and what will happen next? Technology, of course, will continue to expand horizons, but this time, instead of inventing the future and participating in progress, geeks will sail in the same boat with everyone else: new achievements like self-driving cars, universal translators and even computers developing other computers , will arise not with the help of the human mind, but with the help of the machines themselves. And the reason for all this is Big Data.
Big data is the accumulation and analysis of information to extract value.
Data is information about the state of something: who, what, why, where and why plays spy games, how the disease spreads, or how the popularity of the pop group changes. Data can be collected, saved and analyzed in order to understand what is happening: whether social media really launched the Arab Spring, whether DNA decryption can prevent diseases or who wins the election.
Although the data surrounded us in the past, we especially did not use it, mainly because of the high cost of storage and analysis. As hunter-gatherers the first 190,000 years of the life of a rational person, we did not collect data at all, since there was no way to store or even a way to record. Writing appeared about 8000 years ago, primarily as a way of storing data during the formation of culture, when we wanted to write down our stories, and later there was a need to keep lists of the population, taxes and mortality.
Lists, as a rule, are binary - you are either in it or you are not there, you are alive or dead, pay taxes or not. Lists are needed for calculation, not calculation. Lists may contain semantic meaning, but not often. The need to understand the meaning of higher forces and phenomena has led us from counting to computing.
Thousands of years ago, the cost of recording and analyzing data for the public was so high that only religion could afford it. In an attempt to explain the mystical world, our ancestors began to look at the sky, notice the movement of stars and planets, and for the first time began to record this information.
Religion, which had already led to writing, then led to astronomy, and astronomy led to mathematics - all in search of the mystical meaning of heavenly motion. For example, calendars were not invented, they were the result of a generalization of data.
Throughout history, data has been used for tax accounting and population censuses, for general accounting. Take for example the Doomsday Book of 1086 - in fact, Britain’s main tax base. The term count is hidden everywhere. Most of the data collected throughout history was recorded by counting. If you needed to consider a lot of information (more than a few observations needed for a scientific experiment), it was almost always associated with money or another expression of power (how many soldiers, how many taxpayers, how many male babies under the age of two in Bethlehem?). Each time you count, the result is a number, and the numbers are easy to store by writing them down.
As soon as we began to accumulate knowledge and write them down, we had a natural need to hide them from others. This spawned codes, ciphers, and statistical methods for breaking them. A scientist of the 9th century Abu Yusuf al-Kindi wrote the “Manuscript for decrypting encrypted messages”, laying the foundation for statistics - searching for data values, cryptanalysis and cracking codes.
In his book, al-Kindi promoted a method, currently called frequency analysis, to help crack codes with an unknown key. Most codes were wildcard ciphers, where each letter was replaced with another. Knowing which letter matches which one, the message can be easily decrypted. The idea of Al-Kindi was that if you know the frequency of using letters in normal communication, then this frequency will go unchanged into the encoded message.
In English, the most common letters are E, T, A, and O (in that order). When the encoded message is large enough, the most common letter should be the letter E, and so on. If you come across Q, after it will almost always be U, etc. Of course, except in cases where the target language is not English.
The main thing in any task of frequency substitution is the knowledge of the relative frequency for any language, and this means counting letters in thousands of ordinary documents. This is the collection and analysis of data from the sample of 900 AD.
But only 800 years after al-Kindi, the collected data became available for wide public use. In London, weekly starting in 1603, “Bills of mortality” lists were published to keep track of all reported deaths in the city (Drag your dead!). These weekly reports were later published in the annual issue, and here the fun begins. Although the lists were made just for public knowledge, their value was revealed in the analysis of these pages after the plague of 1664–65. Experts were able to plot how the disease spread from infected areas throughout London, combining information with a map of a primitive urban water supply and sanitation system. From these data, the sources of infection (mosquitoes and rats) and how to stay away from them (to be rich, not the poor). So began the study of public health.
The main use of mortality lists was not in information about the deceased (just numbers), but in the metadata (data about the data) that showed where the victims lived, where they died, their age and type of work. The history of the 1664 plague can be traced by noting metadata on the map.
Despite the fact that death lists were considered complete reports of the plague, I doubt that they were accurate. Many deaths were probably not recorded or reported for the wrong reasons. But the statistics led to one conclusion: the dynamics are clearly visible even if there is not enough data. When statistics began to develop as an independent discipline in 18th-century France, it became clear that as much data could be extracted from a random sample of data as from all information. We see this today when political sociologists predict the election results based on small samples of random voters. At the same time, large errors that sometimes occur among researchers show that the sampling method is far from perfect.
Sampling and polling give results that we believe in.but a 100% sample like census or election gives the results that we know .
Data processing. Data storage is not at all what its processing. Libraries store data perfectly, but access to it is difficult. You still need to find the book, open it, read it, but even after that the level of detail that we can achieve is limited by our memory.
American statistician Herman Hollerith at the end of the 19th century conceived a system that would automatically collect data and record them using holes on paper maps. Maps that can be mechanically sorted to obtain values from data. This process Hollerith called tabulation (“tabulating”). Hollerith received a patent for technology, and his Tabulating Machine Company, located in Washington, eventually became the modern International Business Machines (IBM).
For decades, IBM's primary machine function has been sorting. Imagine each card was an account of a customer of an electric company. The machine facilitated the work of organizing the customer base in alphabetical order, by last name, and sorted them by billing date, by amount of debt, by the presence or absence of debts, and so on. At that time, data processing meant sorting, and punch cards did an excellent job of this. Of course, people are also capable of this job. But cars save time, so they were used primarily to ensure that all bills were sent before the end of the month.
The first databases were a bunch of such cards. And it was easy for the electric company to decide what should be on the map. Name and address, electricity consumption in the current billing period, date when the bill should be sent, and current payment status: do you pay your bills?
But what if you needed to add a new product or service? I would have to add a new data field to each card, including all existing cards. Such changes made mechanical sorters a curse. For the sake of greater flexibility, a new kind of database arose in the 1950s that changed the world of business and travel.
Transaction processing. American Airlines' SABER reservation system was the first real-time automated system in the world. This is not only the first reservation system, but the world's first computer system for interacting with operators in real time, where everything happened completely in the computer. Prelude to Big Data. This all worked even when we still manually tracked the Russian bombers.
Before SABER, data processing always occurred after events. Accounting systems dropped in a quarter or a month ago and figured out how to present what had already happened. And this process lasted as long as needed. But SABER was selling tickets for the future, based on location information stored exclusively on a computer.
Imagine that SABER is a shoe box that contains all tickets for all seats on AA 99 flight. Selling tickets from a shoe box will protect you from selling one seat twice, but what if you need to sell seats at the same time through agents in different offices across the country? This requires a computer system and terminals, which at that time did not exist. American Airlines Founder S.R. Smith needed to fly in an airplane next to T. Watson from IBM to start this process.
A key moment in the history of SABER: IBM did not have a suitable computer to start the system, so the task was resource-intensive. So American Airlines became the first customer of the largest computers manufactured in those years. Airlines did not write programs for tasks. Instead, computers in the first corporate data center in the world, in Tulsa, Oklahoma (he is still there), were focused exclusively on the sale of airline tickets, and could not do anything else. Programs appeared later.
American Airlines and SABER have dragged IBM into the mainframe business. And the design of those first systems AA and IBM worked together.
SABER set the general direction for data-driven computing applications from the 1950s to the 1980s. The cashiers at the bank finally got computer terminals, but just like when booking airline tickets, their terminals could only perform one task — banking — and the bank’s customer data was usually stored on a punch card with 80 columns.
Moore's Law. When computers began to be used for data processing, their speed allowed to delve into this data, revealing more meaning. The high cost of computers limited their use to such profitable areas as the sale of airline tickets. But the advent of computers with solid-state electronics in the 1960s laid the foundation for a steady increase in the power of computers and lowering their cost, which continues today. This is Moore's law. What American Airlines cost $ 10 in 1955 was reduced to ten cents by 1965, by 1975 to one tenth of a cent and to one billionth of a cent today.
The power of the entire SABER computing system in 1955 was less than the power of a modern mobile phone.
The influence of Moore’s law and, most importantly, the ability to reliably predict at what stage the cost of computers and their capabilities will be in ten or more years, allows us to apply computing power to increasingly cheaper areas of activity. This is what has turned data processing into Big Data.
But for this to happen, we had to get away from the need to create a new computer every time we needed a new database. Instead of iron, software was to come to the fore. And the programs, in turn, should have become more open to modification, as the needs of government and industry have changed. The solution was a relational database management system. The concept was developed at IBM, but it was introduced to the world by a Silicon Valley startup - Oracle Systems, led by Larry Ellison.
Allison launched Oracle in 1977 with a budget of $ 1,200. Now he (depending on when you read it) is the third richest man in the world and the prototype of the protagonist of the movie "Iron Man".
Before Oracle, data was tables — rows and columns. They were stored in the computer’s memory if there was enough space in it, or they were recorded on magnetic tape and read from it if there was not enough memory, which usually happened in the 70s. Such flat file databases were fast, but often it was not possible to change the logical relationships between the data. If there was a need to delete a record or a variable changed, you had to change everything and design a completely new database, which was then recorded on tape.
For flat file databases, changing and finding meaning was difficult.
Ted Codd of IBM, an expat from England, a mathematician who worked in San Jose, California, began to think in the 1970s about something more advanced than a flat file base. In 1973, he wrote an article describing a new relational database model where you could add and remove data, and important relationships in the data could be redefined on the fly. Where, prior to the Codd model, the payroll system was the payroll system and the inventory system the inventory system, the relational approach separated data from the application that processed them. Codd represented a common database, which simultaneously had attributes of both the payment and inventory systems, and it could change as needed.
This relational model was a huge step forward in the development of databases, but IBM made a lot of money from its old technology, so they did not immediately turn the new idea into a product, leaving this opportunity to Allison and Oracle.
Oracle implemented almost all of Codd’s ideas, and then took a step further, making it possible to run software on different types of computers and operating systems, increasing purchasing power. Other relational database vendors, including IBM and Microsoft, have followed suit, but Oracle is still the largest player. They not only provided flexibility to business applications, they opened the door to new classes of applications: recruiting, CRM, and especially what is called business intelligence. Business intelligence looks at the information you know to show the benefits hidden in it. Business intelligence is one of the key uses of Big Data.
Internet and World Wide Web. Computers that worked with relational databases like Oracle were, by that time, mainframes and minicomputers, and were called "big iron." They were in corporate networks, and consumers never touched them. This changed with the advent of the commercial Internet in 1987, and then the web in 1991. Although the typical Internet access point in those early years was a personal computer, it was essentially a client. The server where the Internet data was physically located was, as a rule, the computer was much larger, able to easily cope with Oracle or another relational database. They all relied on a single structured query language (Structured Query Language or SQL) to query data. So, almost from birth, web servers relied on databases.
Databases basically did not store state and query data. In other owls, if you make a request to the database, then a slightly modified request will still be a completely new task. One could, for example, ask “how many devices did we sell last month?” And get an answer, but if we supplement the question “how many of them are blue?”, Then the computer should present this as a completely new request: “how many blue devices did we sell in last month"?
Perhaps you ask, what does all this matter? Who cares? Amazon.com founder Jeff Bezos. And his interest in the issue forever changed the world of commerce.
Amazon.com was built on the basis of the World Wide Web, whose creator Tim Berners-Lee defined it as a stateless system. Why did Tim do this? Because Larry Tesler of Xerox PARC was against the regimes. Larry’s car number says NO MODES (no modes). This means that as an interface specialist, he was against the existence of operating modes. Example mode: if you hold down the Ctrl key, then everything that happens next is processed by the computer differently. Modes generate states, and states are bad, so there were no states inside the Xerox Alto computer. And since Tim Berners-Lee was mostly busy connecting Alto computers to the CERN network to build his Majesty the World Wide Web, there were no modes on the web either. The web was "stateless."
But the stateless web was causing serious problems for Amazon, and Jeff Bezos had a dream to rid the world of intermediaries in the form of physical stores. To do this is a difficult task if you constantly have to start from scratch. If you used the services of early Amazon, you may remember that after logging out, all data about your session was deleted. The next time you log in to the system (if it recognized you, which usually did not happen), most likely you could see your previous orders, but not viewed goods.
Amazon's obsession with customer service, as shown in this figure from the company's initial business plan, was an integral part of its unique business model.
Bezos, a former Wall Street IT professional who knows all the business intelligence tools of the time, needed a system that would ask you the next time you log into the server, “Are you still looking for long underpants?” Perhaps they are still in your basket, those underpants that you examined the last time, but decided not to buy. This simple trick of keeping a history of recent activities was the real beginning of Big Data.
Amazon built its e-commerce system on Oracle and spent $ 150 million to develop the features described here, those that now seem like a trifle, but in the past were impossible. Bezos and Amazon started by tracking the purchased goods, then moved on to prospective purchases, then to the viewed products, then to saving each click and pressing a key. They do all this even today on the site, whether you are authorized or not.
You need to understand that we are talking about 1996, when an Internet startup cost $ 3-5 million of venture capital maximum, but Amazon spent $ 150 millioncustomer service with big data, which no one has ever done before. How much does this violate deviations from the standard acceptable for venture investors? Bezos almost from the very beginning of the company put everything on Big Data.
This risk has paid off, which is why Amazon.com is worth $ 347 billion today, $ 59 billion of which belongs to Jeff Bezos.
The fact that Jeff Bezos inspired to create such functionality, and that he and his team managed to do it on Oracle, a relational database management system with SQL that was not intended for such tasks, is the first miracle of big data.
(Translation by Natalia Bass )