What is Big Data, Part 2

Original author: Robert X. Cringely
  • Transfer


In the first partIn this series of articles, you learned about data and how computers can be used to extract meaning from large blocks of such data. You even saw something similar to the big data from Amazon.com in the mid-nineties, when the company launched the technology to monitor and record in real time everything that thousands of customers at the same time did on their website. Quite impressive, but you can call it big data with a stretch, puffy data is more suitable. Organizations like the US National Security Agency (NSA) and the UK Government Communications Center (GCHQ) already collected big data at the time as part of espionage operations, recording digital messages, although they did not have an easy way to decrypt them and find meaning in them.


What Amazon.com did was easier. Their customer satisfaction could easily be determined even if it covered all tens of thousands of products and millions of consumers. There are not so many actions that a customer can perform in a store, real or virtual. The client can see what is available, request additional information, compare products, put something in the basket, buy or leave. All this was within the capabilities of relational databases, where the relationship between all types of actions can be set in advance. And they must be set in advance, which is the problem with relational databases - they are not so easily extensible.


To know in advance the structure of such a database - how to make a list of all potential friends of your unborn child ... for life. It should list all unborn friends, because as soon as the list is completed, any addition of a new position will require serious surgical intervention.


The search for relationships and patterns in data requires more flexible technologies.


The first major technological challenge of the Internet of the 90s is to deal with unstructured data. In simple words - with the data that surrounds us daily and have not previously been considered as something that can be stored in the form of a database. The second task is the very cheap processing of such data, since its volume was high, and the information output was low.


If you need to listen to a million telephone conversations in the hope of detecting at least one mention of al-Qaeda, you will need either a substantial budget or a new, very cheap way to process all this data.


The commercial Internet at that time had two very similar tasks: searching for all sorts of things on the World Wide Web and paying advertising for the opportunity to find something.


Search task. By 1998, the total number of websites reached 30 million (today there are more than two billion). 30 million sites , each of which contains many web pages. Pbs.org, for example, is a site with over 30,000 pages. Each page contains hundreds or thousands of words, images, and information blocks. To find something on the web, you had to index the entire Internet. This is already big data!


To index, you first had to read the entire web, all 30 million hosts in 1998, or 2 billion today. This was done with the help of so-called spiders or search robots - computer programs that methodically search the Internet for new web pages, read them, and then copy and drag their contents back into the index. All search engines use search robots, and they must work continuously: update the index, keep it up to date with the appearance of new web pages, their change or disappearance. Most search engines support an index not only of the current web, but, as a rule, of all old versions, so when searching for earlier modifications, you can return to the past.


Indexing means recording all metadata — data about data — words, images, links, and other types of data, such as video or audio, embedded in a page. Now multiply that by five hundred million. We do this because the index occupies about one percent of the server that it represents - the equivalent of 300 thousand pages of data from 30 million in 1998. But indexing is not a search for information, but only a record of metadata. Finding useful information from an index is even more difficult.


In the first decade of the Internet, there were dozens of search engines, but four were the most important, and each had its own technical approach to get meaning from all of these pages. Alta Vista was the first real search engine. She appeared in the laboratory of Digital Equipment Corporation, in Palo Alto. Digital Equipment Corporation was actually the computer lab at XEROX PARC, transported almost entirely two miles by Bob Taylor, who built both of them and hired most of the older employees.


Alta Vista used a linguistic tool to search its web index. And she indexed all the words in the document, for example, in a web page. If you asked him to “search for gold doubles,” Alta Vista scanned her index for documents containing the words “search,” “gold,” and “doubles,” and displayed a list of pages sorted by the number of references to the requested words.


But even then there was a lot of shit on the Internet, which means Alta Vista indexed all this shit and did not know how to distinguish good from bad. These were just words, after all. Naturally, bad documents often went upstairs, and the system was easy to inflate by inserting hidden words to distort the search. Alta Vista could not distinguish real words from hidden ones.


While the advantage of Alta Vista was the use of powerful DEC computers (which was an important point, since DEC were the leading manufacturers of computer equipment), the advantage of Yahoo! was the use of people. The company hired employees to literally browse the web all day, index them manually (and not very carefully), and then mark the most interesting on each topic. If you have a thousand person indexers and everyone can index 100 pages per day, then Yahoo could index 100,000 pages per day, or about 30 million a year — the entire Internet universe in 1998. This worked with a bang on the World Wide Web until the web has grown to an intergalactic scale and has become subject to Yahoo. Yahoo's early system, with its human resources, did not scale.


Soon Excite appeared, it was based on a linguistic trick. The trick is that the system was not looking for what the person wrote, but what he really needed, because not everyone could accurately formulate the request. Again, this task was formed in conditions of a lack of computing capabilities (this is the main point).


Excite used the same index as Alta Vista, but instead of counting how often the words “gold” or “doubloon” were used, six Excite employees used a vector geometry approach where each query was defined as a vector consisting of query conditions and their frequency. A vector is just an arrow in space, with a starting point, direction and length. In the Excite universe, the starting point was the complete absence of the search words (zero "search," zero "gold" and zero "doubloons"). The search vector itself began from the zero-zero-zero point with these three search conditions, and then expanded, say, to two units “search”, because so many times the word “search” appeared in the target document, thirteen units of “gold” and can to be, five are doubloons. This was a new way of indexing the index and the best way to describe the stored data, since from time to time it led to results that did not use any of the searched words directly - something that Alta Vista could not do.


The Excite Web Index was not just a list of words and their frequency of use, it was a multidimensional vector space in which search was seen as a direction. Each search was one thorn in the data hedgehog, and the exciting strategy of Excite (the genius of Graham Spencer) was to capture not one, but all the thorns in the neighborhood. Covering not only documents that fully met the requirements of the query (such as Alta Vista), but also all similar in formulated conditions in a multidimensional vector space, Excite was a more useful search tool. He worked on the index, used vector mathematics for processing, and, more importantly, almost did not require calculations to obtain the result, since calculations were already made in the indexing process. Excite gave better and faster results using primitive iron.


But Google was even better.


Google has made two improvements to the search - PageRank and cheap hardware.


Excite’s advanced vector approach helped in finding the desired search results, but even its results were often useless. Google’s Larry Page came up with a way to measure utility using a trust-based idea that led to greater accuracy. A Google search at the beginning used linguistic methods similar to Alta Vista, but then added an additional PageRank filter (named after Larry Page, did you notice?), Which accessed the first results and built them by the number of pages with which they were associated.The idea was that the more page authors bothered to give a link to this page, the more useful (or at least interesting, even in a bad sense) was the page. And they were right. Other approaches began to die off, and Google quickly went trend with its PageRank patent.


But there was another detail that Google implemented differently. Alta Vista emerged from Digital Equipment and worked on DEC's huge VAX minicomputer cluster. Excite used Sun Microsystems UNIX hardware, which is not inferior to them in power. And Google was launched only with the help of free software, open source, on computers a little more powerful than personal ones. In general, they were smaller than PCs, because homemade Google computers had neither cases, nor power supplies (they literally were powered by car batteries and charged by car chargers). The first modifications were bolted to the walls, and later they stuffed racks, like baking sheets with fresh pastries in industrial ovens.


Amazon created a business case for big data and developed a clumsy way to implement it on hardware and software, not yet adapted for big data. Search companies have greatly expanded the size of practical data sets while mastering indexing. But real big data could not work using the index, they needed actual data, and this required either very large and expensive computers, like Amazon, or a way to use cheap PCs that look like a giant computer on Google.


The dotcom bubble. Let's imagine the euphoria and childishness of the Internet in the late 1990s during the period of the so-called dotcom bubble. It was clear to everyone, starting from Bill Gates, that the future of personal computers and, possibly, business was the Internet. Therefore, venture capitalists have invested billions of dollars in online startups, not thinking much about how these companies will actually make money.


The Internet was seen as a gigantic territory where it was important to create as large companies as possible, as fast as possible, and seize and retain a share in the business, regardless of whether the company has profit or not. For the first time in history, companies began to enter the stock market without earning a penny of profit for the entire time of their existence. But this was perceived as the norm - profit will appear in the process.


The result of all this irrational abundance was a revival of ideas, most of which would not be realized at other times. Broadcast.com, for example, was conceived to broadcast television via dial-up to a huge audience. The idea did not work, but Yahoo! still bought it for $ 5.7 billion in 1999, which made Mark Kuban the billionaire that he remains today.


We believe that Silicon Valley was built according to Moore’s law, so computers were constantly becoming cheaper and more powerful, but the dot-com era only pretended to use this law. In fact, it was built on hype .


The hype and Moore's law. So that many of these Internet frauds of the 90s could succeed, the cost of processing the data had to fall significantly below what was possible in reality, according to Moore's law. That's because the business model of most dot-com startups was based on advertising, and the amount that advertisers were willing to pay had a strict limit.


For a while, it didn’t matter because the venture capitalists and then Wall Street investors were ready to make up for the difference, but in the end it became apparent that Alta Vista with its huge data centers could not get profit only from the search . Like Excite, and any other search engine of the time.


The dotcoms in 2001 collapsed because startups ran out of money from gullible investors who supported their ad campaigns in the Super Bowl. When the last dollar of the last fool was spent on the last office chair from Herman Miller, almost all investors had already sold their shares and left. Thousands of companies collapsed, some of them overnight. Amazon, Google, and several others survived by understanding how to make money online.


Amazon.com was different in that Jeff Bezos's business was e-commerce . And this was a new type of trade, which was supposed to replace bricks with electrons. For Amazon, savings in real estate and salaries played a big role, as the company's profit is measured in dollars per transaction. And for a search engine - the first use of big data and a real Internet tool - the advertising market paid off for less than a cent per transaction. The only way to do this was to understand how to break Moore’s law and reduce the cost of data processing more, while at the same time linking the search engine with advertising and increasing sales. Google has done both.


The time has come for the Second Wonderful Coming of Big Data, which fully explains why Google today costs $ 479 billion, and most of the rest of the search companies have long been dead.



GFS, Map Reduce, and BigTable. Since Page and Brin were the first to understand that creating their own super-cheap servers is the key to the company's survival, Google had to build a new data processing infrastructure to make thousands of cheap PCs look and work like one supercomputer.


When other companies seemed to get used to losses in the hope that Moore’s law would work at some point and turn them into profitable, Google found a way to make its search engine profitable in the late 90s. This included the invention of new machine, software and advertising technologies. Google’s activities in these areas directly led us into the world of those big data, the formation of which can be observed today.


Let's first take a look at the scale of today's Google. When you search for something through their search engine, you first interact with three million web servers in hundreds of data centers around the world. All these servers do is send page images to your computer screen, on average, 12 billion pages per day. The web index is stored additionally on two million servers, and another three million servers contain the actual documents integrated into the system. All together - eight million servers, excluding YouTube.


The three key components in Google’s penny architecture are their file system or GFS, which allows all of these millions of servers to access what they consider ordinary memory. Of course, this is not just memory, but its crushed copies, called fragments, but the whole chip in their commonality. If you change the file, it must be changed on all servers at the same time, even on those that are thousands of kilometers apart.


It turns out that a huge problem for Google is the speed of light.


MapReduce distributes a large task across hundreds or thousands of servers. He gives the task to several servers, and then collects many of their answers into one.


BigTable is a Google database that contains all the data. It is not relational, because relational cannot work on such a scale. This is an old-fashioned flat database that, like GFS, must be coherent.


Before these technologies were developed, computers functioned as people, working on a single task with a limited amount of information at one time. The ability to get thousands of computers to work together on huge amounts of data has become a powerful breakthrough.


But Google was not enough to achieve its financial goals.


Big Brother started as an advertiser. Google just had to make data processing cheaper to get close to Amazon.com’s profit margins. The remaining difference between a cent and a dollar per transaction could be covered if you found a more profitable way to sell online advertising. Google did this through effective user indexing, as it had previously done with the Internet.


Studying our behavior and anticipating our consumer needs, Google offered us ads that would click with a probability of 10 or 100 times more, which increased Google’s likely revenue from such a click by 10 or 100 times.


Now we finally talk about big data.


Whether Google tools worked with the internal or external world - it doesn’t matter, everything worked the same way. And unlike the SABER system, for example, these were general-purpose tools - they could be used for almost any kind of task, applied to almost any kind of data.


GFS and MapReduce were the first to place no restrictions on database size or search scalability. All that was needed was more than the average iron, which would gradually lead to millions of cheap servers sharing the task among themselves. Google is constantly adding servers to its network, but it does it wisely, because unless the data center is completely disconnected, the servers will not replace after a breakdown. This is too complicated. They will simply be left dead in racks, and MapReduce will be engaged in data processing, avoiding idle servers and using existing ones.


Google published an article on GFS in 2003 and MapReduce in 2004. One of the magical moments of this business: they did not even try to keep their methods secret, although it is likely that the rest would have come to such decisions themselves.


Yahoo !, Facebook and others quickly reproduced an open version of Map Reduce called Hadoop (after elephant toy - elephants don't forget anything). This is what allowed what we now call cloud computing to appear. This is just a paid service: distributing your task between dozens or hundreds of rented computers, sometimes rented for several seconds, and then combining several answers into one logically connected solution.


Big data has made cloud computing a necessity. Today it is difficult to separate them between these two concepts.


Not only big data, but also social networks were made possible by MapReduce and Hadoop, as they made it economically feasible for a billion Facebook users to create their own dynamic web pages for free, and companies to make money only from advertising.


Even Amazon switched to Hadoop, and today there are virtually no restrictions on their network growth.


Amazon, Facebook, Google and the NSA cannot function today without MapReduce or Hadoop, which, by the way, forever destroyed the need for an index. Today's search is performed not by index, but by raw data, which change from minute to minute. More precisely, the index is updated from minute to minute. Not the point.


With these tools, Amazon and other companies provide cloud computing services. Armed only with a credit card, smart programmers can use the power of one, one thousand, or ten thousand computers within a few minutes, and use them to solve a problem. That is why Internet startups no longer buy servers. If you want to briefly get more computing resources than the whole of Russia, you only need a card for payment.


If Russia wants to get more computing resources than Russia, it can also use its plastic card.


The big question is unanswered: why did Google share its secrets with competitors and make its research public? Was this stupid arrogance from the founders of Google, who at that time still defended doctoral dissertations at Stanford? Not. Google shared its secrets to create an industry. He needed not to look like a monopoly in the eyes of competitors. But more importantly, by letting thousands of other flowers bloom, Google has boosted the Internet industry by increasing its revenue by 30-40 percent.


By sharing his secrets, Google got a smaller piece of a larger pie.


So, in a nutshell, Big Data came about. Google tracks every click of your mouse, and a billion or more clicks from other people. Similarly, Facebook and Amazon when you are on their site or using Amazon Web Services on any other site. And they cover a third of the data processing of the entire Internet.


Think for a moment how important this is for society. If in the past, businesses used marketing research and thought about how to sell products to consumers, now they can use big data to know about your desires and how to sell it to you. That is why I have long seen ads on the Internet about expensive espresso machines. And when I finally bought it, the advertisement almost instantly stopped, because the system found out about my purchase. She went on to try to sell me coffee beans and, for some reason, adult diapers.


Once upon a time, Google will have a server for every Internet user. They and other companies will collect more types of data about us and better predict our behavior. What this will lead to depends on who will use this data. This can turn us into ideal consumers or captured terrorists (or more successful terrorists - this is another argument).


There is no task that would be insurmountably large.


And for the first time, thanks to Google, the NSA and GCHQ finally have search tools for stored intelligence data, and they can find all the bad guys. Or maybe enslave us forever.


(Translation by Natalia Bass )


Also popular now: