Cloud Bottlenecks: Pokemon Go and Trivia Crack Stories

Original author: Matthew Rothenberg
  • Transfer
image

Lesson: "A system that works with two million users may not be able to handle ten million."

After the release of Pokémon Go in the USA in July 2016, it became the most popular augmented reality game at that time. This product is a long-standing collaboration between game developer Niantic and Google (until Niantic got on its feet, it was Google's internal startup). Therefore, the infrastructure of Pokémon Go was highly dependent on the cloud platform and Google application services. (Nintendo and Pokémon also contributed to the creation of an exciting gameplay of growing little monsters for mobile platforms.)

This was not the first Niantic game in augmented reality . The company previously created Ingress, an alien invasion game released in 2013 for Android devices. But Pokémon Go became a game of a completely different level - Pokemon has long been a cultural phenomenon. The game interested the audience for many years waiting for a mobile game. Therefore, the number of installations grew rapidly. For half a day, the game took first place in revenue on the iPhone. To some extent, it was the world's largest release of a mobile game.

But its success increased the load on the platform: two days after the release, Niantic CTO John Hanke announced that the company was postponing the worldwide release of Pokémon Go, which was caused by overloaded servers. At the same time, there were privacy issues.due to the way Niantic works with Google’s identification and location services. The company had to fix many errors while solving problems with server capacities.

In theory, the cloud should cope with periods of peak loads, simplifying application management, and the services provided by cloud solution providers should simplify the development of various mobile applications (not just games). And the capabilities of the clouds have really simplified the use of new features that require more computing power (such as augmented reality).

But, as in the cases with other network platforms of the past, the developers found out that the presence of all these capacities is not important if there is no possibility to connect to them. The more interactive mobile applications, the more difficult it becomes to exchange data between mobile devices and the cloud infrastructure. Add the factor of differences in data transfer rates between telecom operators around the world and get a system in which many parameters must be taken into account to ensure the speed necessary for users.

Few applications "take off" in the same way as the worldwide viral Pokemon Go. Developers who want to scale games and applications will find it helpful to research how Niantic and other game developers managed to cope with their unexpected success. If hit mobile games can cope with the obstacles that arise when testing and debugging the performance of interaction between devices and the cloud, then corporations are also able to solve problems with unexpected peaks in user activity.

We will process them all


To learn from the lessons learned by Pokémon Go, Arstechnica recently talked to Niantic CTO Phil Keslin. We talked about the complex interactions between publicly accessible parts of the Google Cloud and internal data.

Pokémon Go uses the Google Compute Engine, its cloud-based data warehouse, and a full stack of network technologies, including data and query infrastructures. Of course, the game uses Google Maps to determine the location of the player. According to Keslin, all the changes in the gameplay AI require the mobile client to make calls to the Niantic data warehouse. “With every change in the state of the game - throwing a pokeball, catching a Pokemon, or another action, interaction with the data warehouse is performed.”

When the first big peak arose, then, according to Keslin, “Google didn’t even notice this, and the game at least doubled the amount of processed data.” However, this did not lead to system overload. “The easiest way to say this: we had a forecast for the worst-case scenario, but the game even exceeded it.” On the day of release there was a real explosion. “We found bottlenecks that slowed performance. After eliminating them, we came up against a new “bottleneck”. ”

Some of the bottlenecks were in the Niantic code, “but we had problems with a couple of open source libraries, which we never expected — it was the hardest ones to solve.” Altogether, Niantic found five or six bottlenecks, each of which took one to two days to resolve.

But malfunctions arose also from Google. Pokémon Go is having cloud infrastructure issues; The container engine contained subsystems that were never tested under such a load. There were a couple of problems with the network stack.

Eliminating bottlenecks required a lot of work from a team of five people, consisting of Keslin, the team leader and three service engineers. “In the first two weeks, we hardly slept,” says Keslin. "The guys from Google, too, gave all their best."

Another factor that affected productivity in different regions, according to Keslin, was the difference between mobile operators in different parts of the world. “We designed Pokémon Go so that the game can work on low-bandwidth mobile devices. The problems that arose were more related to the marketing programs of telecom operators. ” For example, a major mobile operator in the Philippines gave all its subscribers free access to Pokémon Go, so Niantic needed to ensure users were turned on and off after the promotion ended.

Despite the initial chaos, Keslin stated that Niantic did not have to change the architecture of Pokémon Go after the release of the game. (The company continues to consistently improve the application and is preparing to release the second generation of Pokémon Go, which, she hopes, will give the game a second wind after the wave of pokemany has subsided.) “The infrastructure was designed with the full range of Pokémon,” he says. “The core of the system will remain the same, we just add new gameplay. We were lucky to be able to create a scalable system, we did not test it. Fortunately, the created architecture scales reliably. ”

What can Keslin advise other developers seeking to create a new augmented reality phenomenon? “Think about scaling from the start. Our game's development team has focused on productivity. Thanks to this, we were able to maximize performance at low cost, and were able to scale the system. "

Addiction to Trivia Crack



Etermax Trivia Crack logo

Other companies also walked through a public cloud with gaming infrastructure, the results were contradictory. Two years before the Pokémon Go hit the jackpot, Argentine-based Etermax created its own mobile gaming hit: Trivia Crack, a game from a selection of competitive games powered by Amazon Web Services.

According to Etermax technical director Gonzalo Garcia, the huge success of Trivia Crack came to her in two waves: the first, in March 2014, marked the success of the game in parts of South American countries. Traffic at the same time increased from 100 thousand to 10 million daily active users. The second wave came when the game became popular in the United States in October of that year, increasing the number of daily active users to 25 million.

“We did not foresee this,” Garcia says. “According to our estimates and tests, we knew that we could handle one million players, and we planned two million. But we did not expect such growth - we did not even invest so much in advertising! Without a cloud infrastructure server, we would never have done it. ”

“We thought we were releasing another company game,” adds Etermax IT Director Martin Dominguez. "Previously, our limit was one million users, but what suits two million is not always enough for ten million."

This burden has put stress on the Agile development process. “We thought we weren’t attached to Scrum, and we always worked in Agile style,” says Dominguez. "The problem was that with such a jump in the number of users, sprints could not be completed in two weeks, we had to work day after day."

To cope with popularity, Etermax first had to abandon some of the functions. She also had to modify some databases and adapt processes to improve efficiency. In particular, Etermax has changed the way it uses the Amazon Relational Database Service (RDS). She implemented fragmentation, secondary servers and the exchange of data between secondary servers. She also collaborated with AWS to solve the problem of packets per second.

To support the advanced networking capabilities, Etermax had to migrate from the Amazon public network to the virtual private cloud. “It was pretty complicated,” recalls Dominguez. "At the second peak of users, which increased to 25 million, AWS employees said they had never seen a company that used RDS at that level."

Garcia said that one feature helped to cope with the explosive growth in popularity of Trivia Crack Etermax. “In Trivia Crack, the interface was almost entirely on the mobile device, so only small amounts of information were transmitted.” Thanks to this, synchronous connections have become a lesser problem for Trivia Crack compared to other Etermax games, for example, the mobile bingo game Bingo Crack: “We constantly had to transmit information about bingo balls in order to know who the first, second or third would shout out„ Bingo!"". What did Etermax teach the success of Trivia Crack? According to Dominguez, when it comes to infrastructure, you need to think fast, be proactive when making changes, and guess what problems will arise. "Pokémon Go had the same problem - every day had to survive."

Success management


Patric Palm is the CTO and founder of the Swedish company Hansoft. She created the Favro collaboration software used by many game developers, including Ubisoft and id Software. Patrick has the opportunity to monitor company customers solving their tasks in the mobile gaming industry.

Looking at the challenges Pokémon Go faces, Palm highlights the issue of differences in connection speeds in different regions of the world. However, he emphasizes that, thanks to cloud computing, scalability is now less of a hindrance.

“A few years ago, scalability was a much more serious problem, and many more people needed to solve it.” Since Niantic was able to shift the solution to some of the problems on Google Cloud, it was able to focus on implementation and registration in different countries. “Cloud computing solves one of the biggest business challenges,” Palm said.

Due to the fact that a solution to the scalability problem was found, attention was drawn to other, smaller problems. In the case of Pokémon Go, they became a quick discharge of the phone’s battery, which forced players to run around the city with external batteries. “Niantic's game drained the batteries,” Palm says. "Cloud game developers are now thinking a lot about power consumption."

Another related issue: data transfer restrictions in different tariffs. “Not every user has a successful mobile operator tariff.” This is another incentive to transfer most of the burden to the server side. Games running in the background consume not only energy, but also data.


“We need a bigger cloud.”

Rules of the game


If Pokémon Go and Trivia Crack were able to cope with the problems of scaling cloud computing on mobile platforms, then you will succeed. Here are tips for evaluating your own tasks:

  1. Consider the “worse than worst” scenario. Of course, when it comes to estimates, you need to choose some indicators. But you also need to plan how to scale above the maximum limit. The main advantage of cloud computing is its flexibility, think about what will happen if these scales expand faster than you tested.
  2. Discuss emergency situations with your service provider. As traffic grew, Niantic and Etermax had to work closely with Google and Amazon. When choosing a cloud service provider, specifically discuss what services they can provide if your needs increase dramatically compared to what is expected.
  3. Explore carriers and mobile devices. If your cloud technology will run on users ’own mobile devices (possibly in different parts of the world), then consider how many connections will be transferred from the device to the cloud and evaluate the tariff plans with the slowest connections from the most problematic operators that your product will encounter .

Also popular now: