How we built cloud infrastructure in Azure

    Case. Building a cloud for a large company


    I have long wanted to talk about how we built a cloud solution for one of our customers.


    So, our Customer is a large international company, with hundreds of offices around the world. The main infrastructure is concentrated in two upscale data centers in Europe and there are no complaints about them. But local components in regional offices are managed by many regional service providers, and this creates a nightmare at the management level, both with solving IT problems directly and in controlling the spending of the IT budget. The customer decided that the transfer of most of the non-critical regional services to Microsoft Azure would allow him to save on the maintenance of his IT infrastructure, concentrate control over financial spending in the central office and, at the same time, implement several modernization projects. We have already implemented for this Customer a hybrid Exchange solution based on Office 365 with local components in several countries,

    All this happened in late 2015 - early 2016 and, at the moment, the platform has been created, and we have already migrated about 500 servers there. The topic of clouds is one of the most popular in recent years and there are quite a lot of documentation and materials describing what a particular service is capable of and how you can use it. Therefore, we’ll talk about the other side of the cloud - about the problems that can be encountered in moving your on-premises infrastructure.

    Update rate


    As you read this article, you may have the false feeling that I am scolding Azure. In fact, this is not so. Just part of the problems is related to the fact that this cloud service is very active and rapidly developing. This is not even a feature of Azure, but a common feature of the clouds. Here it will not work once to learn to do something and use this for years. You will have to learn and develop constantly. And the solutions you sell to Customers must also evolve. It's hard to blame Microsoft. But this creates a fair amount of complexity.

    In addition to the new services, which are described below, a very striking example is PowerShell. Working with the cloud involves a high degree of automation of operations. The larger your environment, the more relevant it is for you. In addition, some operations cannot be done at all through the portal GUI. Updates for Azure PowerShell are released almost every quarter and very often they significantly expand or change the functionality of cmdlets (new keys are added, existing ones are changed, types of returned objects are changed, etc.). This means that you need to constantly monitor all the news, check the functionality of your scripts after the updates, see if there is an opportunity to make something easier or better.

    We had a fun story related to the PowerShell update. Our engineer wrote a fairly large piece of code to add the missing functionality to the virtual disk cmdlet. And at the beginning of next week an update was released in which this same cmdlet received a new parameter that did exactly the same thing. It was nice (after all, our vision of what was missing coincided with the vendor) and a little sad for the time spent.

    Azure Versions


    The reason for many difficulties at the end of 2015 was that the Microsoft Azure cloud was actively switching from the Classic (Azure Service Management) model to the ARM (Azure Resource Management). In a sense, this can be called a transition from version 1 to version 2. Since all of Azure's innovations are primarily focused on ARM, the Customer wanted ARM components to be used everywhere except in situations of exceptional necessity. This is not to mention the fact that only ARM makes it possible to properly configure access rights to various components in Azure in accordance with the standards for the provision of IT services. The problem was that in ARM at that time there was not some of the functionality that was already available in Classic. In addition, the joint work of the ARM and Classic components was far from always possible and incomplete.

    This may seem insignificant, after all, different versions of server operating systems also have different functionality and this is normal. Here the difference was that the speed of development of cloud services was much higher and the architects who worked on this project on our part were used to discussing decisions based on Azure relying on the Classic-functionality, believing that new versions of components can, at least, do the same the most. Moreover, as it turned out, Microsoft architects are experiencing the same difficulties.

    Network


    Extending the customer’s IT infrastructure to the cloud begins with the network. Your first task is to create networks in the cloud and connect them to your existing infrastructure.


    It was here that the first surprise awaited us. It turned out that the topology of the virtual network, which was proposed by the architect from Microsoft at the initial stage of the project, was based on the idea that in one Azure Virtual Network there could be two Virtual Network Gateways - one for ExpressRoute connection, the other for Vnet-to-Vnet VPN . The idea was to provide additional isolation of the customer’s internal networks from traffic from DMZ.

    As it turned out, ARM did not allow this. On the go, we had to switch to the option of connecting all VNET to one ExpressRoute in order to get the routing between them and User Defined Routing, to ensure security.

    Another unpleasant feature of working with a virtual network was the restriction on the number of rules in one Network Security Group (NSG). Here, several technical aspects of the operation of networks in Azure should be noted, each of which, individually, was only a slight inconvenience, but together they became a problem:

    • You cannot create more than 500 rules in one NSG.
    • Great and I'm part of functionality for virtual machines in Azure requires the availability of IP addresses Microsoft public services on ports 80 and 443. Microsoft regularly updates and publishes this list. At the moment, for some regions it already has several hundred addresses.
    • NSG rules can be created for a consistent range of addresses or ports, but not for arbitrary typing. That is, you can open traffic to ports 80-443 with one rule, but if you want exactly 80 and 443, without what is in between, then you will need two rules (the same history for IP addresses).

    As a result, we not only needed to prepare scripts for automatic updating of our NSG rules (after all, couldn’t we redo them every week with our hands?), But, worse, we did not have many rules for their intended use - control for traffic between our networks.

    Fortunately, this problem will soon be a thing of the past - Microsoft has announced changes to the NSG, which will allow more flexible work with the rules.

    Limitations


    Since we touched on the issue of quotas in Azure (500 rules per NSG), it is worth noting that they themselves are a headache if you have a large project. The range of services in Azure is constantly expanding, and it is logical that they have their own limitations. The problem is that there is no snap-in allowing you to see all the restrictions in one place. That is, you have to rely on a hodgepodge of individual teams that collect information and several web pages listing current limits. This, of course, is not very convenient. Especially when some unexpected limit pops up that you did not think about before.

    Data storage


    One example of a rather clever quota, which not everyone thinks about, is the performance of such a resource as Storage Account . The fact is that the VHD disk on Standard Storage for most sizes of virtual machines has a maximum performance of 500 IOPs, but the Storage Account itself is 20,000 IOPs. At the same time, the maximum disk size is 1023GB, and the maximum storage account capacity is 500TB. Do you already see what the catch is? When you place a disk on one Storage Account 41, you could already theoretically be in a situation where, with the maximum load of all disks, their performance will begin to be artificially limited. At the same time, you have not yet taken 10% of the maximum capacity and each subsequent disk will only make the situation worse.

    The most unpleasant thing is that the system will not warn you about this in any way. You will only learn about this if you think about such things beforehand and either did not place more than 40 disks on one Storage Account or monitor Throttling on his part and, when activated, move the actively used disks to another place.

    Considering that your server deployment is most likely automated, you need to consider how your automation tools will choose the location of virtual disks, especially if it is theoretically possible to start the simultaneous deployment of multiple servers.

    Marketplace


    It's funny, but one of the difficulties when working with Azure is its main advantage - the extensive Marketplace. A lot of services and the constant expansion of the list of available services is very cool. The problem is that with such a variety, developers are physically unable to verify the interaction of their product with others, and if you start using it immediately after release, you may be the first to do this for the combination that is used with you.


    Here are a couple of interesting examples:

    • Immediately after the release of Azure Site Recovery (a service to protect your servers by replicating them to the cloud), he demanded that all traffic on 443 and 80 ports to the Internet be opened for a failover server, because the address list for this service has not yet been added to Azure Whitelist ( it’s clear that they corrected it very quickly, but we broke our heads over it in due time).
    • Many of the virtual machine features in Azure are tied to VM Extensions. For example, encryption and backup. There are many operations that clear the Extensions suite for a virtual machine. Moreover, these are quite often used operations, such as deploying a server from VHD (the main method for solving many problems with servers and an obligatory step when transferring them between the Resource Group) or even restoring a server from Azure VM Backup. Despite this, there is no convenient tool for saving the list of these Extensions, and you have to do it yourself.

    Conclusion


    What thought would you like to convey in this article? Personally, I am far from the idea that in the foreseeable future the clouds will completely replace on-premises infrastructure, but to hope that you will be able to hide from them is pretty stupid. But this is not necessary! Working with Azure is very interesting. If you like to constantly learn new things and monitor the release of new functionality, wondering what you can use from it to improve your decisions, then you will not be disappointed.

    PS Those of you who work with Azure may notice that most of the problems that are described in this article are no longer relevant. Microsoft is very actively monitoring the feedback of the community and is finalizing its services (however, the story with NSG has not yet been corrected!).

    Also popular now: