Cloud Site Acceptance Test Planning

    IT CITY Suite on SDN

    On September 24th, we (IT-GRAD) opened a new public cloud platform in the SDN data center (Stack Data Network) . Before putting the first client into commercial operation, I plan trials that show that all components work as intended, and duplication and processing of hardware failures occurs as usual. Here I will talk about those tests that I have already planned, as well as ask the Habrovsk citizens to share their additions and recommendations.

    A little bit about filling a new site:

    At the first stage, the NetApp FAS8040 storage system was installed in the new data center (we, as the golden partner of NetApp, remain faithful to the vendor), the system still has 2 FAS8040 controllers, which are assembled in a cluster via duplicated 10Gbit / s switches (Cluster Interconnects) and allow you to increase storage cluster up to 24 controllers. Storage controllers, in turn, are connected to the network core network via 10Gbit / s optical links formed by two Cisco Nexus 5548UP switches with L3 support.

    VMware vSphere ESXi hypervisor servers (Dell r620 / r820) connect to the network via two 10Gbit / s interfaces using a converged data transfer medium (for working with a disk array and a data transfer network). The ESXi server pool forms a cluster supporting VMware vSphere High Availability (HA). Management Interfaces of iDRAC servers and storage controllers are assembled on a separate dedicated Cisco switch.

    When the basic configuration of the infrastructure is completed, it is time to stop and look back: have you forgotten anything? Does everything work? reliable ??? We already have a chance for success in the person of experienced engineers, but in order for the “foundation” to remain strong, it is, of course, necessary to correctly carry out stress tests of the infrastructure. Successful completion of the tests will indicate the completion of the first stage and passing the acceptance tests (PSI) of the new cloud platform.

    So, I will announce the initial data and the test plan. And attentive readers can make suggestions / recommendations / wishes for correcting possible moments that we could not have foreseen. I will listen to them with pleasure.

    Initial data:
    • FAS8040 dual controller running Data ONTAP Release 8.2.1 Cluster-Mode
    • NetApp DS2246 Disk Enclosures (24 x 900GB SAS) - 5 pcs.
    • NetApp FlashCache 512Gb - 2pcs.
    • NetApp Clustered Ontap CN1610 Interconnect Switch - 2 pcs.
    • Cisco Nexus 5548 Unified Network Core Switches - 2 pcs.
    • Juniper MX80 border router (while one, the second has not arrived yet)
    • Cisco SG200-26 Managed Switch
    • Dell PowerEdge R620 / R810 Servers with VMware vSphere ESXi 5.5

    The connection diagram is as follows: I

    Wiring diagram

    deliberately did not draw the management switch and Juniper MX80, because We’ll test Internet connectivity after channel reservation, another Juniper MX80 is missing (we are waiting by the end of the month).

    So, conditionally our “crash tests” can be divided into 3 types:

    • Testing the FAS8040 Disk Array
    • Network Infrastructure Testing
    • Virtual infrastructure testing

    At the same time, testing of network infrastructure in our case is performed in a shorter version for the reasons mentioned above (not all network equipment is installed).

    Before the tests, it is planned to once again make backups of the network equipment and array configurations, as well as analyze the results of the disk array using the Config Advisor.

    Now I’ll tell you more about the test plan.

    I. Remote Testing

    1. Turn off the FAS8040 controllers one at a time.
      Expected result: automatic takeover to the working node, all VSM resources should be available on ESXi, access to the datastores should not be lost.
    2. Disabling all Cluster Link one node in turn.
      Expected result: automatic takeover to the working node, or moving / switching the VSM to available network ports on the second node, all VSM resources should be available on ESXi, access to the datastores should not be lost.
    3. Disabling all Inter Switch Link between CN1610 switches.
      Expected Result: we assume that the cluster nodes will be accessible to each other through the cluster links of one of the Cluster Interconnect (due to the cross-connection of NetApp - Cluster Interconnect).
    4. Reboot one of the Nexus.
      Expected result: one of the ports on the nodes should remain accessible, on the IFGRP interfaces on each node one of the 10 GbE interfaces should be available, all VSM resources should be available on ESXi, access to the datastores should not be lost.
    5. Alternate blanking of one of the vPCs (vPC-1 or vPC-2) on the Nexus.
      Expected result: moving / switching VSM to available network ports on the second node, all VSM resources should be available on ESXi, access to datastores should not be lost.
    6. Disconnecting Inter Switch Link between Cisco Nexus 5548 switches
      one by one. Expected result: Port Channel is active on one link, there is no loss of connectivity between the switches.
    7. Alternate hard shutdown of the ESXi.
      Expected result: working out HA, automatic start of the VM on the neighboring host.
    8. Tracking monitoring monitoring.
      Expected result: receiving notifications from the equipment and virtual infrastructure about the problems that have arisen.

    II. Directly on the hardware side

    1. Disconnect power cables (all pieces of equipment).
      Expected result: the equipment runs on a second power supply, there is no problem switching between the units.
      Note: The Cisco SG200-26 switch management does not have power redundancy.

    2. One by one disconnecting network links from ESXi (Dell r620 / r810).
      Expected Result: ESXi is available in the second link.

    Well that's all, waiting for your comments.

    Also popular now: