
VMware Virtual SAN 6.5 Theory
- Tutorial
In this article I tried to disclose the purpose of VMware Virtual SAN, the principles of its operation, requirements, capabilities and limitations of this system, the main recommendations for its design.
VMware Virtual SAN (hereinafter vSAN) is a distributed software storage system (SDS) for the organization of hyper-converged infrastructure (HCI) based on vSphere. vSAN is built into the ESXi hypervisor and does not require the deployment of additional services and service VMs. vSAN allows you to combine local host media into a single storage pool that provides a given level of fault tolerance and provides its space for all hosts and VM clusters. Thus, we get the centralized storage necessary to reveal all the capabilities of virtualization (vSphere technology), without the need to implement and maintain a dedicated (traditional) storage system.
vSAN is launched at the vSphere cluster level, it is enough to start the corresponding service in cluster management and provide it with local media of the cluster hosts - in manual or automatic mode. vSAN is built into vSphere and is tightly attached to it, as a standalone SDS product that can work outside the VMware infrastructure cannot exist.
Analogs and competitors of vSAN are: Nutanix DSF (Nutanix Distributed Storage), Dell EMC ScaleIO, Datacore Hyper-converged Virtual SAN, HPE StoreVirtual. All these solutions are compatible not only with VMware vSphere, but also with MS Hyper-V (some of them with KVM); they require the installation and launch of a service VM on each HCI host necessary for the operation of SDS; they are not embedded in the hypervisor.
An important property of any HCI, including VMware vSphere + vSAN, is a horizontal scalability and modular architecture. HCI is built and expanded on the basis of identical server blocks (hosts or nodes), combining computing resources, storage resources (local media) and network interfaces. This can be commodity-equipment (x86 servers, including brand-name ones), supplied separately, or ready-made applications (for example, Nutanix, vxRail, Simplivity). Integrated software (for example, vSphere + vSAN + VMM_orientation tools) allows creating a software-defined data center (SDDS) based on these blocks, including a virtualization environment, SDN (software-defined network), SDS, and centralized management and automation / orchestration tools. Of course, we must not forget about the need for dedicated physical network equipment (switches), without which interaction between HCI hosts is impossible. At the same time, it is advisable to use a leaf-spine architecture that is optimal for the data center to organize the network.
vSAN rises at the ESXi host cluster level under vCenter and provides distributed centralized storage for the hosts in the cluster. The vSAN cluster can be implemented in 2 versions:
• Hybrid - flash drives are used as a cache (cache layer), HDDs provide the main storage capacity (capacity layer).
• All-flash - flash drives are used at both levels: cache and capacity.
Extending a vSAN cluster is possible by adding media to existing hosts or by installing new hosts in the cluster. It should be borne in mind that the cluster should remain balanced, which means that ideally the composition of the media in the hosts (and the hosts themselves) should be identical. It is acceptable, but not recommended, to include hosts without media participating in the vSAN capacity in the cluster; they can also host their VMs on the shared storage of the vSAN cluster.
Comparing vSAN with traditional external storage systems, it should be noted that:
• vSAN does not require the organization of external storage and a dedicated storage network;
• vSAN does not require slicing LUNs and file globes, presenting them to hosts and the associated network interaction settings; after activation, vSAN immediately becomes available to all cluster hosts;
• vSAN data transmission is carried out according to its own proprietary protocol; standard SAN / NAS protocols (FC, iSCSI, NFS) for the organization and operation of vSAN are not needed;
• vSAN administration is carried out only from the vSphere console; Separate administration tools or special plug-ins for vSphere are not needed;
• no dedicated storage manager is needed; vSAN configuration and maintenance is performed by the vSphere administrator.
Each vSAN cluster host must have at least one cache and data medium (capacity or data disk). These media within each host are combined into one or more disk groups. Each disk group includes only one cache medium and one or more storage media for permanent storage.
The media provided by vSAN and added to the disk group is completely disposed of; it cannot be used for other tasks or in several disk groups. This applies to both cache media and capacity drives.
vSAN supports only local media or drives connected via DAS. Inclusion of vSAN storage connected via SAN / NAS into the storage pool is not supported.
vSAN is an object storage, the data in it is stored in the form of “flexible” containers called objects. An object stores data or meta-data distributed across a cluster. The following are the main varieties of vSAN objects (created separately for each VM):

Thus, the data of each VM is stored on the vSAN as a set of the objects listed above. In turn, each object includes many components distributed across the cluster depending on the selected storage policy and the amount of data.
Storage management on vSAN is implemented using the SPBM (Storage Policy Based Management) mechanism, under the influence of which all vSphere storages from version 6 are located. The storage policy sets the number of failures to tolerate, the method of ensuring fault tolerance (replication or erasure coding) and the number of stripes for the object (Number of disk stripes per object). The number of strips specified by the policy determines the number of storage media over which the object will be spread.
The binding of a policy to a specific VM or its disk determines the number of components of the object and their distribution across the cluster.
vSAN allows changing storage policies for objects hot-off without stopping the VM. At the same time, processes of reconfiguration of objects are launched in the background.
When distributing objects across a cluster, vSAN ensures that components related to different object replicas and witness components are distributed across different nodes or failure domains (server rack, basket, or platform).
A witness is a service component that does not contain useful data (only metadata), its size is 2-4MB. It acts as a tiebreaker in determining the live components of an object in the event of a failure.
The quorum calculation mechanism in vSAN works as follows. Each component of the object receives a certain number of votes (votes) - 1 or more. A quorum is reached and the object is considered “alive” if its full replica and more than half of its votes are available.
Such a quorum calculation mechanism allows you to distribute the components of an object across a cluster in such a way that in some cases you can do without creating a witness. For example, when using RAID-5/6 or stripping on RAID-1.
After the vSAN service is enabled in the vSphere cluster, the datastore of the same name appears, it can be the only one on the entire vSAN cluster. Thanks to the SPBM mechanism described above, within a single vSAN storage, each VM or its disk can get its required level of service (fault tolerance and performance) by binding to the desired storage policy.
The vSAN datastore is available to all hosts in the cluster, regardless of whether the host has local media included in vSAN. At the same time, cluster hosts may have access to datastore of other types: VVol, VMFS, NFS.
The vSAN datastore supports Storage vMotion (hot migration of VM disks) with VMFS / NFS repositories.
You can create multiple vSAN clusters within a single vCenter server. Storage vMotion between vSAN clusters is supported. In addition, each host can be a member of only one vSAN cluster.
vSAN is compatible and supports most VMware technologies, including requiring common storage, in particular: vMotion, Storage vMotion, HA, DRS, SRM, vSphere Replication, snapshots, clones, VADP, Horizon View.
vSAN does not support: DPM, SIOC, SCSI reservations, RDM, VMFS.
Media, controllers, as well as drivers and firmware must be vSAN certified and displayed in the appropriate VMware compatibility sheet (Virtual SAN section in the VMware Compatibility Guide).
As a storage controller, a SAS / SATA HBA or RAID controller can be used, they must function in passthrough mode (disks are given by the controller as is, without creating a raid array) or raid-0.
SAS / SATA / PCIe –SSD and NVMe media can be used as cache media.
SAS / SATA HDD for hybrid configuration and all the above types of flash (SAS / SATA / PCIe –SSD and NVMe) for all-flash configuration can be used as storage media.
The amount of host memory is determined by the number and size of disk groups.
The minimum amount of host RAM to participate in the vSAN cluster is 8GB.
The minimum amount of host RAM required to support the maximum configuration of disk groups (5 disk groups of 7 storage media) is 32GB.
vSAN utilizes about 10% of processor resources.
Dedicated 1Gbps adapter for hybrid configuration
Dedicated or shared 10Gbps adapter for all-flash configuration You
need to allow multicast traffic on the vSAN subnet
To load hosts from vSAN, local USB or SD media can be used, as well as SATADOM. The first 2 types of media do not save logs and traces after rebooting, since they are written to a RAM disk, and the latter saves, so it is recommended to use SATADOM SLC-class media with greater survivability and performance.
Maximum 64 hosts per vSAN cluster (for both hybrid and all-flash)
Maximum 5 disk groups per host
Maximum 7 capacity media per disk group Maximum
200 VMs per host and maximum 6000 VMs per cluster Maximum
9000 components per host
Maximum VM disk size - 62TB The
maximum number of media in a strip per object - 12
The minimum number of vSAN cluster hosts is determined by the number of allowable failures (the Number of failures to tolerate parameter in the storage policy) and is determined by the formula: 2 * number_of_failures_to_tolerate + 1.
Provided that 1 failure occurs, vSAN allows the creation of 2- and 3-node clusters. The object and its replica are hosted on 2 hosts, a witness is hosted on 3m. At the same time, the following restrictions appear:
• when 1 host falls, there is no possibility of data rebuild to protect against a new failure;
• when transferring 1 host to the service mode, there is no possibility of data migration; the data of the remaining host at this time becomes insecure.
This is due to the fact that there is simply nowhere to rebuild or migrate data, there is no additional free host. Therefore, it is optimal if the vSAN cluster is used from 4 hosts.
Rule 2 * number_of_failures_to_tolerate + 1 is only applicable when using Mirroring to provide fault tolerance. When using Erasure Coding, it does not work, it is described in detail below in the "Resiliency" section.
In order for the vSAN cluster to be balanced, the hardware configuration of the hosts, primarily for media and storage controllers, must be identical.
An unbalanced cluster (different configuration of host disk groups) is supported, but it makes us put up with the following disadvantages:
• non-optimal cluster performance;
• uneven utilization of host capacity;
• differences in host escort procedures.
It is allowed to host a VM with a vCenter server on a vSAN datastore, but this leads to risks associated with infrastructure management in case of problems with vSAN.
It is recommended that you plan your cache size with a margin to expand the capacity level.
In a hybrid configuration, 30% of the cache is allocated for writing and 70% for reading. All-flash vSAN configuration uses the entire capacity of the cache for writing, no read cache is provided.
The recommended cache size should be at least 10% of the actual VM capacity before replication, i.e. Usable space is taken into account, but not really occupied (taking into account replication).
In a hybrid configuration, the disk group will utilize the entire capacity of the flash media installed in it, while its maximum capacity is not limited.
In an all-flash configuration, a disk group cannot use more than 600 GB of capacity of the installed flash media, while the remaining space will not be idle, since the cached data will be recorded cyclically throughout the entire volume of the media. In all-flash vSAN, it is advisable to use flash media for cache with a higher speed and lifetime compared to capacity media. Using media with a capacity of more than 600 GB for the cache will not affect performance, but it will extend the life of these media.
This approach to organizing the cache in all-flash vSAN is due to the fact that flash-capacity media are already fast, so it makes no sense to cache reads. Allocation of the entire cache capacity for recording, in addition to its acceleration, allows you to extend the life of the capacity level and reduce the total cost of the system, since cheaper media can be used for permanent storage, while one more expensive, productive and tenacious flash cache will protect them from unnecessary write operations.
VM fault tolerance and the ratio of the volume of usable and shared vSAN storage space are determined by two parameters of the storage policy:
• Number of failures to tolerate - the number of allowed failures, which determines the number of cluster hosts the VM can survive the failure.
• Failure tolerance method - fault tolerance method.
vSAN offers 2 fault tolerance methods:
• RAID-1 (Mirroring) - full replication (mirroring) of an object with replicas on different hosts, an analog of network raid-1. Allows the cluster to survive up to 3 failures (hosts, disks, disk groups or network problems). If Number of failures to tolerate = 1, then 1 replica (2 instances of the object) is created, the space actually occupied by the VM or its disk on the cluster is 2 times its useful capacity. If Number of failures to tolerate = 2, we have 2 replicas (3 copies), the actual occupied volume is 3 times more useful. For Number of failures to tolerate = 3, we have 3 replicas, we occupy a space 4 times more useful. This fault tolerance method uses space inefficiently, but provides maximum performance. It can be used for hybrid and all-flash cluster configuration.
• RAID-5/6 (Erasure Coding) - when placing objects, parity blocks are calculated, an analog of network raid-5 or -6. Only all-flash cluster configuration is supported. Allows the cluster to work out 1 (analog of raid-5) or 2 failures (analog of raid-6). The minimum number of hosts for working out 1 failure is 4, for working out 2 failures is 6, the recommended minimums are 5 and 7, respectively, for the possibility of rebuilding. This method allows you to achieve significant savings in cluster space compared to RAID-1, but leads to a loss of performance, which may be quite acceptable for many tasks, given the speed of all-flash. So, in the case of 4 hosts and a tolerance of 1 failure, the usable space occupied by the object when using Erasure Coding, will be 75% of its total volume (for RAID-1 we have 50% of usable space). In the case of 6 hosts and a tolerance of 2 failures, the usable space occupied by the object using Erasure Coding will be 67% of its total volume (for RAID-1 we have 33% of usable space). Thus, RAID-5/6 in these examples is more efficient than RAID-1 in using the cluster capacity by 1.5 and 2 times, respectively.
Below is the data distribution at the component level of the vSAN cluster using RAID-5 and RAID-6. C1-C6 (first line) - the components of the object, A1-D4 (color blocks) - data blocks, P1-P4 and Q1-Q4 (gray blocks) - parity blocks.

vSAN provides fault tolerance for different VMs or their disks in different ways. Within a single repository, you can associate a policy with mirroring for performance-critical VMs, and for less critical VMs, configure Erasure Coding to save space. Thus, a balance will be maintained between performance and efficient capacity utilization.
The table below shows the minimum and recommended number of hosts or failure domains for various FTM / FTT options:

vSAN introduces the concept of fault domains to protect the cluster from failures at the level of server racks or baskets that are logically grouped into these domains. Enabling this mechanism leads to the distribution of data to ensure their fault tolerance not at the level of individual nodes, but at the domain level, which will allow to survive the failure of a whole domain - all nodes grouped in it (for example, a server rack), since replicas of objects will be necessarily located on nodes from different domains of failure.

The number of failure domains is calculated by the formula: number of fault domains = 2 * number of failures to tolerate + 1. vSAN minimally requires 2 domains of failure, each with 1 or more hosts, but the recommended number is 4, since it allows rebuilding in case of failure (2-3 domains do not allow rebuild, nowhere). Thus, the method of counting the number of domains of failure is similar to the method of counting the number of hosts to work out the desired number of failures.
Ideally, each domain of failure should have the same number of hosts, the hosts should have the same configuration, it is recommended to leave the space of one domain empty for rebuilding (for example, 4 domains with 1 failover).
The failure domain mechanism works not only with Mirroring (RAID-1), but also for Erasure Coding, in this case each component of the object must be located in different failure domains and the formula for calculating the number of failure domains changes: at least 4 domains for RAID-5 and 6 domains for RAID-6 (similar to calculating the number of hosts for Erasure Coding).
Deduplication and compression (DiS) mechanisms are supported only in all-flash configurations and can only be enabled on the vSAN cluster as a whole, selective inclusion for individual VMs or disks using policies is not supported. Using only one of these technologies also fails, only both at once.
Turning on DiS causes objects to automatically strip to all disks in a disk group, this avoids rebalancing of components and reveals coincidence of data blocks from different components on each disk in a disk group. At the same time, it remains possible to manually set objects at the level of storage policies, incl. beyond the disk group. When enabling DiS, it is impractical to reserve space for objects at the policy level (Object Space Reservation parameter, thick disks), since this will not give a performance gain and adversely affect space saving.
DiS is performed after confirming the recording operation. Deduplication is performed before data is unloaded from the cache over identical 4K blocks within each disk group; deduplication is not performed between disk groups. After deduplication and before downloading data from the cache, they are compressed: if 4K is really compressed to 2K or less, then compression is performed, if not, the block remains uncompressed to avoid unjustified overheads.
In deduplicated and compressed form, the data is only at the storage level (capacity), which is approximately 90% of the cluster data volume. At the same time, the overhead on DiS is about 5% of the total cluster capacity (storage of metadata, hashes). In the cache, the data is in a normal state, since it is written to the cache much more often than in the permanent storage, so the overhead and degradation of performance from the DiS in the cache would be much more than the bonus of optimizing its relatively small capacity.
It should be noted that there is a choice between a small number of large disk groups or many small ones. On large disk groups, the effect of DiS will be greater (it is done within groups, and not between them). For many small disk groups, the cache works more efficiently (its space increases due to an increase in the total number of cache media), there will be more failure domains, which will speed up rebuild when one disk group fails.
The space occupied by snapshot chains is also optimized by DiS.
The Number of disk stripes per object storage policy parameter sets the number of separate capacity media to distribute the components of one replica of the object (VM disk). The maximum value of this parameter, which means the maximum stripe length that vSAN supports, is 12, in which case the object replica is distributed on 12 media. If the specified stripe length exceeds the number of media to a disk group, then the replica of the object will be stretched to several disk groups (most likely within 1 host). If the specified stripe length exceeds the number of host media, then the replica of the object will be stretched to several hosts (for example, all media of one host and part of the media of another).
By default, the stripe length is set to 1, which means that stripping is not performed and the replica (at a size of up to 255GB) is placed on 1 medium in the form of 1 component.
Theoretically, stripping can give a performance boost due to parallelization of I / O if the media onto which the object is stripped are not overloaded. Stripping an object into several disk groups allows you to parallelize the load not only by capacity media, but also utilize the media cache resources of the involved disk groups. VMware recommends leaving the “number of strips per object” parameter set to 1, as set by default, and not stripping objects, unless stripping is acceptable, it is necessary and will really allow to increase productivity. In general, it is believed that striping will not give a tangible performance boost. In hybrid configurations, the effect of stripping can be positive, especially with intensive reading, when there are problems with getting into the cache. Stream recording can also be accelerated by stripping, including in an all-flash configuration, since several cache media can be disposed of and data crowding out onto permanent media is parallelized.
It should be noted that stripping leads to a significant increase in the number of cluster components. This is important for clusters with a large number of VMs and objects, when the component limit per cluster (9000) can be exhausted.
Please note that the maximum size of 1 component of an object is 255GB, which means that if the object is large, its replica will be automatically divided by the number of components that are a multiple of 255. In this case, regardless of the striping policy, the replica components will be distributed across several media, if there are a lot of them (more than the carriers on the host or in the cluster, for example, we create a 62TB disk), then several components of one object can get onto the carrier.
When planning the size of the vSAN cluster storage, it is necessary to take into account that the actual occupied space, taking into account the used fault tolerance methods (mirror or EC) and the number of allowable failures (from 1 to 3x), can significantly exceed the useful cluster capacity. It is also necessary to take into account the impact of space optimization methods (EC and DiS).
You should consider the allocation of space for swap files (RAM size of each VM) and storage snapshots.
When filling the vSAN capacity by 80%, the rebalancing of the cluster begins - this is a background process that redistributes data across the cluster and causes a significant load, it is better not to allow it to occur. About 1% of the space is consumed when formatting cluster media for the vSAN (VSAN-FS) file system. A small fraction of the space is spent on DiS metadata (5%). Therefore, VMware recommends designing a vSAN cluster with a capacity margin of 30%, so as not to bring to rebalancing.
vSAN recommends and supports the use of multiple storage controllers on the same host. This allows you to increase productivity, capacity and fault tolerance at the level of individual nodes. At the same time, no vSAN ready node contains more than 1 storage controller.
It is recommended to select controllers with the longest queue length (at least 256. It is recommended to use controllers in pass-through mode when disks are directly presented to the hypervisor. VSAN supports the use of controllers in raid-0 mode, however their use leads to additional manipulations during maintenance (for example, when replacing media.) It is recommended to disable the internal controller cache, if it is not possible, set it to 100% for reading; proprietary acceleration modes of controllers should also be disabled.
In the event of a capacity-media failure, its rebuild can be performed inside the same disk group or to another group (on the same host or on another), it depends on the availability of free space.
A cache failure results in a rebuild of its entire disk group. Rebuild can be performed on the same host (if it still has disk groups and free space) or on other hosts.
In case of a host failure, it is better to provide at least 1 free host for the rebuild, if you need to work out several failures, then there should be several free hosts.
If a component (disk or disk controller) has degraded (component failure without recovery), then vSAN begins to rebuild it immediately.
If the component (network loss, network card failure, host disconnection, disk disconnection) is absent (temporary shutdown with the possibility of recovery), then vSAN starts rebuilding it deferred (by default, after 60 minutes).
Naturally, the condition of the rebuild is the availability of free space in the cluster.
After a failure (media, disk group, host, network loss), vSAN stops the I / O for 5-7 seconds while evaluating the availability of the lost object. If the object is located, the I / O resumes.
If 60 minutes after a host failure or network loss (rebuild started), the lost host returns to service (the network is rebuilt or upgraded), vSAN determines what is best (faster) to do: rebuild or synchronize the returned host.
By default, vSAN calculates checksums to monitor the integrity of objects at the retention policy level. The checksums are calculated for each data block (4KB); they are checked as a background process on read operations and for data that remains cold during the year (scrubber). This mechanism allows you to detect data corruption for software and hardware reasons, for example, at the level of memory and disks.
If a checksum mismatch is detected, vSAN automatically restores the damaged data by overwriting it.
Checksumming can be disabled at the retention policy level. The scrubber startup frequency (checking blocks to which there were no calls) can be changed in the additional settings (VSAN.ObjectScrubsPerYear parameter) and perform such a check more often than once a year (at least once a week, but additional loading will occur).
vSAN supports nic-teaming with port aggregation and load balancing.
Prior to version 6.5 inclusive, vSAN requires support for multicast traffic on its network. If several vSAN clusters are located on the same subnet, you need to assign different multicast addresses to their hosts to separate their traffic. Starting with version 6.6, vSAN does not use multicast.
When designing a vSAN, it is recommended that the leaf-spine architecture be laid.
vSAN supports NIOC to provide guaranteed bandwidth for its traffic. NIOC can only be run on distributed vSwitch, vSAN makes it possible to use them in any edition of vSphere (they are included in the vSAN license).
vSAN supports the use of Jumbo frames, but VMware considers the performance gain from their use to be insignificant, therefore it gives the following recommendations: if the network already supports them, you can use them, if not, then they are completely optional for vSAN, you can do without them.
The composition, structure and principles of placing objects and components in a vSAN cluster, methods for ensuring fault tolerance, and the use of storage policies were described above.
Now let's see how this works with a simple example. We have a vSAN cluster of 4 hosts in an all-flash configuration. The figure below conditionally shows how 3 VM disks (in Disks) will be placed in this cluster.

Disk-1 was tied to a retention policy with Failure To Tolerate (FTT) = 1) and Erasure Coding (Fault Tolerance Method (FTM) = EC). Thus, the object in Disk-1 was distributed in the system in the form of 4 components, 1 per host. The data in the Drive inside these components is recorded along with the calculated parity values, in fact it is a network RAID-5.
Disk-2 and Disk-3 were tied to storage policies with failover 1 (FTT = 1) and mirroring (FTM = Mirror). Note that in Disk-2 it has a usable size of less than 255GB and the default stripe size is set for it (Number of disk stripes per object = 1). Therefore, the object in Disk-2 was placed on the cluster in the form of 3 components at different nodes: two mirror replicas and a witness.
Disk-3, in this case, has a usable size of 500GB and the default stripe size is set for it. Since 500GB is larger than 255GB, vSAN automatically splits one replica of an object in Disk-3 into 2 components (Component1-1 and Component1-2) and puts it on Host-1. Their replicas (Component2-1 and Component4-2) are hosted on hosts 2 and 4, respectively. In this case, there is no witness, since the algorithm for calculating the quorum using votes allows you to do without it. It should be noted that vSAN placed in Disk-3 on the cluster space in this way automatically and at its discretion, this will not work with your hands. With the same success, she could place these components on nodes in a different way, for example, one replica (Component1-1 and Component1-2) on Host-4, the second on Host-1 (Component2-1) and Host-3 (Component4- 2).
Of course, automatic placement of objects is not arbitrary, vSAN is guided by its internal algorithms and tries to uniformly utilize space and possibly reduce the number of components.
The placement in Disk-2 could also be different, the general condition is that the components of different replicas and the witness (if any) must be on different hosts, this is a failover condition. So, if in Disk-2 had a size slightly less than 1.9 TB, then each of its replicas would consist of 8 components (one component no more than 255 GB). Such an object could be placed on the same 3 hosts (8 components of 1 replica on Host-1, 8 components of 2 replicas on Host-2, witness on Host-3. Or vSAN could place it without a witness, distributing 16 components of both replicas on all 4 hosts (without crossing different replicas on the same host)
Just a table from VMware recommendations:

vSAN supports operation in the Stretched Cluster mode (extended cluster) with the coverage of 2 geographically dispersed sites (sites), while the shared vSAN storage pool is also distributed between sites. Both sites are active, in the event of failure of one of the sites, the cluster uses the storage and computing power of the remaining site to resume the failed services.
A detailed discussion of the features of Stretched Cluster is beyond the scope of this publication.
" Documentation vSAN 6.5 on the vSphere 6.5 the Documentation the VMware Center
" Guidelines for the design and sizing VSAN 6.2
" Design Guide VSAN network 6.2
" technologies to optimize capacity VSAN 6.2
" Striping in a VSAN
" the VMware Blog on VSAN
Virtual SAN Concept
VMware Virtual SAN (hereinafter vSAN) is a distributed software storage system (SDS) for the organization of hyper-converged infrastructure (HCI) based on vSphere. vSAN is built into the ESXi hypervisor and does not require the deployment of additional services and service VMs. vSAN allows you to combine local host media into a single storage pool that provides a given level of fault tolerance and provides its space for all hosts and VM clusters. Thus, we get the centralized storage necessary to reveal all the capabilities of virtualization (vSphere technology), without the need to implement and maintain a dedicated (traditional) storage system.
vSAN is launched at the vSphere cluster level, it is enough to start the corresponding service in cluster management and provide it with local media of the cluster hosts - in manual or automatic mode. vSAN is built into vSphere and is tightly attached to it, as a standalone SDS product that can work outside the VMware infrastructure cannot exist.
Analogs and competitors of vSAN are: Nutanix DSF (Nutanix Distributed Storage), Dell EMC ScaleIO, Datacore Hyper-converged Virtual SAN, HPE StoreVirtual. All these solutions are compatible not only with VMware vSphere, but also with MS Hyper-V (some of them with KVM); they require the installation and launch of a service VM on each HCI host necessary for the operation of SDS; they are not embedded in the hypervisor.
An important property of any HCI, including VMware vSphere + vSAN, is a horizontal scalability and modular architecture. HCI is built and expanded on the basis of identical server blocks (hosts or nodes), combining computing resources, storage resources (local media) and network interfaces. This can be commodity-equipment (x86 servers, including brand-name ones), supplied separately, or ready-made applications (for example, Nutanix, vxRail, Simplivity). Integrated software (for example, vSphere + vSAN + VMM_orientation tools) allows creating a software-defined data center (SDDS) based on these blocks, including a virtualization environment, SDN (software-defined network), SDS, and centralized management and automation / orchestration tools. Of course, we must not forget about the need for dedicated physical network equipment (switches), without which interaction between HCI hosts is impossible. At the same time, it is advisable to use a leaf-spine architecture that is optimal for the data center to organize the network.
vSAN rises at the ESXi host cluster level under vCenter and provides distributed centralized storage for the hosts in the cluster. The vSAN cluster can be implemented in 2 versions:
• Hybrid - flash drives are used as a cache (cache layer), HDDs provide the main storage capacity (capacity layer).
• All-flash - flash drives are used at both levels: cache and capacity.
Extending a vSAN cluster is possible by adding media to existing hosts or by installing new hosts in the cluster. It should be borne in mind that the cluster should remain balanced, which means that ideally the composition of the media in the hosts (and the hosts themselves) should be identical. It is acceptable, but not recommended, to include hosts without media participating in the vSAN capacity in the cluster; they can also host their VMs on the shared storage of the vSAN cluster.
Comparing vSAN with traditional external storage systems, it should be noted that:
• vSAN does not require the organization of external storage and a dedicated storage network;
• vSAN does not require slicing LUNs and file globes, presenting them to hosts and the associated network interaction settings; after activation, vSAN immediately becomes available to all cluster hosts;
• vSAN data transmission is carried out according to its own proprietary protocol; standard SAN / NAS protocols (FC, iSCSI, NFS) for the organization and operation of vSAN are not needed;
• vSAN administration is carried out only from the vSphere console; Separate administration tools or special plug-ins for vSphere are not needed;
• no dedicated storage manager is needed; vSAN configuration and maintenance is performed by the vSphere administrator.
Media and disk groups
Each vSAN cluster host must have at least one cache and data medium (capacity or data disk). These media within each host are combined into one or more disk groups. Each disk group includes only one cache medium and one or more storage media for permanent storage.
The media provided by vSAN and added to the disk group is completely disposed of; it cannot be used for other tasks or in several disk groups. This applies to both cache media and capacity drives.
vSAN supports only local media or drives connected via DAS. Inclusion of vSAN storage connected via SAN / NAS into the storage pool is not supported.
Object storage
vSAN is an object storage, the data in it is stored in the form of “flexible” containers called objects. An object stores data or meta-data distributed across a cluster. The following are the main varieties of vSAN objects (created separately for each VM):
Thus, the data of each VM is stored on the vSAN as a set of the objects listed above. In turn, each object includes many components distributed across the cluster depending on the selected storage policy and the amount of data.
Storage management on vSAN is implemented using the SPBM (Storage Policy Based Management) mechanism, under the influence of which all vSphere storages from version 6 are located. The storage policy sets the number of failures to tolerate, the method of ensuring fault tolerance (replication or erasure coding) and the number of stripes for the object (Number of disk stripes per object). The number of strips specified by the policy determines the number of storage media over which the object will be spread.
The binding of a policy to a specific VM or its disk determines the number of components of the object and their distribution across the cluster.
vSAN allows changing storage policies for objects hot-off without stopping the VM. At the same time, processes of reconfiguration of objects are launched in the background.
When distributing objects across a cluster, vSAN ensures that components related to different object replicas and witness components are distributed across different nodes or failure domains (server rack, basket, or platform).
Witness and quorum
A witness is a service component that does not contain useful data (only metadata), its size is 2-4MB. It acts as a tiebreaker in determining the live components of an object in the event of a failure.
The quorum calculation mechanism in vSAN works as follows. Each component of the object receives a certain number of votes (votes) - 1 or more. A quorum is reached and the object is considered “alive” if its full replica and more than half of its votes are available.
Such a quorum calculation mechanism allows you to distribute the components of an object across a cluster in such a way that in some cases you can do without creating a witness. For example, when using RAID-5/6 or stripping on RAID-1.
Virtual SAN datastore
After the vSAN service is enabled in the vSphere cluster, the datastore of the same name appears, it can be the only one on the entire vSAN cluster. Thanks to the SPBM mechanism described above, within a single vSAN storage, each VM or its disk can get its required level of service (fault tolerance and performance) by binding to the desired storage policy.
The vSAN datastore is available to all hosts in the cluster, regardless of whether the host has local media included in vSAN. At the same time, cluster hosts may have access to datastore of other types: VVol, VMFS, NFS.
The vSAN datastore supports Storage vMotion (hot migration of VM disks) with VMFS / NFS repositories.
You can create multiple vSAN clusters within a single vCenter server. Storage vMotion between vSAN clusters is supported. In addition, each host can be a member of only one vSAN cluster.
VSAN Compatibility with Other VMware Technologies
vSAN is compatible and supports most VMware technologies, including requiring common storage, in particular: vMotion, Storage vMotion, HA, DRS, SRM, vSphere Replication, snapshots, clones, VADP, Horizon View.
vSAN does not support: DPM, SIOC, SCSI reservations, RDM, VMFS.
VSAN hardware requirements
Storage Requirements
Media, controllers, as well as drivers and firmware must be vSAN certified and displayed in the appropriate VMware compatibility sheet (Virtual SAN section in the VMware Compatibility Guide).
As a storage controller, a SAS / SATA HBA or RAID controller can be used, they must function in passthrough mode (disks are given by the controller as is, without creating a raid array) or raid-0.
SAS / SATA / PCIe –SSD and NVMe media can be used as cache media.
SAS / SATA HDD for hybrid configuration and all the above types of flash (SAS / SATA / PCIe –SSD and NVMe) for all-flash configuration can be used as storage media.
RAM and CPU requirements
The amount of host memory is determined by the number and size of disk groups.
The minimum amount of host RAM to participate in the vSAN cluster is 8GB.
The minimum amount of host RAM required to support the maximum configuration of disk groups (5 disk groups of 7 storage media) is 32GB.
vSAN utilizes about 10% of processor resources.
Network requirements
Dedicated 1Gbps adapter for hybrid configuration
Dedicated or shared 10Gbps adapter for all-flash configuration You
need to allow multicast traffic on the vSAN subnet
Bootable media
To load hosts from vSAN, local USB or SD media can be used, as well as SATADOM. The first 2 types of media do not save logs and traces after rebooting, since they are written to a RAM disk, and the latter saves, so it is recommended to use SATADOM SLC-class media with greater survivability and performance.
VSAN 6.5 Configuration Highs
Maximum 64 hosts per vSAN cluster (for both hybrid and all-flash)
Maximum 5 disk groups per host
Maximum 7 capacity media per disk group Maximum
200 VMs per host and maximum 6000 VMs per cluster Maximum
9000 components per host
Maximum VM disk size - 62TB The
maximum number of media in a strip per object - 12
Technological Features of VMware Virtual SAN
Planning for a vSAN Cluster
The minimum number of vSAN cluster hosts is determined by the number of allowable failures (the Number of failures to tolerate parameter in the storage policy) and is determined by the formula: 2 * number_of_failures_to_tolerate + 1.
Provided that 1 failure occurs, vSAN allows the creation of 2- and 3-node clusters. The object and its replica are hosted on 2 hosts, a witness is hosted on 3m. At the same time, the following restrictions appear:
• when 1 host falls, there is no possibility of data rebuild to protect against a new failure;
• when transferring 1 host to the service mode, there is no possibility of data migration; the data of the remaining host at this time becomes insecure.
This is due to the fact that there is simply nowhere to rebuild or migrate data, there is no additional free host. Therefore, it is optimal if the vSAN cluster is used from 4 hosts.
Rule 2 * number_of_failures_to_tolerate + 1 is only applicable when using Mirroring to provide fault tolerance. When using Erasure Coding, it does not work, it is described in detail below in the "Resiliency" section.
In order for the vSAN cluster to be balanced, the hardware configuration of the hosts, primarily for media and storage controllers, must be identical.
An unbalanced cluster (different configuration of host disk groups) is supported, but it makes us put up with the following disadvantages:
• non-optimal cluster performance;
• uneven utilization of host capacity;
• differences in host escort procedures.
It is allowed to host a VM with a vCenter server on a vSAN datastore, but this leads to risks associated with infrastructure management in case of problems with vSAN.
VSAN Cache Planning
It is recommended that you plan your cache size with a margin to expand the capacity level.
In a hybrid configuration, 30% of the cache is allocated for writing and 70% for reading. All-flash vSAN configuration uses the entire capacity of the cache for writing, no read cache is provided.
The recommended cache size should be at least 10% of the actual VM capacity before replication, i.e. Usable space is taken into account, but not really occupied (taking into account replication).
In a hybrid configuration, the disk group will utilize the entire capacity of the flash media installed in it, while its maximum capacity is not limited.
In an all-flash configuration, a disk group cannot use more than 600 GB of capacity of the installed flash media, while the remaining space will not be idle, since the cached data will be recorded cyclically throughout the entire volume of the media. In all-flash vSAN, it is advisable to use flash media for cache with a higher speed and lifetime compared to capacity media. Using media with a capacity of more than 600 GB for the cache will not affect performance, but it will extend the life of these media.
This approach to organizing the cache in all-flash vSAN is due to the fact that flash-capacity media are already fast, so it makes no sense to cache reads. Allocation of the entire cache capacity for recording, in addition to its acceleration, allows you to extend the life of the capacity level and reduce the total cost of the system, since cheaper media can be used for permanent storage, while one more expensive, productive and tenacious flash cache will protect them from unnecessary write operations.
Resiliency
VM fault tolerance and the ratio of the volume of usable and shared vSAN storage space are determined by two parameters of the storage policy:
• Number of failures to tolerate - the number of allowed failures, which determines the number of cluster hosts the VM can survive the failure.
• Failure tolerance method - fault tolerance method.
vSAN offers 2 fault tolerance methods:
• RAID-1 (Mirroring) - full replication (mirroring) of an object with replicas on different hosts, an analog of network raid-1. Allows the cluster to survive up to 3 failures (hosts, disks, disk groups or network problems). If Number of failures to tolerate = 1, then 1 replica (2 instances of the object) is created, the space actually occupied by the VM or its disk on the cluster is 2 times its useful capacity. If Number of failures to tolerate = 2, we have 2 replicas (3 copies), the actual occupied volume is 3 times more useful. For Number of failures to tolerate = 3, we have 3 replicas, we occupy a space 4 times more useful. This fault tolerance method uses space inefficiently, but provides maximum performance. It can be used for hybrid and all-flash cluster configuration.
• RAID-5/6 (Erasure Coding) - when placing objects, parity blocks are calculated, an analog of network raid-5 or -6. Only all-flash cluster configuration is supported. Allows the cluster to work out 1 (analog of raid-5) or 2 failures (analog of raid-6). The minimum number of hosts for working out 1 failure is 4, for working out 2 failures is 6, the recommended minimums are 5 and 7, respectively, for the possibility of rebuilding. This method allows you to achieve significant savings in cluster space compared to RAID-1, but leads to a loss of performance, which may be quite acceptable for many tasks, given the speed of all-flash. So, in the case of 4 hosts and a tolerance of 1 failure, the usable space occupied by the object when using Erasure Coding, will be 75% of its total volume (for RAID-1 we have 50% of usable space). In the case of 6 hosts and a tolerance of 2 failures, the usable space occupied by the object using Erasure Coding will be 67% of its total volume (for RAID-1 we have 33% of usable space). Thus, RAID-5/6 in these examples is more efficient than RAID-1 in using the cluster capacity by 1.5 and 2 times, respectively.
Below is the data distribution at the component level of the vSAN cluster using RAID-5 and RAID-6. C1-C6 (first line) - the components of the object, A1-D4 (color blocks) - data blocks, P1-P4 and Q1-Q4 (gray blocks) - parity blocks.

vSAN provides fault tolerance for different VMs or their disks in different ways. Within a single repository, you can associate a policy with mirroring for performance-critical VMs, and for less critical VMs, configure Erasure Coding to save space. Thus, a balance will be maintained between performance and efficient capacity utilization.
The table below shows the minimum and recommended number of hosts or failure domains for various FTM / FTT options:

Failure domains
vSAN introduces the concept of fault domains to protect the cluster from failures at the level of server racks or baskets that are logically grouped into these domains. Enabling this mechanism leads to the distribution of data to ensure their fault tolerance not at the level of individual nodes, but at the domain level, which will allow to survive the failure of a whole domain - all nodes grouped in it (for example, a server rack), since replicas of objects will be necessarily located on nodes from different domains of failure.
The number of failure domains is calculated by the formula: number of fault domains = 2 * number of failures to tolerate + 1. vSAN minimally requires 2 domains of failure, each with 1 or more hosts, but the recommended number is 4, since it allows rebuilding in case of failure (2-3 domains do not allow rebuild, nowhere). Thus, the method of counting the number of domains of failure is similar to the method of counting the number of hosts to work out the desired number of failures.
Ideally, each domain of failure should have the same number of hosts, the hosts should have the same configuration, it is recommended to leave the space of one domain empty for rebuilding (for example, 4 domains with 1 failover).
The failure domain mechanism works not only with Mirroring (RAID-1), but also for Erasure Coding, in this case each component of the object must be located in different failure domains and the formula for calculating the number of failure domains changes: at least 4 domains for RAID-5 and 6 domains for RAID-6 (similar to calculating the number of hosts for Erasure Coding).
Deduplication and compression
Deduplication and compression (DiS) mechanisms are supported only in all-flash configurations and can only be enabled on the vSAN cluster as a whole, selective inclusion for individual VMs or disks using policies is not supported. Using only one of these technologies also fails, only both at once.
Turning on DiS causes objects to automatically strip to all disks in a disk group, this avoids rebalancing of components and reveals coincidence of data blocks from different components on each disk in a disk group. At the same time, it remains possible to manually set objects at the level of storage policies, incl. beyond the disk group. When enabling DiS, it is impractical to reserve space for objects at the policy level (Object Space Reservation parameter, thick disks), since this will not give a performance gain and adversely affect space saving.
DiS is performed after confirming the recording operation. Deduplication is performed before data is unloaded from the cache over identical 4K blocks within each disk group; deduplication is not performed between disk groups. After deduplication and before downloading data from the cache, they are compressed: if 4K is really compressed to 2K or less, then compression is performed, if not, the block remains uncompressed to avoid unjustified overheads.
In deduplicated and compressed form, the data is only at the storage level (capacity), which is approximately 90% of the cluster data volume. At the same time, the overhead on DiS is about 5% of the total cluster capacity (storage of metadata, hashes). In the cache, the data is in a normal state, since it is written to the cache much more often than in the permanent storage, so the overhead and degradation of performance from the DiS in the cache would be much more than the bonus of optimizing its relatively small capacity.
It should be noted that there is a choice between a small number of large disk groups or many small ones. On large disk groups, the effect of DiS will be greater (it is done within groups, and not between them). For many small disk groups, the cache works more efficiently (its space increases due to an increase in the total number of cache media), there will be more failure domains, which will speed up rebuild when one disk group fails.
The space occupied by snapshot chains is also optimized by DiS.
Stripping objects and the number of components
The Number of disk stripes per object storage policy parameter sets the number of separate capacity media to distribute the components of one replica of the object (VM disk). The maximum value of this parameter, which means the maximum stripe length that vSAN supports, is 12, in which case the object replica is distributed on 12 media. If the specified stripe length exceeds the number of media to a disk group, then the replica of the object will be stretched to several disk groups (most likely within 1 host). If the specified stripe length exceeds the number of host media, then the replica of the object will be stretched to several hosts (for example, all media of one host and part of the media of another).
By default, the stripe length is set to 1, which means that stripping is not performed and the replica (at a size of up to 255GB) is placed on 1 medium in the form of 1 component.
Theoretically, stripping can give a performance boost due to parallelization of I / O if the media onto which the object is stripped are not overloaded. Stripping an object into several disk groups allows you to parallelize the load not only by capacity media, but also utilize the media cache resources of the involved disk groups. VMware recommends leaving the “number of strips per object” parameter set to 1, as set by default, and not stripping objects, unless stripping is acceptable, it is necessary and will really allow to increase productivity. In general, it is believed that striping will not give a tangible performance boost. In hybrid configurations, the effect of stripping can be positive, especially with intensive reading, when there are problems with getting into the cache. Stream recording can also be accelerated by stripping, including in an all-flash configuration, since several cache media can be disposed of and data crowding out onto permanent media is parallelized.
It should be noted that stripping leads to a significant increase in the number of cluster components. This is important for clusters with a large number of VMs and objects, when the component limit per cluster (9000) can be exhausted.
Please note that the maximum size of 1 component of an object is 255GB, which means that if the object is large, its replica will be automatically divided by the number of components that are a multiple of 255. In this case, regardless of the striping policy, the replica components will be distributed across several media, if there are a lot of them (more than the carriers on the host or in the cluster, for example, we create a 62TB disk), then several components of one object can get onto the carrier.
Capacity Planning for a vSAN Cluster
When planning the size of the vSAN cluster storage, it is necessary to take into account that the actual occupied space, taking into account the used fault tolerance methods (mirror or EC) and the number of allowable failures (from 1 to 3x), can significantly exceed the useful cluster capacity. It is also necessary to take into account the impact of space optimization methods (EC and DiS).
You should consider the allocation of space for swap files (RAM size of each VM) and storage snapshots.
When filling the vSAN capacity by 80%, the rebalancing of the cluster begins - this is a background process that redistributes data across the cluster and causes a significant load, it is better not to allow it to occur. About 1% of the space is consumed when formatting cluster media for the vSAN (VSAN-FS) file system. A small fraction of the space is spent on DiS metadata (5%). Therefore, VMware recommends designing a vSAN cluster with a capacity margin of 30%, so as not to bring to rebalancing.
Choosing a storage controller
vSAN recommends and supports the use of multiple storage controllers on the same host. This allows you to increase productivity, capacity and fault tolerance at the level of individual nodes. At the same time, no vSAN ready node contains more than 1 storage controller.
It is recommended to select controllers with the longest queue length (at least 256. It is recommended to use controllers in pass-through mode when disks are directly presented to the hypervisor. VSAN supports the use of controllers in raid-0 mode, however their use leads to additional manipulations during maintenance (for example, when replacing media.) It is recommended to disable the internal controller cache, if it is not possible, set it to 100% for reading; proprietary acceleration modes of controllers should also be disabled.
Failover
In the event of a capacity-media failure, its rebuild can be performed inside the same disk group or to another group (on the same host or on another), it depends on the availability of free space.
A cache failure results in a rebuild of its entire disk group. Rebuild can be performed on the same host (if it still has disk groups and free space) or on other hosts.
In case of a host failure, it is better to provide at least 1 free host for the rebuild, if you need to work out several failures, then there should be several free hosts.
If a component (disk or disk controller) has degraded (component failure without recovery), then vSAN begins to rebuild it immediately.
If the component (network loss, network card failure, host disconnection, disk disconnection) is absent (temporary shutdown with the possibility of recovery), then vSAN starts rebuilding it deferred (by default, after 60 minutes).
Naturally, the condition of the rebuild is the availability of free space in the cluster.
After a failure (media, disk group, host, network loss), vSAN stops the I / O for 5-7 seconds while evaluating the availability of the lost object. If the object is located, the I / O resumes.
If 60 minutes after a host failure or network loss (rebuild started), the lost host returns to service (the network is rebuilt or upgraded), vSAN determines what is best (faster) to do: rebuild or synchronize the returned host.
Checksums
By default, vSAN calculates checksums to monitor the integrity of objects at the retention policy level. The checksums are calculated for each data block (4KB); they are checked as a background process on read operations and for data that remains cold during the year (scrubber). This mechanism allows you to detect data corruption for software and hardware reasons, for example, at the level of memory and disks.
If a checksum mismatch is detected, vSAN automatically restores the damaged data by overwriting it.
Checksumming can be disabled at the retention policy level. The scrubber startup frequency (checking blocks to which there were no calls) can be changed in the additional settings (VSAN.ObjectScrubsPerYear parameter) and perform such a check more often than once a year (at least once a week, but additional loading will occur).
VSAN Planning
vSAN supports nic-teaming with port aggregation and load balancing.
Prior to version 6.5 inclusive, vSAN requires support for multicast traffic on its network. If several vSAN clusters are located on the same subnet, you need to assign different multicast addresses to their hosts to separate their traffic. Starting with version 6.6, vSAN does not use multicast.
When designing a vSAN, it is recommended that the leaf-spine architecture be laid.
vSAN supports NIOC to provide guaranteed bandwidth for its traffic. NIOC can only be run on distributed vSwitch, vSAN makes it possible to use them in any edition of vSphere (they are included in the vSAN license).
vSAN supports the use of Jumbo frames, but VMware considers the performance gain from their use to be insignificant, therefore it gives the following recommendations: if the network already supports them, you can use them, if not, then they are completely optional for vSAN, you can do without them.
An example of placing objects in a vSAN cluster
The composition, structure and principles of placing objects and components in a vSAN cluster, methods for ensuring fault tolerance, and the use of storage policies were described above.
Now let's see how this works with a simple example. We have a vSAN cluster of 4 hosts in an all-flash configuration. The figure below conditionally shows how 3 VM disks (in Disks) will be placed in this cluster.

Disk-1 was tied to a retention policy with Failure To Tolerate (FTT) = 1) and Erasure Coding (Fault Tolerance Method (FTM) = EC). Thus, the object in Disk-1 was distributed in the system in the form of 4 components, 1 per host. The data in the Drive inside these components is recorded along with the calculated parity values, in fact it is a network RAID-5.
Disk-2 and Disk-3 were tied to storage policies with failover 1 (FTT = 1) and mirroring (FTM = Mirror). Note that in Disk-2 it has a usable size of less than 255GB and the default stripe size is set for it (Number of disk stripes per object = 1). Therefore, the object in Disk-2 was placed on the cluster in the form of 3 components at different nodes: two mirror replicas and a witness.
Disk-3, in this case, has a usable size of 500GB and the default stripe size is set for it. Since 500GB is larger than 255GB, vSAN automatically splits one replica of an object in Disk-3 into 2 components (Component1-1 and Component1-2) and puts it on Host-1. Their replicas (Component2-1 and Component4-2) are hosted on hosts 2 and 4, respectively. In this case, there is no witness, since the algorithm for calculating the quorum using votes allows you to do without it. It should be noted that vSAN placed in Disk-3 on the cluster space in this way automatically and at its discretion, this will not work with your hands. With the same success, she could place these components on nodes in a different way, for example, one replica (Component1-1 and Component1-2) on Host-4, the second on Host-1 (Component2-1) and Host-3 (Component4- 2).
Of course, automatic placement of objects is not arbitrary, vSAN is guided by its internal algorithms and tries to uniformly utilize space and possibly reduce the number of components.
The placement in Disk-2 could also be different, the general condition is that the components of different replicas and the witness (if any) must be on different hosts, this is a failover condition. So, if in Disk-2 had a size slightly less than 1.9 TB, then each of its replicas would consist of 8 components (one component no more than 255 GB). Such an object could be placed on the same 3 hosts (8 components of 1 replica on Host-1, 8 components of 2 replicas on Host-2, witness on Host-3. Or vSAN could place it without a witness, distributing 16 components of both replicas on all 4 hosts (without crossing different replicas on the same host)
Space Efficiency Guidelines
Just a table from VMware recommendations:

Stretched Cluster Support
vSAN supports operation in the Stretched Cluster mode (extended cluster) with the coverage of 2 geographically dispersed sites (sites), while the shared vSAN storage pool is also distributed between sites. Both sites are active, in the event of failure of one of the sites, the cluster uses the storage and computing power of the remaining site to resume the failed services.
A detailed discussion of the features of Stretched Cluster is beyond the scope of this publication.
List of references (useful links)
" Documentation vSAN 6.5 on the vSphere 6.5 the Documentation the VMware Center
" Guidelines for the design and sizing VSAN 6.2
" Design Guide VSAN network 6.2
" technologies to optimize capacity VSAN 6.2
" Striping in a VSAN
" the VMware Blog on VSAN