How and why we changed the configuration of shards in the Evernote architecture

    In last year's review post on the Evernote architecture, we gave a general description of servers - “shards”, which we use both for data storage and application logic. Since Evernote is a more personal service than, say, a social network, we can easily spread the data of individual users into different shards to provide fairly simple linear scalability. Each pair of such shards controls two virtual machines:

    image

    Each of these virtual machines stores transactional “metadata” in a MySQL database on a RAID-1 array of a pair of 300-gigabyte Cheetah drives with a spindle speed of 15,000 rpm. A separate RAID-10 array of 3-terabyte Constellation drives (7200 rpm) is partitioned to store large Lucene text search index files for each user. Paired virtual machines duplicate each of these partitions from the current primary to the current secondary machine using synchronous DRBD.

    These shards have enough disk space and I / O support for comfortable data processing of 100,000 registered Evernote users for at least 4 years, and are equipped with additional drive bays in 4U cases so that you can upgrade them later if necessary. Given the dual L5630 processors and 48 gigabytes of RAM, the cost of each such unit is up to $ 10,000 with an energy consumption of about 374 watts each. That is, one registered user accounts for about $ 0.10 in hardware costs and 3.7 milliwatts of energy.

    Opportunities for improvement


    The generation of shards described above has given us a good price-performance ratio with the very high level of data redundancy that we need. However, we found several areas where this configuration was not ideal for our purposes. For instance:
    1. Disks with 15,000 rpm for the MySQL database usually stand idle 95% of the time since InnoDB has done a great job with serializing caching and I / O. However, we discovered random bottlenecks when users with large accounts begin primary data synchronization on a new device. If their metadata is not already present in the RAM buffer, then massive I / O operations can become very costly.
    2. Lucene search indexes for our users generate far more I / O than we expected. We see that Lucene accounts for twice as many read / write operations as MySQL. This is largely due to our usage model: each time we create or edit a note, we need to update its owner’s index and send information about the changes to disk so that they immediately take effect.
    3. DRBD отлично подходит для репликации одного или двух небольших разделов, но он очень неудобен, когда речь идет о значительном числе больших разделов для каждого сервера. Каждый раздел нужно независимо сконфигурировать, управлять им и проводить мониторинг. Различные проблемы иногда могут потребовать полной синхронизации всех ресурсов, что грозит занять много часов даже при наличии выделенного кроссовер-кабеля с пропускной способностью 1 Гб/c.

    These restrictions were the main factor limiting the number of users that we could assign to each shard. Improving the manageability and performance of I / O operations with metadata would allow us to safely increase the density of user accounts. We solve these problems in our new generation of shards, transferring metadata storage to solid-state drives, and the logic of excessive file storage from the operating system to our application.

    New configuration


    Our new configuration replaces racks with a dozen 4U server racks, where together are fourteen 1U shards for metadata and an application, and four 4U shards for file storage.

    image

    Shard 1U manages a pair of simpler virtual machines, each of which uses a single partition on a separate RAID-5 array of 300 GB Intel SSDs. These two partitions are replicated using DRBD, and the virtual machine image only runs on one server at a time. We use up to 80% of the capacity of solid state drives, which significantly increases the reliability of recording and throughput for input / output operations. We included a spare SSD for each block instead of using RAID-6, which avoided an additional loss of up to 15% in performance, since the recovery time would be short, and replication with DRBD would give us the opportunity to be safe in case of a hypothetical failure of several disks.

    File storage has been transferred from local disks on the main servers to pools of separate WebDAV servers that manage huge file systems on RAID-6 arrays.
    Each time we add a resource file to Evernote, our application synchronously writes a copy of this file to two different file servers in the same rack before the metadata transaction is completed. Remote compliance with the principle of redundancy is also guaranteed by an application that replicates each new file to a remote WebDAV server via asynchronous data transfer in the background.

    results


    This new configuration has enough capacity for I / O and memory operations to handle up to 200,000 users on a single shard for at least four years. A rack of 14 shards and 4 file servers costs about 135 thousand dollars and consumes 3,900 watts, which is about $ 0.05 and 1.4 milliwatts per user.

    Thus, the specific number of future servers and power consumption for new servers decreased by 60%. The specific power consumption of other service equipment (switches, routers, load balancers, text recognition servers in images, etc.) decreased by a total of 50% compared to our previous architecture. All of these changes reduce our hosting costs in the long run.

    We would not want to make high-profile environmental statements [and you could insert a photo of Phil Libin hugging fluffy clothes as a KDPV], but it can be noted that this 50% reduction in energy consumption proportionally reduces carbon emissions from our equipment.

    In addition to obvious savings, the process of evaluating and testing solutions allows us to better understand the components and technologies that we use. We plan to write several more posts on the details of testing and optimizing RAID arrays from SSDs, a comparative assessment of Xen and KVM in terms of I / O bandwidth, DRBD management, etc. We hope that this information will be useful to our colleagues when creating highly loaded services.

    Also popular now: