Zach Jones, Systems & Performance Architect
Improving the performance of our technology for the benefit of our customers and their audiences is an ongoing course of action at Verizon Media. For example, over the past two years, our performance and kernel engineers have eliminated virtually all packet drops (over 98% removed), improved performance health checks on our edge servers by 50%, and increased server capacity by up to 40%.
We've also coupled the above with network automation and organic network expansion — currently approaching 75 Tbps — to improve the user experience. Carefully tuning our performance has played a major role in our ability to support rapidly changing, and sometimes unpredictable, network surges that come as we deliver software updates for millions of game consoles, live video streams for major sporting events, and when multi-CDN load balancers move load to our network.
Maintaining quality at scale involves optimizing performance across every part of the Verizon Media Platform tech stack: from its lower layers, at the CPU and NIC, all the way up to the OS and the applications. Ultimately, our goal is always the same: great performance. To get there, we perform data-driven analysis, relying on measurable performance changes to validate our decision making.
We run 20,000 servers around the world, largely Broadwell and Haswell chipsets, typically with 32 to 40 cores. We added 12,000 servers in the last three years alone. However, most servers are not optimized to scale for our workloads out of the box. Simply adding more servers doesn't make for more efficient operation, and indeed can create additional challenges. Effective scaling requires careful optimization of existing components. Being able to optimize one server so that it is capable of processing two or three times (or more) requests than with the default configuration, can make a powerful difference to the network's overall performance.
To guarantee that the local CPU cache is consistent with memory, modern CPUs employ a snoop protocol. This lets caches listen for modification to variables on any CPU and update their versions of these variables accordingly. Not surprisingly, the particular technique used can have a significant impact on memory performance.
By default, our hardware vendors use a protocol called Early Snoop. It has a lower latency to resolve cache coherency, as all cores can make cache coherency requests simultaneously and send out broadcasts. We have found that during peak load scenarios, our systems generate heavy amounts of simultaneous disk and NIC activity. These activities result in a high number of snoop broadcasts, leading to communication bottlenecks. This causes IO device slowing and can eventually lead to processing stopping entirely.
By switching to Home Snoop mode, an approach which coalesces snoop requests, we have seen a significant reduction in broadcast traffic. The processor's Quick Path Interconnect (QPI) is no longer starved during periods of simultaneous heavy disk and network IO operations; furthermore, packet drops that we saw with Early Snoop significantly reduced in number.
Changing the snoop protocol depends simply on changing a BIOS setting. However, rebooting 20,000 servers without disrupting customers requires automation. We can make this kind of large scale deployment change work in production partly thanks to our StackStorm-based IT automation platform, Crayfish.
During testing of the switch to Home Snoop, a failover event occurred: one of our largest media customers, which has a multi-CDN deployment, experienced a problem with another vendor and moved a significant portion of their traffic to our CDN. This provided an opportunity to test the Home Snoop improvements at scale, and they were extremely impactful.
The figure above shows the effect of the change. The group that was still using Early Snoop saw an increase in drops by 13.75x (55K packet drops per server per day) while the group that had switched to Home Snoop saw an increase of only 4.23x (27K packet drops per machine per day). Home Snoop immediately proved its value during the failover event.
Another set of important performance tuning involved the network interface and the driver. Here, we focused on bringing down packet drops usually occurring with bursty traffic. During large events, inbound traffic was so heavy, the NIC was unable to keep up, and we saw packet drops sooner than expected. As we dug into the reasons why, we found that several parameters on the NIC itself needed adjusting, including the number of queues, the queue size, and interrupt scheduling. To optimize these specifications for our particular workload and hardware configuration, we concentrated on tuning the Receive Side Scaling (RSS) by making the inbound queues longer, reducing their overall number, and balancing the interrupts across NUMA nodes.
The graph above shows a test we ran in North America, in which each PoP is divided into a control (i.e., untuned) group and a test (i.e., tuned) group. Here, we present the number of drops summed daily over one week.
Following the tunings, our test group saw approximately 95% fewer packet drops than the control group, allowing significantly more requests to be processed. This also means less action is required to manually manage the health of the network during surges, leaving our engineers free to focus on other areas.
While the NIC and driver level tuning concentrated on improving the total capacity we can deliver, the CPU scheduling tunings were focused instead on enhancing how consistently we can deliver content.
Without these tunings, inbound and outbound messages have to compete for resources. When we began to investigate the root cause, we found that the contention over resources was a result of how the kernel was scheduling the handling of these messages. This meant that during peak traffic, the load wasn't migrated away until after the CPUs in question were saturated. To fix this, we set the CPU affinity of our web server processes to exclude CPUs dedicated to processing incoming network traffic.
The graphs above show the impact of enabling the CPU scheduling tunings globally across the CDN on March 21–22. We assess the impact based on the 95th percentile and median values of a performance health check metric, a composite metric which demonstrates the relative response time of a server. As expected, low traffic valleys were not significantly reduced; however, the peaks reveal significantly reduced contention between incoming and outgoing traffic. This translates to a major improvement on both the outliers and medians, particularly during peak loads. We can now better handle surges in traffic and iron out problems related to high outlier behavior, such as rebuffers in video streams or the overall responsiveness of a server for all users.
Optimizing the upper layers of our tech stack is equally as important as tuning the lower layer elements. In the process of recently upgrading our OS, we also upgraded our Linux kernels to take advantage of upstream engineering work from the Linux kernel community. The new kernel had around four years of development beyond the previous version deployed, including improvements to the memory management system, which reduces blocking page allocations, and improves load distribution and performance when using the epoll API and socket sharding.
In the graph above, you can see the effect of the upgrading process from late November to early January as a decline in the 99th percentile performance health checks. The underlying kernel improvements all led to a more even load distribution across all our web server request processors. This resulted in a substantial drop to these outliers, making requests for all our customers more reliable.
Over the past two years, the far-reaching system tunings that performance and kernel engineering have deployed have eliminated virtually all packet drops (over 98% removed) and halved our performance health checks on our edge servers. Our server capacity has gone up by 10–40% (the exact amount varies according to the customer profile and event), allowing us to deliver more traffic, faster. Outlier behavior has improved significantly, making for a more consistent experience and we have seen good improvement on the medians, in particular during peak load. In summary, the performance tuning to the entire tech stack have allowed us to better handle any unexpected traffic spikes (whether from a highly anticipated gaming console update or a popular streaming video live event) and deliver more consistent performance for all our users.