How Netflix Uses Python
As many of us are preparing for the PyCon conference, we wanted to talk a bit about how Python is used in Netflix. We use Python throughout the entire life cycle: from deciding which series to finance, and ending with the work of CDN to ship video to 148 million users. We are contributing to many open source Python packages, some of which are mentioned below. If something interests you, look at our vacancy site or look for us on PyCon.
Open Connect is the Netflix Content Delivery Network (CDN). A simple, albeit inaccurate, way to present the Netflix infrastructure is this: everything that happens before you press the Play button on the remote control (for example, logging in, determining a tariff plan, recommendation system, selecting movies) works in Amazon Web Services (AWS) , and everything that happens afterwards (i.e. streaming video) works through Open Connect. Content is hosted on the Open Connect CDN server network as close to the end user as possible in order to improve viewing quality for customers and reduce costs for Netflix and our partners, Internet providers.
Various software systems are needed to design, develop, and operate CDNs, and many of them are written in Python. CDN-based network devices are mainly controlled by Python applications. Such applications monitor the status of network equipment: which devices are used, which models, with which hardware components, where they are located. The configuration of these devices is controlled by several other systems, including the "source of truth", device configuration applications, and backups. The interaction of devices for collecting health data and other operational data is another Python application. Python has long been a popular programming language on the web because it is an intuitive language that allows you to quickly solve network problems.
Demand Engineering is responsible for handling regional failures , traffic distribution, bandwidth operations and server performance in the Netflix cloud. We can proudly say that our tools are built primarily on Python. The failover service uses numpy and scipy for numerical analysis, boto3 for making changes to the AWS infrastructure, rq for running asynchronous workloads, all wrapped in a thin Flask API layer. The ability to leave with failure in the bpython shell and improvise more than once saved the situation.
We actively use Jupyter Notebooks and nteract to analyze operational data and prototype visualization tools.to detect capacity regressions.
The CORE team uses Python to analyze statistics and issue alerts. We rely on many statistical and mathematical libraries (numpy, scipy, ruptures, pandas) to automate the analysis of 1000 related signals when warning systems indicate problems. We have developed a time series correlation system used both within the team and outside it, as well as a distributed work system for parallelizing a large amount of analytical work to obtain fast results.
We also commonly use Python for automation, research, and data filtering tasks and as a convenient visualization tool.
Monitoring, alerting and automatic recovery
The Insight Engineering team is responsible for the development and operation of tools for rapid recognition of problems, alerts, diagnostics and automatic fixes. With the growing popularity of Python, the team now supports Python clients for most of its services. One example is the Spectator client library for code that records dimensional time series metrics. We create Python libraries to interact with other services on the Netflix platform. In addition to libraries, Winston and Bolt products are built using Python frameworks (Gunicorn + Flask + Flask-RESTPlus).
The information security team uses Python for a number of important tasks, including automating security, classifying risks, identifying and automatically fixing vulnerabilities. We published the source code for several successful products, including Security Monkey (our most active open source project). Python is used to protect our SSH resources with Bless . The infrastructure security team uses Python to configure IAM permissions using Repokid . Python scripts help generate TLS certificates in Lemur .
Some of our recent projects include Prism: a batch framework that helps security engineers analyze the state of the infrastructure and identify risk factors and vulnerabilities in the source code. We currently provide Python and Ruby libraries for Prism. The Diffy forensics (computer forensics) tool is completely written in Python. We also use Python to discover sensitive data using Lanius.
We make extensive use of Python in the machine learning framework for personalization . Some models are taught here that provide key aspects of Netflix functionality: from recommendation algorithms to cover selection and marketing algorithms.. For example, some algorithms use TensorFlow, Keras and PyTorch for learning deep neural networks, XGBoost and LightGBM for learning decision trees with gradient boosting or a wider Python stack (for example, numpy, scipy, sklearn libraries, matplotlib, pandas, cvxpy libraries). Since we are constantly trying new approaches, for many experiments we use Jupyter notebooks. We have also developed a number of higher-level libraries for integrating notebooks with the rest of our ecosystem (for example, access to data, recording facts and extracting attributes, evaluating models, and publishing).
Machine Learning Infrastructure
In addition to personalization, Netflix applies machine learning to hundreds of other tasks throughout the company. Many of these applications run on Metaflow, the Python platform that makes it easy to execute ML projects and run them from the prototype stage to production.
Metaflow pushes the boundaries of Python: we use well-parallelized and optimized Python code to extract data at 10 Gb / s, process hundreds of millions of data points in memory and organize calculations on tens of thousands of CPU cores.
We at Netflix are avid users of Jupyter notebooks, and we already wrote about the reasons and nature of these investments .
But Python plays a huge role in how these services are provided. It is the primary language for developing, debugging, researching, and prototyping various interactions with the Jupyter ecosystem. We use Python to create custom extensions for the Jupyter server, which allows us to manage tasks such as logging, archiving, publishing, and cloning notebooks on behalf of users. We provide our users with many options in Python through various Jupyter kernels and control the deployment of these kernel specifications using Python as well.
The Big Data Orchestration Team is responsible for providing all the services and tools for planning and executing ETL and Adhoc pipelines.
Many orchestration components are written in Python. Starting with a scheduler that uses Jupyter and papermill notepads for template job types (Spark, Presto ...). This gives users a standardized and easy way to express the work that needs to be done. Here you can read more about it. We used notebooks as real lists of operations in production (“runbooks”) in situations where human intervention is required, for example, restarting everything that fell in the last hour.
For internal use, we built an event-driven platform that is entirely written in Python. It receives event streams from a number of systems that are combined into a single tool. This allows you to determine the conditions for filtering events, responding to them, or routing. As a result, we were able to isolate microservices and ensure transparency in everything that happens on the data platform.
Our team has also developed the pygenie client , which interacts with the Genie federated job completion service . Internally, we have additional extensions to this library that apply business agreements and are integrated with the Netflix platform. These libraries are the primary way for users to interact programmatically with the big data platform.
Finally, our team contributed to the open source projects papermill and scrapbook : we added code for both our own and external use cases. Our efforts are well received in the open source community, which we are very happy about.
The team of scientific calculations creates a platform for experiments: AB-tests and others. Scientists and engineers can experiment with innovation in three areas: data, statistics, and visualization.
Our metrics repository is a PyPika- based Python platform that allows you to write reusable parameterized SQL queries. This is the entry point for any new analysis.
The Causal Models library of causal models is based on Python and R: here, scientists are given the opportunity to explore new causal models. It uses PyArrow and RPy2, so statistics are easily calculated in any language.
Visualization Library Based on Plotly. Since Plotly is a common specification for visualizations, there are many output tools that go to our platforms.
The partner ecosystem team uses Python to test Netflix applications on devices. Python forms the core of the new continuous integration infrastructure, including managing our orchestration servers, managing Spinnaker, test suite requests and filtering, and planning test runs on devices and containers. Additional post-launch analysis is performed in Python using TensorFlow to determine which tests are most likely to cause problems on which devices.
Video coding and media cloud development
Our team takes care of coding (and transcoding) the Netflix catalog and also uses machine learning to analyze this catalog.
We use Python in about 50 projects, such as vmaf and mezzfs , create computer vision solutions using the map-reduce platform called Archer , and use Python for many internal projects.
We have also opened several tools to facilitate the development / distribution of Python projects, such as setupmeta and pickley .
Netflix and NVFX Animation
Python is an industry standard for all the major applications we use to create animated and VFX content, so it goes without saying that we use it heavily. All our integrations with Maya and Nuke are made in Python, and the bulk of our Shotgun tools, too. We just started building tools in the cloud and are going to deploy many custom Python AMI / containers there.
Machine learning in content, science and analytics
The machine learning team in content makes extensive use of Python to develop machine learning models that are at the core of predicting audience size, views, and other metrics for all content.