JupyterHub, or how to manage hundreds of Python users. Yandex lecture

    The Jupyter platform allows novice developers, data analysts, and students to quickly start programming in Python. Suppose your team grows - now it includes not only programmers, but also managers, analysts, researchers. Sooner or later, the lack of a shared working environment and the complexity of the settings will begin to slow down the work. JupyterHub, a multi-user server with the ability to launch Jupyter with one button, will help to cope with this problem. It is great for those who teach Python, as well as for analysts. The user only needs a browser: no problems installing software on a laptop, compatibility, packages. Jupyter maintainers are actively developing JupyterHub along with JupyterLab and nteract.

    My name is Andrey Petrin, I am the head of the growth analytics group at Yandex. In a report at Moscow Python Meetup, I recalled the advantages of Jupyter and spoke about the architecture and principles of JupyterHub, as well as about the experience of using these systems in Yandex. In the end you will learn how to raise JupyterHub on any computer.


    - To begin with, who are the analysts at Yandex. There is an analogy that this is such a multi-armed Shiva, who can do many different things at once and combines many roles.

    Hello! My name is Andrey Petrin, I am the head of the growth analytics group at Yandex. I’ll tell you about the JupyterHub library, which at one time greatly simplified our life in Yandex analytics, we literally felt the boost in productivity of a large number of teams.




    For example, analysts at Yandex are a little managers. Analysts always know the timing, timeline of the process, what at what point needs to be done.

    These are a little developers, they are familiar with different ways of data processing. For example, on a slide in the hands of Shiva, those Python libraries that come to mind to me are not a complete list, but what is used on a daily basis. Naturally, our development is conducted not only in Python, but I will talk primarily about Python.

    Analysts are a little bit of mathematics, we need to make decisions carefully, look at the real data, not at the managerial point of view, but look for some kind of truth and understand this.

    In this entire task, the Jupyter ecosystem helps us quite a bit.



    It is such a platform for interactive programming and reporting notebooks. A key essence of Jupyter notebooks is such a notebook, where there are a large number of various widgets, interactive elements that can be changed. The main essence is the microelements of the code that you program. They can be printed on a laptop in your browser, which you constantly use. These can be either images or interactive HTML-derived elements. You can just print, print, display elements - a variety of things.

    The Jupyter system has been developing for a long time, it supports various programming languages ​​of different versions, Python 2 and 3, Go, various things. Allows you to coolly solve our daily tasks.

    What are we doing in analytics and how does Jupyter help us a lot in this?



    The first task is the classification of websites. It turns out that for the big Yandex, who knows about the entire Internet, looking at specific sites is quite time-consuming. We have so many sites, each of which can have its own specifics, that we need to aggregate it to some topics - groups of sites, not very large, that generally behave similarly.

    For this task we are building an adjacency graph of all Internet hosts, a graph of the similarity of two sites to each other. With the help of manual markup of hosts, we get some primary database about what sites on the Internet are, and then we extrapolate manual markup to the entire Internet. Literally in each of the tasks we use Jupyter. It allows, from the point of view of constructing an adjacency graph, to constantly run operations on MapReduce, build such graphs, and conduct such data analytics.

    We automated manual markup in Jupyter using input widgets. For each host, we have the alleged theme, which is most likely correct. We almost always guess the topic, but for manual marking a person is still needed.

    And all sorts of interesting pictures are obtained.



    For example, sports-related sites and related search queries that are present in sports-related topics are shown here.



    Subject encyclopedias. There are fewer sites and, in general, unique requests, but more basic requests.



    Subject homework - finished homework. Interesting enough, because inside it there are two independent clusters of sites similar to each other, but not like the others. This is a good example of a topic that I would like to split into two. One half of the sites clearly solves one problem inside homework, the other the other.



    It was quite interesting to make a bid optimizer, a completely different task. Yandex purchases a number of mobile applications, including for money, and we already know how to predict a user's lifetime how much we can get from installing some kind of application for each user, but it turns out that unfortunately, this knowledge is difficult to transfer to a marketer, the contractor who will be involved in the purchase of traffic. This is usually due to the fact that there is always some kind of budget, there are a rather large number of restrictions. It is necessary to do such a multidimensional optimization task, which is interesting from the point of view of analytics, but you need to make a device for the manager.



    Jupyter helps a lot here. This is the interface that we developed at Jupyter so that a user manager who does not have Python knowledge can log in and get the result of our forecast. You can choose there whether we choose Android or iOS, in which countries, which application. There are quite complex managers and pens that can be changed, for example, some progress bars, the size of the budget, and some kind of risk tolerance. These tasks were solved with the help of Jupyter, and we are very pleased that the analyst, being the multi-armed Shiva, can solve these problems alone.

    About five years ago, we came to the conclusion that there are some limitations and platform problems that I want to deal with. The first problem is that we have a lot of different analysts, each of whom is always on different versions, operating systems, etc. Often the code that works for one person does not run for another.

    An even bigger problem is package versions. I think you don’t have to tell how hard it is to maintain some kind of consistent environment so that everything can start out of the box.

    And in general, we began to understand that if you provide a new analyst who has just joined the team with a pre-configured environment where everything will be set up, all the packages are installed on the current version and maintained in a single place, it’s just as good for analytical work as and for development. In general, a thought is coherent for the developer, but it is not always applicable to the analyst precisely because of such constant changes in analytics that are taking place.

    Here the JupyterHub library came to our aid.



    This is a very simple application, it consists of four components that are simply separated.

    The first part of the application is responsible for authorization. We need to check the password and login if we can let this person go.

    The second is the launch of Jupyter servers, from each user the same Jupyter server familiar to you that can run Jupyter laptops is launched. The same thing happens on your computer, only in the cloud, if it is a cloud deployment, or various processes are spawned on the same machine.

    Proxying. There is a single access point to the entire server, and JupyterHub determines which user to which port to go to, everything is completely transparent for the user. Naturally, some control of the database and the entire system.



    If you superficially describe what JupyterHub looks like, the user browser comes into the JupyterHub system, and if the user does not already have a server running or is not authorized. JupyterHub enters the game and begins to ask some questions, create servers and prepare the environment.

    If the user is already logged in, then they are directly proxied to their own server, and the Jupyter laptop actually just communicates with the person directly, sometimes asking the server for access whether this user is allowed access to this laptop, etc.



    The interface is quite simple and convenient. By default, the deployment uses the username and password of the computer where it is deployed. If you have a server where there are several users, then the login and password is the login and password to the system, and the user sees his / home home directory as his home directory. Very convenient, no need to think about any users.





    The rest of the interfaces as a whole are quite familiar to you. These are the standard Jupyter laptops you've all seen. You can see active laptops.



    You probably have not seen this thing. This is the JupyterHub control window, you can turn off your server, start it, or, for example, get a token to communicate on your behalf with JupyterHub, for example, to start some microservices inside JupyterHub.



    Finally, for the administrator, it is possible to manage each user, launch individual Jupyter servers, stop them, add, delete users, turn off all servers, turn off, turn on the hub. All this is done in the browser without any settings and quite convenient.

    In general, the system is developing very much.



    In the picture, the course at UC Berkley, which ended this December, was the largest data science course in the world, in my opinion, it was attended by 1,200 students who did not know how to program, and came to study programming. This was done on the JupyterHub platform, students did not need to install any Python on their computer, they could just go to this server in a browser.

    Naturally, in the further stages of training, the need to install packages appeared, but this coolly solves the problem of the first entry. When you teach Python, and a person is completely unfamiliar with this, quite often you realize that some routines associated with installing packages, maintaining some kind of system, and the like, are a bit superfluous. You want to inspire a person, talk about what kind of world this is, without delving into the details that a person will be able to master in the future.

    Installation:

    python3 -m pip install jupyterhub
    sudo apt-get install npm nodejs-legacy 
    npm install -g configurable-http-proxy 

    Only Python 3 is supported, inside JupyterHub you can run cells in the second Python, but JupyterHub itself only works on the third. The only dependency is this configurable-http-proxy, which is what Python uses to simplify it.

    Configuration:

    jupyterhub --generate-config

    The first thing you want to do is generate a config. Everything will work by itself, even without any settings, by default a local server with some kind of port 8000 will be raised, with access to your users by login and password, it will work only under trial, it will work literally out of the box, but generate-config will create a JupyterHub config file for you, where literally in the form of documentation you can read absolutely all of its settings. This is very convenient, you can even without going into the documentation understand what lines you need to include, everything is commented out, you can manage, all the default settings are visible.



    I would like to make a pause and a reservation. By default, when you deploy it, you will deploy it on your server, and if you don’t make any efforts, namely, you will not use HTTPS, the server will go up via HTTP and your user passwords and logins will enter, will glow openly when communicating with JupyterHub. This is a very dangerous story, an incredible number of problems here can be addressed. Therefore, do not ignore the problem with HTTPS. If you don’t have your own HTTPS certificate, you can create one, or there is a wonderful letsencrypt.org service that allows you to get certificates for free, and you can run it on your domain without problems and without money. It is convenient enough, do not ignore it.

    By default, the hub runs as root, obviously, it spawns its own laptops from under a specific user. This can be changed, but by default it is. And all users are local, the home directory scrolls for each specific user. I’ll tell you in detail what else can be done.

    The cool thing about JupyterHub is that it is such a constructor. Literally in each element of the diagram that I showed, you can insert, embed your own elements, which simplify the work. For example, suppose you don’t want your users to drive in the system login and password, it’s not very safe or inconvenient. You want to make another login system. This can be done using oauth and, for example, github.



    Instead of forcing the user to drive in the username and password, you simply enable authorization using two lines of code with a hub and the user will automatically log in with a hub and will be locally scrolled through the github username.



    Other methods of user authorization are supported out of the box. If there is LDAP - it is possible. Any OAuth can be, there is a REMOTE_USER Authenticator that allows remote servers to check access to your local one. All that your heart desires.



    Suppose you have several types of tasks. For example, one uses a GPU, and for this you need one technology stack, a specific set of packages, and you want to separate it from the CPU with a different use case. To do this, you can write your own spawner. This is the system that creates custom Jupyter laptops. The tincture using Docker is shown here, you can collect a Docker file that will be deployed for each user, and the user will not be local, but in his internal container.

    There are a number of other convenient JupyterHub features, services.



    Suppose you run on some machine with a limited amount of memory, and you want to save your resources and disconnect this user because he is not using them, and the memory takes up some time after the user was not in the system. Or, for example, you have a cloud deployment, and you can save money on virtual machines by disabling unused at night, and turn them on only when you need it.

    There is a ready-made service cull_idle_servers, which allows you to turn off any user servers after inactivity. All data will be saved, resources will simply not be used, you can save a little.



    I said that literally for every piece of this scheme you can include something of your own. You can make some addon to the proxy, in some way to do user proxy. You can write your own authorizer, you can directly communicate with the database using services. You can create your own spawners.

    I want to recommend such a project, a system on top of Kubernetes, which allows everything that I told you to simply deploy directly to any supporting Kubernet cloud, literally without any specific settings. This is very convenient if you do not want to bother with your own server, devops and support. Everything will work out of the box, a very nice detailed guide.



    You need a JupyterHub in case you have multiple people using Jupyter. And it is not necessary that they use Jupyter for the same thing. This is a convenient system that will allow these people to unite and avoid further problems. And if they, moreover, do the same task - most likely, they will need a more or less consistent set of packages.

    The same is the case if you receive complaints that I have a wonderful model, some analyst Vasechkin is trying to reproduce it and it does not work. At one time, we had a constant problem. And of course, a consistent server state helps a lot.

    It's very cool to use this for learning Python. There is a nbgrader service, which on top of JupyterHub allows you to make convenient batteries with sending students homework. They fill in the solutions themselves, send them back, there is an automatic test that checks Jupyter cells and allows you to immediately set grades. Very convenient system, I recommend.

    Imagine that you came to some seminar where you want to show something in Python. At the same time, you don’t want to spend the first three hours making everyone work from your how to. You want to start doing something interesting right away.

    You can raise such a system on your server, provide users with your Internet address where you can log in, and start using it, without wasting time on unnecessary routines. That's all, thanks.

    Also popular now: