Intensive by Kubernetes: about the work of support

    On February 1-3, Slurm-3 will be held, intensively by Kubernetes. Announcement and program here.

    Today I will tell you a little about the internal kitchen: how we help students cope with the practice and what comes out of it. At the same time, future participants will understand what to expect from support.

    I myself 2-3 times a year pass paid courses, always take options with practice, and very rarely finish it to the end. For me, the situation looks as if I ordered a kilogram steak in a restaurant: I ate as much as I could, I left the rest on my plate. But in those who go to Slurm, I would like to cram the whole batch.

    At the first Slurm, we reacted calmly to the practice, they say, we give assignments, and the participants manage as best they can. And this would have led to a catastrophe if there were no initiative and talented guys in the audience: “15 minutes ago I wrote a chat about the problem, I had already solved it myself and helped five more.”

    Therefore, in the second Slurme, in addition to three speakers with the students, a dozen support providers worked: system administrators from the Southbridge team.

    Where are the problems with practice?

    The “Do It Yourself” approach. It would be possible to make a walkthrough: “copy the config, start the playbook, voila, your cluster is ready”. It would be very fast, very simple and very pointless. We went a hard way: to complete the task, you need to understand the topic and manually correct the configs-settings, etc.

    Snowball. All topics and tasks are related to each other. If you didn’t deploy a cluster on the first day, you won’t be able to roll an application on the second day. The most important and difficult topic was Ceph.

    Tin and fakapy

    Ceph is a key and complex topic, and it is impossible to move on without it, so the mass plugging at Ceph was comparable to a pack in destructiveness. Here supports have laid down bones.

    Error on the slide. We are all humans, speakers too. There were mistakes on the slides, and they meant that all 87 students will now write to the chat, as nothing works for them.

    Glitches broadcast. We bought a dedicated channel from the provider and kept the backup channel from the megaphone, but according to the law of meanness it did not save. On the first day of Slurm, a major backbone provider fell through which the channel ran through to the Facecast service. We launched a broadcast on YouTube, but during this time, speakers with full-time students ran ahead, and lagging behind online students made a scandal, even disconnecting from classes. The next day, Facecast changed the connection scheme of the providers, but the system did not immediately work well for all users. And the whole wave of indignation has fallen on our supports.

    (The problem was solved because of the fallen provider: they stopped the classes, waited for full working capacity and repeated all the missed material. The lags of the second day had to be endured).

    So, the student asks for help.

    A support must choose a line of conduct:
    - give the student to independently engage in troubleshooting;
    - find the student's mistake and explain it;
    - To do the practice stage for the student.

    There are undetectable errors: incorrect login, letter I instead of l (big i instead of small L), like that.

    If there is a backup, a queue is formed for the support. It is impossible to thoughtfully help five at once in time pressure.

    And the time trouble was serious: in the internal chat of technical support, several thousand messages flowed in a day. Sapports were turned off at midnight, and they began to work at 6 o'clock in the morning (the benefit was also supported by students, and students were scattered in different time zones).

    Therefore, instead of parsing, the participants received the answer: “I have corrected everything, now your cluster is working as it should, move on.” Yes, “Do It Youself” is fuckin, but it was possible to avoid a snowball.

    Small simple joys

    The support team collected questions from the chat and a special form, sorted, answered, passed on complex questions to the speakers. Therefore, there are no pending issues.

    It turned out that online participants are inconvenient to switch between the broadcast and the console, and we do not have a text file with commands, only the presentation on the speaker’s laptop. Therefore, one of the supporters sitting in the hall recruited and sent commands from the slides to the telegrams.

    In general, a dozen hard workers stand behind the bright speakers, thanks to whom the overwhelming majority of participants reached the end of the practice. Benefit Southbridge is engaged in infrastructure support, everyone can help us.

    Slurm 3 will be better than Slurm 2

    What was done spontaneously at Slurm-2, we systematize and optimize:
    - we assign our group to each support so that students know their support by sight;
    - we write base of typical mistakes and decisions;
    - we are preparing shortcuts “If you have not managed to practice, but want to move on”;
    - we prepare a participant’s memo with instructions on the organization of the workplace and interaction with supports.

    Slurm 3: start Kubernetes cluster

    Also popular now: