“A Train That Could!” Or “Specialization Machine Learning and Data Analysis,” through the eyes of a novice in Data Science

    Earlier in my last article on teaching Data Science from scratch, I promised to sign up for the Machine Learning and Data Analysis specialization on Coursera and share my impressions about the availability of this knowledge for an almost absolute beginner in the field of data science. No sooner said than done! Although, of course, there are already mentions on this and similar specializations on Habré, but I think my “five cents” will not hurt.

    The quote from the famous film in the title of the article and the picture were not taken by chance, in some places it seems to me that this specialization gave me almost physical pain, and there was a colossal desire to quit everything, but interest eventually prevailed. Therefore, if you are interested in how I am with the lowest possible financial costs passed this series of courses - you are welcome to cat.



    Part 1. “Remember everything ...” - a little about skills


    I think at the very beginning it will be appropriate to recall how it all began, so that the reader can try on my experience on himself.

    So, this article is the final in a spontaneously arising series of articles on how I mastered the basics of Data Science from scratch (the articles below are listed in order of appearance):


    I began each of these articles with a brief description of my skills, since mastering the above materials would fit in about one week in total (without taking into account the time of writing the articles), I can’t say that I have made great progress, so by the beginning of training at Coursera my background was as follows:

    • The most basic ideas about Data Science (why is it needed, what is included, a little about working methods)
    • Almost zero knowledge in the mat. analysis and statistics (manually multiplied matrices with errors, the principle of proving statistical hypotheses was not perceived almost at genetic damage)
    • Almost zero knowledge of Python, some minimal programming skills in other languages ​​(for example, C #), which seemed to only interfere with the development of Python logic.
    • Awl at the “fifth point” which made me kill an entire month for an over-intensive course

    It was with this base that I approached the beginning of training. The description of specialization honestly says: “Intermediate Specialization. Some related experience required. ”And confessing it worried me, but since the developers of the specialization are MIPT and Yandex, I decided to take a chance.

    By the way, I have to note that the course really made me remember everything, in particularly difficult moments, unexpectedly began to emerge in the memory, long ago "past ears" of knowledge, seemingly forgotten as unnecessary. True, it seems to me, this specialization in terms of statistics and mat. anyway, the analysis put aside in my head more than the subjects from the specialty and master's programs of the university combined.

    Well, I won’t torment you anymore; let’s get down to business.

    Part 2 - “Getting Started” - Introducing the Course


    "What is the most tenacious parasite? Bacterium? Virus? Intestinal worm? Idea. She is tenacious and extremely contagious. Once the idea is to take possession of the brain, getting rid of it is almost impossible. I mean the formed idea, which is fully conscious, settled in my head. ” - Inception

    While writing this article and recalling the course, I thought that the only common reason why I was interested in Data Science at all might be just because they introduced this idea into me in a dream, or even in a dream inside a dream in a dream inside a dream ...

    And it was not just an idea - to take a course in Data Science at Coursera, it was an idea to take a course as quickly as possible because there was no extra money, and there was no time to stretch for six months.

    If someone is not familiar with the new Coursera policy, then now the subscription system operates on this course, namely 7 free days of a trial subscription, then every month it is paid.
    Specialization is designed for about 6 months. One month cost me 4 576 rubles (now it costs a little more).

    Thus, the system gave 1 month + 1 week , and I decided that it was for this segment that specialization should go . Looking ahead, I will say that the task is quite feasible .

    We turn to the description of the specialization program. It consists of 6 courses, five of which are theoretical, and the sixth is a course project (Capstone Project), access to it will open only after passing the first five. It is advisable to take courses in direct order, of course no one forces you, but they highly recommend it. If you decide to undergo specialization in a short time, sometimes it makes sense to take courses a bit out of order (more on that later), but most likely it will “come back to you” and you will need to return to the previously completed one.

    Five specialization courses smoothly lead you to the possibility of independent application of knowledge, they are especially valuable in interconnection, but in principle, courses can be useful individually. So some courses (or rather, parts of them) seem to have been done somewhat out of touch with the main context, but in any case, the general line can be traced and the requirements for the level of your skill will gradually increase from course to course.

    You will start with the basics of Python, the basics of mat. analysis and probability theory, then consider learning with a teacher and without a teacher (from basic models from scikit learn to neural networks), then statistics, then practical application. In principle, it seems that this is a common approach to training in the field of Data Science.

    It may be critical for someone that the course is sharpened byPython 2 , and I wouldn’t even advise importing some things “from the future”, because in some tasks the grader is very sensitive and problems can arise due to, for example, differences in libraries, including when using Python 3 (at least judging by the reviews on the forums).

    In my opinion, it is most convenient to configure Anaconda. If you already have an anaconda installed with an environment for Python 3, then do not worry about setting up a second environment with Python 2 quite simply (I installed conda through the console using this instruction ). It is installed on both Windows and Linux, I have not tried it on Mac OSX, but I think it is also installed without problems.

    By the way, judging by the forums of specialization, many took this course using OS Windows, I recommend just in case to roll the second Linux system , but certainly this is not necessary , although it can be useful.

    I rolled myself Linux Mint second system, purely for this specialization, and did not regret it. Subjectively, it seems to me that in some places the calculations under it are faster, there were also less problems installing some libraries that were required during the course.

    The first course for a beginner looks quite friendly: in their own way, the charismatic guys from MIPT and Yandex will tell you why this is necessary and at first they won’t scare you with furious tasks. And then, the level of frustration depends on your training. I, and some people on the forum, have had occasions when they couldn’t manage to find a solution to a problem or test for days, on the other hand, if you have the abilities and a good “base”, then I think everything will be simple and clear.

    Each course has its own session (about a month), which implies its study, the course consists of weeks, the week consists of 2-4 lessons (usually), in each lesson (lesson), as a rule, there is optional material (lectures, trial tests ) and control material, tests with grades, programming tasks, tasks for mutual verification, etc. The delivery of these assessment materials is required to complete the course.

    If you don’t pass something on time, you won’t be fined for it, but if it is tied to other people, for example, tasks of mutual verification, then difficulties may arise (everyone will run forward and they will not have time to check your work). If you did not fit into one session, you can always switch to another, the result will be saved.

    A separate word should be said about the course lecturers and assignments. A large team worked on the course and accordingly there are pluses and minuses, most of the course is read by 4 key specialists, each seems to have their own specialization. It can be seen from the lecturers that they are experienced and intelligent people, but it’s difficult to get used to some at first. I will not reveal personalities so as not to offend anyone. I just note that there are lecturers who just want to bow to their feet, because they try to chew the material even for beginners, while some lecturers can be a little nervous and at first cause a burning desire to commit violent acts of an aggressive nature. This is certainly my subjective reaction, caused by poor basic knowledge and personal perception,

    Lessons control materials are also directly related to the lecturers. You will notice that in someone’s lessons the tasks are (generally) more furious, for example, somewhere tests are simple like “three pennies”, and in tasks for statistics and probability theory they make you sweat pretty .
    Well, separately programming tasks (and / or mutual verification) were also developed by different people, therefore, in some cases, the wording can cause feelings of complete misunderstanding and hopeless panic in an unprepared person.

    As for the invited lecturers, they are literate people, but if you don’t like something, they don’t have time to get bored very much, and there aren’t a lot of such moments, as a rule, invited lecturers read material that is slightly divorced from the main context, but certainly useful for general development.

    I do not want to go into the details of training for each of the courses, I think you will understand everything in the learning process, I will turn to useful tips. Well, once again I repeat there are articles on Habré about this specialization, for example, from the Moscow Institute of Physics and Technology ).

    Part 3 “The Hitchhiker's Guide to the Galaxy” - what to do so that it is not painfully painful.


    “The galaxy is a harsh thing. In order to survive in it, you need to know where your towel is ” - Hitchhiker's Guide to the Galaxy

    Below I will try to paint a couple of moments that cost me the hair falling on the keyboard and sleepless nights, I hope this will at least save you a little, let it be your“ towel ”.

    1. My big mistake is the lack of a structured approach to fixing the learning process. In some issues I’m quite a “senior citizen” and I don’t use popular good practices. Closer to the 4th course of specialization, I realized that from the very beginning it was necessary to have something like the Mind map (or any analogue). The main problem begins at the moment when the course ceases to lead you by the handle and requires that you return to the previously covered material and dig out the implementation of the function or piece of theory discussed earlier. Do not rely on memory, it will most likely let you down in places. Thank God there are ways to compensate for the lack of a Mind map, but I still recommend that you somehow structure what you teach.

    2. Also, despite the main message of the article, I do not recommend specializing in a gallop like me. Yes, maybe 6 months is objective, a lot, but I think three monthsthese are quite comfortable conditions for a measured absorption of knowledge. Studying the course for a month + one week in addition to sleepless nights and the absence of a normal weekend will lead to the fact that perhaps your brain simply will not digest what you have learned. So, for example, I found a funny effect by the time I was already in the 4th year, when I suddenly, unconsciously, doing completely different things, began to understand some of the points from the first courses. By the time of the final specialization project, for no reason, my head began to understand the very basics of statistics from the 4th year, apparently the brain needs time. As part of the advice to partially compensate for the lack of time for taking a course with a quick study, I recommend that you start reading in parallel a couple of courses after starting the training , some tutorialon this topic. For example, I chose a book: A. Muller, S. Guido - “Introduction to machine learning using Python. A Guide for Data Professionals ”- 2017. There is little theory there, but the material in the book clearly repeats the techniques learned in the course.
    Or else, as an option “Python and machine learning” Sebsatyan Raska (suggested by Metsur )

    3. Use the courses forums and slack , you will be surprised how many people are faced with the same problems as you. Since I was in a hurry, I started almost every task right away by studying the topics on the forum related to the difficulties that arise during its solution. It’s not uncommon to find on the forum , just pieces of code, or the format of the answers that the "grader" is waiting for, and in especially difficult cases direct instructions from users who chewed on what the author of the task wanted (who apparently has difficulties communicating with non-specialists). Slack helped me out at the very last stage, when it was necessary to cooperate with people for mutual verification of tasks, there were few people in the sixth year and not to wait longIt’s useful to look for people who have already passed this stage and ask them to evaluate or vice versa to help with advice (within the rules) to people who are catching you, so that they catch up with you faster and can evaluate the work. Also, a little life hack, if you don’t get enough tasks to evaluate fellow students and don’t want to wait, you can always look for links from people on the forum where people ask them to check (albeit a couple of months ago), though out of a sense of solidarity I still advise you not to dwell on the three minimum required assessed works, but to help with checking more people. In addition to the forum, just searching the Internet helps in the first courses, there you can easily find tips for solving your problems (for example, one of the tasks in the first courses is based on a scientific article that you can find and peek at pieces of code),

    4. Consider the following point, which for some may become a "stumbling block." Just in case, I recommend that in tasks for checking by a "grader" form answers by using the file writing functions in Python, and not manually through a notebook, this will save you from "invisible" characters that the system recognizes as an erroneous answer.

    5. Recording at the session. Carefully estimate your progress. If you want to finish in a short time, you have no right to wait in vain. It’s impossible to pass some tasks until the session starts, well, for example, you finished the 2nd course on the 14th, and the session is not the third course starts only on the 21st, which means that you won’t be able to take some part of the tasks for 7 days (usually related to the mutual rating). Therefore, it makes sense to sign up for a session a little earlier than you completed the last course.

    Let me give you an example, let’s say a course has already begun, but the first 3 weeks do not contain tasks checked by other users, then it makes sense to sign up for this session and then catch up on what to wait until a new session begins and until your classmates reach the third week. The second example for one of the courses I had to enroll ahead of schedule, it turned out that I finished the second course and immediately enrolled in the fifth, quickly passed the assignment evaluated by users in the very first week and returned calmly to the third and fourth courses in order. Thus, I did not lose the moment when people were ready to evaluate the work and then made up for lost time. Of the minuses, then I had to learn the first week of the fifth year again because everything flew out of my head.

    6. Not everyone knows and it seems that this is not written down, just in case I’ll write - Coursera, at least for the current moment, gives the project to Capstone for six months , that is, my subscription (month + free week) expired on 08.08.17, but how said support for access to the Capstone project will continue for six months from the beginning of the 6th course in my case until the end of January, because I started in late July. So knowing this, you can save your nerves.

    7. The Capstone project is divided into 4 branches, to complete the specialization it is enough to go through one of them, while in some places the rating systems are not very fair. Well, for example, in the 5th task of the 1st project (identification of Internet users) it will be very difficult to achieve high marks due to the need to get into the top competitions on kagle, on the other hand, in the 5th task of the sentiment analysis project, they suggest writing a primitive site parser, the task can be done in half an hour without even going into the previous tasks of the course, and it’s easier to get a good grade (they will take into account the best score in the end). Thus, you can at some points where you have better skills, in addition to the main branch, you can also carry out buildings in others, combine business with pleasure.

    8. Do not be lazy to write a normal code and draw up a notepad well, I was in a hurry (well, there wasn’t enough knowledge) and my code was creepy, I can hardly parse it myself, it’s also hard for other people ( this sometimes affects user ratings ). I think it’s not shameful to see how others are doing and correct themselves a little bit, certainly not crossing the border with plagiarism. I also recommend that you include the text from the job description in the notebook, now I don’t remember what I did in each cell, and access to the buildings is closed after the subscription expires, so you won’t look. Although in fact, many tasks are on GitHub, so this is not very critical.

    9. Well, let it be trite, but calculate your strength, seriously, over the past month I sometimes had to sleep 2 hours a night, not see friends and relatives, forget to eat, ruin the whole weekend to solve problems, and much more. Therefore, if you really want, without special training to master the course in a month, think whether you are ready for it.

    By the way, the phrase from the Hitchhiker's Guide to the Galaxy was recalled by the fact that in the course they periodically suggest setting random seed to = 42

    Well, I think it makes sense to summarize.

    Part 4 “I 'll be back” - the conclusion.


    Let's answer the questions in order:

    1. Have the skills that I received from training before that been useful to me (see the first three articles of the cycle)?

    - Yes, but not very much, on the one hand it’s good when you have an idea of ​​what awaits you (Courses from the Cognitive Class), the tutorial on “Data Science from scratch” also came in handy in some places, I re-read probability theory (there’s a bit but what is written is more understandable), well, the experience with kagle will also be useful to you when you are doing a Capstone project, however, in total all three past articles on skills from the point of view of practice do not lie closely with the passage of specialization, so if you have already made up your mind whatever you want, you can start "without foreplay."

    2. Have I suffered from a lack of basic knowledge?

    - Yes, very much in some places, especially when I couldn’t write a simple function for 2.5 days, or I couldn’t stupidly perceive some moments of statistics and probability theory. Fortunately, there is a forum and slack there are many of the same people, and you can find help, well, the mentors of the course, and sometimes the developers themselves try to help. If everything is completely bad, you can take a personal tutor, but I think that anyone can handle it by himself.

    3. Have I learned anything new?

    - Yes, firstly, for the first time in my life, I wrote a program that worked for 9.5 hours in a row, then I covered up with a memory error (then of course I fixed it all), I have a weak computer, but even toys with normal graphics could not compete with my creation in terms of devouring resources. This is a very good experience, I have now forever remembered the importance and benefits of discharged matrices. Well, and secondly, there are other useful points too: this course still teaches Python (y) a little, I still know it very poorly and have not mastered the “Pythonic way”, but it is much better than not at all, the course explains the basic principles well higher mathematics and statistics (without going into details), in fact, I rediscovered them for myself. The course really shows many interesting features, some of which, if desired, can be transferred to your daily life. Yes,

    4. Can everyone learn this course?

    - I think so, if only there was a desire, perhaps not in a month, but everyone who decides for himself what he really wants will definitely master. The contingent for the course was selected different and young boys and girls and people of age, both with good knowledge of the material, and not so.

    As a bonus, the authors on the specialization website write about the possibility of assistance in finding employment after passing it, I have not tried it yet, but the opportunity itself pleases.

    To summarize, I definitely recommend specialization, many aspects are still rough there, but I think in terms of price-quality ratio - this is more than an acceptable option.

    What next? Well, maybe I will apply the acquired skills to my hobby and then I will throw off the material on Habr, maybe I'll see how things are with machine learning on .net and also unsubscribe. But all this will be much later.

    So I wish everyone good luck in mastering this interesting field of knowledge!

    Well, so that the article doesn’t seem very serious, catch the “bonus”:

    As a bonus


    Another cool advantage of this specialization was that I learned the word correlation and now I will shove it everywhere and not very .

    So, your letters and comments on past articles have led me to the knowledge that my past articles as part of the cycle are read more or less easily and contain a bit of humor (well, I hope this is true), but judging by the feelings this article is harder to read, Yes, and I wrote it with a serious face peering at the monitor.

    If you think about it this way, you can find some correlation between how easy it was to learn from the materials in each of the articles of the entire cycle and the number of conditional “jokes” in the article.

    Let's see if there is any CORRELATION ?!

    Let's calculate the ratio of the number of words in the article and the number of “jokes” in it, as well as the difficulty of mastering (days spent on learning).

    Articles are numbered in the order as I indicated them at the beginning of the article, this one will be the fourth, respectively, when calculating the number of jokes and words, the bonus section was not included in the selection.

    Jokes mean at least some hints of humor (taking into account the pictures at the beginning of the article), quotes in the headings were not considered jokes. So:

    1. Article No. 1: words = 2575, jokes = 5, days of training - 2
    2. Article No. 2: words = 2098, jokes = 3, days of training - 3
    3. Article No. 3: words = 2667, jokes = 4, training days - 2
    4. Article No. 4: words = 3051, jokes = 2, training days - 37

    Next, the code for Python 3, for Python 2, remove the brackets before print and make sure that you divide by float, you can also remove list () before zip ()

    import pandas as pd
    humor_rate=[(5/2575),(3/2098), (4/2667),(2/3051)]
    days=[2,3,2,37]
    df=pd.DataFrame(list(zip(humor_rate, days)), index=None, columns=['Humor rate', 'Days of study'])
    print ('Таблица данных: \n', df)
    print ('Корреляция между humor rate и days of study = ', df.corrwith(df['days of study'])[0])
    

    conclusion: Well, in the end? And in the end, we have a pronounced negative CORRELATION (coefficient. Pearson CORRELATIONS), which tells us that, as a rule, the less the number of days spent on learning the more humor in the article. Of course, this is a comic example of CORRELATION. There is certainly little data, and I had difficulty determining the unambiguous number of jokes in the article, but we will consider this a small example of how you can put the skills you got after specialization into practice, including for calculating CORRELATION. PS How many times have I mentioned this word in the bonus fragment of the article? That's right - eight considering print () output.

    Таблица данных:
    .....Humor rate.....Days of study
    0....0.001942............2
    1....0.001430............3
    2....0.001500............2
    3....0.000656............37

    Корреляция между humor rate и days of study = -0.912343823382









    Also popular now: