“Now He Counts You,” or Data Science from Scratch

    Not so long ago, I talked about how I accidentally got acquainted with the concept of Data Science, thanks to courses from the Cognitive Class . Briefly summarizing that article, I’ll say that I didn’t really learn anything from the course, but I was curious, so after a while I ran to the store and bought a book, which this material is dedicated to.

    I don’t know how appropriate it is on Habré to describe the possibility of training in a printed tutorial, but in the end this hub is about the educational process in IT and therefore if you are interested in what this book can teach a complete beginner in the field of Data Science and whether it is worth spending on this stage time and money, then you are welcome under cat.



    Part 1. "I - this time" - a little about skills


    I must say that before reading this book, my idea of ​​the benefits of Data Science did not go far from the title picture borrowed from my favorite cartoon.

    In order for the reader who has looked here to project my experience on myself, I will have to tell a little about my starting skills. So, like last time, the dossier remained virtually unchanged:

    1. In connection with the mat. analysis and statistics were not seen;
    2. No programming skills in Python;
    3. Owns knowledge of the existence of Data Science, has no practical skills.
    4. The character is persistent Nordic, not married.

    Actually, why did I decide to study this book and share my impressions of it?
    Just after the Cognitive Class courses, I decided to look at Kaggle and realized that even in the tutorial on solving the problem about the Titanic , I don’t understand the essence of almost all the techniques and definitions.

    This book did not require any starting skills and promised a pleasant immersion in the world of data science. Do I have confidence now that after reading the book I can solve this problem with the Titanic? The answer is at the end of the article.

    Part 2. “Two is a Calf” - general information about the book


    The book "Data Science. The science of data from scratch ”- it seems that it has appeared on the domestic market quite recently, which is at least indicated by the fact that I could not download or buy its electronic version. The original itself was released in 2015. Of course, over 2 years in the IT world, a lot of things are changing, for example, new versions of libraries for data analysis in Python are released. And here we must pay tribute to the author (Joel Grass) and the localizer of the book. Initially, the book was written with the expectation of Python 2, but the author did not leave his brainchild and adapted the source codes of the programs for Python 3 (and by the way posted it on GitHub ), well, thank God the translator placed already adapted program texts in the book (it seems that with minor adjustments).

    Thanks also to the translators for the brief instructions on installing Anaconda and / or setting up the environment in case you do not want to install Anaconda.

    And so we begin the story of the book. On the back of the cover is a quote that really clearly describes the material posted in it: “Joel will give you a tour of data science. As a result, you will move from simple curiosity to a deep understanding of the vital algorithms that any data analyst should know. ” - Roy Sivaprasad. Well, at least the first part of this quote is 100% correct, the book really resembles an excursion when you need to examine the Hermitage in 2 hours and all you have to do is run after the guide, catching a brief reference about each masterpiece. Oddly enough, I can’t say that this is bad, at least you have time to read the book before it gets bored.

    It should be noted that someone like me can be misled by the name of the book.
    It is important to note that in this case “from scratch” does not mean from scratch knowledge to some practical level, but that all examples of functions for analysis and visualization will be written in the process of presenting the material. This is reminiscent of the analogy with the book “Linux from scratch”, which is not aimed at immediately starting to use some Linux distribution “with interesting wallpapers”, but systematically, assembling your system from scratch (even if you later will not use).

    This approach has its advantages and disadvantages. On the one hand, you are unlikely to use the functions that you borrow from the book in the future, on the other hand, you may come to understand the general principles (I haven’t come to you the first time)

    So, as the representatives of law enforcement agencies write in the reports: “On the merits of the issue, I report the following:”

    Part 3. “Three is Python” - content and general approach.


    I must say that the book really copes with the “excursion” format. It summarizes, probably, almost all the basic concepts that can be found in other courses on Data Science (for example, on the same Coursera ). Shortness has advantages and disadvantages, on the one hand you can read a book if you want in 2-3 days and it doesn’t have time to get tired, on the other hand there is really little material and you can skip reading “diagonally”, so that you will have to come back and re-read the chapter again.

    The author showed imagination and linked the studied material to your work in a conditional social network for scientists according to data - “DataSciencester”. I must say that this is a pleasant approach, tasks from far away look like everyday ones. And the complexity of the "work" tasks you are solving is gradually increasing from chapter to chapter.

    In the first chapter, training starts right off the bat, the author will show you how to use Python to solve several conditional problems on a small amount of data, for example, build a graph that reflects the number of friends in our "conditional" social network or identify and graphically display the relationship between work experience and salary level, for the scientist according to the data.

    The Python intensity will be described below, you definitely cannot call it redundant, but you must pay tribute to the author, beyond the scope of what is given in Chapter 2, he does not go much further in the text, so if you delve into the basic data types and other concepts once, then, in theory, the code for the problems presented in the book should not cause (although it did).

    After the introductory part and the basics of Python, the rest of the book can be divided into 3 parts:

    1. Very brief basics mat. analysis and statistics;
    2. Collection, processing, storage of data;
    3. Machine learning (mathematical models and algorithms for data processing and prediction);

    A fragment of the book and the table of contents can be viewed on Ozon (not advertising), there is just the content and the first chapter.

    Let's move on from the text part to the practical part; above in the text there was a link to the author’s page on GitHub , where the code presented in the book and the necessary data are located.

    In the localized version of the book there is a link to the archive with an adapted (Russified) version of the code, so as not to violate anyone's rights, I will refrain from posting it.
    All code is presented in the form of source codes for Python 2 and 3, as well as in the form of notebooks for Jupyter notebook. I must say, many thanks to this book, because thanks to it I discovered Anaconda(convenient thing). In my opinion, it is most convenient to experiment with the code presented in the book in the Jupyter notebook version (which is installed by default in Anacode). Although on the other hand, in fact, in the notebook, all the code is driven into one cell without breaking down and without separate text inserts, so this is more a matter of taste than a clear advantage. By the way, if suddenly the root directory from where Jupyter "sees" the files does not suit you, then here is a really working tip (there are options for both Windows and Linux)

    It should be noted that notepads come with pre-prepared results so that you can see them without running the code, but after restarting the calculations in some places you may need small “dances with a tambourine” in the form of installing libraries or some other little things (for example, connecting to the services API).

    I do not want to be unfounded, so I hope the author will not be offended if I demonstrate a piece of code from his book.

    Here, for example, is a code fragment dedicated to linear algebra (in order not to violate the rights of a translator, take the original from GitHub). In the book, this code is mixed with the presentation of the material, in the notebook and source code it goes in a solid form.

    # -*- coding: utf-8 -*-
    # linear_algebra.py
    import re, math, random # regexes, math functions, random numbers
    import matplotlib.pyplot as plt # pyplot
    from collections import defaultdict, Counter
    from functools import partial, reduce
    #
    # functions for working with vectors
    #
    def vector_add(v, w):
        """adds two vectors componentwise"""
        return [v_i + w_i for v_i, w_i in zip(v,w)]
    def vector_subtract(v, w):
        """subtracts two vectors componentwise"""
        return [v_i - w_i for v_i, w_i in zip(v,w)]
    def vector_sum(vectors):
        return reduce(vector_add, vectors)
    def scalar_multiply(c, v):
        return [c * v_i for v_i in v]
    def vector_mean(vectors):
        """compute the vector whose i-th element is the mean of the
        i-th elements of the input vectors"""
        n = len(vectors)
        return scalar_multiply(1/n, vector_sum(vectors))
    def dot(v, w):
        """v_1 * w_1 + ... + v_n * w_n"""
        return sum(v_i * w_i for v_i, w_i in zip(v, w))
    def sum_of_squares(v):
        """v_1 * v_1 + ... + v_n * v_n"""
        return dot(v, v)
    def magnitude(v):
        return math.sqrt(sum_of_squares(v))
    def squared_distance(v, w):
        return sum_of_squares(vector_subtract(v, w))
    def distance(v, w):
       return math.sqrt(squared_distance(v, w))
    #
    # functions for working with matrices
    #
    def shape(A):
        num_rows = len(A)
        num_cols = len(A[0]) if A else 0
        return num_rows, num_cols
    def get_row(A, i):
        return A[i]
    def get_column(A, j):
        return [A_i[j] for A_i in A]
    def make_matrix(num_rows, num_cols, entry_fn):
        """returns a num_rows x num_cols matrix
        whose (i,j)-th entry is entry_fn(i, j)"""
        return [[entry_fn(i, j) for j in range(num_cols)]
                for i in range(num_rows)]
    def is_diagonal(i, j):
        """1's on the 'diagonal', 0's everywhere else"""
        return 1 if i == j else 0
    identity_matrix = make_matrix(5, 5, is_diagonal)
    #          user 0  1  2  3  4  5  6  7  8  9
    #
    friendships = [[0, 1, 1, 0, 0, 0, 0, 0, 0, 0], # user 0
                   [1, 0, 1, 1, 0, 0, 0, 0, 0, 0], # user 1
                   [1, 1, 0, 1, 0, 0, 0, 0, 0, 0], # user 2
                   [0, 1, 1, 0, 1, 0, 0, 0, 0, 0], # user 3
                   [0, 0, 0, 1, 0, 1, 0, 0, 0, 0], # user 4
                   [0, 0, 0, 0, 1, 0, 1, 1, 0, 0], # user 5
                   [0, 0, 0, 0, 0, 1, 0, 0, 1, 0], # user 6
                   [0, 0, 0, 0, 0, 1, 0, 0, 1, 0], # user 7
                   [0, 0, 0, 0, 0, 0, 1, 1, 0, 1], # user 8
                   [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]] # user 9
    

    The author, as promised, writes all the functions from scratch and tries to explain their work, after which at the end of each chapter he honestly informs you that there are certain libraries where it is better implemented.

    An explanation of the process of developing functions closer to the end of an unprepared reader (for example, to me) seems furious and somewhere after the middle of the book it is already understood that not all the logic of the code (I think I will have to re-read it one day), but in this way you and the author will see how to do it do it yourself next fucking bikes, useful basic decisions in a heap of areas, for example, its primitive analogue of the BDS, all basic functions and analysis models, neural networks, decision trees, text generators, captcha recognizers. Even just a cursory acquaintance with this whole set may well develop your interest in the subject.

    Part 4. "Hooray kid!" - Conclusion.


    So, what do we have in the bottom line?

    Since at the moment all my knowledge about Data Science is limited to this book and courses from the Cognitive class (CC), then for a start I will compare with them.

    I don’t know maybe the factor of the native language, maybe that, in contrast to the SS courses, the author at least normally signed the axes on the graphs in the graphs, but in terms of general presentation, the book yielded much more, at the same time (there are 2 clean days), despite the lack of videos, labs, exams, and so on. And even the absence of “certificates” and “badges” does not give CC advantages at all (because they are worthless).

    Can a complete beginner understand something about the basic approaches in the field of data science? Rather yes than no. Will he be able to immediately do something worthwhile at the end of the book, more likely no than yes. Nevertheless, it will probably be a bad practice to apply for the permanent work those examples that are indicated in the book, which means that it is necessary to learn the main libraries for data analysis (the author himself also talks about this during the presentation of the material). And I can assume that it will be useful to return to examples from scratch once, when a hand is already full on ready-made libraries.

    Is a book useful for a beginner? I think yes. Probably, if you imagine that your brain is discussing with itself, then you can get something like “Overton windows”, that is, at first the very realization that you need to delve into some concept such as dispersion or regression, or neural networks, It seems unacceptable, but every time you slowly come to the conclusion that this is not so scary.

    Therefore, as an excursion into the world of Data Science, the book is quite suitable, at least in the process of reading, interest in the issue only grows and I think that when taking more thorough training courses it will be much easier to examine in more detail the concepts previously studied with the help of this book.

    Whether in the end it costs 550 rubles a book of 300 with small pages printed on newsprint, is up to you. I can say one thing, this book instilled confidence in me that now I can somehow solve the problem about the Titanic on kaggle, I think just about this and will be my next material.

    Also popular now: