Another Github 2: Machine Learning, Datasets, and Jupyter Notebooks
Despite the many sources of free machine learning software available on the Internet, Github remains an important clearinghouse for all types of open source tools used by the machine learning and data analysis community.
This collection contains machine learning repositories, datasets, and Jupyter Notebooks, ranked by star rating. In the previous part, we talked about popular repositories for studying data visualization and deep learning.
Machine learning
Awesome Machine Learning
38 809, 9 615
An impressive list of systems, libraries, and software classified by language and category (computer vision, natural language processing, etc.). In addition, in this repository you will find a list of free machine learning books, free (mostly) machine learning courses, data science blogs.
Scikit-learn
34 067, 16 698
Developed since 2007, the Python module for machine learning, built on the basis of the SciPy, NumPy, and Matplotlib libraries. Distributed under the BSD 3-Clause license. Scikit-learn is a universal tool for work that contains classification, regression and clustering algorithms, as well as methods for preparing data and evaluating models.
PredictionIO
11 703, 1 903
An open source machine learning framework that supports event collection, algorithm deployment, evaluation, templates for well-known tasks such as classification and recommendations. Connects to existing applications using the REST API or SDK. PredictionIO is based on scalable open source services such as Hadoop, HBase (and other databases), Elasticsearch, Spark.
Dive Into Machine Learning
9 163, 1 673
Material for beginners in the subject. The repository contains a collection of IPython tutorials for the Scikit-learn library, which implements a large number of machine learning algorithms, as well as several links to Python-related machine learning topics and more general data analysis information. The author provides links to many other tutorials covering the topic.
Pattern
6 845, 1 353
Python-based web development module with tools for analysis, natural language processing (marking up parts of speech, n-gram search, mood analysis, WordNet), machine learning, network analysis and visualization. The module was created and well documented at the Research Center for Computer Linguistics and Psycholinguistics of the University of Antwerp (Belgium). In the repository you will find more than 50 examples of its use.
GoLearn
6 374, 867
Actively developing machine learning library for Go. Provides a full-featured, easy-to-use, highly customizable software package for developers. GoLearn implements the familiar Scikit-learn learning interface.
Vowpal Wabbit
6 189, 1 519
The Vowpal Wabbit system extends the boundaries of machine learning using methods such as hashing, allreduce, learning2search, and active and interactive learning. Vowpal Wabbit aims to quickly model massive datasets and supports parallel learning. Particular attention is paid to reinforcement learning using several contextual "gangster algorithms."
NuPIC (Numenta Platform for Intelligent Computing)
5 852, 1,570
NuPIC implements Hierarchical Temporal Memory (HTM) machine learning algorithms. In general, HTM is an attempt to simulate the computational operations of the neocortex of the human brain and focuses on the conservation and invocation of spatial and temporal patterns. HTM is a memory system, it is not programmed, does not learn to execute algorithms for various tasks, it learns to solve a problem. NuPIC is suitable for all kinds of tasks, in particular, for detecting pattern anomalies.
aerosolve
4 522, 570
aerosolve tries to distinguish itself from other libraries by focusing on user-friendly debugging tools, a Scala code for training, a mechanism for analyzing image content for easy ranking, flexibility and control over functions. The library is intended for use with rare interpretable functions that are usually found in search (search keywords, filters) or pricing (number of rooms in a hotel room, location, price).
Code for Machine Learning for Hackers
3 467, 2 220
A complementary repository of Machine Learning for Hackers , a repository in which all the code is presented in the R language for statistical data processing (in fact, the standard of statistical programs) and graphics. You will find numerous R packages here. Topics covered include general classification, ranking and regression tasks, as well as statistical procedures for component analysis and multi-dimensional scaling.
Github datasets
Awesome Public Datasets
31 852, 5 361
Another impressive repository with its size is a list divided into 30 topics: biology, sports, museums, natural language, etc. The repository includes several hundred data sets, most of which are free. Here are links to other Big Data collections.
OpenAddresses
1 644, 745
The official OpenAddresses.io repository is a free and open global collection of street addresses. The project includes street names, house numbers, postal codes and geographical coordinates.
Open Exoplanet Catalog
583, 176
A catalog of all known planets existing outside the solar system. Previously, the database was updated within 24 hours after the discovery of a new planet, but now, unfortunately, the project is practically not developing.
CitySDK
510, 149
The US Census Bureau database, adapted for integration with other open data sets, with convenient functions for working and creating your own custom data set with the Census API: statistics, cartographic GeoJSON, lat / lng, etc.
openFDA
353, 84
openFDA is a U.S. Food and Drug Administration (FDA) project that aims to provide a collection of public data sets for researchers and developers through the API, as well as examples of how to use this data and documentation. There is information about the side effects of medications, drug labeling, reports on drug withdrawal from the market, and changes to the prescription formula.
CERN Open Data Portal
247, 88
The source code for the CERN Nuclear Research Organization open data portal, which is described as "an access point to a growing range of data from CERN research."
IPython (Jupyter) Notebooks
A list of useful Github repositories consisting of IPython (Jupyter) notebooks focused on data manipulation and machine learning.
Python Machine Learning Book
9 655, 3 674
An accompanying repository of the first edition of the Machine Learning with Python book (repository for the second edition here ), which deals with working with missing values, converting categorical variables into formats suitable for machine learning, choosing informative properties, compressing data with transfer to subspaces with less number of measurements.
Example Data Science Notebook
4 156, 1 463
A repository of training materials, code and data for various data analysis and machine learning projects. Notebook contains all the basic principles of working with data analysis using the Iris dataset as an example and illustrates the construction of a workflow in data science. The basic points for working in a repo are gleaned from the book “ The Elements of Data Analytic Style ” (Jeff Leek, 2015).
Learn Data Science
2 197, 1 228
A collection of Notebooks and datasets covering four algorithmic topics: linear regression, logistic regression, random forests, and K-Means clustering algorithms. Learn Data Science is based on materials created for the Open Data Science Training project .
IPython Notebooks
2 106, 1 226
The repository contains various IPython Notebooks - from an overview of the IPython language and functionality to examples of using various popular libraries in data analysis. Here you will find a comprehensive collection of machine learning, deep learning, and big data processing materials from Machine Learning courses by Andrew Ng (Coursera), Intro to TensorFlow for Deep Learning (Udacity), and Spark (edX).
Scikit-learn Tutorial
963, 573
A repository for learning the Scikit-learn library , which implements a large number of machine learning algorithms. The library provides an implementation of a number of algorithms for learning with or without a teacher. Scikit-learn is built on top of SciPy (Scientific Python).
Machine Learning
543, 336
A series of very detailed training materials on IPython Notebook, created on the basis of the data of the course Andrew Nga on Machine Learning (Stanford University), course Tom Mitchell (Carnegie Mellon University) and a book by Christopher M. Bishora "Pattern recognition and machine learning."
The list provided is not fully exhaustive, so we welcome comments with a list of your favorite (or your own) repositories.