randall March 27, 2019 at 15:06

Another Github 2: Machine Learning, Datasets, and Jupyter Notebooks

Despite the many sources of free machine learning software available on the Internet, Github remains an important clearinghouse for all types of open source tools used by the machine learning and data analysis community.

This collection contains machine learning repositories, datasets, and Jupyter Notebooks, ranked by star rating. In the previous part, we talked about popular repositories for studying data visualization and deep learning.

Machine learning

Awesome Machine Learning
38 809, 9 615

An impressive list of systems, libraries, and software classified by language and category (computer vision, natural language processing, etc.). In addition, in this repository you will find a list of free machine learning books, free (mostly) machine learning courses, data science blogs.

Scikit-learn
34 067, 16 698

Developed since 2007, the Python module for machine learning, built on the basis of the SciPy, NumPy, and Matplotlib libraries. Distributed under the BSD 3-Clause license. Scikit-learn is a universal tool for work that contains classification, regression and clustering algorithms, as well as methods for preparing data and evaluating models.

PredictionIO
11 703, 1 903

An open source machine learning framework that supports event collection, algorithm deployment, evaluation, templates for well-known tasks such as classification and recommendations. Connects to existing applications using the REST API or SDK. PredictionIO is based on scalable open source services such as Hadoop, HBase (and other databases), Elasticsearch, Spark.

Dive Into Machine Learning
9 163, 1 673

Material for beginners in the subject. The repository contains a collection of IPython tutorials for the Scikit-learn library, which implements a large number of machine learning algorithms, as well as several links to Python-related machine learning topics and more general data analysis information. The author provides links to many other tutorials covering the topic.

Pattern
6 845, 1 353

Python-based web development module with tools for analysis, natural language processing (marking up parts of speech, n-gram search, mood analysis, WordNet), machine learning, network analysis and visualization. The module was created and well documented at the Research Center for Computer Linguistics and Psycholinguistics of the University of Antwerp (Belgium). In the repository you will find more than 50 examples of its use.

GoLearn
6 374, 867

Actively developing machine learning library for Go. Provides a full-featured, easy-to-use, highly customizable software package for developers. GoLearn implements the familiar Scikit-learn learning interface.

Vowpal Wabbit
6 189, 1 519

The Vowpal Wabbit system extends the boundaries of machine learning using methods such as hashing, allreduce, learning2search, and active and interactive learning. Vowpal Wabbit aims to quickly model massive datasets and supports parallel learning. Particular attention is paid to reinforcement learning using several contextual "gangster algorithms."

NuPIC (Numenta Platform for Intelligent Computing)
5 852, 1,570

NuPIC implements Hierarchical Temporal Memory (HTM) machine learning algorithms. In general, HTM is an attempt to simulate the computational operations of the neocortex of the human brain and focuses on the conservation and invocation of spatial and temporal patterns. HTM is a memory system, it is not programmed, does not learn to execute algorithms for various tasks, it learns to solve a problem. NuPIC is suitable for all kinds of tasks, in particular, for detecting pattern anomalies.

aerosolve
4 522, 570

aerosolve tries to distinguish itself from other libraries by focusing on user-friendly debugging tools, a Scala code for training, a mechanism for analyzing image content for easy ranking, flexibility and control over functions. The library is intended for use with rare interpretable functions that are usually found in search (search keywords, filters) or pricing (number of rooms in a hotel room, location, price).

Code for Machine Learning for Hackers
3 467, 2 220

A complementary repository of Machine Learning for Hackers , a repository in which all the code is presented in the R language for statistical data processing (in fact, the standard of statistical programs) and graphics. You will find numerous R packages here. Topics covered include general classification, ranking and regression tasks, as well as statistical procedures for component analysis and multi-dimensional scaling.

Github datasets

Awesome Public Datasets
31 852, 5 361

Another impressive repository with its size is a list divided into 30 topics: biology, sports, museums, natural language, etc. The repository includes several hundred data sets, most of which are free. Here are links to other Big Data collections.

OpenAddresses
1 644, 745

The official OpenAddresses.io repository is a free and open global collection of street addresses. The project includes street names, house numbers, postal codes and geographical coordinates.

Open Exoplanet Catalog
583, 176

A catalog of all known planets existing outside the solar system. Previously, the database was updated within 24 hours after the discovery of a new planet, but now, unfortunately, the project is practically not developing.

CitySDK
510, 149

The US Census Bureau database, adapted for integration with other open data sets, with convenient functions for working and creating your own custom data set with the Census API: statistics, cartographic GeoJSON, lat / lng, etc.

openFDA
353, 84

openFDA is a U.S. Food and Drug Administration (FDA) project that aims to provide a collection of public data sets for researchers and developers through the API, as well as examples of how to use this data and documentation. There is information about the side effects of medications, drug labeling, reports on drug withdrawal from the market, and changes to the prescription formula.

CERN Open Data Portal
247, 88

The source code for the CERN Nuclear Research Organization open data portal, which is described as "an access point to a growing range of data from CERN research."

IPython (Jupyter) Notebooks

A list of useful Github repositories consisting of IPython (Jupyter) notebooks focused on data manipulation and machine learning.

Python Machine Learning Book
9 655, 3 674

An accompanying repository of the first edition of the Machine Learning with Python book (repository for the second edition here ), which deals with working with missing values, converting categorical variables into formats suitable for machine learning, choosing informative properties, compressing data with transfer to subspaces with less number of measurements.

Example Data Science Notebook
4 156, 1 463

A repository of training materials, code and data for various data analysis and machine learning projects. Notebook contains all the basic principles of working with data analysis using the Iris dataset as an example and illustrates the construction of a workflow in data science. The basic points for working in a repo are gleaned from the book “ The Elements of Data Analytic Style ” (Jeff Leek, 2015).

Learn Data Science
2 197, 1 228

A collection of Notebooks and datasets covering four algorithmic topics: linear regression, logistic regression, random forests, and K-Means clustering algorithms. Learn Data Science is based on materials created for the Open Data Science Training project .

IPython Notebooks
2 106, 1 226

The repository contains various IPython Notebooks - from an overview of the IPython language and functionality to examples of using various popular libraries in data analysis. Here you will find a comprehensive collection of machine learning, deep learning, and big data processing materials from Machine Learning courses by Andrew Ng (Coursera), Intro to TensorFlow for Deep Learning (Udacity), and Spark (edX).

Scikit-learn Tutorial
963, 573

A repository for learning the Scikit-learn library , which implements a large number of machine learning algorithms. The library provides an implementation of a number of algorithms for learning with or without a teacher. Scikit-learn is built on top of SciPy (Scientific Python).

Machine Learning
543, 336

A series of very detailed training materials on IPython Notebook, created on the basis of the data of the course Andrew Nga on Machine Learning (Stanford University), course Tom Mitchell (Carnegie Mellon University) and a book by Christopher M. Bishora "Pattern recognition and machine learning."

The list provided is not fully exhaustive, so we welcome comments with a list of your favorite (or your own) repositories.

Tags:

Another Github 2: Machine Learning, Datasets, and Jupyter Notebooks

Machine learning

Awesome Machine Learning 38 809, 9 615

Scikit-learn 34 067, 16 698

PredictionIO 11 703, 1 903

Dive Into Machine Learning 9 163, 1 673

Pattern 6 845, 1 353

GoLearn 6 374, 867

Vowpal Wabbit 6 189, 1 519

NuPIC (Numenta Platform for Intelligent Computing) 5 852, 1,570

aerosolve 4 522, 570

Code for Machine Learning for Hackers 3 467, 2 220

Github datasets

Awesome Public Datasets 31 852, 5 361

OpenAddresses 1 644, 745

Open Exoplanet Catalog 583, 176

CitySDK 510, 149

openFDA 353, 84

CERN Open Data Portal 247, 88