Top Github Scraper
Scrape top Github repositories and users based on keywords.
I used this tool to analyze the top 1k machine learning users in this article.

Setup
Installation
pip install top-github-scraper
Add Credentials
To make sure you can scrape many repositories and users, add your GitHub's credentials to .env
file.
touch .env
Add your username and token to .env
file:
GITHUB_USERNAME=yourusername
GITHUB_TOKEN=yourtoken
Usage
Get Top Github Repositories' URLs
from top_github_scraper import get_top_repo_urls
get_top_repo_urls(keyword="machine learning", stop_page=10)
Output at top_repo_urls_<keyword>_<sort_by>_<start_page>_<end_page>.json
:
[
"/josephmisiti/awesome-machine-learning",
"/wepe/MachineLearning",
"/udacity/machine-learning",
"/Jack-Cherish/Machine-Learning",
"/ZuzooVn/machine-learning-for-software-engineers",
"/rasbt/python-machine-learning-book",
"/lawlite19/MachineLearning_Python",
"/lazyprogrammer/machine_learning_examples",
"/trekhleb/homemade-machine-learning",
"/ujjwalkarn/Machine-Learning-Tutorials"
]
Get Top Github Repositories' Information
from top_github_scraper import get_top_repos
get_top_repos("machine learning", stop_page=10)
Output for 1 repository at top_repo_info_<keyword>_<sort_by>_<start_page>_<end_page>.json
:
{
"stargazers_count": 48620,
"forks_count": 12155,
"contributors": {
"login": [
"josephmisiti",
"josephmmisiti",
"hslatman",
"0asa",
"ajkl",
"ipcenas",
"cogmission",
"spekulatius",
"basickarl",
"NathanEpstein"
],
"url": [
"https://api.github.com/users/josephmisiti",
"https://api.github.com/users/josephmmisiti",
"https://api.github.com/users/hslatman",
"https://api.github.com/users/0asa",
"https://api.github.com/users/ajkl",
"https://api.github.com/users/ipcenas",
"https://api.github.com/users/cogmission",
"https://api.github.com/users/spekulatius",
"https://api.github.com/users/basickarl",
"https://api.github.com/users/NathanEpstein"
],
"contributions": [
671,
105,
21,
12,
11,
9,
8,
7,
7,
7
]
}
}
Get Top Github Contributors' Profiles
from top_github_scraper import get_top_contributors
get_top_contributors("machine learning", stop_page=10)
Output at top_contributor_info_<keyword>_<sort_by>_<start_page>_<end_page>.csv
:
Get Top Github Users' Profiles
from top_github_scraper import get_top_users
get_top_users("machine learning", stop_page=10)
Output at top_user_info_<keyword>_<start_page>_<end_page>.csv
0 | rasbt | https://api.github.com/users/rasbt | User | Sebastian Raschka | UW-Madison | "Madison, WI" | | | "Machine Learning researcher & open source contributor. Author of ""Python Machine Learning."" Asst. Prof. of Statistics @ UW-Madison." | 71 | 5 | 13888 | 35 |
1 | tqchen | https://api.github.com/users/tqchen | User | Tianqi Chen | "CMU, OctoML" | | | | Large scale Machine Learning | 28 | 1 | 8611 | 126 |
2 | halfrost | https://api.github.com/users/halfrost | User | halfrost | @Alibaba | Shanghai China | i@halfrost.com | | 💪天道酬勤,勤能补拙。博观而约取,厚积而薄发。Gopher / Rustacean / iOS Dev. / Machine Learning / Retired acmer / Math / Philosophy / Technical Writer. | 22 | 0 | 8566 | 314 |
3 | ageron | https://api.github.com/users/ageron | User | Aurélien Geron | | Paris | | | Author of the book Hands-On Machine Learning with Scikit-Learn and TensorFlow. Former PM of YouTube video classification and founder & CTO of a telco operator. | 43 | 16 | 8383 | 2 |
4 | chiphuyen | https://api.github.com/users/chiphuyen | User | Chip Huyen | https://snorkel.ai | "Mountain View, CA" | | True | Developing tools and best practices for machine learning production. | 19 | 1 | 7839 | 15 |
5 | rhiever | https://api.github.com/users/rhiever | User | Randy Olson | FOXO BioScience | "Vancouver, WA" | rso@randalolson.com | | "Chief Data Scientist, @FOXOBioScience. AI, Machine Learning, and Data Visualization specialist. Community leader for /r/DataIsBeautiful." | 77 | 17 | 5363 | 13 |
6 | lexfridman | https://api.github.com/users/lexfridman | User | Lex Fridman | MIT | "Cambridge, MA" | | | "AI researcher working on autonomous vehicles, human-robot interaction, and machine learning at MIT and beyond." | 2 | 0 | 5031 | 0 |
7 | eriklindernoren | https://api.github.com/users/eriklindernoren | User | Erik Linder-Norén | | "Stockholm, Sweden" | eriklindernoren@gmail.com | | "ML engineer at Apple. Excited about machine learning, basketball and building things." | 24 | 0 | 3764 | 11 |
8 | roboticcam | https://api.github.com/users/roboticcam | User | A/Prof Richard Xu 徐亦达教授 | University of Technology Sydney | Sydney Australia | | | "I am an A/Professor in Machine Learning at UTS. manage a large research team of postdoc, PhD students close to 30 people" | 10 | 0 | 3561 | 0 |
9 | ogrisel | https://api.github.com/users/ogrisel | User | Olivier Grisel | Inria | "Paris, France" | olivier.grisel@ensta.org | | Machine Learning Engineer a Inria Saclay (Parietal team). | 174 | 93 | 3237 | 116 |
Parameters
View a full list of paramters here.
How the Data is Scraped
top-github-scraper
scrapes the owners as well as the contributors of the top repositories that pop up in the search when searching for a specific keyword on GitHub.

For each user, top-github-scraper
scrapes 16 data points:
login
: username
url
: URL of the user
type
: Whether this account is a user or an organization
name
: Name of the user
company
: User's company
location
: User's location
email
: User's email
hireable
: Whether the user is hireable
bio
: Short description of the user
public_repos
: Number of public repositories the user has (including forked repositories)
public_gists
: Number of public repositories the user has (including forked gists)
followers
: Number of followers the user has
following
: Number of people the user is following