![PyPI Now Supports iOS and Android Wheels for Mobile Python Development](https://cdn.sanity.io/images/cgdhsj6q/production/96416c872705517a6a65ad9646ce3e7caef623a0-1024x1024.webp?w=400&fit=max&auto=format)
Security News
PyPI Now Supports iOS and Android Wheels for Mobile Python Development
PyPI now supports iOS and Android wheels, making it easier for Python developers to distribute mobile packages.
git-pull is a web scraper for Github. You can use it to scrape –– or, if you will, pull –– data from a Github profile, repo, or file. It's parallelized and designed for anyone who wants to avoid using the Github API (e.g. due to the rate limit). Using it is very simple:
from git_pull import GithubProfile
gh = GithubProfile("shobrook")
gh.scrape_follower_count() # >>> 168
Note that git-pull is not a perfect replacement for the Github API. There's some stuff that it can't scrape (yet), like a repo's commit history or release count.
You can install git-pull with pip
:
$ pip install git-pull
git-pull provides three objects –– GithubProfile
, Repo
, and File
–– each with methods for scraping data. Below are descriptions and usage examples for each object.
GithubProfile(username, num_threads=cpu_count(), scrape_everything=False)
This is the master object for scraping data from a Github profile. All it requires is the username of the Github user, and from there you can scrape social info for that user and their repos.
Parameters:
username
(str): Github usernamenum_threads
(int, optional (default=multiprocessing.cpu_count())): Number of threads to allocate for splitting up scraping work; default is # of cores in your machine's CPUscrape_everything
(bool, optional (default=False)): If True
, does a "deep scrape" and scrapes all social info and repo data for the user (i.e. it calls all the scraper methods listed below and stores the results in properties of the object); if False
, you have to call individual scraper methods to get the data you wantMethods:
scrape_name() -> str
: Returns the name of the Github userscrape_avatar() -> str
: Returns a URL for the user's profile picturescrape_follower_count() -> int
: Returns the number of followers the user hasscrape_contribution_graph() -> dict
: Returns the contribution history for the user as a map of dates (as strings) to commit countsscrape_location() -> str
: Returns the user's location, if availablescrape_personal_site() -> str
: Returns the URL of the user's website, if availablescrape_workplace() -> str
: Returns the name of the user's workplace, if availablescrape_repos(scrape_everything=False) -> list
: Returns list of Repo
objects for each of the user's repos (both source and forked); if scrape_everything=True
, then a "deep scrape" is performed for each reposcrape_repo(repo_name, scrape_everything=False) -> Repo
: Returns a single Repo
object for a given repo that the user ownsExample:
from git_pull import GithubProfile
# If scrape_everything=True, then all scraped data is stored in object
# properties
gh = GithubProfile("shobrook", scrape_everything=True)
gh.name # >>> "Jonathan Shobrook"
gh.avatar # >>> "https://avatars1.githubusercontent.com/u/18684735?s=460&u=60f797085eb69d8bba4aba80078ad29bce78551a&v=4"
gh.repos # >>> [Repo("git-pull"), Repo("saplings"), ...]
# If scrape_everything=False, individual scraper methods have to be called, each
# of which both returns the scraped data and stores it in the object properties
gh = GithubProfile("shobrook", scrape_everything=False)
gh.name # >>> ''
gh.scrape_name() # >>> "Jonathan Shobrook"
gh.name # >>> "Jonathan Shobrook"
Repo(name, owner, num_threads=cpu_count(), scrape_everything=False)
Use this object for scraping data from a Github repo.
Parameters:
name
(str): Name of the repo to be scrapedowner
(str): Username of the owner of the reponum_threads
(int, (optional, default=multiprocessing.cpu_count())): Number of threads to allocate for splitting up scraping work; default is # of cores in your machine's CPUscrape_everything
(bool, (optional, default=False)): If True
, scrapes all metadata for the repo and scrapes files; if False
, you have to call individual scraper methods to get the data you wantMethods:
scrape_topics() -> list
: Returns list of topics/tags for the reposcrape_star_count() -> int
: Returns number of stars the repo hasscrape_fork_count() -> int
: Returns number of times the repo has been forkedscrape_fork_status() -> bool
: Returns whether or not the repo is a fork of another onescrape_files(scrape_everything=False) -> list
: Returns a list of File
objects, each representing a file in the repo; files that aren't programs or documentation files (e.g. boilerplate) are not scrapedscrape_file(file_path, file_type=None, scrape_everything=False) -> File
: Returns a File
object given a file pathExample:
from git_pull import Repo
repo = Repo("git-pull", "shobrook", scrape_everything=True)
repo.topics # >>> ["web-scraper", "github", "github-api", "parallel", "scraper"]
repo.fork_status # >>> False
File(path, repo, owner, scrape_everything=False)
Use this object for scraping data from a single file inside a Github repo.
Parameters:
path
(str): Absolute path of the file inside the reporepo
(str): Name of the repo containing the fileowner
(str): Username of the repo's ownerscrape_everything
(bool, (optional, default=False)): If True
, scrapes the blame history for the file and the file type (i.e. calls the methods listed below)Methods:
scrape_blames() -> dict
: Returns the blame history for a file as a map of usernames (i.e. contributors) to {"line_nums": [1, 2, ...], "committers": [...]}
dictionaries, where "line_nums"
is a list of line numbers the user wrote and "committers"
is a list of usernames of contributors the user pair programmed with, if anyExample:
from git_pull import File
file = File("git_pull/git_pull.py", "git-pull", "shobrook", scrape_everything=True)
file.blames # >>> {"shobrook": {"line_nums": [1, 2, ...], "committers": []}}
file.raw_url # >>> "https://raw.githubusercontent.com/shobrook/git-pull/master/git_pull/git_pull.py"
file.type # >>> "Python"
FAQs
Parallelized web scraper for Github
We found that git-pull demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
PyPI now supports iOS and Android wheels, making it easier for Python developers to distribute mobile packages.
Security News
Create React App is officially deprecated due to React 19 issues and lack of maintenance—developers should switch to Vite or other modern alternatives.
Security News
Oracle seeks to dismiss fraud claims in the JavaScript trademark dispute, delaying the case and avoiding questions about its right to the name.