Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
clarku-youtube-crawler
Advanced tools
Clark University YouTube crawler and JSON decoder for YouTube json. Please read documentation in DOCS
Pypi page: "https://pypi.org/project/clarku-youtube-crawler/"
To install,
pip install clarku-youtube-crawler
The crawler needs multiple other packages to function.
If missing requirements (I already include all dependencies so it shouldn't happen), download requirements.txt
.
Navigate to the folder where it contains requirements.txt and run
pip install -r requirements.txt
To upgrade
pip install clarku-youtube-crawler --upgrade
Go to the project folder, delete config.ini if it is already there.
Case 1: crawl videos by keywords,
import clarku_youtube_crawler as cu
# Crawl all JSONs
crawler = cu.RawCrawler()
crawler.build("low visibility")
crawler.crawl("low visibility", start_date=14, start_month=12, start_year=2020, day_count=5)
crawler.crawl("blind", start_date=14, start_month=12, start_year=2020, day_count=5)
crawler.merge_to_workfile()
crawler.crawl_videos_in_list(comment_page_count=1)
crawler.merge_all(save_to='low visibility/all_videos.json')
# Convert JSON to CSV
decoder = cu.JSONDecoder()
decoder.json_to_csv(data_file='low visibility/all_videos.json')
# Crawl subtitles from CSV
# If you don't need subtitles, delete the following lines
subtitleCrawler = cu.SubtitleCrawler()
subtitleCrawler.build("low visibility")
subtitleCrawler.crawl_csv(
videos_to_collect="low visibility/videos_to_collect.csv",
video_id="videoId",
sub_title_dir="low visibility/subtitles/"
)
Case 2: crawl a videos by a list of ids specified by videoId column in an input CSV
import clarku_youtube_crawler as cu
crawler = cu.RawCrawler()
work_dir = "blind"
crawler.build(work_dir)
# update videos_to_collect.csv to your csv file. Specify the column of video id by video_id
# video ids must be ":" + YouTube video id. E.g., ":wl4m1Rqmq-Y"
crawler.crawl_videos_in_list(video_list_workfile="videos_to_collect.csv",
comment_page_count=1,
search_key="blind",
video_id="videoId"
)
crawler.merge_all(save_to='all_raw_data.json')
decoder = cu.JSONDecoder()
decoder.json_to_csv(data_file='all_raw_data.json')
# Crawl subtitles from CSV
# If you don't need subtitles, delete the following lines
subtitleCrawler = cu.SubtitleCrawler()
subtitleCrawler.build(work_dir)
subtitleCrawler.crawl_csv(
videos_to_collect="videos_to_collect.csv",
video_id="videoId",
sub_title_dir=f"YouTube_CSV/subtitles/"
)
Case 3: Search a list of channels by search keys, then crawl all videos belonging to those channels.
import clarku_youtube_crawler as cu
chCrawler = cu.ChannelCrawler()
work_dir = "low visibility"
chCrawler.build(work_dir)
# You can search different channels. All results will be merged
chCrawler.search_channel("low visibility")
chCrawler.search_channel("blind")
chCrawler.merge_to_workfile()
chCrawler.crawl()
# Crawl videos posted by selected channels. channels_to_collect.csv file has which search keys find each channel
crawler = cu.RawCrawler()
crawler.build(work_dir)
crawler.merge_to_workfile(file_dir=work_dir + "/video_search_list/")
crawler.crawl_videos_in_list(comment_page_count=1)
crawler.merge_all()
# Convert JSON to CSV
decoder = cu.JSONDecoder()
decoder.json_to_csv(data_file=work_dir + '/all_videos_visibility.json')
# Crawl subtitles from CSV
# If you don't need subtitles, delete the following lines
subtitleCrawler = cu.SubtitleCrawler()
subtitleCrawler.build(work_dir)
subtitleCrawler.crawl_csv(
videos_to_collect=work_dir+"/videos_to_collect.csv",
video_id="videoId",
sub_title_dir=work_dir+"/subtitles/"
)
Case 4: You already have a list of channels. You want to crawl all videos of the channels in the list:
import clarku_youtube_crawler as cu
work_dir = 'disability'
chCrawler = cu.ChannelCrawler()
chCrawler.build(work_dir)
chCrawler.crawl(filename='mturk_test.csv', channel_header="Input.channelId")
# Crawl videos posted by selected channels
crawler = cu.RawCrawler()
crawler.build(work_dir)
crawler.merge_to_workfile(file_dir=work_dir + "/video_search_list/")
crawler.crawl_videos_in_list(comment_page_count=10) # 100 comments per page, 10 page will crawl 1000 comments
crawler.merge_all()
#
# Convert JSON to CSV
decoder = cu.JSONDecoder()
decoder.json_to_csv(data_file=work_dir + '/all_videos.json')
# Crawl subtitles from CSV
subtitleCrawler = cu.SubtitleCrawler()
subtitleCrawler.build(work_dir)
subtitleCrawler.crawl_csv(
videos_to_collect=work_dir + "/videos_to_collect.csv",
video_id="videoId",
sub_title_dir=work_dir + "/subtitles/"
)
FAQs
Clark University, Package for YouTube crawler and cleaning data
We found that clarku-youtube-crawler demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 2 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.