Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Scrapes and crawls websites for textual data and urls in any ISO language
In a terminal window:
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
xcode-select --install
brew update
brew upgrade
git --version
git version 2.24.1 (Apple Git-126)
brew install python3
python3 --version
Python 3.7.7
pip3 install -U pytest
py.test --version
This is pytest version 5.4.1, imported from /usr/local/lib/python3.7/site-packages/pytest/__init__.py
open https://desktop.github.com
open https://www.jetbrains.com/pycharm/download
open https://help.github.com/articles/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent/
check to make sure your github key has been added to the ssh-agent list. Here's my ~/.ssh/config file
Host github.com github
IdentityFile ~/.ssh/id_rsa
IdentitiesOnly yes
UseKeyChain yes
AddKeysToAgent yes
cd ~/.ssh
ssh-keygen -o
ssh-add -K ~/.ssh/id_rsa
ssh-add -L
cd ~
git clone https://github.com/Stimson-Center/stimson-web-scraper.git
cd ~/stimson-web-scraper
./run_tests.sh
cd ~/stimson-web-scraper/scraper
./start.sh
./cli.py -u https://www.yahoo.com -l en
import datetime
from scraper import Article
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)
article.build()
# Access Data scraped from this web site page
article.authors
['Leigh Ann Caldwell', 'John Honway']
article.publish_date
datetime.datetime(2013, 12, 30, 0, 0)
article.text
'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'
article.top_image
'http://someCDN.com/blah/blah/blah/file.png'
article.movies
['http://youtube.com/path/to/link.com', ...]
article.keywords
['New Years', 'resolution', ...]
article.summary
'The study shows that 93% of people ...'
article.html
'<!DOCTYPE HTML><html itemscope itemtype="http://...'
scraper can extract and detect languages seamlessly. If no language is specified, Newspaper will attempt to auto detect a language. If you are certain that an from scraper then you can specify it by two letter ISO code
To see list of supported ISO languages
import scraper
scraper.get_languages()
Your available languages are:
input code full name
af Afrikaans
ar Arabic
be Belarusian
bg Bulgarian
bn Bengali
br Portuguese, Brazil
ca Catalan
cs Czech
da Danish
de German
el Greek
en English
eo Esperanto
es Spanish
et Estonian
eu Basque
fa Persian
fi Finnish
fr French
ga Irish
gl Galician
gu Gujarati
ha Hausa
he Hebrew
hi Hindi
hr Croatian
hu Hungarian
hy Armenian
id Indonesian
it Italian
ja Japanese
ka Georgian
ko Korean
ku Kurdish
la Latin
lt Lithuanian
lv Latvian
mk Macedonian
mr Marathi
ms Malay
nb Norwegian (Bokmål)
nl Dutch
no Norwegian
np Nepali
pl Polish
pt Portuguese
ro Romanian
ru Russian
sk Slovak
sl Slovenian
so Somali
sr Serbian
st Sotho, Southern
sv Swedish
sw Swahili
ta Tamil
th Thai
tl Tagalog
tr Turkish
uk Ukrainian
ur Urdu
vi Vietnamese
yo Yoruba
zh Chinese
zu Zulu
import scraper
scraper.get_languages()
{'ar': 'Arabic', 'af': 'Afrikaans', 'be': 'Belarusian', 'bg': 'Bulgarian', 'bn': 'Bengali', 'br': 'Portuguese, Brazil', 'ca': 'Catalan', 'cs': 'Czech', 'da': 'Danish', 'de': 'German', 'el': 'Greek', 'en': 'English', 'eo': 'Esperanto', 'es': 'Spanish', 'et': 'Estonian', 'eu': 'Basque', 'fa': 'Persian', 'fi': 'Finnish', 'fr': 'French', 'ga': 'Irish', 'gl': 'Galician', 'gu': 'Gujarati', 'ha': 'Hausa', 'he': 'Hebrew', 'hi': 'Hindi', 'hr': 'Croatian', 'hu': 'Hungarian', 'hy': 'Armenian', 'id': 'Indonesian', 'it': 'Italian', 'ja': 'Japanese', 'ka': 'Georgian', 'ko': 'Korean', 'ku': 'Kurdish', 'la': 'Latin', 'lt': 'Lithuanian', 'lv': 'Latvian', 'mk': 'Macedonian', 'mr': 'Marathi', 'ms': 'Malay', 'nb': 'Norwegian (Bokmål)', 'nl': 'Dutch', 'no': 'Norwegian', 'np': 'Nepali', 'pl': 'Polish', 'pt': 'Portuguese', 'ro': 'Romanian', 'ru': 'Russian', 'sk': 'Slovak', 'sl': 'Slovenian', 'so': 'Somali', 'sr': 'Serbian', 'st': 'Sotho, Southern', 'sv': 'Swedish', 'sw': 'Swahili', 'ta': 'Tamil', 'th': 'Thai', 'tl': 'Tagalog', 'tr': 'Turkish', 'uk': 'Ukrainian', 'ur': 'Urdu', 'vi': 'Vietnamese', 'yo': 'Yoruba', 'zh': 'Chinese', 'zu': 'Zulu'}
To import an article in a supported ISO language
from scraper import Article
url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
article = Article(url, language='zh') # Chinese
article.build()
print(article.text[:150])
香港行政长官梁振英在各方压力下就其大宅的违章建
筑(僭建)问题到立法会接受质询,并向香港民众道歉。
梁振英在星期二(12月10日)的答问大会开始之际
在其演说中道歉,但强调他在违章建筑问题上没有隐瞒的
意图和动机。 一些亲北京阵营议员欢迎梁振英道歉,
且认为应能获得香港民众接受,但这些议员也质问梁振英有
print(article.title)
港特首梁振英就住宅违建事件道歉
# If you are certain that an from scraper import Article
url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
article = Article(url, language='zh') # Chinese
article.build()
print(article.text[:150])
香港行政长官梁振英在各方压力下就其大宅的违章建
筑(僭建)问题到立法会接受质询,并向香港民众道歉。
梁振英在星期二(12月10日)的答问大会开始之际
在其演说中道歉,但强调他在违章建筑问题上没有隐瞒的
意图和动机。 一些亲北京阵营议员欢迎梁振英道歉,
且认为应能获得香港民众接受,但这些议员也质问梁振英有
print(article.title)
港特首梁振英就住宅违建事件道歉
from scraper import Article
url = "http://tpch-th.listedcompany.com/misc/ShareholderMTG/egm201701/20170914-tpch-egm201701-enc02-th.pdf"
article = Article(url=url, language='th')
article.build()
print(article.text)
from scraper import Article
url = "https://en.wikipedia.org/wiki/International_Phonetic_Alphabet_chart_for_English_dialects"
article = Article(url=url, language='en')
article.build()
print(article.text)
print(article.tables)
brew install docker
docker --version
cd ~/stimson-web-scraper
./run_docker.sh
You will be put into the virtual machine:
(venv) tf-docker /app >
./run_tests.sh
For more details see:
git checkout -b your_github_name-feature
)git commit -am 'Added some feature'
)git push origin your_github_name-feature
)FAQs
website article / adobe pdf file discovery & extraction
We found that stimson-web-scraper demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.