
Security News
npm ‘is’ Package Hijacked in Expanding Supply Chain Attack
The ongoing npm phishing campaign escalates as attackers hijack the popular 'is' package, embedding malware in multiple versions.
SecretScraper is a web scraper tool that can scrape the content through target websites and extract secret information via regular expression.
SecretScraper is a highly configurable web scrape tool that crawl links from target websites and scrape sensitive data via regular expression.
pip install secretscraper
pip install --upgrade secretscraper
Note that, since Secretscraper generates a default configuration under the work directory if settings.yml
is absent, so remember to update the settings.yml
to the latest version(just copy from Customize Configuration).
Start with single target:
secretscraper -u https://scrapeme.live/shop/
Start with multiple targets:
secretscraper -f urls
# urls
http://scrapeme.live/1
http://scrapeme.live/2
http://scrapeme.live/3
http://scrapeme.live/4
http://scrapeme.live/1
Sample output:
All supported options:
> secretscraper --help
Usage: secretscraper [OPTIONS]
Main commands
Options:
-V, --version Show version and exit.
--debug Enable debug.
-a, --ua TEXT Set User-Agent
-c, --cookie TEXT Set cookie
-d, --allow-domains TEXT Domain white list, wildcard(*) is supported,
separated by commas, e.g. *.example.com,
example*
-D, --disallow-domains TEXT Domain black list, wildcard(*) is supported,
separated by commas, e.g. *.example.com,
example*
-f, --url-file FILE Target urls file, separated by line break
-i, --config FILE Set config file, defaults to settings.yml
-m, --mode [1|2] Set crawl mode, 1(normal) for max_depth=1,
2(thorough) for max_depth=2, default 1
--max-page INTEGER Max page number to crawl, default 100000
--max-depth INTEGER Max depth to crawl, default 1
-o, --outfile FILE Output result to specified file in csv format
-s, --status TEXT Filter response status to display, seperated by
commas, e.g. 200,300-400
-x, --proxy TEXT Set proxy, e.g. http://127.0.0.1:8080,
socks5://127.0.0.1:7890
-H, --hide-regex Hide regex search result
-F, --follow-redirects Follow redirects
-u, --url TEXT Target url
--detail Show detailed result
--validate Validate the status of found urls
-l, --local PATH Local file or directory, scan local
file/directory recursively
--help Show this message and exit.
Use --validate
option to check the status of found links, this helps reduce invalid links in the result.
secretscraper -u https://scrapeme.live/shop/ --validate --max-page=10
The max depth is set to 1, which means only the start urls will be crawled. To change that, you can specify
via --max-depth <number>
. Or in a simpler way, use -m 2
to run the crawler in thorough mode which is equivalent
to --max-depth 2
. By default the normal mode -m 1
is adopted with max depth set to 1.
secretscraper -u https://scrapeme.live/shop/ -m 2
secretscraper -u https://scrapeme.live/shop/ -o result.csv
Support wildcard(*), white list:
secretscraper -u https://scrapeme.live/shop/ -d *scrapeme*
Black list:
secretscraper -u https://scrapeme.live/shop/ -D *.gov
Use -H
option to hide regex-matching results. Only found links will be displayed.
secretscraper -u https://scrapeme.live/shop/ -H
secretscraper -l <dir or file>
I have implemented the regex matching functionality with both hyperscan
and re
module, re
module is used as default, if you purse higher performance, you can switch to hyperscan
by changing the handler_type
to hyperscan
in settings.yml
.
There are some pitfalls of hyperscan
which you have to take caution to use it:
re
You'd better write regex separately for the two regex engine.
The built-in config is shown as below. You can assign custom configuration via -i settings.yml
.
verbose: false
debug: false
loglevel: critical
logpath: log
handler_type: re
proxy: "" # http://127.0.0.1:7890
max_depth: 1 # 0 for no limit
max_page_num: 1000 # 0 for no limit
timeout: 5
follow_redirects: true
workers_num: 1000
headers:
Accept: "*/*"
Cookie: ""
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36 SE 2.X MetaSr 1.0
urlFind:
- "[\"'‘“`]\\s{0,6}(https{0,1}:[-a-zA-Z0-9()@:%_\\+.~#?&//={}]{2,250}?)\\s{0,6}[\"'‘“`]"
- "=\\s{0,6}(https{0,1}:[-a-zA-Z0-9()@:%_\\+.~#?&//={}]{2,250})"
- "[\"'‘“`]\\s{0,6}([#,.]{0,2}/[-a-zA-Z0-9()@:%_\\+.~#?&//={}]{2,250}?)\\s{0,6}[\"'‘“`]"
- "\"([-a-zA-Z0-9()@:%_\\+.~#?&//={}]+?[/]{1}[-a-zA-Z0-9()@:%_\\+.~#?&//={}]+?)\""
- "href\\s{0,6}=\\s{0,6}[\"'‘“`]{0,1}\\s{0,6}([-a-zA-Z0-9()@:%_\\+.~#?&//={}]{2,250})|action\\s{0,6}=\\s{0,6}[\"'‘“`]{0,1}\\s{0,6}([-a-zA-Z0-9()@:%_\\+.~#?&//={}]{2,250})"
jsFind:
- (https{0,1}:[-a-zA-Z0-9()@:%_\+.~#?&//=]{2,100}?[-a-zA-Z0-9()@:%_\+.~#?&//=]{3}[.]js)
- '["''‘“`]\s{0,6}(/{0,1}[-a-zA-Z0-9()@:%_\+.~#?&//=]{2,100}?[-a-zA-Z0-9()@:%_\+.~#?&//=]{3}[.]js)'
- =\s{0,6}[",',’,”]{0,1}\s{0,6}(/{0,1}[-a-zA-Z0-9()@:%_\+.~#?&//=]{2,100}?[-a-zA-Z0-9()@:%_\+.~#?&//=]{3}[.]js)
dangerousPath:
- logout
- update
- remove
- insert
- delete
rules:
- name: Swagger
regex: \b[\w/]+?((swagger-ui.html)|(\"swagger\":)|(Swagger UI)|(swaggerUi)|(swaggerVersion))\b
loaded: true
- name: ID Card
regex: \b((\d{8}(0\d|10|11|12)([0-2]\d|30|31)\d{3})|(\d{6}(18|19|20)\d{2}(0[1-9]|10|11|12)([0-2]\d|30|31)\d{3}(\d|X|x)))\b
loaded: true
- name: Phone
regex: "['\"](1(3([0-35-9]\\d|4[1-8])|4[14-9]\\d|5([\\d]\\d|7[1-79])|66\\d|7[2-35-8]\\d|8\\d{2}|9[89]\\d)\\d{7})['\"]"
loaded: true
- name: JS Map
regex: \b([\w/]+?\.js\.map)
loaded: true
- name: URL as a Value
regex: (\b\w+?=(https?)(://|%3a%2f%2f))
loaded: false
- name: Email
regex: "['\"]([\\w]+(?:\\.[\\w]+)*@(?:[\\w](?:[\\w-]*[\\w])?\\.)+[\\w](?:[\\w-]*[\\w])?)['\"]"
loaded: true
- name: Internal IP
regex: '[^0-9]((127\.0\.0\.1)|(10\.\d{1,3}\.\d{1,3}\.\d{1,3})|(172\.((1[6-9])|(2\d)|(3[01]))\.\d{1,3}\.\d{1,3})|(192\.168\.\d{1,3}\.\d{1,3}))'
loaded: true
- name: Cloud Key
regex: \b((accesskeyid)|(accesskeysecret)|\b(LTAI[a-z0-9]{12,20}))\b
loaded: true
- name: Shiro
regex: (=deleteMe|rememberMe=)
loaded: true
- name: Suspicious API Key
regex: "[\"'][0-9a-zA-Z]{32}['\"]"
loaded: true
- name: Jwt
regex: "['\"](ey[A-Za-z0-9_-]{10,}\\.[A-Za-z0-9._-]{10,}|ey[A-Za-z0-9_\\/+-]{10,}\\.[A-Za-z0-9._\\/+-]{10,})['\"]"
loaded: true
--detail
optionre
module as regex engine by defaulthandler_type
--validate
option: Validate urls after the crawler finish, which helps reduce useless links--debug
option--detail
optionFAQs
SecretScraper is a web scraper tool that can scrape the content through target websites and extract secret information via regular expression.
We found that secretscraper demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
The ongoing npm phishing campaign escalates as attackers hijack the popular 'is' package, embedding malware in multiple versions.
Security News
A critical flaw in the popular npm form-data package could allow HTTP parameter pollution, affecting millions of projects until patched versions are adopted.
Security News
Bun 1.2.19 introduces isolated installs for smoother monorepo workflows, along with performance boosts, new tooling, and key compatibility fixes.