Security News
Maven Central Adds Sigstore Signature Validation
Maven Central now validates Sigstore signatures, making it easier for developers to verify the provenance of Java packages.
simple-website-scraper
Advanced tools
Contentstack specific framework to scrap data from a website and to Contentstack
This is contentstack headless cms specific only Provide urls and what to extract and you are good to go
Install:
npm install simple-website-scraper
You will need to create a config.json, urls.json and schemaFile files as following
config.json
{
"api_key" : "stack api key",
"email" : "xyz@raweng.com",
"password" : "xyz",
"parentUid" : "asstes folder uid",
"contentUid": "contenttype uid",
"baseUrl" : "https://xyz.com",
"schemaFile": "authors.json",
"ssr" : false,
"locale" :"en-us",
"import" : false
}
You can also use authtoken instead of email and password here.
ssr = true , turn on server side rendering
import = false, import entries and dump on system and do not upload to Contentstack
schemaFile: It will guide the framework what needs to be scrapped from the provided URLs using jQuery.
authors.json (schemaFile) : we will map page elements that needs to be scrapped
{
"title": "$('title')",
"url": "getRelativeUrl()",
"name": "$('.author_name').text()",
"profile_description": "rteHandler($('.author_description'))",
"seo": "seoHandler()"
}
urls.json
{
"urls": ["https://example.com/blog/authors/lucy", "https://example.com/blog/authors/shern", "https://example.com/blog/authors/kety"]
}
You have access to some internal variables like -
1. relativePageUrl // /blog/authors/shern
2. currentUrl // https://example.com/blog/authors/shern
3. $ - DOM of the current page
You have access to some internal functions like -
seoHandler: It will return meta title, keywords and descriptions in following format
{
"title": "current page meta title",
"description": "current page meta description",
"keywords": "current page meta keywords",
}
getRelativeUrl: It will return relativePageUrl
getUrl: it will return full URL of current page
imageHandler: input - src of image, output - uid of image uploaded of Contentstack
rteHandler: input - dom, output - it will upload all assets/images to Contentstack and update the srcs and links to uploaded assets/images to Contentstack and return updated DOM
Start scraping
const scrap = require('simple-website-scraper').scrap
scrap()
.then( response => response)
.catch( err => console.log(err))
FAQs
Contentstack specific framework to scrap data from a website and to Contentstack
We found that simple-website-scraper demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Maven Central now validates Sigstore signatures, making it easier for developers to verify the provenance of Java packages.
Security News
CISOs are racing to adopt AI for cybersecurity, but hurdles in budgets and governance may leave some falling behind in the fight against cyber threats.
Research
Security News
Socket researchers uncovered a backdoored typosquat of BoltDB in the Go ecosystem, exploiting Go Module Proxy caching to persist undetected for years.