url-metadata
Request a url and scrape the metadata from its HTML using Node.js or the browser. Has an optional mode that lets you pass in a string of html or a Response object as well (see Options section below).
Includes:
More details in the Returns section below.
v5.1.0+ Protects against:
To report a bug or request a feature please open an issue or pull request in GitHub. Please read the Troubleshooting section below before filing a bug.
Install
Works with Node.js versions >=6.0.0 or in the browser when bundled with Webpack (see /example-typescript) or Vite (see /example-vite). For Next.js, see /example-nextjs. Use previous version 2.5.0 which uses the (now-deprecated) request module if you don't have access to node-fetch or window.fetch in your target environment.
npm install url-metadata --save
Usage
In your project file:
const urlMetadata = require('url-metadata');
(async function () {
try {
const url = 'https://www.npmjs.com/package/url-metadata';
const metadata = await urlMetadata(url);
console.log(metadata);
} catch (err) {
console.log(err);
}
})();
Options & Defaults
To override the default options, pass in a second options argument. The default options are the values below.
const options = {
requestHeaders: {
'User-Agent': 'url-metadata (+https://www.npmjs.com/package/url-metadata)',
From: 'example@example.com'
},
requestFilteringAgentOptions: undefined,
agent: undefined,
cache: 'no-cache',
mode: 'cors',
maxRedirects: 10,
timeout: 10000,
size: 0,
compress: true,
decode: 'auto',
descriptionLength: 750,
ensureSecureImageRequest: true,
includeResponseBody: false,
parseResponseObject: undefined
};
try {
const url = 'https://www.npmjs.com/package/url-metadata';
const metadata = await urlMetadata(url, options);
console.log(metadata);
} catch (err) {
console.log(err);
}
try {
const response = await fetch('https://www.npmjs.com/package/url-metadata');
const metadata = await urlMetadata(null, {
parseResponseObject: response
});
console.log(metadata);
} catch (err) {
console.log(err);
}
const html = `
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Metadata page</title>
<meta name="author" content="foobar">
<meta name="keywords" content="HTML, CSS, JavaScript">
</head>
<body>
<h1>Metadata page</h1>
</body>
</html>
`;
const response = new Response(html, {
headers: {
'Content-Type': 'text/html'
}
});
const metadata = await urlMetadata(null, {
parseResponseObject: response
});
console.log(metadata);
Returns
Returns a promise resolved with a JSON object. Note that the url field returned will be the last hop in the request chain. If you pass in a url from a url shortener you'll get back the final destination as the url.
A basic template for the returned metadata object can be found in lib/metadata-fields.js. Any additional meta tags found on the page are appended as new fields to the object.
The returned metadata object consists of key/value pairs as strings, with a few exceptions:
favicons is an array of objects containing key/value pairs of strings
jsonld is an array of objects
responseHeaders is an object containing key/value pairs of strings
- all meta tags that begin with
citation_ (ex: citation_author) return with keys as strings and values that are an array of strings to conform to the Google Scholar spec which allows for multiple citation meta tags with different content values. So if the html contains:
<meta name="citation_author" content="Arlitsch, Kenning">
<meta name="citation_author" content="OBrien, Patrick">
... it will return as:
'citation_author': ["Arlitsch, Kenning", "OBrien, Patrick"],
Troubleshooting
Issue: DNS Lookup errors. The SSRF filtering agent defaults on this package prevent calls to private ip addresses, link-local addresses and reserved ip addresses. To change or disable this feature you need to pass custom requestFilteringAgentOptions. More info here.
Issue: No fetch implementation found. You're in either an older browser that doesn't have the native fetch API or a Node.js environment that doesn't support node-fetch (Node.js < v6). File a GitHub issue or try dowgrading to url-metadata version 2.5.0 which uses the now-deprecated request module.
Issue: Response status code 0 or CORS errors. The fetch request failed at either the network or protocol level. Possible causes:
-
CORS errors. Try changing the mode option (ex: cors, same-origin, etc) or setting the Access-Control-Allow-Origin header on the server response from the url you are requesting if you have access to it.
-
Trying to access an https resource that has invalid certificate, or trying to access an http resource from a page with an https origin.
-
A browser plugin such as an ad-blocker or privacy protector.
Issue: Request returns 404, 403 errors or a CAPTCHA form. Your request may have been blocked by the server because it suspects you are a bot or scraper. Check this list to ensure you're not triggering a block.