url-inspector
Get metadata about any URL.
Limited memory and network usage.
This is a node.js module.
It returns and normalizes information found in http headers or in the resource
itself using exiftool (which knows almost everything about files but html),
or a sax parser to read oembed, opengraph, twitter cards, schema.org attributes
or standard html tags.
Both tools stop inspection when they gathered enough tags, or stop when a max number
of bytes (depending on media type) have been downloaded.
A demo using this module is available,
with url-inspector-daemon
-
url
url of the inspected resource
-
title
title of the resource, or filename, or last component of pathname with query
-
description
optional longer description, without title in it
-
site
the name of the site, or the domain name
-
mime
RFC 7231 mime type of the resource (defaults to Content-Type)
The inspected mime type could be more accurate than the http header.
-
ext
the extension matching the mime type (not the file extension)
-
type
what the resource represents
image, video, audio, link, file, embed, archive
-
html
a canonical html representation of the full resource,
depending on the type and mime, could be img, a, video, audio, iframe tag.
-
size
optional Content-Length; discarded when type is embed
-
icon
optional link to the favicon of the site
-
width, height
optional dimensions
-
duration
optional
-
thumbnail
optional a URL to a thumbnail, could be a data-uri for embedded images
-
source
optional a URL that can go in a 'src' attribute; for example a resource can
be an html page representing an image type. The URL of the image itself would
be stored here; same thing for audio, video, embed types.
-
error
optional an http error code, or string
-
all
an object with all additional metadata that was found
Installation
npm install url-inspector
Add -g
switch to install the executable.
exiftool executable must be available:
API
var inspector = require('url-inspector');
var opts = {
all: false, // return all available non-normalized metadata
ua: "Mozilla/5.0", // some oembed providers might not answer otherwise
nofavicon: false, // disable any favicon-related additional request
nosource: false, // disable any sub-source inspection for audio, video, image types
providers: [{ // an array of custom OEmbed providers, or path to a module exporting such an array
provider_name: "Custom OEmbed provider",
endpoints: [{
schemes: ["http:\/\/video\.com\/*"],
builder: function(urlObj, obj) {
// can see current obj and override arbitrary props
obj.embed = "custom embed url";
}
}]
}],
// new in version 2.3.0
file: true
};
inspector(url, opts, function(err, obj) {
});
// or simply
inspector(url, function(err, obj) {...});
Command-line client
inspector-url <url>
inspector-url <filepath>
Low resource usage
network:
- a maximum of several hundreds of kilobytes (depending on resource type) is downloaded
but it is usually much less, depending on connection speed.
- inspection stops as soon as enough metadata is gathered
memory:
- html is inspected using a sax parser, without building a full DOM.
exiftool:
- runs using
streat
module, which keeps exiftool always open for performance
Since version 2.3.0, file:// protocol is supported through cli by default,
or setting "file" flag to true (false by default) through api.
License
See LICENSE.
See also
https://github.com/kapouer/url-inspector-daemon
https://github.com/kapouer/node-streat