UrlParser
Extended URI capabilities built on top of Addressable::URI. Parse URIs into granular components, unescape encoded characters, extract embedded URIs, normalize URIs, handle canonical url generation, and validate domains. Inspired by PostRank-URI and URI.js.
Installation
Add this line to your application's Gemfile:
gem 'url_parser'
And then execute:
$ bundle
Or install it yourself as:
$ gem install url_parser
Example
uri = UrlParser.parse('foo://username:password@ww2.foo.bar.example.com:123/hello/world/there.html?name=ferret#foo')
uri.class
uri.scheme
uri.username
uri.user
uri.password
uri.userinfo
uri.hostname
uri.naked_hostname
uri.port
uri.host
uri.www
uri.tld
uri.top_level_domain
uri.extension
uri.sld
uri.second_level_domain
uri.domain_name
uri.trd
uri.third_level_domain
uri.subdomains
uri.naked_trd
uri.naked_subdomain
uri.domain
uri.subdomain
uri.origin
uri.authority
uri.site
uri.path
uri.segment
uri.directory
uri.filename
uri.suffix
uri.query
uri.query_values
uri.fragment
uri.resource
uri.location
Usage
Parse
Parse takes the provided URI and breaks it down into its component parts. To see a full list components provided, see URI Data Model. If you provide an instance of Addressable::URI, it will consider the URI already parsed.
ruby
uri = UrlParser.parse('http://example.org/foo?bar=baz')
uri.class
#=> UrlParser::URI
Unembed, canonicalize, normalize, and clean all rely on parse.
### Unembed
Unembed searches the provided URI's query values for redirection urls. By default, it searches the `u` and `url` params, however you can configure custom params to search.
```ruby
uri = UrlParser.unembed('http://energy.gov/exit?url=https%3A//twitter.com/energy')
uri.to_s
#=> "https://twitter.com/energy"
With custom embedded params keys:
uri = UrlParser.unembed('https://www.upwork.com/leaving?ref=https%3A%2F%2Fwww.example.com', embedded_params: [ 'u', 'url', 'ref' ])
uri.to_s
Canonicalize
Canonicalize applies filters on param keys to remove common tracking params, attempting to make it easier to identify duplicate URIs. For a full list of params, see db.yml
.
uri = UrlParser.canonicalize('https://en.wikipedia.org/wiki/Ruby_(programming_language)?source=ABCD&utm_source=EFGH')
uri.to_s
Normalize
Normalize standardizes paths, query strings, anchors, whitespace, hostnames, and trailing slashes.
uri = UrlParser.normalize('http://example.com/a/b/../../')
uri.to_s
uri = UrlParser.normalize('http://example.com/?')
uri.to_s
uri = UrlParser.normalize('http://example.com/#test')
uri.to_s
uri = UrlParser.normalize('http://example.com/a/../? #test')
uri.to_s
uri = UrlParser.normalize("💩.la")
uri.to_s
uri = UrlParser.normalize('http://example.com/a/b/')
uri.to_s
Clean
Clean combines parsing, unembedding, canonicalization, and normalization into a single call. It is designed to provide a method for cross-referencing identical urls.
uri = UrlParser.clean('http://example.com/a/../?url=https%3A//💩.la/&utm_source=google')
uri.to_s
uri = UrlParser.clean('https://en.wikipedia.org/wiki/Ruby_(programming_language)?source=ABCD&utm_source%3Danalytics')
uri.to_s
UrlParser::URI
Parsing a URI with UrlParser returns an instance of UrlParser::URI
, with the following methods available:
URI Data Model
* :scheme
* :username
* :user
* :password
* :userinfo
* :hostname
* :naked_hostname
* :port
* :host
* :www
* :tld
* :top_level_domain
* :extension
* :sld
* :second_level_domain
* :domain_name
* :trd
* :third_level_domain
* :subdomains
* :naked_trd
* :naked_subdomain
* :domain
* :subdomain
* :origin
* :authority
* :site
* :path
* :segment
* :directory
* :filename
* :suffix
* :query
* :query_values
* :fragment
* :resource
* :location
Additional URI Methods
uri = UrlParser.clean('#')
uri.unescaped?
uri.parsed?
uri.unembedded?
uri.canonicalized?
uri.normalized?
uri.cleaned?
uri.localhost?
uri.ip_address?
uri.ipv4?
uri.ipv6?
uri.ipv4
uri.ipv6
uri = UrlParser.parse('/')
uri.relative?
uri = UrlParser.parse('http://example.com/')
uri.absolute?
uri = UrlParser.parse('http://example.com/?utm_source=google')
uri.clean
uri = UrlParser.parse('http://example.com/?utm_source%3Danalytics')
uri.canonical
uri = UrlParser.parse('http://foo.com/zee/zaw/zoom.html')
joined_uri = uri + '/bar#id'
joined_uri.to_s
uri = UrlParser.parse('http://example.com/')
uri.raw
uri = UrlParser.parse('http://example.com/a/../?')
uri == 'http://example.com/'
uri == 'https://example.com/'
uri =~ 'https://example.com/'
uri = UrlParser.parse('http://example.qqq/')
uri.valid?
Configuration
embedded_params
Set the params the unembed parser uses to search for embedded URIs. Default is [ 'u', 'url ]
. Set to an empty array to disable unembedding.
UrlParser.configure do |config|
config.embedded_params = [ 'ref' ]
end
uri = UrlParser.unembed('https://www.upwork.com/leaving?ref=https%3A%2F%2Fwww.example.com')
uri.to_s
default_scheme
Set a default scheme if one is not present. Can also be set to nil if there should not be a default scheme. Default is 'http'
.
UrlParser.configure do |config|
config.default_scheme = 'https'
end
uri = UrlParser.parse('example.com')
uri.to_s
scheme_map
Replace scheme keys in the 'map' with the corresponding value. Useful for replacing invalid or outdated schemes. Default is an empty hash.
UrlParser.configure do |config|
config.scheme_map = { 'feed' => 'http' }
end
uri = UrlParser.parse('feed://feeds.feedburner.com/YourBlog')
uri.to_s
TODO
- Extract URIs from text
- Enable custom rules for normalization, canonicaliztion, escaping, and extraction
Contributing
- Fork it ( https://github.com/[my-github-username]/url_parser/fork )
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request