
Security News
CISA’s 2025 SBOM Guidance Adds Hashes, Licenses, Tool Metadata, and Context
CISA’s 2025 draft SBOM guidance adds new fields like hashes, licenses, and tool metadata to make software inventories more actionable.
We are all in the gutter, but some of us are looking at the stars.
-- Oscar Wilde
WordsCounted is a Ruby NLP (natural language processor). WordsCounted lets you implement powerful tokensation strategies with a very flexible tokeniser class.
Are you using WordsCounted to do something interesting? Please tell me about it.
Visit this website for one example of what you can do with WordsCounted.
["Bayrūt"]
and not ["Bayr", "ū", "t"]
, for example.Add this line to your application's Gemfile:
gem 'words_counted'
And then execute:
$ bundle
Or install it yourself as:
$ gem install words_counted
Pass in a string or a file path, and an optional filter and/or regexp.
counter = WordsCounted.count(
"We are all in the gutter, but some of us are looking at the stars."
)
# Using a file
counter = WordsCounted.from_file("path/or/url/to/my/file.txt")
.count
and .from_file
are convenience methods that take an input, tokenise it, and return an instance of WordsCounted::Counter
initialized with the tokens. The WordsCounted::Tokeniser
and WordsCounted::Counter
classes can be used alone, however.
WordsCounted.count(input, options = {})
Tokenises input and initializes a WordsCounted::Counter
object with the resulting tokens.
counter = WordsCounted.count("Hello Beirut!")
Accepts two options: exclude
and regexp
. See Excluding tokens from the analyser and Passing in a custom regexp respectively.
WordsCounted.from_file(path, options = {})
Reads and tokenises a file, and initializes a WordsCounted::Counter
object with the resulting tokens.
counter = WordsCounted.from_file("hello_beirut.txt")
Accepts the same options as .count
.
The tokeniser allows you to tokenise text in a variety of ways. You can pass in your own rules for tokenisation, and apply a powerful filter with any combination of rules as long as they can boil down into a lambda.
Out of the box the tokeniser includes only alpha chars. Hyphenated tokens and tokens with apostrophes are considered a single token.
#tokenise([pattern: TOKEN_REGEXP, exclude: nil])
tokeniser = WordsCounted::Tokeniser.new("Hello Beirut!").tokenise
# With `exclude`
tokeniser = WordsCounted::Tokeniser.new("Hello Beirut!").tokenise(exclude: "hello")
# With `pattern`
tokeniser = WordsCounted::Tokeniser.new("I <3 Beirut!").tokenise(pattern: /[a-z]/i)
See Excluding tokens from the analyser and Passing in a custom regexp for more information.
The WordsCounted::Counter
class allows you to collect various statistics from an array of tokens.
#token_count
Returns the token count of a given string.
counter.token_count #=> 15
#token_frequency
Returns a sorted (unstable) two-dimensional array where each element is a token and its frequency. The array is sorted by frequency in descending order.
counter.token_frequency
[
["the", 2],
["are", 2],
["we", 1],
# ...
["all", 1]
]
#most_frequent_tokens
Returns a hash where each key-value pair is a token and its frequency.
counter.most_frequent_tokens
{ "are" => 2, "the" => 2 }
#token_lengths
Returns a sorted (unstable) two-dimentional array where each element contains a token and its length. The array is sorted by length in descending order.
counter.token_lengths
[
["looking", 7],
["gutter", 6],
["stars", 5],
# ...
["in", 2]
]
#longest_tokens
Returns a hash where each key-value pair is a token and its length.
counter.longest_tokens
{ "looking" => 7 }
#token_density([ precision: 2 ])
Returns a sorted (unstable) two-dimentional array where each element contains a token and its density as a float, rounded to a precision of two. The array is sorted by density in descending order. It accepts a precision
argument, which must be a float.
counter.token_density
[
["are", 0.13],
["the", 0.13],
["but", 0.07 ],
# ...
["we", 0.07 ]
]
#char_count
Returns the char count of tokens.
counter.char_count #=> 76
#average_chars_per_token([ precision: 2 ])
Returns the average char count per token rounded to two decimal places. Accepts a precision argument which defaults to two. Precision must be a float.
counter.average_chars_per_token #=> 4
#uniq_token_count
Returns the number of unique tokens.
counter.uniq_token_count #=> 13
You can exclude anything you want from the input by passing the exclude
option. The exclude option accepts a variety of filters and is extremely flexible.
:odd?
.tokeniser =
WordsCounted::Tokeniser.new(
"Magnificent! That was magnificent, Trevor."
)
# Using a string
tokeniser.tokenise(exclude: "was magnificent")
# => ["that", "trevor"]
# Using a regular expression
tokeniser.tokenise(exclude: /trevor/)
# => ["magnificent", "that", "was", "magnificent"]
# Using a lambda
tokeniser.tokenise(exclude: ->(t) { t.length < 4 })
# => ["magnificent", "that", "magnificent", "trevor"]
# Using symbol
tokeniser = WordsCounted::Tokeniser.new("Hello! محمد")
tokeniser.tokenise(exclude: :ascii_only?)
# => ["محمد"]
# Using an array
tokeniser = WordsCounted::Tokeniser.new(
"Hello! اسماءنا هي محمد، كارولينا، سامي، وداني"
)
tokeniser.tokenise(
exclude: [:ascii_only?, /محمد/, ->(t) { t.length > 6}, "و"]
)
# => ["هي", "سامي", "وداني"]
The default regexp accounts for letters, hyphenated tokens, and apostrophes. This means twenty-one is treated as one token. So is Mohamad's.
/[\p{Alpha}\-']+/
You can pass your own criteria as a Ruby regular expression to split your string as desired.
For example, if you wanted to include numbers, you can override the regular expression:
counter = WordsCounted.count("Numbers 1, 2, and 3", pattern: /[\p{Alnum}\-']+/)
counter.tokens
#=> ["numbers", "1", "2", "and", "3"]
Use the from_file
method to open files. from_file
accepts the same options as .count
. The file path can be a URL.
counter = WordsCounted.from_file("url/or/path/to/file.text")
A hyphen used in leu of an em or en dash will form part of the token. This affects the tokeniser algorithm.
counter = WordsCounted.count("How do you do?-you are well, I see.")
counter.token_frequency
[
["do", 2],
["how", 1],
["you", 1],
["-you", 1], # WTF, mate!
["are", 1],
# ...
]
In this example -you
and you
are separate tokens. Also, the tokeniser does not include numbers by default. Remember that you can pass your own regular expression if the default behaviour does not fit your needs.
The program will normalise (downcase) all incoming strings for consistency and filters.
def self.from_url
# open url and send string here after removing html
end
See contributors. Not listed there is [Dave Yarwood][1].
git checkout -b my-new-feature
)git commit -am 'Add some feature'
)git push origin my-new-feature
)FAQs
Unknown package
We found that words_counted demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
CISA’s 2025 draft SBOM guidance adds new fields like hashes, licenses, and tool metadata to make software inventories more actionable.
Security News
A clarification on our recent research investigating 60 malicious Ruby gems.
Security News
ESLint now supports parallel linting with a new --concurrency flag, delivering major speed gains and closing a 10-year-old feature request.