
Security News
Follow-up and Clarification on Recent Malicious Ruby Gems Campaign
A clarification on our recent research investigating 60 malicious Ruby gems.
EncodingSampler helps solve the problem of what to do when the character encoding is unknown, for example when a user is uploading a file but has no idea of its encoding (or typically, even what "character encoding" means.) EncodingSampler extracts a concise set of samples from the selected file for display so the user can choose wisely.
For a given file, some encodings may be dismissed out of hand because they would result in invalid characters or sequences. However, in the general case you have to let the user see the differences and choose. For example, it's easy to determine that an 8-bit character is not encoded as US_ASCII because it is simply invalid, but it's impossible to tell whether the character 0xA4 should be displayed as a generic currency symbol (¤) using ISO-8859-1 or as a Euro symbol (€) using ISO-8859-15 without asking the user.
EncodingSampler solves the problem by collecting a reasonably (but not rigorously) minimal sample by reading the file line-by-line. Lines that demonstrate the difference between any pair of encodings are noted, and when a line is encountered that cannot be "decoded" with a specific encoding, that encoding is considered invalid and removed from the running. When the sampling is complete, each encoding is grouped with other encoding(s) that yield identical decoding results.
There are three possible results:
There may be no valid encodings. This could mean that none of the proposed encodings match the file, but often it means the file is either malformed, or is not a text file. This is generally what you will see if you try to determine the encoding of a non-text binary file.
There may be only one group of valid encodings, all of which yield the same decoded data. In this case there are no samples to look at because there are no differences to show. A straight ASCII file may yield this result for many encodings.
There may be more than one set of valid encodings, each if which yields a different decoded data. This is the interesting case! Then samples will be available so a user can visually determine which is the correct interpretation. The "diff-lcs" gem is used to diff the samples, providing a simple way to highlight the (usually few) differences.
Because this method works by reading file lines and "decoding" each line with all the remaining valid encodings, it can be slow. For most files, the number of line "decodings" will equal the number of lines in the file times the number of encodings tested, and at this writing, Ruby 1.9.3 supports 168 encodings! It's recommended to try and use a much smaller set.
Add this line to your application's Gemfile:
gem 'encoding_sampler'
And then execute:
$ bundle
Or install it yourself as:
$ gem install encoding_sampler
Creating a new EncodingSampler instantiates a new instance and completes the file analysis.
EncodingSampler.new(file_name, options = {}}
# options:
# :difference_start => inserted into the diffed samples to mark the start of a "different" section
# :difference_end => inserted into the diffed samples to mark the end of a "different" section
Once you have an instance of an EncodingSampler, you can use the object's instance methods to determine which encodings are valid, which are unique (that is, which yield unique results,) and get samples to compare the differences visually. For example, imagining you have a file that turns out to be ISO-8859-15 (which includes the Euro sign,) you might get these results:
sampler = EncodingSampler::Sampler.new(
'some/file/name.csv',
['ASCII-8BIT', 'UTF-8', 'ISO-8859-1', 'ISO-8859-2', 'ISO-8859-15'])
sampler.valid_encodings
# ["ASCII-8BIT", "ISO-8859-1", "ISO-8859-2", "ISO-8859-15"]
sampler.unique_valid_encoding_groups
# [["ASCII-8BIT"], ["ISO-8859-1", 'ISO-8859-2'], ["ISO-8859-15"]]
sampler.sample('ASCII-8BIT')
# ["?ABCDEFabcdef0123456789?ABCDEFabcdef0123456789?"]
sampler.sample('ISO-8859-1')
# ["¤ABCDEFabcdef0123456789¤ABCDEFabcdef0123456789¤"]
sampler.sample('ISO-8859-15')
# ["€ABCDEFabcdef0123456789€ABCDEFabcdef0123456789€"]
sampler.samples(["ASCII-8BIT", "ISO-8859-1", "ISO-8859-15"])
# {"ASCII-8BIT"=>["?ABCDEFabcdef0123456789?ABCDEFabcdef0123456789?"],
# "ISO-8859-1"=>["¤ABCDEFabcdef0123456789¤ABCDEFabcdef0123456789¤"],
# "ISO-8859-15"=>["€ABCDEFabcdef0123456789€ABCDEFabcdef0123456789€"]}
sampler.diffed_samples(["ASCII-8BIT", "ISO-8859-1", "ISO-8859-15"])
# {"ASCII-8BIT"=>["<span class=\"difference\">?</span>ABCDEFabcdef0123456789<span class=\"difference\">?</span>ABCDEFabcdef0123456789<span class=\"difference\">?</span>"],
# "ISO-8859-1"=>["<span class=\"difference\">¤</span>ABCDEFabcdef0123456789<span class=\"difference\">¤</span>ABCDEFabcdef0123456789<span class=\"difference\">¤</span>"],
# "ISO-8859-15"=>["<span class=\"difference\">€</span>ABCDEFabcdef0123456789<span class=\"difference\">€</span>ABCDEFabcdef0123456789<span class=\"difference\">€</span>"]}
Notes:
In raw form the diffed_samples
don't seem impressive, but they can display the resuls via HTML, for example, to highlight and clarify the differences.
ASCII-8BIT | ?ABCDEFabcdef0123456789?ABCDEFabcdef0123456789? |
---|---|
ISO-8859-1 | ¤ABCDEFabcdef0123456789¤ABCDEFabcdef0123456789¤ |
ISO-8859-15 | €ABCDEFabcdef0123456789€ABCDEFabcdef0123456789€ |
EncodingSampler provides a functional but not-so-elegant solution. I'd love to see improvements or alternate ideas in regard to the concept, the algorithms, the interface, etc.
git checkout -b my-new-feature
)git commit -am 'Added some feature'
)git push origin my-new-feature
)FAQs
Unknown package
We found that encoding_sampler demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
A clarification on our recent research investigating 60 malicious Ruby gems.
Security News
ESLint now supports parallel linting with a new --concurrency flag, delivering major speed gains and closing a 10-year-old feature request.
Research
/Security News
A malicious Go module posing as an SSH brute forcer exfiltrates stolen credentials to a Telegram bot controlled by a Russian-speaking threat actor.