
Product
Introducing Scala and Kotlin Support in Socket
Socket now supports Scala and Kotlin, bringing AI-powered threat detection to JVM projects with easy manifest generation and fast, accurate scans.
feature-scaler
Advanced tools
Normalize arbitrary lists of js objects into something you can feed to a machine learning algorithm.
feature-scaler
is a utility that transforms a list of arbitrary JavaScript objects into a normalized format suitable for feeding into machine learning algorithms. It can also decode encoded data back into its original format.
Motivation: I use Andrej Karpathy's excellent convnetjs library to experiment with neural networks in JavaScript and often have to preprocess my data before training a network. This utility makes it easy to encode data in a format usable by convnetjs
.
"Why JavaScript?" is a fair question - Python's scikit-learn
has most of the data preprocessing features you may need. I wrote this mainly because I wanted an easy way to use convnetjs
without communicating across languages. If your data is big enough that convnetjs
or the performance of the V8 engine in node.js is the limiting factor in your workflow, don't use JavaScript!
Field types currently supported: ints
, floats
, bools
, and strings
.
Check out tests/main.spec.js for a demo of this library in action.
In the following documentation, I'll use planetList
as the example data set we're transforming. It looks like this:
const planetList = [
{ planet: 'mars', isGasGiant: false, value: 10 },
{ planet: 'saturn', isGasGiant: true, value: 20 },
{ planet: 'jupiter', isGasGiant: true, value: 30 }
]
The independent variables are planet
and isGasGiant
.
The dependent variable is value
.
encode(data, opts = { dataKeys, labelKeys })
data
: list of raw data you need encoded. Assumptions: all entries in this list have the same structure as the first entry in the list. If the first element in data
has a key called isGasGiant
, and data[0].isGasGiant === true
, isGasGiant
should be a boolean
for all objects in the list.opts
opts.labelKeys
- list of keys you are predicting values for (value
).opts.dataKeys
optional - list of independent keys (planet
, isGasGiant
). If not provided, defaults to all keys minus opts.labelKeys
.Example usage:
const dataKeys = ['planet', 'isGasGiant'];
const labelKeys = ['value']
const encodedInfo = encode(planetList, { dataKeys: ['value']});
// encodedInfo.data
[ [ 1, 0, 0, 0, -1 ], [ 0, 1, 0, 1, 0 ], [ 0, 0, 1, 1, 1 ] ]
// Note: as is the norm with machine learning algorithms,
// "label" data is at the end of each row.
// encodedInfo.data[0][4] === -1; the scaled label value for Mars.
// encodedInfo.decoders - can be treated as a black box
[
{ key: 'planet', type: 'string', offset: 3, lookupTable: ['mars','saturn','jupiter'] },
{ key: 'isGasGiant', type: 'boolean' },
{ key: 'value', type: 'number', mean: 20, std: 10 }
]
Each entry in the "decoders" list is metadata from the original dataset. It contains information on how to transform an encoded row back into the original { key: value }
pairs. Your code should not modify this list. The only thing you should do with it is feed it back into decode
, described below.
Note: encodedInfo
can safely be serialized to JSON and saved for later use with JSON.stringify(encodedInfo)
.
decode(encodedData, decoders)
encodedData
- the data
from encode
outputdecoders
- the decoders
from encode
outputIt returns the list of data in its original format.
decodeRow(encodedRow, decoders)
Similar to decode
, but operates on a single row. e.g.
decodeRow(encodedData[0], decoders) === decode(encodedData, decoders)[0]
The short version is this library encodes data in the following ways:
(n - mean) / stddev
n ? 1 : 0
Standardizing numbers and booleans is easy, but categorical string data is a little trickier. In the example above, transforming ['mars', 'jupiter', 'saturn']
into a single number value falsely implies* there is an ordering to the underlying value. Suppose you had a variable that represented the weather; there is no logical ordering to ['rain', 'sun', 'overcast']
. If we naively had a sinlge numeric "weather" column where rain=0
, sun=1
, overcast=2
, some machine learning algorithms would treat that field as "ordered".
Instead, we need to map these strings to a list of single-valued binary values. In the planets example, we see the following encodings:
mars
== [0, 0, 1]
saturn
== [0, 1, 0]
jupiter
== [1, 0, 0]
We can feed this into an arbitrary machine learning algorithm without the possibility of it (incorrectly) inferring an ordering to our data.
* In our example, there is indeed an ordering to the planets! If the ordering is important, add a calculated field to the data before encoding. You could add a numberOfPlanetFromSun
integer field to each record before encoding if the ordering of categorical data is important.
Contributions welcome! Please include unit tests, and ensure both npm run test
and npm run lint
pass without warning.
FAQs
Normalize arbitrary lists of js objects into something you can feed to a machine learning algorithm.
The npm package feature-scaler receives a total of 0 weekly downloads. As such, feature-scaler popularity was classified as not popular.
We found that feature-scaler demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Product
Socket now supports Scala and Kotlin, bringing AI-powered threat detection to JVM projects with easy manifest generation and fast, accurate scans.
Application Security
/Security News
Socket CEO Feross Aboukhadijeh and a16z partner Joel de la Garza discuss vibe coding, AI-driven software development, and how the rise of LLMs, despite their risks, still points toward a more secure and innovative future.
Research
/Security News
Threat actors hijacked Toptal’s GitHub org, publishing npm packages with malicious payloads that steal tokens and attempt to wipe victim systems.