d3-scale-cluster - npm Package Compare versions

Comparing version 1.0.1 to 1.1.0

CHANGELOG.md

package.json

		{
		"name": "d3-scale-cluster",
		"version": "1.0.1",
		"version": "1.1.0",
		"description": "D3 scale that clusters data into discrete groups",
		@@ -5,0 +5,0 @@ "repository": "schnerd/d3-scale-cluster",

README.md

		# d3-scale-cluster
		D3 scale that clusters data into discrete groups

		###Usage
		D3 scale that clusters data into discrete groups. Similar to [quantile scales](https://github.com/d3/d3-scale/blob/master/README.md#scaleQuantile), the cluster scale maps a sampled input domain to a discrete range. The number of values in the output range determines the number of clusters that will be computed from the domain. The graphic below demonstrates how cluster compares to d3's quantile and quantize scales:

		<img width="420" alt="d3 scale cluster example" src="https://cloud.githubusercontent.com/assets/875591/18608070/0213d7ce-7cdf-11e6-89aa-1b0e18e63cc8.png">

		Clusters are computed using a 1-dimensional clustering algorithm with an `O(kn log(n))` runtime (where `k` is the number of clusters desired). This should be fast enough for the majority of projects, but it's worth doing your own performance testing when working with large data sets. More details about the algorithm can be found later in this document.

		To use this scale, you may install via npm:
		```
		npm install d3-cluster-scale
		```

		Or include a `<script>` tag on your page after d3
		```
		<script src="https://unpkg.com/d3-scale-cluster@1.0.1/dist/d3-scale-cluster.min.js"></script>
		```

		###Example Usage

		This scale largely has the same API as [d3.scaleQuantile](https://github.com/d3/d3-scale/blob/master/README.md#scaleQuantile) (however we use `clusters()` instead of `quantiles()`)
		@@ -18,2 +33,42 @@

		###API

		d3.scaleCluster()

		Constructs a new cluster scale with an empty domain and an empty range. The cluster scale is invalid until both a domain and range are specified.

		_cluster_(_value_)

		Given a value in the input domain, returns the corresponding value in the output range.

		_cluster_.invertExtent(_value_)

		Returns the extent of values in the domain [x0, x1] for the corresponding value in the range: the inverse of cluster. This method is useful for interaction, say to determine the value in the domain that corresponds to the pixel location under the mouse.

		_cluster_.domain(_[domain]_)

		If domain is specified, sets the domain of the quantile scale to the specified set of discrete numeric values. The array must not be empty, and must contain at least one numeric value; NaN, null and undefined values are ignored and not considered part of the sample population. If the elements in the given array are not numbers, they will be coerced to numbers. A copy of the input array is sorted and stored internally. If domain is not specified, returns the scale’s current domain.

		_cluster_.range(_[range]_)

		If range is specified, sets the discrete values in the range. The array must not be empty, and may contain any type of value. The number of values in (the cardinality, or length, of) the range array determines the number of clusters that are computed. If range is not specified, returns the current range.

		_cluster_.clusters()

		Returns the cluster thresholds. If the range contains n discrete values, the returned array will contain n - 1 thresholds. Values less than the first threshold are considered in the first cluster; values greater than or equal to the first threshold but less than the second threshold are in the second cluster, and so on.

		_cluster_.copy()

		Returns an exact copy of this scale. Changes to this scale will not affect the returned scale, and vice versa.

		###The Algorithm / History

		For clustering, this project uses the [Ckmeans](http://simplestatistics.org/docs/#ckmeans) algorithm implementation from the [simple-statistics](https://github.com/simple-statistics/simple-statistics) library (the original algorithm was designed by [Haizhou Wang and Mingzhou Song](https://cran.r-project.org/web/packages/Ckmeans.1d.dp/)). This algorithm is a major improvement on the [Jenks Natural Breaks](https://en.wikipedia.org/wiki/Jenks_natural_breaks_optimization) algorithm developed by cartographer George Jenks. Tom MacWright wrote previously about [using Jenks Natural Breaks for d3 choropleths](http://www.macwright.org/2013/02/18/literate-jenks.html), but to my knowledge it was never turned into a reusable d3 scale.

		As noted previously in this document, the Ckmeans algorithm has a runtime complexity of `O(kn log(n))`, where `k` is the number of clusters. Here's a chart of the observed runtime on Chrome 53 for a 2011 Macbook Pro 2GHz Intel Core i7:

		![screen shot 2016-10-13 at 2 14 41 pm](https://cloud.githubusercontent.com/assets/875591/19367754/5159b53e-9151-11e6-9fee-52ce88cdf696.png)

		>Fun fact: In May 2016, Haizhou Wang and Mingzhou Song implemented a new core Ckmeans algorithm (3.4.6) that improved runtime from `O(kn^2)` to `O(kn log(n))`, leading to a roughly [10x performance boost](https://cloud.githubusercontent.com/assets/875591/19367940/67688548-9152-11e6-9c2e-01e3e800bb65.png). d3-scale-cluster is one of the first javascript projects to utilize and benefit from this new algorithm.

		###Contributing
		@@ -26,5 +81,1 @@
		```

		###Thanks

		This project uses the [Ckmeans](http://simplestatistics.org/docs/#ckmeans) implementation from the [simple-statistics](https://github.com/simple-statistics/simple-statistics) library–with the original algorithm written by [Haizhou Wang and Mingzhou Song](http://journal.r-project.org/archive/2011-2/RJournal_2011-2_Wang+Song.pdf)

187

src/ckmeans.js

		/**
		* Ckmeans algorithm
		*
		* The code that lies within was taken from the wonderful simple-statistics library
		* Much of the code that lies within was taken from the simple-statistics library,
		* which offers a javascript implementation of the ckmeans algorithm originally
		* designed by Haizhou Wang and Mingzhou Song
		*
		* https://cran.r-project.org/web/packages/Ckmeans.1d.dp/
		* https://github.com/simple-statistics/simple-statistics
		@@ -62,2 +66,111 @@ *

		function ssq (j, i, sumX, sumXsq) {
		var sji; // s(j, i)
		if (j > 0) {
		var muji = (sumX[i] - sumX[j - 1]) / (i - j + 1); // mu(j, i)
		sji = sumXsq[i] - sumXsq[j - 1] - (i - j + 1) * muji * muji;
		} else {
		sji = sumXsq[i] - sumX[i] * sumX[i] / (i + 1);
		}
		return sji < 0 ? 0 : sji;
		}

		function fillMatrixColumn (imin, imax, column, matrix, backtrackMatrix, sumX, sumXsq) {
		if (imin > imax) {
		return;
		}

		// Start at midpoint between imin and imax
		var i = Math.floor((imin + imax) / 2);

		// Initialization of S[k][i]:
		matrix[column][i] = matrix[column - 1][i - 1];
		backtrackMatrix[column][i] = i;

		var jlow = column; // the lower end for j

		if (imin > column) {
		jlow = Math.max(jlow, backtrackMatrix[column][imin - 1] \|\| 0);
		}
		jlow = Math.max(jlow, backtrackMatrix[column - 1][i] \|\| 0);

		var jhigh = i - 1; // the upper end for j
		if (imax < matrix.length - 1) {
		jhigh = Math.min(jhigh, backtrackMatrix[column][imax + 1] \|\| 0);
		}

		var sji;
		var sjlowi;
		var ssqjlow;
		var ssqj;
		for (var j = jhigh; j >= jlow; --j) {
		// compute s(j,i)
		sji = ssq(j, i, sumX, sumXsq);

		// MS May 11, 2016 Added:
		if (sji + matrix[column - 1][jlow - 1] >= matrix[column][i]) {
		break;
		}

		// Examine the lower bound of the cluster border
		// compute s(jlow, i)
		sjlowi = ssq(jlow, i, sumX, sumXsq);

		ssqjlow = sjlowi + matrix[column - 1][jlow - 1];

		if (ssqjlow < matrix[column][i]) {
		// shrink the lower bound
		matrix[column][i] = ssqjlow;
		backtrackMatrix[column][i] = jlow;
		}
		jlow++;

		ssqj = sji + matrix[column - 1][j - 1];
		if (ssqj < matrix[column][i]) {
		matrix[column][i] = ssqj;
		backtrackMatrix[column][i] = j;
		}
		}

		fillMatrixColumn(imin, i - 1, column, matrix, backtrackMatrix, sumX, sumXsq);
		fillMatrixColumn(i + 1, imax, column, matrix, backtrackMatrix, sumX, sumXsq);
		}

		function fillMatrices (data, matrix, backtrackMatrix) {
		var nValues = matrix[0].length;
		var sumX = new Array(nValues);
		var sumXsq = new Array(nValues);

		// Use the median to shift values of x to improve numerical stability
		var shift = data[Math.floor(nValues / 2)];

		// Initialize first row in matrix & backtrackMatrix
		for (var i = 0; i < nValues; ++i) {
		if (i === 0) {
		sumX[0] = data[0] - shift;
		sumXsq[0] = (data[0] - shift) * (data[0] - shift);
		} else {
		sumX[i] = sumX[i - 1] + data[i] - shift;
		sumXsq[i] = sumXsq[i - 1] + (data[i] - shift) * (data[i] - shift);
		}

		// Initialize for k = 0
		matrix[0][i] = ssq(0, i, sumX, sumXsq);
		backtrackMatrix[0][i] = 0;
		}

		// Initialize the rest of the columns
		var imin;
		for (var k = 1; k < matrix.length; ++k) {
		if (k < matrix.length - 1) {
		imin = k;
		} else {
		// No need to compute matrix[K-1][0] ... matrix[K-1][N-2]
		imin = nValues - 1;
		}

		fillMatrixColumn(imin, nValues - 1, k, matrix, backtrackMatrix, sumX, sumXsq);
		}
		}

		/**
		@@ -106,2 +219,4 @@ * Ckmeans clustering is an improvement on heuristic-based clustering

		var nValues = data.length;

		var sorted = numericSort(data);
		@@ -116,7 +231,8 @@ // we'll use this as the maximum number of clusters
		}
		nClusters = Math.min(uniqueCount, nClusters);

		// named 'D' originally
		var matrix = makeMatrix(nClusters, sorted.length);
		// named 'B' originally
		var backtrackMatrix = makeMatrix(nClusters, sorted.length);
		// named 'S' originally
		var matrix = makeMatrix(nClusters, nValues);
		// named 'J' originally
		var backtrackMatrix = makeMatrix(nClusters, nValues);

		@@ -127,61 +243,4 @@ // This is a dynamic programming way to solve the problem of minimizing
		// sum of squares that are later read.
		fillMatrices(sorted, matrix, backtrackMatrix);

		// The outer loop iterates through clusters, from 0 to nClusters.
		for (var cluster = 0; cluster < nClusters; cluster++) {
		// At the start of each loop, the mean starts as the first element
		var firstClusterMean = sorted[0];

		for (var sortedIdx = Math.max(cluster, 1);
		sortedIdx < sorted.length;
		sortedIdx++) {
		if (cluster === 0) {
		// Increase the running sum of squares calculation by this
		// new value
		var squaredDifference = Math.pow(
		sorted[sortedIdx] - firstClusterMean,
		2
		);
		matrix[cluster][sortedIdx] = matrix[cluster][sortedIdx - 1] +
		(sortedIdx / (sortedIdx + 1)) * squaredDifference;

		// We're computing a running mean by taking the previous
		// mean value, multiplying it by the number of elements
		// seen so far, and then dividing it by the number of
		// elements total.
		var newSum = sortedIdx * firstClusterMean + sorted[sortedIdx];
		firstClusterMean = newSum / (sortedIdx + 1);
		} else {
		var sumSquaredDistances = 0;
		var meanXJ = 0;

		for (var j = sortedIdx; j >= cluster; j--) {
		sumSquaredDistances += (sortedIdx - j) /
		(sortedIdx - j + 1) *
		Math.pow(sorted[j] - meanXJ, 2);

		meanXJ = (sorted[j] + (sortedIdx - j) * meanXJ) /
		(sortedIdx - j + 1);

		if (j === sortedIdx) {
		matrix[cluster][sortedIdx] = sumSquaredDistances;
		backtrackMatrix[cluster][sortedIdx] = j;
		if (j > 0) {
		matrix[cluster][sortedIdx] += matrix[cluster - 1][j - 1];
		}
		} else {
		if (j === 0) {
		if (sumSquaredDistances <= matrix[cluster][sortedIdx]) {
		matrix[cluster][sortedIdx] = sumSquaredDistances;
		backtrackMatrix[cluster][sortedIdx] = j;
		}
		} else if (sumSquaredDistances + matrix[cluster - 1][j - 1] < matrix[cluster][sortedIdx]) {
		matrix[cluster][sortedIdx] = sumSquaredDistances + matrix[cluster - 1][j - 1];
		backtrackMatrix[cluster][sortedIdx] = j;
		}
		}
		}
		}
		}
		}

		// The real work of Ckmeans clustering happens in the matrix generation:
		@@ -197,3 +256,3 @@ // the generated matrices encode all possible clustering combinations, and
		// and moves the cluster target with the loop.
		for (cluster = backtrackMatrix.length - 1; cluster >= 0; cluster--) {
		for (var cluster = backtrackMatrix.length - 1; cluster >= 0; cluster--) {
		var clusterLeft = backtrackMatrix[cluster][clusterRight];
		@@ -200,0 +259,0 @@

d3-scale-cluster - npm Package Compare versions

New alerts

Fixed alerts

Improved metrics