Security News
tea.xyz Spam Plagues npm and RubyGems Package Registries
Tea.xyz, a crypto project aimed at rewarding open source contributions, is once again facing backlash due to an influx of spam packages flooding public package registries.
Readme
This package contains some tools to integrate the Spark computing framework <https://spark.apache.org/>
_
with the popular scikit-learn machine library <https://scikit-learn.org/stable/>
_. Among other things, it can:
multicore implementation <https://pythonhosted.org/joblib/parallel.html>
_ included by default in scikit-learn
ndarray
or sparse matricesIt focuses on problems that have a small amount of data and that can be run in parallel.
For small datasets, it distributes the search for estimator parameters (GridSearchCV
in scikit-learn),
using Spark. For datasets that do not fit in memory, we recommend using the distributed implementation in
Spark MLlib https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html`_.
This package distributes simple tasks like grid-search cross-validation. It does not distribute individual learning algorithms (unlike Spark MLlib).
This package is available on PYPI:
::
pip install spark-sklearn
This project is also available as Spark package <https://spark-packages.org/package/databricks/spark-sklearn>
_.
The developer version has the following requirements:
Spark website <https://spark.apache.org/>
.
In order to use this package, you need to use the pyspark interpreter or another Spark-compliant python
interpreter. See the Spark guide <https://spark.apache.org/docs/latest/programming-guide.html#overview>
for more details.nose <https://nose.readthedocs.org>
_ (testing dependency only)If you want to use a developer version, you just need to make sure the python/
subdirectory is in the
PYTHONPATH
when launching the pyspark interpreter:
::
PYTHONPATH=$PYTHONPATH:./python:$SPARK_HOME/bin/pyspark
You can directly run tests:
::
cd python && ./run-tests.sh
This requires the environment variable SPARK_HOME
to point to your local copy of Spark.
Here is a simple example that runs a grid search with Spark. See the Installation <#installation>
_ section
on how to install the package.
.. code:: python
from sklearn import svm, datasets
from spark_sklearn import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svr = svm.SVC(gamma='auto')
clf = GridSearchCV(sc, svr, parameters)
clf.fit(iris.data, iris.target)
This classifier can be used as a drop-in replacement for any scikit-learn classifier, with the same API.
API documentation <http://databricks.github.io/spark-sklearn-docs>
_ is currently hosted on Github pages. To
build the docs yourself, see the instructions in docs/
.
.. image:: https://travis-ci.org/databricks/spark-sklearn.svg?branch=master :target: https://travis-ci.org/databricks/spark-sklearn
FAQs
Integration tools for running scikit-learn on Spark
We found that spark-sklearn demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 5 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Tea.xyz, a crypto project aimed at rewarding open source contributions, is once again facing backlash due to an influx of spam packages flooding public package registries.
Security News
As cyber threats become more autonomous, AI-powered defenses are crucial for businesses to stay ahead of attackers who can exploit software vulnerabilities at scale.
Security News
UnitedHealth Group disclosed that the ransomware attack on Change Healthcare compromised protected health information for millions in the U.S., with estimated costs to the company expected to reach $1 billion.