Research
Security News
Malicious npm Package Targets Solana Developers and Hijacks Funds
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
The machine learning client library that is used for interacting with Snowflake to build machine learning solutions.
Snowpark ML is a set of tools including SDKs and underlying infrastructure to build and deploy machine learning models. With Snowpark ML, you can pre-process data, train, manage and deploy ML models all within Snowflake, using a single SDK, and benefit from Snowflake’s proven performance, scalability, stability and governance at every stage of the Machine Learning workflow.
The Snowpark ML Python SDK provides a number of APIs to support each stage of an end-to-end Machine Learning development and deployment process, and includes two key components.
Snowpark ML Development provides a collection of python APIs enabling efficient ML model development directly in Snowflake:
Modeling API (snowflake.ml.modeling
) for data preprocessing, feature engineering and model training in Snowflake.
This includes the snowflake.ml.modeling.preprocessing
module for scalable data transformations on large data sets
utilizing the compute resources of underlying Snowpark Optimized High Memory Warehouses, and a large collection of ML
model development classes based on sklearn, xgboost, and lightgbm.
Framework Connectors: Optimized, secure and performant data provisioning for Pytorch and Tensorflow frameworks in their native data loader formats.
FileSet API: FileSet provides a Python fsspec-compliant API for materializing data into a Snowflake internal stage from a query or Snowpark Dataframe along with a number of convenience APIs.
Snowflake MLOps contains suit of tools and objects to make ML development cycle. It complements the Snowpark ML Development API, and provides end to end development to deployment within Snowflake. Currently, the API consists of:
If you don't have a Snowflake account yet, you can sign up for a 30-day free trial account.
Follow the installation instructions in the Snowflake documentation.
Python versions 3.9 to 3.11 are supported. You can use miniconda or anaconda to create a Conda environment (recommended), or virtualenv to create a virtual environment.
The Snowflake Conda Channel contains the official snowpark ML package releases.
The recommended approach is to install snowflake-ml-python
this conda channel:
conda install \
-c https://repo.anaconda.com/pkgs/snowflake \
--override-channels \
snowflake-ml-python
See the developer guide for installation instructions.
The latest version of the snowpark-ml-python
package is also published in a conda channel in this repository. Package versions
in this channel may not yet be present in the official Snowflake conda channel.
Install snowflake-ml-python
from this channel with the following (being sure to replace <version_specifier>
with the
desired version, e.g. 1.0.10
):
conda install \
-c https://raw.githubusercontent.com/snowflakedb/snowflake-ml-python/conda/releases/ \
-c https://repo.anaconda.com/pkgs/snowflake \
--override-channels \
snowflake-ml-python==<version_specifier>
Note that until a snowflake-ml-python
package version is available in the official Snowflake conda channel, there may
be compatibility issues. Server-side functionality that snowflake-ml-python
depends on may not yet be released.
Install cosign. This example is using golang installation: installing-cosign-with-go.
Download the file from the repository like pypi.
Download the signature files from the release tag.
Verify signature on projects signed using Jenkins job:
cosign verify-blob snowflake_ml_python-1.7.0.tar.gz --key snowflake-ml-python-1.7.0.pub --signature resources.linux.snowflake_ml_python-1.7.0.tar.gz.sig
cosign verify-blob snowflake_ml_python-1.7.0.tar.gz --key snowflake-ml-python-1.7.0.pub --signature resources.linux.snowflake_ml_python-1.7.0
NOTE: Version 1.7.0 is used as example here. Please choose the the latest version.
block
option
in ModelVersion.create_service()
set to True by default.pandas.StringDType()
, pandas.BooleanDType()
, etc.) are now supported in model
signature inference.snowflake.ml.data.*
module exports in wheelsnowflake.ml.dataset.*
module exports in wheel.tf_keras.Model
is not recognized as keras model when logging.enable_monitoring
set to False by default. This will gate access to preview features of Model Monitoring.show_model_monitors
Registry method. This feature is still in Private Preview.pd.Series
in input and output data.add_monitor
Registry method. This feature is still in Private Preview.resume
and suspend
ModelMonitor. This feature is still in Private Preview.get_monitor
Registry method. This feature is still in Private Preview.delete_monitor
Registry method. This feature is still in Private Preview.to_torch_dataset
and to_torch_datapipe
to add a dimension for scalar data.
This allows for more seamless integration with PyTorch DataLoader
, which creates batches by stacking inputs of each batch.Examples:
ds = connector.to_torch_dataset(shuffle=False, batch_size=3)
Input: "col1": [10, 11, 12]
Input: "col2": [[0, 100], [1, 110], [2, 200]]
Model Registry: External access integrations are optional when creating a model inference service in Snowflake >= 8.40.0.
Model Registry: Deprecate build_external_access_integration
with build_external_access_integrations
in
ModelVersion.create_service()
.
log_model
API to accept both signature and sample_input_data parameters.to_torch_dataset
and to_torch_datapipe
mc = custom_model.ModelContext(
config = 'local_model_dir/config.json',
m1 = model1
)
class ExamplePipelineModel(custom_model.CustomModel):
def __init__(self, context: custom_model.ModelContext) -> None:
super().__init__(context)
v = open(self.context['config']).read()
self.bias = json.loads(v)['bias']
@custom_model.inference_api
def predict(self, input: pd.DataFrame) -> pd.DataFrame:
model_output = self.context['m1'].predict(input)
return pd.DataFrame({'output': model_output + self.bias})
eps
argument is now ignored.None
sized batch to to_torch_dataset
for better
interoperability with PyTorch DataLoader.signatures
and sample_input_data
at the same time to capture background
data from explainablity and data lineage.ModelVersion.run
with service.not a valid remote uri
error when logging mlflow models.ModelVersion.run
is called in a nested way.log_model
failure when local package version contains parts other than
base version.sample_weights
were not being applied to search estimators.DataConnector.to_pandas()
performance when loading from Snowpark DataFrames.log_model
.Modeling: Support XGBoost version that is larger than 2.
Data: Fix multiple epoch iteration over DataConnector.to_torch_datapipe()
DataPipes.
Generic: Fix a bug that when an invalid name is provided to argument where fully qualified name is expected, it will be parsed wrongly. Now it raises an exception correctly.
Model Explainability: Handle explanations for multiclass XGBoost classification models
Model Explainability: Workarounds and better error handling for XGB>2.1.0 not working with SHAP==0.42.1
DataConnector
and DataSource
to snowflake.ml.data
.batch_size
and drop_last_batch
arguments to DataConnector.to_torch_dataset()
run
when function_name
is not mentioned and model has multiple
target methods.DataFrame
ingestion with ArrowIngestor
.set_params
to set the parameters of the underlying sklearn estimator, if the snowflake-ml model has been fit.snowflake.ml.data.ingestor_utils
module with utility functions helpful for DataIngestor
implementations.to_torch_dataset()
connector to DataConnector
to replace deprecated DataPipe.enable_explainability
set to True by default for XGBoost, LightGBM and CatBoost as PuPr feature.enable_explainability
when registering SHAP supported sklearn models.SimpleImputer
can impute integer columns with integer values.ModelVersion.run
.ExampleHelper
to help with load source data to simplify public notebooks.enable_explainability
when registering XGBoost models as a pre-PuPr feature.update_entity()
.enable_explainability
when registering Catboost models as a pre-PuPr feature.from snowflake.ml.modeling._internal.snowpark_implementations import ( distributed_hpo_trainer, ) distributed_hpo_trainer.ENABLE_EFFICIENT_MEMORY_USAGE = False
enable_explainability
when registering LightGBM models as a pre-PuPr feature.snowflake.ml.data
preview module which contains data reading utilities like DataConnector
DataConnector
provides efficient connectors from Snowpark DataFrame
and Snowpark ML Dataset
to external frameworks like PyTorch, TensorFlow, and Pandas. Create DataConnector
instances using the classmethod constructors DataConnector.from_dataset()
and DataConnector.from_dataframe()
.DataConnector.from_sources()
classmethod constructor for constructing from DataSource
objects.ingestor_class
arg to DataConnector
classmethod constructors for easier DataIngestor
injection.DatasetReader
now subclasses new DataConnector
class.
limit
arg to DatasetReader.to_pandas()
OrdinalEncoder
with categories
as a dictionary and a pandas DataFrameOneHotEncoder
with categories
as a dictionary and a pandas DataFramedevice_map
and device
when loading huggingface pipeline models.set_alias
method to ModelVersion
instance to set an alias to model version.unset_alias
method to ModelVersion
instance to unset an alias to model version.partitioned_inference_api
allowing users to create partitioned inference functions in registered
models. Enable model inference methods with table functions with vectorized process methods in registered models.list_feature_views()
.update_feature_view()
supports updating description.refresh_feature_view()
.get_refresh_history()
.generate_training_set()
API for generating table-backed feature snapshots.DeprecationWarning
for generate_dataset(..., output_type="table")
.update_feature_view()
supports updating description.refresh_feature_view()
.get_refresh_history()
.categories
argument.categories
argument.Pipeline
, GridSearchCV
, SimpleImputer
, and RandomizedSearchCV
ModelVersion.run
method in Stored Procedure.DatasetVersion.label_cols
and DatasetVersion.exclude_cols
properties.import snowflake.ml.modeling.parameters.enable_anonymous_sproc
cannot be imported due to package
dependency error.snowflake.connector.errors.DataError: Query Result did not match expected number of rows
when accessing
DatasetVersion properties when case insensitive SHOW VERSIONS IN DATASET
check matches multiple version names.snowpark.DataFrame
transformationssnowflake.ml.feature_store.setup_feature_store()
API to assist Feature Store RBAC setup.output_type
argument to FeatureStore.generate_dataset()
to allow generating data snapshots
as Datasets or Tables.log_model
, get_model
, delete_model
now supports fully qualified name.import snowflake.ml.modeling.parameters.enable_anonymous_sproc # noqa: F401
fit_transform
for all estimators is changed.
Firstly, it will cover all the estimator that contains this function,
secondly, the output would be the union of pandas DataFrame and snowpark DataFrame.snowflake.ml.registry.artifact
and related snowflake.ml.model_registry.ModelRegistry
APIs have been removed.
snowflake.ml.registry.artifact
module.ModelRegistry.log_artifact()
, ModelRegistry.list_artifacts()
, ModelRegistry.get_artifact()
artifacts
argument from ModelRegistry.log_model()
snowflake.ml.dataset.Dataset
has been redesigned to be backed by Snowflake Dataset entities.
Dataset
s can be created with Dataset.create()
and existing Dataset
s may be loaded
with Dataset.load()
.Dataset
s now maintain an immutable selected_version
state. The Dataset.create_version()
and
Dataset.load_version()
APIs return new Dataset
objects with the requested selected_version
state.dataset.create_from_dataframe()
and dataset.load_dataset()
convenience APIs as a shortcut
to creating and loading Dataset
s with a pre-selected version.Dataset.materialized_table
and Dataset.snapshot_table
no longer exist with Dataset.fully_qualified_name
as the closest equivalent.Dataset.df
no longer exists. Instead, use DatasetReader.read.to_snowpark_dataframe()
.Dataset.owner
has been moved to Dataset.selected_version.owner
Dataset.desc
has been moved to DatasetVersion.selected_version.comment
Dataset.timestamp_col
, Dataset.label_cols
, Dataset.feature_store_metadata
, and
Dataset.schema_version
have been removed.FeatureStore.generate_dataset
argument list has been changed to match the new
snowflake.ml.dataset.Dataset
definition
materialized_table
has been removed and replaced with name
and version
.name
moved to first positional argumentsave_mode
has been removed as merge
behavior is no longer supported. The new behavior is always errorifexists
.Change feature view version type from str to FeatureViewVersion
. It is a restricted string literal.
Remove as_dataframe arg from FeatureStore.list_feature_views(), now always returns result as DataFrame.
Combines few metadata tags into a new tag: SNOWML_FEATURE_VIEW_METADATA. This will make previously created feature views not readable by new SDK.
export
method to ModelVersion
instance to export model files.load
method to ModelVersion
instance to load the underlying object from the model.Model.rename
method to Model
instance to rename or move a model.Dataset.read.to_snowpark_dataframe()
Dataset.read.to_pandas()
Dataset.read.to_torch_datapipe()
and Dataset.read.to_tf_dataset()
respectively.fsspec
style file integration using Dataset.read.files()
and Dataset.read.filesystem()
catboost
model (catboost.CatBoostClassifier
, catboost.CatBoostRegressor
).lightgbm
model (lightgbm.Booster
, lightgbm.LightGBMClassifier
, lightgbm.LightGBMRegressor
).save_model
would be added to model signature
in SnowML models.apply
method is no longer by default logged when logging a xgboost model. If that is required, it could
be specified manually when logging the model by log_model(..., options={"target_methods": ["apply", ...]})
.block=true
becomes default.sentence-transformers
model (sentence_transformers.SentenceTransformer
).snowflake.ml.fileset.sfcfs.SFFileSystem
can now be used in UDFs and stored procedures.code_paths
when log_model
cannot be correctly imported.strict_input_validation=True
when
calling run
.relax_version=True
when logging a model instead of using the specific local dependency versions.
This improves dependency versioning by using versions available in Snowflake. To switch back to the previous behavior
and use specific local dependency versions, specify relax_version=False
when calling log_model
.fit_predict
for all estimators is changed.
Firstly, it will cover all the estimator that contains this function,
secondly, the output would be the union of pandas DataFrame and snowpark DataFrame.snowflake.ml.fileset.sfcfs.SFFileSystem
can now be serialized with pickle
.pip_requirements
argument.average="samples"
.FeatureStore.suspend_feature_view
and FeatureStore.resume_feature_view
doesn't mutate input feature
view argument any more. The updated status only reflected in the returned feature view object.score_samples
method for all the classes, including Pipeline,
GridSearchCV, RandomizedSearchCV, PCA, IsolationForest, ...snowflake.ml.model.models.huggingface_pipeline.HuggingFacePipelineModel
object, the following endpoints are required
to be allowed: huggingface.com:80, huggingface.com:443, huggingface.co:80, huggingface.co:443.relax_version
option is available in the options
argument when logging the model.snowflake.ml.registry.Registry
providing similar APIs as the old one but works
with new MODEL object in Snowflake SQL. Also, we are providingsnowflake.ml.model.Model
and
snowflake.ml.model.ModelVersion
to represent a model and a specific version of a model.fit_predict
method in AgglomerativeClustering
, DBSCAN
, and OPTICS
classes;fit_transform
method in MDS
, SpectralEmbedding
and TSNE
class.snowflake.ml.registry.model_registry.ModelRegistry
has been deprecated starting from version
1.2.0. It will stay in the Private Preview phase. For future implementations, kindly utilize
snowflake.ml.registry.Registry
, except when specifically required. The old model registry will be removed once all
its primary functionalities are fully integrated into the new registry.predict
with Snowpark DataFrame, both inferred or normalized column names are accepted.precision_score
metric.predict
target method on registered models is now compatible with unsupervised estimators.batch_size
error when deploying a model other than Hugging Face Pipeline
and LLM with GPU on SPCS.conda-forge
channel is now automatically added to channel lists when deploying to SPCS.relax_version
will not strip all version specifier, instead it will relax ==x.y.z
specifier to
>=x.y,<(x+1)
.snowflake.ml.model.models.huggingface_pipeline.HuggingFacePipelineModel
object,
versions of local installed libraries won't be picked as dependencies of models, instead it will pick up some pre-
defined dependencies to improve user experience.kneighbors
.deploy
will now return Deployment
for deployment information.Deployment
details will contains image_name
, service_spec
and service_function_sql
.SnowparkSQLUnexpectedAliasException
in inference.zipimport
.platform
argument in the deploy
function.np.nan
.token
argument when using
snowflake.ml.model.models.huggingface_pipeline.HuggingFacePipelineModel
with transformers < 4.32.0
is not effective.transformers.pipeline
that does not have a tokenizer
.pandas.io.json.json_normalize
.create_if_not_exists
parameter in constructor.xgboost.XGBModel
and xgboost.Booster
), PyTorch (torch.nn.Module
and torch.jit.ScriptModule
) and TensorFlow (tensorflow.Module
and
tensorflow.keras.Model
) models to Snowpark Container Services.Sequence
of built-in types, Sequence
of numpy.ndarray
,
Sequence
of torch.Tensor
, Sequence
of tensorflow.Tensor
and Sequence
of tensorflow.Tensor
can be used
instead of only List
of them.get_training_dataset
API.transformers.Pipeline
) and our wrapper
(snowflake.ml.model.models.huggingface_pipeline.HuggingFacePipelineModel
) to it. Using the wrapper to specify
configurations and the model for the pipeline will be loaded dynamically when deploying. Currently, following tasks
are supported to log without manually specifying model signatures:
log_model()
now return a ModelReference
object instead of a model ID.target method
only, the target_method
argument can be omitted.embed_local_ml_library
option will be set as True
automatically if not.keep_order
and output_with_input_features
in the deploy options have been removed. Now the
behavior is controlled by the type of the input when calling model.predict()
. If the input is a pandas.DataFrame
,
the behavior will be the same as keep_order=True
and output_with_input_features=False
before. If the input is a
snowpark.DataFrame
, the behavior will be the same as keep_order=False
and output_with_input_features=True
before.torch.nn.Module
and torch.jit.ScriptModule
) and TensorFlow
(tensorflow.Module
and tensorflow.keras.Model
) models, we no longer accept models whose input is a list of tensor
and output is a list of tensors. Instead, now we accept models whose input is 1 or more tensors as positional arguments,
and output is a tensor or a tuple of tensors. The input and output dataframe when predicting keep the same as before,
that is every column is an array feature and contains a tensor.create_model_registry()
.private_key_path
.tensorflow.Module
).mlflow.pyfunc.PyFuncModel
).torch.nn.Module
and torch.jit.ScriptModule
).create_model_registry
contains special
characters, the model registry cannot be created.get_model_description
returns with additional quotes.predict
method that contains a column with
Snowflake NUMBER(precision, scale)
data type where scale = 0
will not lead to error, and will now correctly
recognized as INT64
data type in model signature._use_local_snowml
parameter in options of deploy()
has been removed.False
embed_local_ml_library
parameter has been added to the options of log_model()
.
With this set to False
(default), the version of the local snowflake-ml-python library will be recorded and used when
deploying the model. With this set to True
, local snowflake-ml-python library will be embedded into the logged model,
and will be used when you load or deploy the model.code_paths
has been added to the arguments of log_model()
for users
to specify additional code paths to be imported when loading and deploying the model.options
has been added to the arguments of log_model()
to specify
any additional options when saving the model.accuracy_score()
now works when given label column names are lists of a single value.accuracy_score()
, confusion_matrix()
, precision_recall_fscore_support()
, precision_score()
methods move from
respective modules to metrics.classification
.get_model_history()
method as been enhanced to include the history of model deployment.False
flag named replace_udf
has been added to the options of deploy()
. Setting this
to True
will allow overwrite existing UDF with the same name when deploying.permanent
has been added to the argument of deploy()
. Setting this to True
allows the creation of a permanent deployment without needing to specify the UDF location.list_deployments()
has been added to enumerate all permanent deployments originating
from a specific model.get_deployment()
has been added to fetch a deployment by its deployment name.delete_deployment()
has been added to remove an existing permanent deployment.predict()
method moves from Registry to ModelReference._snowml_wheel_path
parameter in options of deploy()
, is replaced with _use_local_snowml
with
default value of False
. Setting this to True
will have the same effect of uploading local SnowML code when executing
model in the warehouse.id
field from ModelReference
constructor.snowflake.ml.modeling.preprocessing
and
snowflake.ml.modeling.metrics
.get_sklearn_object()
method is renamed to to_sklearn()
, to_xgboost()
, and to_lightgbm()
for
respective native models.deploy()
& predict()
methods now correctly escapes identifiersFAQs
The machine learning client library that is used for interacting with Snowflake to build machine learning solutions.
We found that snowflake-ml-python demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 3 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
Security News
Research
Socket researchers have discovered malicious npm packages targeting crypto developers, stealing credentials and wallet data using spyware delivered through typosquats of popular cryptographic libraries.
Security News
Socket's package search now displays weekly downloads for npm packages, helping developers quickly assess popularity and make more informed decisions.