vineyard-ml: Accelerating Data Science Pipelines
Vineyard has been tightly integrated with the data preprocessing pipelines in
widely-adopted machine learning frameworks like PyTorch, TensorFlow, and MXNet.
Shared objects in vineyard, e.g., vineyard::Tensor
, vineyard::DataFrame
,
vineyard::Table
, etc., can be directly used as the inputs of the training
and inference tasks in these frameworks.
Examples
Datasets
The following examples shows how DataFrame
in vineyard can be used as the input
of Dataset for PyTorch:
import os
import numpy as np
import pandas as pd
import torch
import vineyard
client = vineyard.connect(os.environ['VINEYARD_IPC_SOCKET'])
df = pd.DataFrame({
'data': vineyard.data.dataframe.NDArrayArray(np.random.rand(1000, 10)),
'label': np.random.rand(1000)
})
object_id = client.put(df)
from vineyard.contrib.ml.torch import torch_context
with torch_context():
ds = client.get(object_id)
from vineyard.contrib.ml.torch import datapipe
pipe = datapipe(ds)
for data, label in pipe:
pass
Pytorch Modules
The following example shows how to use vineyard to share pytorch modules between processes:
import torch
import vineyard
client = vineyard.connect(os.environ['VINEYARD_IPC_SOCKET'])
class Model(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 20, 5)
self.conv2 = nn.Conv2d(20, 20, 5)
def forward(self, x):
x = F.relu(self.conv1(x))
return F.relu(self.conv2(x))
model = Model()
from vineyard.contrib.ml.torch import torch_context
with torch_context():
object_id = client.put(model)
model = Model()
with torch_context():
state_dict = client.get(object_id)
model.load_state_dict(state_dict, assign=True)
By default, the compression is enabled for the vineyard client. Sometimes, the compression may not be efficient for the torch modules, you can disable it as follows:
from vineyard.contrib.ml.torch import torch_context
with torch_context(client):
object_id = client.put(model)
with torch_context(client):
state_dict = client.get(object_id)
Besides, if you want to put the torch modules into all vineyard workers spreadly to gather the network bandwidth of all workers, you can enable the spread option as follows:
from vineyard.contrib.ml.torch import torch_context
with torch_context(client, spread=True):
object_id = client.put(model)
with torch_context(client):
state_dict = client.get(object_id)
Reference and Implementation
For more details about vineyard itself, please refer to the Vineyard project.