Research
Security News
Malicious npm Package Targets Solana Developers and Hijacks Funds
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
The gcp-pal
library provides a set of utilities for interacting with Google Cloud Platform (GCP) services, streamlining the process of implementing GCP functionalities within your Python applications.
The utilities are designed to work with the google-cloud
Python libraries, providing a more user-friendly and intuitive interface for common tasks.
Module | Python Class |
---|---|
Firestore | gcp_pal.Firestore |
BigQuery | gcp_pal.BigQuery |
Storage | gcp_pal.Storage |
Cloud Functions | gcp_pal.CloudFunctions |
Cloud Run | gcp_pal.CloudRun |
Docker | gcp_pal.Docker |
Logging | gcp_pal.Logging |
Secret Manager | gcp_pal.SecretManager |
Cloud Scheduler | gcp_pal.CloudScheduler |
Project | gcp_pal.Project |
Dataplex | gcp_pal.Dataplex |
Artifact Registry | gcp_pal.ArtifactRegistry |
PubSub | gcp_pal.PubSub |
Request | gcp_pal.Request |
Schema | gcp_pal.Schema |
Parquet | gcp_pal.Parquet |
The package is available on PyPI as gcp-pal
. To install with pip
:
pip install gcp-pal
The library has module-specific dependencies. These can be installed via pip install gcp-pal[ModuleName]
, e.g.:
pip install gcp-pal[BigQuery]
# Installing 'google-cloud-bigquery'
pip install gcp-pal[CloudRun]
# Installing 'google-cloud-run' and 'docker'
To install all optional dependencies:
pip install gcp-pal[all]
The modules are also set up to notify the user if any required libraries are missing. For example, when attempting to use the Firestore
module:
from gcp_pal import Firestore
Firestore()
# ImportError: Module 'Firestore' requires 'google.cloud.firestore' (PyPI: 'google-cloud-firestore') to be installed.
Which lets the user know that the google-cloud-firestore
package is required to use the Firestore
module.
Before you can start using the gcp-pal
library with Firestore or any other GCP services, make sure you either have set up your GCP credentials properly or have the necessary permissions to access the services you want to use:
gcloud auth application-default login
And specify the project ID to be used as the default for all API requests:
gcloud config set project PROJECT_ID
You can also specify the default variables such as project ID and location using environmental variables. The reserved variables are GCP_PAL_PROJECT
and GCP_PAL_PROJECT
:
export GCP_PROJECT_ID=project-id
export GCP_LOCATION=us-central1
The order of precendece is as follows:
1. Keyword arguments (e.g. BigQuery(project="project-id"))
2. Environmental variables (e.g. export GCP_PROJECT_ID=project-id)
3. Default project set in gcloud (e.g. gcloud config set project project-id)
4. None
The Firestore module in the gcp-pal
library allows you to perform read and write operations on Firestore documents and collections.
First, import the Firestore class from the gcp_pal
module:
from gcp_pal import Firestore
To write data to a Firestore document, create a dictionary with your data, specify the path to your document, and use the write
method:
data = {
"field1": "value1",
"field2": "value2"
}
path = "collection/document"
Firestore(path).write(data)
To read a single document from Firestore, specify the document's path and use the read
method:
path = "collection/document"
document = Firestore(path).read()
print(document)
# Output: {'field1': 'value1', 'field2': 'value2'}
To read all documents within a specific collection, specify the collection's path and use the read
method:
path = "collection"
documents = Firestore(path).read()
print(documents)
# Output: {'document': {'field1': 'value1', 'field2': 'value2'}}
The Firestore module also supports writing and reading Pandas DataFrames, preserving their structure and data types:
import pandas as pd
# Example DataFrame
df = pd.DataFrame({
"field1": ["value1"],
"field2": ["value2"]
})
path = "collection/document"
Firestore(path).write(df)
read_df = Firestore(path).read()
print(read_df)
# Output:
# field1 field2
# 0 value1 value2
To list all documents and collections within a Firestore database, use the ls
method similar to bash:
colls = Firestore().ls()
print(colls)
# Output: ['collection']
docs = Firestore("collection").ls()
print(docs)
# Output: ['document1', 'document2']
The BigQuery module in the gcp-pal
library allows you to perform read and write operations on BigQuery datasets and tables.
Import the BigQuery class from the gcp_pal
module:
from gcp_pal import BigQuery
To list all objects (datasets and tables) within a BigQuery project, use the ls
method similar to bash:
datasets = BigQuery().ls()
print(datasets)
# Output: ['dataset1', 'dataset2']
tables = BigQuery(dataset="dataset1").ls()
print(tables)
# Output: ['table1', 'table2']
To create an object (dataset or table) within a BigQuery project, initialize the BigQuery class with the object's path and use the create
method:
BigQuery(dataset="new-dataset").create()
# Output: Dataset "new-dataset" created
BigQuery("new-dataset2.new-table").create(schema=schema)
# Output: Dataset "new-dataset2" created, table "new-dataset2.new-table" created
To create a table from a Pandas DataFrame, pass the DataFrame to the create
method:
df = pd.DataFrame({
"field1": ["value1"],
"field2": ["value2"]
})
BigQuery("new-dataset3.new-table").create(data=df)
# Output: Dataset "new-dataset3" created, table "new-dataset3.new-table" created, data inserted
Deleting objects is similar to creating them, but you use the delete
method instead:
BigQuery(dataset="dataset").delete()
# Output: Dataset "dataset" and all its tables deleted
BigQuery("dataset.table").delete()
# Output: Table "dataset.table" deleted
To read data from a BigQuery table, use the query
method:
query = "SELECT * FROM dataset.table"
data = BigQuery().query(query)
print(data)
# Output: [{'field1': 'value1', 'field2': 'value2'}]
Alternatively, there is a simple read method to read the data from a table with the given columns
, filters
and limit
:
data = BigQuery("dataset.table").read(
columns=["field1"],
filters=[("field1", "=", "value1")],
limit=1,
to_dataframe=True,
)
print(data)
# Output: pd.DataFrame({'field1': ['value1']})
By default, the read
method returns a Pandas DataFrame, but you can also get the data as a list of dictionaries by setting the to_dataframe
parameter to False
.
To insert data into a BigQuery table, use the insert
method:
data = {
"field1": "value1",
"field2": "value2"
}
BigQuery("dataset.table").insert(data)
# Output: Data inserted
One can also create BigQuery external tables by specifying the file path:
file_path = "gs://bucket/file.parquet"
BigQuery("dataset.external_table").create(file_path)
# Output: Dataset "dataset" created, external table "dataset.external_table" created
The allowed file formats are CSV, JSON, Avro, Parquet (single and partitioned), ORC.
The Storage module in the gcp-pal
library allows you to perform read and write operations on Google Cloud Storage buckets and objects.
Import the Storage class from the gcp_pal
module:
from gcp_pal import Storage
Similar to the other modules, listing objects in a bucket is done using the ls
method:
buckets = Storage().ls()
print(buckets)
# Output: ['bucket1', 'bucket2']
objects = Storage("bucket1").ls()
print(objects)
# Output: ['object1', 'object2']
To create a bucket, use the create
method:
Storage("new-bucket").create()
# Output: Bucket "new-bucket" created
Deleting objects is similar to creating them, but you use the delete
method instead:
Storage("bucket").delete()
# Output: Bucket "bucket" and all its objects deleted
Storage("bucket/object").delete()
# Output: Object "object" in bucket "bucket" deleted
To upload an object to a bucket, use the upload
method:
Storage("bucket/uploaded_file.txt").upload("local_file.txt")
# Output: File "local_file.txt" uploaded to "bucket/uploaded_file.txt"
To download an object from a bucket, use the download
method:
Storage("bucket/uploaded_file.txt").download("downloaded_file.txt")
# Output: File "bucket/uploaded_file.txt" downloaded to "downloaded_file.txt"
The Cloud Functions module in the gcp-pal
library allows you to deploy and manage Cloud Functions.
Import the CloudFunctions
class from the gcp_pal
module:
from gcp_pal import CloudFunctions
To deploy a Cloud Function, specifty the function's name, the source codebase, the entry point and any other parameters that are to be passed to BuildConfig
, ServiceConfig
and Function
(see docs):
CloudFunctions("function-name").deploy(
path="path/to/function_codebase",
entry_point="main",
environment=2,
)
Deploying a Cloud Function from a local source depends on the gcp_toole.Storage
module. By default, the source codebase is uploaded to the gcf-v2-sources-{PROJECT_NUMBER}-{REGION}
bucket and is deployed from there. An alternative bucket can be specified via the source_bucket
parameter:
CloudFunctions("function-name").deploy(
path="path/to/function_codebase",
entry_point="main",
environment=2,
source_bucket="bucket-name",
)
To list all Cloud Functions within a project, use the ls
method:
functions = CloudFunctions().ls()
print(functions)
# Output: ['function1', 'function2']
To delete a Cloud Function, use the delete
method:
CloudFunctions("function-name").delete()
# Output: Cloud Function "function-name" deleted
To invoke a Cloud Function, use the invoke
(or call
) method:
response = CloudFunctions("function-name").invoke({"key": "value"})
print(response)
# Output: {'output_key': 'output_value'}
To get the details of a Cloud Function, use the get
method:
details = CloudFunctions("function-name").get()
print(details)
# Output: {'name': 'projects/project-id/locations/region/functions/function-name',
# 'build_config': {...}, 'service_config': {...}, 'state': {...}, ... }
Service account email can be specified either within the constructor or via the service_account
parameter:
CloudFunctions("function-name", service_account="account@email.com").deploy(**kwargs)
# or
CloudFunctions("function-name").deploy(service_account="account@email.com", **kwargs)
The Cloud Run module in the gcp-pal
library allows you to deploy and manage Cloud Run services.
Import the CloudRun
class from the gcp_pal
module:
from gcp_pal import CloudRun
CloudRun("test-app").deploy(path="samples/cloud_run")
# Output:
# - Docker image "test-app" built based on "samples/cloud_run" codebase and "samples/cloud_run/Dockerfile".
# - Docker image "test-app" pushed to Google Container Registry as "gcr.io/{PROJECT_ID}/test-app:random_tag".
# - Cloud Run service "test-app" deployed from "gcr.io/{PROJECT_ID}/test-app:random_tag".
The default tag is a random string but can be specified via the image_tag
parameter:
CloudRun("test-app").deploy(path="samples/cloud_run", image_tag="5fbd72c")
# Output: Cloud Run service deployed
To list all Cloud Run services within a project, use the ls
method:
services = CloudRun().ls()
print(services)
# Output: ['service1', 'service2']
To list the job, set the job
parameter to True
:
jobs = CloudRun(job=True).ls()
print(jobs)
# Output: ['job1', 'job2']
To delete a Cloud Run service, use the delete
method:
CloudRun("service-name").delete()
# Output: Cloud Run service "service-name" deleted
Similarly to delete a job, set the job
parameter to True
:
CloudRun("job-name", job=True).delete()
To invoke a Cloud Run service, use the invoke
/call
method:
response = CloudRun("service-name").invoke({"key": "value"})
print(response)
# Output: {'output_key': 'output_value'}
To get the details of a Cloud Run service, use the get
method:
details = CloudRun("service-name").get()
print(details)
# Output: ...
To get the status of a Cloud Run service, use the status
/state
method:
service_status = CloudRun("service-name").status()
print(service_status)
# Output: Active
job_status = CloudRun("job-name", job=True).status()
print(job_status)
# Output: Active
Service account email can be specified either within the constructor or via the service_account
parameter:
CloudRun("run-name", service_account="account@email.com").deploy(**kwargs)
# or
CloudRun("run-name").deploy(service_account="account@email.com", **kwargs)
The Docker module in the gcp-pal
library allows you to build and push Docker images to Google Container Registry.
Import the Docker class from the gcp_pal
module:
from gcp_pal import Docker
Docker("image-name").build(path="path/to/context", dockerfile="Dockerfile")
# Output: Docker image "image-name:latest" built based on "path/to/context" codebase and "path/to/context/Dockerfile".
The default tag
is "latest"
but can be specified via the tag
parameter:
Docker("image-name", tag="5fbd72c").build(path="path/to/context", dockerfile="Dockerfile")
# Output: Docker image "image-name:5fbd72c" built based on "path/to/context" codebase and "path/to/context/Dockerfile".
Docker("image-name").push()
# Output: Docker image "image-name" pushed to Google Container Registry.
The default destination is "gcr.io/{PROJECT_ID}/{IMAGE_NAME}:{TAG}"
but can be specified via the destination
parameter:
Docker("image-name").push(destination="gcr.io/my-project/image-name:5fbd72c")
# Output: Docker image "image-name" pushed to "gcr.io/my-project/image-name:5fbd72c".
The Logging module in the gcp-pal
library allows you to access and manage logs from Google Cloud Logging.
Import the Logging class from the gcp_pal
module:
from gcp_pal import Logging
To list all logs within a project, use the ls
method:
logs = Logging().ls(limit=2)
for log in logs:
print(log)
# Output: LogEntry - [2024-04-16 17:30:04.308 UTC] {Message payload}
Where each entry is a LogEntry
object with the following attributes: project
, log_name
, resource
, severity
, message
, timestamp
, time_zone
, timestamp_str
.
The message
attribute is the main payload of the log entry.
To filter logs based on a query, use the filter
method:
logs = Logging().ls(filter="severity=ERROR")
# Output: [LogEntry - [2024-04-16 17:30:04.308 UTC] {Message payload}, ...]
Some common filters are also supported natively: severity
(str), time_start
(str), time_end
(str), time_range
(int: hours). For example, the following are equivalent:
# Time now: 2024-04-16 17:30:04.308 UTC
logs = Logging().ls(filter="severity=ERROR AND time_start=2024-04-16T16:30:04.308Z AND time_end=2024-04-16T17:30:04.308Z")
logs = Logging().ls(severity="ERROR", time_start="2024-04-16T16:30:04.308Z", time_end="2024-04-16T17:30:04.308Z")
logs = Logging().ls(severity="ERROR", time_range=1)
To stream logs in real-time, use the stream
method:
Logging().stream()
# LogEntry - [2024-04-16 17:30:04.308 UTC] {Message payload}
# LogEntry - [2024-04-16 17:30:05.308 UTC] {Message payload}
# ...
The Secret Manager module in the gcp-pal
library allows you to access and manage secrets from Google Cloud Secret Manager.
Import the SecretManager class from the gcp_pal
module:
from gcp_pal import SecretManager
To create a secret, specify the secret's name and value:
SecretManager("secret1").create("value1", labels={"env": "dev"})
# Output: Secret 'secret1' created
To list all secrets within a project, use the ls
method:
secrets = SecretManager().ls()
print(secrets)
# Output: ['secret1', 'secret2']
The ls
method also supports filtering secrets based on filter
or label
parameters:
secrets = SecretManager().ls(filter="name:secret1")
print(secrets)
# Output: ['secret1']
secrets = SecretManager().ls(label="env:*")
print(secrets)
# Output: ['secret1', 'secret2']
To access the value of a secret, use the value
method:
value = SecretManager("secret1").value()
print(value)
# Output: 'value1'
To delete a secret, use the delete
method:
SecretManager("secret1").delete()
# Output: Secret 'secret1' deleted
The Cloud Scheduler module in the gcp-pal
library allows you to create and manage Cloud Scheduler jobs.
Import the CloudScheduler class from the gcp_pal
module:
from gcp_pal import CloudScheduler
To create a Cloud Scheduler job, specify the job's name in the constructor, and use the create
method to set the schedule and target:
CloudScheduler("job-name").create(
schedule="* * * * *",
time_zone="UTC",
target="https://example.com/api",
payload={"key": "value"},
)
# Output: Cloud Scheduler job "job-name" created with HTTP target "https://example.com/api"
If the target
is not an HTTP endpoint, it will be treated as a PubSub topic:
CloudScheduler("job-name").create(
schedule="* * * * *",
time_zone="UTC",
target="pubsub-topic-name",
payload={"key": "value"},
)
# Output: Cloud Scheduler job "job-name" created with PubSub target "pubsub-topic-name"
Additionally, service_account
can be specified to add the OAuth and OIDC tokens to the request:
CloudScheduler("job-name").create(
schedule="* * * * *",
time_zone="UTC",
target="https://example.com/api",
payload={"key": "value"},
service_account="PROJECT@PROJECT.iam.gserviceaccount.com",
)
# Output: Cloud Scheduler job "job-name" created with HTTP target "https://example.com/api" and OAuth+OIDC tokens
To list all Cloud Scheduler jobs within a project, use the ls
method:
jobs = CloudScheduler().ls()
print(jobs)
# Output: ['job1', 'job2']
To delete a Cloud Scheduler job, use the delete
method:
CloudScheduler("job-name").delete()
# Output: Cloud Scheduler job "job-name" deleted
To pause or resume a Cloud Scheduler job, use the pause
or resume
methods:
CloudScheduler("job-name").pause()
# Output: Cloud Scheduler job "job-name" paused
CloudScheduler("job-name").resume()
# Output: Cloud Scheduler job "job-name" resumed
To run a Cloud Scheduler job immediately, use the run
method:
CloudScheduler("job-name").run()
# Output: Cloud Scheduler job "job-name" run
If the job is paused, it will be resumed before running. To prevent this, set the force
parameter to False
:
CloudScheduler("job-name").run(force=False)
# Output: Cloud Scheduler job "job-name" not run if it is paused
Service account email can be specified either within the constructor or via the service_account
parameter:
CloudScheduler("job-name", service_account="account@email.com").create(**kwargs)
# or
CloudScheduler("job-name").create(service_account="account@email.com", **kwargs)
The Project module in the gcp-pal
library allows you to access and manage Google Cloud projects.
Import the Project class from the gcp_pal
module:
from gcp_pal import Project
To list all projects available to the authenticated user, use the ls
method:
projects = Project().ls()
print(projects)
# Output: ['project1', 'project2']
To create a new project, use the create
method:
Project("new-project").create()
# Output: Project "new-project" created
To delete a project, use the delete
method:
Project("project-name").delete()
# Output: Project "project-name" deleted
Google Cloud will delete the project after 30 days. During this time, to undelete a project, use the undelete
method:
Project("project-name").undelete()
# Output: Project "project-name" undeleted
To get the details of a project, use the get
method:
details = Project("project-name").get()
print(details)
# Output: {'name': 'projects/project-id', 'project_id': 'project-id', ...}
To obtain the project number use the number
method:
project_number = Project("project-name").number()
print(project_number)
# Output: "1234567890"
The Dataplex module in the gcp-pal
library allows you to interact with Dataplex services.
Import the Dataplex class from the gcp_pal
module:
from gcp_pal import Dataplex
The Dataplex module supports listing all lakes, zones, and assets within a Dataplex instance:
lakes = Dataplex().ls()
print(lakes)
# Output: ['lake1', 'lake2']
zones = Dataplex("lake1").ls()
print(zones)
# Output: ['zone1', 'zone2']
assets = Dataplex("lake1/zone1").ls()
print(assets)
# Output: ['asset1', 'asset2']
To create a lake, zone, or asset within a Dataplex instance, use the create_lake
, create_zone
, and create_asset
methods.
To create a lake:
Dataplex("lake1").create_lake()
# Output: Lake "lake1" created
To create a zone (zone type and location type are required):
Dataplex("lake1/zone1").create_zone(zone_type="raw", location_type="single-region")
# Output: Zone "zone1" created in Lake "lake1"
To create an asset (asset source and asset type are required):
Dataplex("lake1/zone1").create_asset(asset_source="dataset_name", asset_type="bigquery")
# Output: Asset "asset1" created in Zone "zone1" of Lake "lake1"
Deleting objects can be done using a single delete
method:
Dataplex("lake1/zone1/asset1").delete()
# Output: Asset "asset1" deleted
Dataplax("lake1/zone1").delete()
# Output: Zone "zone1" and all its assets deleted
Dataplex("lake1").delete()
# Output: Lake "lake1" and all its zones and assets deleted
The Artifact Registry module in the gcp-pal
library allows you to interact with Artifact Registry services.
Import the ArtifactRegistry class from the gcp_pal
module:
from gcp_pal import ArtifactRegistry
The objects within Artifact Registry module follow the hierarchy: repositories > packages > versions > tags.
To list all repositories within a project, use the ls
method:
repositories = ArtifactRegistry().ls()
print(repositories)
# Output: ['repo1', 'repo2']
To list all packages (or "images") within a repository, use the ls
method with the repository name:
images = ArtifactRegistry("repo1").ls()
print(images)
# Output: ['image1', 'image2']
To list all versions of a package, use the ls
method with the repository and package names:
versions = ArtifactRegistry("repo1/image1").ls()
print(versions)
# Output: ['repo1/image1/sha256:version1', 'repo1/image1/sha256:version2']
To list all tags of a version, use the ls
method with the repository, package, and version names:
tags = ArtifactRegistry("repo1/image1/sha256:version1").ls()
print(tags)
# Output: ['repo1/image1/tag1', 'repo1/image1/tag2']
To create a repository, use the create_repository
method with the repository name:
ArtifactRegistry("repo1").create_repository()
# Output: Repository "repo1" created
Some additional parameters can be specified within the method, such as format ("docker"
or "maven"
), mode ('standard'
, 'remote'
or 'virtual'
).
To create a tag, use the create_tag
method with the repository, package, version, and tag names:
ArtifactRegistry("repo1/image1/sha256:version1").create_tag("tag1")
# Output: Tag "tag1" created for version "version1" of package "image1" in repository "repo1"
Deleting objects can be done using a single delete
method:
ArtifactRegistry("repo1/image1:tag1").delete()
# Output: Tag "tag1" deleted for package "image1" in repository "repo1"
ArtifactRegistry("repo1/image1/sha256:version1").delete()
# Output: Version "version1" deleted for package "image1" in repository "repo1"
ArtifactRegistry("repo1/image1").delete()
# Output: Package "image1" deleted in repository "repo1"
ArtifactRegistry("repo1").delete()
# Output: Repository "repo1" deleted
The PubSub module in the gcp-pal
library allows you to publish and subscribe to PubSub topics.
First, import the PubSub class from the gcp_pal
module:
from gcp_pal import PubSub
The PubSub
prefers to take the path
argument in the format project/topic/subscription
:
PubSub("my-project/my-topic/my-subscription")
Alternatively, you can specify the project and topic/subscription separately:
PubSub(project="my-project", topic="my-topic", subscription="my-subscription")
To list all topics within a project or all subscriptions within a topic, use the ls
method:
topics = PubSub("my-project").ls()
# Output: ['topic1', 'topic2']
subscriptions = PubSub("my-project/topic1").ls()
# Output: ['subscription1', 'subscription2']
Or to list all subscriptions within a project:
subscriptions = PubSub("my-project").ls_subscriptions()
# Output: ['subscription1', 'subscription2', ...]
To create a PubSub topic, use the create
method:
PubSub("my-project/new-topic").create()
# Output: PubSub topic "new-topic" created
To create a PubSub subscription, use the create
method with the topic
parameter:
PubSub("my-project/my-topic/new-subscription").create()
To delete a PubSub topic or subscription, use the delete
method:
PubSub("my-project/topic/subscription").delete()
# Output: PubSub subscription "subscription" deleted
PubSub("my-project/topic").delete()
# Output: PubSub topic "topic" deleted
To delete a subscription without specifying the topic, use the subscription
parameter:
PubSub(subscription="my-project/subscription").delete()
# Output: PubSub subscription "subscription" deleted
To publish a message to a PubSub topic, specify the topic's name and the message you want to publish:
topic = "topic-name"
message = "Hello, PubSub!"
PubSub(topic).publish(message)
The Request module in the gcp-pal
library allows you to make authorized HTTP requests.
Import the Request class from the gcp_pal
module:
from gcp_pal import Request
To make an authorized requests, specify the URL you want to access and use the relevant method:
url = "https://example.com/api"
get_response = Request(url).get()
print(get_response)
# Output: <Response [200]>
post_response = Request(url).post(data={"key": "value"})
print(post_response)
# Output: <Response [201]>
put_response = Request(url).put(data={"key": "value"})
print(put_response)
# Output: <Response [200]>
Specify the service account email to make requests on behalf of a service account within the constructor:
Request(url, service_account="account@email.com").get()
The Schema module is not strictly GCP-related, but it is a useful utility. It allows one to translate schemas between different formats, such as Python, PyArrow, BigQuery, and Pandas.
Import the Schema
class from the gcp_pal
module:
from gcp_pal.schema import Schema
To translate a schema from one format to another, use the respective methods:
str_schema = {
"a": "int",
"b": "str",
"c": "float",
"d": {
"d1": "datetime",
"d2": "timestamp",
},
}
python_schema = Schema(str_schema).str()
# {
# "a": int,
# "b": str,
# "c": float,
# "d": {
# "d1": datetime,
# "d2": datetime,
# },
# }
pyarrow_schema = Schema(str_schema).pyarrow()
# pa.schema(
# [
# pa.field("a", pa.int64()),
# pa.field("b", pa.string()),
# pa.field("c", pa.float64()),
# pa.field("d", pa.struct([
# pa.field("d1", pa.timestamp("ns")),
# pa.field("d2", pa.timestamp("ns")),
# ])),
# ]
# )
bigquery_schema = Schema(str_schema).bigquery()
# [
# bigquery.SchemaField("a", "INTEGER"),
# bigquery.SchemaField("b", "STRING"),
# bigquery.SchemaField("c", "FLOAT"),
# bigquery.SchemaField("d", "RECORD", fields=[
# bigquery.SchemaField("d1", "DATETIME"),
# bigquery.SchemaField("d2", "TIMESTAMP"),
# ]),
# ]
pandas_schema = Schema(str_schema).pandas()
# {
# "a": "int64",
# "b": "object",
# "c": "float64",
# "d": {
# "d1": "datetime64[ns]",
# "d2": "datetime64[ns]",
# },
# }
To infer and translate a schema from a dictionary of data or a Pandas DataFrame, use the is_data
parameter:
df = pd.DataFrame(
{
"a": [1, 2, 3],
"b": ["a", "b", "c"],
"c": [1.0, 2.0, 3.0],
"date": [datetime.datetime.now() for _ in range(3)],
}
)
inferred_schema = Schema(df, is_data=True).schema
# {
# "a": "int",
# "b": "str",
# "c": "float",
# "date": "datetime",
# }
pyarrow_schema = Schema(df, is_data=True).pyarrow()
# pa.schema(
# [
# pa.field("a", pa.int64()),
# pa.field("b", pa.string()),
# pa.field("c", pa.float64()),
# pa.field("date", pa.timestamp("ns")),
# ]
# )
The Parquet module in the gcp-pal
library allows you to read and write Parquet files in Google Cloud Storage. The gcp_pal.Storage
module uses this module to read and write Parquet files to and from Google Cloud Storage.
Import the Parquet class from the gcp_pal
module:
from gcp_pal import Parquet
To read a Parquet file from Google Cloud Storage, use the read
method:
data = Parquet("bucket/file.parquet").read()
print(data)
# Output: pd.DataFrame({'field1': ['value1'], 'field2': ['value2']})
To write a Pandas DataFrame to a Parquet file in Google Cloud Storage, use the write
method:
df = pd.DataFrame({
"field1": ["value1"],
"field2": ["value2"]
})
Parquet("bucket/file.parquet").write(df)
# Output: Parquet file "file.parquet" created in "bucket"
Partitioning can be specified via the partition_cols
parameter:
Parquet("bucket/file.parquet").write(df, partition_cols=["field1"])
# Output: Parquet file "file.parquet" created in "bucket" partitioned by "field1"
FAQs
Set of utilities for interacting with Google Cloud Platform
We found that gcp-pal demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
Security News
Research
Socket researchers have discovered malicious npm packages targeting crypto developers, stealing credentials and wallet data using spyware delivered through typosquats of popular cryptographic libraries.
Security News
Socket's package search now displays weekly downloads for npm packages, helping developers quickly assess popularity and make more informed decisions.