You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP →

Book a Demo Install Sign in

hyperleaup

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

hyperleaup

Create and publish Tableau Hyper files from Apache Spark DataFrames and Spark SQL.

0.1.2

PyPI

Maintainers: 1

hyperleaup

Pronounced "hyper-loop". Create and publish Tableau Hyper files from Apache Spark DataFrames or Spark SQL.

Why are data extracts are so slow?

Tableau Data Extracts can take hours to create and publish to a Tableau Server. Sometimes this means waiting around most of the day for the data extract to complete. What a waste of time! In addition, the Tableau Backgrounder (the Tableau Server job scheduler) becomes a single point of failure as more refresh jobs are scheduled and long running jobs exhaust the server’s resources.

Data Extract Current Workflow

How hyperleaup helps

Rather than pulling data from the source over an ODBC connection, hyperleaup can write data directly to a Hyper file and publish final Hyper files to a Tableau Server. Best of all, you can take advantage of all the benefits of Apache Spark + Tableau Hyper API:

perform efficient CDC upserts
distributed read/write/transformations from multiple sources
execute SQL directly

hyperleaup allows you to create repeatable data extracts that can be scheduled to run on a repeated frequency or even incorporate it as a final step in an ETL pipeline, e.g. refresh data extract with latest CDC.

Getting Started

A list of usage examples is available in the demo folder of this repo as a Databricks Notebook Archive (DBC) or IPython Notebook (demo/Hyperleaup-Demo.ipynb).

Example usage

The following code snippet creates a Tableau Hyper file from a Spark SQL statement and publishes it as a datasource to a Tableau Server.

from hyperleaup import HyperFile

# Step 1: Create a Hyper File from Spark SQL
query = """
select *
  from transaction_history
 where action_date > '2015-01-01'
"""

hf = HyperFile(name="transaction_history", sql=query, is_dbfs_enabled=True)

# Step 2: Publish Hyper File to a Tableau Server
hf.publish(tableau_server_url,
           username,
           password,
           site_name,
           project_name,
           datasource_name)

# Step 3: Append new data
new_data = """
select *
  from transaction_history
 where action_date > last_publish_date
"""
hf.append(sql=new_data)

Creation Mode

There are several options for how to create the Hyper file that can be set by adding argument creation_mode when initializing HyperFile instance. The default is PARQUET.

Mode	Description	Data Size
PARQUET	Saves data to a single Parquet file then copies to Hyper file.	MEDIUM
COPY	Saves data to CSV format then copies to Hyper file.	MEDIUM
INSERT	Reads data into memory; more forgiving for null values.	SMALL
LARGEFILE	Saves data to multiple Parquet files then copies to Hyper file.	LARGE

Example of setting creation mode:
hf = HyperFile(name="transaction_history", sql=query, is_dbfs_enabled=True, creation_mode="PARQUET")

Hyper File Options

There is an optional HyperFileConfig that can be used to change default behaviors.

timestamp_with_timezone:
- If True, use timestamptz datatype with HyperFile. Recommended if using timestamp values with Parquet create mode. (default=False)
allow_nulls:
- If True, skip default behavior of replacing null numeric and strings with non-null values. (default=False)
convert_decimal_precision:
- If True, automatically convert decimals with precision over 18 down to 18. This has risk of data truncation. (default=False)

Example using configs

from hyperleaup import HyperFile, HyperFileConfig

hf_config = HyperFileConfig(
              timestamp_with_timezone=True, 
              allow_nulls=False,
              convert_decimal_precision=False)

hf = HyperFile(name="transaction_history", sql=query, is_dbfs_enabled=True)

Legal Information

This software is provided as-is and is not officially supported by Databricks through customer technical support channels. Support, questions, and feature requests can be submitted through the Issues page of this repo. Please understand that issues with the use of this code will not be answered or investigated by Databricks Support.

Core Contribution team

Lead Developer: Will Girten, Lead SSA @Databricks
Puru Shrestha, Sr. BI Developer

Project Support

Please note that all projects in the /databrickslabs github account are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs).
They are provided AS-IS and we do not make any guarantees of any kind.
Please do not submit a support ticket relating to any issues arising from the use of these projects.

Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo.
They will be reviewed as time permits, but there are no formal SLAs for support.

Building the Project

To build the project:

python3 -m build

Running Pytests

To run tests on the project:

cd tests
python test_hyper_file.py
python test_creator.py

Keywords

FAQs

What is hyperleaup?

Is hyperleaup well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

hyperleaup

hyperleaup

Why are data extracts are so slow?

How hyperleaup helps

Getting Started

Example usage

Creation Mode

Hyper File Options

Example using configs

Legal Information

Core Contribution team

Project Support

Building the Project

Running Pytests

Keywords

Related posts

Tracking Protestware Spread: 28 npm Packages Affected by Payload Targeting Russian-Language Users

Contagious Interview Campaign Escalates With 67 Malicious npm Packages and New Malware Loader