Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

spl-transpiler

Package Overview
Dependencies
Maintainers
2
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

spl-transpiler

Convert Splunk SPL queries into PySpark code

  • 0.2.0
  • PyPI
  • Socket score

Maintainers
2

from spl_transpiler import convert_spl_to_pyspark

Overview

spl_transpiler is a Rust + Python port of Databricks Labs' spl_transpiler. The goal is to provide a high-performance, highly portable, convenient tool for adapting common SPL code into PySpark code when possible, making it easy to migrate from Splunk to other data platforms for log processing.

Installation

pip install spl_transpiler

Usage

from spl_transpiler import convert_spl_to_pyspark

print(convert_spl_to_pyspark(r"""multisearch
[index=regionA | fields +country, orders]
[index=regionB | fields +country, orders]"""))

# spark.table("regionA").select(
#     F.col("country"),
#     F.col("orders"),
# ).unionByName(
#     spark.table("regionB").select(
#         F.col("country"),
#         F.col("orders"),
#     ),
#     allowMissingColumns=True,
# )

Interactive CLI

For demonstration purposes and ease of use, an interactive CLI is also provided.

pip install spl_transpiler[cli]
python -m spl_transpiler

cli_screenshot.png

This provides an in-terminal user interface (using textual) where you can type an SPL query and see the converted Pyspark code in real time, alongside a visual representation of how the transpiler is understanding your query.

Runtime

The Runtime is a library provided (currently) as part of the SPL Transpiler which can provide more robust implementation as well as an SPL-like authoring experience when writing PySpark code directly.

For example, the following code snippets are equivalent:

In SPL (which can be transpiled and run on Spark):

sourcetype="cisco" | eval x=len(raw) | stats max(x) AS longest BY source

In PySpark:

from pyspark.sql import functions as F
spark.table(...).where(
    (F.col("sourcetype") == F.lit("cisco")),
).withColumn(
    "x",
    F.length(F.col("raw")),
).groupBy(
    [
        "source",
    ]
).agg(
    F.max(F.col("x")).alias("longest"),
)

In the SPL Runtime:

from pyspark.sql import functions as F
from spl_transpiler.runtime import commands, functions
df_1 = commands.search(None, sourcetype=F.lit("cisco"))
df_2 = commands.eval(df_1, x=functions.eval.len(F.col("raw")))
df_3 = commands.stats(
    df_2, by=[F.col("source")], longest=functions.stats.max(F.col("x"))
)
df_3

This runtime is a collection of helper functions on top of PySpark, and can be intermingled with other PySpark code. This means you can leverage an SPL-like experience where convenient while still using regular PySpark code where convenient.

In addition to these helper functions, the runtime also provides UDFs (user-defined functions) to provide data processing functions that aren't natively available in Spark. For example, in eval is_local=cidrmatch("10.0.0.0/8", ip), the cidrmatch function has no direct equivalent in Spark. The runtime provides a UDF that can be used as follows, either directly:

from pyspark.sql import functions as F
from spl_transpiler.runtime import udfs

df = df.withColumn("is_local", udfs.cidr_match("10.0.0.0/8", F.col("ip")))

Or via the runtime:

from pyspark.sql import functions as F
from spl_transpiler.runtime import commands, functions
df_1 = commands.eval(None, is_local=functions.eval.cidrmatch("10.0.0.0/8", F.col("ip")))
df_1

The transpiler, by default, will not assume the presence of the runtime. You need to explicitly allow the runtime to enable these features (it is enabled by default in the TUI):

from spl_transpiler import convert_spl_to_pyspark
convert_spl_to_pyspark(
    '''eval is_local=cidrmatch("10.0.0.0/8", ip)''',
    allow_runtime=True
)

Why?

Why transpile SPL into Spark? Because a huge amount of domain knowledge is locked up in the Splunk ecosystem, but Splunk is not always the optimal place to store and analyze data. Transpiling existing queries can make it easier for analysts and analytics to migrate iteratively onto other platforms. SPL is also a very laser-focused language for certain analytics, and in most cases it's far more concise than other languages (PySpark or SQL) at log processing tasks. Therefore, it may be preferable to continue writing queries in SPL and use a transpiler layer to make that syntax viable on various platforms.

Why rewrite the Databricks Lab transpiler? A few reasons:

  1. The original transpiler is written in Scala and assumes access to a Spark environment. That requires a JVM to execute and possibly a whole ecosystem of software (maybe even a running Spark cluster) to be available. This transpiler stands alone and compiles natively to any platform.
  2. While Scala is a common language in the Spark ecosystem, Spark isn't the only ecosystem that would benefit from having an SPL transpiler. By providing a transpiler that's both easy to use in Python and directly linkable at a system level, it becomes easy to embed and adapt the transpiler for any other use case too.
  3. Speed. Scala's plenty fast, to be honest, but Rust is mind-numbingly fast. This transpiler can parse SPL queries and generate equivalent Python code in a fraction of a millisecond. This makes it viable to treat the transpiler as a realtime component, for example embedding it in a UI and re-computing the converted results after every keystroke.
  4. Maintainability. Rust's type system helps keep things unambiguous as data passes through parsers and converters, and built-in unit testing makes it easy to adapt and grow the transpiler without risk of breaking existing features. While Rust is undoubtedly a language with a learning curve, the resulting code is very hard to break without noticing. This makes it much easier to maintain than a similarly complicated system would be in Python.

Contributing

This project is in early development. While it parses most common SPL queries and can convert a non-trivial variety of queries to PySpark, it's extremely limited and not yet ready for any serious usage. However, it lays a solid foundation for the whole process and is modular enough to easily add incremental features to.

Ways to contribute:

  • Add SPL queries and what the equivalent PySpark could would be. These test cases can drive development and prioritize the most commonly used features.

Support Matrix

Search Commands

A complete list of built-in Splunk commands is provided on the Splunk documentation page.

It is not the goal of this transpiler to support every feature of every SPL command. Splunk and Spark are two different platforms and not everything that is built in to Splunk makes sense to trivially convert to Spark. What's far more important is supporting most queries (say, 90% of all queries), which only requires supporting a reasonable number of the most commonly used commands.

For reference, however, here is a complete table of Splunk commands and the current and planned support status in this transpiler.

Support status can be one of the following:

  • None: This command will result in a syntax error in the SPL parser, and is completely unrecognized.
  • Parser: This command can be parsed from SPL into a syntax tree, but cannot currently be rendered back as Pyspark code.
  • Partial: This command can be parsed and rendered back to functional Pyspark code. Not all features may be supported.
  • Complete: This command can be parsed and rendered back to functional Pyspark code. All intended features are supported. This library is still early in its development, and commands might get marked as Complete while still having unknown bugs or limitations.

Commands that don't yet support the Runtime can still be converted to raw PySpark code. The only difference will be that the PySpark code will be verbose and native, rather than using the SPL-like interface the runtime provides.

CommandSupportRuntimeTarget
High Priority
bin (bucket)PartialYes
convertYesYes
dedupParserYes
evalPartialYesYes
eventstatsPartialYes
fieldsYesYes
fillnullPartialYesYes
headPartialYes
inputlookupParserYes
iplocationNoneYes
joinPartialYes
lookupPartialYes
mstatsNoneYes
multisearchPartialYes
mvexpandParserYes
outputlookupNoneYes
rareYesYes
regexYesYes
renamePartialYes
rexPartialYes
searchPartialYesYes
sortPartialYes
spathPartialYes
statsPartialYes
streamstatsParserYes
tablePartialYes
tailYesYes
topYesYes
tstatsPartialYesYes
wherePartialYes
Planned/Supported
addtotalsPartialYes
anomalydetectionNoneMaybe
appendNoneMaybe
appendpipeNoneMaybe
chartNoneMaybe
collectParserMaybe
extract (kv)NoneMaybe
foreachNoneYes
formatParserYes
fromNoneYes
makecontinuousNoneMaybe
makemvNoneMaybe
makeresultsParserYes
mapParserYes
multikvNoneMaybe
replaceNoneMaybe
returnParserYes
transactionNoneYes
xmlkvNoneYes
Unsupported
abstractNone
accumNone
addcoltotalsNone
addinfoNone
analyzefieldsNone
anomaliesNone
anomalousvalueNone
appendcolsNone
arulesNone
associateNone
autoregressNone
bucketdirNone
clusterNone
cofilterNone
concurrencyNone
contingencyNone
correlateNone
datamodelNone
dbinspectNone
deleteNone
deltaNone
diffNone
erexNone
eventcountNone
fieldformatNone
fieldsummaryNone
filldownNone
findtypesNone
folderizeNone
gaugeNone
gentimesNone
geomNone
geomfilterNone
geostatsNone
highlightNone
historyNone
iconifyNone
inputcsvNone
kmeansNone
kvformNone
loadjobNone
localizeNone
localopNone
mcollectNone
metadataNone
metasearchNone
meventcollectNone
mpreviewNone
msearchNone
mvcombineParser
nomvNone
outlierNoneMaybe
outputcsvNone
outputtextNone
overlapNone
pivotNone
predictNone
rangemapNone
redistributeNone
reltimeNone
requireNone
restNone
reverseNone
rtorderNone
savedsearchNone
script (run)None
scrubNone
searchtxnNone
selfjoinNone
sendalertNone
sendemailNone
setNone
setfieldsNone
sichartNone
sirareNone
sistatsNone
sitimechartNone
sitopNone
strcatNone
tagsNone
timechartNoneMaybe
timewrapNone
tojsonNone
transposeNone
trendlineNone
tscollectNone
typeaheadNone
typelearnerNone
typerNone
unionNone
uniqNone
untableNone
walklexNone
x11None
xmlunescapeNone
xpathNone
xyseriesNone

Functions

There are two primary kinds of functions: Evaluation functions ( primarily for use in eval) and Statistical and Charting functions ( primarily for use in stats).

Like with commands, there are a lot of built-in functions and not all of them may map cleanly to Spark. This transpiler intends to support most queries and will thus support the most common functions. However, there is no goal at this time to support all Splunk functions.

CategorySubcategoryFunctionSupportRuntimeTarget
EvalBitwisebit_andYesYesYes
EvalBitwisebit_orYesYesYes
EvalBitwisebit_notYesYesYes
EvalBitwisebit_xorYesYesYes
EvalBitwisebit_shift_leftYesYesYes
EvalBitwisebit_shift_rightYesYesYes
EvalComparison and ConditionalcaseYesYesYes
EvalComparison and ConditionalcidrmatchYes*Yes (UDF)Yes
EvalComparison and ConditionalcoalesceYesYesYes
EvalComparison and ConditionalfalseYesYesYes
EvalComparison and ConditionalifYesYesYes
EvalComparison and ConditionalinYesYesYes
EvalComparison and ConditionallikeYesYesYes
EvalComparison and ConditionallookupNoNo
EvalComparison and ConditionalmatchYesYesYes
EvalComparison and ConditionalnullYesYesYes
EvalComparison and ConditionalnullifYesYesYes
EvalComparison and ConditionalsearchmatchNoNo
EvalComparison and ConditionaltrueYesYesYes
EvalComparison and ConditionalvalidateYesYesYes
EvalConversionipmaskNoYes (UDF)
EvalConversionprintfNoYes (UDF)
EvalConversiontonumberPartialYes (UDF)Yes
EvalConversiontostringPartialYes (UDF)Yes
EvalCryptographicmd5YesYesYes
EvalCryptographicsha1YesYesYes
EvalCryptographicsha256YesYesYes
EvalCryptographicsha512YesYesYes
EvalDate and TimenowYesYesYes
EvalDate and Timerelative_timeYesYesYes
EvalDate and TimestrftimePartialPartialYes
EvalDate and TimestrptimePartialPartialYes
EvalDate and TimetimeYesYesYes
EvalInformationalisboolNoNoNo
EvalInformationalisintNoNoNo
EvalInformationalisnotnullYesYesYes
EvalInformationalisnullYesYesYes
EvalInformationalisnumNoNoNo
EvalInformationalisstrNoNoNo
EvalInformationaltypeofNoNoNo
EvalJSONjson_objectNoNo
EvalJSONjson_appendNoNo
EvalJSONjson_arrayNoNo
EvalJSONjson_array_to_mvNoNo
EvalJSONjson_extendNoNo
EvalJSONjson_extractNoNo
EvalJSONjson_extract_exactNoNo
EvalJSONjson_keysYesYes
EvalJSONjson_setNoNo
EvalJSONjson_set_exactNoNo
EvalJSONjson_validYesYes
EvalMathematicalabsYesYesYes
EvalMathematicalceiling (ceil)YesYesYes
EvalMathematicalexactYes*Yes*No
EvalMathematicalexpYesYesYes
EvalMathematicalfloorYesYesYes
EvalMathematicallnYesYesYes
EvalMathematicallogYesYesYes
EvalMathematicalpiYesYesYes
EvalMathematicalpowYesYesYes
EvalMathematicalroundYesYesYes
EvalMathematicalsigfigNoYesNo
EvalMathematicalsqrtYesYesYes
EvalMathematicalsumYesYesYes
EvalMultivaluecommandsNoNo
EvalMultivaluemvappendYesYesYes
EvalMultivaluemvcountYesYesYes
EvalMultivaluemvdedupNoNo
EvalMultivaluemvfilterNoNoYes
EvalMultivaluemvfindNoNo
EvalMultivaluemvindexYesYesYes
EvalMultivaluemvjoinNoNoYes
EvalMultivaluemvmapNoNo
EvalMultivaluemvrangeNoNo
EvalMultivaluemvsortNoNo
EvalMultivaluemvzipYesYes
EvalMultivaluemv_to_json_arrayNoNo
EvalMultivaluesplitYesYesYes
EvalStatisticalavgYesYesYes
EvalStatisticalmaxYesYesYes
EvalStatisticalminYesYesYes
EvalStatisticalrandomYesYesYes
EvalTextlenYesYesYes
EvalTextlowerYesYesYes
EvalTextltrimYesYesYes
EvalTextreplaceYesYesYes
EvalTextrtrimYesYesYes
EvalTextspathNoNo
EvalTextsubstrYesYesYes
EvalTexttrimYesYesYes
EvalTextupperYesYesYes
EvalTexturldecodeYesYesYes
EvalTrigonometry and HyperbolicacosYesYesYes
EvalTrigonometry and HyperbolicacoshYesYesYes
EvalTrigonometry and HyperbolicasinYesYesYes
EvalTrigonometry and HyperbolicasinhYesYesYes
EvalTrigonometry and HyperbolicatanYesYesYes
EvalTrigonometry and Hyperbolicatan2YesYesYes
EvalTrigonometry and HyperbolicatanhYesYesYes
EvalTrigonometry and HyperboliccosYesYesYes
EvalTrigonometry and HyperboliccoshYesYesYes
EvalTrigonometry and HyperbolichypotYesYesYes
EvalTrigonometry and HyperbolicsinYesYesYes
EvalTrigonometry and HyperbolicsinhYesYesYes
EvalTrigonometry and HyperbolictanYesYesYes
EvalTrigonometry and HyperbolictanhYesYesYes
StatsAggregateavgYesYesYes
StatsAggregatecountYesYesYes
StatsAggregatedistinct_count (dc)YesYesYes
StatsAggregateestdcYesYes
StatsAggregateestdc_errorNoNo
StatsAggregateexactpercYesYes
StatsAggregatemaxYesYesYes
StatsAggregatemeanYesYesYes
StatsAggregatemedianYesYesYes
StatsAggregateminYesYesYes
StatsAggregatemodeYesYesYes
StatsAggregatepercentileYesYesYes
StatsAggregaterangeYesYesYes
StatsAggregatestdevYesYesYes
StatsAggregatestdevpYesYesYes
StatsAggregatesumYesYesYes
StatsAggregatesumsqYesYesYes
StatsAggregateupperpercYesYesYes
StatsAggregatevarYesYesYes
StatsAggregatevarpYesYesYes
StatsEvent orderfirstYesYes
StatsEvent orderlastYesYes
StatsMultivalue stats and chartlistYesYes
StatsMultivalue stats and chartvaluesYesYesYes
StatsTimeearliestYesYesYes
StatsTimeearliest_timeYes?Yes?
StatsTimelatestYesYesYes
StatsTimelatest_timeYes?Yes?
StatsTimeper_dayNoNo
StatsTimeper_hourNoNo
StatsTimeper_minuteNoNo
StatsTimeper_secondNoNo
StatsTimerateYes?Yes?
StatsTimerate_avgNoYes?
StatsTimerate_sumNoYes?

* Pyspark output depends on custom UDFs instead of native Spark functions. Some of these may be provided by this package, some may be provided by Databricks Sirens.

Prioritized TODO list

  • Support macro syntax (separate pre-processing function?)
  • Use sample queries to create prioritized list of remaining commands
  • ~~ Incorporate standard macros that come with CIM~~
  • Support re-using intermediate results (saving off as tables or variables, .cache())
  • Migrate existing commands into runtime
  • Migrate eval, stats, and collect functions into runtime
  • Support custom Python UDFs
  • Finish supporting desired but not directly-mappable evals functions using UDFs
  • Support {} and @ in field names
  • Support Scala UDFs
  • Support SQL output

Contributing

Installation

You'll need cargo (Rust) and python installed. I recommend using uv for managing the Python environment, dependencies, and tools needed for this project.

Note that PySpark is currently only compatible with Python 3.11 and older, 3.12 and 3.13 are not yet supported. E.g., you can use uv venv --python 3.11 to create a .venv virtual environment with the appropriate Python interpreter. spl_transpiler is developed against Python 3.10 and likely requires at least that.

This project uses maturin and pyo3 for the Rust <-> Python interfacing, you'll need to install maturin, e.g. using uvx maturin commands which will auto-install the tool on first use.

This project uses pre-commit to automate linting and formatting. E.g. you can use uvx pre-commit install to install pre-commit and set up its git hooks.

You can then build and install spl_transpiler and all dependencies. First, make sure you have your virtual environment activated (uv commands will detect the venv by default if you use that, else follow activation instructions for your virtual environment tool), then run uv pip install -e .[cli,test,runtime].

Running Tests

You can test the core transpiler using cargo test. The Rust test suites include full end-to-end tests of query conversions, ensuring that the transpiler compiles and converts a wide range of known inputs into expected outputs.

The Python-side tests can be run with pytest and primarily ensure that the Rust <-> Python interface is behaving as expected. It also includes runtime tests, which validate that hand-written and transpiled runtime code does what is expected using known input/output data pairs running in an ephemeral Spark cluster.

There is also a large-scale Python test that can be run using pytest tests/test_sample_files_parse.py. By default, this test is ignored because it is slow and currently does not pass. It runs the transpiler on >1,800 sample SPL queries and ensure that the transpiler doesn't crash, generating detailed logs and error summaries along the way. This test is useful when expanding the transpiler to support new syntax, command, functions, etc. to see if the changes cause more commands/queries to transpile successfully. It's also useful for identifying what elements of SPL should be prioritized next to support more real-world use cases.

Acknowledgements

This project is deeply indebted to several other projects:

  • Databricks Labs' Transpiler provided most of the starting point for this parser, including an unambiguous grammar definition and numerous test cases which have been copied mostly verbatim. The license for that transpiler can be found here. Copyright 2021-2022 Databricks, Inc.
  • Numerous real-world SPL queries have been provided by Splunk Security Content under Apache 2.0 License. Copyright 2022 Splunk Inc.

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc