
Research
npm Malware Targets Telegram Bot Developers with Persistent SSH Backdoors
Malicious npm packages posing as Telegram bot libraries install SSH backdoors and exfiltrate data from Linux developer machines.
benchstats
is a Python 3.10+ package that lets you compare two sets of benchmark results* using proper statistical tests and make a readable report on that. This lets remove guesswork and WAGs from interpretation of benchmarking results, and obtain a solid** answer to a question "is the algorithm implementation A is faster than the algorithm implementation B".
Code that read input data, perform statistical testing and then visualize results is coupled only by a shared data types and it could easily be used separately if needed (for example, to build a performance regression testing pipeline).
* - while currently benchstats
has a built-in parser only for Google Benchmark-produced .json
files, it's very simple to make and use a parser for any other data source as long as you know how to interpret it. See a "Custom data sources" paragraph and --files_parser
CLI argument below.
** - the solidness of the answer still depends heavily on how well the underlying hardware and OS are quiesced, and how inherently non-deterministic the algorithms under benchmark are. It isn't covered here how to achieve these benchmarking basics, however, benchstats
does have a mode to alleviate testing if the benchmarking machine is stable enough, see --expect_same
CLI argument below.
- Really, why bother? Can't you just measure and compare the run time of code variant A to the run time of code variant B?
Well, you can, but to see that this is meaningless, just run your benchmark several times. The numbers will never be the same, they'll fluctuate. Which particular number should you use for the comparison then?
- Ok, then why not take an average number over a certain amount of benchmark executions and compare these averages?
If you repeat that experiment several times, you'll see this suffers from the very same problem as above - the number you get still fluctuates, even though usually it fluctuates less. You'll obtain two such numbers for benchmark alternatives and one will be bigger then the other. How would you know that the difference is large enough to be attributed to real differences in the code instead of a random non-determinism?
- Ah, but that's simple! In addition to the average value, also compute a standard deviation and then if averages are within 1-2-whatever standard deviations apart one from the other, declare the results aren't different. Otherwise one is significantly less or greater than the other. If you want to be extra sure, go with a "6 sigma golden standard" just like they do it in Physics! Google Benchmark even computes standard deviation on its own when you run benchmark with repetitions! Easy!
... and even more wrong than you'd be, should you just compare two random measurements directly. Standard deviation is a measure of variance applicable (read: that make sense to apply) only to random variables that are instances of Normal/Gaussian distribution. Measurements you get when benchmarking are anything but normally distributed, and a value computed as their "standard deviation" just doesn't describe what is expected - a real variance of distribution of measurements.
None of above is a good idea. But what is? Let's step back and think what do we want to learn from comparing benchmark results? We want to learn if a code variant A is faster, i.e. have smaller run time, than a code variant B. But we can't just measure their run times and compare them directly, because each measurement is essentially a random variable. Even if algorithms are perfectly deterministic, the hardware inherently isn't, so instead of measuring a run time precisely, we always sample a random value from some unknown distribution instead. Hence, the only correct way to go is to sample a lot of these random variables for distributions A and B, and then compare these sets of samples against each other. If it's more likely to find a smaller value in the set A than in the set B, then A is faster than B.
But how to do that? It's a well known problem in Mathematics and several solutions exists, such as Mann–Whitney U test and a newer more generic Brunner Munzel Test. benchstats
is essentially a convenience wrapper around these tests with fully customizable inputs and results visualization/exporting, letting you get the answers effortlessly.
pip install benchstats
For command line use see section "Examples" and "Command Line Interface Reference" below. For API overview see a corresponding section below.
(I'll use a Google Benchmark-derived binary as a data source for the examples, but just reminding that any data source could be easily plugged-in and used instead. For conciseness, I'll shorten "Google Benchmark" to simply "GBench")
Note that in all cases, you should run at least 10 repetitions of a benchmark (--benchmark_repetitions=10
). Even 2 repetitions would work, but the more data you have, the more reliable results are, and at least 10 is a reasonable minimum. --benchmark_enable_random_interleaving
flag is highly recommended too.
Lets consider two use-cases to illustrate how to use benchstats
.
For example, you already have a benchmark in the main
code branch and want to ensure that a feature
branch, which modifies some underlying code, doesn't contain unexpected performance regressions. How to do that?
main
branch with --benchmark_out=./my_main.json
argument to save results to ./my_main.json
file:benchmark_binary --benchmark_repetitions=10 \
--benchmark_enable_random_interleaving \
--benchmark_out=./my_main.json
Checkout the feature
code branch and similarly run the same benchmark saving results to ./my_feature.json
file.
Let the benchstats
analyze the results:
benchstats ./my_main.json ./my_feature.json real_time
The first two positional arguments of benchstats
module are just file paths and the third argument is a json field name that contain a metric value for comparison. (real_time
is the default field/metric, so it doesn't even have to be specified, btw). GBench typically saves two metrics: real_time
(shown as just Time
in reports) which is an average time of one iteration, and a cpu_time
(shown as CPU
in reports) which is an average time CPU spent executing the user-mode code. GBench also supports custom timers, and you can use them here as well. Typically, custom timers have the same field name as the name that was given to a counter (in case of doubts, just peek into a .json
file).
If the benchmark_binary
had only a single benchmark, the result of the command could look something like this:
benchstats
present results as a table: the first column correspond to a benchmark name and all other columns shows results of applying a statistical test to a corresponding metric (yes, you can supply many metrics at once and all of them will be analyzed. Try to append cpu_time
to the command line above).
Here's what a data cell of the real_time
column mean: there were 10 samples of an alternative A (from the first positional argument - my_main.json
) and 10 samples of an alternative B (from the second positional argument my_feature.json
), - hence this (10 vs 10)
text in the end. Mean value (hence (means)
in the column title) for alternative A over its 10 samples was 516.5 microseconds, while mean for B was 502.8 microseconds. The test was using significance level 0.001 (hence the string alpha=0.00100
in the table title) and on that level test declared that both sets aren't different from each other (hence symbol ~
between the mean values). Actual p-value for tests that didn't find a difference is not shown by default, only blank space is allocated to report it (so no text shift happens when some benchmark has differences and others don't) but you can pass --always_show_pvalues
flag to show it.
To show how benchstats
report could look like when a significant difference is found, I'll just rerun the same analysis with less restrictive significance level (and will use the opportunity to show effects of some above-mentioned flags)
benchstats ./my_main.json ./my_feature.json real_time cpu_time \
--always_show_pvalues --alpha=0.03
The main change is that now the test reports that according to real_time
metric, the benchmark that generated ./my_main.json
run significantly slower than the benchmark that made ./my_feature.json
. Due to that the script has exited with exit code 1, letting one to detect that a human intervention might be needed. But what are these "main metrics" that the warning string refers to? Main metrics are such metrics, any difference in which makes the script to exit with error code 1. By default, a metric mentioned the first in the command line list of metrics is the main metric, but this could be changed with --main_metrics
argument, which accepts a list of indices in the metrics list. Let's make cpu_time
the main metric and see what would change:
benchstats ./my_main.json ./my_feature.json real_time cpu_time \
--always_show_pvalues --alpha=0.03 --main_metrics 1
Now cpu_time
is the main metric and the difference in it isn't significant. real_time
now is a secondary metric, which is only reported, but doesn't influence the script exit code.
For example, we are optimizing a function and want to compare how a new implementation compares to the old. We've created an individual benchmark for both implementations and for convenience have placed them in the same binary. That implies the benchmark functions must have different names. Now, if we run the benchmark binary and obtain ./my.json
results file and then will follow the approach above blindly, we won't get what we need. benchstats
compares a benchmark from the source 1 to a benchmark with the same name in the source 2, so we'll end up comparing old vs old in one row, and new vs new in the other, just like this:
benchstats ./my.json ./my.json real_time
To solve that, benchstats
has two features based on the Python standard regular expression engine:
--filter1
or --filter2
flags correspondingly.--from
(also --from1
for source 1 only and --from2
for source 2 only) flag sets a regular expression pattern to find in each benchmark name, and an optional --to
(also --to1
and --to2
) flag sets replacement string to substitute the pattern found.This allows us to make a plan:
./my.json
and take only benchmark names that doesn't have Old
substring using a regexp ^(?!.*?Old).*$
./my.json
and take only benchmarks with names that contain Old
substring. The regexp is trivial literal string Old
.Old
prefixbenchstats ./my.json ./my.json real_time \
--filter1='^(?!.*?Old).*$' --filter2=Old \
--from2='(?:Old)(.*?)' --to2=\\1
Note the single '
quotes around --filter1
and --from2
regexps. They contain special characters that bash will try interpret spawning an error, if we won't escape them. Single quotes prevent bash expansion (double quotes "
don't!) and pass strings as is to the program. Also note that string replacement pattern for --to2
has a back slash escaped for that very reason too: \\1
instead of \1
. Single quotes is an option here too.
Now we get a proper result:
While benchstats
today support one Google Benchmark -produced .json
files out of the box, it's intended to be able to run statistical tests for any data sources. If you know how to parse your custom file format, connect to a remote database, generate data on the fly, or obtain data any other way, you can easily let benchstats
to run statistical tests on it by writing a custom parser.
Any python class (a parser) named X, that:
benchstats.common.ParserBase
classsource_id
- a string identifier, describing a data source within parser's controlled namespace. It's is obtained as the first or the second positional argument of benchstats
module CLI.
parser_GbenchJson
this is a source file name.filter
- a string identifier of a filter, used to restrict a set of benchmarks to load. Obtained as a value of --filter1
or --filter2
CLI flags.
parser_GbenchJson
it's a string with regular expression to use for filtering benchmark names.metrics
- is a non-empty list or a tuple of string identifiers, corresponding to metrics to read for benchmarks. Obtained from third and subsequent CLI positional arguments.debug_log
- an instance of benchstats.common.LoggingConsole
class that provides a
console logging interface to report errors if any (built-in parsers should also support None
values and bool
values and instantiate a LoggingConsole
on its own, but support for this
isn't needed for a custom parser)def getStats(self)
method returning a dictionary that maps strings describing benchmark names to dictionaries linking metric names with np.ndarray()
of metric values.X.py
can be used as a custom data source for benchstats
CLI. Just pass a path to the X.py
as value of --files_parser
argument (or --file1_parser
or --file2_parser
if you want to use different parsers for source1 and source2)
Imagine you have a one-column CSV with data on which you want to run a statistical significance test. Save the following to ./myCSV.py
:
import numpy as np
from benchstats.common import ParserBase
class myCSV(ParserBase):
def __init__(self, fpath, filter, metrics, debug_log=None) -> None:
self.stats = np.loadtxt(fpath, dtype=np.float64)
def getStats(self) -> dict[str, dict[str, np.ndarray]]:
return {"bm": {"real_time": self.stats}}
There's always only one benchmark name bm
hardcoded, and only one metric real_time
that is read from CSV, but you get the idea. It's that simple.
Now just run benchstats
passing a path to the parser in --files_parser
flag:
benchstats ./src1.csv ./src2.csv --files_parser ./myCSV.py
If you make a parser that might be usable by other people, please consider adding it to the project's built-in parsers set by opening a thread with a suggestion in https://github.com/Arech/benchstats/issues or by making a PR into the repo.
You can always get actual CLI help output by passing --help
flag to the invocation of benchstats
. Note that --help
output has precedence over the description below.
usage: __main__.py [-h] [--show_debug | --no-show_debug]
[--files_parser <files parser class or path>]
[--file1_parser <file1 parser class or path>]
[--filter1 <reg expr>]
[--file2_parser <file2 parser class or path>]
[--filter2 <reg expr>]
[--from <reg expr>] [--to <replacement string>]
[--from1 <reg expr>] [--to1 <replacement string>]
[--from2 <reg expr>] [--to2 <replacement string>]
[--main_metrics <metric idx> [<metric idx> ...]]
[--method <method id>]
[--alpha <positive float less than 0.5>]
[--bonferroni]
[--always_show_pvalues]
[--sample_sizes | --no-sample_sizes]
[--sample_stats <stat ids> [<stat ids> ...]]
[--metric_precision <int>=3>]
[--expect_same]
[--colors | --no-colors]
[--multiline | --no-multiline]
[--export_to <path/to/export_file>]
[--export_fmt <format id>]
[--export_light]
<path/to/file1> <path/to/file2> [metrics ...]
-h, --help
show this help message and exit--show_debug, --no-show_debug
Shows some additional debugging info. Default: TrueArguments describing how a source data sets are obtained and are optionally transformed.
<path/to/file1>
Path to the first data file with benchmark results.
See also --file1_parser
and --filter1
arguments.<path/to/file2>
Path to the second data file with benchmark results.
See also --file2_parser
and --filter2
arguments--files_parser <files parser class or path>
Sets files parser class identifier, if a built-in parser is used (options are: GbenchJson
). Or sets a path to .py
file defining a custom parser, inherited from benchstats.common.ParserBase
class. The parser class name must be the same as the file name.--file1_parser <file1 parser class or path>
Same as --files_parser
, but only applies to file1
.--filter1 <reg expr>
If specified, sets a Python regular expression to select benchmarks by name from <file1>
--file2_parser <file2 parser class or path>
Same as --files_parser, but only applies to file2
.--filter2 <reg expr>
Same as --filter1
, but for <file2>
--from <reg expr>
Arguments --from
and --to
works in a pair and are used to describe benchmark names transformation. --from
sets a regular expression (Python re module flavor) to define a pattern to replace to --to
replacement string. Typical use-cases include removing unnecessary parts of benchmark names (for example, Google Benchmark adds some suffixes that might not convey useful information) or "glueing" results of different benchmarks so they become comparable by the tool (for example, you have two benchmarks, the first is 'old_foo' with old algorithm implementation and the other is 'foo' with a new competing algorithm implementation, - to compare their performance against each other, you need to remove 'old_' prefix from the first benchmark with --from old_
and apply --filter1
and --filter2
to restrict loading only corresponding data for old or new variations). Remember to escape characters that have special meaning for the shell.--to <replacement string>
See help for --from
.--from1 <reg expr>
Like --from
, but for <file1>
only.--to1 <replacement string>
Like --to
, but for <file1>
only--from2 <reg expr>
Like --from
, but for <file2>
only--to2 <replacement string>
Like --to
, but for <file2>
onlymetrics
List of metric identifiers to use in tests for each benchmark. Default: real_time
. For deterministic algorithms highly recommend to measure and to use a minimum latency per repetition.--main_metrics <metric idx> [<metric idx> ...]
Indexes in metrics
list specifying the main metrics. These are displayed differently and differences detected cause script to exit(1).
Default: 0
.--method <method id>
Selects a method of statistical testing. Possible values are are: brunnermunzel
for Brunner Munzel test and mannwhitneyu
for Mann-Whitney U test. Default is brunnermunzel
--alpha <positive float less than 0.5>
Set statistical significance level (a desired mistake probability level). Default 0.001
--bonferroni
If set, applies a Bonferroni multiple comparisons correction.Controls how results are rendered
--always_show_pvalues
If set, always show pvalues. By default it shows pvalues only for significant differences. Note that when two sets are compared stochastically
same (~
), or more precisely, not stochastically less and not stochastically greater (see Stochastic ordering),
pvalue shown is a minimum of two pvalues for less and greater comparison.--sample_sizes, --no-sample_sizes
Controls whether to show sizes of datasets used in a test. Default True
--sample_stats <stat ids> [<stat ids> ...]
Sets which additional statistics about compared samples sets to show. Could be any sequence of floats (must be in range [0,100]
, designating a percentile value to calculate), or aliases extremums
, median
, iqr
(interquartile range), std
(standard deviation). Shortenings are accepted. Use of std
, though isn't recommended as standard deviation doesn't really mean anything for most distributions beyond Normal.--metric_precision <int>=3>
Total number of digits in a metric value reported. Minimum is 3. Default is 4.--expect_same
If set, assumes that distributions are the same (i.e. H0
hypothesis is true) and shows some additional statistics useful for ensuring that a
benchmark code is stable enough, or the machine is quiesced enough. One good example is when two different source <file1>
and <file2>
were made with exactly the same binary running on exactly the same machine in the same state - on a machine with a proper setup tests shouldn't find a difference.--colors, --no-colors
Controls if the output should be colored. Default: True
--multiline, --no-multiline
Controls if benchmark results could span several lines. Default: False
Controls how to export results.
--export_to <path/to/export_file>
Path to file to store comparison results to.--export_fmt <format id>
Format of the export file. Options are: txt
, svg
, html
. If not set, inferred from --export_to
file extension.--export_light
If set, uses light theme instead of darkThe code is pretty trivial, so I'll post only a high-level overview: there are 3 major components interacting in benchstats
when it runs in CLI mode:
--files_parser
, file1_parser
or file2_parser
.
parser_
prefix. What goes after the prefix is a parser identifier for CLI --file?_parser
flag.dict[str, dict[str, np.ndarray]]
mapping from benchmark names to metric names to metric values. It's a class only for the simplicity of writing actual code and validating user-supplied parsers.compareStats()
from benchstats.compare
. It:
dict[str, dict[str, np.ndarray]]
with source data plus some control arguments, run stat tests on all metric values of the dictionaries that have the same benchmark name (key of the outer-most dictionary) and a metric name (key of the inner-most dictionary), and thenbenchstats.compare.CompareStatsResult
namedtuple object with comparison results for that benchmarks+metrics, common to both source data sets.renderComparisonResults()
from benchstats.render
:
benchstats.compare.CompareStatsResult
namedtuple produced by compareStats()
, plus various control parameters, and renders it according to the rules and styles.The project uses semantic versioning scheme.
Depending on a feedback the project or its individual components might get breaking changes, so prefer to use version pinning to prevent unexpected breakages. See CHANGELOG.md for details.
FAQs
Statistical Testing for Benchmark Results Comparison
We found that benchstats demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Malicious npm packages posing as Telegram bot libraries install SSH backdoors and exfiltrate data from Linux developer machines.
Security News
pip, PDM, pip-audit, and the packaging library are already adding support for Python’s new lock file format.
Product
Socket's Go support is now generally available, bringing automatic scanning and deep code analysis to all users with Go projects.