Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Suppose we want to specify a continuous n
-dimensional joint probability distribution by giving:
n
arbitrary marginal distributionsCopulas are a good way to achieve this separate specification of marginal distribution and dependence.
This is a wrapper around the CopulaDistribution
class from the statsmodels
package, providing an interface that I prefer. (The rationale is disused in more detail below).
pip install copula_wrapper # or `poetry add copula_wrapper`
Dictionary arguments and Kendall's tau:
from copula_wrapper import CopulaJoint
from scipy import stats
import seaborn as sns
marginals = {
"consumption elasticity": stats.uniform(0.75, 3),
"market return": stats.lognorm(0.05, 0.05),
"risk of war": stats.beta(1, 50)
}
tau = {
# Missing pairs are assumed to be independent
("risk of war", "market return"): -0.5,
}
# Instantiate the (frozen) joint distribution
dist = CopulaJoint(marginals, kendall_tau=tau)
# Query the CDF or PDF
p = dist.cdf({
"consumption elasticity": 1.5,
"market return": 1.5,
"risk of war": 0.5
})
# Sample randomly
sample = dist.rvs(100_000)
# Plot
sns.pairplot(sample, plot_kws={"s": 1})
Array arguments and Spearman's rho:
import pandas as pd
import numpy as np
from copula_wrapper import CopulaJoint
from scipy import stats
import seaborn as sns
marginals = [
stats.uniform(0.75, 3),
stats.lognorm(0.05, 0.05),
stats.beta(1, 50)
]
spearman = np.eye(3)
spearman[1, 2] = spearman[2, 1] = -0.5
# Instantiate the (frozen) joint distribution
dist = CopulaJoint(marginals, spearman_rho=spearman)
# Query the CDF or PDF
p = dist.cdf([1.5, 1.5, 0.5])
# Sample randomly
sample = dist.rvs(10_000)
# Plot
sns.pairplot(pd.DataFrame(sample), plot_kws={"s": 1})
CopulaJoint
is a wrapper around CopulaDistribution
from statsmodels
.
The context that motivated this wrapper was:
statsmodels
).In statsmodels
, you'd do something like the following
from statsmodels.distributions.copula.api import CopulaDistribution, GaussianCopula
from scipy import stats
corr = [
[1.0, 0.5],
[0.5, 1.0],
]
joint = CopulaDistribution(GaussianCopula(corr=corr), marginals=[stats.beta(2,2), stats.norm()])
where corr
is the matrix of Pearson's rho correlations (the standard linear correlation coefficient most people are familiar with). IMO, this surfaces implementation details and requires the user to engage with the internals of their joint distribution, which they may not understand or care about. It also invites the misunderstanding that the joint distribution will have a Pearson's rho matching the corr
we pass in. This is not the case, as the following example shows:
import numpy as np
from scipy import stats
from statsmodels.distributions.copula.api import CopulaDistribution, GaussianCopula
c = 0.8
corr = [
[1.0, c],
[c, 1.0],
]
joint = CopulaDistribution(GaussianCopula(corr=corr),
marginals=[stats.norm(2, 2), stats.lognorm(loc=0, s=1)])
samples = joint.rvs(int(1e6))
joint_r = stats.pearsonr(samples[:, 0], samples[:, 1])[0] # about 0.6
assert not np.isclose(joint_r, c)
Therefore, this wrapper instead takes rank correlations as input, and these are satisfied by the resulting joint distribution. It converts these to the appropriate correlation measure for the underlying copula. You must explicitly provide kendall_tau
or spearman_rho
.
(For now, only the Gaussian copula is supported. But I've written the class to be extensible easily and even put some hints as code comments. PRs welcome. If more copulas are supported, there will be a parameter family
with values like "Gaussian"
, "Clayton"
...)
if dimensions are to be named, marginals
must be a dict, and rank_corr
must be a dict
of tuples, e.g.
from scipy import stats
marginals = {
"a": stats.uniform(0, 1),
"b": stats.lognorm(2, 3),
"c": stats.norm(4, 5),
}
rank_corr = {
# Missing pairs are assumed to be independent
("a", "b"): 0.5,
}
and the first argument to .pdf
, .cdf
, and .logpdf
must be a dict with the same keys as marginals
.
(I originally had these methods support dimensions as keyword arguments, but I thought it was better not to
modify the API even more, plus dictionary keys can be any hashable object, not just strings.).
If dimensions are named,
.rvs
returns a Pandas DataFrame with dimension names as column names.
Otherwise, marginals
must be a list and rank_corr
must be a square matrix, and .rvs
returns a numpy array, just like in statsmodels.
For background, the Scipy pattern is that there is e.g. a norm_gen
subclass of rv_continuous
that is used to create the
norm
instance, which can then optionally be 'frozen'.
The statsmodels CopulaDistribution
does not follow this pattern. It acts 'frozen' in sofar
as the underlying copula (e.g. Gaussian) and its correlation matrix are specified at
initialization. However, it acts 'unfrozen' in that cop_args
and marg_args
can
be passed to .pdf
, .cdf
, and .logpdf
. As a result, CopulaDistribution
does not subclass
any of the scipy.stats
classes.
I find this a bit inconsistent, and don't find cop_args
and marg_args
useful, so I
drop them in this wrapper. The benefit is that the wrapper now satisfies the interface of
a SciPy multivariate frozen distribution.
This means that the marginals provided to CopulaJoint
must be frozen as well.
It also means I use size
(SciPy) instead of nobs
(statsmodels) for the call to rvs
.
While the copula tests in statsmodels
are likely to be good for their intended purpose, I couldn't quickly decipher
exactly what behaviour these tests were checking for. In order to make sure this wrapper behaves in the way that its
interface promises, I created a set of end-to-end tests. These tests that check that random samples from the
joint distribution have:
The tests are parametrized (run for many inputs).
As they involve large numbers of random samples, they are slow.
There are also a few simple unit tests.
FAQs
Top-level package for statsmodels copula wrapper.
We found that copula-wrapper demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.