šŸš€ Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more →
Socket
Sign inDemoInstall
Socket

scors

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

scors

Binary classification scores in Rust

0.1.8
PyPI
Maintainers
1

Build Status Crates.io Version PyPI - Version

Scors

This package is a Rust re-implementation with Python bindings of some of the classification scores from scikit-learn (sklearn), restricted to binary classification only. Scores generally have 3 input parameters for labels, predictions, and weights, with slightly different names in sklearn:

sklearnscors
y_truelabels
y_scorepredictions
sample_weightweights

Functions in scors have an additional parameter order that can be

  • (default) None to indicate unsorted data,
  • Order.ASCENDING to indicate that the input data is sorted in ascending order wrt predictions, or
  • Order.DESCENDING to indicate that the input data is sorted in descending order wrt predictions.

Other parameters that may be present (e.g. max_fprs in roc_auc) follow the naming and meaning as defined in the respective sklearn counterpart

Why?

I want to improve runtime performance of scores for my use case. I have a single large background sample that I combine and score with each of many small foreground smaples. For the rank-based metrics (e.g. average_precision-score_), the data is sorted, which has complexity n*log(n). Exploiting the structure of my data helps me avoid this cost to boost performance. But even without assumptions about structure in the data, I found ways to improve performance. This is a summary of all the optimizations I implemented (or plan to):

  • Add option to assume the data is sorted. This allows the caller to exploit structure in data that is already sorted/mostly sorted.
  • Remove checks on labels. When stepping through the debugger, I noticed that the sklearn imlementation uses safeguards like np.unique to check the validity of the data. This can be helpful to ensure that assumptions are always met, especially in a library a huge audience and general scope like sklearn. But it also has a performance penalty. I decided, to place the responsibility for data validation completely on the caller. The caller can add or leave out data validation as appropriate
  • Minimize data allocation:
    • All current scores are implemented as single pass over the data (double pass in case of ROC AUC with max fprs). 2 For ordered input data, no allocations are made.
    • If the optional weights parameter is not provided, no extra constant array filled with 1 is created. Instead, the Rust implementation uses a constant value iterator.
    • TODO: For unordered input data, I currently create sorted copies of all of the input data. That is a total of 4 (3 if weights are not provided) extra allocations for the index array (for sorting), labels, predictions, and weights. Instead of creating copies of the input arrays, I consider creating index views that simply index into the original arrays through the sorted index array. This may provide another performance benefit, but I still have to benchmark this.

Currently Implemented Scores

sklearnscors
average_precision_scoreaverage_precision
roc_auc_scoreroc_auc

Is It Actually Faster?

TODO: Benchmarks

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts