Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
mRMR, which stands for "minimum Redundancy - Maximum Relevance", is a feature selection algorithm.
The peculiarity of mRMR is that it is a minimal-optimal feature selection algorithm.
This means it is designed to find the smallest relevant subset of features for a given Machine Learning task.
Selecting the minimum number of useful features is desirable for many reasons:
This is why a minimal-optimal method such as mrmr is often preferable.
On the contrary, the majority of other methods (for instance, Boruta or Positive-Feature-Importance) are classified as all-relevant, since they identify all the features that have some kind of relationship with the target variable.
Due to its efficiency, mRMR is ideal for practical ML applications, where it is necessary to perform feature selection frequently and automatically, in a relatively small amount of time.
For instance, in 2019, Uber engineers published a paper describing how they implemented mRMR in their marketing machine learning platform Maximum Relevance and Minimum Redundancy Feature Selection Methods for a Marketing Machine Learning Platform.
You can install this package in your environment via pip:
pip install mrmr_selection
And then import it in Python through:
import mrmr
This package is designed to do mRMR selection through different tools, depending on your needs and constraints.
Currently, the following tools are supported (others will be added):
The package has a module for each supported tool. Each module has at least these 2 functions:
mrmr_classif
, for feature selection when the target variable is categorical (binary or multiclass).mrmr_regression
, for feature selection when the target variable is numeric.Let's see some examples.
You have a Pandas DataFrame (X
) and a Series which is your target variable (y
).
You want to select the best K
features to make predictions on y
.
# create some pandas data
import pandas as pd
from sklearn.datasets import make_classification
X, y = make_classification(n_samples = 1000, n_features = 50, n_informative = 10, n_redundant = 40)
X = pd.DataFrame(X)
y = pd.Series(y)
# select top 10 features using mRMR
from mrmr import mrmr_classif
selected_features = mrmr_classif(X=X, y=y, K=10)
Note: the output of mrmr_classif is a list containing K selected features. This is a ranking, therefore, if you want to make a further selection, take the first elements of this list.
# create some polars data
import polars
data = [(1.0, 1.0, 1.0, 7.0, 1.5, -2.3),
(2.0, None, 2.0, 7.0, 8.5, 6.7),
(2.0, None, 3.0, 7.0, -2.3, 4.4),
(3.0, 4.0, 3.0, 7.0, 0.0, 0.0),
(4.0, 5.0, 4.0, 7.0, 12.1, -5.2)]
columns = ["target", "some_null", "feature", "constant", "other_feature", "another_feature"]
df_polars = polars.DataFrame(data=data, schema=columns)
# select top 2 features using mRMR
import mrmr
selected_features = mrmr.polars.mrmr_regression(df=df_polars, target_column="target", K=2)
# create some spark data
import pyspark
session = pyspark.sql.SparkSession(pyspark.context.SparkContext())
data = [(1.0, 1.0, 1.0, 7.0, 1.5, -2.3),
(2.0, float('NaN'), 2.0, 7.0, 8.5, 6.7),
(2.0, float('NaN'), 3.0, 7.0, -2.3, 4.4),
(3.0, 4.0, 3.0, 7.0, 0.0, 0.0),
(4.0, 5.0, 4.0, 7.0, 12.1, -5.2)]
columns = ["target", "some_null", "feature", "constant", "other_feature", "another_feature"]
df_spark = session.createDataFrame(data=data, schema=columns)
# select top 2 features using mRMR
import mrmr
selected_features = mrmr.spark.mrmr_regression(df=df_spark, target_column="target", K=2)
# initialize BigQuery client
from google.cloud.bigquery import Client
bq_client = Client(credentials=your_credentials)
# select top 20 features using mRMR
import mrmr
selected_features = mrmr.bigquery.mrmr_regression(
bq_client=bq_client,
table_id='bigquery-public-data.covid19_open_data.covid19_open_data',
target_column='new_deceased',
K=20
)
For an easy-going introduction to mRMR, read my article on Towards Data Science: “MRMR” Explained Exactly How You Wished Someone Explained to You.
Also, this article describes an example of mRMR used on the world famous MNIST dataset: Feature Selection: How To Throw Away 95% of Your Data and Get 95% Accuracy.
mRMR was born in 2003, this is the original paper: Minimum Redundancy Feature Selection From Microarray Gene Expression Data.
Since then, it has been used in many practical applications, due to its simplicity and effectiveness. For instance, in 2019, Uber engineers published a paper describing how they implemented MRMR in their marketing machine learning platform Maximum Relevance and Minimum Redundancy Feature Selection Methods for a Marketing Machine Learning Platform.
FAQs
minimum-Redundancy-Maximum-Relevance algorithm for feature selection
We found that mrmr-selection demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.