Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More →

topn

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

topn

This package boosts a group-wise nlargest sort

0.0.7
PyPI

Maintainers: 1

topn

Cython utility functions to be used instead of pandas' SeriesGroupBy nlargest() function (since pandas does it so slowly).

Contains 3 functions:

awesome_topn(),
awesome_hstack_topn(),
awesome_hstack(): (for CSR matrices only; at least twice as fast as scipy.sparse.hstack in scipy version 1.6.1)

See Short Description for details.

This is how it may be done with pandas:

import pandas as pd
import numpy as np

r = np.array([0, 1, 2, 1, 2, 3, 2]) 
c = np.array([1, 1, 0, 3, 1, 2, 3]) 
d = np.array([0.3, 0.2, 0.1, 1.0, 0.9, 0.4, 0.6]) 
rcd = pd.DataFrame({'r': r, 'c': c, 'd': d})
rcd

	r	c	d
0	0	1	0.3
1	1	1	0.2
2	2	0	0.1
3	1	3	1.0
4	2	1	0.9
5	3	2	0.4
6	2	3	0.6

ntop = 2

rcd.set_index('c').groupby('r')['d'].nlargest(ntop).reset_index().sort_values(['r', 'd'], ascending = [True, False])

	r	c	d
0	0	1	0.3
1	1	3	1.0
2	1	1	0.2
3	2	1	0.9
4	2	3	0.6
5	3	2	0.4

Usage

from topn import awesome_topn

o_r, o_c, o_d = awesome_topn(r, c, d, ntop, n_jobs=7)
pd.DataFrame({'r': o_r, 'c': o_c, 'd': o_d})

	r	c	d
0	0	1	0.3
1	1	3	1.0
2	1	1	0.2
3	2	1	0.9
4	2	3	0.6
5	3	2	0.4

Alternatively, if one had a matrix encoding the above data:

from scipy.sparse import csr_matrix 

csr = csr_matrix((d, (r, c)), shape=(4, 4))

then one could use the function awesome_hstack_topn() instead:

from topn import awesome_hstack_topn 

topn_matrix = awesome_hstack_topn([csr], ntop=ntop)
o_r, o_c = topn_matrix.nonzero()
o_d = topn_matrix.data
pd.DataFrame({'r': o_r, 'c': o_c, 'd': o_d})

	r	c	d
0	0	1	0.3
1	1	3	1.0
2	1	1	0.2
3	2	1	0.9
4	2	3	0.6
5	3	2	0.4

Short Description

Contains 3 functions:

awesome_topn(),
awesome_hstack_topn(),
awesome_hstack()

def awesome_topn(r, c, d, ntop, n_rows=-1, n_jobs=1):
    """
    r, c, and d are 1D numpy arrays all of the same length N. 
    This function will return arrays rn, cn, and dn of length n <= N such
    that the set of triples {(rn[i], cn[i], dn[i]) : 0 < i < n} is a subset of 
    {(r[j], c[j], d[j]) : 0 < j < N} and that for every distinct value 
    x = rn[i], dn[i] is among the first ntop existing largest d[j]'s whose 
    r[j] = x.

    Input:
        r and c: two 1D integer arrays of the same length
        d: 1D array of single or double precision floating point type of the
        same length as r or c
        ntop maximum number of maximum d's returned
        n_rows: an int. If > -1 it will replace output rn with Rn the
            index pointer array for the compressed sparse row (CSR) matrix
            whose elements are {C[rn[i], cn[i]] = dn: 0 < i < n}.  This matrix
            will have its number of rows = n_rows.  Thus the length of Rn is
            n_rows + 1
        n_jobs: number of threads, must be >= 1

    Output:
        (rn, cn, dn) where rn, cn, dn are all arrays as described above, or
        (Rn, cn, dn) where Rn is described above
        
    """


def awesome_hstack_topn(blocks, ntop, sort=True, use_threads=False, n_jobs=1):
    """
    Returns, in CSR format, the matrix formed by horizontally stacking the
    sequence of CSR matrices in parameter 'blocks', with only the largest ntop
    elements of each row returned.  Also, each row will be sorted in
    descending order only when 
        ntop < total number of columns in blocks or sort=True,
    otherwise the rows will be unsorted.
    
    :param blocks: list of CSR matrices to be stacked horizontally.
    :param ntop: int. The maximum number of elements to be returned for
        each row.
    :param sort: bool. Each row of the returned matrix will be sorted in
        descending order only when ntop < total number of columns in blocks
        or sort=True, otherwise the rows will be unsorted.
    :param use_threads: bool. Will use the multi-threaded versions of this
        routine if True otherwise the single threaded version will be used.
        In multi-core systems setting this to True can lead to acceleration.
    :param n_jobs: int. When use_threads=True, denotes the number of threads
        that are to be spawned by the multi-threaded routines. Recommended
        value is number of cores minus one.

    Output:
        (scipy.sparse.csr_matrix) matrix in CSR format 
    """


def awesome_hstack(blocks, use_threads=False, n_jobs=1):
    """
    Returns, in CSR format, the matrix formed by horizontally stacking the
    sequence of CSR matrices in parameter blocks.
    
    :param blocks: list of CSR matrices to be stacked horizontally.
    :param use_threads: bool. Will use the multi-threaded versions of this
        routine if True otherwise the single threaded version will be used.
        In multi-core systems setting this to True can lead to acceleration.
    :param n_jobs: int. When use_threads=True, denotes the number of threads
        that are to be spawned by the multi-threaded routines. Recommended
        value is number of cores minus one.

    Output:
        (scipy.sparse.csr_matrix) matrix in CSR format 
    """

Keywords

nlargest hstack csr csc scipy cython

FAQs

What is topn?

Is topn well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

topn

topn

Usage

Short Description

Keywords

Related posts

Malicious PyPI Package ‘pycord-self’ Targets Discord Developers with Token Theft and Backdoor Exploit

UK Officials Consider Banning Ransomware Payments from Public Entities