Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

sidetable

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

sidetable

sidetable builds simple but useful summary tables of your data

  • 0.9.1
  • PyPI
  • Socket score

Maintainers
1

sidetable

Pypi link PyPI - Downloads

sidetable started as a supercharged combination of pandas value_counts plus crosstab that builds simple but useful summary tables of your pandas DataFrame. It has since expanded to provide support for common and useful pandas tasks such as adding subtotals to your DataFrame or flattening hierarchical columns.

Usage is straightforward. Install and import sidetable. Then access it through the new .stb accessor on your DataFrame.

For the Titanic data: df.stb.freq(['class']) will build a frequency table like this:

classcountpercentcumulative_countcumulative_percent
0Third49155.106649155.1066
1First21624.242470779.349
2Second18420.651891100

You can also summarize missing values with df.stb.missing():

missingtotalpercent
deck68889177.2166
age17789119.8653
embarked28910.224467
embark_town28910.224467
survived08910
pclass08910
sex08910
sibsp08910
parch08910
fare08910
class08910
who08910
adult_male08910
alive08910
alone08910

You can group the data and add subtotals and grand totals with stb.subtotal():

df.groupby(['sex', 'class']).agg({'fare': ['sum']}).stb.subtotal()
fare
sum
sexclass
femaleFirst9975.8250
Second1669.7292
Third2321.1086
female - subtotal13966.6628
maleFirst8201.5875
Second2132.1125
Third4393.5865
male - subtotal14727.2865
grand_total28693.9493

You can also turn a hierarchical column structure into this:

titanic.groupby(['embark_town', 'class', 'sex']).agg({'fare': ['sum'], 'age': ['mean']}).unstack().stb.flatten()
embark_townclassfare_sum_femalefare_sum_maleage_mean_femaleage_mean_male
0CherbourgFirst4972.533928.5436.052640.1111
1CherbourgSecond176.879254.21219.142925.9375
2CherbourgThird337.983402.14614.062525.0168
3QueenstownFirst90903344
4QueenstownSecond24.712.353057
5QueenstownThird340.159465.04622.8528.1429
6SouthamptonFirst4753.294183.0532.704541.8972
7SouthamptonSecond1468.151865.5529.719730.8759
8SouthamptonThird1642.973526.3923.223726.5748

sidetable has several useful features:

  • See total counts and their relative percentages in one table. This is roughly equivalent to combining the output of value_counts() and value_counts(normalize=True) into one table.
  • Include cumulative totals and percentages to better understand your thresholds. The Pareto principle applies to many different scenarios and this function makes it easy to see how your data is cumulatively distributed.
  • Aggregate multiple columns together to see frequency counts for grouped data.
  • Provide a threshold point above which all data is grouped into a single bucket. This is useful for quickly identifying the areas to focus your analysis.
  • Get a count of the missing values in your data.
  • Count the number of unique values for each column.
  • Add grand totals on any DataFrame and subtotals to any grouped DataFrame.
  • Pretty print columns

Table of Contents:

Quickstart

For the impatient:

$ python -m pip install sidetable
import sidetable
import pandas as pd

# Create your DataFrame
df = pd.read_csv(myfile.csv)

# Build a frequency table for one or more columns
df.stb.freq(['column1', 'column2'])

# See what data is missing
df.stb.missing()

# Group data and add a subtotal
df.groupby(['column1', 'column2'])['col3'].sum().stb.subtotal()

That's it.

Read on for more details and more examples of what you can do sidetable.

Rationale

The idea behind sidetable is that there are a handful of useful data analysis tasks that you might run on any data set early in the data analysis process. While each of these tasks can be done in a handful of lines of pandas code, it is a lot of typing and difficult to remember.

In addition to providing useful functionality, this project is also a test to see how to build custom accessors using some of pandas relatively new API. I am hopeful this can serve as a model for other projects whether open source or just for your own usage. Please check out the release announcement for more information about the usage and how to use this as a model for your own projects.

The solutions in sidetable are heavily based on three sources:

  • This tweet thread by Peter Baumgartner
  • An excellent article by Steve Miller that lays out many of the code concepts incorporated into sidetable.
  • Ted Petrou's post on finding the percentage of missing values in a DataFrame.

I very much appreciate the work that all three authors did to point me in this direction.

Installation


$  python -m pip install -U sidetable

This is the preferred method to install sidetable, as it will always install the most recent stable release. sidetable requires pandas 1.0 or higher and no additional dependencies. It should run anywhere that pandas runs.

If you prefer to use conda, sidetable is available on conda-forge:

$ conda install -c conda-forge sidetable

Usage

import pandas as pd
import sidetable
import seaborn as sns

df = sns.load_dataset('titanic')

sidetable uses the pandas DataFrame accessor api to add a .stb accessor to all of your DataFrames. Once you import sidetable you are ready to go. In these examples, I will be using seaborn's Titanic dataset as an example but seaborn is not a direct dependency.

freq

If you have used value_counts() before, you have probably wished it were easier to combine the values with percentage distribution.

df['class'].value_counts()

Third     491
First     216
Second    184
Name: class, dtype: int64

df['class'].value_counts(normalize=True)

Third     0.551066
First     0.242424
Second    0.206510
Name: class, dtype: float64

Which can be done, but is messy and a lot of typing and remembering:

pd.concat([df['class'].value_counts().rename('count'), 
        df['class'].value_counts(normalize=True).mul(100).rename('percentage')], axis=1)
countpercentage
Third49155.1066
First21624.2424
Second18420.651

Using sidetable is much simpler and you get cumulative totals, percents and more flexibility:

df.stb.freq(['class'])
classcountpercentcumulative_countcumulative_percent
0Third49155.106649155.1066
1First21624.242470779.349
2Second18420.651891100

If you want to style the results so percentages and large numbers are easier to read, use style=True:

df.stb.freq(['class'], style=True)
classcountpercentcumulative_countcumulative_percent
0Third49155.11%49155.11%
1First21624.24%70779.35%
2Second18420.65%891100.00%

In addition, you can group columns together. If we want to see the breakdown among class and sex:

df.stb.freq(['sex', 'class'])
sexclasscountpercentcumulative_countcumulative_percent
0maleThird34738.94534738.945
1femaleThird14416.161649155.1066
2maleFirst12213.692561368.7991
3maleSecond10812.121272180.9203
4femaleFirst9410.549981591.4703
5femaleSecond768.52974891100

You can use as many groupings as you would like.

By default, sidetable counts the data. However, you can specify a value argument to indicate that the data should be summed based on the data in another column. For this data set, we can see how the fares are distributed by class:

df.stb.freq(['class'], value='fare')
classfarepercentcumulative_farecumulative_percent
0First18177.463.349318177.463.3493
1Third6714.723.401124892.186.7504
2Second3801.8413.249628693.9100

Another feature of sidetable is that you can specify a threshold. For many data analysis, you may want to break down into large groupings to focus on and ignore others. You can use the thresh argument to define a threshold and group all entries above that threshold into an "other" grouping:

df.stb.freq(['class', 'who'], value='fare', thresh=80)
classwhofarepercentcumulative_farecumulative_percent
0Firstwoman9492.9433.08349492.9433.0834
1Firstman7848.1827.351317341.160.4348
2Thirdman3617.5312.607320958.673.042
3Secondman1886.366.574062284579.6161
4othersothers5848.9520.383928693.9100

You can further customize by specifying the label to use for all the others:

df.stb.freq(['class', 'who'], value='fare', thresh=80, other_label='All others')
classwhofarepercentcumulative_farecumulative_percent
0Firstwoman9492.9433.08349492.9433.0834
1Firstman7848.1827.351317341.160.4348
2Thirdman3617.5312.607320958.673.042
3Secondman1886.366.574062284579.6161
4All othersAll others5848.9520.383928693.9100

counts

The counts() function shows how many unique values are in each column as well as the most and least frequent values & their total counts. This summary view can help you determine if you need to convert data to a categorical value. It can also help you understand the high level structure of your data.

df.stb.counts()
countuniquemost_freqmost_freq_countleast_freqleast_freq_count
survived891205491342
sex8912male577female314
adult_male8912True537False354
alive8912no549yes342
alone8912True537False354
pclass891334912184
embarked8893S644Q77
class8913Third491Second184
who8913man537child83
embark_town8893Southampton644Queenstown77
sibsp8917060855
parch8917067861
deck2037C59G4
age7148824.03020.51
fare8912488.054363.35831

By default, all data types are included but you may use the exclude and include parameters to select specific types of columns. The syntax is the same as pandas select_dtypes

For example,

df.stb.counts(exclude='number')
countuniquemost_freqmost_freq_countleast_freqleast_freq_count
sex8912male577female314
adult_male8912True537False354
alive8912no549yes342
alone8912True537False354
embarked8893S644Q77
class8913Third491Second184
who8913man537child83
embark_town8893Southampton644Queenstown77
deck2037C59G4

missing

sidetable also includes a summary table that shows the missing values in your data by count and percentage of total missing values in a column.

df.stb.missing()
missingtotalpercent
deck68889177.2166
age17789119.8653
embarked28910.224467
embark_town28910.224467
survived08910
pclass08910
sex08910
sibsp08910
parch08910
fare08910
class08910
who08910
adult_male08910
alive08910
alone08910

If you wish to see the results with styles applied to the Percent and Total column, use:

df.stb.missing(style=True)
missingtotalpercent
deck68889177.22%
age17789119.87%
embarked28910.22%
embark_town28910.22%
survived08910
pclass08910
sex08910
sibsp08910
parch08910
fare08910
class08910
who08910
adult_male08910
alive08910
alone08910

Finally, you can exclude the columns that have 0 missing values using the clip_0=True parameter:

df.stb.missing(clip_0=True, style=True)
missingtotalpercent
deck68889177.22%
age17789119.87%
embarked28910.22%
embark_town28910.22%

subtotal

Another useful function is the subtotal function. Trying to add a subtotal to grouped pandas data is not easy. sidetable adds a subtotal() function that makes adds a subtotal at one or more levels of a DataFrame.

The subtotal function can be applied to a simple DataFrame in order to add a Grand Total label:

df.stb.subtotal()
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealone
88711female190030SFirstwoman0BSouthamptonyes1
88803femalenan1223.45SThirdwoman0nanSouthamptonno0
88911male260030CFirstman1CCherbourgyes1
89003male32007.75QThirdman1nanQueenstownno1
grand_total3422057nan21205.246634028693.9nannannan537nannannan537

The real power of subtotal is being able to add it to one or more levels of your grouped data. For example, you can group the data and add a subtotal at each level:

df.groupby(['sex', 'class', 'embark_town']).agg({'fare': ['sum']}).stb.subtotal()

Which yields this view (truncated for simplicity):

fare
sum
sexclassembark_town
femaleFirstCherbourg4972.5333
Queenstown90.0000
Southampton4753.2917
female | First - subtotal9815.8250
SecondCherbourg176.8792
Queenstown24.7000
Southampton1468.1500
female | Second - subtotal1669.7292
ThirdCherbourg337.9833
Queenstown340.1585
Southampton1642.9668
female | Third - subtotal2321.1086
female - subtotal13806.6628
maleFirstCherbourg3928.5417
Queenstown90.0000

By default, every level in the DataFrame will be subtotaled but you can control this behavior by using the sub_level argument. For instance, you can subtotal on sex and class by passing the argument sub_level=[1,2]

summary_table = df.groupby(['sex', 'class', 'embark_town']).agg({'fare': ['sum']})
summary_table.stb.subtotal(sub_level=[1, 2])

The subtotal function also allows the user to configure the labels and separators used in the subtotal and Grand Total by using the grand_label, sub_label, show_sep and sep arguments.

flatten

When grouping and pivoting data, you can end up with a DataFrame that has a multiindex. Often times, you want a simple flat representation of the data.

For example, we can build a table using a groupby() plus unstack() that looks like this:

df.groupby(['embark_town', 'class', 'sex']).agg({'fare': ['sum'], 'age': ['mean']}).unstack()
fareage
summean
sexfemalemalefemalemale
embark_townclass
CherbourgFirst4972.53333928.541736.05263240.111111
Second176.8792254.212519.14285725.937500
Third337.9833402.146214.06250025.016800
QueenstownFirst90.000090.000033.00000044.000000
Second24.700012.350030.00000057.000000
Third340.1585465.045822.85000028.142857
SouthamptonFirst4753.29174183.045832.70454541.897188
Second1468.15001865.550029.71969730.875889
Third1642.96683526.394523.22368426.574766

If you wish to flatten it, use stb.flatten():

df.groupby(['embark_town', 'class', 'sex']).agg({'fare': ['sum'], 'age': ['mean']}).unstack().stb.flatten()
embark_townclassfare_sum_femalefare_sum_maleage_mean_femaleage_mean_male
0CherbourgFirst4972.533928.5436.052640.1111
1CherbourgSecond176.879254.21219.142925.9375
2CherbourgThird337.983402.14614.062525.0168
3QueenstownFirst90903344
4QueenstownSecond24.712.353057
5QueenstownThird340.159465.04622.8528.1429
6SouthamptonFirst4753.294183.0532.704541.8972
7SouthamptonSecond1468.151865.5529.719730.8759
8SouthamptonThird1642.973526.3923.223726.5748

flatten will also take additional arguments:

  • Add a custom separator using the sep argument - stb.flatten(sep='|')
  • Control whether or not to reset the index using reset argument - stb.flatten(reset=False)
  • Reorganize the output levels using levels argument levels=2
    • levels can also take a list of valid levels if you want to reorganize the display levels=[0,2]
fares = df.groupby(['embark_town', 'class', 'sex']).agg({'fare': ['sum'], 'age': ['mean']}).unstack()
fares.stb.flatten(sep='|', reset=False, levels=[0,2])
fare|femalefare|malefare|femalefare|maleage|femaleage|male
embark_townclass
CherbourgFirst4972.53333928.5417115.64030993.53670736.05263240.111111
Second176.8792254.212525.26845725.42125019.14285725.937500
Third337.9833402.146214.6949269.35223714.06250025.016800
QueenstownFirst90.000090.000090.00000090.00000033.00000044.000000
Second24.700012.350012.35000012.35000030.00000057.000000
Third340.1585465.045810.30783311.92425122.85000028.142857
SouthamptonFirst4753.29174183.045899.02691052.94994732.70454541.897188
Second1468.15001865.550021.91268719.23247429.71969730.875889
Third1642.96683526.394518.67007713.30714923.22368426.574766

prettyprint

This function interprets the magnitude of your numeric results and returns a nicely formatted version of all the numbers. This can be used on a full DataFrame or during your analysis of aggregated data.

For instance, if you are summarizing data, you may get something that looks like this:

df.groupby(['pclass', 'sex']).agg({'fare': 'sum'})
fare
pclasssex
1female9975.8250
male8201.5875
2female1669.7292
male2132.1125
3female2321.1086
male4393.5865

Use stb.pretty() to format it nicely so you can have the same order or magnitude for all numbers:

df.groupby(['pclass', 'sex']).agg({'fare': 'sum'}).div(df['fare'].sum()).stb.pretty()
  fare
pclasssex 
1female9.98k
male8.20k
2female1.67k
male2.13k
3female2.32k
male4.39k

Here's an example of a percentage format:

df.groupby(['pclass', 'sex']).agg({'fare': 'sum'}).div(df['fare'].sum()).stb.pretty(precision=0, caption="Fare Percentage")
Fare Percentage
  fare
pclasssex 
1female35%
male29%
2female6%
male7%
3female8%
male15%

Behind the scenes, pretty will attempt to normalize the values. You can control the precision, rows add a caption.

Caveats

sidetable supports grouping on any data type in a pandas DataFrame. This means that you could try something like:

df.stb.freq(['fare'])

In some cases where there are a fairly small discrete number of this may be useful. However, if you have a lot of unique values, you should bin the data first. In the example, above the data would include 248 rows and not be terribly useful.

One alternative could be:

df['fare_bin'] = pd.qcut(df['fare'], q=4, labels=['low', 'medium', 'high', 'x-high'])
df.stb.freq(['fare_bin'])
fare_bincountpercentcumulative_countcumulative_percent
0medium22425.140322425.1403
1low22325.028144750.1684
2x-high22224.915866975.0842
3high22224.9158891100

The other caveat is that null or missing values can cause data to drop out while aggregating. For instance, if we look at the deck variable, there are a lot of missing values.

df.stb.freq(['deck'])
deckcountpercentcumulative_countcumulative_percent
0C5929.0645929.064
1B4723.152710652.2167
2D3316.256213968.4729
3E3215.763517184.2365
4A157.3891618691.6256
5F136.4039419998.0296
6G41.97044203100

The total cumulative count only goes up to 203 not the 891 we have seen in other examples. Future versions of sidetable may handle this differently. For now, it is up to you to decide how best to handle unknowns. For example, this version of the Titanic data set has a categorical value for deck so using fillna requires an extra step:

df['deck_fillna'] = df['deck'].cat.add_categories('UNK').fillna('UNK')
df.stb.freq(['deck_fillna'])
deck_fillnacountpercentcumulative_countcumulative_percent
0UNK68877.216668877.2166
1C596.6217774783.8384
2B475.2749779489.1134
3D333.703782792.8171
4E323.5914785996.4085
5A151.683587498.092
6F131.4590388799.5511
7G40.448934891100

Another variant is that there might be certain groupings where there are no valid counts.

For instance, if we look at the deck and class:

df.stb.freq(['deck', 'class'])
deckclasscountpercentcumulative_countcumulative_percent
0CFirst5929.0645929.064
1BFirst4723.152710652.2167
2DFirst2914.285713566.5025
3EFirst2512.315316078.8177
4AFirst157.3891617586.2069
5FSecond83.9408918390.1478
6FThird52.4630518892.6108
7GThird41.9704419294.5813
8ESecond41.9704419696.5517
9DSecond41.9704420098.5222
10EThird31.47783203100

There are only 11 combinations. If we want to see all - even if there are not any passengers fitting that criteria, use clip_0=False

df.stb.freq(['deck', 'class'], clip_0=False)
deckclasscountpercentcumulative_countcumulative_percent
0CFirst5929.0645929.064
1BFirst4723.152710652.2167
2DFirst2914.285713566.5025
3EFirst2512.315316078.8177
4AFirst157.3891617586.2069
5FSecond83.9408918390.1478
6FThird52.4630518892.6108
7GThird41.9704419294.5813
8ESecond41.9704419696.5517
9DSecond41.9704420098.5222
10EThird31.47783203100
11GSecond00203100
12GFirst00203100
13FFirst00203100
14DThird00203100
15CThird00203100
16CSecond00203100
17BThird00203100
18BSecond00203100
19AThird00203100
20ASecond00203100

In many cases this might be too much data, but sometimes the fact that a combination is missing could be insightful.

The final caveat relates to subtotal. When working with the subtotal function, sidetable convert a Categorical MultiIndex to a plain index in order to easily add the subtotal labels.

TODO

  • Handle NaN values more effectively
  • Offer binning options for continuous variables
  • Offer more options, maybe plotting?

Contributing

Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given. If you have a new idea for a simple table that we should add, please submit a ticket.

For more info please click here

Credits

This package was created with Cookiecutter and the oldani/cookiecutter-simple-pypackage project template. The code used in this package is heavily based on the posts from Peter Baumgartner, Steve Miller and Ted Petrou. Thank you!

Keywords

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc