UpSetPlot documentation

Latest version on PyPi licence Python versions supported

Issue tracker Travis CI build status Documentation Status Test coverage

This is another Python implementation of UpSet plots by Lex et al. [Lex2014]. UpSet plots are used to visualise set overlaps; like Venn diagrams but more readable. Documentation is at https://upsetplot.readthedocs.io.

This upsetplot library tries to provide a simple interface backed by an extensible, object-oriented design.

There are many ways to represent the categorisation of data, as covered in our Data Format Guide.

Our internal input format uses a pandas.Series containing counts corresponding to subset sizes, where each subset is an intersection of named categories. The index of the Series indicates which rows pertain to which categories, by having multiple boolean indices, like example in the following:

>>> from upsetplot import generate_counts
>>> example = generate_counts()
>>> example
cat0   cat1   cat2
False  False  False      56
              True      283
       True   False    1279
              True     5882
True   False  False      24
              True       90
       True   False     429
              True     1957
Name: value, dtype: int64

Then:

>>> from upsetplot import plot
>>> plot(example)  
>>> from matplotlib import pyplot
>>> pyplot.show()  

makes:

http://upsetplot.readthedocs.io/en/latest/_images/sphx_glr_plot_generated_001.png

This plot shows the cardinality of every category combination seen in our data. The leftmost column counts items absent from any category. The next three columns count items only in cat1, cat2 and cat3 respectively, with following columns showing cardinalities for items in each combination of exactly two named sets. The rightmost column counts items in all three sets.

Rotation

We call the above plot style “horizontal” because the category intersections are presented from left to right. Vertical plots are also supported!

http://upsetplot.readthedocs.io/en/latest/_images/sphx_glr_plot_vertical_001.png

Distributions

Providing a DataFrame rather than a Series as input allows us to expressively plot the distribution of variables in each subset.

http://upsetplot.readthedocs.io/en/latest/_images/sphx_glr_plot_boston_001.png

Loading datasets

While the dataset above is randomly generated, you can prepare your own dataset for input to upsetplot. A helpful tool is from_memberships, which allows us to reconstruct the example above by indicating each data point’s category membership:

>>> from upsetplot import from_memberships
>>> example = from_memberships(
...     [[],
...      ['cat2'],
...      ['cat1'],
...      ['cat1', 'cat2'],
...      ['cat0'],
...      ['cat0', 'cat2'],
...      ['cat0', 'cat1'],
...      ['cat0', 'cat1', 'cat2'],
...      ],
...      data=[56, 283, 1279, 5882, 24, 90, 429, 1957]
... )
>>> example
cat0   cat1   cat2
False  False  False      56
              True      283
       True   False    1279
              True     5882
True   False  False      24
              True       90
       True   False     429
              True     1957
dtype: int64

See also from_contents, another way to describe categorised data, and from_indicators which allows each category to be indicated by a column in the data frame (or a function of the column’s data such as whether it is a missing value).

Installation

To install the library, you can use pip:

$ pip install upsetplot

Installation requires:

It should then be possible to:

>>> import upsetplot

in Python.

Why an alternative to py-upset?

Probably for petty reasons. It appeared py-upset was not being maintained. Its input format was undocumented, inefficient and, IMO, inappropriate. It did not facilitate showing plots of each subset’s distribution as in Lex et al’s work introducing UpSet plots. Nor did it include the horizontal bar plots illustrated there. It did not support Python 2. I decided it would be easier to construct a cleaner version than to fix it.

References

[Lex2014]Alexander Lex, Nils Gehlenborg, Hendrik Strobelt, Romain Vuillemot, Hanspeter Pfister, UpSet: Visualization of Intersecting Sets, IEEE Transactions on Visualization and Computer Graphics (InfoVis ‘14), vol. 20, no. 12, pp. 1983–1992, 2014. doi: doi.org/10.1109/TVCG.2014.2346248

Examples

Introductory examples for upsetplot.

Plot the distribution of missing values

UpSet plots are often used to show which variables are missing together.

Passing a callable indicators=pd.isna to from_indicators() is an easy way to categorise a record by the variables that are missing in it.

_images/sphx_glr_plot_missingness_001.png
from matplotlib import pyplot as plt
import pandas as pd
from upsetplot import plot, from_indicators

TITANIC_URL = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'  # noqa
data = pd.read_csv(TITANIC_URL)

plot(from_indicators(indicators=pd.isna, data=data), show_counts=True)
plt.show()

Total running time of the script: ( 0 minutes 0.383 seconds)

Gallery generated by Sphinx-Gallery

Vertical orientation

This illustrates the effect of orientation=’vertical’.

from matplotlib import pyplot as plt
from upsetplot import generate_counts, plot

example = generate_counts()
plot(example, orientation='vertical')
plt.suptitle('A vertical plot')
plt.show()
_images/sphx_glr_plot_vertical_001.png
plot(example, orientation='vertical', show_counts='%d')
plt.suptitle('A vertical plot with counts shown')
plt.show()
_images/sphx_glr_plot_vertical_002.png
plot(example, orientation='vertical', show_counts='%d', show_percentages=True)
plt.suptitle('With counts and percentages shown')
plt.show()
_images/sphx_glr_plot_vertical_003.png

Total running time of the script: ( 0 minutes 0.819 seconds)

Gallery generated by Sphinx-Gallery

Plotting with generated data

This example illustrates basic plotting functionality using generated data.

from matplotlib import pyplot as plt
from upsetplot import generate_counts, plot

example = generate_counts()
print(example)

Out:

cat0   cat1   cat2
False  False  False      56
              True      283
       True   False    1279
              True     5882
True   False  False      24
              True       90
       True   False     429
              True     1957
Name: value, dtype: int64
plot(example)
plt.suptitle('Ordered by degree')
plt.show()
_images/sphx_glr_plot_generated_001.png
plot(example, sort_by='cardinality')
plt.suptitle('Ordered by cardinality')
plt.show()
_images/sphx_glr_plot_generated_002.png
plot(example, show_counts='%d')
plt.suptitle('With counts shown')
plt.show()
_images/sphx_glr_plot_generated_003.png
plot(example, show_counts='%d', show_percentages=True)
plt.suptitle('With counts and % shown')
plt.show()
_images/sphx_glr_plot_generated_004.png

Total running time of the script: ( 0 minutes 1.080 seconds)

Gallery generated by Sphinx-Gallery

Hiding subsets based on size or degree

This illustrates the use of min_subset_size, max_subset_size, min_degree or max_degree.

from matplotlib import pyplot as plt
from upsetplot import generate_counts, plot

example = generate_counts()

plot(example, show_counts=True)
plt.suptitle('Nothing hidden')
plt.show()
_images/sphx_glr_plot_hide_001.png
plot(example, show_counts=True, min_subset_size=100)
plt.suptitle('Small subsets hidden')
plt.show()
_images/sphx_glr_plot_hide_002.png
plot(example, show_counts=True, max_subset_size=500)
plt.suptitle('Large subsets hidden')
plt.show()
_images/sphx_glr_plot_hide_003.png
plot(example, show_counts=True, min_degree=2)
plt.suptitle('Degree <2 hidden')
plt.show()
_images/sphx_glr_plot_hide_004.png
plot(example, show_counts=True, max_degree=2)
plt.suptitle('Degree >2 hidden')
plt.show()
_images/sphx_glr_plot_hide_005.png

Total running time of the script: ( 0 minutes 1.327 seconds)

Gallery generated by Sphinx-Gallery

Customising element size and figure size

This example illustrates controlling sizing within an UpSet plot.

from matplotlib import pyplot as plt
from upsetplot import generate_counts, plot

example = generate_counts()
print(example)

plot(example)
plt.suptitle('Defaults')
plt.show()
_images/sphx_glr_plot_sizing_001.png

Out:

cat0   cat1   cat2
False  False  False      56
              True      283
       True   False    1279
              True     5882
True   False  False      24
              True       90
       True   False     429
              True     1957
Name: value, dtype: int64

upsetplot uses a grid of square “elements” to display. Controlling the size of these elements affects all components of the plot.

plot(example, element_size=40)
plt.suptitle('Increased element_size')
plt.show()
_images/sphx_glr_plot_sizing_002.png

When setting figsize explicitly, you then need to pass the figure to plot, and use element_size=None for optimal sizing.

fig = plt.figure(figsize=(10, 3))
plot(example, fig=fig, element_size=None)
plt.suptitle('Setting figsize explicitly')
plt.show()
_images/sphx_glr_plot_sizing_003.png

Components in the plot can be resized by indicating how many elements they should equate to.

plot(example, intersection_plot_elements=3)
plt.suptitle('Decreased intersection_plot_elements')
plt.show()
_images/sphx_glr_plot_sizing_004.png
plot(example, totals_plot_elements=5)
plt.suptitle('Increased totals_plot_elements')
plt.show()
_images/sphx_glr_plot_sizing_005.png

Total running time of the script: ( 0 minutes 1.323 seconds)

Gallery generated by Sphinx-Gallery

Plotting discrete variables as stacked bar charts

Currently, a somewhat contrived example of add_stacked_bars.

  • _images/sphx_glr_plot_discrete_001.png
  • _images/sphx_glr_plot_discrete_002.png
import pandas as pd
from upsetplot import UpSet
from matplotlib import pyplot as plt
from matplotlib import cm

TITANIC_URL = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'  # noqa
df = pd.read_csv(TITANIC_URL)
# Show UpSet on survival and first classs
df = df.set_index(df.Survived == 1).set_index(df.Pclass == 1, append=True)

upset = UpSet(df,
              intersection_plot_elements=0)  # disable the default bar chart
upset.add_stacked_bars(by="Sex", colors=cm.Pastel1,
                       title="Count by gender", elements=10)
upset.plot()
plt.suptitle("Gender for first class and survival on Titanic")
plt.show()


upset = UpSet(df, show_counts=True, orientation="vertical",
              intersection_plot_elements=0)
upset.add_stacked_bars(by="Sex", colors=cm.Pastel1,
                       title="Count by gender", elements=10)
upset.plot()
plt.suptitle("Same, but vertical, with counts shown")
plt.show()

Total running time of the script: ( 0 minutes 0.621 seconds)

Gallery generated by Sphinx-Gallery

Changing Plot Colors

This example illustrates use of matplotlib and upsetplot color settings, aside from matplotlib style sheets, which can control colors as well as grid lines, fonts and tick display.

Upsetplot provides some color settings:

  • facecolor: sets the color for intersection size bars, and for active matrix dots. Defaults to white on a dark background, otherwise black.
  • other_dots_color: sets the color for other (inactive) dots. Specify as a color, or a float specifying opacity relative to facecolor.
  • shading_color: sets the color odd rows. Specify as a color, or a float specifying opacity relative to facecolor.

For an introduction to matplotlib theming see:

from matplotlib import pyplot as plt
from upsetplot import generate_counts, plot

example = generate_counts()

plot(example, facecolor="darkblue")
plt.suptitle('facecolor="darkblue"')
plt.show()
_images/sphx_glr_plot_theming_001.png
plot(example, facecolor="darkblue", shading_color="lightgray")
plt.suptitle('facecolor="darkblue", shading_color="lightgray"')
plt.show()
_images/sphx_glr_plot_theming_002.png
with plt.style.context('Solarize_Light2'):
    plot(example)
    plt.suptitle('matplotlib classic stylesheet')
    plt.show()
_images/sphx_glr_plot_theming_003.png
with plt.style.context('dark_background'):
    plot(example, show_counts=True)
    plt.suptitle('matplotlib dark_background stylesheet')
    plt.show()
_images/sphx_glr_plot_theming_004.png
with plt.style.context('dark_background'):
    plot(example, show_counts=True, shading_color=.15)
    plt.suptitle('matplotlib dark_background stylesheet, shading_color=.15')
    plt.show()
_images/sphx_glr_plot_theming_005.png
with plt.style.context('dark_background'):
    plot(example, show_counts=True, facecolor="red")
    plt.suptitle('matplotlib dark_background, facecolor="red"')
    plt.show()
_images/sphx_glr_plot_theming_006.png
with plt.style.context('dark_background'):
    plot(example, show_counts=True, facecolor="red", other_dots_color=.4,
         shading_color=.2)
    plt.suptitle('dark_background, red face, stronger other colors')
    plt.show()
_images/sphx_glr_plot_theming_007.png

Total running time of the script: ( 0 minutes 1.860 seconds)

Gallery generated by Sphinx-Gallery

Highlighting selected subsets

Demonstrates use of the style_subsets method to mark some subsets as different.

from matplotlib import pyplot as plt
from upsetplot import generate_counts, UpSet

example = generate_counts()

Subsets can be styled by the categories present in them, and a legend can be optionally generated.

upset = UpSet(example)
upset.style_subsets(present=["cat1", "cat2"],
                    facecolor="blue",
                    label="special")
upset.plot()
plt.suptitle("Paint blue subsets including both cat1 and cat2; show a legend")
plt.show()
_images/sphx_glr_plot_highlight_001.png

… or styling can be applied by the categories absent in a subset.

upset = UpSet(example, orientation="vertical")
upset.style_subsets(present="cat2", absent="cat1", edgecolor="red",
                    linewidth=2)
upset.plot()
plt.suptitle("Border for subsets including cat2 but not cat1")
plt.show()
_images/sphx_glr_plot_highlight_002.png

… or their size or degree.

upset = UpSet(example)
upset.style_subsets(min_subset_size=1000,
                    facecolor="lightblue", hatch="xx",
                    label="big")
upset.plot()
plt.suptitle("Hatch subsets with size >1000")
plt.show()
_images/sphx_glr_plot_highlight_003.png

Multiple stylings can be applied with different criteria in the same plot.

upset = UpSet(example, facecolor="gray")
upset.style_subsets(present="cat0", label="Contains cat0", facecolor="blue")
upset.style_subsets(present="cat1", label="Contains cat1", hatch="xx")
upset.style_subsets(present="cat2", label="Contains cat2", edgecolor="red")

# reduce legend size:
params = {'legend.fontsize': 8}
with plt.rc_context(params):
    upset.plot()
plt.suptitle("Styles for every category!")
plt.show()
_images/sphx_glr_plot_highlight_004.png

Total running time of the script: ( 0 minutes 1.117 seconds)

Gallery generated by Sphinx-Gallery

Above-average features in Boston

Explore above-average neighborhood characteristics in the Boston dataset.

Here we take some features correlated with house price, and look at the distribution of median house price when each of these features is above average.

The most correlated features are:

ZN
proportion of residential land zoned for lots over 25,000 sq.ft.
CHAS
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
RM
average number of rooms per dwelling
DIS
weighted distances to five Boston employment centres
B
1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

This kind of dataset analysis may not be a practical use of UpSet, but helps to illustrate the UpSet.add_catplot() feature.

import pandas as pd
from sklearn.datasets import load_boston
from matplotlib import pyplot as plt
from upsetplot import UpSet

# Load the dataset into a DataFrame
boston = load_boston()
boston_df = pd.DataFrame(boston.data, columns=boston.feature_names)

# Get five features most correlated with median house value
correls = boston_df.corrwith(pd.Series(boston.target),
                             method='spearman').sort_values()
top_features = correls.index[-5:]

# Get a binary indicator of whether each top feature is above average
boston_above_avg = boston_df > boston_df.median(axis=0)
boston_above_avg = boston_above_avg[top_features]
boston_above_avg = boston_above_avg.rename(columns=lambda x: x + '>')

# Make this indicator mask an index of boston_df
boston_df = pd.concat([boston_df, boston_above_avg],
                      axis=1)
boston_df = boston_df.set_index(list(boston_above_avg.columns))

# Also give us access to the target (median house value)
boston_df = boston_df.assign(median_value=boston.target)
# UpSet plot it!
upset = UpSet(boston_df, subset_size='count', intersection_plot_elements=3)
upset.add_catplot(value='median_value', kind='strip', color='blue')
upset.add_catplot(value='AGE', kind='strip', color='black')
upset.plot()
plt.title("UpSet with catplots, for orientation='horizontal'")
plt.show()
_images/sphx_glr_plot_boston_001.png
# And again in vertical orientation

upset = UpSet(boston_df, subset_size='count', intersection_plot_elements=3,
              orientation='vertical')
upset.add_catplot(value='median_value', kind='strip', color='blue')
upset.add_catplot(value='AGE', kind='strip', color='black')
upset.plot()
plt.title("UpSet with catplots, for orientation='vertical'")
plt.show()
_images/sphx_glr_plot_boston_002.png

Total running time of the script: ( 0 minutes 3.087 seconds)

Gallery generated by Sphinx-Gallery

Gallery generated by Sphinx-Gallery

Data Format Guide

UpSetPlot fundamentally is about visualizing datapoints (or data aggregates) that are each assigned to one or more categories. Curiously, there are many ways to represent categories as data structures. Object 1 belongs to categories A and B and object 2 belongs to category B only, this information can be represented by:

  • listing the memberships for each object, i.e. [["A", "B"],  # object 1    ["B"]]       # object 2
  • listing the contents of each category, i.e. {"A": [1], "B": [1, 2]}
  • using a boolean-valued indicator matrix (perhaps columns in a larger DataFrame), i.e. #  A      B   [[ True,  True ],  # object 1    [ False, True ]]  # object 2

Moreover, UpSetPlot aims to handle both of the following cases:

  • where only aggregates (e.g. counts) of the values in each category subset are given; and
  • there are data points with several attributes in each category subset, where these attributes can be visualized as well as aggregates.

This guide reviews the internal data format and alternative representations, but we recommend using the helper functions `from_memberships <api.html#upsetplot.from_memberships>`__, `from_contents <api.html#upsetplot.from_contents>`__ or `from_indicators <api.html#upsetplot.from_indicators>`__ depending on how it’s most convenient to express your data.

Internal data format

UpSetPlot internally works with data based on Pandas data structres: a Series when all you care about is counts, or a DataFrame when you’re interested in visualising additional properties of the data, such as with the UpSet.add_catplot method.

UpSetPlot expects the Series or DataFrame to have a MultiIndex as input, with this index being an indicator matrix. Specifically, each category is a level in the pandas.MultiIndex with boolean values.

Note: This internal data format may change in a future version since it is not efficient. Using the from_* methods will provide more stable compatibility with future releases.

Use Series as input

Below is a minimal example using Series as input:

[1]:
from upsetplot import generate_counts
example_counts = generate_counts()
example_counts
[1]:
cat0   cat1   cat2
False  False  False      56
              True      283
       True   False    1279
              True     5882
True   False  False      24
              True       90
       True   False     429
              True     1957
Name: value, dtype: int64

This is a pandas.Series with 3-level Multi-index. Each level is a Set: cat0, cat1, and cat2. Each row is a unique subset with boolean values in indices indicating memberships of each row. The value in each row indicates the number of observations in each subset. upsetplot will simply plot these numbers when supplied with a Series:

[2]:
from upsetplot import UpSet
plt = UpSet(example_counts).plot()
_images/formats_8_0.png

Alternatively, we can supply a Series with each observation in a row:

[3]:
from upsetplot import generate_samples
example_values = generate_samples().value
example_values
[3]:
cat0   cat1   cat2
False  True   True    1.652317
              True    1.510447
       False  True    1.584646
              True    1.279395
       True   True    2.338243
                        ...
              True    1.701618
              True    1.577837
True   True   True    1.757554
False  True   True    1.407799
True   True   True    1.709067
Name: value, Length: 10000, dtype: float64

In this case, we can use subset_size='count' to have upsetplot count the number of observations in each unique subset and plot them:

[4]:
from upsetplot import UpSet
plt = UpSet(example_values, subset_size='count').plot()
_images/formats_12_0.png

Or, we can weight each subset’s size by the series value:

[5]:
from upsetplot import UpSet
plt = UpSet(example_values, subset_size='sum', show_counts=True).plot()
_images/formats_14_0.png
Use DataFrame as input:

A DataFrame can also be used as input to carry additional information.

[6]:
from upsetplot import generate_samples
example_samples_df = generate_samples()
example_samples_df.head()
[6]:
index value
cat0 cat1 cat2
False True True 0 1.652317
True 1 1.510447
False True 2 1.584646
True 3 1.279395
True True 4 2.338243

In this data frame, each observation has two variables: index and value. If we simply want to count the number of observations in each unique subset, we can use subset_size='count':

[7]:
from upsetplot import UpSet
plt = UpSet(example_samples_df, subset_size='count').plot()
_images/formats_19_0.png

If for some reason, we want to plot the sum of a variable in each subset (eg. index), we can use sum_over='index'. This will make upsetplot to take sum of a given variable in each unique subset and plot that number:

[8]:
from upsetplot import UpSet
plt = UpSet(example_samples_df, sum_over='index', subset_size='sum').plot()
_images/formats_21_0.png
Convert Data to UpSet-compatible format

We can convert data from common formats to be compatible with upsetplot.

Suppose we have three categories (the data is not scientifically true!):

[9]:
mammals = ['Cat', 'Dog', 'Horse', 'Sheep', 'Pig', 'Cattle', 'Rhinoceros', 'Moose']
herbivores = ['Horse', 'Sheep', 'Cattle', 'Moose', 'Rhinoceros']
domesticated = ['Dog', 'Chicken', 'Horse', 'Sheep', 'Pig', 'Cattle', 'Duck']
(mammals, herbivores, domesticated)
[9]:
(['Cat', 'Dog', 'Horse', 'Sheep', 'Pig', 'Cattle', 'Rhinoceros', 'Moose'],
 ['Horse', 'Sheep', 'Cattle', 'Moose', 'Rhinoceros'],
 ['Dog', 'Chicken', 'Horse', 'Sheep', 'Pig', 'Cattle', 'Duck'])

Since this format lists the entries in each category, we can use from_contents to construct a data frame ready for plotting.

from_contents takes a dictionary as input. The input dictionary should have categories names as key and a list or set of category members as values:

[10]:
from upsetplot import from_contents
animals = from_contents({'mammal': mammals, 'herbivore': herbivores, 'domesticated': domesticated})
animals
[10]:
id
mammal herbivore domesticated
True False False Cat
True Dog
True True Horse
True Sheep
False True Pig
True True Cattle
False Rhinoceros
False Moose
False False True Chicken
True Duck

Now we can plot:

[11]:
from upsetplot import UpSet
plt = UpSet(animals, subset_size='count').plot()
_images/formats_28_0.png

Alternatively, our input data may have been structured by species, allowing us to use from_memberships:

[12]:
from upsetplot import from_memberships

animal_memberships = {
    "Cat": "Mammal",
    "Dog": "Mammal,Domesticated",
    "Horse": "Mammal,Herbivore,Domesticated",
    "Sheep": "Mammal,Herbivore,Domesticated",
    "Pig": "Mammal,Domesticated",
    "Cattle": "Mammal,Herbivore,Domesticated",
    "Rhinoceros": "Mammal,Herbivore",
    "Moose": "Mammal,Herbivore",
    "Chicken": "Domesticated",
    "Duck": "Domesticated",
}

# Turn this into a list of lists:
animal_membership_lists = [categories.split(",") for categories in animal_memberships.values()]

animals = from_memberships(animal_membership_lists)
animals
[12]:
Domesticated  Herbivore  Mammal
False         False      True      1
True          False      True      1
              True       True      1
                         True      1
              False      True      1
              True       True      1
False         True       True      1
                         True      1
True          False      False     1
                         False     1
Name: ones, dtype: int64

This should produce the same plot:

[13]:
from upsetplot import UpSet
plt = UpSet(animals, subset_size='count').plot()
_images/formats_32_0.png
When category membership is indicated in DataFrame columns

Let’s take a look at a movies dataset like that used in the original publication by Alexander Lex et al..

[14]:
import pandas as pd

movies = pd.read_csv("https://raw.githubusercontent.com/peetck/IMDB-Top1000-Movies/master/IMDB-Movie-Data.csv")
movies.head()
[14]:
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76
1 2 Prometheus Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65
2 3 Split Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62
3 4 Sing Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32 59
4 5 Suicide Squad Action,Adventure,Fantasy A secret government agency recruits some of th... David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123 6.2 393727 325.02 40

Here Genre category membership is represented with a comma-separated Genre column.

from_memberships is our best option:

[15]:
movies_by_genre = from_memberships(movies.Genre.str.split(','), data=movies)
movies_by_genre
[15]:
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
Action Adventure Animation Biography Comedy Crime Drama Family Fantasy History Horror Music Musical Mystery Romance Sci-Fi Sport Thriller War Western
True True False False False False False False False False False False False False False True False False False False 1 Guardians of the Galaxy Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76
False True False False False False False False False False False False False True False True False False False False 2 Prometheus Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65
False False False False False False False False False True False False False False False False True False False 3 Split Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62
True False True False False True False False False False False False False False False False False False 4 Sing Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32 59
True True False False False False False False True False False False False False False False False False False False 5 Suicide Squad Action,Adventure,Fantasy A secret government agency recruits some of th... David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123 6.2 393727 325.02 40
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
False False False False False True True False False False False False False True False False False False False False 996 Secret in Their Eyes Crime,Drama,Mystery A tight-knit team of rising investigators, alo... Billy Ray Chiwetel Ejiofor, Nicole Kidman, Julia Roberts... 2015 111 6.2 27585 0.00 45
False False False False False True False False False False False False False False False 997 Hostel: Part II Horror Three American college students studying abroa... Eli Roth Lauren German, Heather Matarazzo, Bijou Philli... 2007 94 5.5 73152 17.54 46
True False False False False True False False True False False False False False 998 Step Up 2: The Streets Drama,Music,Romance Romantic sparks occur between two dance studen... Jon M. Chu Robert Hoffman, Briana Evigan, Cassie Ventura,... 2008 98 6.2 70699 58.01 50
True False False True False False False False False False False False False False False False False False False 999 Search Party Adventure,Comedy A pair of friends embark on a mission to reuni... Scot Armstrong Adam Pally, T.J. Miller, Thomas Middleditch,Sh... 2014 93 5.6 4881 0.00 22
False False False True False False True True False False False False False False False False False False False 1000 Nine Lives Comedy,Family,Fantasy A stuffy businessman finds himself trapped ins... Barry Sonnenfeld Kevin Spacey, Jennifer Garner, Robbie Amell,Ch... 2016 87 5.3 12435 19.64 11

1000 rows × 12 columns

[16]:
UpSet(movies_by_genre)
[16]:
<upsetplot.plotting.UpSet at 0x7faa985332e8>
_images/formats_38_1.png

Given the size of this plot, we limit ourselves to frequent genres:

[17]:
UpSet(movies_by_genre, min_subset_size=15, show_counts=True).plot()
[17]:
{'matrix': <matplotlib.axes._subplots.AxesSubplot at 0x7faaa87e8ef0>,
 'shading': <matplotlib.axes._subplots.AxesSubplot at 0x7faad876a7b8>,
 'totals': <matplotlib.axes._subplots.AxesSubplot at 0x7faac8b93978>,
 'intersections': <matplotlib.axes._subplots.AxesSubplot at 0x7faaf845f978>}
_images/formats_40_1.png

If the genres were instead presented as a series of boolean columns, we could use from_indicators.

[18]:
genre_indicators = pd.DataFrame([{cat: True
                                  for cat in cats}
                                 for cats in movies.Genre.str.split(',').values]).fillna(False)
genre_indicators
[18]:
Action Adventure Sci-Fi Mystery Horror Thriller Animation Comedy Family Fantasy Drama Music Biography Romance History Crime Western War Musical Sport
0 True True True False False False False False False False False False False False False False False False False False
1 False True True True False False False False False False False False False False False False False False False False
2 False False False False True True False False False False False False False False False False False False False False
3 False False False False False False True True True False False False False False False False False False False False
4 True True False False False False False False False True False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 False False False True False False False False False False True False False False False True False False False False
996 False False False False True False False False False False False False False False False False False False False False
997 False False False False False False False False False False True True False True False False False False False False
998 False True False False False False False True False False False False False False False False False False False False
999 False False False False False False False True True True False False False False False False False False False False

1000 rows × 20 columns

[19]:
from upsetplot import from_indicators
# this produces the same result as from_memberships above
movies_by_genre = from_indicators(genre_indicators, data=movies)

These columns could also be part of the original matrix. For this case from_indicators allows the indicators to be specified as a list of column names, or as a function of the data frame.

[20]:
movies_with_indicators = pd.concat([movies, genre_indicators], axis=1)
movies_with_indicators
[20]:
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes ... Drama Music Biography Romance History Crime Western War Musical Sport
0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 ... False False False False False False False False False False
1 2 Prometheus Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 ... False False False False False False False False False False
2 3 Split Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 ... False False False False False False False False False False
3 4 Sing Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 ... False False False False False False False False False False
4 5 Suicide Squad Action,Adventure,Fantasy A secret government agency recruits some of th... David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123 6.2 393727 ... False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 996 Secret in Their Eyes Crime,Drama,Mystery A tight-knit team of rising investigators, alo... Billy Ray Chiwetel Ejiofor, Nicole Kidman, Julia Roberts... 2015 111 6.2 27585 ... True False False False False True False False False False
996 997 Hostel: Part II Horror Three American college students studying abroa... Eli Roth Lauren German, Heather Matarazzo, Bijou Philli... 2007 94 5.5 73152 ... False False False False False False False False False False
997 998 Step Up 2: The Streets Drama,Music,Romance Romantic sparks occur between two dance studen... Jon M. Chu Robert Hoffman, Briana Evigan, Cassie Ventura,... 2008 98 6.2 70699 ... True True False True False False False False False False
998 999 Search Party Adventure,Comedy A pair of friends embark on a mission to reuni... Scot Armstrong Adam Pally, T.J. Miller, Thomas Middleditch,Sh... 2014 93 5.6 4881 ... False False False False False False False False False False
999 1000 Nine Lives Comedy,Family,Fantasy A stuffy businessman finds himself trapped ins... Barry Sonnenfeld Kevin Spacey, Jennifer Garner, Robbie Amell,Ch... 2016 87 5.3 12435 ... False False False False False False False False False False

1000 rows × 32 columns

We can now specify some or all category column names instead of passing a separate indicator matrix:

[21]:
UpSet(from_indicators(["Drama", "Action", "Comedy", "Adventure"],
                      data=movies_with_indicators))
[21]:
<upsetplot.plotting.UpSet at 0x7faae8a30a20>
_images/formats_47_1.png

Or we can use pd.select_dtypes to extract out all boolean columns:

[22]:
UpSet(from_indicators(lambda df: df.select_dtypes(bool),
                      data=movies_with_indicators),
      min_subset_size=15, show_counts=True)
[22]:
<upsetplot.plotting.UpSet at 0x7faab010a5c0>
_images/formats_49_1.png

API Reference

Plotting
upsetplot.plot(data, fig=None, **kwargs)[source]

Make an UpSet plot of data on fig

Parameters:
data : pandas.Series or pandas.DataFrame

Values for each set to plot. Should have multi-index where each level is binary, corresponding to set membership. If a DataFrame, sum_over must be a string or False.

fig : matplotlib.figure.Figure, optional

Defaults to a new figure.

kwargs

Other arguments for UpSet

Returns:
subplots : dict of matplotlib.axes.Axes

Keys are ‘matrix’, ‘intersections’, ‘totals’, ‘shading’

class upsetplot.UpSet(data, orientation='horizontal', sort_by='degree', sort_categories_by='cardinality', subset_size='auto', sum_over=None, min_subset_size=None, max_subset_size=None, min_degree=None, max_degree=None, facecolor='auto', other_dots_color=0.18, shading_color=0.05, with_lines=True, element_size=32, intersection_plot_elements=6, totals_plot_elements=2, show_counts='', show_percentages=False)[source]

Manage the data and drawing for a basic UpSet plot

Primary public method is plot().

Parameters:
data : pandas.Series or pandas.DataFrame

Elements associated with categories (a DataFrame), or the size of each subset of categories (a Series). Should have MultiIndex where each level is binary, corresponding to category membership. If a DataFrame, sum_over must be a string or False.

orientation : {‘horizontal’ (default), ‘vertical’}

If horizontal, intersections are listed from left to right.

sort_by : {‘cardinality’, ‘degree’, None}

If ‘cardinality’, subset are listed from largest to smallest. If ‘degree’, they are listed in order of the number of categories intersected. If None, the order they appear in the data input is used.

Changed in version 0.5: Setting None was added.

sort_categories_by : {‘cardinality’, None}

Whether to sort the categories by total cardinality, or leave them in the provided order.

New in version 0.3.

subset_size : {‘auto’, ‘count’, ‘sum’}

Configures how to calculate the size of a subset. Choices are:

‘auto’ (default)

If data is a DataFrame, count the number of rows in each group, unless sum_over is specified. If data is a Series with at most one row for each group, use the value of the Series. If data is a Series with more than one row per group, raise a ValueError.

‘count’

Count the number of rows in each group.

‘sum’

Sum the value of the data Series, or the DataFrame field specified by sum_over.

sum_over : str or None

If subset_size='sum' or 'auto', then the intersection size is the sum of the specified field in the data DataFrame. If a Series, only None is supported and its value is summed.

min_subset_size : int, optional

Minimum size of a subset to be shown in the plot. All subsets with a size smaller than this threshold will be omitted from plotting. Size may be a sum of values, see subset_size.

New in version 0.5.

max_subset_size : int, optional

Maximum size of a subset to be shown in the plot. All subsets with a size greater than this threshold will be omitted from plotting.

New in version 0.5.

min_degree : int, optional

Minimum degree of a subset to be shown in the plot.

New in version 0.5.

max_degree : int, optional

Maximum degree of a subset to be shown in the plot.

New in version 0.5.

facecolor : ‘auto’ or matplotlib color or float

Color for bar charts and active dots. Defaults to black if axes.facecolor is a light color, otherwise white.

Changed in version 0.6: Before 0.6, the default was ‘black’

other_dots_color : matplotlib color or float

Color for shading of inactive dots, or opacity (between 0 and 1) applied to facecolor.

New in version 0.6.

shading_color : matplotlib color or float

Color for shading of odd rows in matrix and totals, or opacity (between 0 and 1) applied to facecolor.

New in version 0.6.

with_lines : bool

Whether to show lines joining dots in the matrix, to mark multiple categories being intersected.

element_size : float or None

Side length in pt. If None, size is estimated to fit figure

intersection_plot_elements : int

The intersections plot should be large enough to fit this many matrix elements. Set to 0 to disable intersection size bars.

Changed in version 0.4: Setting to 0 is handled.

totals_plot_elements : int

The totals plot should be large enough to fit this many matrix elements.

show_counts : bool or str, default=False

Whether to label the intersection size bars with the cardinality of the intersection. When a string, this formats the number. For example, ‘%d’ is equivalent to True.

show_percentages : bool, default=False

Whether to label the intersection size bars with the percentage of the intersection relative to the total dataset. This may be applied with or without show_counts.

New in version 0.4.

Methods

add_catplot(self, kind[, value, elements]) Add a seaborn catplot over subsets when plot() is called.
add_stacked_bars(self, by[, sum_over, …]) Add a stacked bar chart over subsets when plot() is called.
make_grid(self[, fig]) Get a SubplotSpec for each Axes, accounting for label text width
plot(self[, fig]) Draw all parts of the plot onto fig or a new figure
plot_intersections(self, ax) Plot bars indicating intersection size
plot_matrix(self, ax) Plot the matrix of intersection indicators onto ax
plot_totals(self, ax) Plot bars indicating total set size
style_subsets(self[, present, absent, …]) Updates the style of selected subsets’ bars and matrix dots
plot_shading  
add_catplot(self, kind, value=None, elements=3, **kw)[source]

Add a seaborn catplot over subsets when plot() is called.

Parameters:
kind : str

One of {“point”, “bar”, “strip”, “swarm”, “box”, “violin”, “boxen”}

value : str, optional

Column name for the value to plot (i.e. y if orientation=’horizontal’), required if data is a DataFrame.

elements : int, default=3

Size of the axes counted in number of matrix elements.

**kw : dict

Additional keywords to pass to seaborn.catplot().

Our implementation automatically determines ‘ax’, ‘data’, ‘x’, ‘y’ and ‘orient’, so these are prohibited keys in kw.

Returns:
None
add_stacked_bars(self, by, sum_over=None, colors=None, elements=3, title=None)[source]

Add a stacked bar chart over subsets when plot() is called.

Used to plot categorical variable distributions within each subset.

New in version 0.6.

Parameters:
by : str

Column name within the dataframe for color coding the stacked bars, containing discrete or categorical values.

sum_over : str, optional

Ordinarily the bars will chart the size of each group. sum_over may specify a column which will be summed to determine the size of each bar.

colors : Mapping, list-like, str or callable, optional

The facecolors to use for bars corresponding to each discrete label, specified as one of:

Mapping

Maps from label to matplotlib-compatible color specification.

list-like

A list of matplotlib colors to apply to labels in order.

str

The name of a matplotlib colormap name.

callable

When called with the number of labels, this should return a list-like of that many colors. Matplotlib colormaps satisfy this callable API.

None

Uses the matplotlib default colormap.

elements : int, default=3

Size of the axes counted in number of matrix elements.

title : str, optional

The axis title labelling bar length.

Returns:
None
make_grid(self, fig=None)[source]

Get a SubplotSpec for each Axes, accounting for label text width

plot(self, fig=None)[source]

Draw all parts of the plot onto fig or a new figure

Parameters:
fig : matplotlib.figure.Figure, optional

Defaults to a new figure.

Returns:
subplots : dict of matplotlib.axes.Axes

Keys are ‘matrix’, ‘intersections’, ‘totals’, ‘shading’

plot_intersections(self, ax)[source]

Plot bars indicating intersection size

plot_matrix(self, ax)[source]

Plot the matrix of intersection indicators onto ax

plot_totals(self, ax)[source]

Plot bars indicating total set size

style_subsets(self, present=None, absent=None, min_subset_size=None, max_subset_size=None, min_degree=None, max_degree=None, facecolor=None, edgecolor=None, hatch=None, linewidth=None, linestyle=None, label=None)[source]

Updates the style of selected subsets’ bars and matrix dots

Parameters are either used to select subsets, or to style them with attributes of matplotlib.patches.Patch, apart from label, which adds a legend entry.

Parameters:
present : str or list of str, optional

Category or categories that must be present in subsets for styling.

absent : str or list of str, optional

Category or categories that must not be present in subsets for styling.

min_subset_size : int, optional

Minimum size of a subset to be styled.

max_subset_size : int, optional

Maximum size of a subset to be styled.

min_degree : int, optional

Minimum degree of a subset to be styled.

max_degree : int, optional

Maximum degree of a subset to be styled.

facecolor : str or matplotlib color, optional

Override the default UpSet facecolor for selected subsets.

edgecolor : str or matplotlib color, optional

Set the edgecolor for bars, dots, and the line between dots.

hatch : str, optional

Set the hatch. This will apply to intersection size bars, but not to matrix dots.

linewidth : int, optional

Line width in points for edges.

linestyle : str, optional

Line style for edges.

label : str, optional

If provided, a legend will be added

Dataset loading and generation
upsetplot.from_contents(contents, data=None, id_column='id')[source]

Build data from category listings

Parameters:
contents : Mapping (or iterable over pairs) of strings to sets

Keys are category names, values are sets of identifiers (int or string).

data : DataFrame, optional

If provided, this should be indexed by the identifiers used in Python Documentation contents.

id_column : str, default=’id’

The column name to use for the identifiers in the output.

Returns:
DataFrame

data is returned with its index indicating category membership, including a column named according to id_column. If data is not given, the order of rows is not assured.

Notes

The order of categories in the output DataFrame is determined from Python Documentation contents, which may have non-deterministic iteration order.

Examples

>>> from upsetplot import from_contents
>>> contents = {'cat1': ['a', 'b', 'c'],
...             'cat2': ['b', 'd'],
...             'cat3': ['e']}
>>> from_contents(contents)
                  id
cat1  cat2  cat3
True  False False  a
      True  False  b
      False False  c
False True  False  d
      False True   e
>>> import pandas as pd
>>> contents = {'cat1': [0, 1, 2],
...             'cat2': [1, 3],
...             'cat3': [4]}
>>> data = pd.DataFrame({'favourite': ['green', 'red', 'red',
...                                    'yellow', 'blue']})
>>> from_contents(contents, data=data)
                   id favourite
cat1  cat2  cat3
True  False False   0     green
      True  False   1       red
      False False   2       red
False True  False   3    yellow
      False True    4      blue
upsetplot.from_indicators(indicators, data=None)[source]

Load category membership indicated by a boolean indicator matrix

This loader also supports the case where the indicator columns can be derived from data.

New in version 0.6.

Parameters:
indicators : DataFrame-like of booleans, Sequence of str, or callable

Specifies the category indicators (boolean mask arrays) within data, i.e. which records in data belong to which categories.

If a list of strings, these should be column names found in data whose values are boolean mask arrays.

If a DataFrame, its columns should correspond to categories, and its index should be a subset of those in data, values should be True where a data record is in that category, and False or NA otherwise.

If callable, it will be applied to data after the latter is converted to a Series or DataFrame.

data : Series-like or DataFrame-like, optional

If given, the index of category membership is attached to this data. It must have the same length as indicators. If not given, the series will contain the value 1.

Returns:
DataFrame or Series

data is returned with its index indicating category membership. It will be a Series if data is a Series or 1d numeric array or None.

Notes

Categories with indicators that are all False will be removed.

Examples

>>> import pandas as pd
>>> from upsetplot import from_indicators

Just indicators >>> indicators = {“cat1”: [True, False, True, False], … “cat2”: [False, True, False, False], … “cat3”: [True, True, False, False]} >>> from_indicators(indicators) cat1 cat2 cat3 True False True 1.0 False True True 1.0 True False False 1.0 False False False 1.0 Name: ones, dtype: float64

Where indicators are included within data, specifying columns by name >>> data = pd.DataFrame({“value”: [5, 4, 6, 4], **indicators}) >>> from_indicators([“cat1”, “cat3”], data=data)

value cat1 cat2 cat3

cat1 cat3 True True 5 True False True False True 4 False True True True False 6 True False False False False 4 False False False

Making indicators out of all boolean columns >>> from_indicators(lambda data: data.select_dtypes(bool), data=data)

value cat1 cat2 cat3

cat1 cat2 cat3 True False True 5 True False True False True True 4 False True True True False False 6 True False False False False False 4 False False False

Using a dataset with missing data, we can use missingness as an indicator >>> data = pd.DataFrame({“val1”: [pd.NA, .7, pd.NA, .9], … “val2”: [“male”, pd.NA, “female”, “female”], … “val3”: [pd.NA, pd.NA, 23000, 78000]}) >>> from_indicators(pd.isna, data=data)

val1 val2 val3

val1 val2 val3 True False True <NA> male <NA> False True True 0.7 <NA> <NA> True False False <NA> female 23000 False False False 0.9 female 78000

upsetplot.from_memberships(memberships, data=None)[source]

Load data where each sample has a collection of category names

The output should be suitable for passing to UpSet or plot.

Parameters:
memberships : sequence of collections of strings

Each element corresponds to a data point, indicating the sets it is a member of. Each category is named by a string.

data : Series-like or DataFrame-like, optional

If given, the index of category memberships is attached to this data. It must have the same length as memberships. If not given, the series will contain the value 1.

Returns:
DataFrame or Series

data is returned with its index indicating category membership. It will be a Series if data is a Series or 1d numeric array. The index will have levels ordered by category names.

Examples

>>> from upsetplot import from_memberships
>>> from_memberships([
...     ['cat1', 'cat3'],
...     ['cat2', 'cat3'],
...     ['cat1'],
...     []
... ])
cat1   cat2   cat3
True   False  True     1
False  True   True     1
True   False  False    1
False  False  False    1
Name: ones, dtype: ...
>>> # now with data:
>>> import numpy as np
>>> from_memberships([
...     ['cat1', 'cat3'],
...     ['cat2', 'cat3'],
...     ['cat1'],
...     []
... ], data=np.arange(12).reshape(4, 3))
                   0   1   2
cat1  cat2  cat3
True  False True   0   1   2
False True  True   3   4   5
True  False False  6   7   8
False False False  9  10  11
upsetplot.generate_counts(seed=0, n_samples=10000, n_categories=3)[source]

Generate artificial counts corresponding to set intersections

Parameters:
seed : int

A seed for randomisation

n_samples : int

Number of samples to generate statistics over

n_categories : int

Number of categories (named “cat0”, “cat1”, …) to generate

Returns:
Series

Counts indexed by boolean indicator mask for each category.

See also

generate_samples
Generates a DataFrame of samples that these counts are derived from.
upsetplot.generate_samples(seed=0, n_samples=10000, n_categories=3)[source]

Generate artificial samples assigned to set intersections

Parameters:
seed : int

A seed for randomisation

n_samples : int

Number of samples to generate

n_categories : int

Number of categories (named “cat0”, “cat1”, …) to generate

Returns:
DataFrame

Field ‘value’ is a weight or score for each element. Field ‘index’ is a unique id for each element. Index includes a boolean indicator mask for each category.

Note: Further fields may be added in future versions.

See also

generate_counts
Generates the counts for each subset of categories corresponding to these samples.

Changelog

What’s new in version 0.6
  • Added add_stacked_bars, similar to add_catplot but to add stacked bar charts to show discrete variable distributions within each subset. (#137)
  • Improved ability to control colors, and added a new example of same. Parameters other_dots_color and shading_color were added. facecolor will now default to white if matplotlib.rcParams['axes.facecolor'] is dark. (#138)
  • Added style_subsets to colour intersection size bars and matrix dots in the plot according to a specified query. (#152)
  • Added from_indicators to allow yet another data input format. This allows category membership to be easily derived from a DataFrame, such as when plotting missing values in the columns of a DataFrame. (#143)
What’s new in version 0.5
  • Support using input intersection order with sort_by=None (#133 with thanks to Brandon B).
  • Add parameters for filtering by subset size (with thanks to Sichong Peng) and degree. (#134)
  • Fixed an issue where tick labels were not given enough space and overlapped category totals. (#132)
  • Fixed an issue where our implementation of sort_by='degree' apparently gave incorrect results for some inputs and versions of Pandas. (#134)
What’s new in version 0.4.4
  • Fixed a regresion which caused the first column to be hidden (#125)
What’s new in version 0.4.3
  • Fixed issue with the order of catplots being reversed for vertical plots (#122 with thanks to Enrique Fernandez-Blanco)
  • Fixed issue with the x limits of vertical plots (#121).
What’s new in version 0.4.2
  • Fixed large x-axis plot margins with high number of unique intersections (#106 with thanks to Yidi Huang)
What’s new in version 0.4.1
  • Fixed the calculation of percentage which was broken in 0.4.0. (#101)
What’s new in version 0.4
  • Added option to display both the absolute frequency and the percentage of the total for each intersection and category. (#89 with thanks to Carlos Melus and Aaron Rosenfeld)
  • Improved efficiency where there are many categories, but valid combinations are sparse, if sort_by='degree'. (#82)
  • Permit truthy (not necessarily bool) values in index. (#74 with thanks to @ZaxR)
  • intersection_plot_elements can now be set to 0 to hide the intersection size plot when add_catplot is used. (#80)
What’s new in version 0.3
  • Added from_contents to provide an alternative, intuitive way of specifying category membership of elements.
  • To improve code legibility and intuitiveness, sum_over=False was deprecated and a subset_size parameter was added. It will have better default handling of DataFrames after a short deprecation period.
  • generate_data has been replaced with generate_counts and generate_samples.
  • Fixed the display of the “intersection size” label on plots, which had been missing.
  • Trying to improve nomenclature, upsetplot now avoids “set” to refer to the top-level sets, which are now to be known as “categories”. This matches the intuition that categories are named, logical groupings, as opposed to “subsets”. To this end:
    • generate_counts (formerly generate_data) now names its categories “cat1”, “cat2” etc. rather than “set1”, “set2”, etc.
    • the sort_sets_by parameter has been renamed to sort_categories_by and will be removed in version 0.4.
What’s new in version 0.2.1
  • Return a Series (not a DataFrame) from from_memberships if data is 1-dimensional.
What’s new in version 0.2
  • Added from_memberships to allow a more convenient data input format.
  • plot and UpSet now accept a pandas.DataFrame as input, if the sum_over parameter is also given.
  • Added an add_catplot method to UpSet which adds Seaborn plots of set intersection data to show more than just set size or total.
  • Shading of subset matrix is continued through to totals.
  • Added a show_counts option to show counts at the ends of bar plots. (#5)
  • Defined _repr_html_ so that an UpSet object will render in Jupyter notebooks. (#36)
  • Fix a bug where an error was raised if an input set was empty.