API Reference¶

Plotting¶

upsetplot.plot(data, fig=None, **kwargs)[source]¶

Make an UpSet plot of data on fig

Parameters:	data : pandas.Series or pandas.DataFrame Values for each set to plot. Should have multi-index where each level is binary, corresponding to set membership. If a DataFrame, `sum_over` must be a string or False. fig : matplotlib.figure.Figure, optional Defaults to a new figure. kwargs Other arguments for `UpSet`
Returns:	subplots : dict of matplotlib.axes.Axes Keys are ‘matrix’, ‘intersections’, ‘totals’, ‘shading’

class upsetplot.UpSet(data, orientation='horizontal', sort_by='degree', sort_categories_by='cardinality', subset_size='auto', sum_over=None, min_subset_size=None, max_subset_size=None, max_subset_rank=None, min_degree=None, max_degree=None, facecolor='auto', other_dots_color=0.18, shading_color=0.05, with_lines=True, element_size=32, intersection_plot_elements=6, totals_plot_elements=2, show_counts='', show_percentages=False, include_empty_subsets=False)[source]¶

Manage the data and drawing for a basic UpSet plot

Primary public method is plot().

Parameters:

data : pandas.Series or pandas.DataFrame

Elements associated with categories (a DataFrame), or the size of each subset of categories (a Series). Should have MultiIndex where each level is binary, corresponding to category membership. If a DataFrame, sum_over must be a string or False.

orientation : {‘horizontal’ (default), ‘vertical’}

If horizontal, intersections are listed from left to right.

sort_by : {‘cardinality’, ‘degree’, ‘-cardinality’, ‘-degree’,

‘input’, ‘-input’}

If ‘cardinality’, subset are listed from largest to smallest. If ‘degree’, they are listed in order of the number of categories intersected. If ‘input’, the order they appear in the data input is used. Prefix with ‘-’ to reverse the ordering.

Note this affects subset_sizes but not data.

sort_categories_by : {‘cardinality’, ‘-cardinality’, ‘input’, ‘-input’}

Whether to sort the categories by total cardinality, or leave them in the input data’s provided order (order of index levels). Prefix with ‘-’ to reverse the ordering.

subset_size : {‘auto’, ‘count’, ‘sum’}

Configures how to calculate the size of a subset. Choices are:

‘auto’ (default): If data is a DataFrame, count the number of rows in each group, unless sum_over is specified. If data is a Series with at most one row for each group, use the value of the Series. If data is a Series with more than one row per group, raise a ValueError.
‘count’: Count the number of rows in each group.
‘sum’: Sum the value of the data Series, or the DataFrame field specified by sum_over.

sum_over : str or None

If subset_size='sum' or 'auto', then the intersection size is the sum of the specified field in the data DataFrame. If a Series, only None is supported and its value is summed.

min_subset_size : int or “number%”, optional

Minimum size of a subset to be shown in the plot. All subsets with a size smaller than this threshold will be omitted from plotting. This may be specified as a percentage using a string, like “50%”. Size may be a sum of values, see subset_size.

New in version 0.5.

Changed in version 0.9: Support percentages

max_subset_size : int or “number%”, optional

Maximum size of a subset to be shown in the plot. All subsets with a size greater than this threshold will be omitted from plotting. This may be specified as a percentage using a string, like “50%”.

New in version 0.5.

Changed in version 0.9: Support percentages

max_subset_rank : int, optional

Limit to the top N ranked subsets in descending order of size. All tied subsets are included.

New in version 0.9.

min_degree : int, optional

Minimum degree of a subset to be shown in the plot.

New in version 0.5.

max_degree : int, optional

Maximum degree of a subset to be shown in the plot.

New in version 0.5.

facecolor : ‘auto’ or matplotlib color or float

Color for bar charts and active dots. Defaults to black if axes.facecolor is a light color, otherwise white.

Changed in version 0.6: Before 0.6, the default was ‘black’

other_dots_color : matplotlib color or float

Color for shading of inactive dots, or opacity (between 0 and 1) applied to facecolor.

New in version 0.6.

shading_color : matplotlib color or float

Color for shading of odd rows in matrix and totals, or opacity (between 0 and 1) applied to facecolor.

New in version 0.6.

with_lines : bool

Whether to show lines joining dots in the matrix, to mark multiple categories being intersected.

element_size : float or None

Side length in pt. If None, size is estimated to fit figure

intersection_plot_elements : int

The intersections plot should be large enough to fit this many matrix elements. Set to 0 to disable intersection size bars.

Changed in version 0.4: Setting to 0 is handled.

totals_plot_elements : int

The totals plot should be large enough to fit this many matrix elements. Set to 0 to disable the totals plot.

Changed in version 0.9: Setting to 0 is handled.

show_counts : bool or str, default=False

Whether to label the intersection size bars with the cardinality of the intersection. When a string, this formats the number. For example, ‘{:d}’ is equivalent to True. Note that, for legacy reasons, if the string does not contain ‘{‘, it will be interpreted as a C-style format string, such as ‘%d’.

show_percentages : bool or str, default=False

Whether to label the intersection size bars with the percentage of the intersection relative to the total dataset. When a string, this formats the number representing a fraction of samples. For example, ‘{:.1%}’ is the default, formatting .123 as 12.3%. This may be applied with or without show_counts.

New in version 0.4.

include_empty_subsets : bool (default=False)

If True, all possible category combinations will be shown as subsets, even when some are not present in data.

Methods

`add_catplot`(kind[, value, elements])	Add a seaborn catplot over subsets when `plot()` is called.
`add_stacked_bars`(by[, sum_over, colors, …])	Add a stacked bar chart over subsets when `plot()` is called.
`make_grid`([fig])	Get a SubplotSpec for each Axes, accounting for label text width
`plot`([fig])	Draw all parts of the plot onto fig or a new figure
`plot_intersections`(ax)	Plot bars indicating intersection size
`plot_matrix`(ax)	Plot the matrix of intersection indicators onto ax
`plot_totals`(ax)	Plot bars indicating total set size
`style_categories`(categories, *[, …])	Updates the style of the categories.
`style_subsets`([present, absent, …])	Updates the style of selected subsets’ bars and matrix dots

plot_shading

add_catplot(kind, value=None, elements=3, **kw)[source]¶

Add a seaborn catplot over subsets when plot() is called.

Parameters:

kind : str

One of {“point”, “bar”, “strip”, “swarm”, “box”, “violin”, “boxen”}

value : str, optional

Column name for the value to plot (i.e. y if orientation=’horizontal’), required if data is a DataFrame.

elements : int, default=3

Size of the axes counted in number of matrix elements.

**kw : dict

Additional keywords to pass to seaborn.catplot().

Our implementation automatically determines ‘ax’, ‘data’, ‘x’, ‘y’ and ‘orient’, so these are prohibited keys in kw.

Returns:

None

add_stacked_bars(by, sum_over=None, colors=None, elements=3, title=None)[source]¶

Add a stacked bar chart over subsets when plot() is called.

Used to plot categorical variable distributions within each subset.

New in version 0.6.

Parameters:

by : str

Column name within the dataframe for color coding the stacked bars, containing discrete or categorical values.

sum_over : str, optional

Ordinarily the bars will chart the size of each group. sum_over may specify a column which will be summed to determine the size of each bar.

colors : Mapping, list-like, str or callable, optional

The facecolors to use for bars corresponding to each discrete label, specified as one of:

Mapping: Maps from label to matplotlib-compatible color specification.
list-like: A list of matplotlib colors to apply to labels in order.
str: The name of a matplotlib colormap name.
callable: When called with the number of labels, this should return a list-like of that many colors. Matplotlib colormaps satisfy this callable API.
None: Uses the matplotlib default colormap.

elements : int, default=3

Size of the axes counted in number of matrix elements.

title : str, optional

The axis title labelling bar length.

Returns:

None

make_grid(fig=None)[source]¶: Get a SubplotSpec for each Axes, accounting for label text width

plot(fig=None)[source]¶

Draw all parts of the plot onto fig or a new figure

Parameters:	fig : matplotlib.figure.Figure, optional Defaults to a new figure.
Returns:	subplots : dict of matplotlib.axes.Axes Keys are ‘matrix’, ‘intersections’, ‘totals’, ‘shading’

plot_intersections(ax)[source]¶: Plot bars indicating intersection size

plot_matrix(ax)[source]¶: Plot the matrix of intersection indicators onto ax

plot_totals(ax)[source]¶: Plot bars indicating total set size

style_categories(categories, *, bar_facecolor=None, bar_hatch=None, bar_edgecolor=None, bar_linewidth=None, bar_linestyle=None, shading_facecolor=None, shading_edgecolor=None, shading_linewidth=None, shading_linestyle=None)[source]¶

Updates the style of the categories.

Select a category by name, and style either its total bar or its shading.

New in version 0.9.

Parameters:

categories : str or list[str]: Category names where the changed style applies.
bar_facecolor : str or RGBA matplotlib color tuple, optional.: Override the default facecolor in the totals plot.
bar_hatch : str, optional: Set a hatch for the totals plot.
bar_edgecolor : str or matplotlib color, optional: Set the edgecolor for total bars.
bar_linewidth : int, optional: Line width in points for total bar edges.
bar_linestyle : str, optional: Line style for edges.
shading_facecolor : str or RGBA matplotlib color tuple, optional.: Override the default alternating shading for specified categories.
shading_edgecolor : str or matplotlib color, optional: Set the edgecolor for bars, dots, and the line between dots.
shading_linewidth : int, optional: Line width in points for edges.
shading_linestyle : str, optional: Line style for edges.

style_subsets(present=None, absent=None, min_subset_size=None, max_subset_size=None, max_subset_rank=None, min_degree=None, max_degree=None, facecolor=None, edgecolor=None, hatch=None, linewidth=None, linestyle=None, label=None)[source]¶

Updates the style of selected subsets’ bars and matrix dots

Parameters are either used to select subsets, or to style them with attributes of matplotlib.patches.Patch, apart from label, which adds a legend entry.

Parameters:

present : str or list of str, optional: Category or categories that must be present in subsets for styling.
absent : str or list of str, optional: Category or categories that must not be present in subsets for styling.
min_subset_size : int or “number%”, optional: Minimum size of a subset to be styled. This may be specified as a percentage using a string, like “50%”.

Changed in version 0.9: Support percentages
max_subset_size : int or “number%”, optional: Maximum size of a subset to be styled. This may be specified as a percentage using a string, like “50%”.

Changed in version 0.9: Support percentages
max_subset_rank : int, optional: Limit to the top N ranked subsets in descending order of size. All tied subsets are included.

New in version 0.9.
min_degree : int, optional: Minimum degree of a subset to be styled.
max_degree : int, optional: Maximum degree of a subset to be styled.
facecolor : str or matplotlib color, optional: Override the default UpSet facecolor for selected subsets.
edgecolor : str or matplotlib color, optional: Set the edgecolor for bars, dots, and the line between dots.
hatch : str, optional: Set the hatch. This will apply to intersection size bars, but not to matrix dots.
linewidth : int, optional: Line width in points for edges.
linestyle : str, optional: Line style for edges.
label : str, optional: If provided, a legend will be added

Dataset loading and generation¶

upsetplot.from_contents(contents, data=None, id_column='id')[source]¶

Build data from category listings

Parameters:	contents : Mapping (or iterable over pairs) of strings to sets Keys are category names, values are sets of identifiers (int or string). data : DataFrame, optional If provided, this should be indexed by the identifiers used in `Python Documentation contents`. id_column : str, default=’id’ The column name to use for the identifiers in the output.
Returns:	DataFrame `data` is returned with its index indicating category membership, including a column named according to id_column. If data is not given, the order of rows is not assured.

Notes

The order of categories in the output DataFrame is determined from Python Documentation contents, which may have non-deterministic iteration order.

Examples

>>> from upsetplot import from_contents
>>> contents = {'cat1': ['a', 'b', 'c'],
...             'cat2': ['b', 'd'],
...             'cat3': ['e']}
>>> from_contents(contents)
                  id
cat1  cat2  cat3
True  False False  a
      True  False  b
      False False  c
False True  False  d
      False True   e
>>> import pandas as pd
>>> contents = {'cat1': [0, 1, 2],
...             'cat2': [1, 3],
...             'cat3': [4]}
>>> data = pd.DataFrame({'favourite': ['green', 'red', 'red',
...                                    'yellow', 'blue']})
>>> from_contents(contents, data=data)
                   id favourite
cat1  cat2  cat3
True  False False   0     green
      True  False   1       red
      False False   2       red
False True  False   3    yellow
      False True    4      blue

upsetplot.from_indicators(indicators, data=None)[source]¶

Load category membership indicated by a boolean indicator matrix

This loader also supports the case where the indicator columns can be derived from data.

New in version 0.6.

Parameters:

indicators : DataFrame-like of booleans, Sequence of str, or callable

Specifies the category indicators (boolean mask arrays) within data, i.e. which records in data belong to which categories.

If a list of strings, these should be column names found in data whose values are boolean mask arrays.

If a DataFrame, its columns should correspond to categories, and its index should be a subset of those in data, values should be True where a data record is in that category, and False or NA otherwise.

If callable, it will be applied to data after the latter is converted to a Series or DataFrame.

data : Series-like or DataFrame-like, optional

If given, the index of category membership is attached to this data. It must have the same length as indicators. If not given, the series will contain the value 1.

Returns:

DataFrame or Series: data is returned with its index indicating category membership. It will be a Series if data is a Series or 1d numeric array or None.

Notes

Categories with indicators that are all False will be removed.

Examples

>>> import pandas as pd
>>> from upsetplot import from_indicators
>>>
>>> # Just indicators:
>>> indicators = {"cat1": [True, False, True, False],
...               "cat2": [False, True, False, False],
...               "cat3": [True, True, False, False]}
>>> from_indicators(indicators)
cat1   cat2   cat3
True   False  True     1.0
False  True   True     1.0
True   False  False    1.0
False  False  False    1.0
Name: ones, dtype: float64
>>>
>>> # Where indicators are included within data, specifying
>>> # columns by name:
>>> data = pd.DataFrame({"value": [5, 4, 6, 4], **indicators})
>>> from_indicators(["cat1", "cat3"], data=data)
             value   cat1   cat2   cat3
cat1  cat3
True  True       5   True  False   True
False True       4  False   True   True
True  False      6   True  False  False
False False      4  False  False  False
>>>
>>> # Making indicators out of all boolean columns:
>>> from_indicators(lambda data: data.select_dtypes(bool), data=data)
                   value   cat1   cat2   cat3
cat1  cat2  cat3
True  False True       5   True  False   True
False True  True       4  False   True   True
True  False False      6   True  False  False
False False False      4  False  False  False
>>>
>>> # Using a dataset with missing data, we can use missingness as
>>> # an indicator:
>>> data = pd.DataFrame({"val1": [pd.NA, .7, pd.NA, .9],
...                      "val2": ["male", pd.NA, "female", "female"],
...                      "val3": [pd.NA, pd.NA, 23000, 78000]})
>>> from_indicators(pd.isna, data=data)
                   val1    val2   val3
val1  val2  val3
True  False True   <NA>    male   <NA>
False True  True    0.7    <NA>   <NA>
True  False False  <NA>  female  23000
False False False   0.9  female  78000

upsetplot.from_memberships(memberships, data=None)[source]¶

Load data where each sample has a collection of category names

The output should be suitable for passing to UpSet or plot.

Parameters:	memberships : sequence of collections of strings Each element corresponds to a data point, indicating the sets it is a member of. Each category is named by a string. data : Series-like or DataFrame-like, optional If given, the index of category memberships is attached to this data. It must have the same length as `memberships`. If not given, the series will contain the value 1.
Returns:	DataFrame or Series `data` is returned with its index indicating category membership. It will be a Series if `data` is a Series or 1d numeric array. The index will have levels ordered by category names.

Examples

>>> from upsetplot import from_memberships
>>> from_memberships([
...     ['cat1', 'cat3'],
...     ['cat2', 'cat3'],
...     ['cat1'],
...     []
... ])
cat1   cat2   cat3
True   False  True     1
False  True   True     1
True   False  False    1
False  False  False    1
Name: ones, dtype: ...
>>> # now with data:
>>> import numpy as np
>>> from_memberships([
...     ['cat1', 'cat3'],
...     ['cat2', 'cat3'],
...     ['cat1'],
...     []
... ], data=np.arange(12).reshape(4, 3))
                   0   1   2
cat1  cat2  cat3
True  False True   0   1   2
False True  True   3   4   5
True  False False  6   7   8
False False False  9  10  11

upsetplot.generate_counts(seed=0, n_samples=10000, n_categories=3)[source]¶

Generate artificial counts corresponding to set intersections

Parameters:	seed : int A seed for randomisation n_samples : int Number of samples to generate statistics over n_categories : int Number of categories (named “cat0”, “cat1”, …) to generate
Returns:	Series Counts indexed by boolean indicator mask for each category.

Data querying and transformation¶

upsetplot.query(data, present=None, absent=None, min_subset_size=None, max_subset_size=None, max_subset_rank=None, min_degree=None, max_degree=None, sort_by='degree', sort_categories_by='cardinality', subset_size='auto', sum_over=None, include_empty_subsets=False)[source]¶

Transform and filter a categorised dataset

Retrieve the set of items and totals corresponding to subsets of interest.

Parameters:

data : pandas.Series or pandas.DataFrame

Elements associated with categories (a DataFrame), or the size of each subset of categories (a Series). Should have MultiIndex where each level is binary, corresponding to category membership. If a DataFrame, sum_over must be a string or False.

present : str or list of str, optional

Category or categories that must be present in subsets for styling.

absent : str or list of str, optional

Category or categories that must not be present in subsets for styling.

min_subset_size : int or “number%”, optional

Minimum size of a subset to be reported. All subsets with a size smaller than this threshold will be omitted from category_totals and data. This may be specified as a percentage using a string, like “50%”. Size may be a sum of values, see subset_size.

Changed in version 0.9: Support percentages

max_subset_size : int or “number%”, optional

Maximum size of a subset to be reported.

Changed in version 0.9: Support percentages

max_subset_rank : int, optional

Limit to the top N ranked subsets in descending order of size. All tied subsets are included.

New in version 0.9.

min_degree : int, optional

Minimum degree of a subset to be reported.

max_degree : int, optional

Maximum degree of a subset to be reported.

sort_by : {‘cardinality’, ‘degree’, ‘-cardinality’, ‘-degree’,

‘input’, ‘-input’}

If ‘cardinality’, subset are listed from largest to smallest. If ‘degree’, they are listed in order of the number of categories intersected. If ‘input’, the order they appear in the data input is used. Prefix with ‘-’ to reverse the ordering.

Note this affects subset_sizes but not data.

sort_categories_by : {‘cardinality’, ‘-cardinality’, ‘input’, ‘-input’}

Whether to sort the categories by total cardinality, or leave them in the input data’s provided order (order of index levels). Prefix with ‘-’ to reverse the ordering.

subset_size : {‘auto’, ‘count’, ‘sum’}

Configures how to calculate the size of a subset. Choices are:

‘auto’ (default): If data is a DataFrame, count the number of rows in each group, unless sum_over is specified. If data is a Series with at most one row for each group, use the value of the Series. If data is a Series with more than one row per group, raise a ValueError.
‘count’: Count the number of rows in each group.
‘sum’: Sum the value of the data Series, or the DataFrame field specified by sum_over.

sum_over : str or None

If subset_size='sum' or 'auto', then the intersection size is the sum of the specified field in the data DataFrame. If a Series, only None is supported and its value is summed.

include_empty_subsets : bool (default=False)

If True, all possible category combinations will be returned in subset_sizes, even when some are not present in data.

Returns:

QueryResult: Including filtered data, filtered and sorted subset_sizes and overall category_totals and total.

Examples

>>> from upsetplot import query, generate_samples
>>> data = generate_samples(n_samples=20)
>>> result = query(data, present="cat1", max_subset_size=4)
>>> result.category_totals
cat1    14
cat2     4
cat0     0
dtype: int64
>>> result.subset_sizes
cat1  cat2  cat0
True  True  False    3
Name: size, dtype: int64
>>> result.data
                 index     value
cat1 cat2 cat0
True True False      0  2.04...
          False      2  2.05...
          False     10  2.55...
>>>
>>> # Sorting:
>>> query(data, min_degree=1, sort_by="degree").subset_sizes
cat1   cat2   cat0
True   False  False    11
False  True   False     1
True   True   False     3
Name: size, dtype: int64
>>> query(data, min_degree=1, sort_by="cardinality").subset_sizes
cat1   cat2   cat0
True   False  False    11
       True   False     3
False  True   False     1
Name: size, dtype: int64
>>>
>>> # Getting each subset's data
>>> result = query(data)
>>> result.subsets[frozenset({"cat1", "cat2"})]
            index     value
cat1  cat2 cat0
False True False      3  1.333795
>>> result.subsets[frozenset({"cat1"})]
                    index     value
cat1  cat2  cat0
False False False      5  0.918174
            False      8  1.948521
            False      9  1.086599
            False     13  1.105696
            False     19  1.339895