API Reference¶
Plotting¶
-
upsetplot.
plot
(data, fig=None, **kwargs)[source]¶ Make an UpSet plot of data on fig
Parameters: - data : pandas.Series or pandas.DataFrame
Values for each set to plot. Should have multi-index where each level is binary, corresponding to set membership. If a DataFrame,
sum_over
must be a string or False.- fig : matplotlib.figure.Figure, optional
Defaults to a new figure.
- kwargs
Other arguments for
UpSet
Returns: - subplots : dict of matplotlib.axes.Axes
Keys are ‘matrix’, ‘intersections’, ‘totals’, ‘shading’
-
class
upsetplot.
UpSet
(data, orientation='horizontal', sort_by='degree', sort_categories_by='cardinality', subset_size='auto', sum_over=None, min_subset_size=None, max_subset_size=None, max_subset_rank=None, min_degree=None, max_degree=None, facecolor='auto', other_dots_color=0.18, shading_color=0.05, with_lines=True, element_size=32, intersection_plot_elements=6, totals_plot_elements=2, show_counts='', show_percentages=False, include_empty_subsets=False)[source]¶ Manage the data and drawing for a basic UpSet plot
Primary public method is
plot()
.Parameters: - data : pandas.Series or pandas.DataFrame
Elements associated with categories (a DataFrame), or the size of each subset of categories (a Series). Should have MultiIndex where each level is binary, corresponding to category membership. If a DataFrame,
sum_over
must be a string or False.- orientation : {‘horizontal’ (default), ‘vertical’}
If horizontal, intersections are listed from left to right.
- sort_by : {‘cardinality’, ‘degree’, ‘-cardinality’, ‘-degree’,
‘input’, ‘-input’}
If ‘cardinality’, subset are listed from largest to smallest. If ‘degree’, they are listed in order of the number of categories intersected. If ‘input’, the order they appear in the data input is used. Prefix with ‘-’ to reverse the ordering.
Note this affects
subset_sizes
but notdata
.- sort_categories_by : {‘cardinality’, ‘-cardinality’, ‘input’, ‘-input’}
Whether to sort the categories by total cardinality, or leave them in the input data’s provided order (order of index levels). Prefix with ‘-’ to reverse the ordering.
- subset_size : {‘auto’, ‘count’, ‘sum’}
Configures how to calculate the size of a subset. Choices are:
- ‘auto’ (default)
If
data
is a DataFrame, count the number of rows in each group, unlesssum_over
is specified. Ifdata
is a Series with at most one row for each group, use the value of the Series. Ifdata
is a Series with more than one row per group, raise a ValueError.- ‘count’
Count the number of rows in each group.
- ‘sum’
Sum the value of the
data
Series, or the DataFrame field specified bysum_over
.
- sum_over : str or None
If
subset_size='sum'
or'auto'
, then the intersection size is the sum of the specified field in thedata
DataFrame. If a Series, only None is supported and its value is summed.- min_subset_size : int or “number%”, optional
Minimum size of a subset to be shown in the plot. All subsets with a size smaller than this threshold will be omitted from plotting. This may be specified as a percentage using a string, like “50%”. Size may be a sum of values, see
subset_size
.New in version 0.5.
Changed in version 0.9: Support percentages
- max_subset_size : int or “number%”, optional
Maximum size of a subset to be shown in the plot. All subsets with a size greater than this threshold will be omitted from plotting. This may be specified as a percentage using a string, like “50%”.
New in version 0.5.
Changed in version 0.9: Support percentages
- max_subset_rank : int, optional
Limit to the top N ranked subsets in descending order of size. All tied subsets are included.
New in version 0.9.
- min_degree : int, optional
Minimum degree of a subset to be shown in the plot.
New in version 0.5.
- max_degree : int, optional
Maximum degree of a subset to be shown in the plot.
New in version 0.5.
- facecolor : ‘auto’ or matplotlib color or float
Color for bar charts and active dots. Defaults to black if axes.facecolor is a light color, otherwise white.
Changed in version 0.6: Before 0.6, the default was ‘black’
- other_dots_color : matplotlib color or float
Color for shading of inactive dots, or opacity (between 0 and 1) applied to facecolor.
New in version 0.6.
- shading_color : matplotlib color or float
Color for shading of odd rows in matrix and totals, or opacity (between 0 and 1) applied to facecolor.
New in version 0.6.
- with_lines : bool
Whether to show lines joining dots in the matrix, to mark multiple categories being intersected.
- element_size : float or None
Side length in pt. If None, size is estimated to fit figure
- intersection_plot_elements : int
The intersections plot should be large enough to fit this many matrix elements. Set to 0 to disable intersection size bars.
Changed in version 0.4: Setting to 0 is handled.
- totals_plot_elements : int
The totals plot should be large enough to fit this many matrix elements. Set to 0 to disable the totals plot.
Changed in version 0.9: Setting to 0 is handled.
- show_counts : bool or str, default=False
Whether to label the intersection size bars with the cardinality of the intersection. When a string, this formats the number. For example, ‘{:d}’ is equivalent to True. Note that, for legacy reasons, if the string does not contain ‘{‘, it will be interpreted as a C-style format string, such as ‘%d’.
- show_percentages : bool or str, default=False
Whether to label the intersection size bars with the percentage of the intersection relative to the total dataset. When a string, this formats the number representing a fraction of samples. For example, ‘{:.1%}’ is the default, formatting .123 as 12.3%. This may be applied with or without show_counts.
New in version 0.4.
- include_empty_subsets : bool (default=False)
If True, all possible category combinations will be shown as subsets, even when some are not present in data.
Methods
add_catplot
(kind[, value, elements])Add a seaborn catplot over subsets when plot()
is called.add_stacked_bars
(by[, sum_over, colors, …])Add a stacked bar chart over subsets when plot()
is called.make_grid
([fig])Get a SubplotSpec for each Axes, accounting for label text width plot
([fig])Draw all parts of the plot onto fig or a new figure plot_intersections
(ax)Plot bars indicating intersection size plot_matrix
(ax)Plot the matrix of intersection indicators onto ax plot_totals
(ax)Plot bars indicating total set size style_categories
(categories, *[, …])Updates the style of the categories. style_subsets
([present, absent, …])Updates the style of selected subsets’ bars and matrix dots plot_shading -
add_catplot
(kind, value=None, elements=3, **kw)[source]¶ Add a seaborn catplot over subsets when
plot()
is called.Parameters: - kind : str
One of {“point”, “bar”, “strip”, “swarm”, “box”, “violin”, “boxen”}
- value : str, optional
Column name for the value to plot (i.e. y if orientation=’horizontal’), required if
data
is a DataFrame.- elements : int, default=3
Size of the axes counted in number of matrix elements.
- **kw : dict
Additional keywords to pass to
seaborn.catplot()
.Our implementation automatically determines ‘ax’, ‘data’, ‘x’, ‘y’ and ‘orient’, so these are prohibited keys in
kw
.
Returns: - None
-
add_stacked_bars
(by, sum_over=None, colors=None, elements=3, title=None)[source]¶ Add a stacked bar chart over subsets when
plot()
is called.Used to plot categorical variable distributions within each subset.
New in version 0.6.
Parameters: - by : str
Column name within the dataframe for color coding the stacked bars, containing discrete or categorical values.
- sum_over : str, optional
Ordinarily the bars will chart the size of each group. sum_over may specify a column which will be summed to determine the size of each bar.
- colors : Mapping, list-like, str or callable, optional
The facecolors to use for bars corresponding to each discrete label, specified as one of:
- Mapping
Maps from label to matplotlib-compatible color specification.
- list-like
A list of matplotlib colors to apply to labels in order.
- str
The name of a matplotlib colormap name.
- callable
When called with the number of labels, this should return a list-like of that many colors. Matplotlib colormaps satisfy this callable API.
- None
Uses the matplotlib default colormap.
- elements : int, default=3
Size of the axes counted in number of matrix elements.
- title : str, optional
The axis title labelling bar length.
Returns: - None
-
plot
(fig=None)[source]¶ Draw all parts of the plot onto fig or a new figure
Parameters: - fig : matplotlib.figure.Figure, optional
Defaults to a new figure.
Returns: - subplots : dict of matplotlib.axes.Axes
Keys are ‘matrix’, ‘intersections’, ‘totals’, ‘shading’
-
style_categories
(categories, *, bar_facecolor=None, bar_hatch=None, bar_edgecolor=None, bar_linewidth=None, bar_linestyle=None, shading_facecolor=None, shading_edgecolor=None, shading_linewidth=None, shading_linestyle=None)[source]¶ Updates the style of the categories.
Select a category by name, and style either its total bar or its shading.
New in version 0.9.
Parameters: - categories : str or list[str]
Category names where the changed style applies.
- bar_facecolor : str or RGBA matplotlib color tuple, optional.
Override the default facecolor in the totals plot.
- bar_hatch : str, optional
Set a hatch for the totals plot.
- bar_edgecolor : str or matplotlib color, optional
Set the edgecolor for total bars.
- bar_linewidth : int, optional
Line width in points for total bar edges.
- bar_linestyle : str, optional
Line style for edges.
- shading_facecolor : str or RGBA matplotlib color tuple, optional.
Override the default alternating shading for specified categories.
- shading_edgecolor : str or matplotlib color, optional
Set the edgecolor for bars, dots, and the line between dots.
- shading_linewidth : int, optional
Line width in points for edges.
- shading_linestyle : str, optional
Line style for edges.
-
style_subsets
(present=None, absent=None, min_subset_size=None, max_subset_size=None, max_subset_rank=None, min_degree=None, max_degree=None, facecolor=None, edgecolor=None, hatch=None, linewidth=None, linestyle=None, label=None)[source]¶ Updates the style of selected subsets’ bars and matrix dots
Parameters are either used to select subsets, or to style them with attributes of
matplotlib.patches.Patch
, apart from label, which adds a legend entry.Parameters: - present : str or list of str, optional
Category or categories that must be present in subsets for styling.
- absent : str or list of str, optional
Category or categories that must not be present in subsets for styling.
- min_subset_size : int or “number%”, optional
Minimum size of a subset to be styled. This may be specified as a percentage using a string, like “50%”.
Changed in version 0.9: Support percentages
- max_subset_size : int or “number%”, optional
Maximum size of a subset to be styled. This may be specified as a percentage using a string, like “50%”.
Changed in version 0.9: Support percentages
- max_subset_rank : int, optional
Limit to the top N ranked subsets in descending order of size. All tied subsets are included.
New in version 0.9.
- min_degree : int, optional
Minimum degree of a subset to be styled.
- max_degree : int, optional
Maximum degree of a subset to be styled.
- facecolor : str or matplotlib color, optional
Override the default UpSet facecolor for selected subsets.
- edgecolor : str or matplotlib color, optional
Set the edgecolor for bars, dots, and the line between dots.
- hatch : str, optional
Set the hatch. This will apply to intersection size bars, but not to matrix dots.
- linewidth : int, optional
Line width in points for edges.
- linestyle : str, optional
Line style for edges.
- label : str, optional
If provided, a legend will be added
Dataset loading and generation¶
-
upsetplot.
from_contents
(contents, data=None, id_column='id')[source]¶ Build data from category listings
Parameters: - contents : Mapping (or iterable over pairs) of strings to sets
Keys are category names, values are sets of identifiers (int or string).
- data : DataFrame, optional
If provided, this should be indexed by the identifiers used in
Python Documentation contents
.- id_column : str, default=’id’
The column name to use for the identifiers in the output.
Returns: - DataFrame
data
is returned with its index indicating category membership, including a column named according to id_column. If data is not given, the order of rows is not assured.
Notes
The order of categories in the output DataFrame is determined from
Python Documentation contents
, which may have non-deterministic iteration order.Examples
>>> from upsetplot import from_contents >>> contents = {'cat1': ['a', 'b', 'c'], ... 'cat2': ['b', 'd'], ... 'cat3': ['e']} >>> from_contents(contents) id cat1 cat2 cat3 True False False a True False b False False c False True False d False True e >>> import pandas as pd >>> contents = {'cat1': [0, 1, 2], ... 'cat2': [1, 3], ... 'cat3': [4]} >>> data = pd.DataFrame({'favourite': ['green', 'red', 'red', ... 'yellow', 'blue']}) >>> from_contents(contents, data=data) id favourite cat1 cat2 cat3 True False False 0 green True False 1 red False False 2 red False True False 3 yellow False True 4 blue
-
upsetplot.
from_indicators
(indicators, data=None)[source]¶ Load category membership indicated by a boolean indicator matrix
This loader also supports the case where the indicator columns can be derived from
data
.New in version 0.6.
Parameters: - indicators : DataFrame-like of booleans, Sequence of str, or callable
Specifies the category indicators (boolean mask arrays) within
data
, i.e. which records indata
belong to which categories.If a list of strings, these should be column names found in
data
whose values are boolean mask arrays.If a DataFrame, its columns should correspond to categories, and its index should be a subset of those in
data
, values should be True where a data record is in that category, and False or NA otherwise.If callable, it will be applied to
data
after the latter is converted to a Series or DataFrame.- data : Series-like or DataFrame-like, optional
If given, the index of category membership is attached to this data. It must have the same length as
indicators
. If not given, the series will contain the value 1.
Returns: - DataFrame or Series
data
is returned with its index indicating category membership. It will be a Series ifdata
is a Series or 1d numeric array or None.
Notes
Categories with indicators that are all False will be removed.
Examples
>>> import pandas as pd >>> from upsetplot import from_indicators >>> >>> # Just indicators: >>> indicators = {"cat1": [True, False, True, False], ... "cat2": [False, True, False, False], ... "cat3": [True, True, False, False]} >>> from_indicators(indicators) cat1 cat2 cat3 True False True 1.0 False True True 1.0 True False False 1.0 False False False 1.0 Name: ones, dtype: float64 >>> >>> # Where indicators are included within data, specifying >>> # columns by name: >>> data = pd.DataFrame({"value": [5, 4, 6, 4], **indicators}) >>> from_indicators(["cat1", "cat3"], data=data) value cat1 cat2 cat3 cat1 cat3 True True 5 True False True False True 4 False True True True False 6 True False False False False 4 False False False >>> >>> # Making indicators out of all boolean columns: >>> from_indicators(lambda data: data.select_dtypes(bool), data=data) value cat1 cat2 cat3 cat1 cat2 cat3 True False True 5 True False True False True True 4 False True True True False False 6 True False False False False False 4 False False False >>> >>> # Using a dataset with missing data, we can use missingness as >>> # an indicator: >>> data = pd.DataFrame({"val1": [pd.NA, .7, pd.NA, .9], ... "val2": ["male", pd.NA, "female", "female"], ... "val3": [pd.NA, pd.NA, 23000, 78000]}) >>> from_indicators(pd.isna, data=data) val1 val2 val3 val1 val2 val3 True False True <NA> male <NA> False True True 0.7 <NA> <NA> True False False <NA> female 23000 False False False 0.9 female 78000
-
upsetplot.
from_memberships
(memberships, data=None)[source]¶ Load data where each sample has a collection of category names
The output should be suitable for passing to
UpSet
orplot
.Parameters: - memberships : sequence of collections of strings
Each element corresponds to a data point, indicating the sets it is a member of. Each category is named by a string.
- data : Series-like or DataFrame-like, optional
If given, the index of category memberships is attached to this data. It must have the same length as
memberships
. If not given, the series will contain the value 1.
Returns: - DataFrame or Series
data
is returned with its index indicating category membership. It will be a Series ifdata
is a Series or 1d numeric array. The index will have levels ordered by category names.
Examples
>>> from upsetplot import from_memberships >>> from_memberships([ ... ['cat1', 'cat3'], ... ['cat2', 'cat3'], ... ['cat1'], ... [] ... ]) cat1 cat2 cat3 True False True 1 False True True 1 True False False 1 False False False 1 Name: ones, dtype: ... >>> # now with data: >>> import numpy as np >>> from_memberships([ ... ['cat1', 'cat3'], ... ['cat2', 'cat3'], ... ['cat1'], ... [] ... ], data=np.arange(12).reshape(4, 3)) 0 1 2 cat1 cat2 cat3 True False True 0 1 2 False True True 3 4 5 True False False 6 7 8 False False False 9 10 11
-
upsetplot.
generate_counts
(seed=0, n_samples=10000, n_categories=3)[source]¶ Generate artificial counts corresponding to set intersections
Parameters: - seed : int
A seed for randomisation
- n_samples : int
Number of samples to generate statistics over
- n_categories : int
Number of categories (named “cat0”, “cat1”, …) to generate
Returns: - Series
Counts indexed by boolean indicator mask for each category.
See also
generate_samples
- Generates a DataFrame of samples that these counts are derived from.
-
upsetplot.
generate_samples
(seed=0, n_samples=10000, n_categories=3)[source]¶ Generate artificial samples assigned to set intersections
Parameters: - seed : int
A seed for randomisation
- n_samples : int
Number of samples to generate
- n_categories : int
Number of categories (named “cat0”, “cat1”, …) to generate
Returns: - DataFrame
Field ‘value’ is a weight or score for each element. Field ‘index’ is a unique id for each element. Index includes a boolean indicator mask for each category.
Note: Further fields may be added in future versions.
See also
generate_counts
- Generates the counts for each subset of categories corresponding to these samples.
Data querying and transformation¶
-
upsetplot.
query
(data, present=None, absent=None, min_subset_size=None, max_subset_size=None, max_subset_rank=None, min_degree=None, max_degree=None, sort_by='degree', sort_categories_by='cardinality', subset_size='auto', sum_over=None, include_empty_subsets=False)[source]¶ Transform and filter a categorised dataset
Retrieve the set of items and totals corresponding to subsets of interest.
Parameters: - data : pandas.Series or pandas.DataFrame
Elements associated with categories (a DataFrame), or the size of each subset of categories (a Series). Should have MultiIndex where each level is binary, corresponding to category membership. If a DataFrame,
sum_over
must be a string or False.- present : str or list of str, optional
Category or categories that must be present in subsets for styling.
- absent : str or list of str, optional
Category or categories that must not be present in subsets for styling.
- min_subset_size : int or “number%”, optional
Minimum size of a subset to be reported. All subsets with a size smaller than this threshold will be omitted from category_totals and data. This may be specified as a percentage using a string, like “50%”. Size may be a sum of values, see
subset_size
.Changed in version 0.9: Support percentages
- max_subset_size : int or “number%”, optional
Maximum size of a subset to be reported.
Changed in version 0.9: Support percentages
- max_subset_rank : int, optional
Limit to the top N ranked subsets in descending order of size. All tied subsets are included.
New in version 0.9.
- min_degree : int, optional
Minimum degree of a subset to be reported.
- max_degree : int, optional
Maximum degree of a subset to be reported.
- sort_by : {‘cardinality’, ‘degree’, ‘-cardinality’, ‘-degree’,
‘input’, ‘-input’}
If ‘cardinality’, subset are listed from largest to smallest. If ‘degree’, they are listed in order of the number of categories intersected. If ‘input’, the order they appear in the data input is used. Prefix with ‘-’ to reverse the ordering.
Note this affects
subset_sizes
but notdata
.- sort_categories_by : {‘cardinality’, ‘-cardinality’, ‘input’, ‘-input’}
Whether to sort the categories by total cardinality, or leave them in the input data’s provided order (order of index levels). Prefix with ‘-’ to reverse the ordering.
- subset_size : {‘auto’, ‘count’, ‘sum’}
Configures how to calculate the size of a subset. Choices are:
- ‘auto’ (default)
If
data
is a DataFrame, count the number of rows in each group, unlesssum_over
is specified. Ifdata
is a Series with at most one row for each group, use the value of the Series. Ifdata
is a Series with more than one row per group, raise a ValueError.- ‘count’
Count the number of rows in each group.
- ‘sum’
Sum the value of the
data
Series, or the DataFrame field specified bysum_over
.
- sum_over : str or None
If
subset_size='sum'
or'auto'
, then the intersection size is the sum of the specified field in thedata
DataFrame. If a Series, only None is supported and its value is summed.- include_empty_subsets : bool (default=False)
If True, all possible category combinations will be returned in subset_sizes, even when some are not present in data.
Returns: - QueryResult
Including filtered
data
, filtered and sortedsubset_sizes
and overallcategory_totals
andtotal
.
Examples
>>> from upsetplot import query, generate_samples >>> data = generate_samples(n_samples=20) >>> result = query(data, present="cat1", max_subset_size=4) >>> result.category_totals cat1 14 cat2 4 cat0 0 dtype: int64 >>> result.subset_sizes cat1 cat2 cat0 True True False 3 Name: size, dtype: int64 >>> result.data index value cat1 cat2 cat0 True True False 0 2.04... False 2 2.05... False 10 2.55... >>> >>> # Sorting: >>> query(data, min_degree=1, sort_by="degree").subset_sizes cat1 cat2 cat0 True False False 11 False True False 1 True True False 3 Name: size, dtype: int64 >>> query(data, min_degree=1, sort_by="cardinality").subset_sizes cat1 cat2 cat0 True False False 11 True False 3 False True False 1 Name: size, dtype: int64 >>> >>> # Getting each subset's data >>> result = query(data) >>> result.subsets[frozenset({"cat1", "cat2"})] index value cat1 cat2 cat0 False True False 3 1.333795 >>> result.subsets[frozenset({"cat1"})] index value cat1 cat2 cat0 False False False 5 0.918174 False 8 1.948521 False 9 1.086599 False 13 1.105696 False 19 1.339895