Лабораторное занятие 7¶

Визуализация данных в Python¶

Библиотека matplotlib¶

Matplotlib — библиотека на языке программирования Python для визуализации данных двумерной и трёхмерной графикой.

Основой для Matplotlib послужил MATLAB, но в отличие от него Matplotlib является более гибким, легко конфигурируемым пакетом.

Библиотека поддерживает многие виды графиков и диаграмм:

Графики (англ. line plot)
Диаграммы рассеяния (англ. scatter plot)
Столбчатые диаграммы (англ. bar chart) и гистограммы (англ. histogram)
Круговые диаграммы (англ. pie chart)
Диаграммы стебель-листья (англ. stem plot)
Контурные графики (англ. contour plot)
Поля градиентов (англ. quiver)
Спектральные диаграммы (англ. spectrogram)

Установить бибилиотеку можно через команду:

In [ ]:

!pip install matplotlib

Requirement already satisfied: matplotlib in c:\users\medve\anaconda\lib\site-packages (3.5.2)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\medve\anaconda\lib\site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: cycler>=0.10 in c:\users\medve\anaconda\lib\site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\medve\anaconda\lib\site-packages (from matplotlib) (1.4.2)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\medve\anaconda\lib\site-packages (from matplotlib) (4.25.0)
Requirement already satisfied: numpy>=1.17 in c:\users\medve\appdata\roaming\python\python39\site-packages (from matplotlib) (1.21.6)
Requirement already satisfied: pillow>=6.2.0 in c:\users\medve\anaconda\lib\site-packages (from matplotlib) (10.0.0)
Requirement already satisfied: pyparsing>=2.2.1 in c:\users\medve\anaconda\lib\site-packages (from matplotlib) (3.0.9)
Requirement already satisfied: packaging>=20.0 in c:\users\medve\anaconda\lib\site-packages (from matplotlib) (23.1)
Requirement already satisfied: six>=1.5 in c:\users\medve\anaconda\lib\site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)

In [ ]:

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

In [ ]:

%matplotlib inline

Линейные графики¶

In [ ]:

y = [1, 2, 3, 5, 8, 13, 21]

In [ ]:

plt.plot(y)
plt.title('График зависимой переменной y')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

Для построения линейных графиков используется функция plot():

```plot([x], y, [fmt], *, data=None, **kwargs)

plot([x], y, [fmt], [x2], y2, [fmt2], …, **kwargs)```

Рассмотрим аргументы функции plot():

x, x2, …: array - набор данных для оси абсцисс первого, второго и т.д. графика.
y, y2, …: array - набор данных для оси ординат первого, второго и т.д. графика.
fmt: str - формат графика, задается в виде строки: ‘[marker][line][color]’.
**kwargs – свойства класса Line2D, которые предоставляют доступ к большому количеству настроек внешнего вида графика/

Подробнее можно посмотреть в документации: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html

In [ ]:

x = [1, 5, 10, 15, 20]
y1 = [1, 7, 3, 5, 11]
y2 = [4, 3, 1, 8, 12]
y3 = [12, 5, 10, 6, 11]

plt.figure(figsize=(12, 7))
plt.plot(x, y1, 'X-y', alpha=0.4, label="first", lw=6, mec='b', mew=2, ms=10)
plt.plot(x, y2, 'v-.g', label="second", mec='r', lw=2, mew=2, ms=12)
plt.plot(x, y3, 'o--g', label="third", mec='g', lw=2, mew=2, ms=10)
plt.legend()
plt.grid(True)

In [ ]:

x = [1, 6, 3, 9, 4, 16, 2]
# y's copied from above for easier reading
y = [1, 2, 3, 5, 8, 13, 21]

In [ ]:

plt.plot(x, y);

In [ ]:

plt.plot(x, y, 'go--');

In [ ]:

plt.plot(x, y, 'rx--');

Другой способ задать точно такой же график:

In [ ]:

plt.plot(x, y, color='red', marker='x', linestyle='dashed');

In [ ]:

import numpy as np

# y = x^2 in the range of -10 to 10
x = np.linspace(-10, 10, 100)
y = x**2

plt.plot(x, y, color='#c26603', marker='*', markersize=16, markeredgecolor='green',
         markerfacecolor='#3333fc', linewidth=1, linestyle=':', alpha=0.8, markevery=10);

Графики рассеяния¶

In [ ]:

x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])

plt.scatter(x, y)
plt.show()

Теперь отобразим на графики две группы точек различным цветом.

In [ ]:

x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(x, y)

x = np.array([2,2,8,1,15,8,12,9,7,3,11,4,7,14,12])
y = np.array([100,105,84,105,90,99,90,95,94,100,79,112,91,80,85])
plt.scatter(x, y)

plt.show()

Можно менять размер маркеров с помощью которых отображаются значения в зависимости от значения.

In [ ]:

price = np.asarray([2.50, 1.23, 4.02, 3.25, 5.00, 4.40])
sales_per_day = np.asarray([34, 62, 49, 22, 13, 19])
profit_margin = np.asarray([20, 35, 40, 20, 27.5, 15])

plt.scatter(x=price, y=sales_per_day, s=profit_margin * 10)
plt.show()

Можно разделить точки на несколько групп каждую показать своим цветом.

In [ ]:

low = (0, 1, 0)
medium = (1, 1, 0)
high = (1, 0, 0)

sugar_content = [low, high, medium, medium, high, low]

plt.scatter(
    x=price,
    y=sales_per_day,
    s=profit_margin * 10,
    c=sugar_content,
)
plt.show()

Круговые диаграммы¶

Подходят для отображения части целого, т. е. какую долю от общего числа составляет какая-то/какие-то из категорий.

In [ ]:

plt.figure(figsize=(6, 6))

plt.pie([4, 1, 6, 9], explode=[0, 0.2, 0, 0], labels=["Cherry Pie", "Apple Pie", "Blueberry Pie", "Rhubarb Pie"],
        colors=["red", "green", "blue", "orange"], startangle=45, textprops={'fontsize': 20});

Круговые диаграммы¶

Подходят для отображения части целого, т. е. какую долю от общего числа составляет какая-то/какие-то из категорий.

In [ ]:

data = [23, 45, 56, 78, 213]
plt.bar([1,2,3,4,5], data)
plt.show()

Добавим горизонтальные линии для удобства считывания значений.

In [ ]:

data = [23, 45, 56, 78, 213]

plt.bar(range(len(data)), data, color='royalblue', alpha=0.7)
plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7)
plt.show()

Несколько столбчатых диаграмм можно построить следующим образом:

In [ ]:

data1 = [23,85, 72, 43, 52]
data2 = [42, 35, 21, 16, 9]
width =0.3
plt.bar(np.arange(len(data1)), data1, width=width)
plt.bar(np.arange(len(data2))+ width, data2, width=width)
plt.show()

In [ ]:

data1 = [23,85, 72, 43, 52]
data2 = [42, 35, 21, 16, 9]
plt.bar(range(len(data1)), data1)
plt.bar(range(len(data2)), data2, bottom=data1)
plt.show()

Примеры различных типов графиков можно посмотреть в gallery.

In [ ]:

from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
from matplotlib.figure import Figure
from matplotlib.lines import Line2D
from IPython.core.display import display

# Create a new figure
fig = Figure(figsize=(5, 8))

# Attach a canvas to the figure
FigureCanvas(fig)

# Add a subplot (i.e., an Axes object)
ax = fig.add_subplot(1, 1, 1)

# Create some lines and add them to the plot.
# We could do this more easily with pyplot's .plot(),
# but this is an exercise in masochism.

# intercepts and slopes for the three data series
lines = [(2, -0.8), (1, 0.3), (-1, 1.2)]

# Other visual parameters that vary by line
colors = ['navy', 'g', 'k']
markers = ['*', 'o', 'v']
labels = ['blue', 'green', 'black']

# Loop over data series, construct Line2D for each one,
# and add it to the subplot
for i in range(3):
    intercept, slope = lines[i]
    # Create some data
    x = np.random.normal(size=30)
    y = x * slope + np.random.normal(intercept, 0.6, size=30)
    # Make the line
    line = Line2D(x, y, marker=markers[i], linestyle='', color=colors[i],
                  markersize=12, fillstyle='none', markeredgewidth=1.5,
                  label=labels[i])
    # Add line to plot
    ax.add_line(line)

# Set a gray background
ax.set_facecolor('#eeeef2')

# Set up grid lines
ax.grid(which='both')
major_ticks = np.arange(-10, 10, 1)
minor_ticks = np.arange(-10, 10, 0.2)
ax.set_xticks(major_ticks)
ax.set_xticks(minor_ticks, minor=True)
ax.set_yticks(major_ticks)
ax.set_yticks(minor_ticks, minor=True)
ax.grid(which='minor', alpha=0.7, color='#ddddff')
ax.grid(which='major', alpha=0.9)

# Pick sane x and y-axis limits
ax.set(xlim=(-3, 3), ylim=(-4, 4))

# title and axis labels
ax.set_title('Distribution of colored shapes', fontsize=22,
             fontname='Arial')
ax.set_xlabel('X', fontsize=18, labelpad=10)
ax.set_ylabel('Y', fontsize=18)

# Add legend in lower right with semi-transparent frame
ax.legend(loc='lower right', fontsize=14, framealpha=0.5, edgecolor='k')

# Show the plot in the notebook
display(fig)

Аннотация графиков¶

In [ ]:

x = np.linspace(-10, 10, 100)
y = x**2

# Basic plot...
plt.plot(x, y, color='#c26603', marker='*', markersize=16, markeredgecolor='black',
         markerfacecolor='#3333fc', linewidth=3, linestyle=':', alpha=0.8, markevery=10)

### Customize the plot ###

# Change the aspect
plt.gcf().set_size_inches((8, 4))

# # Set title and axis labels
plt.title("Matplotlib impressions", fontsize=20, fontname="Comic Sans MS")
plt.xlabel("Time spent working in matplotlib", fontsize=16, labelpad=15)
plt.ylabel("Desire to work in matplotlib", fontsize=16, labelpad=15)

# # # Replace tick labels with something more... descriptive
plt.xticks([-9.5, 10], ['newbie', 'master'], fontsize=12)
plt.yticks([18, 110], ['PLZ NO MOAR', 'Mmmm'], rotation=90, ha='right', fontsize=12)

# # # Hide ticks
plt.gca().tick_params(axis=u'both', which=u'both',length=0)

# # # Annotate the second star
plt.annotate("a lucky blue star!", (x[10] + 0.5, y[10]), xytext=(x[20] + 3, y[20] + 40),
             arrowprops=dict(facecolor='black', width=2), fontsize=14);

In [ ]:

iris = sns.load_dataset('iris')

In [ ]:

iris.head(5)

Out[ ]:

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

In [ ]:

# Set up the figure and axes--a 4 x 3 grid.
# We share both the x and y axes so it's easy to compare values.
fig, axes = plt.subplots(4, 3, figsize=(6, 6))

# We'll plot each species of iris in a different column
species = ['setosa', 'versicolor', 'virginica']

# ..and a histogram of each attribute in a separate row
attrs = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

for i in range(3):
    sp = species[i]
    sp_data = iris.query('species == @sp')
    # Show column names
    axes[0, i].set_title(sp, fontsize="16")
    for j in range(4):
        attr = attrs[j]
        values = sp_data[attr]
        # plot separately on each Axes
        axes[j, i].hist(values)

        # Only plot y-axis label for first column
        if i == 0:
            axes[j, i].set_ylabel(attr, fontsize=12)

# A fairly magical layout manager that tends to clean up
# figures well and prevent overlap between elements.
plt.tight_layout()

Pandas¶

In [ ]:

# KDE plot of all iris attributes, collapsing over species
iris.plot();

In [ ]:

iris.plot(kind='kde');

In [ ]:

# Separate boxplot of iris attributes for each species
iris.boxplot(by='species', figsize=(16, 4), layout=(1, 4));

In [ ]:

fig, axes = plt.subplots(1, 4, figsize=(16, 3.5), sharey=False)

subplots = iris.boxplot(by='species', ax=axes, return_type='both', notch=True,
                        bootstrap=10000, patch_artist=True, fontsize=12);

varnames = ['Petal length', 'Petal width', 'Sepal length', 'Sepal width']
ylabels = ["length (cm)", "width (cm)", "length (cm)", "width (cm)"]
colors = ['#ffddff', '#ddffdd', '#ddffff', '#ffdddd']

# Stuff we need to do separately for each subplot
for i, sp in enumerate(subplots):
    # We asked for both the Axes and the boxplot's dict
    ax, box = sp
    # Embiggen title
    ax.set_title(varnames[i], fontsize=16)
    # Only show gridlines along y
    ax.grid(axis='x')
    # y-axis label
    ax.set_ylabel(ylabels[i], fontsize=16)
    # background
    ax.set_facecolor('#eeeef2')
    # Hide xlabel
    ax.set_xlabel('')

    # Set all boxes in the current subplot to the same color
    for patch in box['boxes']:
        patch.set(facecolor=colors[i], edgecolor='k', linewidth=1)
    for whisk in box['whiskers']:
        whisk.set(color='k')

# Change figure-level options
fig = plt.gcf()
fig.suptitle('Iris attributes by species', size=22, y=1.1, fontname='Arial')

# Fix spacing issues
plt.subplots_adjust(wspace=0.3)

C:\Users\medve\AppData\Local\Temp\ipykernel_15480\592480191.py:3: UserWarning: When passing multiple axes, sharex and sharey are ignored. These settings must be specified when creating axes.
  subplots = iris.boxplot(by='species', ax=axes, return_type='both', notch=True,

Seaborn¶

Документация: и tutorials

In [ ]:

sns.set_style('darkgrid')

sns.boxplot(data=iris)

Out[ ]:

<AxesSubplot:>

In [ ]:

fig, axes = plt.subplots(1, 4, figsize=(16, 4), sharey=False)

# Explicitly list the variables to map onto subplots
variables = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

# For each of the variables, use a different subplot
for i, var in enumerate(variables):
    species_data = iris[[var, 'species']]
    sns.boxplot(x='species', y=var, data=species_data, ax=axes[i], palette='Set2',
                notch=True, bootstrap=10000)

In [ ]:

# seaborn expects data in tidy, long format, so we need
# to first "melt" our dataset so that each row is a single
# observation/variable combination.
iris_melted = iris.melt('species')

# Create the FacetGrid, indicating that we want a separate
# column for each variable.
g = sns.FacetGrid(iris_melted, col='variable', sharey=False, )

# Apply the boxplot plotting function to each cell of the FacetGrid.
# Here, the first argument gives the plotting function, and subsequent
# arguments are passed through to the plotting function. I.e.,
g.map_dataframe(sns.boxplot, x='species', y='value', hue='species', notch=True, bootstrap=10000);

In [ ]:

# Note: if working in JupyterLab, replace with %matplotlib widget
# You may have to wrestle with dependencies to get this to work.
%matplotlib notebook

Ipython widgets¶

In [ ]:

x = np.linspace(0, 10, 1000)
y = np.sin(x)
plt.plot(x, y);

In [ ]:

def f(freq=1):
    x = np.linspace(0, 10, 1000)
    y = np.sin(x * freq)
    plt.plot(x, y);

Now all we need to do is bind the freq argument in our function f() to a slider in the Jupyter notebook. The interact function makes this trivial:

In [ ]:

# Revert to ordinary inline plotting;
# notebook mode will interfere.
%matplotlib inline

from ipywidgets import interact

interact(f, freq=(-10, 10));

Кейс по анализу данных¶

Загрузим данные по планетам и посмотрим какие инсайты можно выявить используя построение графиков.

In [ ]:

df = sns.load_dataset('planets')
df.head()

Out[ ]:

	method	number	orbital_period	mass	distance	year
0	Radial Velocity	1	269.300	7.10	77.40	2006
1	Radial Velocity	1	874.774	2.21	56.95	2008
2	Radial Velocity	1	763.000	2.60	19.84	2011
3	Radial Velocity	1	326.030	19.40	110.62	2007
4	Radial Velocity	1	516.220	10.50	119.47	2009

In [ ]:

df.sample(10)

Out[ ]:

	method	number	orbital_period	mass	distance	year
357	Radial Velocity	1	3383.000000	3.1500	51.36	2002
194	Transit	1	2.691548	NaN	322.00	2014
1030	Transit	1	3.941507	NaN	172.00	2006
552	Radial Velocity	1	6.838000	0.9400	47.37	2006
288	Radial Velocity	3	51.284000	0.0498	38.01	2011
674	Transit	1	228.776000	NaN	61.00	2011
394	Radial Velocity	1	361.100000	0.9000	132.80	2011
896	Transit	1	17.833648	NaN	132.00	2013
153	Transit	1	2.899736	NaN	138.00	2007
631	Radial Velocity	1	41.397000	0.2980	11.03	2010

In [ ]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1035 entries, 0 to 1034
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   method          1035 non-null   object 
 1   number          1035 non-null   int64  
 2   orbital_period  992 non-null    float64
 3   mass            513 non-null    float64
 4   distance        808 non-null    float64
 5   year            1035 non-null   int64  
dtypes: float64(3), int64(2), object(1)
memory usage: 48.6+ KB

In [ ]:

lst_col = list(df.columns)

In [ ]:

for i in lst_col[1:]:
    plt.hist(df[i], color='royalblue', alpha=0.7)
    plt.title(f'Гистограмма распределения {i}')
    plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7)
    plt.show()

In [ ]:

df_year = df.groupby('year')['number'].count()
df_year

Out[ ]:

year
1989      1
1992      2
1994      1
1995      1
1996      6
1997      1
1998      5
1999     15
2000     16
2001     12
2002     32
2003     25
2004     26
2005     39
2006     31
2007     53
2008     74
2009     98
2010    102
2011    185
2012    140
2013    118
2014     52
Name: number, dtype: int64

In [ ]:

df_year.isnull().sum()

Out[ ]:

In [ ]:

plt.plot(df_year, 'o--g');

Проанализируем корреляции признаков. Для этого построим корреляционную матрицу и визуализируем еёё с помощью библиотеки seaborn.

In [ ]:

sns.heatmap(df.corr(), annot = True);

Ограничить число выводимых знаков после запятой можно через параметр fmt:

In [ ]:

sns.heatmap(df.corr(), annot = True, fmt='.1g')

Out[ ]:

<AxesSubplot:>

In [ ]:

sns.heatmap(df.corr(), annot = True, vmin=-1, vmax=1, center= 0);

Изменим цвет, используя аргумент cmap

In [ ]:

sns.heatmap(df.corr(), annot = True, vmin=-1, vmax=1, center= 0, cmap= 'coolwarm');

In [ ]:

matrix = np.triu(df.corr())
sns.heatmap(df.corr(), annot=True, mask=matrix);

In [ ]:

mask = np.tril(df.corr())
sns.heatmap(df.corr(), annot=True, mask=mask);