Лекция 14¶

Написаное ниже может содержать ошибки, не стоит верить всему на 100%.

Деревья решений¶

Деревья решений, или решающие деревья, распространенный алгоритм машинного обучения.

Рассмотрим типичную задачу студента во время семестра: пора ли делать задание или же можно подождать.

На картинке выше представлено два почти одинаковых дерева. Отличаются они лишь порядком вопросов.

Какое из деревьев подходит лучше для решения задачи? Попробуем рассмотреть данный вопрос в рамках решения тренировочной задачи.

Описание тренировочной задачи¶

Имеется набор данных об ирисах, содержащий petal length/width (длина/ширина лепестка) и sepal length/width (длина/ширина чашелистка). По этим параметрам ирисы определяются в три разных класса: setosa,vesicilor и virginica.

Загрузим данные из sklearn. Если это не удалось, то обратитесь напрямую к данным по ссылке.

In [45]:

from sklearn.datasets import load_iris

dataset = load_iris(as_frame=True)
df = dataset.frame

print(dataset.DESCR)

df.loc[::30]

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

============== ==== ==== ======= ===== ====================
                Min  Max   Mean    SD   Class Correlation
============== ==== ==== ======= ===== ====================
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. dropdown:: References

  - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
    Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
    Mathematical Statistics" (John Wiley, NY, 1950).
  - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
    (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
  - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
    Structure and Classification Rule for Recognition in Partially Exposed
    Environments".  IEEE Transactions on Pattern Analysis and Machine
    Intelligence, Vol. PAMI-2, No. 1, 67-71.
  - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
    on Information Theory, May 1972, 431-433.
  - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
    conceptual clustering system finds 3 classes in the data.
  - Many, many more ...

Out[45]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target
0	5.1	3.5	1.4	0.2	0
30	4.8	3.1	1.6	0.2	0
60	5.0	2.0	3.5	1.0	1
90	5.5	2.6	4.4	1.2	1
120	6.9	3.2	5.7	2.3	2

Рассмотрим совместное распределение данных.

In [46]:

from seaborn import pairplot

pairplot(df, hue="target");

Разберем принцип построения дерева на основе двух конечных классов. Для этого объединим классы с нулевой и единичной метками.

In [47]:

df.loc[df["target"] == 0] = 1

pairplot(df, hue="target");

Рассмотрим два признака: petal width b sepal width. Построим распределение меток и попробуем добавить разделяющую прямую между двумя классами. Пока что положение прямой поставим наугад.

In [48]:

from matplotlib import pyplot as plt

feature_name0 = "petal width (cm)"
feature_name1 = "sepal width (cm)"
feature0 = df[feature_name0]
feature1 = df[feature_name1]
actual_mask = df["target"] == 2
plt.scatter(feature0[actual_mask], feature1[actual_mask], label="target=1")
plt.scatter(feature0[~actual_mask], feature1[~actual_mask], label="target=2")
plt.axvline(1.7, color="black", linestyle="--", label="decision margin")
plt.xlabel(feature_name0)
plt.ylabel(feature_name1)
plt.legend(loc="lower right")
plt.tight_layout();

Критерий информативности¶

В первом приближении качество модели можно оценить, если посчитатить долю ошибочно определенных объектов: $$ F(X) = \dfrac{1}{|X|} \sum[y_k \neq y_k^{\prime}],$$ где $y_k$ - истинная метка, а $y_k^{\prime}$ - предсказание.

In [49]:

def F(y_actual, y_pred):
    if len(y_actual):
        return sum(y_actual != y_pred) / len(y_actual)
    return 1

Посчитаем зависимость такой ошибки от значения разделяющего критерия.

$$ Q(X) = F(X) - w_l F(X_l) - w_r F(X_r) $$$$ Q(X) = F(X) - \dfrac{|X_l|}{|X|} F(X_l) - \dfrac{|X_r|}{|X|} F(X_r) $$

In [50]:

result = {}
feature_name0 = "petal width (cm)"
feature0 = df[feature_name0]
unique_values = feature0.unique()
for idx, decision_value in enumerate(unique_values):
    target_pred = feature0.apply(lambda x: 2 if x < decision_value else 1)
    mask = feature0 < decision_value
    df_l = df[mask]
    df_r = df[~mask]
    w_l = df_l.shape[0] / df.shape[0]
    w_r = 1.0 - w_l
    result[decision_value] = {
        "Q_F": F(df["target"], 1) - w_l * F(df_l["target"][mask], 1) - w_r * F(df_r["target"][~mask], 2)
    }

Подойдем к вопросу с вероятностной стороны.

Какова вероятность получить тот или иной класс? Давайте посчитаем.

In [51]:

p1 = sum(df["target"] == 1) / df.shape[0]
p2 = sum(df["target"] == 2) / df.shape[0]  # p2 = 1 - p1
print(f"Probability(target=1)={p1:1.3f}", f"Prbability(target=2)={p2:1.3f}", sep="\n")

Probability(target=1)=0.667
Prbability(target=2)=0.333

Посчитаем величину $$ G = \sum p_k (1 - p_k) = p_1 (1 - p_1) + p_2 (1 - p_2). $$ Данную величину называют критерием Джини.

In [52]:

def G(y_actual, y_pred):
    p1 = F(y_actual, y_pred)
    p2 = 1.0 - p1
    return 1 - p1 * p1 - p2 * p2

Посчитаем зависимость такой ошибки от значения разделяющего критерия.

In [53]:

feature_name0 = "petal width (cm)"
feature0 = df[feature_name0]
unique_values = feature0.unique()
for idx, decision_value in enumerate(unique_values):
    target_pred = feature0.apply(lambda x: 2 if x < decision_value else 1)
    mask = feature0 < decision_value
    df_l = df[mask]
    df_r = df[~mask]
    w_l = df_l.shape[0] / df.shape[0]
    w_r = 1.0 - w_l
    result[decision_value].update({
        "Q_G": G(df["target"], 2) - w_l * G(df_l["target"], 2) - w_r * G(df_r["target"], 2)
    })

Посчитаем для полученой системы энтропию Шенона: $$ H(x) = -\sum\limits_{i} p_{i} \log_2 p_i = -p_1\log_2 p_1 - p_2 \log_2 p_2.$$

In [54]:

import numpy as np

def H(y_actual, y_pred):
    p1 = F(y_actual, y_pred)
    p2 = 1.0 - p1
    if p1 == 1.0 or p2 == 1.0:  # 1 * log2(1) + 0 * log2(0) == 0
        return 0
    return - p1 * np.log2(p1) - p2 * np.log2(p2)

У хорошей системы должна быть энтропия равна нулю. Ведь тогда только один из $p_i$ равен 1. Стоит оговориться, что если $p_j = 0$, то такое элемент считаем за 0.

Что же вносит введение энтропии в нашу задачу? Возможность получения функции выбора разделения на основе следующего критерия: $$Q(X) = H(X_m) - \dfrac{|X_l|}{|X_m|} H(X_l) - \dfrac{|X_r|}{|X_m|} H(X_r).$$ Здесь $X_m$ - количество объектов на входе критерия, $X_l/X_r$ - количество объектов в левой/правой ветке решения. Попробуем оценить изменение энтропии для нескольких вариантов расположения решающего критерия.

In [55]:

import pandas as pd

feature_name0 = "petal width (cm)"
feature0 = df[feature_name0]
unique_values = feature0.unique()
for idx, decision_value in enumerate(unique_values):
    mask = feature0 > decision_value
    df_l = df[mask]
    df_r = df[~mask]
    w_l = df_l.shape[0] / df.shape[0]
    w_r = df_r.shape[0] / df.shape[0]

    result[decision_value].update({"Q_H": H(df["target"], 2) - w_l * H(df_l["target"], 2) - w_r * H(df_r["target"], 2)})

result = pd.DataFrame([{"decision value": key, **value} for key, value in result.items()]).sort_values(by="decision value")

result

Out[55]:

	decision value	Q_F	Q_G	Q_H
0	1.0	-0.333333	0.000000	0.300832
5	1.1	0.046667	0.136201	0.323650
7	1.2	0.066667	0.148148	0.364426
3	1.3	0.100000	0.169935	0.492067
1	1.4	0.186667	0.240741	0.530720
2	1.5	0.226667	0.278141	0.630976
4	1.6	0.280000	0.345413	0.676028
8	1.7	0.293333	0.367647	0.679094
6	1.8	0.293333	0.367939	0.470695
10	1.9	0.226667	0.260536	0.381241
13	2.0	0.193333	0.213039	0.286327
11	2.1	0.153333	0.160980	0.201616
12	2.2	0.113333	0.113617	0.162349
15	2.3	0.093333	0.091503	0.065839
14	2.4	0.040000	0.037037	0.032292
9	2.5	0.020000	0.018141	0.000000

In [56]:

from matplotlib import pyplot as plt

criteria = result["decision value"]

plt.scatter(criteria, result["Q_F"], s=25, label="Error")
plt.scatter(criteria, result["Q_G"], s=15, label="Gini")
plt.scatter(criteria, result["Q_H"], s=10, label="Entropy")
plt.legend()

Out[56]:

<matplotlib.legend.Legend at 0x73fbdce0d6a0>

Найдем такое значение критерия, которое минимизирует изменение начальной энтропии после разделения.

In [57]:

best_row0 = result.loc[result["Q_H"].idxmax()]
feature0_best = best_row0["decision value"]

best_row0

Out[57]:

decision value    1.700000
Q_F               0.293333
Q_G               0.367647
Q_H               0.679094
Name: 8, dtype: float64

Есть ли у нас гарантии, что полученое значение дает лучший результат? Стоит проверить изменение энтропии с использованием альтернативного признака.

In [58]:

import pandas as pd

result_alt = []
feature_name1 = "sepal width (cm)"
feature1 = df[feature_name1]
unique_values = feature1.unique()
for idx, decision_value in enumerate(unique_values):
    mask = feature1 > decision_value
    df_l = df[mask]
    df_r = df[~mask]
    w_l = df_l.shape[0] / df.shape[0]
    w_r = df_r.shape[0] / df.shape[0]

    result_alt.append({"decision value": decision_value, "Q_H": H(df["target"], 2) - w_l * H(df_l["target"], 2) - w_r * H(df_r["target"], 2)})

result_alt = pd.DataFrame(result_alt).sort_values(by="decision value").reset_index(drop=["index"])
print(f"Initial entropy: {H(df['target'], 2)}")
result_alt

Initial entropy: 0.9182958340544896

Out[58]:

	decision value	Q_H
0	1.0	0.251629
1	2.0	0.258344
2	2.2	0.230602
3	2.3	0.251170
4	2.4	0.272803
5	2.5	0.203594
6	2.6	0.188195
7	2.7	0.161170
8	2.8	0.104662
9	2.9	0.131267
10	3.0	0.073630
11	3.1	0.062847
12	3.2	0.047629
13	3.3	0.032098
14	3.4	0.032292
15	3.6	0.021394
16	3.8	0.000000

In [59]:

best_row0_alt = result_alt.loc[result_alt["Q_H"].idxmax()]
feature_best_alt = best_row0_alt["decision value"]

print(f"{best_row0['Q_H'] > best_row0_alt['Q_H']}")

best_row0_alt

True

Out[59]:

decision value    2.400000
Q_H               0.272803
Name: 4, dtype: float64

In [60]:

best_row0, best_row0_alt

Out[60]:

(decision value    1.700000
 Q_F               0.293333
 Q_G               0.367647
 Q_H               0.679094
 Name: 8, dtype: float64,
 decision value    2.400000
 Q_H               0.272803
 Name: 4, dtype: float64)

Использование альтернативного признака приводит к большему уменьшению энтропию. Соотвественно, мы выбираем первый признак для разделения (согласно жадному выбору).

In [61]:

from matplotlib import pyplot as plt

feature_name0 = "petal width (cm)"
feature_name1 = "sepal width (cm)"
feature0 = df[feature_name0]
feature1 = df[feature_name1]
actual_mask = df["target"] == 2
plt.scatter(feature0[actual_mask], feature1[actual_mask], label="target=1")
plt.scatter(feature0[~actual_mask], feature1[~actual_mask], label="target=2")
plt.axvline(feature0_best, color="black", linestyle="--", label="decision margin")
plt.xlabel(feature_name0)
plt.ylabel(feature_name1)
plt.legend(loc="lower right")
plt.tight_layout();

Возьмем правую половину и уже внутри него повторим операцию минимизации потери энтропии. Но теперь пойдем не по оптимальному пути, а уже с использованием альтернативного признака.

In [62]:

import pandas as pd
from matplotlib import pyplot as plt

result2 = []
feature_name0 = "petal width (cm)"
feature0 = df[feature_name0]

mask = feature0 > feature0_best
df_selected = df[mask]
feature1_selected = df_selected[feature_name1]
I_selected = H(df_selected["target"], 2)

unique_values = feature1_selected.unique()
for idx, decision_value in enumerate(unique_values):
    mask = feature1_selected > decision_value
    df_l = df_selected[mask]
    df_r = df_selected[~mask]
    w_l = df_l.shape[0] / df_selected.shape[0]
    w_r = df_r.shape[0] / df_selected.shape[0]

    result2.append({"decision value": decision_value, "Q_H": I_selected - w_l * H(df_l["target"], 2) - w_r * H(df_r["target"], 2)})

result2 = pd.DataFrame(result2).sort_values(by="decision value").reset_index(drop=["index"])
result2

Out[62]:

	decision value	Q_H
0	2.5	0.002139
1	2.6	0.002886
2	2.7	0.006065
3	2.8	0.012545
4	2.9	0.014673
5	3.0	0.029971
6	3.1	0.038113
7	3.2	0.006065
8	3.3	0.003651
9	3.4	0.002139
10	3.6	0.001410
11	3.8	0.000000

In [63]:

plt.scatter(result2["decision value"], result2["Q_H"])

Out[63]:

<matplotlib.collections.PathCollection at 0x73fbd99d8890>

In [64]:

best_row1 = result2.loc[result2["Q_H"].idxmax()]
feature1_best = best_row1["decision value"]

best_row1

Out[64]:

decision value    3.100000
Q_H               0.038113
Name: 6, dtype: float64

In [65]:

from matplotlib import pyplot as plt

feature_name0 = "petal width (cm)"
feature_name1 = "sepal width (cm)"
feature0 = df[feature_name0]
feature1 = df[feature_name1]
actual_mask = df["target"] == 2
plt.scatter(feature0[actual_mask], feature1[actual_mask], label="target=1")
plt.scatter(feature0[~actual_mask], feature1[~actual_mask], label="target=2")
plt.axvline(feature0_best, color="black", linestyle="--", label="decision margin 0")
plt.axhline(feature1_best, 0.475, 1, color="black", linestyle="--", label="decision margin 0-1")
plt.xlabel(feature_name0)
plt.ylabel(feature_name1)
plt.legend(loc="lower right")
plt.tight_layout();

Класс в sklearn¶

Воспользуемся уже готовым алгоритмом из библиотеки sklearn. Класс содержит несколько параметров. В частности:

max_depth максимально возможная глубина нашего дерева;
criterion критерий для выбора разбиения;
random_state способ зафиксировать случайность внутри алгоритма.

Более подробно параметры можно изучить в соответствующей документации (eng|ru).

Не будем заморачиваться с разбиением на обучающую и тестовую выборки. Обучим алгоритм сразу на всех объектах.

In [66]:

from sklearn.tree import DecisionTreeClassifier

X = df[[feature_name0, feature_name1]].to_numpy()
y = df["target"].to_numpy()

tree = DecisionTreeClassifier(max_depth=3, criterion="entropy", random_state=1)
tree.fit(X, y)

Out[66]:

DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Попробуем отобразить дерево несколькими способами.

Первый из: нарисовать граф.

In [67]:

from sklearn.tree import plot_tree

plot_tree(tree, filled=True);

Второй: визуализировать плоскость и границы критериев.

In [68]:

from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.metrics import accuracy_score
from matplotlib import pyplot as plt

disp = DecisionBoundaryDisplay.from_estimator(
    tree, 
    X,
    response_method="predict",
    xlabel=feature_name0,
    ylabel=feature_name1,
    alpha=0.5, 
    cmap=plt.cm.coolwarm
)

accuracy_tree = accuracy_score(y, tree.predict(X))
print(f'Tree Accuracy: {accuracy_tree:.2f}')

disp.ax_.scatter(
    X[:, 0],
    X[:, 1],
    c=y,
    edgecolor="k",
    cmap=plt.cm.coolwarm
)

plt.title(f"Decision surface for tree trained on {feature_name0} and {feature_name1}")
plt.tight_layout()

Tree Accuracy: 0.97

In [69]:

from sklearn.tree import DecisionTreeClassifier

X = df[[feature_name0, feature_name1]].to_numpy()
y = df["target"].to_numpy()

tree = DecisionTreeClassifier(max_depth=10, criterion="entropy", random_state=1)
tree.fit(X, y)

Out[69]:

DecisionTreeClassifier(criterion='entropy', max_depth=10, random_state=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [70]:

from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.metrics import accuracy_score
from matplotlib import pyplot as plt

disp = DecisionBoundaryDisplay.from_estimator(
    tree, 
    X,
    response_method="predict",
    xlabel=feature_name0,
    ylabel=feature_name1,
    alpha=0.5, 
    cmap=plt.cm.coolwarm
)

accuracy_tree = accuracy_score(y, tree.predict(X))
print(f'Tree Accuracy: {accuracy_tree:.2f}')

disp.ax_.scatter(
    X[:, 0],
    X[:, 1],
    c=y,
    edgecolor="k",
    cmap=plt.cm.coolwarm
)

plt.title(f"Decision surface for tree trained on {feature_name0} and {feature_name1}")
plt.tight_layout()

Tree Accuracy: 0.98

In [71]:

from sklearn.tree import plot_tree

plot_tree(tree, filled=True);

Вернемся к изначальной задаче, будем предсказывать три финальных класса.

Попробуем переобучить дерево, добавим ему глубины.

In [72]:

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris

dataset = load_iris(as_frame=True)
df = dataset.frame

X = df[[feature_name0, feature_name1]].to_numpy()
y = df["target"].to_numpy()

tree = DecisionTreeClassifier(max_depth=4, criterion="entropy", random_state=1)
tree.fit(X, y)

Out[72]:

DecisionTreeClassifier(criterion='entropy', max_depth=4, random_state=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [73]:

from sklearn.tree import plot_tree

plot_tree(tree, filled=True);

In [74]:

from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.metrics import accuracy_score
from matplotlib import pyplot as plt

disp = DecisionBoundaryDisplay.from_estimator(
    tree,
    X,
    response_method="predict",
    xlabel=feature_name0,
    ylabel=feature_name1,
    alpha=0.5,
    cmap=plt.cm.coolwarm
)

accuracy_tree = accuracy_score(y, tree.predict(X))
print(f'Tree Accuracy: {accuracy_tree:.2f}')


disp.ax_.scatter(
    X[:, 0],
    X[:, 1],
    c=y,
    edgecolor="k",
    cmap=plt.cm.coolwarm
)

plt.title(f"Decision surface for tree trained on {feature_name0} and {feature_name1}")
plt.tight_layout()

Tree Accuracy: 0.97

Деревья можно применить и в задаче регрессии.

In [75]:

from sklearn.tree import DecisionTreeRegressor
from seaborn import load_dataset

df_diamonds = load_dataset("diamonds")

df_diamonds.head()

Out[75]:

	carat	cut	color	clarity	depth	table	price	x	y	z
0	0.23	Ideal	E	SI2	61.5	55.0	326	3.95	3.98	2.43
1	0.21	Premium	E	SI1	59.8	61.0	326	3.89	3.84	2.31
2	0.23	Good	E	VS1	56.9	65.0	327	4.05	4.07	2.31
3	0.29	Premium	I	VS2	62.4	58.0	334	4.20	4.23	2.63
4	0.31	Good	J	SI2	63.3	58.0	335	4.34	4.35	2.75

In [77]:

X_diamonds = df_diamonds[["carat"]].to_numpy()
y_diamonds = df_diamonds["price"].to_numpy()

tree = DecisionTreeRegressor(max_depth=3, random_state=1)

tree.fit(X_diamonds, y_diamonds)

X_copy = X_diamonds.copy()

plt.scatter(X_diamonds[:, 0], y_diamonds)
plt.scatter(X_copy[:, 0], tree.predict(X_copy), color="C1")

Out[77]:

<matplotlib.collections.PathCollection at 0x73fbdf91ac60>

In [78]:

from sklearn.tree import plot_tree

plot_tree(tree, filled=True);

In [79]:

from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(random_state=1, max_depth=3)

print(X.shape)

random_forest.fit(X, y)

(150, 2)

Out[79]:

RandomForestClassifier(max_depth=3, random_state=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [83]:

from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.metrics import accuracy_score
from matplotlib import pyplot as plt

disp = DecisionBoundaryDisplay.from_estimator(
    random_forest.estimators_[10],
    X,
    response_method="predict",
    xlabel=feature_name0,
    ylabel=feature_name1,
    alpha=0.5,
    cmap=plt.cm.coolwarm
)

accuracy_bagging = accuracy_score(y, random_forest.predict(X))
print(f'Random Forest Accuracy: {accuracy_bagging:.2f}')

disp.ax_.scatter(
    X[:, 0],
    X[:, 1],
    c=y,
    edgecolor="k",
    cmap=plt.cm.coolwarm
)

plt.title(f"Decision surface for tree trained on {feature_name0} and {feature_name1}")
plt.tight_layout()

Random Forest Accuracy: 0.96

In [86]:

random_forest.estimator_params

Out[86]:

('criterion',
 'max_depth',
 'min_samples_split',
 'min_samples_leaf',
 'min_weight_fraction_leaf',
 'max_features',
 'max_leaf_nodes',
 'min_impurity_decrease',
 'random_state',
 'ccp_alpha',
 'monotonic_cst')

In [87]:

from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(random_state=1, max_depth=10)

random_forest.fit(X, y)

Out[87]:

RandomForestClassifier(max_depth=10, random_state=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [88]:

from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.metrics import accuracy_score
from matplotlib import pyplot as plt

disp = DecisionBoundaryDisplay.from_estimator(
    random_forest,
    X,
    response_method="predict",
    xlabel=feature_name0,
    ylabel=feature_name1,
    alpha=0.5,
    cmap=plt.cm.coolwarm
)

accuracy_bagging = accuracy_score(y, random_forest.predict(X))
print(f'Random Forest Accuracy: {accuracy_bagging:.2f}')

disp.ax_.scatter(
    X[:, 0],
    X[:, 1],
    c=y,
    edgecolor="k",
    cmap=plt.cm.coolwarm
)

plt.title(f"Decision surface for tree trained on {feature_name0} and {feature_name1}")
plt.tight_layout()

Random Forest Accuracy: 0.98

In [96]:

from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from sklearn.tree import DecisionTreeClassifier

base_model = DecisionTreeClassifier(max_depth=10)

bagging_model = BaggingClassifier(estimator=base_model, n_estimators=10000, random_state=42)
bagging_model.fit(X, y)

accuracy_bagging = accuracy_score(y, bagging_model.predict(X))
print(f'Bagging Test Accuracy: {accuracy_bagging:.2f}')

Bagging Test Accuracy: 0.98

In [97]:

from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.metrics import accuracy_score
from matplotlib import pyplot as plt

disp = DecisionBoundaryDisplay.from_estimator(
    bagging_model,
    X,
    response_method="predict",
    xlabel=feature_name0,
    ylabel=feature_name1,
    alpha=0.5,
    cmap=plt.cm.coolwarm
)

accuracy_bagging = accuracy_score(y, random_forest.predict(X))
print(f'Random Forest Accuracy: {accuracy_bagging:.2f}')

disp.ax_.scatter(
    X[:, 0],
    X[:, 1],
    c=y,
    edgecolor="k",
    cmap=plt.cm.coolwarm
)

plt.title(f"Decision surface for tree trained on {feature_name0} and {feature_name1}")
plt.tight_layout()

Random Forest Accuracy: 0.98

In [98]:

from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

base_model = LogisticRegression(max_iter=100)

bagging_model = BaggingClassifier(estimator=base_model, n_estimators=100, random_state=42)
bagging_model.fit(X, y)

accuracy_bagging = accuracy_score(y, bagging_model.predict(X))
print(f'Bagging Test Accuracy: {accuracy_bagging:.2f}')

Bagging Test Accuracy: 0.96

In [105]:

from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.metrics import accuracy_score
from matplotlib import pyplot as plt

disp = DecisionBoundaryDisplay.from_estimator(
    bagging_model.estimators_[2],
    X,
    response_method="predict",
    xlabel=feature_name0,
    ylabel=feature_name1,
    alpha=0.5,
    cmap=plt.cm.coolwarm
)

accuracy_bagging = accuracy_score(y, random_forest.predict(X))
print(f'Random Forest Accuracy: {accuracy_bagging:.2f}')

disp.ax_.scatter(
    X[:, 0],
    X[:, 1],
    c=y,
    edgecolor="k",
    cmap=plt.cm.coolwarm
)

plt.title(f"Decision surface for tree trained on {feature_name0} and {feature_name1}")
plt.tight_layout()

Random Forest Accuracy: 0.98

In [111]:

from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor

regressor = AdaBoostRegressor(
    DecisionTreeRegressor(max_depth=2), n_estimators=10000,
)

regressor.fit(X_diamonds, y_diamonds)

Out[111]:

AdaBoostRegressor(estimator=DecisionTreeRegressor(max_depth=2),
                  n_estimators=10000)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [112]:

X_diamonds = df_diamonds[["carat"]].to_numpy()
y_diamonds = df_diamonds["price"].to_numpy()

X_copy = X_diamonds.copy()

plt.scatter(X_diamonds[:, 0], y_diamonds)
plt.scatter(X_copy[:, 0], regressor.predict(X_copy), color="C1")

Out[112]:

<matplotlib.collections.PathCollection at 0x73fbe1e60890>

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target
0	5.1	3.5	1.4	0.2	0
30	4.8	3.1	1.6	0.2	0
60	5.0	2.0	3.5	1.0	1
90	5.5	2.6	4.4	1.2	1
120	6.9	3.2	5.7	2.3	2

	criterion	'entropy'
	splitter	'best'
	max_depth	3
	min_samples_split	2
	min_samples_leaf	1
	min_weight_fraction_leaf	0.0
	max_features	None
	random_state	1
	max_leaf_nodes	None
	min_impurity_decrease	0.0
	class_weight	None
	ccp_alpha	0.0
	monotonic_cst	None

	n_estimators	100
	criterion	'gini'
	max_depth	3
	min_samples_split	2
	min_samples_leaf	1
	min_weight_fraction_leaf	0.0
	max_features	'sqrt'
	max_leaf_nodes	None
	min_impurity_decrease	0.0
	bootstrap	True
	oob_score	False
	n_jobs	None
	random_state	1
	verbose	0
	warm_start	False
	class_weight	None
	ccp_alpha	0.0
	max_samples	None
	monotonic_cst	None

	estimator	DecisionTreeR...r(max_depth=2)
	n_estimators	10000
	learning_rate	1.0
	loss	'linear'
	random_state	None

	criterion	'squared_error'
	splitter	'best'
	max_depth	2
	min_samples_split	2
	min_samples_leaf	1
	min_weight_fraction_leaf	0.0
	max_features	None
	random_state	None
	max_leaf_nodes	None
	min_impurity_decrease	0.0
	ccp_alpha	0.0
	monotonic_cst	None

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target
0	5.1	3.5	1.4	0.2	0
30	4.8	3.1	1.6	0.2	0
60	5.0	2.0	3.5	1.0	1
90	5.5	2.6	4.4	1.2	1
120	6.9	3.2	5.7	2.3	2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target
0	5.1	3.5	1.4	0.2	0
30	4.8	3.1	1.6	0.2	0
60	5.0	2.0	3.5	1.0	1
90	5.5	2.6	4.4	1.2	1
120	6.9	3.2	5.7	2.3	2