from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
import numpy as np
import pandas as pd
import seaborn as sns
from collections import Counter
from matplotlib import pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
knn = KNeighborsClassifier(n_neighbors=1)
%matplotlib inline
Source: https://thesoundtrackonline.com/2018/03/10/ranked-all-seven-pokemon-generations/
We want to explore whether the anime and game versions of Pokemon are consistent with each other. Specifically, we seek to explore the following questions using k-NN classification algorithm.
Type 1
) using their basic stats alone?It is difficult to classify ALL Pokemon by type based on their stats alone.
Some pairs of Pokemon types can be classified accurately based on their stats.
Legendaries and non-legendaries can be classified based on their stats alone.
The Pokemon data set used in this study was taken from Kaggle. The attributes in this data set were sourced from various websites pokemon.com, pokemondb and bulbapedia. It should be noted that the basic stats (HP, attack, defense, etc) here are based from Pokemon games, not the show.
The data set has 800 data points, with the following columns.
#
: ID number of each PokemonName
: name of each PokemonType 1
: each Pokemon has a type that determines their strengths and weaknessesType 2
: some Pokemons have two typesTotal
: sum of the basic statsHP
: health of a pokemon which determines the damage it can withstandAttack
: the base modifier for normal attacks (eg. scratch, punch)Defense
: the base damage resistance against normal attacksSP Atk
: special attack, the base modifier for special attacks (e.g. fire blast, bubble beam)SP Def
: the base damage resistance against special attacksSpeed
: determines which Pokemon attacks first each roundGeneration
: grouping of PokemonLegendary
: extremely rare, powerful and mythical PokemonThroughout this notebook, we will be using the term basic stats
which collectively refers to the features HP
, Attack
, Defense
, SP Atk
, SP Def
and Speed
.
To compare how consistent these game stats are from the anime version, we referred to this website.
df = pd.read_csv('pokemon_dataset.csv')
df.head()
The target column in this data set is the primary type (Type 1
) of the Pokemon, that is, we will attempt to classify the primary type of all Pokemon by just doing some analysis on their basic stats. Listed below are the primary types and the correspnding counts.
df['Type 1'].value_counts().reset_index()
First, implement the k-NN classification algorithm to classify all Pokemon according to their primary types using their basic stats.
df_features = df[['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]
df_target = df['Type 1']
training_accuracy = []
test_accuracy = []
trials = range(50)
for trial in trials:
(X_train, X_test,
y_train, y_test) = train_test_split(df_features,
df_target,
test_size=0.25,
random_state=trial)
# Set n for n_neighbors from 1 to 39
neighbors_settings = range(1, 40)
for n_neighbors in neighbors_settings:
# Build model
clf = KNeighborsClassifier(n_neighbors=n_neighbors)
clf.fit(X_train, y_train)
# Record training accuracy for one trial
training_accuracy.append(clf.score(X_train, y_train))
# Record generalization trial for one trial
test_accuracy.append(clf.score(X_test, y_test))
#Reshaping accuracies in such a way that one row is one trial
training_accuracy = (np.array(training_accuracy)
.reshape(len(trials), len(neighbors_settings)))
#Calculate mean and standard deviation per column
training_err = np.std(training_accuracy, axis=0)
training_accuracy = np.mean(training_accuracy, axis=0)
#Reshape accuracies in such a way that one row is one trial
test_accuracy = (np.array(test_accuracy)
.reshape(len(trials), len(neighbors_settings)))
#Calculate mean and standard deviation per column
test_err = np.std(test_accuracy, axis=0)
test_accuracy = np.mean(test_accuracy, axis=0)
# Graph results
plt.figure(figsize=(10,5))
plt.errorbar(neighbors_settings, training_accuracy, yerr=training_err,
label="training accuracy")
plt.errorbar(neighbors_settings, test_accuracy, yerr=test_err,
label="test accuracy")
plt.title("Classifying Pokemon (ALL types)", fontsize=18)
plt.ylabel("Accuracy")
plt.xlabel("n_neighbors")
plt.xticks(range(1,len(neighbors_settings) + 1, 5))
plt.legend();
In the Pokemon story, it is known that the effects of attack and defense of each Pokemon type vary depending on the type of their opponents (Click for more details). Hence, in this section, some Pokemon types are paired up to check if they can be accurately classified using k-NN classification algorithm.
pairs = [['Electric', 'Ground'],['Fighting', 'Steel'],['Psychic', 'Fighting']]
fig, axes = plt.subplots(1,3, figsize=(20,4))
for pair, ax in zip(pairs, axes):
df_rows = df.loc[df['Type 1'].isin(pair)]
df_features = df_rows[['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def',
'Speed']]
df_target = df_rows['Type 1']
training_accuracy = []
test_accuracy = []
trials = range(10)
for trial in trials:
(X_train, X_test,
y_train, y_test) = train_test_split(df_features,
df_target,
test_size=0.25,
random_state=trial)
# Set n for n_neighbors from 1 to 49
neighbors_settings = range(1, 40)
for n_neighbors in neighbors_settings:
# Build model
clf = KNeighborsClassifier(n_neighbors=n_neighbors)
clf.fit(X_train, y_train)
# Record training accuracy for one trial
training_accuracy.append(clf.score(X_train, y_train))
# Record generalization trial for one trial
test_accuracy.append(clf.score(X_test, y_test))
#Reshaping accuracies in such a way that one row is one trial
training_accuracy = (np.array(training_accuracy)
.reshape(len(trials), len(neighbors_settings)))
#Calculate mean and standard deviation per column
training_err = np.std(training_accuracy, axis=0)
training_accuracy = np.mean(training_accuracy, axis=0)
#Reshape accuracies in such a way that one row is one trial
test_accuracy = (np.array(test_accuracy)
.reshape(len(trials), len(neighbors_settings)))
#Calculate mean and standard deviation per column
test_err = np.std(test_accuracy, axis=0)
test_accuracy = np.mean(test_accuracy, axis=0)
# Graph results
#plt.figure(figsize=(10,5))
ax.errorbar(neighbors_settings, training_accuracy, yerr=training_err,
label="training accuracy")
ax.errorbar(neighbors_settings, test_accuracy, yerr=test_err,
label="test accuracy")
ax.set_ylabel("Accuracy")
ax.set_xlabel("n_neighbors")
ax.set_xticks(range(1,len(neighbors_settings) + 1), 10)
ax.set_title("{} vs {}".format(pair[0], pair[1], fontsize=18))
ax.legend()
Maybe you once asked the question, "Are legendary Pokemon really way stronger than non-legendary ones?" This question pushed us to check whether we can accurately distinguish legendary Pokemon just by looking at their basic stats.
df.groupby('Legendary')['Generation'].count()
fig, ax = plt.subplots(1,2, figsize=(20,5.8))
# -------------- LEGENDARY VS NON-LEGENDARY (UNEQUAL SIZES)---------------------
df_features = df[['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def',
'Speed']]
df_target = df['Legendary']
training_accuracy = []
test_accuracy = []
trials = range(10)
for trial in trials:
(X_train, X_test,
y_train, y_test) = train_test_split(df_features,
df_target,
test_size=0.25,
random_state=trial)
# Set n for n_neighbors from 1 to 49
neighbors_settings = range(1, 40)
for n_neighbors in neighbors_settings:
# Build model
clf = KNeighborsClassifier(n_neighbors=n_neighbors)
clf.fit(X_train, y_train)
# Record training accuracy for one trial
training_accuracy.append(clf.score(X_train, y_train))
# Record generalization trial for one trial
test_accuracy.append(clf.score(X_test, y_test))
#Reshaping accuracies in such a way that one row is one trial
training_accuracy = (np.array(training_accuracy)
.reshape(len(trials), len(neighbors_settings)))
#Calculate mean and standard deviation per column
training_err = np.std(training_accuracy, axis=0)
training_accuracy = np.mean(training_accuracy, axis=0)
#Reshape accuracies in such a way that one row is one trial
test_accuracy = (np.array(test_accuracy)
.reshape(len(trials), len(neighbors_settings)))
#Calculate mean and standard deviation per column
test_err = np.std(test_accuracy, axis=0)
test_accuracy = np.mean(test_accuracy, axis=0)
# Graph results
#plt.figure(figsize=(10,5))
ax[0].errorbar(neighbors_settings, training_accuracy, yerr=training_err,
label="training accuracy")
ax[0].errorbar(neighbors_settings, test_accuracy, yerr=test_err,
label="test accuracy")
ax[0].set_ylabel("Accuracy")
ax[0].set_xlabel("n_neighbors")
ax[0].set_xticks(range(1,len(neighbors_settings) + 1), 10)
ax[0].set_title("Legendary vs Non-legendary (Unequal Sizes)", fontsize=18)
ax[0].legend()
# -------------- LEGENDARY VS NON-LEGENDARY (EQUAL SIZES)---------------------
df_rows = df[df['Legendary'] == False].sample(n=65)
df_rows = df_rows.append(df[df['Legendary'] == True])
df_features = df_rows[['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def',
'Speed']]
df_target = df_rows['Legendary']
training_accuracy = []
test_accuracy = []
trials = range(10)
for trial in trials:
(X_train, X_test,
y_train, y_test) = train_test_split(df_features,
df_target,
test_size=0.25,
random_state=trial)
# Set n for n_neighbors from 1 to 49
neighbors_settings = range(1, 40)
for n_neighbors in neighbors_settings:
# Build model
clf = KNeighborsClassifier(n_neighbors=n_neighbors)
clf.fit(X_train, y_train)
# Record training accuracy for one trial
training_accuracy.append(clf.score(X_train, y_train))
# Record generalization trial for one trial
test_accuracy.append(clf.score(X_test, y_test))
#Reshaping accuracies in such a way that one row is one trial
training_accuracy = (np.array(training_accuracy)
.reshape(len(trials), len(neighbors_settings)))
#Calculate mean and standard deviation per column
training_err = np.std(training_accuracy, axis=0)
training_accuracy = np.mean(training_accuracy, axis=0)
#Reshape accuracies in such a way that one row is one trial
test_accuracy = (np.array(test_accuracy)
.reshape(len(trials), len(neighbors_settings)))
#Calculate mean and standard deviation per column
test_err = np.std(test_accuracy, axis=0)
test_accuracy = np.mean(test_accuracy, axis=0)
# Graph results
#plt.figure(figsize=(10,5))
ax[1].errorbar(neighbors_settings, training_accuracy, yerr=training_err,
label="training accuracy")
ax[1].errorbar(neighbors_settings, test_accuracy, yerr=test_err,
label="test accuracy")
ax[1].set_ylabel("Accuracy")
ax[1].set_xlabel("n_neighbors")
ax[1].set_xticks(range(1,len(neighbors_settings) + 1), 10)
ax[1].set_title("Legendary vs Non-legendary (Equal Sizes)", fontsize=18)
ax[1].legend()
Result of the k-NN classification algorithm in 4.1 attempting to classify all Pokemon according to their primary type shows only an accuracy of around 20-25%. This suggests that all primary types of Pokemon cannot be categorized just based on their basic stats. One possible reason for this is the dispersion of the basic stats of Pokemon within the same type. This means that a set of stats of a Pokemon may not be distinguished from those of other Pokemon types if all Pokemon are to be considered. One can argue that in the same type there are strong Pokemon and weak ones like Charizard and Charmander, respectively.
In addition, there are too many classifications and you cannot distinguish one set of basic stats over the other if you compare them all at once. Some pairs of types though are easily differentiated and classified, while some pairs yield low accuracy in terms of classification.
Even if the model have problems classifying multiple types based on basic stats alone, there are pairs of Pokemon types which yield high accuracy using k-NN classification. In other words, if only two Pokemon types are considered, k-NN classification algorithm classifies them into types accurately based on their basic stats. This makes sense as some types of Pokemon are known to be stronger in certain stats.
To expound on this, consider the pairs of Pokemon types considered in the k-NN classification implementation in 4.2.
Note that achieving high accuracy is not true for all pairs of Pokemon types. Not all pairs of Pokemon types that seemed to be highly contrasted to each other in the show, are accurately classified using k-NN classification algorithm. This may suggest that despite these types being contrasted in the show, the difference in their basic stats as Pokemon types may not be that significant.
In classifying the legendary and non-legendary Pokemon, we implemented the k-NN classification algorithm twice with different setups.
It turns out that the accuracy is relatively high in both setups:
This result suggests that basic stats of legendary Pokemon are distinguishable from those of non-legendary Pokemon. Particularly, we may safely assume that legendary Pokemon are significantly superior in basic stats than the non-legendary Pokemon.