Self Organizing Map
Self-organizing map (SOM) è una tecnica di clustering - unsupervised model - utile per classificare i dati in grandi dataset. I neuroni vengono rappresentati in una griglia di 2 dimensioni. Produce una riduzione di dimensioni come altri altri algoritmi ad. esempio PCA. Sotto una immagine(fonte: Wikipedia) di come l'algoritmo evolve fino alla rappresentazione finale.




Sotto un esempio in Python per creare un cluster con clienti di una banca che hanno caratteristiche simili.
The data is related with direct marketing campaigns of a Portuguese banking institution.
The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (or not) subscribed.
The bank dataset with all examples from May 2008 to November 2010 has 45211 instances.
The classification goal is to predict if the client will subscribe a product of the bank the term deposit (variable y).
The dataset has 16 + output attribute our dependent variable.
The input attribute are our independent variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student", "blue-collar","self-employed","retired","technician","services")
3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)
4 - education (categorical: "unknown","secondary","primary","tertiary")
5 - default: has credit in default? (binary: "yes","no")
6 - balance: average yearly balance, in euros (numeric)
7 - housing: has housing loan? (binary: "yes","no")
8 - loan: has personal loan? (binary: "yes","no") # related with the last contact of the current campaign:
9 - contact: contact communication type (categorical: "unknown","telephone","cellular")
10 - day: last contact day of the month (numeric)
11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
12 - duration: last contact duration, in seconds (numeric)
13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
15 - previous: number of contacts performed before this campaign and for this client (numeric)
16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success") Output variable (desired target):
17 - y - has the client subscribed a term deposit? (binary: "yes","no")
For the project we use the following file granted from UCI Machine Learning Repository :
[Moro et al., 2011] S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS. Available at: [pdf] http://hdl.handle.net/1822/14838 [bib] http://www3.dsi.uminho.pt/pcortez/bib/2011-esm-1.txt
from minisom import MiniSom
from sklearn.preprocessing import MinMaxScaler
from pylab import bone, pcolor, colorbar, plot, show
import del dataframe con i dati dei clienti si una banca
filename = "bank.csv"
bank = pd.read_csv(filename ,sep=',')



inserimento di un indice nel dataframe e vediamo il risultato dopo aver trasformato le categorie in formato numerico
numero = range(1,len(bank)+1)
num = pd.Series(numero)
bank.insert(0, 'num' , num)



trasformazione delle label in array e scalarizzazione dei dati da 0 a 1
y = np.array(y)

sc = MinMaxScaler(feature_range = (0,1))
X = bank2.iloc[:, :].values
X = sc.fit_transform(X)
con il modello som si effettua la classificazione
som = MiniSom(x=10, y=10, input_len= 9, sigma= 1.0, learning_rate = 0.5)
som.random_weights_init(X)
som.train_random(data = X, num_iteration = 100)
Grafico con evidenza dei gruppi da utilizzare per avere gli id delle diverse classi
bone()
pcolor(som.distance_map().T)
colorbar()
markers = ['o', 's']
colors = ['r', 'g']

for i, x in enumerate(X):
w = som.winner(x)
plot(w[0] + 0.5,
w[1] + 0.5,
markers[y[i]],
markeredgecolor = colors[y[i]],
markerfacecolor = 'None',
markersize = 10,
markeredgewidth = 2)
show()



Nel grafico i riquadri in bianco sono quelli con massima probabilità per cui si prendono le coordinati di questi e si fa il mapping di questi per individuare gli id. Dopo avere concatenato i dati si riportano i valori allo stato orginale (inverso dello scalar)
mappings = som.win_map(X)

cluster = np.concatenate((mappings[(8,2)], mappings[(8,9)]), axis = 0)
cluster = sc.inverse_transform(cluster)
Ora si visualizzano gli id appartenenti al cluster individuato
print('Cluster id')
for i in cluster[:, 0]:
print(int(i))

26 , 47,61,117,185,193,201,202,275,287,290,316,353,394,....... , 44429,44499,44523,44706,44975,45060