TRANSFORMER - MULTI-HEAD ATTENTION - SEQtoSEQ Translation
Per effettuare dei modelli Sequence to Sequence è stato creato un nuovo modello denominato Transformer che qui verrà riprodotto in base al paper Attention is All You Need che qui utilizziamo con PyTorch e vedremo un esempio di come costruire un modello per il language translation dove inserendo una sequenza di parole in italiano verranno tradotte in inglese.
Le diverse fasi sono :
Creazione input data
Input Embedding # convertire input token in vector
Positional Encoding # informazioni relative alla posizione del token nella sentence
Attention # funzione per mapping di q e set di k-v pairs dove q sta per query, k per key e v per values
Encoder # con 6 layer identici
Decoder # con 6 layer identici
Multi-Head Attention
Normalization # creare dati più omogenei per facilitare i calcoli
Feed Forward
Output Embedding

Nel grafico sono rappresentate le sequenze delle diverse fasi. Partendo da sinistra si trova il Transformer. Da notare che i dati prodotti dall'Encoder poi vengono passati in input al Decoder insieme ai dati propri del decoder per poi ottenere dopo altri layer output che ha la maggiore probabilità.
Passiamo al grafico al centro dove sono riprodotte le operazioni che creano il Multi-Head Attention. Nell'esempio sottostante si creano 8 head Attention come riportato nel paper sopra indicato.
Infine a destra troviamo le operazioni che si effettuano nel processo Scaled Dot Product.





Transformer


Multi-Head Attention Layer




Scaled Dot Product Attention






Input

Per importare i dati usiamo la funzione TORCHTEXT che ci consente di importare i dati e di creare dei dictonary con l'elenco delle parole delle 2 lingue. Da notare che le parole inserite vengono poi trasformate in tensor che poi con il metodo transpose da orizzontali diventano verticali ad es. [0,1] diviene
[0,
1]
Con spacy importiamo un oggetto che ci consente di gestire le parole nelle 2 lingue che vogliamo utilizzare.
en = spacy.load('en')
it = it_core_news_sm.load()

IT = Field(tokenize=tokenizerIT,init_token = "<sos>", eos_token = "<eos>")
EN = Field(tokenize=tokenizerEN, init_token = "<sos>", eos_token = "<eos>")
Per importare i dati usiamo la funzione TabularDataset di TORCHTEXT che ci consente di creare dei blocchi di dati divisi in batch. Prima però dividiamo il DataFrame df in una perte per la fwase di training e una parte per fase di test del moddello.
trainItaEn, testItaEn = train_test_split(df, test_size=0.1) # split input file in 2 parti

train,test = TabularDataset.splits(path='./', train='train.csv', validation='test.csv', format='csv', fields=[('IT', IT), ('EN', EN)])
train.fields
{'EN': <torchtext.data.field.Field at 0x7f527d131668>,
'IT': <torchtext.data.field.Field at 0x7f5291452048>}
Creazione di due vocabolari.
IT.build_vocab(train, test)
EN.build_vocab(train, test)
IT.vocab.itos
con questo comando vediamo la lista creata con i simboli iniziali e tutte le parole inserite nelle frasi in italiano.
['<unk>','<pad>','<sos>','<eos>','.','Tom','?','è','di','a','non','che', .... ]

IT.vocab.stoi
con questo comando vediamo il dictonary creato con i simboli iniziali e parole inserite nelle frasi in italiano.
defaultdict(<function torchtext.vocab._default_unk_index>)
{'<unk>': 0,'<pad>': 1,'<sos>': 2,'<eos>': 3,'.': 4,'Tom': 5,'?': 6,'è': 7, 'di': 8,'a': 9,'non': 10,'che': 11 .... }
EN.vocab.itos
con questo comando vediamo la lista creata con i simboli iniziali e tutte le parole inserite nelle frasi in inglese.
['<unk>', '<pad>', '<sos>', '<eos>', '.', 'I', 'Tom', 'you', 'to', '?', "n't", 'the', ...]
EN.vocab.stoi
con questo comando vediamo il dictonary creato con i simboli iniziali e parole inserite nelle frasi in inglese.
defaultdict(<function torchtext.vocab._default_unk_index>,

{'<unk>': 0,'<pad>': 1,'<sos>': 2,'<eos>': 3,'.': 4,'I': 5,'Tom': 6,'you': 7,'to': 8,'?': 9, "n't": 10,'the': 11, .... }
Creazione di un iterator con BucketIterator funzione di TORCHTEXT con un bacth-size di 32 record
trainI = BucketIterator(train, batch_size=32,shuffle= True)


Ora il programma completo con importazione di un file composto da frasi in italiano e relativa traduzione in inglese.

!python -m spacy download it_core_news_sm
import torch
import torch.nn as nn
import torch.nn.functional as F
import pandas as pd
import spacy
import torchtext
from torchtext.data import Field, BucketIterator, TabularDataset
from torch.autograd import Variable
import math, copy
import it_core_news_sm
from sklearn.model_selection import train_test_split
import numpy as np
class Embedding(nn.Module):
def __init__(self, vocab_size, dModel):
super().__init__()
self.embedding = nn.Embedding(vocab_size, dModel)
def forward(self, x):
embedding = self.embedding.to(device)
return embedding(x)
class PositionalEncoder(nn.Module):
def __init__(self, dModel, max_seq_len = 180):
super().__init__()
self.dModel = dModel
posEnc = torch.zeros(max_seq_len, dModel ,device = device)
for posit in range(max_seq_len):
for i in range(0, dModel, 2):
posEnc[posit, i] = math.sin(posit / (10000 ** ((2 * i)/dModel)))
posEnc[posit, i + 1] = math.cos(posit / (10000 ** ((2 * (i + 1))/dModel)))
posEnc = posEnc.unsqueeze(0)
self.register_buffer('posEnc', posEnc)
def forward(self, x):
x = x.to(device)
x = x * math.sqrt(self.dModel)
seqLength= x.size(1)
x = x + Variable(self.posEnc[:,:seqLength],requires_grad=False).to(device)
return x
class MultiHeadAttention(nn.Module):
def __init__(self, numHead, dModel, dropout = 0.1):
super().__init__()
self.dModel = dModel
self.d_k = dModel // numHead
self.h = numHead
self.q = nn.Linear(dModel, dModel)
self.v = nn.Linear(dModel, dModel)
self.k = nn.Linear(dModel, dModel)
self.dropout = nn.Dropout(dropout)
self.output = nn.Linear(dModel, dModel)
def forward(self, q, k, v, mask=None):
bs = q.size(0)
# perform linear operation and split into h numHead
k = self.k(k).view(bs, -1, self.h, self.d_k)
k = k.to(device)
q = self.q(q).view(bs, -1, self.h, self.d_k)
q = q.to(device)
v = self.v(v).view(bs, -1, self.h, self.d_k)
v = v.to(device)
# transpose to get dimensions bs * h * sl * dModel
k = k.transpose(1,2)
q = q.transpose(1,2)
v = v.transpose(1,2)
scores = attention(q, k, v, self.d_k, mask, self.dropout)
concat = scores.transpose(1,2).contiguous().view(bs, -1, self.dModel)
output = self.output(concat)
return output
class FeedForward(nn.Module):
def __init__(self, dModel, d_ff=2048, dropout = 0.1):
super().__init__()
self.linear1 = nn.Linear(dModel, d_ff)
self.dropout = nn.Dropout(dropout)
self.linear2 = nn.Linear(d_ff, dModel)
def forward(self, x):
x = self.dropout(F.relu(self.linear1(x)))
x = self.linear2(x)
x = x.to(device)
return x
class Normalisation(nn.Module):
def __init__(self, dModel, eps = 1e-6):
super().__init__()
self.size = dModel
self.alpha = nn.Parameter(torch.ones(self.size))
self.bias = nn.Parameter(torch.zeros(self.size))
self.eps = eps
def forward(self, x):
norm = self.alpha * (x - x.mean(dim=-1, keepdim=True))/(x.std(dim=-1, keepdim=True) + self.eps) + self.bias
return norm.to(device)
class EncoderLayer(nn.Module):
def __init__(self, dModel, numHead, dropout = 0.1):
super().__init__()
self.norm1 = Normalisation(dModel)
self.norm2 = Normalisation(dModel)
self.mhattention = MultiHeadAttention(numHead, dModel)
self.ff = FeedForward(dModel)
self.dropout1 = nn.Dropout(dropout)
def forward(self, x, mask):
xNorm = self.norm1(x)
x = x + self.dropout1(self.mhattention(xNorm,xNorm,xNorm,mask))
xNorm = self.norm2(x)
x = x + self.dropout1(self.ff(xNorm))
return x.to(device)
class DecoderLayer(nn.Module):
def __init__(self, dModel, numHead, dropout=0.1):
super().__init__()
self.norm_1 = Normalisation(dModel)
self.norm_2 = Normalisation(dModel)
self.norm_3 = Normalisation(dModel)
self.dropout_1 = nn.Dropout(dropout)
self.dropout_2 = nn.Dropout(dropout)
self.dropout_3 = nn.Dropout(dropout)
self.attn_1 = MultiHeadAttention(numHead, dModel)
self.attn_2 = MultiHeadAttention(numHead, dModel)
self.ff = FeedForward(dModel).to(device)
def forward(self, x, encOut, sourceMask, targetMask):
x2 = self.norm_1(x)
x2 = x2.to(device)
x = x + self.dropout_1(self.attn_1(x2, x2, x2, targetMask))
x = x.to(device)
x2 = self.norm_2(x)
x = x + self.dropout_2(self.attn_2(x2, encOut, encOut, sourceMask))
x2 = self.norm_3(x)
x = x + self.dropout_3(self.ff(x2))
return x
class Encoder(nn.Module):
def __init__(self, vocab_size, dModel, N, numHead):
super().__init__()
self.N = N
self.embed = Embedding(vocab_size, dModel)
self.pe = PositionalEncoder(dModel)
self.layers = cloneModule(EncoderLayer(dModel, numHead), N)
self.norm = Normalisation(dModel)
def forward(self, source, mask):
x = self.embed(source)
x = self.pe(x)
for i in range(N):
x = self.layers[i](x, mask)
return self.norm(x)
class Decoder(nn.Module):
def __init__(self, vocab_size, dModel, N, numHead):
super().__init__()
self.N = N
self.embed = Embedding(vocab_size, dModel)
self.pe = PositionalEncoder(dModel)
self.layers = cloneModule(DecoderLayer(dModel, numHead), N)
self.norm = Normalisation(dModel)
def forward(self, trg, encOut, sourceMask, targetMask):
x = self.embed(trg)
x = self.pe(x)
for i in range(self.N):
x = self.layers[i](x, encOut, sourceMask, targetMask)
return self.norm(x)
class Transformer(nn.Module):
def __init__(self, len_src_vocab, len_trg_vocab, dModel, N, numHead):
super().__init__()
self.encoder = Encoder(len_src_vocab, dModel, N, numHead).to(device)
self.decoder = Decoder(len_trg_vocab, dModel, N, numHead).to(device)
self.out = nn.Linear(dModel, len_trg_vocab)
def forward(self, source, target, sourceMask, targetMask):
source = source.to(device)
target = target.to(device)
sourceMask.to(device)
targetMask.to(device)
encOut = self.encoder(source, sourceMask)
encOut.to(device)
decOutput = self.decoder(target, encOut, sourceMask, targetMask)
decOutput.to(device)
return self.out(decOutput).to(device)
def tokenizerIT(frasi):
return [token.text for token in it.tokenizer(frasi)]
def tokenizerEN(frasi):
return [token.text for token in en.tokenizer(frasi)]
def cloneModule(module, N):
return nn.ModuleList([copy.deepcopy(module) for i in range(N)])
def attention(q, k, v, d_k, mask=None, dropout=None): scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
mask = mask.unsqueeze(1).to(device)
scores = scores.masked_fill(mask == 0, -1e9).to(device)
scores = F.softmax(scores, dim=-1)
if dropout is not None:
scores = dropout(scores)
return torch.matmul(scores, v).to(device)
def creaMask(src, trg): inputSeq = src
inputPad = IT.vocab.stoi['<pad>']
# se pad inserisce 0
inputMask = (inputSeq != inputPad).unsqueeze(1)
targetSeq = trg
targetPad = EN.vocab.stoi['<pad>']
targetMask = (targetSeq != targetPad).unsqueeze(1)
targetMask = targetMask.to(device)
mask = np.triu(np.ones((1, targetSeq.size(1) , targetSeq.size(1) )), k=1).astype('uint8')
mask = Variable(torch.from_numpy(mask) == 0).to(device)
targetMask = targetMask & mask
return inputMask, targetMask , targetPad
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
filename = "....../ita.txt"
data = pd.read_csv(filename, sep='\t')
eng = data.iloc[:,0]
ita = data.iloc[:,1]
df = pd.DataFrame({'IT':ita,'EN':eng })

with open('ita-eng.txt', 'w') as f:
f.write( df.to_csv(header = False, index = False) )

en = spacy.load('en')

it = it_core_news_sm.load()

IT = Field(tokenize=tokenizerIT, init_token = "<sos>", eos_token = "<eos>")
EN = Field(tokenize=tokenizerEN, init_token = "<sos>", eos_token = "<eos>")
trainItaEn, testItaEn = train_test_split(df, test_size=0.1)

trainItaEn.to_csv("train.csv", index=False)
testItaEn.to_csv("test.csv", index=False)

train,test = TabularDataset.splits(path='./', train='train.csv', validation='test.csv', format='csv',
fields=[('IT', IT), ('EN', EN)])

IT.build_vocab(train, test)
EN.build_vocab(train, test)

trainI = BucketIterator(train, batch_size=32,shuffle= True)
EPOCHS = 4
max_sequence_length = 180
dModel = 512 # input e output dimension
d_ff=2048 # inner-layer dimension
numHead = 8 # Numero head in cui suddividere input in multi-headed attention
N = 6 # numero layer uguali in encoder e decoder
len_src_vocab = len(IT.vocab)
len_trg_vocab = len(EN.vocab)

model = Transformer(len_src_vocab, len_trg_vocab, dModel, N, numHead).to(device)

for param in model.parameters():
if param.dim() > 1:
nn.init.xavier_uniform_(param)
# Glorot initialization limitare parametri per non rendere troppo oneroso

optim = torch.optim.Adam(model.parameters(), lr=0.005, betas=(0.9, 0.9998), eps=1e-9 )

model.train()

total_loss = 0
check_iter = 100
lossModel = nn.CrossEntropyLoss()
for epoch in range(EPOCHS):
for i, train in enumerate(trainI):
source = train.IT.transpose(0,1)
target = train.EN.transpose(0,1)
# Nel target language si esclude ultima parola
targetToDecoder = target[:, :-1]
targets = target[:, 1:].contiguous().view(-1)
targets = targets.to(device)
# crea mask per avere matrici con stesse dimensioni
sourceMask, targetMask, targetPad = creaMask(source, targetToDecoder )
source.to(device)
targetToDecoder.to(device)
sourceMask.to(device)
targetMask.to(device)
prediction = model(source, targetToDecoder , sourceMask, targetMask).to(device)
optim.zero_grad()
loss = lossModel(prediction.view(-1, prediction.size(-1)), targets)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
optim.step()
total_loss += loss.item()
if (i + 1) % check_iter == 0:
loss_avg = total_loss / check_iter
total_loss = 0
<bound method Module.eval of Transformer(
(encoder): Encoder(
(embed): Embedding(
(embedding): Embedding(30669, 512)
)
(pe): PositionalEncoder()
(layers): ModuleList(
(0): EncoderLayer(
(norm1): Normalisation()
(norm2): Normalisation()
(mhattention): MultiHeadAttention(
(q): Linear(in_features=512, out_features=512, bias=True)
(v): Linear(in_features=512, out_features=512, bias=True)
(k): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(output): Linear(in_features=512, out_features=512, bias=True)
)
(ff): FeedForward(
(linear1): Linear(in_features=512, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=512, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
)
(1): EncoderLayer(
(norm1): Normalisation()
(norm2): Normalisation()
(mhattention): MultiHeadAttention(
(q): Linear(in_features=512, out_features=512, bias=True)
(v): Linear(in_features=512, out_features=512, bias=True)
(k): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(output): Linear(in_features=512, out_features=512, bias=True)
)
(ff): FeedForward(
(linear1): Linear(in_features=512, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=512, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
)
(2): EncoderLayer(
(norm1): Normalisation()
(norm2): Normalisation()
(mhattention): MultiHeadAttention(
(q): Linear(in_features=512, out_features=512, bias=True)
(v): Linear(in_features=512, out_features=512, bias=True)
(k): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(output): Linear(in_features=512, out_features=512, bias=True)
)
(ff): FeedForward(
(linear1): Linear(in_features=512, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=512, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
)
(3): EncoderLayer(
(norm1): Normalisation()
(norm2): Normalisation()
(mhattention): MultiHeadAttention(
(q): Linear(in_features=512, out_features=512, bias=True)
(v): Linear(in_features=512, out_features=512, bias=True)
(k): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(output): Linear(in_features=512, out_features=512, bias=True)
)
(ff): FeedForward(
(linear1): Linear(in_features=512, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=512, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
)
(4): EncoderLayer(
(norm1): Normalisation()
(norm2): Normalisation()
(mhattention): MultiHeadAttention(
(q): Linear(in_features=512, out_features=512, bias=True)
(v): Linear(in_features=512, out_features=512, bias=True)
(k): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(output): Linear(in_features=512, out_features=512, bias=True)
)
(ff): FeedForward(
(linear1): Linear(in_features=512, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=512, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
)
(5): EncoderLayer(
(norm1): Normalisation()
(norm2): Normalisation()
(mhattention): MultiHeadAttention(
(q): Linear(in_features=512, out_features=512, bias=True)
(v): Linear(in_features=512, out_features=512, bias=True)
(k): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(output): Linear(in_features=512, out_features=512, bias=True)
)
(ff): FeedForward(
(linear1): Linear(in_features=512, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=512, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
)
)
(norm): Normalisation()
)
(decoder): Decoder(
(embed): Embedding(
(embedding): Embedding(14938, 512)
)
(pe): PositionalEncoder()
(layers): ModuleList(
(0): DecoderLayer(
(norm_1): Normalisation()
(norm_2): Normalisation()
(norm_3): Normalisation()
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
(dropout_3): Dropout(p=0.1, inplace=False)
(attn_1): MultiHeadAttention(
(q): Linear(in_features=512, out_features=512, bias=True)
(v): Linear(in_features=512, out_features=512, bias=True)
(k): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(output): Linear(in_features=512, out_features=512, bias=True)
)
(attn_2): MultiHeadAttention(
(q): Linear(in_features=512, out_features=512, bias=True)
(v): Linear(in_features=512, out_features=512, bias=True)
(k): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(output): Linear(in_features=512, out_features=512, bias=True)
)
(ff): FeedForward(
(linear1): Linear(in_features=512, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=512, bias=True)
)
)
(1): DecoderLayer(
(norm_1): Normalisation()
(norm_2): Normalisation()
(norm_3): Normalisation()
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
(dropout_3): Dropout(p=0.1, inplace=False)
(attn_1): MultiHeadAttention(
(q): Linear(in_features=512, out_features=512, bias=True)
(v): Linear(in_features=512, out_features=512, bias=True)
(k): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(output): Linear(in_features=512, out_features=512, bias=True)
)
(attn_2): MultiHeadAttention(
(q): Linear(in_features=512, out_features=512, bias=True)
(v): Linear(in_features=512, out_features=512, bias=True)
(k): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(output): Linear(in_features=512, out_features=512, bias=True)
)
(ff): FeedForward(
(linear1): Linear(in_features=512, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=512, bias=True)
)
)
(2): DecoderLayer(
(norm_1): Normalisation()
(norm_2): Normalisation()
(norm_3): Normalisation()
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
(dropout_3): Dropout(p=0.1, inplace=False)
(attn_1): MultiHeadAttention(
(q): Linear(in_features=512, out_features=512, bias=True)
(v): Linear(in_features=512, out_features=512, bias=True)
(k): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(output): Linear(in_features=512, out_features=512, bias=True)
)
(attn_2): MultiHeadAttention(
(q): Linear(in_features=512, out_features=512, bias=True)
(v): Linear(in_features=512, out_features=512, bias=True)
(k): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(output): Linear(in_features=512, out_features=512, bias=True)
)
(ff): FeedForward(
(linear1): Linear(in_features=512, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=512, bias=True)
)
)
(3): DecoderLayer(
(norm_1): Normalisation()
(norm_2): Normalisation()
(norm_3): Normalisation()
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
(dropout_3): Dropout(p=0.1, inplace=False)
(attn_1): MultiHeadAttention(
(q): Linear(in_features=512, out_features=512, bias=True)
(v): Linear(in_features=512, out_features=512, bias=True)
(k): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(output): Linear(in_features=512, out_features=512, bias=True)
)
(attn_2): MultiHeadAttention(
(q): Linear(in_features=512, out_features=512, bias=True)
(v): Linear(in_features=512, out_features=512, bias=True)
(k): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(output): Linear(in_features=512, out_features=512, bias=True)
)
(ff): FeedForward(
(linear1): Linear(in_features=512, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=512, bias=True)
)
)
(4): DecoderLayer(
(norm_1): Normalisation()
(norm_2): Normalisation()
(norm_3): Normalisation()
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
(dropout_3): Dropout(p=0.1, inplace=False)
(attn_1): MultiHeadAttention(
(q): Linear(in_features=512, out_features=512, bias=True)
(v): Linear(in_features=512, out_features=512, bias=True)
(k): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(output): Linear(in_features=512, out_features=512, bias=True)
)
(attn_2): MultiHeadAttention(
(q): Linear(in_features=512, out_features=512, bias=True)
(v): Linear(in_features=512, out_features=512, bias=True)
(k): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(output): Linear(in_features=512, out_features=512, bias=True)
)
(ff): FeedForward(
(linear1): Linear(in_features=512, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=512, bias=True)
)
)
(5): DecoderLayer(
(norm_1): Normalisation()
(norm_2): Normalisation()
(norm_3): Normalisation()
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
(dropout_3): Dropout(p=0.1, inplace=False)
(attn_1): MultiHeadAttention(
(q): Linear(in_features=512, out_features=512, bias=True)
(v): Linear(in_features=512, out_features=512, bias=True)
(k): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(output): Linear(in_features=512, out_features=512, bias=True)
)
(attn_2): MultiHeadAttention(
(q): Linear(in_features=512, out_features=512, bias=True)
(v): Linear(in_features=512, out_features=512, bias=True)
(k): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(output): Linear(in_features=512, out_features=512, bias=True)
)
(ff): FeedForward(
(linear1): Linear(in_features=512, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=512, bias=True)
)
)
)
(norm): Normalisation()
)
(out): Linear(in_features=512, out_features=14938, bias=True)
)>
model.eval()
test_iter = BucketIterator(test, batch_size=32,device = device)
for i, test in enumerate(test_iter):
source = test.IT.transpose(0,1)

sss = source[1]
src = torch.ones(max_sequence_length).type_as(sss.data)
src2 = [x for x in sss if x > 3 ]

for i in range(len(src2)):
src[i] = src2[i]
print(' '.join([IT.vocab.itos[i] for i in src if i > 3 ] ))
input_pad = IT.vocab.stoi[ "<pad>"]
src_mask = (src != input_pad).unsqueeze(-2)
encOut = model.encoder(src, sourceMask)
outputs = torch.zeros(max_sequence_length).type_as(src.data)
for i in range(1, max_sequence_length ):
targetMask = np.triu(np.ones((1, i, i)), k=1).astype('uint8')
targetMask = Variable(torch.from_numpy(targetMask) == 0).to(device)
modelOut = model.out(model.decoder((outputs[:i].unsqueeze(0)), encOut, sourceMask, targetMask))
ris = F.softmax(out, dim=-1)
_ , wordtoidx = ris[:, -1].data.topk(1)
outputs[i] = wordtoidx[0][0]
if wordtoidx[0][0] == 3 :
break
' '.join([EN.vocab.itos[i] for i in outputs[:i] if i > 3 ] )



train GPU:

epoch 5, iter = 8900, loss = 0.189
epoch 5, iter = 9000, loss = 0.197
epoch 5, iter = 9100, loss = 0.194
epoch 5, iter = 9200, loss = 0.196
epoch 5, iter = 9300, loss = 0.209
epoch 5, iter = 9400, loss = 0.210
epoch 5, iter = 9500, loss = 0.202
epoch 5, iter = 9600, loss = 0.203



test

input : Era uno sciatore molto bravo quand' era piccolo .


output : It was a very good skier when she was small .