Introduction Language Translation using Deep Learning
In this article we are going to Develop a model to perform Language Translation using Deep Learning to Automatically Translate from German to English in Python with Keras, Step-by-Step.
Machine translation is a challenging task that traditionally involves large statistical models developed using highly sophisticated linguistic knowledge.
In this article, you will perform how to develop a machine translation system for translating German into English.
Machine Translation is a challenging task to convert one language into another with Language Translation using Deep Learning. Generally, it involves statistical models. Once the model is built up then get a result quickly that is power off machine learning and statistical model. Here we are, we are going to use deep neural networks for the problem of machine translation. We will discover how to develop a neural machine translation model for Language Translation using Deep Learning.
Pre-Processing the Text Data
An important step in Natural Language Processing for modeling.
Some steps to clean the data.
- Removing the Punctuation.
- The text contains uppercase and lowercase.
- Given text contains special characters in the German.
- The file is ordered by sentence length with very long sentences toward the end of the file.
In this article, basic steps for data preparation is divided into two section
Load the data that preserves the Unicode German characters.The load_doc() command line helps us to load the file as a blob of text.
# load doc into memory def load_data(filename): # open the file as read only file = open(filename, mode='rt', encoding='utf-8') # read all text text = file.read() # close the file file.close() return text #here, use the one special parameter encoding that means read the english.
Every line contains a single pair of phrases, first English and then German, separated by a tab character.
We must split the loaded text by line and then by phrase.for split given function to_pairs() will split the loaded text.
# split a loaded document into sentences
lines = doc.strip().split(‘\n’)
pairs = [line.split(‘\t’) for line in lines]
Now, this function is ready to clean each sentence.The specific cleaning operations we will perform are as follows:
- Remove all non-printable characters.
- Remove all punctuation characters.
- Normalize all Unicode characters to ASCII (e.g. Latin characters).
- Normalize the case to lowercase.
- Remove any remaining tokens that are not alphabetic.
We will perform these operations on each phrase for each pair in the loaded dataset.
The clean_pairs() function below implements these operations.
# clean a list of lines def clean_pairs(lines): cleaned = list() # prepare regex for char filtering re_print = re.compile('[^%s]' % re.escape(string.printable)) # prepare translation table for removing punctuation table = str.maketrans('', '', string.punctuation) for pair in lines: clean_pair = list() for line in pair: # normalize unicode characters line = normalize('NFD', line).encode('ascii', 'ignore') line = line.decode('UTF-8') # tokenize on white space line = line.split() # convert to lowercase line = [word.lower() for word in line] # remove punctuation from each token line = [word.translate(table) for word in line] # remove non-printable chars form each token line = [re_print.sub('', w) for w in line] # remove tokens with numbers in them line = [word for word in line if word.isalpha()] # store as string clean_pair.append(' '.join(line)) cleaned.append(clean_pair) return array(cleaned)
Finally, now that the data has been cleaned, we can save the list of phrase pairs by using save_clean_data() function, use the pickle API and save it.After that save file ready for use.
# save a list of clean sentences to file def save_clean_data(sentences, filename): dump(sentences, open(filename, 'wb')) print('Saved: %s' % filename) # load dataset filename = 'deu.txt' doc = load_doc(filename) # split into english-german pairs pairs = to_pairs(doc) # clean sentences clean_pairs = clean_pairs(pairs) # save clean pairs to file save_clean_data(clean_pairs, 'english-german.pkl') # spot check for i in range(100): print('[%s] => [%s]' % (clean_pairs[i,0], clean_pairs[i,1])) Output: [hi] => [hallo] [hi] => [gru gott] [run] => [lauf] [wow] => [potzdonner] [wow] => [donnerwetter] [fire] => [feuer] [help] => [hilfe] [help] => [zu hulf] [stop] => [stopp] [wait] => [warte]
The clean data contains a little over 150,000 phrase pairs and some of the pairs toward the end of the file are very long.
This is a good number of examples for developing a small translation model. The complexity of the model increases with the number of examples, length of phrases, and size of the vocabulary. More examples are good for creating a large model.
Although we have a good dataset for modeling translation, we will simplify the problem slightly to dramatically reduce the size of the model required, and in turn the training time required to fit the model.
We will simplify the problem by reducing the dataset to the first 10,000 examples in the file; these will be the shortest phrases in the dataset.
Further, we will then take the first 9,000 of those as examples for training and the remaining 1,000 examples to test the fit model.
Below code is the complete example, of loading the clean data, splitting it, and saving the split portions of data to new files.
from pickle import load from pickle import dump from numpy.random import rand from numpy.random import shuffle # load a clean dataset def load_clean_sentences(filename): return load(open(filename, 'rb')) # save a list of clean sentences to file def save_clean_data(sentences, filename): dump(sentences, open(filename, 'wb')) print('Saved: %s' % filename) # load dataset raw_dataset=load_clean_sentences('english_german.pkl') # reduce dataset size n_sentences = 10000 dataset = raw_dataset[:n_sentences, :] # random shuffle shuffle(dataset) # split into train/test train, test = dataset[:9000], dataset[9000:] # save save_clean_data(dataset, 'english_german_both.pkl') save_clean_data(train, 'english_german_train.pkl') save_clean_data(test, 'english_german_test.pkl')
Run the all example, get a three file in output.
- English_german_both.pkl that file contains all of the train and test examples.These examples used to define the parameters of the problem, such as vocabulary.
- english_german_train.pkl file for train dataset
- english_german_test.pkl file for test dataset
Train the Language Translation Model
Now, load the both loading and preparing the clean text data ready for modeling and defining and training the model on the prepared data.
The load_clean_sentences() function can be used to load the train, test, and both datasets.
# load a clean dataset def load_clean_sentences(filename): return load(open(filename, 'rb')) # load datasets dataset = load_clean_sentences('english-german-both.pkl') train_data = load_clean_sentences('english-german-train.pkl') test_data = load_clean_sentences('english-german-test.pkl')
We can use the Keras Tokenizer class to map words to integers. We will use a separate tokenizer for the English sequences and the German sequences. The function below-named create_tokenizer() will train a tokenizer on a list of phrases.
# fit a tokenizer def create_tokenizer(lines): tokenizer = Tokenizer() tokenizer.fit_on_texts(lines) return tokenizer # max sentence length def max_length(lines): return max(len(line.split()) for line in lines)
Above code, create a max_length function. Will find the length of the longest sequence in a list of phrases.
We create another function that can call these functions with the combined dataset to prepare tokenizers, vocabulary sizes, and maximum lengths for both the English and German phrases.
phrases. # prepare english tokenizer eng_tokenizer = create_tokenizer(dataset[:, 0]) eng_vocab_size = len(eng_tokenizer.word_index) + 1 eng_length = max_length(dataset[:, 0]) print('English Vocabulary Size: %d' % eng_vocab_size) print('English Max Length: %d' % (eng_length)) # prepare german tokenizer ger_tokenizer = create_tokenizer(dataset[:, 1]) ger_vocab_size = len(ger_tokenizer.word_index) + 1 ger_length = max_length(dataset[:, 1]) print('German Vocabulary Size: %d' % ger_vocab_size) print('German Max Length: %d' % (ger_length))
Now ready train data.
Each input and output sequence must be encoded to integers and padded to the maximum phrase length. Because we will use a word embedding for the input sequences and one-hot encode the output sequences. The encode_sequence() function performs the these operations and returns the result.
# encode and pad sequences def encode_sequences(tokenizer, length, lines): # integer encode sequences X = tokenizer.texts_to_sequences(lines) # pad sequences with 0 values X = pad_sequences(X, maxlen=length, padding='post') return X The output sequence needs to be one-hot encoded. This is because the model will predict the probability of each word in the vocabulary as output. The function encode_output() below will one-hot encode English output sequences. # one hot encode target sequence def encode_output(sequences, vocab_size): ylist = list() for sequence in sequences: encoded=to_categorical(sequence, num_classes=vocab_size) ylist.append(encoded) y = array(ylist) y = y.reshape(sequences.shape, sequences.shape, vocab_size) return y Now,create these two functions and prepare both the train and test dataset ready for training the model. # prepare training data trainX = encode_sequences(ger_tokenizer, ger_length, train[:, 1]) trainY = encode_sequences(eng_tokenizer, eng_length, train[:, 0]) trainY = encode_output(trainY, eng_vocab_size) # prepare validation data testX = encode_sequences(ger_tokenizer, ger_length, test[:, 1]) testY = encode_sequences(eng_tokenizer, eng_length, test[:, 0]) testY = encode_output(testY, eng_vocab_size) Now,We are now ready to define the model.
We will use an encoder-decoder LSTM model on this problem. In this model, the input sequence is encoded by a front-end model called the encoder then decoded word by word by a backend model called the decoder.
The define_model() function below defines the model and takes a number of arguments used to configure the model, such as the size of the input and output vocabularies, the maximum length of input and output phrases, and the number of memory units used to configure the model.
The model is trained using the efficient Adam approach to stochastic gradient descent and minimizes the categorical loss function because we have framed the prediction problem as multi-class classification.
# define NMT model def define_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units): model = Sequential() model.add(Embedding(src_vocab, n_units, input_length=src_timesteps, mask_zero=True)) model.add(LSTM(n_units)) model.add(RepeatVector(tar_timesteps)) model.add(LSTM(n_units, return_sequences=True)) model.add(TimeDistributed(Dense(tar_vocab, activation='softmax'))) return model # define model model = define_model(ger_vocab_size, eng_vocab_size, ger_length, eng_length, 256) model.compile(optimizer='adam', loss='categorical_crossentropy') # summarize defined model print(model.summary()) plot_model(model, to_file='model.png', show_shapes=True) Finally, we can train the model and 30 epochs and a batch size of 64 examples. We use checkpointing to ensure that each time the model skill on the test set improves, the model is saved to file. # fit model filename = 'model.h5' checkpoint = ModelCheckpoint(filename, monitor='val_loss', verbose=1, save_best_only=True, mode='min') model.fit(trainX, trainY, epochs=30, batch_size=64, validation_data=(testX, testY), callbacks=[checkpoint], verbose=2) Running the example first prints a summary of the parameters of the dataset such as vocabulary size and maximum phrase lengths.Like, English Vocabulary Size: 2404 English Max Length: 5 German Vocabulary Size: 3856 German Max Length: 10 Summary, Layer (type) Output Shape Param embedding_1(Embedding) (None, 10, 256) 987136 lstm_1 (LSTM) (None, 256) 525312 repeat_vector_1 (RepeatVecto) (None, 5, 256) 0 lstm_2 (LSTM) (None, 5, 256) 525312 time_distributed_1 (TimeDist ) (None, 5, 2404) 617828 ================================================================= Total params: 2,655,588 Trainable params: 2,655,588 Non-trainable params: 0
Next, the model is trained.
Each epoch takes about 30 seconds on modern CPU hardware; no GPU is required.
During the run, the model will be saved to the file model.h5, ready for inference in the next step.
Epoch 00025: val_loss improved from 2.20048 to 2.19976, saving model to model.h5
17s – loss: 0.7114 – val_loss: 2.1998
Epoch 00026: val_loss improved from 2.19976 to 2.18255, saving model to model.h5
17s – loss: 0.6532 – val_loss: 2.1826
Epoch 00027: val_loss did not improve
17s – loss: 0.5970 – val_loss: 2.1970
Epoch 00028: val_loss improved from 2.18255 to 2.17872, saving model to model.h5
17s – loss: 0.5474 – val_loss: 2.1787
Epoch 00029: val_loss did not improve
17s – loss: 0.5023 – val_loss: 2.1823
Evaluate the Translation Model
Now, evaluate the model on the train and test data.
Ideally, we would use a separate validation dataset to help with model selection during training instead of the test set. You can try this as an extension.
# load datasets dataset=load_clean_sentences('english-german-both.pkl) train_data=load_clean_sentences('english-german-train.kl') test_data=load_clean_sentences('english-german-test.p’) # prepare english tokenizer eng_tokenizer = create_tokenizer(dataset[:, 0]) eng_vocab_size = len(eng_tokenizer.word_index) + 1 eng_length = max_length(dataset[:, 0]) # prepare german tokenizer ger_tokenizer = create_tokenizer(dataset[:, 1]) ger_vocab_size = len(ger_tokenizer.word_index) + 1 ger_length = max_length(dataset[:, 1]) # prepare data trainX = encode_sequences(ger_tokenizer, ger_length, train[:, 1]) testX = encode_sequences(ger_tokenizer, ger_length, test[:, 1]) #Save the model # load model model = load_model('model.h5')
Evaluation involves two steps: first generating a translated output sequence, and then repeating this process for many input examples and summarizing the skill of the model across multiple cases.
translation = model.predict(source, verbose=0)
Below function used for reverse mapping.
The function named as word_for_id(), will perform this reverse mapping.
# map an integer to a word def word_for_id(integer, tokenizer): for word, index in tokenizer.word_index.items(): if index == integer: return word return None The predict_sequence() function performs this operation for a single encoded source phrase. # generate target given source sequence def predict_sequence(model, tokenizer, source): prediction = model.predict(source, verbose=0) integers = [argmax(vector) for vector in prediction] target = list() for i in integers: word = word_for_id(i, tokenizer) if word is None: break target.append(word) return ' '.join(target)
Next, we can repeat this step for source phrase in a dataset and compare the predicted result to the expected target phrase in English.
We can print some of these comparisons to the screen to get an idea of how the model performs in practice.
For testing, we will also calculate the BLEU scores to get a quantitative idea of how well the model has performed.
# evaluate the skill of the model def evaluate_model(model, tokenizer, sources, raw_dataset): actual, predicted = list(), list() for i, source in enumerate(sources): # translate encoded source text source = source.reshape((1, source.shape)) translation = predict_sequence(model, eng_tokenizer, source) raw_target, raw_src = raw_dataset[i] if i < 10: print('src=[%s], target=[%s], predicted=[%s]' % (raw_src, raw_target, translation)) actual.append([raw_target.split()]) predicted.append(translation.split()) # calculate BLEU score print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0))) print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0))) print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0))) print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))
Now, evaluate the loaded model on both the training and test datasets.
src=[er ist ein blodmann], target=[hes a jerk], predicted=[hes a jerk]
src=[ich bin brillentrager], target=[i wear glasses], predicted=[i wear glasses]
src=[tom hat mich aufgezogen], target=[tom raised me], predicted=[tom tricked me]
src=[ich zahle auf tom], target=[i count on tom], predicted=[ill call tom tom]
src=[ich kann rauch sehen], target=[i can see smoke], predicted=[i can help you]
src=[tom fuhlte sich einsam], target=[tom felt lonely], predicted=[tom felt uneasy]
src=[hab ich nicht recht], target=[am i wrong], predicted=[am i fat]
src=[gestatten sie mir zu gehen], target=[allow me to go], predicted=[do me to go]
src=[du hast mir gefehlt], target=[i missed you], predicted=[i missed you]
src=[es ist zu spat], target=[it is too late], predicted=[its too late]
This article almost big and advanced level code. Cover the all basic process with the data cleaning process, tokenizer, and machine translation using the LSTM model. This code converts German to the English language.