How to Build a Text Generator using Keras in Python

Recurrent Neural Networks (RNNs) শ্রেণিবিন্যাস সমস্যার জন্য খুব শক্তিশালী সিকোয়েন্স মডেল। তবে, এই টিউটোরিয়ালে, আমরা আরএনএনগুলি জেনারেটরি মডেল হিসাবে ব্যবহার করব, যার অর্থ তারা সমস্যার ক্রম শিখতে পারে এবং তারপরে সমস্যা ডোমেনের জন্য সম্পূর্ণ নতুন সিকোয়েন্স তৈরি করতে পারে।

এই টিউটোরিয়ালটি পড়ার পরে, আপনি কীভাবে একটি LSTM মডেল তৈরি করবেন তা পাইথনে কেরাস ব্যবহার করে পাঠ্য (character by character) তৈরি করতে পারে ।

text generation নে, আমরা মডেলটিকে অনেক প্রশিক্ষণের উদাহরণ দেখি যাতে এটি ইনপুট এবং আউটপুটটির মধ্যে একটি বিন্যাস শিখতে পারে। প্রতিটি ইনপুট অক্ষরের একটি ক্রম এবং আউটপুট পরবর্তী একক অক্ষর। উদাহরণস্বরূপ, বলুন যে আমরা 'python is great' বাক্যটি প্রশিক্ষণ দিতে চাই, ইনপুটটি 'python is grea' এবং আউটপুটটি 'T' হবে। যুক্তিযুক্ত ভবিষ্যদ্বাণী করার জন্য আমাদের স্মৃতি যতটা হ্যান্ডেল করতে পারে তার মডেলগুলিকে আমাদের দেখাতে হবে।

Getting Started

Let's install the required dependencies for this tutorial:

pip3 install tensorflow==1.13.1 keras numpy requests

Importing everything:

import numpy as np
import os
import pickle
from keras.models import Sequential
from keras.layers import Dense, LSTM
from keras.callbacks import ModelCheckpoint
from string import punctuation

Preparing the Dataset

We are going to use a free downloadable book as the dataset: Alice’s Adventures in Wonderland by Lewis Carroll.

These lines of code will download it and save it in a text file:

import requests
content = requests.get("http://www.gutenberg.org/cache/epub/11/pg11.txt").text
open("data/wonderland.txt", "w", encoding="utf-8").write(content)

Just make sure you have a folder called "data" exists in your current directory.

Now let's try to clean this dataset:

# read the textbook
text = open("data/wonderland.txt", encoding="utf-8").read()
# remove caps and replace two new lines with one new line
text = text.lower().replace("\n\n", "\n")
# remove all punctuations
text = text.translate(str.maketrans("", "", punctuation))

উপরের কোডটি উচ্চতর কেস অক্ষর এবং বিরামচিহ্নগুলি সরিয়ে পাশাপাশি একটানা দুটি নতুন রেখাকে মাত্র একটি দ্বারা প্রতিস্থাপনের মাধ্যমে আরও ভাল এবং দ্রুত প্রশিক্ষণের জন্য আমাদের শব্দভাণ্ডার হ্রাস করে।

Let's print some statistics about the dataset:

n_chars = len(text)
unique_chars = ''.join(sorted(set(text)))
print("unique_chars:", unique_chars)
n_unique_chars = len(unique_chars)
print("Number of characters:", n_chars)
print("Number of unique characters:", n_unique_chars)

Output:

unique_chars:
 0123456789abcdefghijklmnopqrstuvwxyz
Number of characters: 154207
Number of unique characters: 39

এখন যেহেতু আমরা ডেটাসেটটি সাফল্যের সাথে লোড করেছি এবং পরিষ্কার করেছি, এই চরিত্রগুলিকে পূর্ণসংখ্যায় রূপান্তর করার জন্য আমাদের একটি উপায়ের প্রয়োজন আছে, এর জন্য সেখানে প্রচুর কেরাস এবং সাইকিট-লার্ন ইউটিলিটি রয়েছে তবে আমরা পাইথনে এটি ম্যানুয়ালি তৈরি করতে যাচ্ছি।

যেহেতু আমাদের শব্দভাণ্ডার হিসাবে আমাদের অনন্য_চার রয়েছে যা আমাদের ডেটাসেটের সমস্ত অনন্য অক্ষর ধারণ করে, তাই আমরা দুটি অভিধান তৈরি করতে পারি যা প্রতিটি অক্ষরকে একটি পূর্ণসংখ্যার সংখ্যার সাথে মানচিত্র করে এবং বিপরীতভাবে:

# dictionary that converts characters to integers
char2int = {c: i for i, c in enumerate(unique_chars)}
# dictionary that converts integers to characters
int2char = {i: c for i, c in enumerate(unique_chars)}

Let's save them to a file (to retrieve them later in text generation):

# save these dictionaries for later generation
pickle.dump(char2int, open("char2int.pickle", "wb"))
pickle.dump(int2char, open("int2char.pickle", "wb"))

এখন, আমাদের 100 টি অক্ষরের একটি নির্দিষ্ট আকারের সাথে পাঠ্যটিকে উপ-বিভাগে বিভক্ত করতে হবে, যেমনটি আগে আলোচনা করা হয়েছে, ইনপুটটি 100 টি অক্ষরের ক্রম (স্পষ্টতই পূর্ণসংখ্যায় রূপান্তরিত) এবং আউটপুটটি পরবর্তী অক্ষর (ওয়ানহোট-এনকোডেড) হয়। চল এটা করি:

# hyper parameters
sequence_length = 100
step = 1
batch_size = 128
epochs = 40
sentences = []
y_train = []
for i in range(0, len(text) - sequence_length, step):
    sentences.append(text[i: i + sequence_length])
    y_train.append(text[i+sequence_length])
print("Number of sentences:", len(sentences))

Output:

Number of sentences: 154107

আমি এই সমস্যার জন্য 40 টি যুগকে বেছে নিয়েছি, প্রশিক্ষণের জন্য এটি কয়েক ঘন্টা সময় নেবে, আপনি আরও ভাল কর্মক্ষমতা অর্জন করতে আরও বেশি সময় লাগাতে পারেন।

উপরের কোডটি দুটি নতুন তালিকা তৈরি করে যা সমস্ত বাক্য (100 অক্ষরের নির্দিষ্ট দৈর্ঘ্যের ক্রম) এবং এর সাথে সম্পর্কিত আউটপুট (পরবর্তী অক্ষর) ধারণ করে।

Now we need to transform the list of input sequences into the form (number_of_sentences, sequence_length, n_unique_chars).

n_unique_chars is the total vocabulary size, in this case; 39 total unique characters.

# vectorization
X = np.zeros((len(sentences), sequence_length, n_unique_chars))
y = np.zeros((len(sentences), n_unique_chars))

for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char2int[char]] = 1
        y[i, char2int[y_train[i]]] = 1
print("X.shape:", X.shape)
print("y.shape:", y.shape)

Output:

X.shape: (154107, 100, 39)
y.shape: (154107, 39)

প্রত্যাশিত হিসাবে, প্রতিটি অক্ষর (ইনপুট সিকোয়েন্স বা আউটপুট চরিত্র) 39 টি সংখ্যার ভেক্টর হিসাবে উপস্থাপিত হয়, অক্ষর সূচকটির জন্য কলামের 1 বাদে জিরো পূর্ণ। উদাহরণস্বরূপ, 'এ' (12 এর সূচকের মান) এর মতো এক-হট এনকোডযুক্ত:

[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]

Building the Model

Now let's build the model, it has basically one LSTM layer (more layers is better) with an arbitrary number of 128 LSTM units.

The output layer is a fully connected layer with 39 units where each neuron corresponds to a character (probability of the occurence of each character).

# building the model
model = Sequential([
    LSTM(128, input_shape=(sequence_length, n_unique_chars)),
    Dense(n_unique_chars, activation="softmax"),
])

Training the Model

Let's train the model now:

model.summary()
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
# make results folder if does not exist yet
if not os.path.isdir("results"):
    os.mkdir("results")
# save the model in each epoch
checkpoint = ModelCheckpoint("results/wonderland-v1-{loss:.2f}.h5", verbose=1)
model.fit(X, y, batch_size=batch_size, epochs=epochs, callbacks=[checkpoint])

This will start training, which gonna look something like this:

Epoch 00026: saving model to results/wonderland-v1-1.10.h5
Epoch 27/40
154107/154107 [==============================] - 314s 2ms/step - loss: 1.0901 - acc: 0.6632

Epoch 00027: saving model to results/wonderland-v1-1.09.h5
Epoch 28/40
 80384/154107 [==============>...............] - ETA: 2:24 - loss: 1.0770 - acc: 0.6694

This will take few hours, depending on your hardware, try increasing batch_size to 256 for faster training.

After each epoch, the checkpoint will save model weights in results folder.

Generating New Text

Now we have trained the model, how can we generate new text?

Open up a new file, I will call it generate.py and import:

import numpy as np
import pickle
import tqdm
from keras.models import Sequential
from keras.layers import Dense, LSTM
from keras.callbacks import ModelCheckpoint

We need a sample text to start generating with, you can take sentences from the training data which will perform better, but I'll try to produce a new chapter:

seed = "chapter xiii"

Let's load the dictionaries that maps each integer to a character and vise-verca that we saved before in the training process:

char2int = pickle.load(open("char2int.pickle", "rb"))
int2char = pickle.load(open("int2char.pickle", "rb"))

Building the model again:

sequence_length = 100
n_unique_chars = len(char2int)

# building the model
model = Sequential([
    LSTM(128, input_shape=(sequence_length, n_unique_chars)),
    Dense(n_unique_chars, activation="softmax"),
])

Now we need to load the optimal set of model weights, choose the least loss you have in the results folder:

model.load_weights("results/wonderland-v1-1.10.h5")

Let's start generating:

# generate 400 characters
generated = ""
for i in tqdm.tqdm(range(400), "Generating text"):
    # make the input sequence
    X = np.zeros((1, sequence_length, n_unique_chars))
    for t, char in enumerate(seed):
        X[0, (sequence_length - len(seed)) + t, char2int[char]] = 1
    # predict the next character
    predicted = model.predict(X, verbose=0)[0]
    # converting the vector to an integer
    next_index = np.argmax(predicted)
    # converting the integer to a character
    next_char = int2char[next_index]
    # add the character to results
    generated += next_char
    # shift seed and the predicted character
    seed = seed[1:] + next_char
print("Generated text:")
print(generated)

আমরা এখানে যা করছি, সমস্তই বীজ পাঠ্য দিয়ে শুরু হচ্ছে, ইনপুট ক্রম তৈরি করছে এবং তারপরে পরবর্তী অক্ষরটি পূর্বাভাস করবে। এর পরে, আমরা প্রথম অক্ষরটি সরিয়ে এবং পূর্বাভাসিত শেষ অক্ষর যুক্ত করে ইনপুট ক্রমটি স্থানান্তর করি। এটি আমাদের ইনপুটগুলির সামান্য পরিবর্তিত ক্রম দেয় যা এখনও আমাদের ক্রম দৈর্ঘ্যের আকারের সমান দৈর্ঘ্য রয়েছে।

আমরা তখন অন্য একটি চরিত্রের পূর্বাভাস দেওয়ার জন্য এই আপডেট হওয়া ইনপুট ক্রমটি মডেলটিতে ফিড করি, N বারটি এই প্রক্রিয়াটি পুনরাবৃত্তি করে এন অক্ষরগুলি সহ একটি পাঠ্য উত্পন্ন করবে।

Here is an interesting text generated:

Generated Text:
ded of and alice as it go on and the court
well you wont you wouldncopy thing
there was not a long to growing anxiously any only a low every cant
go on a litter which was proves of any only here and the things and the mort meding and the mort and alice was the things said to herself i cant remeran as if i can repeat eften to alice any of great offf its archive of and alice and a cancur as the mo

That is clearly english! But you know, most of the sentences doesn't make sense, that is because it is a character-level model.

যদিও দ্রষ্টব্য, এটি কেবল ইংরেজী পাঠ্যের মধ্যে সীমাবদ্ধ নয়, আপনি যে প্রকারের পাঠ্য চান তা ব্যবহার করতে পারেন। আসলে, আপনার কাছে পর্যাপ্ত কোডের লাইন থাকলে আপনি পাইথন কোড এমনকি জেনারেট করতে পারেন।

In order to further improve the model, you can:

Reduce the vocabulary size by removing few occured characters.
Train the model on padded sequences.
Add more LSTM and Dropout layers with more LSTM units.
Tweak the batch size and see which works best.
Train on more epochs.

I suggest you grab your own text, just make sure it is long enough (more than 100K characters) and train on it!

Mohammad Mostofa Zaman

How to Build a Text Generator using Keras in Python

Getting Started

Preparing the Dataset

Building the Model

Training the Model

Generating New Text

0 comments:

Post a Comment

Popular Posts

New Research

SAY HELLO TO ME

ADDRESS

EMAIL

TELEPHONE

MOBILE