To train a GPT-2 neural network, first of all we need to pre-process the data, in order to obtain a single .txt with a machine-learning compatible structure.

2.1 Google Colab

For the sake of simplicity and since the machine learning model we will use requires a GPU to work, we’re going to use Google Colab for the next step.

If you don’t know what Google Colab is, check this other article here.

2.2 Start the notebook

Open this Colab notebook and follow these steps:

  1. Run the first block of cells called under the “0️⃣ Init” chapter
  2. Press “Run Anyway” on the pop-up
  3. Make sure that the first command !nvidia-smi shows that a GPU is connected (p100 is suggested)
  4. If no GPU is connected, go to Runtime > Change Runtime type > Hardware accelerator > GPU
Image for post
Example output when a Tesla T4 GPU is properly connected. | Image by Author

2.3 Load the data

To work with the data, we need to upload them on Colab, into the right folders.

WhatsApp chats
Select all your .txt files and upload everything into the following notebook folder:
./messaging-chat-parser/data/chat_raw/whatsapp/

Telegram JSON
Get the file telegram_dump.json and upload it into the following notebook folder:
./messaging-chat-parser/data/chat_raw/telegram/

Image for post
Example of the notebook files after the chats are uploaded | Image by Author

2.4 Parse the data

Now, run all the cells up until the block “2️⃣ Parse the data”.

Here we need to replace the variable “whatsapp_user_name” with your WhatsApp name, called <YourName> on the 1.1 chapter.

You can also change the date format parsing system if some of the exported data show a different format due to local time formatting.

Image for post
Cells used to set the user name. | Image by Author

So, for example, if my name is “Bob” and I’m from America, the code I should use is the following:

Managing the new economic world order and the rise of asia.