- To study a wide variety of natural language processing techniques and compare the performance.
- Tune the any model to obtain 90 % test accuracy.
The source code hosted in this repository is shared under MIT license.
DataDisca Pty Ltd, Melbourne, Australia
Latest tested versions are mentioned inside the brackets along with the library names for reference.
- Python (3.9.7)
- Jupyter Notebook (6.4.6) with IPython (7.29.0)
- Numba (0.54.0rc1)
- Numpy (1.22.0)
- Pandas (1.3.4)
- Matplotlib (3.5.0)
- Tensorflow (2.9.0.dev20220102) including Keras (2.9.0.dev2022010308) and Tensorboard (2.8.0a20220102)
- Scikit-learn (1.0.1)
- Plotly (5.4.0)
- Pydot (1.4.2) - Dependency for tf.keras.utils.plot_model
- Pydotplus (2.0.2) - Dependency for tf.keras.utils.plot_model
- GraphViz (2.50.0) - Dependency for tf.keras.utils.plot_model
Below mentioned 5 authors were selected as the sample authors for this project. In order to generate dataset, multiple books from each author in plain text format is used from Project Gutenberg. Each source file is covered under respective licenses by Project Gutenberg and strictly used only for research purposes. Raw files are downloaded from the mirrors using a script as per instrutions in terms of use.
Download the text (Plain Text UTF-8) of at least 5 books for train, test, validation splits and 4 books for the seperate validation dataset from each author in the following table.
Author | URL |
---|---|
Charles Dickens | https://www.gutenberg.org/ebooks/author/37 |
Jane Austen | https://www.gutenberg.org/ebooks/author/68 |
Sir Arthur Conan Doyle | https://www.gutenberg.org/ebooks/author/69 |
George Eliot | https://www.gutenberg.org/ebooks/author/90 |
Jules Verne | https://www.gutenberg.org/ebooks/author/60 |
Extract mutually exclusive records of length L words, from the text of each book for train, test, validation splits as "dataset.csv".
- Script: Dataset_Prepare\extract.ipynb
- L = 50 # length of records to be extracted
- N = 1000 # number of records for a book
- Extracted 25000 records in total (1000 record per book (N) * 5 authors * 5 books per author)
- Pre-processing:
- Break lines (sentences) at "." (period)
- Replace "\r\n" with " " (space)
- Replace double spaces with single spaces
- Remove all punctuations and just keep the alphanumeric characters and spaces.
- Remove sentances with less than 20 char to remove gibberish.
- Remove first 100 sentances to remove table of contents, preface etc.
- Remove last 250 sentances to remove Gutenbury stuff such as license etc.
- Convert all text to lowercase
Extract a seperate validation dataset from another set of books as "validation_dataset.csv".
- Script: Dataset_Prepare\seperate_validation.ipynb
- L = 50 # length of records to be extracted
- N = 1000 # number of records for a book
- Extracted 4000 records in total (200 record per book (N) * 5 authors * 4 book per author)
- With all above pre-processing steps
Classifier | Bag of Word + TF-IDF (TfidfVectorizer()) | Averaged Test Accuracy | Averaged Validation Accuracy |
---|---|---|---|
Logistic Regression | train_LR.ipynb | 89.03 | 89.02 |
Support Vector Machines | train_SVM.ipynb | 86.25 | 86.65 |
Random Forest | train_RFC.ipynb | 72.67 | 72.29 |
Naive Bayes | train_NB.ipynb | 89.34 | 89.92 |
XGBoost | train_XGB.ipynb | 78.23 | 77.73 |
Classifier Model | Source Code | Validation Accuracy (same books) | Validation Accuracy (seperate books) |
---|---|---|---|
GRU + Glove | keras_glove_gru.ipynb | 82.70 | 59.45 |
LSTM + Glove | keras_glove_lstm.ipynb | 80.32 | 58.25 |
Bidirectional LSTM + Glove | keras_glove_bi_lstm.ipynb | 78.38 | 59.23 |
GRU + Word2vec | keras_word2vec_gru.ipynb | 76.68 | 59.63 |
LSTM + Word2vec | keras_word2vec_lstm.ipynb | 70.62 | 55.58 |
Bidirectional LSTM + Word2vec | keras_word2vec_bi_lstm.ipynb | 69.76 | 54.80 |
Classifier Model | Source Code | Validation Accuracy (same books) | Validation Accuracy (seperate books) |
---|---|---|---|
BERT (AdamW / seq_length = 300) | keras_bert_v4_18_colab.ipynb | 92.70 | 69.63 |
Classifier Model | Source Code | Validation Accuracy (same books) | Validation Accuracy (seperate books) |
---|---|---|---|
BERT (Adam / seq_length = 100) | keras_hf_bert_3060.ipynb | 90.40 | 67.73 |
DistilBERT (Adam / seq_length = 100) | keras_hf_DistilBERT.ipynb | 89.42 | 65.45 |
RoBERTa (Adam / seq_length = 100) | keras_hf_RoBERTa.ipynb | 89.48 | 67.40 |
XLNet (Adam / seq_length = 100) | keras_hf_XLNet.ipynb | 88.06 | 67.88 |