PatternRecognition

Project files for pattern recognition group assignment. Our project is about the classification of Wikipedia articles in one of the 11 top-level categories of the Vital Articles Wikipedia list, level 4.

The vital articles list was downloaded using scrapy. The code can be found in the WikiVitalArticles repository. The raw data is included in this repository.

Given the different skillsets in our group, we use a mix of R, python, Keras and Pytorch to build our models. However, we make sure that each model uses the same train/test splits.

Our presentation can be found here. Our final paper can be found here.

Group members

Files

Currently contains the following files (only the most important files are listed here):

data/raw/WikiEssentials_L4.7z: output file of the WikiVitalArticles program. Each document is included in its entirety (but split by paragraph).
preprocess_utils.py: preprocessing functions for Wiki data.
model_utils.py: various utility functions used for modeling (e.g. loading embeddings).
1_preprocess_raw_data.py: preprocessing of raw input data. Currently shortens each article to first 8 sentences.
2_baseline_model.py: Pytorch implementation of the baseline model (1-layer NN with softmax classifier).
3_cnn_model.R: Keras implementation of a 1D convolutional neural network.
4_lstm_model.py: Pytorch implementation of a Long-Short Term Recurrent Neural Network.
5_han_model.py: Pytorch implementation of a Hierarchical Attention Network (HAN).
6_statistical_test.R: Contains R code to perform Stuart-Maxwell test on the classification outcomes.
HAN.py: Contains the Pytorch module implementation of the HAN.
LSTM.py: Contains the Pytorch module implementation of the LSTM.

It contains the following folders:

data: Contains raw and pre-processed data used by the models. To understand te pipeline from raw to preprocessed data, see the preprocess_utils.py file.
embeddings: Folder in which FastText embeddings should be downloaded and unzipped.
img: Contains images.
model_cnn: Final model for the convolutional neural network after hyperparameter optimization.
models: Final Pytorch model weights for the baseline, HAN and LSTM.
predictions: CSV files containing the predictions and ground-truth labels for each model.
results: CSV files containing the results of the hyperparameter search we conducted using Hyperopt.

To run our code, we recommend using PyCharm or VS Code. The latter can be downloaded in the Anaconda Launcher.

Setup

Download and install Anaconda Python 3
Download latest version of Rstudio.
If you want to re-run our data pre-processing steps, download the FastText word embeddings from here. You need the wiki-news-300d-1M.vec.zip file. Save the file in the embeddings folder and unzip it there. If you do not want to rerun the data preprocessing steps, then you don't need the FT embeddings to re-run our models; they are included in the preprocessed data files.
In a terminal, go to this repository's folder and set up the Conda environment. If you are on Windows, execute:

conda env create -f environment_windows.yml

If you are on linux, execute:

conda env create -f environment_linux.yml

Note that this will install both Python requirements as well as R requirements. We use a separate R library location that is set in the .Renviron file.

In R, install the following libraries:

install.packages(c("yardstick", "rBayesianOptimization", "DescTools", "ggExtra"))

Check the .Renviron file to ensure that the path to the Anaconda environment is set correctly. The path should look something like "PATH-TO-ANACONDA-INSTALL/envs/VitalWikiClassifier/lib/R/library". Usually, it is located in either one of the following places:

"~/Anaconda3/envs/VitalWikiClassifier/lib/R/library"
"~/anaconda3/envs/VitalWikiClassifier/lib/R/library"

Hyperparameter optimization

We use the hyperopt python module to search for good hyperparameters using the following settings.

You can use the evaluate_baseline_results.R and evaluate_HAN_results.R R scripts to evaluate the results of the hyperparameter optimization process. It provides plots like this:

Over time, your results should improve as the algorithm begins to understand which settings work well:

Results

We obtain the following results using our models:

The results of a Stuart-Maxwell test show all models perform better than the baseline model. It's not as clear to tell which of the other models perform best.

Shiny application

We created a small shiny application that allows you to input a document and visualize the HAN attention predictions and score. The shiny application can be found in the shiny folder.

References

The following papers were instrumental for our work.

[1] Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016, June). Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies (pp. 1480-1489).

[2] Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135-146.

[3] Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.

[4] Bergstra, J., Yamins, D., & Cox, D. D. (2013, June). Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. In Proceedings of the 12th Python in science conference (pp. 13-20). Citeseer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PatternRecognition

Group members

Files

Setup

Hyperparameter optimization

Results

Shiny application

References

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
.idea		.idea
.vscode		.vscode
data		data
embeddings		embeddings
img		img
model_cnn		model_cnn
models		models
predictions		predictions
results		results
shiny		shiny
writeup		writeup
.Renviron		.Renviron
.Rprofile		.Rprofile
.gitignore		.gitignore
1_preprocess_raw_data.py		1_preprocess_raw_data.py
2_baseline_model.py		2_baseline_model.py
3_cnn_model.R		3_cnn_model.R
4_lstm_model.py		4_lstm_model.py
5_han_model.py		5_han_model.py
6_statistical_test.R		6_statistical_test.R
Final_calibrated.rds		Final_calibrated.rds
HAN.py		HAN.py
LSTM.py		LSTM.py
PatternRecognition.code-workspace		PatternRecognition.code-workspace
README.md		README.md
WikiFilter.Rproj		WikiFilter.Rproj
class_weights.pickle		class_weights.pickle
environment_linux.yml		environment_linux.yml
environment_windows.yml		environment_windows.yml
evaluate_HAN_results.R		evaluate_HAN_results.R
evaluate_baseline_results.R		evaluate_baseline_results.R
model_utils.py		model_utils.py
preprocess_utils.py		preprocess_utils.py

JasperHG90/PatternRecognition

Folders and files

Latest commit

History

Repository files navigation

PatternRecognition

Group members

Files

Setup

Hyperparameter optimization

Results

Shiny application

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages