Skip to content

The UN General Debate Corpus (UNGDC) is a dataset of all speeches given at the high-level UN forum usually held in September of each year.

Notifications You must be signed in to change notification settings

nicolasdz/UNGDC

Repository files navigation

UNGDC

The UN General Debate (UNGD) is the annual high-level event where each UN member state can address all the others. The UNGD Corpus (UNGDC) provides the English-language text of speeches from 200 countries between 1970 and 2018: some 8,093 speeches in total.

The UNGDC dataset was created by Slava Jankin Mikhaylov, Alexander Baturo, and Niheer Dasandi in 2017. See their Github repository for the latest version. You can also find all of their replication materials on this webpage. The 2017 article they published in the journal Research & Politics which draws upon the dataset can be found here (pay-walled).

In this repository, I provide a simple Jupyter notebook demonstrating how the dataset can be used to analyze and visualize trends in global diplomacy. It employs the Pandas and SpaCy packages to do some simple NLP on the dataset, with three major applications: tracking topic mentions across countries; tracking topic mentions over time; and performing semantic similarity analysis. At a later date, I hope to add a demonstration of how named entity recognition can be used to visualize the UNGDC as a network.

About

The UN General Debate Corpus (UNGDC) is a dataset of all speeches given at the high-level UN forum usually held in September of each year.

Topics

Resources

Stars

Watchers

Forks