Skip to content

Turkish writings dataset that promotes creativity, content, composition, grammar, spelling and punctuation.

Notifications You must be signed in to change notification settings

selimfirat/bilkent-turkish-writings-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bilkent Turkish Writings Dataset

This dataset contains the turkish creative writings of Turkish 101 and Turkish 102 courses between 2014-2018. It contains 4 publicly published writings of students 2 for each course. The writings in this dataset promotes creativity, content, composition, grammar, spelling and punctuation.

The writings can be found here as bunch of PDFs.

The dataset is continuously growing since each semester new texts are published publicly.

Currently, there are 6,844 writings in this dataset which is 33.1MB of data in a csv file.

Description of Turkish 101 & 102 in Bilkent University

This course is the first of a sequence of two courses designed to develop creative writing skills of the students through their own writings in Turkish. It is an active learning course. Students write their own blogs and instructors comment and send feedback about the creativity, content, composition, grammar, spelling and punctuation of the writing regularly.

Downloading the dataset

The data can be found in ./data/texts.csv.

git clone https://github.com/selimfirat/bilkent-turkish-writings-dataset.git
mv ./bilkent-turkish-writings-dataset/data/texts.csv <TARGET_PATH>

How to scrape from scratch

git clone https://github.com/selimfirat/bilkent-turkish-writings-dataset.git
pip install -r requirements.txt
cd bilkent-turkish-writings-dataset/scraper
scrapy crawl bilkent_turkish_writings
cd ../
python convert_to_text.py

In the end, there will be ~2GB of PDFs(it worth to continuous crawling & preprocessing) which can be deleted after the converting to text is done. The last two line suggested to executed using this notebook.

About

Turkish writings dataset that promotes creativity, content, composition, grammar, spelling and punctuation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published