embeddings

Continuous word representations for Coptic

Introduction

This repository contains work in progress for the development of word embeddings for Sahidic Coptic. The initial release contains proof of concept 50 dimensional embeddings trained with word2vec on approx. 1 million words with a vocabulary of about 10K items. The vocabulary assumes segmented forms based on Coptic Scriptorium guidelines (no bound-group embeddings are included).

Data

The current embeddings are based on the entire available Sahidic text of the Old Testament (CoptOT project text), New Testament (Sahidica version) and all Coptic Scriptorium dev data as of Spring 2020. Data is mostly automatically segmented using CopticScriptorium/Coptic-NLP though gold segmentation was used for Scriptorium data where available.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
coptic_50d.zip		coptic_50d.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

embeddings

Introduction

Data

About

Releases

Packages

CopticScriptorium/embeddings

Folders and files

Latest commit

History

Repository files navigation

embeddings

Introduction

Data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages