-
Notifications
You must be signed in to change notification settings - Fork 14
Open
Description
Hi, I would be happy to contribute data and insights that would help develop a tokenizer for Medieval/Premodern Slavic.
Currently I am using tokconfig-rus on this data, and there'd be room for improvement; e.g. sentences are either very short or very long, please see below for some examples.
Some of the data characteristics:
- the character set of this data is nonstandard, incl. punctuation
- sentence delimiters are typically nonstandard or nonexistent (
·
or∙
are often used between words but are typically not true sentence delimiters)
There are no real gold standards of orthography in this period, and I guess also no very strong gold labeled data.
I looked into the Stanza and the UDPipe sentence splitters but they worked suboptimally.
Would you be interested in creating a premodern slavic config? Or would you suggest another approach?

Metadata
Metadata
Assignees
Labels
No labels