Develop a tokenizer for Premodern Slavic

Hi, I would be happy to contribute data and insights that would help develop a tokenizer for Medieval/Premodern Slavic. 
Currently I am using tokconfig-rus on this data, and there'd be room for improvement; e.g. sentences are either very short or very long, please see below for some examples.

Some of the data characteristics:
- the character set of this data is nonstandard, incl. punctuation
- sentence delimiters are typically nonstandard or nonexistent  (`·` or `∙` are often used between words but are typically not true sentence delimiters)

There are no real gold standards of orthography in this period, and I guess also no very strong gold labeled data.
I looked into the Stanza and the UDPipe sentence splitters but they worked suboptimally. 

Would you be interested in creating a premodern slavic config? Or would you suggest another approach?  

<img width="870" alt="Screen Shot 2023-07-17 at 15 54 18" src="https://github.com/LanguageMachines/ucto/assets/56072981/355ffc39-fd93-4183-838d-7e534d078f60">


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Develop a tokenizer for Premodern Slavic #91

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Develop a tokenizer for Premodern Slavic #91

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions