Skip to content

6.0.0

Latest
Compare
Choose a tag to compare
@github-actions github-actions released this 03 Sep 16:22
· 3 commits to main since this release

The 6.0 release does not introduce any major new features but changes the behavior of multiple components and introduces non-backward-compatible API changes, necessitating a major release.

Backward-incompatible changes

Tag parsing has changed which not only affects the internal data structures of the container classes but also the user-facing command line interface. The mapping of line tags to recognition models in the kraken ocr's -m argument now always uses the resolved type of the line. The resolved type is determined for ALTO files by any tag reference pointing to a tag element either with a TYPE attribute with value type or no TYPE attribute at all. For PageXML files this is determined by the custom string structure {type: $value;}.

These changes are in preparation for the eventual removal of per-tag-recognition as it prevents optimizing recognition throughput with batching.

New features

The model repository has seen a major upgrade with a new metadata schema called HTRMoPo that allows uploading more model types (segmentation, recognition, reading order, ...) and includes support for informative huggingface-style model cards. The new implementation also caches the model repository state for faster querying, has support for versioned models, and allows filtering of output based on various metadata fields. Interaction with the repository using the command line drivers is documented here.

The API and command line driver for reading order model training (ketos rotrain) now supports the same filtering and merging options as the segmentation training tools which makes it easier to train RO models when the corresponding segmentation model has been trained using these options.

Testing recognition models with ketos test now also computes a case-insensitive character error rate. (Thanks Weslley Oliveira!).

Per-step and average epoch training loss is now printed on the progress bars of all training tools (ketos pretrain, ketos rotrain, ketos segtrain, ketos train).

The contrib/repolygonize.py now allows setting the scale of the polygonization input with the --scale option. (Thanks Weslley Oliveira!)

contrib/set_seg_options.py can set the segmentation model option for line location to centerline as well.

A new contrib/add_neural_ro.py script can be used to add a new reading order generated by a neural reading order model to an existing XML facsimile.

A softmax temperature option has been added to smooth out the confidence distribution of the character confidences of text recognition output. The option is available as an argument to TorchSeqRecognizer and the --temperature setting on the kraken ocr subcommand.

Removed features

The synthetic line generation tools were removed as they were only useful for training legacy line-strip recognition models. The recommended alternative that is compatible with baseline-style models is the new pangoline tool. A short description how to prepare kraken training data with it is available here in the docs.

Likewise, the legacy HTML file-based transcription environment was removed as it never supported transcription of baseline segmentation data. eScriptorium is the suggested replacement.

Installation through anaconda is gone. Due to coreml not being maintained in conda-forge it has not been possible to do a pure conda installation without side-loading packages through pip for a long while.

Misc. Changes

All valid floating point precision values known to pytorch lightning can now be used with the --precision option of ketos.

scripts.json has been updated to include the new scripts encoded by Unicode 16.

The reading order training code has been refactored.

Region filtering now supports types containing $.

contrib/extract_lines.py writes output always as RGB images.

The pytorch pin has been relaxed to accept versions between 2.4.0 and 2.7.x.

API changes

The XML parsing, container classes, and tagging have been revamped, introducing a number of changes.

Tags

Tags on the container classes (Region, BaselineLine, BboxLine) were previously a simple dictionary containing string keys and values which was less expressive than the Transkribus-style custom strings mapping an identifier to one or more dictionaries, e.g. language {id: eng; name: English} language {id: heb; name: Hebrew}. With the current release all tags are in dict-list-of-dicts format, taking the example above {'language': [{'id': 'eng', 'name': 'English'}, {'id': 'heb', 'name': 'Hebrew'}]}, no matter their source (PageXML or ALTO files). Tags parsed from ALTO's tag reference system, which only allows serialization of key-value paris, are expanded by introducing a dummy key 'type' in the value dicts, i.e.

<Tags>
<OtherTag> ID="foo" LABEL="heb" TYPE="language"/>
...
</Tags>
...
<TextLine ... TAGREFS="foo">...

will have a value of the tags property of the parsed line {'language': [{'type': 'heb'}]}. When multiple tags with the same TYPE are referenced, the value dicts will be aggregated into a list (PageXML custom string are treated analogously):

<Tags>
<OtherTag> ID="foo" LABEL="heb" TYPE="language"/>
<OtherTag> ID="foo" LABEL="eng" TYPE="language"/>
...
</Tags>
...
<TextLine ... TAGREFS="foo">...

will be parsed as {'language': [{'type': 'heb'}, {'type': 'eng']}. The TYPE field on ALTO files is not obligatory, if it is missing the TYPE will be treated as having the value type.

Baseline and Bbox XML parsing

The XMLPage class is now able to parse input facsimile files as either containing bounding-box or baselines by changing the value of the linetype argument:

> from kraken.lib.xml import XMLPage
> doc = XMLPage('alto.xml', linetype='baselines').to_container()
> print(doc.type)
baselines
> doc.lines[0]
BaselineLine(id='eSc_line_192895', baseline=[(848, 682), (934, 678), (1027, 689), (1214, 696), (2731, 700)], boundary=[(844, 678), (851, 635), (1038, 649), (1053, 635), (1110, 635), (1182, 664), (1311, 656), (1351, 635), (1365, 649), (1469, 635), (1505, 664), (1552, 646), (1570, 660), (1599, 635), (1685, 667), (1746, 653), (1786, 664), (1822, 639), (1947, 667), (2199, 667), (2289, 639), (2346, 667), (2386, 649), (2422, 667), (2497, 667), (2526, 642), (2619, 664), (2637, 649), (2670, 667), (2716, 656), (2727, 696), (2716, 761), (2673, 761), (2645, 735), (2555, 739), (2537, 753), (2508, 743), (2490, 761), (2458, 735), (2393, 757), (2364, 739), (2267, 761), (2163, 743), (2080, 761), (2005, 739), (1969, 761), (1929, 739), (1865, 757), (1807, 739), (1764, 761), (1732, 739), (1602, 761), (1530, 743), (1509, 753), (1484, 735), (1459, 757), (1405, 743), (1351, 757), (1304, 735), (1283, 757), (1232, 757), (1193, 732), (1168, 757), (1124, 757), (1067, 732), (1045, 746), (999, 732), (848, 732)], text="בשאול וגו' ˙ אם יחבאו בראש הכרמל וגו' אם ילכו בשבי וגו' אין חשך ואין [צל']", base_dir='L', type='baselines', imagename=None, tags=None, split=None, regions=['eSc_textblock_10523'], language=['iai'])
> doc = XMLPage('alto.xml', linetype='bbox').to_container()
> print(doc.type)
bbox
> doc.lines[0]
BBoxLine(id='eSc_line_192895', bbox=(844, 635, 2727, 761), text="בשאול וגו' ˙ אם יחבאו בראש הכרמל וגו' אם ילכו בשבי וגו' אין חשך ואין [צל']", base_dir='L', type='bbox', imagename=None, tags=None, split=None, regions=['eSc_textblock_10523'], text_direction='horizontal-lr', language=['iai'])

This simplifies using text recognition models trained on bounding box data with input data in XML format. Instead of manually creating the appropriate Segmentation object it is now possible to just run the parser with linetype set and hand the container to rpred.rpred().

When the source files are PageXML, the bounding boxes around lines are computed from the maximum extend of the line bounding polygon. For ALTO files the bounding boxes are taken from the HPOS, VPOS, HEIGHT, WIDTH attributes which means that no bounding polygons need to be defined in a Shape element.

Language parsing

In addition, it now parses language information in source files, Region/BBoxLine/BaselineLine classes have a new language property containing a list of language identifiers, and the standard output format templates serialize the field correctly. For PageXML files these identifiers are validated to the ISO639-3 standard, for ALTO files the values are gathered as is. Inheritance from the page and region level is handled correctly but the notion of primaryLanguage and secondaryLanguage attributes is lost during parsing as they are merged with any language identifiers in the custom string. For ALTO files language information is taken from the LANG attribute and any references to tags that have a type of language. The current uses of this system are limited but are in preparation for integration of the new party recognizer.

Hyperparameter register

lib/register.py is a new module that contains valid values for hyperparameters like optimizers, schedulers, precision, and stoppers.

Bugfixes

  • 0053402: Correct return value for image load error in extract line & line path (rlskoeser) #665
  • d356587: Add a test for image error handling (rlskoeser) #665
  • bbf4336: Fix Augmentation Issues (Weslley Oliveira) #673
  • b435c77: Bug fix for class determination in RO dataset
  • 8a13475: Fix a situation where unicodedata.category is not covering up enough (Thibault Clérice) #692
  • 9a218ce: Prefix uuids with _ to make them valid xml:ids

Among many others.