The 6.0 release does not introduce any major new features but changes the behavior of multiple components and introduces non-backward-compatible API changes, necessitating a major release.
Backward-incompatible changes
Tag parsing has changed which not only affects the internal data structures of the container classes but also the user-facing command line interface. The mapping of line tags to recognition models in the kraken ocr
's -m
argument now always uses the resolved type of the line. The resolved type is determined for ALTO files by any tag reference pointing to a tag element either with a TYPE
attribute with value type
or no TYPE
attribute at all. For PageXML files this is determined by the custom string structure {type: $value;}
.
These changes are in preparation for the eventual removal of per-tag-recognition as it prevents optimizing recognition throughput with batching.
New features
The model repository has seen a major upgrade with a new metadata schema called HTRMoPo that allows uploading more model types (segmentation, recognition, reading order, ...) and includes support for informative huggingface-style model cards. The new implementation also caches the model repository state for faster querying, has support for versioned models, and allows filtering of output based on various metadata fields. Interaction with the repository using the command line drivers is documented here.
The API and command line driver for reading order model training (ketos rotrain
) now supports the same filtering and merging options as the segmentation training tools which makes it easier to train RO models when the corresponding segmentation model has been trained using these options.
Testing recognition models with ketos test
now also computes a case-insensitive character error rate. (Thanks Weslley Oliveira!).
Per-step and average epoch training loss is now printed on the progress bars of all training tools (ketos pretrain, ketos rotrain, ketos segtrain, ketos train
).
The contrib/repolygonize.py now allows setting the scale of the polygonization input with the --scale
option. (Thanks Weslley Oliveira!)
contrib/set_seg_options.py can set the segmentation model option for line location to centerline as well.
A new contrib/add_neural_ro.py script can be used to add a new reading order generated by a neural reading order model to an existing XML facsimile.
A softmax temperature option has been added to smooth out the confidence distribution of the character confidences of text recognition output. The option is available as an argument to TorchSeqRecognizer
and the --temperature
setting on the kraken ocr
subcommand.
Removed features
The synthetic line generation tools were removed as they were only useful for training legacy line-strip recognition models. The recommended alternative that is compatible with baseline-style models is the new pangoline tool. A short description how to prepare kraken training data with it is available here in the docs.
Likewise, the legacy HTML file-based transcription environment was removed as it never supported transcription of baseline segmentation data. eScriptorium is the suggested replacement.
Installation through anaconda is gone. Due to coreml not being maintained in conda-forge it has not been possible to do a pure conda installation without side-loading packages through pip for a long while.
Misc. Changes
All valid floating point precision values known to pytorch lightning can now be used with the --precision
option of ketos
.
scripts.json
has been updated to include the new scripts encoded by Unicode 16.
The reading order training code has been refactored.
Region filtering now supports types containing $
.
contrib/extract_lines.py writes output always as RGB images.
The pytorch pin has been relaxed to accept versions between 2.4.0 and 2.7.x.
API changes
The XML parsing, container classes, and tagging have been revamped, introducing a number of changes.
Tags
Tags on the container classes (Region, BaselineLine, BboxLine
) were previously a simple dictionary containing string keys and values which was less expressive than the Transkribus-style custom strings mapping an identifier to one or more dictionaries, e.g. language {id: eng; name: English} language {id: heb; name: Hebrew}
. With the current release all tags are in dict-list-of-dicts format, taking the example above {'language': [{'id': 'eng', 'name': 'English'}, {'id': 'heb', 'name': 'Hebrew'}]}
, no matter their source (PageXML or ALTO files). Tags parsed from ALTO's tag reference system, which only allows serialization of key-value paris, are expanded by introducing a dummy key 'type' in the value dicts, i.e.
<Tags>
<OtherTag> ID="foo" LABEL="heb" TYPE="language"/>
...
</Tags>
...
<TextLine ... TAGREFS="foo">...
will have a value of the tags property of the parsed line {'language': [{'type': 'heb'}]}
. When multiple tags with the same TYPE
are referenced, the value dicts will be aggregated into a list (PageXML custom string are treated analogously):
<Tags>
<OtherTag> ID="foo" LABEL="heb" TYPE="language"/>
<OtherTag> ID="foo" LABEL="eng" TYPE="language"/>
...
</Tags>
...
<TextLine ... TAGREFS="foo">...
will be parsed as {'language': [{'type': 'heb'}, {'type': 'eng']}
. The TYPE
field on ALTO files is not obligatory, if it is missing the TYPE
will be treated as having the value type
.
Baseline and Bbox XML parsing
The XMLPage
class is now able to parse input facsimile files as either containing bounding-box or baselines by changing the value of the linetype
argument:
> from kraken.lib.xml import XMLPage
> doc = XMLPage('alto.xml', linetype='baselines').to_container()
> print(doc.type)
baselines
> doc.lines[0]
BaselineLine(id='eSc_line_192895', baseline=[(848, 682), (934, 678), (1027, 689), (1214, 696), (2731, 700)], boundary=[(844, 678), (851, 635), (1038, 649), (1053, 635), (1110, 635), (1182, 664), (1311, 656), (1351, 635), (1365, 649), (1469, 635), (1505, 664), (1552, 646), (1570, 660), (1599, 635), (1685, 667), (1746, 653), (1786, 664), (1822, 639), (1947, 667), (2199, 667), (2289, 639), (2346, 667), (2386, 649), (2422, 667), (2497, 667), (2526, 642), (2619, 664), (2637, 649), (2670, 667), (2716, 656), (2727, 696), (2716, 761), (2673, 761), (2645, 735), (2555, 739), (2537, 753), (2508, 743), (2490, 761), (2458, 735), (2393, 757), (2364, 739), (2267, 761), (2163, 743), (2080, 761), (2005, 739), (1969, 761), (1929, 739), (1865, 757), (1807, 739), (1764, 761), (1732, 739), (1602, 761), (1530, 743), (1509, 753), (1484, 735), (1459, 757), (1405, 743), (1351, 757), (1304, 735), (1283, 757), (1232, 757), (1193, 732), (1168, 757), (1124, 757), (1067, 732), (1045, 746), (999, 732), (848, 732)], text="בשאול וגו' ˙ אם יחבאו בראש הכרמל וגו' אם ילכו בשבי וגו' אין חשך ואין [צל']", base_dir='L', type='baselines', imagename=None, tags=None, split=None, regions=['eSc_textblock_10523'], language=['iai'])
> doc = XMLPage('alto.xml', linetype='bbox').to_container()
> print(doc.type)
bbox
> doc.lines[0]
BBoxLine(id='eSc_line_192895', bbox=(844, 635, 2727, 761), text="בשאול וגו' ˙ אם יחבאו בראש הכרמל וגו' אם ילכו בשבי וגו' אין חשך ואין [צל']", base_dir='L', type='bbox', imagename=None, tags=None, split=None, regions=['eSc_textblock_10523'], text_direction='horizontal-lr', language=['iai'])
This simplifies using text recognition models trained on bounding box data with input data in XML format. Instead of manually creating the appropriate Segmentation
object it is now possible to just run the parser with linetype
set and hand the container to rpred.rpred()
.
When the source files are PageXML, the bounding boxes around lines are computed from the maximum extend of the line bounding polygon. For ALTO files the bounding boxes are taken from the HPOS, VPOS, HEIGHT, WIDTH
attributes which means that no bounding polygons need to be defined in a Shape
element.
Language parsing
In addition, it now parses language information in source files, Region/BBoxLine/BaselineLine
classes have a new language
property containing a list of language identifiers, and the standard output format templates serialize the field correctly. For PageXML files these identifiers are validated to the ISO639-3 standard, for ALTO files the values are gathered as is. Inheritance from the page and region level is handled correctly but the notion of primaryLanguage
and secondaryLanguage
attributes is lost during parsing as they are merged with any language identifiers in the custom string. For ALTO files language information is taken from the LANG
attribute and any references to tags that have a type of language
. The current uses of this system are limited but are in preparation for integration of the new party recognizer.
Hyperparameter register
lib/register.py is a new module that contains valid values for hyperparameters like optimizers, schedulers, precision, and stoppers.
Bugfixes
- 0053402: Correct return value for image load error in extract line & line path (rlskoeser) #665
- d356587: Add a test for image error handling (rlskoeser) #665
- bbf4336: Fix Augmentation Issues (Weslley Oliveira) #673
- b435c77: Bug fix for class determination in RO dataset
- 8a13475: Fix a situation where unicodedata.category is not covering up enough (Thibault Clérice) #692
- 9a218ce: Prefix uuids with
_
to make them valid xml:ids
Among many others.