Skip to content

destructive behaviour in edge-cases #83

@aflueckiger

Description

@aflueckiger

As of v0.3.3, pySBD shows destructive behavior in some edge-cases even when setting the option clean to False.
When dealing with OCR text, pySBD removes whitespace after multiple periods.

To reproduce

import pysbd

splitter = pysbd.Segmenter(language="fr", clean=False)

text = "Maissen se chargea du reste .. Logiquement,"
print(splitter.segment(text))

text = "Maissen se chargea du reste ... Logiquement,"
print(splitter.segment(text))

text = "Maissen se chargea du reste .... Logiquement,"
print(splitter.segment(text))

Actual output
Please note the missing whitespace after the final period in the example with .. and .....

['Maissen se chargea du reste .', '.', 'Logiquement,']
['Maissen se chargea du reste ... ', 'Logiquement,']
['Maissen se chargea du reste .', '...', 'Logiquement,']

Expected output

['Maissen se chargea du reste .', '. ', 'Logiquement,']
['Maissen se chargea du reste ... ', 'Logiquement,']
['Maissen se chargea du reste .', '... ', 'Logiquement,']

In general, pySBD works well. Many thanks @nipunsadvilkar. I can also look into this as soon as I find some time and open a pull request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugedge-casesupdate rules to account for the edge cases

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions