-
Notifications
You must be signed in to change notification settings - Fork 87
Closed
Labels
Description
As of v0.3.3, pySBD shows destructive behavior in some edge-cases even when setting the option clean
to False
.
When dealing with OCR text, pySBD removes whitespace after multiple periods.
To reproduce
import pysbd
splitter = pysbd.Segmenter(language="fr", clean=False)
text = "Maissen se chargea du reste .. Logiquement,"
print(splitter.segment(text))
text = "Maissen se chargea du reste ... Logiquement,"
print(splitter.segment(text))
text = "Maissen se chargea du reste .... Logiquement,"
print(splitter.segment(text))
Actual output
Please note the missing whitespace after the final period in the example with ..
and ....
.
['Maissen se chargea du reste .', '.', 'Logiquement,']
['Maissen se chargea du reste ... ', 'Logiquement,']
['Maissen se chargea du reste .', '...', 'Logiquement,']
Expected output
['Maissen se chargea du reste .', '. ', 'Logiquement,']
['Maissen se chargea du reste ... ', 'Logiquement,']
['Maissen se chargea du reste .', '... ', 'Logiquement,']
In general, pySBD works well. Many thanks @nipunsadvilkar. I can also look into this as soon as I find some time and open a pull request.