-
Notifications
You must be signed in to change notification settings - Fork 23
Open
Description
Here something that woud be really useful for PG development going forward: A count of the occurrence of unicode code points used in PG texts. A recent issue #271 notes that the 2em dash is often used but missing in many typefaces.
Applications:
- we could automatically add an embedded "polyfill" subsetted font to epubs to improve the appearances of characters with spotty coverage in typefaces, or we could recommend to users to use fonts with comprehensive coverage.
- we could produce a test epub for use by producers to see coverage of non-ascii code point.
- data science!
How:
- a zipped tarball of all the texts in PG is available at https://gutenberg.org/cache/epub/feeds/txt-files.tar.zip
Metadata
Metadata
Assignees
Labels
No labels