Releases: CUNY-CL/wikipron
Releases · CUNY-CL/wikipron
v1.3.3
Release [1.3.2] - 2024-07-17
Under data/
Changed
- Rescrapes dialect data after #548. (#551)
- Fixes dialect XPath selector. (#548)
- Fixes table alignment. (#539)
- Repeats big scrape after #523. (#536)
- Fixes excessive line wrapping. (#529)
- Big scrape for 2024. (#514)
- Upstream cleaning for Bengali data. (#547)
Under src/ and elsewhere
- Upgrades
requestsfor Dependabot. (#541, #544) - Upgrades
blackfor Dependabot. (#530) - Removes Min Nan (
nan) custom selector. (#529)
Added
- Remove the case-folding attributes for the big scrape. (#469)
Changed.
- Removed the case-folding test for the big scrape. (#469)
[1.3.1] - 2024-03-02
Under data/
- Updated Maltese (
mlt) phonelist. (#517) - Fixed path bug in
generate_summary.py. (#517) - Fixed CLI arg bug in
list_phones.py. (#516) - Big scrape for 2023. (#512)
- Moved IPAs of words with tildes to multiple lines. (#379)
- Caught
iso639.language.LanguageNotFoundErrorerror incodes.py. (#498) - Added KPI computation to
generate_summary.py. (#465) - Added "ː"-suffixed characters to list of valid IPAs. (#497)
- Renamed the two TSV summaries to
summary.tsv. (#494) - Renamed
generate_tsv_summary.pytogenerate_summary.py. (#492) - Upstream cleaning wrt English tie bar. (#491)
- Upstream cleaning wrt English high vowel and schwa. (#493)
- Fixed Georgian (
kat) phones and rescrapes. (#488)
Under wikipron/ and elsewhere
[1.3.0] - 2022-11-28
Under data/
Added
- Big scrape for 2022. (#464)
- Added the
--freshflag todata/scrape/scrape.pyto facilitate running the big scrape in batches. (#464) - Added the
--excludeflag for excluding one or more languages indata/scrape/scrape.py. (#460) - Added
data/src/normalize.py. (#356) - Updated
README.md. (#360) - Added
data/cg/tsv/geo.tsv. (#367) - Added
data/morphology. (#369) - Added SIGMORPHON 2021 morphology data. (#375)
- Added
data/cg/tsv/jpn_hira.tsv. (#384) - Enforced final newlines. (#387)
- Adds all UniMorph languages to morphology. (#393)
- Added
data/covering_grammar/tsv/fre_latn_phonemic.tsv(#398) - Added
data/covering_grammar/lib/make_test_file.py(#396, #399) - Added Komi-Zyrian (
kpv). (#400) - Added Makasar (
mak). (#415, #419) - Added Zou (
zom). (#421) - Added Wiyot (
wiy). (#422) - Added Sidamo (
sid). (#423) - Added Central Atlas Tamazight (
tzm). (#429) - Added Chibcha (
chb). (#430) - Added Kashmiri (
kas). (#431) - Added Malayalam (
mal). (#434) - Added Dhivehi (
div). (#437) - Added Akkadian (
akk). (#441) - Added Central Nahuatl (
nhn). (#443) - Added Etruscan (
ett). (#444) - Added Gujarati (
guj). (#445) - Added Kannada (
kan). (#446) - Added Karelian (
krl). (#447) - Added Romagnol (
rgn). (#448) - Added Southern Yukaghir (
yux). (#449) - Added Urak Lawoi' (
urk). (#451) - Added Hausa (
ha). (#452) - Added Kashubian (
csb). (#453) - Added Tabaru (
tby). (#455) - Added West Makian (
mqs). (#457) - Added Amharic (
amh). (#458) - Added Livvi (
olo). (#459) - Added Kalmyk (
xal). (#472) - Added Ternate (
tft). (#473) - Added Abkhaz (
abk). (#474) - Added Farefare (
gur). (#475) - Added Iban (
iba). (#476) - Added Laz (
lzz). (#477)
Changed
- Switched to ISO 639-3 language codes. (#468)
- Updated scraped data in preparation for the SIGMORPHON 2022 shared task:
swe nno ger dut ita rum ukr bel tgl ceb ben asm per pus tha lwl. (#461) - Made scripts under
data/frequencies/anddata/morphology/more flexible,
especially for the purposes of preparing data for a shared task. (#461) - Fixed the
--restrictionflag for specifying multiple languages indata/scrape/scrape.py. (#460) - Added covering grammar coverage error log and specified error_type in error_analysis.py. (#424)
- Added error log writing in error_analysis.py. (#420)
- Added new columns in summary tables. (#365)
- Fixed broken paths in
data/src/generate_phones_summary.pyand in
data/phones/HOWTO.md. (#352) - Added Atong (India) (
aot). (#353) - Added Egyptian Arabic (
arz). (#354) - Added Lolopo (
ycl). (#355) - Fixed Unicode normalization in
data/phones/slv_phonemic.phonesand
re-scraped Slovenian data. (#356) - Updated
data/phones/HOWTO.mdto include instructions on applying the
NFC Unicode normalization (#357) - Updated
data/src/normalize.pyto be more efficient. (#358) - Fixed inaccuracies in
data/phones/geo_phonemic.phones. (#367) - Fixed typo in
data/cg/tsv/geo.tsvand added missing character. (#370) - Morphology URLs are now provided as a list. (#376)
- Configured and scraped Yamphu (
ybi). (#380) - Configured and scraped Khumi Chin (
cnk). (#381) - Made summary generation in
common_characters.pyoptional. (#382) - Fixed phone counting in
data/src/generate_phones_summary.py(#390, #392) - Reorganizes scraping scripts under
data/scrape(#394) - Reorganizes
.phonesfiles and related scripts underdata/phones(#395) - Reorganizes CG files and related scripts under
data/covering_grammar(#395) - Reorganized
data/phones/phones/fre_phonemic.phones(#398) - Removed
data/src/(#401) - Renamed TSV files and phonelists to use the terms "broad"/"narrow" instead
of "phonemic"/"phonetic" (#389, #402, #405) - Fixed typo in
README.md(#407) - Fixed column ordering of the test file read by the script in
data/covering_grammar/lib/error_analysis.py(#411) - Fixed Common character collection in
common_characters.py(#419) - Scraping test fixed for
blt. (#436) - Changed URLs to point at CUNY-CL repo, where applicable. (#438)
Under wikipron/ and elsewhere
Added
- Added
ckbinlanguagecodes.py. (#464) - Added support for Python 3.10. (#462)
- Added test of phones list generation in
test_data/test_summary.py(#363) - Added Min Nan extraction function. (#397)
- Added Tai Dam extraction function, configuration and initial scrape. (#435)
- Added test of
casefoldvalue for languages indata/scrape/lib/languages.json(#442) - Added support for Python 3.11. (#479)
- Added checks for the Python source distribution and wheel on CI. (#479)
- Turned on tests for Windows on CI. (#479)
Removed
Changed
[1.2.0] - 2021-01-30
Under data/
Added
- Added Afrikaans whitelists, filtered TSV file, rescraped phonemic and phonetic TSV files. (#311)
- Added German whitelists and filtered TSV file. (#285)
- Added whitelisting capabilities to
postprocess. (#152) - Added whitelists for Dutch, English, Greek, Latin, Korean, and Spanish.
(#158, etc.) - Logged dialect configuration if specified. (#133)
- Added typing to big scrape code. (#140)
- Added argparse to allow limiting 'big scrape' to individual languages
with--restrictionflag. (#154) - Added Manchu (
mnc). (#185) - Added Polabian (
pox). (#186) - Added
aar,bdq,jje, andlsi. (#202) - Added
tyvtolanguagecodes.py(#203, #205) - Added
bcl,egl,izh,ltg,azg,kirandmgatolanguagecodes.py. (#205) - Added
neptolanguagecodes.py. (#206) - Added Ingrian (
izh). (#215) - Added French phoneme list and filtered TSV file. (#213, #217)
- Added Corsican (
cos). (#222) - Added Middle Korean (
okm). (#223) - Added Middle Irish (
mga). (#224) - Added Old Portuguese (
opt). (#225) - Added Serbo-Croatian phoneme list and filtered TSV files. (#227)
- Added Tuvan (
tyv). (#228) - Added Shan (
shn) with custom extraction. (#229) - Added Northern Kurdish (
kmr). (#243) - Added a script to facilitate the creation of a
.phonesfile. (#246) - Added IPA validity checks for phonemes. (#248)
- Split multiple pronunciations joined by tilde in
eng_us_phonetic. - Added Italian phoneme list and filtered TSV file. (#260, #261)
- Added Adyghe phone list and filtered TSV file. (#262, #263)
- Added Bulgarian phoneme list and filtered TSV file. (#264, #267)
- Added Icelandic phoneme list and filtered TSV file. (#269, #270)
- Added Slovenian phoneme list and filtered TSV file. (#271, #273)
- Added normalization to
list_phones.py. Corrected errors relating to
ipapy(#275) - Added Welsh .phones lists and filtered TSV files. (#274, #276)
- Added draft of covering grammar script. (#297)
- Updated
data/phones/README.mdwith instructions to re-scrape. (#279, #281) - Added Vietnamese
.phonesfiles and re-scraped and filtered.tsvfiles.
(#278, #283) - Added Hindi
.phonesfiles and the re-scraped and filtered.tsvfiles.
(#282, #284) - Added Old Frisian (
ofs). (#294) - Added Dungan (
dng). (#293) - Added Latgalian (
ltg). (#296) - Added draft of covering grammar script. (#297)
- Added Portuguese
.phonesfiles and re-scraped data. (#290, #304) - Added Japanese
.phonesfiles and re-scraped data. (#230, #307) - Added Moksha (
mdf). (#295) - Added Azerbaijani
.phonesfiles and re-scraped data. (#306, #312) - Added Turkish
.phonesfile and re-scraped data. (#313, #314) - Added Maltese
.phonesfile and re-scraped data. (#317, #318) - Added Latvian
.phonesfile and re-scraped data. (#321, #322) - Added Khmer
.phonesfile and re-scraped data. (#324, #327) - Added Østnorsk (Bokmål)
.phonesfile and re-scraped data. (#324, #327) - Several languages added to
languagecodes.py. (#334)
Changed
- Edited the arm_e_phonetic.phones and arm_w_phonetic.phones files. (#298)
- Improved printing in the README table. (#145)
- Renamed data directory
data. (#147) - Split
mayinto Latin and Arabic files. (#164) - Split
paninto Gurmukhi and Shahmukhī. (#169) - Split
uiginto Perso-Arabic and Cyrillic. (#173) - Only allowed Latin spellings in Maltese lexicon. (#166).
- Split
moninto Cyrillic and Mongol Bichig (#179). - Merged whitelist.py into 'big scrape' script. src scrape.py now checks for
existence of whitelist file during scrape to create second filtered TSV.
New TSV placed undertsv/\*\_filtered.tsv. (#154). - Updated
generate_summary.pyto reflect presence of 'filtered' tsv. (#154) - Imperial Aramaic (
arc) split into three scripts properly. (#187) - Flattened data directory structure. (#194)
- Updated Georgian (
geo) to take advantage of upstream bot-based
consistency fixes. (#138) - Split
arminto Eastern and Western dialects. (#197) - Rescraped files with new whitelists. (#199)
- Updated logging statements for consistency. (#196)
- Renamed
.whitelistfile extension name as.phones. (#207) - Split
baninto Latin and Balinese scripts. (#214) - Split
kirinto Cyrillic and Arabic. (#216) - Split Latin (
lat) into its dialects. (#233) - Added MyPy coverage for
wikipron,testsanddatadirectories. (#247) - Modified paths in
codes.py,scrape.py, andsplit.py. (#251, #256) - Modified config flags in
languages.jsonandscrape.py. (#258) - Edited Serbo-Croatian
.phonesfile to list all vowel/pitch accent
combinations. Re-scraped Serbo-Croatian data. (#288) - Moved
list_phones.pyto parent directory. (#265, #266) - Moved
list_phones.pytosrcdirectory. (#297) - Frequencies code no longer overwrites TSV files. (#320)
- Updated
data/phones/README.mdto specify that.phonesfiles should be
in NFC normalization form. (#333) - Kurdish (
kur) and Opata (opt) removed fromlanguages.json. (#334) - Re-scraped Armenian data. Fixed an error in West Armenian phone list.
(#338)
Fixed
- Fixed path issue with phonetic whitelisted files. (#195)
Under wikipron/ and Elsewhere
Added
- Added positive flags for stress, syllable boundaries, tones, segment to
cli.py. (#141) - Added positive flags for space skipping to
cli.py. (#257) - Added two Vietnamese dialects to
languages.json. (#139) - Handled additional language codes. (#132, #148)
- Added
--no-skip-spaces-wordand--no-skip-spaces-pronflag. (#135) - Allowed ASCII apostrophes (0x27) in spellings. (#172).
- Added Vietnamese extraction function. (#181).
- Modified pron selector in Latin extraction function. (#183).
- Added
--no-toneflag. (#188) - Customized extractor and new scraped prons for
khb. (#219) - Added
tests/test_datadirectory containing two tests. (#226, #251) - Added HTTP User-Agent header to API calls to Wiktionary. (#234)
- Added support for python 3.9 (#240)
- Added black style formatting to
.circleci/config.yml. (#242) - Added logging for scraping a language with
--dialectspecified
that requires its custom extraction logic. (#245) - Improved CircleCI workflow with orbs. (#249)
- Added
test_split.pytotests/test_data. (#256) - Handled Cantonese for scraping. (#277)
- Added exclusion for reconstructions. (#302)
- Added Vietnamese contour tone grouping test in
tests/test_config.py(#308) - Added restart functionality. (#340)
Changed
- Renamed arguments to positive statements in
wikipron/config.pyand edited_get_process_pronfunction accordingly. (#141, #257) - Changed testing values used in
tests/test_config.pyin order to accomodate the addition of positive flags. (#141) - Specified UTF-8 encoding in handling text files. (#221)
- Moved previous contents of
testsintotests/test_wikipron(#226) - Updated the packages version numbers in requirements.txt to their latest according to PyPI (#239)
- Updated the default pron selector to also look for IPA strings under paragraphs in addition to list items. (#295)
- Updated segments package version to 2.2.0 (#308)
Removed
- Moved Wiktionary querying functions from
test_languagecodes.pytocodes.py(#205)
v1.1.0
[1.1.0] - 2020-03-03
Added
- Added the extraction function for Mandarin Chinese and its scraped data. (#124)
- Integrated Wortschatz frequencies. (#122)
Changed
- Updated the Japanese extraction function and Japanese data. (#129)
- Updated all scraped Wiktionary data and frequency data. (#127, #128)
- Generalized the splitting script in the big scrape. (#123)
- Moved small file removal to
generate_summary.py. (#119) - Updated Russian data. (#115)
Fixed
- Avoided and logged error in case of pron processing failure. (#130)