- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1k
refactor(locale): filter and cleanup PersonEntryDefintions data #3266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: refactor/person/sex
Are you sure you want to change the base?
Conversation
| Codecov ReportAll modified and coverable lines are covered by tests ✅ 
 Additional details and impacted files@@                   Coverage Diff                   @@
##           refactor/person/sex    #3266      +/-   ##
=======================================================
- Coverage                99.97%   99.97%   -0.01%     
=======================================================
  Files                     2811     2811              
  Lines                   217025   183684   -33341     
  Branches                   941      940       -1     
=======================================================
- Hits                    216973   183632   -33341     
  Misses                      52       52              
 | 
| Perhaps it would be possible to summarise the length of each locale definition file before and after the changes to get a feel for what the actual impacted methods and locales are without having to review the giant diff eg | 
| 
 Click to expand
 | 
| I sorted by entry first, and removed  
 | 
| 
 previously 
 This suggests to me that rather than a fixed 80% gendered and 20% generic result, if say female is requested it should pick randomly from the female definitions concatenated with the generic definitions, so that locales with only a small number of generic definitions dont keep picking the same small number of generic names. | 
| 
 We also considered this. We also considerd weighting them: 
 In summary, we haven found the perfect solution yet. | 
| I kind of tend to just proceed with a non optimal weight distribution and tweak it in subsequent PRs. | 
| What if the percentage of generic names was something that could be set for each locale definition seperately? Then you could have say a 10% chance of getting a gender-neutral english prefix, but a 50% chance of getting a gender-neutral Chinese first_name. | 
| 
 Is finding the right percentage for each distribution a (merge or release) blocking issue for you or is that something we can adjust in later PRs? | 
| I'd say yes. Having 20 percent of all Japanese first names output the same because there is only one generic name feels like a bug/ regression. | 
| 
 I'm not sure what the best way to handle this is. We don't want to leave them all in generic otherwise female names would be returned when you asked for male. So we would have to go through and split the generic names into male and female. There are some (free and paid) apis which might be able to help with that like https://genderize.io/ | 
| My plan for this - if you are fine with it - looks like this: 
 Please let me know what you think of this and what your suggestions are. | 
| In general that sounds fine. There's no great hurry for this and we are less likely to accidentally break things if we spread this over a few releases. However I think we should try to figure out what we will do the problematic locales so we don't get stuck in future. Even if we truncate en generic first names to 1000 first that's a lot to go through by hand. | 
| 
 IMO we can either check the existing list, which can be a lot, or we could search for a new list. Whatever is easier for us. | 
| Would we allow 1000 male and 1000 female names? Or 1000 total across all genders? | 
| I think the current script limits it to up to 1000 each. | 
…or/person/sex-localeData
| 
 How about using a ratio of: 
 Percentage of choosing specific
 Percentage of choosing generic
 | 
| 'Živana', | ||
| 'Žofie', | ||
| ], | ||
| generic: ['Nikola', 'René'], | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe to move them to female and male respectively?
| Team Decision 
 
 We believe that these values represent the use case best while leaning towards specific values, if specific has been requested. | 
| Just wanted to check I understood this right So for example if there were 9 generic first names, 25 female first names and 36 male first names, then if I request firstName("female") then I'll get a name from the female list versus the generic list in a ratio of 3*sqrt(25) : sqrt(9) 15:3 i.e. I get a name from the female list 15/18 of the time, 83.3 percent. | 







Second part of #3058
genericprefixes #3058Extension of #3259
This PR cleans up the PersonEntryDefintions locale data.
genericvalues are checked whether they exist in exclusively eitherfemaleandmale, if so, they are removed from generic. This solves the issue where generic = merge(female, male)femalevalues are checked whether they are ingeneric, if so, they are removed fromfemale.femalevalues are checked whether they are inmale, if so, they are added togenericand removed fromfemale.malevalues are checked whether they are ingeneric, if so, they are removed frommale.I haven't run the script yet, because there is a large diff, due to the person data not being sorted.Summary (changes only)