Skip to content

Commit cb03e74

Browse files
authored
optimise is_cjk(character) (#139)
The call to `ord()` and the list comprehension both showed up in my profiler. Some observations: - The call to `ord()` doesn't need to happen every iteration of that list comprehension. - A list isn't necessary, we can stop searching once we've found a hit. - The list is sorted, so we can be sure that if a char is lower than the upper bound of a group, we don't need to evaluate any of the higher ranges. - Technically we could also do a binary search instead of a linear one, but I'm assuming that, overall, most chars will be in the lowest ranges (ascii) and the loop will abort on its first iteration.
1 parent 1b161ea commit cb03e74

File tree

1 file changed

+8
-18
lines changed

1 file changed

+8
-18
lines changed

sacremoses/util.py

Lines changed: 8 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,9 @@ class CJKChars(object):
9191
]
9292

9393

94+
_CJKChars_ranges = CJKChars().ranges
95+
96+
9497
def is_cjk(character):
9598
"""
9699
This checks for CJK character.
@@ -106,24 +109,11 @@ def is_cjk(character):
106109
:type character: char
107110
:return: bool
108111
"""
109-
return any(
110-
[
111-
start <= ord(character) <= end
112-
for start, end in [
113-
(4352, 4607),
114-
(11904, 42191),
115-
(43072, 43135),
116-
(44032, 55215),
117-
(63744, 64255),
118-
(65072, 65103),
119-
(65381, 65500),
120-
(94208, 101119),
121-
(110592, 110895),
122-
(110960, 111359),
123-
(131072, 196607),
124-
]
125-
]
126-
)
112+
char = ord(character)
113+
for start, end in _CJKChars_ranges:
114+
if char < end:
115+
return char > start
116+
return False
127117

128118

129119
def xml_escape(text):

0 commit comments

Comments
 (0)