better plots #76

bt2901 · 2020-08-04T22:13:12Z

No description provided.

Alvant · 2020-08-20T06:24:13Z

topnum/search_methods/optimize_scores_method.py

            save_experiment: bool = False,
-            experiment_directory: str = DEFAULT_EXPERIMENT_DIR):
+            experiment_directory: str = DEFAULT_EXPERIMENT_DIR,
+            nums_topics_list: int = None):


О, круто, я тоже хотел это добавить!
Только вот лучше, наверно, так

nums_topics: List[int] = None

Плюс стоит создать докстринг и отметить там, что это параметр приоритетнее min_num_topics, max_num_topics, num_topics_interval.

P.S. А вообще это всё, конечно, немного тяп-ляп) По-хорошему, OptimizeScoresMethod должен принимать на вход заданный неким образом интервал тем. А интервал тем может определяться как границами и шагом, так и вручную заданным списком чисел тем (так и мб ещё каким способом; например "центрами" — предполагаемые оптимумы, — вокруг которых надо перебирать числа тем, и параметрами перебора тем около этих центров)

def __init__(..., search_interval, ...) search_interval = SearchInterval.from_borders_and_step(min_num_topics, max_num_topics, num_topics_interval) ... search_interval = SearchInterval.from_list(nums_topics) ... search_interval = SearchInterval.from_whatever_else(...)

Или надо вообще убрать min_num_topics, max_num_topics, num_topics_interval, и оставить только list 😅

Alvant · 2020-08-20T06:33:43Z

topnum/utils.py

+    '[email protected]_contrast': max,
+    '[email protected]_purity': max,
+    '[email protected]_size': None,
+    'perp': None,


дело в том, что перплексии дублируются (лишие скоры при инициализации модели). Чтобы не рисовать загромождающих лишних графиков, я ставлю None (это вместо USELESS_KEYS, которые были раньше)

Тяп-ляп, да.

А, ок. Может, perp как раз оставим?) Типа есть holdout_perp, и будет perp?

Alvant · 2020-08-20T06:33:50Z

topnum/utils.py

+    '[email protected]_purity': max,
+    '[email protected]_size': None,
+    'perp': None,
+    'sparsity_phi': None,


Alvant · 2020-08-20T06:35:34Z

topnum/utils.py



+SCORES_DIRECTION = {
+    'PerplexityScore@all': min,


Прикольно!

P.S. Я бы создал enum с возможными значениями direction:

class ScoreDirection(Enum): NONE = auto() MIN = auto() MAX = auto()

Alvant · 2020-08-20T06:38:10Z

topnum/utils.py

+SCORES_DIRECTION = {
+    'PerplexityScore@all': min,
+    'SparsityThetaScore': max,
+    'SparsityPhiScore@word': max,


То, что здесь зашиты конкретные модальности — не очень хорошо... (новый датасет, новая модальность — fail). Но с этим можно жить)
...
Да и имена скоров заданы)) Короче, это очень currently-conducted-experiments-related штука

Классы скоров вроде не сохраняются? Если сохраняются, то можно сюда типы написать вместо названий.

У обычной-то модели классы точно есть, а у Dummy... скорее всего, нет

Alvant · 2020-08-20T06:44:33Z

topnum/utils.py

+    'toptok1': max
+}
+
+def classify_curve(my_data, FRAC_THRESHOLD, score_name):


Капсом имя параметра

Alvant · 2020-08-20T06:47:11Z

topnum/utils.py

+        colored_values[colored_values > threshold] = np.nan
+
+    intervals = colored_values[colored_values.notna()]
+    minx, maxx = min(intervals.index), max(intervals.index)


minx, maxx?

Alvant · 2020-08-20T06:48:56Z

topnum/utils.py

+    'toptok1': max
+}
+
+def classify_curve(my_data, FRAC_THRESHOLD, score_name):


А можешь плиз тип my_data указать? 🙂 (кстати, имя лучше другое дать)
А то я понял, что не знаю, что такое intervals.index

Alvant · 2020-08-20T06:54:56Z

topnum/utils.py

+            # and abs(intervals.loc[maxx] - optimum_val) <= :
+            curve_type = "outside"
+    else:
+        curve_type = "jumping"


Ну, это тоже явно на константы или enum-ы напрашивается

Alvant · 2020-08-20T06:58:04Z

topnum/utils.py

+
+    intervals = colored_values[colored_values.notna()]
+    minx, maxx = min(intervals.index), max(intervals.index)
+    optimum_idx = set(intervals.index)


optimum_idx: не совсем удачное название? или что здесь хочется посчитать?)
minx, maxx = min(intervals.index), max(intervals.index) -> проверка if minx == maxx равносильна проверке на 349 строчке?

Нет, не равносильна.

Пусть мы перебираем темы с шагом 2. intervals.index == [10, 12, 14, 22] (два максимума), slice_idx = [10, 12, ..., 20, 22]

Тут скорее можно len() сравнивать, но мне показалось что это менее интуитивно.

Ок, с примером стало понятнее)

Alvant · 2020-08-20T07:00:38Z

topnum/utils.py

+    if (optimum_idx == slice_idx):
+        curve_type = f"interval {len(intervals)}"
+        if len(intervals) == 1:
+            curve_type = "sharp"


А на outside curve_type уже не сможет измениться по коду ниже?..

да, но наверное это зря, исправлю

Alvant · 2020-08-20T07:03:34Z

topnum/utils.py

+
+        minx, maxx = min(colored_values.index), max(colored_values.index)
+        if minx in optimum_idx:
+            curve_type = "outside"


А почему если minx in optimum_idx, то curve_type = outside?) Score direction не влияет на то, minx или maxx надо проверять?

Нет (представь себе U-образную кривую)
minx, maxx -- это границы интервала по x. Score direction влияет на то, это интервал с минимумом графика или максимумом

А, то есть если граница входит в окрестность оптимума, то считаем, что истинный оптимум "снаружи"?

...Мб это плато?)

Alvant · 2020-08-20T07:05:11Z

topnum/utils.py

+            *name_base, param_id, seed = experiment_name.split("_")
+
+            seed = int(seed) if seed != 'None' else 0
+            if seed > 3:


Alvant · 2020-08-20T07:06:20Z

topnum/utils.py

-            my_ax.plot(data.T.mean(axis=0), linestyle=style, label=experiment_name)
+            *name_base, param_id, seed = experiment_name.split("_")
+
+            seed = int(seed) if seed != 'None' else 0


None ведь не должно быть нигде? Это ведь от старого кода, когда seed вообще не выставлялся?

Да, это поддержка обратной совместимости с legacy
seed > 3 это тоже оттуда осталось, когда сиды через RandomState выставлялись большие

Может, в топку legacy?) Зачем поддерживать? Поддержка старого -> есть вероятность случайно какой-то баг словить?

Alvant · 2020-08-20T07:08:00Z

topnum/utils.py

+
+            my_data = data.T.mean(axis=0)
+
+            if maxval is not None:


Когда может быть нужен maxval?

Когда например я рисую перплексию и там из-за того, что первое значение 100500 масштаб графика ломается

для симметрии ещё нужен minval, видимо

и докстринг видимо :)

Alvant · 2020-08-20T07:09:37Z

topnum/utils.py

+
+            if FRAC_THRESHOLD:
+                colored_values, curve_type = classify_curve(my_data, FRAC_THRESHOLD, score)
+                if curve_type == "jumping":


P.S.
Модификация classify_curve приведёт к необходимости менять код и здесь (добавлять/удалять/изменять if ветки)

Alvant · 2020-08-20T07:13:16Z

topnum/utils.py

+                if curve_type == "outside":
+                    marker = "^"
+                my_ax.plot(colored_values, linestyle=style, color=color, alpha=1.0)
+                my_ax.plot(colored_values, marker=marker, linestyle='', color='black', alpha=1.0)


А почему два раза плот? 😅 linestyle=''? это типа... нет линии?

Аааа, это чтоб точки нарисовать? Что так сложно?)) Почему нельзя за один plot всё сделать? Может, если за один нельзя, вместо второго plot использовать scatter?

Alvant · 2020-08-20T09:08:41Z

@bt2901 Виктор, единственное, хотелось бы, чтобы была функция, которая просто принимает на вход значения X, Y, и указатель, что мы ищем (max или min). Мало ли просто захочется просто какой-то график обработать (без привязки к my_data, score_names_stuff, и т.д.). Такое можно сделать? Вот, например, я захочу с помощью твоей тулзы проанализировать стабилити и ренормализацию.

P.S. В идеале на вход лучше подавать просто X и Y. А тулза (в идеале) должна сказать, где там максимум, где там минимум, где там плато, с "какой уверенностью" тулза считает, что это именно макс/мин/плато

Ещё у меня вопрос: есть иногда графики, где первая точка (как правило) "лишняя" (грубо говоря, скор посчитан для модели с одной темой, и там какая-то жесть; а дальше начинается "нормальная" зависимость). Можно ли как-то такие явные граничные выбросы обрубать, выбрасывать из анализа? Хотя, наверно, надо это делать самому перед тем, как подать range значений на вход анализатору...

Alvant · 2020-09-04T11:39:07Z

demos/AutoTable.ipynb

@@ -0,0 +1,2164 @@
+{


@bt2901 Виктор, тут может быть конфликт: этот файлик есть в двух реквестах. Здесь, как я понимаю, более старая версия ноутбука и его можно убрать вообще из реквеста?

better plots

1728830

bt2901 requested a review from Alvant as a code owner August 4, 2020 22:13

bt2901 added 4 commits August 18, 2020 23:02

add classify_curve method

337000d

try to revert notebook changes

dd87397

whoops

a022fc3

more bad hardcoded stuff

0ed2c36

Alvant reviewed Aug 20, 2020

View reviewed changes

topnum/utils.py

'[email protected]_purity': max,

'[email protected]_size': None,

'perp': None,

'sparsity_phi': None,

Copy link

Collaborator

Alvant Aug 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max?

Alvant reviewed Aug 20, 2020

View reviewed changes

Alvant linked an issue Aug 20, 2020 that may be closed by this pull request

Search interval v2: simple list as input #88

Closed

bt2901 added 2 commits August 20, 2020 21:17

some style fixes

75ac3ef

pivot tables and proofs of concept

f7e06dc

Alvant reviewed Sep 4, 2020

View reviewed changes

Alvant approved these changes Sep 8, 2020

View reviewed changes

Alvant merged commit 590242e into master Sep 8, 2020

Alvant mentioned this pull request Sep 8, 2020

SearchInterval: standardize topic grid #97

Open

better plots #76

better plots #76

Uh oh!

Conversation

bt2901 commented Aug 4, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Alvant Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Alvant Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Alvant Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bt2901 Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Alvant Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Alvant Aug 20, 2020 •

edited

Loading

Alvant Aug 20, 2020 •

edited

Loading

Alvant Aug 20, 2020 •

edited

Loading

bt2901 Aug 20, 2020 •

edited

Loading

Alvant Aug 20, 2020 •

edited

Loading

Alvant commented Aug 20, 2020 •

edited

Loading

Alvant Sep 4, 2020 •

edited

Loading