-
Notifications
You must be signed in to change notification settings - Fork 78
Description
Hi,
First I'd like to say thanks for publishing this repo! It's very helpful.
My question specifically refers to this description in the README:
_After the specified Optuna trials are complete, a 3-step KMeans clustering method is used to select the optimal parameter(s):
Each trial is placed in its nearest neighbor cluster based on its distance correlation to the target. The optimal number of clusters is determined using the elbow method. The cluster with the highest average correlation is selected with respect to its membership. In other words, a weighted score is used to select the cluster with the highest correlation but also with the most trials.
After the best correlation cluster is selected, the parameters of the trials within the cluster are also clustered. Again, the best cluster of indicator parameter(s) is selected with respect to its membership.
Finally, the centered best trial is selected from the best parameter cluster._
Since you are clustering by the correlation, and then picking the cluster with the best mean-correlation to the target, I'm not really sure what this is achieving. Why not just use the parameters from the trial with the highest correlation itself?
I can see how this would be useful if you were clustering by the parameters instead of the correlations. (That way you avoid outlier/overfit parameters by making sure you're using a cluster with similar parameters having a high correlation). But the description and the implementation don't seem to be actually using the parameter values in the clustering, they only cluster the scores.
Alternatively doing a k-fold optimization could help control for overfitting as well. Although I guess the user can implement that themselves if they want to.
Thanks again!
-Aakash