Skip to content

Explore dropping raw_time from Message dataclass #212

@dlqqq

Description

@dlqqq

Context

Show context (click to expand)

When a client creates a new message, it sends a new message with a timestamp & raw_time set to True:

// from LabChatModel.sendMessage()

    const msg: IYmessage = {
      type: 'msg',
      id: UUID.uuid4(),
      body: message.body,
      time: Date.now() / 1000,
      sender: this._user.username,
      raw_time: true
    };

Then, the server extension calls self._set_timestamp() concurrently when a new message with raw_time=True is received:

# from YChat._on_messages_change()

        for idx in range(index, index + inserted_count):
            message_dict = self._ymessages[idx]
            if message_dict and message_dict.get("raw_time", True):  # type:ignore[attr-defined]
                self.create_task(self._set_timestamp(idx, timestamp))

This then deletes the message and re-adds it with raw_time set to False, while ensuring that the new message is inserted at the correct position:

    async def _set_timestamp(self, msg_idx: int, timestamp: float):
        with self._ydoc.transaction():
            # Remove the message from the list and modify the timestamp
            try:
                message_dict = self._ymessages[msg_idx]
            except IndexError:
                return

            message_dict["time"] = timestamp  # type:ignore[index]
            message_dict["raw_time"] = False  # type:ignore[index]
            self._ymessages[msg_idx] = message_dict

            new_idx = len(self._ymessages) - next((i for i, v in enumerate(self._get_messages()[::-1]) if v["time"] < timestamp), len(self._ymessages))
            if msg_idx != new_idx:
                message_dict = self._ymessages.pop(msg_idx)
                self._ymessages.insert(new_idx, message_dict)

Description

I think that this logic surrounding the raw_time property is not necessary, and is doubling memory & disk use.

Why raw_time may not be necessary

Each Yjs list is guaranteed to be eventually consistent. Each new item in a Yjs list is represented as a doubly-linked list node, that stores a reference to the previous & next items.

If some user A sends a message, it sends a SyncUpdate message to the server extension, which then broadcasts that SyncUpdate to other clients. Therefore, it will take another user B some latency L to receive that update, e.g. 500 ms.

After L time passes, user B will receive the new message from A. Since new items in a list store references to the previous & next items in the list, any new message B sends is guaranteed to follow the previous message from A.

An edge case does occur when B sends a new message within that latency L, e.g. B sends a new message just 5 ms after A sends a new message. When this occurs, Yjs will decide how the new messages are ordered based on their client ID & client clock, and the order is effectively arbitrary.

However, even in this case, I do not think we should re-order the messages, because it's impossible to know which client sent their message first. Different clients have different latencies, and we can't trust client-side timestamps unless we somehow implement NTP over WebSockets.

  • Example: user A may live in the US and have a latency of 50 ms while user B may live in Australia and have a latency of 5000 ms. If user B sends a message at t = 0 ms, user A can send a message at t = 4000 ms, and the server will receive A's message first, even though B sent their message first.

Why raw_time may be doubling memory & disk use

All Yjs documents are append-only. When we "delete" an item from a list (e.g. deleting a previous message), Yjs is just marking that item as "deleted" but keeping it in the YDoc. This is necessary to resolve edge cases where user A deletes a message at the same time user B adds a new one.

Therefore, deleting & re-adding every new user message will double memory & disk use, since every new message now produces 2 Yjs items: one with raw_time=True (that gets quickly marked as deleted), and another with raw_time=False (that persists and shows in the chat UI).

Also relates to #203.

Other quirks caused by raw_time

This also impacts the event listener since 2 new message insert events are created on every new message (one with raw_time=True and another with raw_time=False. See this review comment: https://github.com/jupyterlab/jupyter-ai/pull/1324/files/49aaa3908d8f755c2921944f28f3a3566035d960#r2046884004

Proposed Solution

  • Drop raw_time from Message.
  • Update Message.time to be optional, e.g. time: Optional[float] = None.
  • Update _set_timestamp() to just set Message.time as soon as it receives a message, while not deleting & re-ordering messages.

After this change, timestamps may be out-of-order in rare cases. I don't think we have verified whether this will happen, so we should remove the existing logic to see what happens. If Yjs handles this well and timestamps are consistent 99.99% of the time, then we don't need to do anything. Otherwise, we can fix this later in a more performant way that doesn't double our memory & disk use.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions