Skip to content

Conversation

@lhecker
Copy link
Member

@lhecker lhecker commented Dec 10, 2025

Together with #19640 this bumps the performance on my HW by around 60%.
You can now stream 4K images at >60 FPS on a high-end CPU.

Validation Steps Performed

  • Used imagemagick to dump large images ✅

@lhecker lhecker added the Area-Performance Performance-related issue label Dec 10, 2025
auto repeatAspectRatio = _pixelAspectRatio;
do
{
std::fill_n(imageBufferPtr, repeatCount, foreground);
Copy link
Member Author

@lhecker lhecker Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same code as before, just with extra indentation. Suppress whitespace changes: https://github.com/microsoft/terminal/pull/19639/files?w=1

Comment on lines +765 to +766
const auto foreground = std::bit_cast<int16_t>(_foregroundPixel);
auto imageBufferPtr = reinterpret_cast<int16_t*>(_imageBuffer.data() + targetOffset);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I did this int16_t cast, because std::fill contains optimizations for primitive types (those that can be memset'd). To my dismay, I found out today that they added this check for "are all bits zero? if so, use memset": microsoft/STL@a27d894

Comment on lines +773 to +784
// This gets unrolled by MSVC. It's written this way to use CMOV instructions.
// Modern CPUs have fat caches and deep pipelines. It's better to do pointless reads
// from memory than causing branch misprediction, as sixelValue is highly random.
for (int i = 0; i < 6; i++)
{
std::fill_n(imageBufferPtr, repeatCount, _foregroundPixel);
std::advance(imageBufferPtr, _imageMaxWidth);
} while (--repeatAspectRatio > 0);
}
else
const auto test = sixelValue & (1 << i);
// Possibly pointless read from memory, but...
const auto before = imageBufferPtr[i * _imageMaxWidth];
// ...it allows the compiler to turn this into a CMOV (= no branch misprediction).
const auto after = test ? foreground : before;
imageBufferPtr[i * _imageMaxWidth] = after;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments, but the idea here is that pixelValue is highly random leading to branch mispredictions when bit-testing it. We can avoid that by always loading the old bitmap value and then CMOVing between the old and new value before storing it again. With modern CPUs and their deep pipelines this runs faster.

Comment on lines +800 to +809
// If this used std::fill_n or just a primitive loop, MSVC would compile
// it to a `rep stosw` instruction, which has a high startup cost.
// This is not ideal when our repeatCount is almost always small.
// The way this does ptr++ and len-- is also ideal for optimization.
auto ptr = imageBufferPtr;
auto remaining = repeatCount;
do
{
__iso_volatile_store16(ptr++, foreground);
} while (--remaining != 0);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To work around std::fill being bloated with memset and to work around the compiler inserting a pointless rep stos here, we can use volatile writes. This then compiles down to a very small compact loop = makes CPUs happy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area-Performance Performance-related issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants