-
Notifications
You must be signed in to change notification settings - Fork 9k
Improve sixel performance #19639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Improve sixel performance #19639
Conversation
src/terminal/adapter/SixelParser.cpp
Outdated
| auto repeatAspectRatio = _pixelAspectRatio; | ||
| do | ||
| { | ||
| std::fill_n(imageBufferPtr, repeatCount, foreground); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same code as before, just with extra indentation. Suppress whitespace changes: https://github.com/microsoft/terminal/pull/19639/files?w=1
| const auto foreground = std::bit_cast<int16_t>(_foregroundPixel); | ||
| auto imageBufferPtr = reinterpret_cast<int16_t*>(_imageBuffer.data() + targetOffset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, I did this int16_t cast, because std::fill contains optimizations for primitive types (those that can be memset'd). To my dismay, I found out today that they added this check for "are all bits zero? if so, use memset": microsoft/STL@a27d894
| // This gets unrolled by MSVC. It's written this way to use CMOV instructions. | ||
| // Modern CPUs have fat caches and deep pipelines. It's better to do pointless reads | ||
| // from memory than causing branch misprediction, as sixelValue is highly random. | ||
| for (int i = 0; i < 6; i++) | ||
| { | ||
| std::fill_n(imageBufferPtr, repeatCount, _foregroundPixel); | ||
| std::advance(imageBufferPtr, _imageMaxWidth); | ||
| } while (--repeatAspectRatio > 0); | ||
| } | ||
| else | ||
| const auto test = sixelValue & (1 << i); | ||
| // Possibly pointless read from memory, but... | ||
| const auto before = imageBufferPtr[i * _imageMaxWidth]; | ||
| // ...it allows the compiler to turn this into a CMOV (= no branch misprediction). | ||
| const auto after = test ? foreground : before; | ||
| imageBufferPtr[i * _imageMaxWidth] = after; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments, but the idea here is that pixelValue is highly random leading to branch mispredictions when bit-testing it. We can avoid that by always loading the old bitmap value and then CMOVing between the old and new value before storing it again. With modern CPUs and their deep pipelines this runs faster.
| // If this used std::fill_n or just a primitive loop, MSVC would compile | ||
| // it to a `rep stosw` instruction, which has a high startup cost. | ||
| // This is not ideal when our repeatCount is almost always small. | ||
| // The way this does ptr++ and len-- is also ideal for optimization. | ||
| auto ptr = imageBufferPtr; | ||
| auto remaining = repeatCount; | ||
| do | ||
| { | ||
| __iso_volatile_store16(ptr++, foreground); | ||
| } while (--remaining != 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To work around std::fill being bloated with memset and to work around the compiler inserting a pointless rep stos here, we can use volatile writes. This then compiles down to a very small compact loop = makes CPUs happy.
Together with #19640 this bumps the performance on my HW by around 60%.
You can now stream 4K images at >60 FPS on a high-end CPU.
Validation Steps Performed