Utf16 bom support #4326

joeleonjr · 2025-07-21T18:28:57Z

Description:

Secrets in UTF-16-Encoded Files are not always detected due to data chunk changes made in the UTF-8 extractSubstrings() function.

In engine.go, TH loops each decoder, passing the Chunk Data in for processing. The UTF-8 decoder runs first. If the data chunk is invalid UTF-8, the UTF-8 decoder will execute the function extractSubstrings(). The result of that function is applied to the Chunk's Data field, which is then passed into all subsequent decoders. Part of that function alters the data structure of valid UTF-16 data, making detecting some secrets impossible.

Here's an example to test out:

echo <VALID_DETECTABLE_SECRET> > secret.txt
printf '\xFF\xFE' > utf16le.txt && iconv -f UTF-8 -t UTF-16LE secret.txt >> utf16le.txt
printf '\xFF\xFE' > utf16le.txt && iconv -f UTF-8 -t UTF-16LE secret.txt >> utf16le.txt
trufflehog filesystem utf*

Originally, I thought the problem was we did not address the UTF-16 Byte Order Marks (BOM) #FEFF and #FFFE. However, the existing logic takes care of those in the utf16ToUTF8 function in utf16.go. I added two test cases to prove that.

The only change needed is creating a copy of the chunk prior to processing each decoder.

If that change is too expensive, I have 2 other ideas:

Move extractSubstrings out from the UTF-8 decoder and invoke it directly engine.go prior to running FindDetectorMatches during a failed UTF-8 decode.
Store the results of that function in a separate variable for later processing in FindDetectorMatches.

joeleonjr added 2 commits July 21, 2025 13:58

added UTF-16 BOM support

dc23e71

removed BOM removal; doesn't make a difference

3e14971

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Utf16 bom support #4326

Utf16 bom support #4326

Uh oh!

joeleonjr commented Jul 21, 2025

Uh oh!

Uh oh!

Utf16 bom support #4326

Are you sure you want to change the base?

Utf16 bom support #4326

Uh oh!

Conversation

joeleonjr commented Jul 21, 2025

Description:

Uh oh!

Uh oh!