Float Table Finder #8431

PennRobotics · 2025-08-13T11:12:17Z

PennRobotics
Aug 13, 2025

Would a floating point data table finder be worth developing?

In the bare minimal case, I show values I often encounter and their mapping:

ulong	float
`3f800000`	1.0
`41200000`	10.0
`447a0000`	1000.0
`3eaaaaab`	1/3
`3a83126f`	0.001
`472c4400`	44100.0
`40490fdb`	pi
`4b371b00`	12000000.0
`4bb71b00`	24000000.0
`4c371b00`	48000000.0

Sure, it's possible there are instructions or another data type that is referred by 472c4400, like the string G,D, but if the next value is 473b8000 (as was the case in the last three audio interfaces—all from different manufacturers—whose firmware I've disassembled) then I know I'm looking at a table of sampling rates that is likely to also have 88200, 96000, and sometimes 192000 or 32000.

I see how it would be easy enough to create a new AbstractAnalyzer and look through a list of pairs (or even just a list of ints) but I don't know enough about Ghidra's data types, Java, and who this would even serve to decide to do it. (It also isn't a showstopper for me. I usually can see an obvious float table in the hex dump and can just load a binary into Trace32, open a Draw window for a memory region with the %Float specifier, zoom in and out and scan for anything that looks like a decay, impulse, sine table, or otherwise identifiable sequence. This gets 95% of cases.)

Here are discussion points I propose for a hypothetical Float Table Finder Analyzer:

What values are included e.g. "this is almost certainly a float"? Is it a lookup table or a combination of heuristics?
Is the lookup list just created by "what feels right" and stored in the Analyzer source itself as a constant? If so, is there a prefered data type for such a table? (The most obvious inclusion is 3f800000. Sometimes I'll just search for this sequence to find low hanging fruit.)
Would a dropdown with lookup models make sense instead e.g. similar to how string searching works?
Is there some sort of heuristic that would be better than a lookup, like having enough 32-bit values in a row that start with a value between, say, 0x35 and 0x47 (inconveniently, common ASCII values, and only representing positive floats) and then checking (a) if they are relatively round values when scaled, and (b) there is some pattern between successive values, like they very closely fit a linear or exponential pattern?
Would all of this be better placed as a script?

Depending on the architecture, doing float analysis before code decompilation avoids trouble e.g. 8051, where any arbitrary byte sequence is likely to result in a valid op code and many of those op codes convert into C code. Running Aggressive Instruction Finder before identifying float tables may result in blocks of code that almost look like they make sense, until you are lucky to end up with a jump in the middle of another mnemonic, resulting in an error, and then you click back to find the source of the error is a misidentified float table.

Also, if you already identify a float table before code analysis, it is more likely (at least on some architectures) that a function will have the appropriate argument type defined and then display call arguments correctly in the Decompilation window.

To illustrate the heuristic point of whether a value converted from ulong to float (the resulting float shown here as an decimal integer) should be accepted as a valid float:

211043472: "no" (two large factors, 9 nonzero digits, it's also the big endian string "MIDI")
209000000: "maybe" (largest factor is 19, lots of zeroes, all four bytes are valid ASCII)
211000000: "probably" (largest factor is 211, lots of zeroes, not valid ASCII)
216000000: "yes" (largest factor is 5, lots of zeroes, not valid ASCII)

Similarly, should 0.216 or 0.0000216 be accepted as readily? What about 1/216000000? 1/216? 1/360? Why? Why not? Most importantly, how??

As far as pattern matching, consider you have 3f800000 and then 3f792a30, 3f728241, 3f6c01a3, ... and on the 256th entry, 3a83126f:

If the next byte starts with anything other than 3a, the table size is a "predictable" 0x100.

This would be the outcome of an exponential decay starting on 1.0 at f[0] and ending on 0.001 at f[255]. The downside is you wouldn't be able to predict/tabulate the decay rate for every imaginable sequence. You can have non-exponential decay (i.e. polynomial) that still starts on 1.0 and ends on 0.001. What if the user wanted to end on a different value (e.g. 5 time constants) or choose a start value where all values collectively sum to 1.0?

You can see that the first byte of each successive value changes monotonically and rarely: starting from 3f and eventually reaching 3a.

The next byte is generally within a few numbers of the other and often decreasing (except when the first byte changes) and then the entropy of the third and fourth bytes is all over the place.

Because the mantissa isn't byte-aligned, this obviously isn't a perfect method, but you could analyze the exponent and mantissa separately and identify patterns in how much they change and in which direction.

Visually, you'd immediate recognize a decay lookup table if a large selection of data/code where the table is contained was graphed as floats, because our visual system does excellent pattern matching, and patterns are just regions of low entropy. In the hex dump/memory viewer, you could also identify this table it if you knew what to look for and grouped by 4 bytes. As a list of ?? bytes in the Listing window, it's easy to overlook. If the lookup table is already defined as something else (especially code) it gets harder to realize what it actually is and it might take a while before you realize something is wrong.

The beauty here is this method should also pick up things like sine tables or even a large enough table of frequency clock speeds (which I've seen from time to time in CMSIS/vendor microcontroller code) because they will have the same low entropy changes as a decay table.

What is a "predictable" table size? 200 floats? 256, 359, 360, 441, 1000, 1001? 10? 5?
What are the thresholds of entropy for the exponent and the significand/mantissa?
Does it make sense to ignore sign?
What degree of control should a Ghidra user have in deciding these factors? (There is already some prior art in Ghidra: the BSim function matching thresholds.)

Then… if such a thing can be successfully implemented and is well received… we do it all again for doubles!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Float Table Finder #8431

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Float Table Finder #8431

Uh oh!

Uh oh!

PennRobotics Aug 13, 2025

Replies: 0 comments

PennRobotics
Aug 13, 2025