The WFP algorithm admits configuration of two main parameters: ‘Gram’ and ‘Window’. Selecting the right values will have a direct impact in output uniformity and footprint. These values will affect performance and quality of results.
In order to find a suitable configuration, we executed tests with different values for different programming languages and different applications. Some of these results are made available below.
The smaller the value of gram the lower the output uniformity and the higher the possibility of data colission. For example, a gram value of 4 would lead to the fingerprint for the word else becoming very popular since the word is common in many programming languages. The bigger the gram value, however, the less likely it would be to find matches on modified code.
The bigger the window, the lower the output footprint, but also the lower the chances to find matches on modified code.
Uniformity and footprint
Uniformity and footprint are the two resulting factors evaluated when testing different configurations for gram and window.
To evaluate footprint, we simply count the amount of fingerprints generated in the output. The graphs below illustrate how footprint is affected by different combinations of gram and window:
In order to evaluate uniformity, we establish a uniformity index, which is a factor indicating how many times the most common fingerprint repeats vs. the less common one. For example, if the less repeating fingerprint appears two times, while a given fingerprint appears 10 times, then it has a uniformity factor of 5 for the exercise. Therefore, the lower the uniformity index, the greater the output uniformity.
The graphs below illustrate how different combinations of gram and window affect uniformity:
Based on the different exercises and comparison tests we concluded that gram=30 and window=64 provides a good balance between footprint and uniformity, and has proven so far to provide good matching capabilities.
Questions? Suggestions for a different algorithm? Concerns? Please do not hesitate to contact us.