![]() |
![]() |
{GSA} | |||||||||||||||||||||||||||||||||||||
| Knowledge Integrity | Column Archive/Benford's Law: Information Analysis and BPM | ||||||||||||||||||||||||||||||||||||||
|
Benford's Law: Information Analysis and BPM, Published in B-Eye-Network,
April 2005
Sometimes what people perceive to be the truth is less than consistent with reality. In the business intelligence world, situations like these often present opportunities for discovery that lead directly to actionable knowledge. One example is a curious observation (in the 1920s) by a General Electric physicist named Frank Benford that led to his description of a counter-intuitive law of logarithmic sizes associated with numeric distributions. This law, which is now referred to as Benfords Law, states that in data value sets with certain properties, there is a predictable, albeit, uneven distribution of the initial digits of numbers within the set. In other words, in some number distributions, if you analyzed the frequency of the leftmost digit of all the numbers, you are much more likely to find the digit 1 than any other digit, followed by 2, then 3, etc. Benfords Law applies to data sets with these criteria: The data set should describe
sizes of similar items or phenomena, such as populations, lengths and
durations; Ultimately, Benford observed a phenomenon that had been noticed earlier by others, notably 19th-century astronomer Simon Newcomb, that for data sets meeting those criteria, the frequencies of the initial digit generally corresponded to the probability function: P(dd) = log(1 + 1/dd), Where dd represents the initial digit(s). The probabilities for the first digit are shown in Table 1.
Table 1: Benford's Law Probabilities for Frequency Distribution This implies that in a Benford data set, a 1 has a 30 percent chance of being the initial digit, while a 9 has a less than 5 percent chance, as can be seen in Figure 1. This law works due to the logarithmic nature of increasing numbers. For example, take the stock prices: there are as many dollar increments in a stocks price between $10 and $20 as there are between $80 and $90, yet to get from $10 to $20 the price of the stock must double, while to go from $80 to $90, the price must increase by only 12.5 percent. The implication is that it takes longer for the price to double than to increase by a smaller percentage, and consequently the first digit stays at 1 for a longer time than it would at 8.
The non-randomness of these digit frequencies has led to some interesting uses, most notably in the areas of auditing and fraud detection. For example, a person intending to commit fraud through cooking the books might assume numeric randomness and pepper their incorrect entries with numbers that reflect an equal distribution. But since these numbers meet our specified criteria, even a relatively small number of invalid entries will skew the distribution away from the Benford curve, and will highlight areas for further exploration. Are there potential information quality and business intelligence uses in there? Interestingly, Benford analysis is consistent with concept of data profiling, introducing an alternate dimension across, which numeric data can be subjected to frequency analysis. However, the implications of Benfords Law open up other curious doors. Certain kinds of time durations meet the Benford criteria, allowing for both information quality analysis and for business process improvement opportunities. On a simple level, consider that time durations associated with inbound call center operations should conform to Benfords law: Customer hold times; The time between product purchase and initial customer contact should also be consistent with the Benford curve. We might expect that a larger number of people will require help at an earlier stage, while those who have survived without making the call are more likely to go longer without calling. So any variations from the Benford curve might indicate a problem that is occurring with unexpected frequency or a product failure occurring faster than expected. Yet again, variance from the natural expectations may indicate a process or product failure requiring more focused attention. There are multiple ways that a natural logarithmic size law like Benfords Law can be applied, as well as alternate mathematical functions and laws that can (and should) be integrated into analytical profiling tools, and I will be exploring these in future articles. And there is definite value in collectively exploring how techniques and methods developed for different industries and applications can be abstracted and integrated into the information integration and semantic convergence process. |
|
|||||||||||||||||||||||||||||||||||||
|