ITS 4450 - Fraud Risk Assessment Tools and Investigation

Chapter 6, Data-Driven Fraud Detection

This lesson presents material from chapter 6. Objectives important to this lesson:

  1. Anomalies and fraud
  2. Data analysis
  3. Analysis software
  4. Data access
  5. Analysis techniques
  6. Outliers
  7. Stratification and summarization
  8. Financial statements
Chapter 6

The chapter opens with a story, this time a short one. There are some gaps in the logic of the story, but we can take it as an introduction to using databases and reports to find instances of fraud. It seems to me that it would be more useful to detect false social security numbers when applications are made for employment and driver licenses (yes, that is the plural). Verification of names and their associated social security numbers is available to registered users, which means you have to register to use the system. Note that this does not prevent someone from using someone else's identity. It only confirms that there is a name match to an active number.

The text tells us that data analysis often finds accounting anomalies that are not part of a fraud. An anomaly may simply be due to an error that was not caught earlier. Anomalies can be intentional (probably part of a fraud) or unintentional (probably due to a system or human error). If the anomaly is due to a system problem, such as the printer error discussed in the text, it will occur with some regularity. Instances of such an anomaly will appear throughout the data set. The text advises us that such patterns usually indicate a problem, not fraud. If such anomalies are rare in the data set, that may indicate a crime that has not been covered up in every case. There are exceptions to each of these possibilities, so investigation is called for in any case. A standard logical aphorism applies: if you hear hoof beats, think horses, not zebras. The more common explanation is usually the right one. Except when it isn't.

The text continues with the concept of sampling. It would be nice if we had all the time in the world to do analysis, and all the processing power we might need to do it. Neither of those things is true for anyone I know, so when there is a huge body of data to examine, we need to decide if sampling, examining random cases from the data, will be appropriate for the task. It is not always a good choice. Our author knows that. The text points out that if we pull a 5% sample from a body of data, we are accepting a 95% chance of not finding the data we are looking for. This may be okay when looking for a system problem that happens a lot. It is not acceptable when looking for fraud, which is typically rare. Fraud investigation should be done with a full-population analysis (all of the data).

Given the nature of data analysis, the text points out that fraud detection can be much more proactive when these techniques are used. Figure 6.1 shows us a process with six (actually eight) steps in three phases.

Analytical steps
1. Understand the business - what do they do, how do they do it, and how do they control their processes?
2. Identify possible frauds - where are controls weak? what kinds of fraud could be attempted?
3. Identify symptoms of possible frauds - how would each of the possible frauds appear in the data? what would we see in employee behavior data? (are we tracking that?)
Technology steps
4. Gather data with technology - what are the expected anomalies? what can we find that is unexpected? run queries on databases to extract relevant data
5. Analyze data - find indicators of fraud and other problems, then investigate to make sure
Investigative steps
6. Investigate symptoms - if fraud found go to 8. follow-up, if not, go to 7. create new controls to detect and correct symptoms

The process should be considered a cycle that should be repeated on a regular schedule and again whenever needed.

The text briefly discusses three software packages used in data audits. It mentions two others. This is a moving target, in that software changes regularly, and products come and go in every category. When choosing a product, you should look at reviews of currently available software, and you should consult reputable sources, like business partners and trade associations, to find out what products are recommended by people you respect.

The text continues with a discussion of access to data. It includes the idea that an analyst should be happy to have read only access to data because analysis can still be done, and it prevents accusations about the analyst changing the data.

graphic of Benford's Law distributionIn the section on analysis, the text introduces a theory called Benford's Law. Frank Benford, a physicist, examined a theory proposed by Simon Newcomb, an astronomer. The theory says that the first digits of actual numbers in a set will most often follow a nonrandom distribution. This distribution, shown in the graphic on the right, says that there is a 30% chance that the first digit in a number will be a 1, an 18% chance that it will be a 2, and so on down the indicated curve. This is counter intuitive. You might think that digits in real, naturally occurring numbers would be more random. You can follow the link above to a Wikipedia article about the subject.

The text tells us that financial data often follow Benford's Law, and that examination of financial data sets usually shows this to be true, except where numbers are assigned or where fraud is taking place.

Benford Analysis

In the image above, we see seven data sets examined for the applicability of Benford's Law. Benford's Law is shown as the black curve. All of the data sets in this example seem to fit, with the exception of lottery numbers, which are meant to be completely random. Lottery numbers fit the horizontal line predicting the same probability for each digit. You can click the image above to follow a link to another article about this concept.

The text discusses outliers, which are values that do not seem to belong to their data set. The text proposes increasingly higher numbers for the cost of a broom ($10, $25, $100, $1500), asking us where we would call for an investigation of such charges. In order to answer the question, we should be asking what the average cost of a broom is in the locality where we buy them. If we had the prices for brooms from a couple of dozen vendors, we could z-score analysis of each price to show whether a specific price is well outside the data set it is supposed to belong in. This type of analysis assumes that most elements in the data set will fit a bell curve, one in which there are more data points in the middle of a range, and fewer at each end. A z-score tells us how many standard deviations away from the mean a value is. Obviously, this only applies to data expected to fit under a bell curve.

The next concept is stratification, which means simplifying data sets into collections of tables. The text explains that we could not get a meaningful z-score for brooms if the table in question also included the cost of uniforms, buildings, and parties. We need to examine just the data for like cases, in this example, the prices of brooms. The text warns that this method creates lots of tables from a data set. In the same part of the chapter, the author explains summarization, which is easier to understand. Summarization computes a representative statistic for each case in the tables formed by stratification, which is why the author explained that first. It could, for instance, generate an average price for each product we buy, regardless of the number we buy or the number of vendors we use.

In the same section, the author introduces time trend analysis. Look at the graph on page 185, which shows how much a specific employees spent every two weeks. If a company has huge peaks in production at certain times of the year (candy companies often do), then there may be higher spending in those periods. The graph in the text shows a growing number of dollars spent, increasing from late October through March. Are we making jelly beans? In this case, there was no good reason for the purchases other than to steal from the company.

Let's move ahead to financial statements, starting on page 187. It lists several common statements that companies compile about their operations, as well as some analysis methods commonly used to examine such statements.

  • Comparing the same kinds of statements across time gives us a history of the company from the point of view of that kind of statement.
  • Financial statements can be used to calculate numbers based on them (key ratios), which gives more meaning than the statements themselves. Some of these ratios are shown on page 189.
  • Vertical analysis examines a statement by showing each line's percentage of the amount it adds up to, as in the examples on page 190. Two analyses are shown to indicate how each line's percentage increased or decreased across time.
  • Horizontal analysis is done by showing line items for two or more years, and computing the differences in actual numbers, not in percentages of the whole. The amounts and percents of change are caclulated.


  1. Continue the reading assignments for the course.
  2. Complete the assignments and class discussion made in this module.