Where Does Dirty Data Originate?

The importance of data quality cannot be overstated. For marketers, it starts with educating ourselves and our team as to what dirty data looks like and where it originates. Here’s a close look at where to direct your lens.

Bad List Buys

Looking to boost your marketing list in a snap? Buying those contacts from a list-purchasing company might appear too good to be true. That’s because it is.

SiriusDecisions reports that, on average, purchased contact lists are 15 months old at the point of sale, at best – and they continue to deteriorate over time at the rate of about 25% per year. While buying email lists is tempting, the long-term consequences aren’t worth the short-term fix. One bad apple is all it takes to contaminate your list. As Hubspot blogger Corey Eridon writes, “One customer’s ill-gotten email address list can poison the deliverability of the other customers on that shared IP address.”

Something else to consider when looking to buy from a list broker: the source. Data that comes straight from the original source itself, the actual contact, is ideal, but very few data providers offer self-reported contact data and contact attributes. These list companies almost always compile data – they take one source, butt it up against another source, then another, then another and at some point decide which data points are most accurate and then merge it all together. If it isn’t compiled data, its probably crowd sourced, meaning that the source of the data is other individuals who claim that a record is accurate or inaccurate based off of information they have that may or may not be closer to the original source (the contact). Typically, there are rewards and incentives put in place to encourage customers to correct records when they are wrong. The accuracy of data is NOT a democracy – just because someone says that a record is accurate does not mean it is.

Natural Decay: Expired Records

In a workforce that is more transient than ever, a database will have a natural decay of contacts because people change jobs, companies go out of business, and mergers happen. To compound the issue, its happening now more than ever with the average position held less than 2 years. In 2015, it is totally acceptable to change jobs after 3 years, but think about the numbers – if everyone changes jobs within 2-5 years in a database of 100k+, then it is expiring fast…almost hourly.

dirty data changing workfoce

Keeping tabs on data before it goes bad is paramount to success. “The problem of data decay is that it’s faster than it’s ever been,” says Sam Zales, president of ZoomInfo, adding that 71 percent of business cards you collect have at least one change within 12 months.

The bottom line is that there is no avoiding it- it’s inevitable that databases everywhere are expiring every day. The best solution though, is to first accept it and then understand why so that cleansing the data can be more comprehensive and more frequent.

Individuals in the U.S. hold an average of 11.7 jobs from ages 18 – 48

Bureau of Labor Statistics, U.S. Department of Labor (2015)

Single Source & Multi Source Problems

Dirty data can occur within a single set of records or between multiple sets of data which rely on each other and need to be merged. These two major data quality problems are referred to as single source problems and multi source problems and can be restored by cleansing the data

data quality problems

Single Source Problems

Schema Level Problems: Problems that occur within the theory or standardization of database organization; schema level problems will be reflected in instance level problems. Typically these problems occur when a database is not properly engineered or has poor integrity constraints.

Examples:

  • Illegal values – values are outside of the domain range
  • Uniqueness violations – duplication of unique fields, like a Social Security number
  • Referential integrity issue – referenced field is not defined
  • Violated attribute dependencies – one field depending on another field incorrectly, like city and zipcode

schema-level-problem

Instance Level Problems:

Errors, inconsistencies and inaccuracies in the actual contents of each record; these problems are not visible at the schema level. Errors that occur at the instance level encompass a wider rage of inconsistencies that do not reflect the database structure but data conflicts themselves.

Examples:

  • Misspellings – typos or phonetic errors
  • Missing values – unavailable values during data entry
  • Embedded values – multiple values entered into one attirbute field, like including
  • Misfielded values – value is in the incorrect field, ex. city = Virginia
  • Violated attribute dependencies – one field depending on another field incorrectly, like city and zipcode
  • Duplicated records – same contact shown twice due to data entry or merging errors
  • Contradicting records – duplicate contact with different values

instance-level-problems

Multi Source Problems

When multiple data sources need to be merged in a data warehouse, the need for data cleansing increases significantly. This is because the sources often contain the same data in different representations or they overlap or contradict each other.

Schema Level & Instance Level

When there are both Instance and Schema issues with a database, it means that not only are there inaccuracies within some of the record instances, but there are also structural issues with the database because of schema problems. Schema problems almost always create instance problems, so if both are occurring it means that the structure of the database is flawed, creating inaccuracies in certain instances. But it also means that there are instance inaccuracies caused not by schema issues, rather they are caused by data entry errors, naturally expired data, bad list buys, etc.

s123db

Data Cleansing Solutions by Social123

Social123 Offers a wide selection of dirty quality solutions.   View our data quality  products here or contact us for a complementary data health check.