Beware: Garbage In Means Garbage Out

September 15, 2015

Beware: Garbage In Means Garbage Out

This is the Age of Information: we have more data about more things – and people – than at any time in human history. And, with Technology, we can access that data…which should make our marketing, and our businesses better, because we know lot more about our customers. Right?

Wrong. We are in danger of drowning in data because we cannot interpret it and use it to make good business decisions.

Let’s look at the Internet and other forms of digital marketing, for instance. Proponents of digital claim that the medium is the one which offers the best feedback, because the data is available, instantly. But precisely what data is available, how clean is it, and how much does it really tell you about your consumer? Big data is usually also "dirty data".

Data is very simple: you either have it, or you don’t, it is either accurate, or it is not.

When you don't have the data, or the data is highly inaccurate or "dirty", no statistical technique in the world can generate an analysis you can trust for decision-making.

We have a word for it: GIGO – Garbage In, Garbage Out.

Data is a profoundly human artefact – you can’t step out into a field and simply gather data as you would pick a bunch of grapes.

Instead, you have to determine what you want measured, how it should be measured and then put in procedures to measure it, prioritising, of course, the data that is most important for your business.

There is a surprising amount of inaccurate or “dirty” data out there, and companies must be sure they are making business decisions on accurate or cleaned data.

Dirty data is often caused by human error. Sometimes it’s a design fault of a system that does not undertake good enough data verification.

Some error is almost unavoidable: for instance, humans will make spelling mistakes – which is why Google provides a drop-down list surmising what correctly spelled word or words you are searching on as you type.

Other errors relate to the fact that humans will always cheat or take shortcuts when they can. Sales clerks, spurred by the desire for commission, rapidly learn they can get around the need for a client ID by entering in their own ID, over and over and over again, if the system enables duplication without verification. This generates dirty data that is almost impossible to clean. Sometimes, it is more efficient and effective to discard dirty data than even attempt to clean it … just because you have data doesn’t mean you should use it.

Bad data inevitably leads to bad decisions, while acknowledging you don’t have enough of the data you need goes a long way to fixing the problem.

Sometimes the data is accurate, but it doesn’t tell you what you think it does.

Take, for instance, something as simple as web traffic. It’s easy to measure, but what are you measuring? Traffic emanates from both humans and computers (bots or spiders), so you need a way of splitting this data into groups: human traffic this side, computer-generated traffic that side.

The behaviour of humans and bots differs (the latter is on the site for a fraction of time of the former, for instance), so it is possible to filter out the non-human “users”.

The technical term is "disaggregation" - like segmenting consumers into groups of interest, so you must categorise your data into groups of interest.

Another issue with traffic is that of simple geography. If you are a local company, traffic from outside your country or region is generally not useful to you. Do make sure these numbers are separated out before you undertake any analyses.

A web analytics package, like the one offered by Google, provides a veritable cornucopia of data and analytics.

But to understand it fully, you have to work with the data, transforming it into something that is useful to your business and answers your business questions. You must use its tools to separate your data into the categories of interest.

If you don’t do this, then don’t expect the data to help you make good business decisions.

Web data has a number of limits in terms of consumer information.

One limit is the problem of online privacy. There are a number of online identification and tracking technologies available, from cookies to tracking pixels in emails, which help you gain insight into your potential or actual clients.

However, collecting some types of information passively without explicit, prior, informed consent violates consumer research ethics and may be illegal in many EU countries.

Both marketing and marketing research require consumer confidence, and the costs of client distress if they discover what they may consider “spyware” is simply not worth any potential benefits.

In some parts of the world, even the harvesting of Internet addresses is considered personally identifiable information.

Consumers are also becoming wiser and more savvy at protecting their data: there are a number of techniques to ensure companies cannot harvest data, from deleting browsing history regularly, or blocking specific cookies while enabling others, as well as ensuring images are not automatically downloaded in emails to avoid tracking pixels.

This means that if a large enough number of consumers are avoiding being tracked, you’re not getting a representative sample showing up in your analytics. And you might lose out on information about the demographics and the interests of the people you wish to target, and be making decisions on unrepresentative data.

Despite the wealth of data and the apparent ease with which companies can get hold of it, the very nature of data requires that you interrogate it since lots of dirty data on the complex, multivariate creatures that are humans requires a specialist skill-set to interpret.

Finally, you must also relate the data to the outcome of interest. Facts are stubborn; no amount of data manipulation can conjure up sales and customers, but it’s amazing how bedazzling other meaningless metrics can be.

Kathryn Kure undertakes independent, third-party analysis of web data and analytics. She is an independent member of SAMRA. www.datamyna.com

This article was first published on page 17 in the Saturday Star of the 12 September 2015, and republished with permission

Search This Blog

Data Myna

Beware: Garbage In Means Garbage Out

Popular Posts

Copyright and Her Limits Go to the Creative Commons

Kathryn Kure of Data Myna