We’ve all heard of “garbage in, garbage out”, so even though the primary focus of any statistical analysis will be the outcome, that will be next to useless without the proper raw materials. Indeed, many analysts might claim that the process is in itself an example of the Pareto Principle, and 80% of your time should be spent getting the various ducks into a row, and only 20% should be spent counting them.

There is an even more problematic phrase with the acronym GIGO, which is “garbage in, gospel out”, and this refers to the common trap of assuming that anything produced by a computer must be correct and reliable. Certainly, a professionally presented spreadsheet, crammed with impressive graphs, tables, and statistics, or a beautifully designed PowerPoint, can be a compelling way to sell the fruits of your labours, but all the pretty colours in the world won’t prevent the conclusions from being complete rubbish if they are drawn from dodgy data.

Assumption is the mother of mistakes

There are less polite versions of the same phrase, of course, but this is a key idea to get your head around. We had a very good example of it in the news during the Covid-19 pandemic, where some data was transferred using a spreadsheet, and some rows were lost because the file wasn’t big enough to contain the full dataset.

This was a really basic error, in fairness, because someone had assumed that the data had been moved successfully from A to B, but had failed to perform even the most basic test, such as asking, “How many rows did we start with, and do we still have the same number?”

Spreadsheets are always fun – very commonly used to transfer information, but prone to user error on a spectacular scale. A prime example of people falling for the “garbage in, gospel out” syndrome I mentioned above.

I’m old enough to remember a time before we all had calculators, and still to this day I often do a little mental sum when I use a calculator, so that I have a rough idea of the answer I’m expecting. This can save the day if you miss out or add a zero, or something simple like that. If you are expecting an answer in the thousands, and end up with one in the hundreds, you know immediately that you have gone wrong somewhere.

The issue with assuming that the computer will have got it right is that it is a fairly safe assumption – the computer will normally have got it right, but whatever it was you asked of it may not be quite so reliable!

Know who owns what

Whatever place you hold in the hierarchy within your business, the chances are that you will uncover something incorrect or missing that you can’t correct yourself. You need to make sure that everyone in the business knows who is responsible for what so that mistakes or omissions can be rectified quickly and accurately.

Too often, problems work their way down the chain, because people are not comfortable going the “the boss” and telling them that there has been a mistake, so they might try and correct it themselves or have someone else who is no better qualified or informed, so the problem still exists.

It is in everyone’s interests to have the best information available to the organisation, so make sure that you encourage a culture of openness around your business data, and make it clear that uncovering a problem is something to be encouraged, and not to be embarrassed or uncomfortable about raising.

Document methods and approaches

Consistency is important, so you need to make sure that you treat the same data in the same way year on year, for example, if you want it to give you reliable intelligence.

Staff change, and people move about, so documentation about the data being used, how it needs to be prepared, how you structure it and analyse it, and about any software or applications that you use, is a valuable resource.

It’s also important to clearly document individual one-off projects, so that people looking back in years to come will be able to tell exactly what was analysed and what wasn’t. Include information about how the data was collected and who by, along with how it was entered or collated.

Deal with missing information

This is not some complete dataset that has been lost, but rather individual gaps and bits of missing data. The data may be missing for a number of reasons – it may be that some element was never collected, or a particular question was never asked, which has caused a problem in hindsight, or it may be an error. If you have conducted a survey, respondents may have chosen not to answer certain questions.

The problem is that it may compromise the extent to which you can analyse the dataset that you have, or limit the information that it can generate for you. Alternatively, it might mean that your conclusions are drawn from a smaller sample than you would like, so your level of confidence is therefore lower.

There is no obvious answer to how to deal with missing data, as the situation will be different every time, but the important thing is to be aware of it, and to make an assessment as to how serious it is. Remedies might include trying to fill in the gaps by returning to the source if possible, or changing the design of your analysis. It’s important not just to ignore it.

Just keep your wits about you

That’s all you can do, really, and it’s sadly more than a lot of people seem to manage! Just be aware – take nothing at face value, and don’t assume that somebody else has got it right. Check and double-check, as it’s far quicker to spend a bit of time doing that than it is to revisit the whole thing later when it turns out to be flawed.

Data will always be dirty, to a greater or lesser extent, but to derive useful information from it, it’s critical that you know how dirty it is, or at least have a pretty good idea.

Be systematic in how you handle it, make sure you understand it, and don’t take anything for granted!

Photo by Markus Spiske on Unsplash