25.05.2016 by Marisa Krystian
As a journalist, marketer, or business professional, chances are your world is full of data. This data needs to be visualized and analyzed in order to share quality stories or make important company decisions. The problem is that all datasets are not created equal. Some data is widely referred to as ‘bad data.’
This is data that is missing crucial information, wasn’t entered correctly, is in the wrong format, or is simply inaccurate. Some bad data needs to be addressed by third-party experts or programmers – but some bad data can be fixed by you!
We’d like to share a few key takeaways with you. Here are 5 easy ways to fix bad data:
1) Data is in a PDF
A tremendous amount of data—especially government data—is only available in PDF format. If you have real, textual data inside the PDF then there are several good options for extracting it.
SOLUTION: One excellent, free tool is Tabula. However, if you have Adobe Creative Cloud then also have access to Acrobat Pro, which has an excellent feature for exporting tables in PDFs to Excel. Either solution should be able to extract most tabular data from a PDF.
2) Data is too granular
This is the opposite of Data are too coarse. In this case, you’ve got counties, but you want states or you’ve got months but you want years. Fortunately, this is usually pretty straightforward.
SOLUTION: Data can be aggregated by using the Pivot Table feature of Excel or Google Docs, by using an SQL database or by writing custom code. Pivot Tables are a fabulous tool that every reporter should learn, but they do have their limits. Click here for 6 Excel add-ins that help you find, process, and analyze your data like a pro.
For exceptionally large datasets or aggregations to unusual groups, you should ask a programmer and they can craft a solution that’s easier to verify and reuse.
3) Human Error – Entry and Manual Editing
Human data entry is a very common problem. There is no worse way to screw up data than to let a single human type it in. Manual editing is almost the same as data that was entered by humans except that it happens after the fact and often with good intentions.
In fact, data are often manually edited in an attempt to fix data that was originally entered by humans. Problems start to creep in when the person doing the editing doesn’t have complete knowledge of the original data.
SOLUTION: Issues with manual editing are one reason why you always want to ensure your data has well-documented provenance. A lack of provenance can be a good indication that someone may have toyed with it.
Academics often get data from the government, monkey with it and then redistribute it to journalists. Without any record of their changes, it’s impossible to know if the changes they made were justified. Whenever feasible always try to get the primary source or at least the earliest version you can and then do your own analysis from that.
4) Margin-of-error is unknown or too large
Sometimes the problem is nobody ever bothered to figure out the MOE. This is one problem with unscientific polls. Without computing MOE, it is impossible to know how accurate the results are.
Another major problem is the usage of numbers with very large margins-of-error. MOE is usually associated with survey data. The most likely place a reporter encounters it is when using polling data or the US Census Bureau’s American Community Survey data.
SOLUTION: As a general rule, anytime you have data that’s from a survey you should ask for the MOE. If the source can’t tell you, that data probably isn’t worth using for any serious analysis.
When the MOE is too large, there is no one rule about when a number is not accurate enough to use, but as a rule of thumb, you should be cautious about using any number with an MOE over 10%.
5) Timeframe or frame of reference has been manipulated
A source can accidentally or intentionally misrepresent the world by giving you data that stops or starts at a specific time. Or, you may have data that manipulates the frame of reference.
Crime statistics are often manipulated for political purposes by comparing to a year when crime was very high. This can be expressed either as a change (down 60% since 2004) or via an index (40, where 2004 = 100). In either of these cases, 2004 may or may not be an appropriate year for comparison. It could have been an unusually high crime year.
This also happens when comparing places. If someone wants to make one country look bad, they can simply express the data about it relative to whichever country which is doing the best.
SOLUTION: If you have data that covers a limited timeframe try to avoid starting your calculations with the very first time period you have data for. If you start a few years (or months or days) into the data you can have confidence that you aren’t making a comparison which would be invalidated by having a single additional data point.
Time frame manipulation tends to crop up in subjects where people have a strong confirmation bias. Whenever possible try comparing rates from several different starting points to see how the numbers shift. And whatever you do, don’t use this technique yourself to make a point you think is important.
Ready to fix your bad data and create quality data visualizations? Infogram allows you to import your data in multiple formats, sync with Google Drive, add a JSON feed, search our global databases or manually enter your data yourself.
Do you want to know more about data? Read our blog post about 17 Essential Tips and Tricks for Google Sheets You Need To Know.
Get data visualization tips every week:
New features, special offers, and exciting news about the world of data visualization.