In the last blog post, we explored good data structures for performing advanced analytics. But what about the quality itself? Here is a list of five signs that you may have bad data and quick ways to improve it.
1. No data uniformity
Any reader can recognize that “123 Anywhere St” is the same as “123 anywhere street” or “123 Anywhere STR.” However, a computer struggles to make sense of these differences. Recording uniformly allows scientists to dive into work immediately. This reduces the amount of time spent on the front end cleaning and processing the data.
For data with extensive abbreviations or that is not obvious to an outsider, uniformity is even more important. Otherwise, analysts might treat typos as different categories within a variable and produce incorrect results. This becomes a bigger issue when fundamental changes exist – for example, a change in salary structure mid-year, or a change in medical coding practices. Scientists can make use of data dictionaries or variable mapping to cope with these real-world discrepancies.
2. High levels of missingness
Missing data is tricky for any scientist. Is it missing because there is no real value for this field? Or, is it missing because someone failed to enter it for some reason? If the latter, did they fail to enter it in a systemic way, or was it a random failure?
A classic example is a self-reported income in survey research. People often hesitate to report their income to strangers. We don’t know why they withhold this information (are they sensitive about how much they earn, or how little?). Therefore, we cannot claim that their income may be similar to that reported by other survey respondents. Their data are not missing completely at random, and particular caution needs to be taken with this variable.
In other situations, missing data might simply occur because no meaningful response to the variable exists. In a data set from an HR department, employees’ overtime hours might be missing if they are salaried employees and therefore ineligible for overtime. This is very different from someone deliberately choosing not to report their overtime.
Both figures here (representing data stored in columns and rows) have the same amount of missing data. The figure on the left is missing data completely at random. The figure appears to have a pattern to the missing data.
3. Numerous duplicate records
When we think of “big data,” we often imagine sets with hundreds of thousands of records or rows. However, if these records contain many duplicates, we might be surprised to see just how small our data set actually is.
For instance, we might have daily attendance records for students across the entire school district. We might think that 365 days of attendance records times tens of thousands of students could give us a nice big data set. The majority of those records, however, are likely to be repeated. Student absences are likely to be relatively limited.
In this case, it is probably better to distill the separate daily row values into a one-row “count” value (perhaps stored in a variable called “number of absences”). With this simple change, our set of millions of observations has suddenly shrunk to tens of thousands, making it more difficult to find statistically significant relationships between variables.
Multiple rows containing repeated information can exaggerate the size of the data set.
4. Numerous repeated fields
In contrast to point #2 above, we also sometimes see data with numerous repeated columns, or fields. Again, it is ultimately better for modeling purposes to condense these fields into one aggregated variable.
For instance, when looking at electronic medical records we may see columns representing a patient’s vitals taken at numerous times throughout their hospital stay. Instead of attempting to model on “Blood Pressure, Time 1” and “Blood Pressure, Time 2”, we might find that creating a new variable for the average blood pressure of a patient throughout their stay better fits our model. This would increase model parsimony without sacrificing accuracy, even if the concept of “average blood pressure” makes little sense from a clinician’s perspective.
Multiple columns containing repeated information make a data set unwieldy and provide excess noise.
5. Data lacking heterogeneity
Good data should give us a snapshot of the problem at hand. However, the best should also be relatively heterogeneous. It should show a mixture of both categorical and numeric variables. Statistics is grounded in numbers, but including categorical variables in a model can help scientists discover different effects across groups.
For instance, we might have HR data on the number of hours per week worked by an employee. Our model could show that for each additional hour worked, productivity decreases by a set proportion. But, it might be more meaningful and simpler to interpret if we show that full-time employees are most productive until they work their 40 hours and that productivity drops off once overtime begins.
A variety of data, numeric and categorical, can help us find natural groupings that will ultimately benefit clients as they move to implement new strategies from our findings.
These are some of the biggest issues we experience regularly with data quality. We will address some of the tools data scientists have to work around these problems in future blog posts. But, ensuring that your data is already in a consistent format, with low levels of missing data, no repeated rows or columns, and a healthy mix of categorical and numeric data can save valuable time. Good data allows the EdjAnalytics team to invest more resources in producing high-quality predictive models for your bottom line.
Previous blogs by Cara Davies: What is Good Data?