Data science – it’s so hot right now. But before you go racing to build predictive models, it’s essential to assess the quality of your data.
At EdjAnalytics, we recognize that most of the data companies collect serve multiple purposes. Data siloed by departments are not in the perfect format for statistical analysis. While your record-keeping may be excellent for your needs in accounting or human resources, how can you tell if the data you have collected will be useful for modeling?
The first thing we must understand about data is what it all means. While internal users might appreciate the shorthand and abbreviations sprinkled throughout your dataset, obscure abbreviations and data coding, such as ICD and CPT in the medical field, may stump external data analysts.
Even something that seems straightforward, such as a field of dates labeled “admission date,” might need further explanation once we dive into the data. Some individuals may have multiple admission dates throughout the dataset. Does this date refer to the time the patient first entered the hospital or the date of admission into each new ward or department?
To avoid confusion, supplying a data dictionary, or a list of specific variable definitions and keys helps our analysts. This data dictionary should explain what each variable means and how it was calculated or derived from various inputs.
Data dictionaries also describe the correct format for the data, which should allow us to easily differentiate a typo from a new category while representing the data’s origin and proper usage. When our analysts possess all of this meta-data, they more easily recognize any inconsistencies in the data and begin to form testable hypotheses about relationships between various factors.
This table is an example of a data dictionary. It shows variable name, data type, description of the data, and examples of how the data should look.
Deep History of Data
Another critical aspect of data for predictive modeling is a deep history across all variables. While it might be possible to find relationships between variables without a deep history, we may miss critical factors for prediction that would be instantly obvious with a more extended history.
If we wanted to predict ice cream sales for a local shop, having sales data from the last few months won’t be nearly as effective as having sales data from across the course of the year; we would miss all the seasonal variability in ice cream sales, which peak throughout the summer months.
Additionally, having ice cream sales data from the past three years could be even more useful to determine if that seasonal peak was a fluke – perhaps coinciding with a significant promotion that drove sales – or if that seasonal peak can be seen throughout the data to varying degrees every year.
While it might not take a data scientist to tell you that you are likely to sell more ice cream in the summer, numerous cyclical trends exist in the real world. Ensuring that you have a long, detailed history represented in your data can help uncover these trends.
Looking at data only in the shaded region, we might anticipate slow, steady growth in the next few months. We would miss the rapid peak as the weather warms up!
Finally, the structure of the data can help facilitate model-building. At EdjAnalytics, we work primarily with structured data in clearly-defined tables and databases. That is, data stored in multiple tables with one or more overlapping ID variables that can be accessed using the programming language SQL. This structuring helps data analysts and scientists quickly find relevant information and pull it into one comprehensive analysis.
A client might have human resources data on individual employees (such as age, job title, and years of employment) that they want to link to accounts’ data (number of orders filled, the dollar amount of revenue generated, and so on) to find patterns in employee efficiency.
By using a relational database, an analyst can quickly perform the necessary analysis, connecting the relevant information from the employee table to the relevant outcomes in the accounts table. Without a relational database, data analysts would need to search through multiple lines of repeated information in a flat file that contains information about both employees and accounts.
With relationships between variables clear and easy to understand, and unnecessary data filtered out, storing data relationally speeds up the preliminary steps of data cleaning and processing.
Linking these tables quickly shows that the 35-year-old salesman is responsible for over $15,000 in orders, and the 49-year-old district manager had over $57,000 in sales.
In contrast, this table includes repeated information for the salesman. It unnecessarily provides information about the receptionist, who does not sell.
In summary, good data should have a data dictionary to explain the origin and meaning of all the included variables, and it should be stored relationally for ease of use. Additionally, when building complex predictive models, a deep history is necessary to uncover any cyclical trends that might be lurking in the data.
Of course, there are many more factors to consider when preparing data for predictive analytics, which we will address in future blog posts. Keeping these three things in mind, however, can help your data science projects and improve the collaboration between your company and our specialists at EdjAnalytics.