I recently read a good article on the difference betweenÂ structured and unstructured data. The author defines structured data as data that can be easily organized. As a result these type of data are easily analyzable. Unstructured data refers to information that either does not have a pre-defined data model and/or is not organized in a predefined manner. Unstructured data are not easy to analyze. A primary goal of a data scientist isÂ to extract structure from unstructured data. Natural language processing is a process of extracting something useful (e.g., sentiment, topics) from something that is essentially useless (e.g., text).
While I like theseÂ definitions she offers, she included anÂ infographic that is confusing. It equates the structural nature of the data with the source of the data, suggesting that structured data areÂ generated solely from internal/enterprise systems while unstructured data are generated solely from social media sources. I think it would be useful to separate the formatÂ (structure vs. unstructured)Â of the data from source (internal vs. external) of data.
Sources of Data: Internal and External
Generally speaking, business data can come from either internal sources or from external sources. Internal sources of data reflect those data that are under the control of the business. These data are housed in financial reporting system, operational systems, HR systems and CRM systems, to name a few.Â Business leaders have a large say in the quality of internal data; they are essentially a byproduct of the processes and systems the leaders useÂ to run the business and generate/store the data.
External sources of data, on the other hand, are any data generated outside the walls of the business. These data sources includeÂ social media, online communities, open data sources and more.Â Due to the nature of source of data, external sources of data are under less control by the business than areÂ internal sources of data. These data are collected by other companies, each using their unique systems and processes.
Data Definition Framework
This 2×2 data framework is a way to think about your business data (See Figure 1). This model distinguishesÂ the formatÂ of data from the source of data. The 2 columns represent the format of the data, either structured or unstructured. The 2 rows represent the source of the data, either internal or external. Data can fall into one of the four quadrants.
Using this framework, we see thatÂ unstructured data can come from both internal sources (e.g., open-ended survey questions, call center transcripts) and external sources (e.g., Twitter comments, Pinterest images). Unstructured data is primarily human-generated.Â Human-generated data are those that are input by people.
Structured data also can come from both inside (e.g., survey ratings, Web logs, process control measures) and outside (e.g., GPS for tweets, Yelp ratings)Â the business. Structured data includesÂ both human-generated and machine-generated data. Machine-generated data are those that are calculated/collected automatically and without human intervention (e.g., metadata).
The quality of any analysis is dependent on the quality of the data. You are more likely to uncoverÂ something useful in your analysis if your data are reliable and valid. When measuring customers’ attitudes, we can use customerÂ ratings orÂ customer comments as our data source. Customer satisfaction ratings, due to the nature of the data (structured / internal), might be more reliable and valid than customer sentiment metrics from social media content (unstructured / external); as a result, the use of structured data might lead to a better understanding of your data.
Data format is not the same as data source. I offer this data framework as a way for businesses to organize and understand their data assets. Identify strengths and gaps in your own data collection efforts. Organize your data to help you assess yourÂ Big Data analytic needs. Understanding the data you have is a good first step in knowing what you can do withÂ it.
What kind of data do you have?