Unstructured data. As a writer I hate that term. I remember the first time I heard reference to it; sitting in a meeting and technical people were talking about all the unstructured content that publishers produce. Huh? How could they be speaking about articles as unstructured? If anyone has made it through kindergarten they have learned that they must follow a linguistic pattern or “structure” in order to communicate effectively. But in the lexicon of geekdom, any article, picture, powerpoint, video, song, user-generated content — your kids’ text messages — are all: unstructured. Any “data” that does not fit nicely into a column or a row — is considered unstructured.
Now, as much as I find the word ironic, there is a logic that places all content that doesn’t fall neatly into a table to be called unstructured versus structured. It was evident when classifieds first went online. Dumped from mainframes where customers paid by the character, people created their own short-hand to say 4BR House 4 Sale. Turns out though, that all that unstructure (which I prefer to say as free-form) makes it very hard to search on. Don’t believe me? Go to Craig’s List — which tries to impose structure on advertisements by putting it under broad categories and locations. Other than that — it is pretty freestyle. Nannies are caregivers are sitters (baby or otherwise) — and plural or not. And search on one of those terms at the peril of not finding it under the other.
On the flip side are the sites that allow only structure: information fits neatly in a row or a table and has descriptive titles like Type of residence, # of Bedrooms, Siding, Price, MLS # etc. Having that structure makes it easy to query or search on that information. You probably have heard of SQL — which is the acronym for Structured Query Language. Relational databases use SQL to find content — by looking in the appropriate fields. Which is all well and good for content like financial information and inventory systems.
Think about all the content that flows in and around a publishing company: The articles, books (chapters of books), images, slideshows, Web sites, tweets, user comments … the stuff we refer to simply as: Content. Now think about how tough it is to find it – so we can repurpose it.
Let’s look briefly at what all this means. You have tons of content that doesn’t fit into tables and rows. Storing content in their original containers of powerpoints, word docs and PDFs make it extraordinarily hard to share and repurpose this content — since it’s hard to search it — and cutting and pasting becomes the only alternative. And when we are talking about sharing and repurposing, remember all the great mash-up apps that your content might be perfect for — if only it were in a neutral format.
Look, I don’t like the term Unstructured Content — but the acronym of CTDFWICAR (Content that doesn’t fit well in columns and rows) is hardly memorable. So while it may “unstructured,” it’s not unimportant. In fact, it is the lifeblood of any publisher, marketer and knowledge worker — and deserves an Enterprise NoSQL database to help keep it valuable.
Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.
Learn about data bias in AI, ways technology can help overcome it, why AI still needs humans, and how you can achieve transparency.
Successfully responding to changes in the business landscape requires data agility. Learn what visionary organizations have done, and how you can start your journey.
Sharing data can be relatively easy. Sharing our specialized knowledge about data is harder – and current approaches don’t scale.
Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.Request a Demo