We’ve all read the stats about how much of the content currently being generated and managed within organizations is unstructured, and we’ve worried about the opportunities that might be missed if we are not able to leverage the knowledge this content contains. Indexing the content with your favorite search engine is a start, but it doesn’t go nearly far enough.
The content also needs to be tagged with metadata, which enables things like context-based personalized delivery, dynamic boosting of results, and what I like to call “search and refine” – a usage pattern that is ubiquitous on internet sites but often lacking in intranet search experiences. The challenge, of course, is that the users who create the content will rarely, if ever, tag content consistently.
To solve this problem, an automated approach is needed. Auto-tagging capabilities fall into one of two categories: Rules-based or Machine Learning /AI (statistics-based). Taking a rules-based approach has the advantage of being easy to understand and implement, but it requires human input to work. The ML / AI approach, on the other hand, largely automates the process and can “learn” over time, but it can be difficult to understand and debug when things don’t go smoothly. So the question is, which approach is better?
Neither.
Both methods have their place, and at BA Insight we advocate combining the two approaches to auto-tagging and applying each in ways that take advantage of their strengths. Let’s take a closer look at each method:
Rules-based
When a clear set of defined metadata tags exist, in my experience a rules-based approach yields the clearest, easiest to understand mechanism to have those tags automatically applied to your content. There are several concepts that apply here:
- Taxonomy creation. The first step is to determine which metadata would be of the most help to users in finding the content they need. This is a good place to apply a statistical methodology to analyze your content and suggest tags based on that analysis. Rather than having your most knowledgeable users pore through thousands of documents looking for terms which appear frequently, BA Insight offers an automated analysis capability that does this more quickly and accurately than any human could. Take advantage of your users to verify that the suggested terms make sense, but automate the heavy lifting.
- Rules. Now that you have a taxonomy of terms, the most accurate way to control assignment of these terms to content is via Rules. Sure, there is some work here, but in the long run using Rules to assign known terms ensures that it’s done right. Applying your understanding of the content to the creation of Rules enables accurate identification of the concepts or categories which apply to each piece of unstructured content, which in turn becomes invaluable when users search for that content. It is critical that the rules be sufficiently flexible to allow you to apply only the tags that accurately describe what a document is truly about rather than just what it may mention. In other words, you need to be able to evaluate potential tags to decide if they truly reflect what the document is about. In addition to a widely extensible rules development interface, our products incorporate mechanisms to enable you to test your rules before using them to actually tag any content.
- Patterns. There may be cases where you need to apply concepts like security classification to your content. This is an example of where the human touch brings a lot of value, because it requires an understanding of what “sensitive” means in your organization. There are obvious examples – seeing that a document contains Social Security Numbers or Birth Dates come to mind. But there are also things like Part Numbers which may appear to be innocuous but in the context of your organization represent potential risk if they are widely disseminated. Identifying documents that may contain such information enables you to determine how you want to handle those documents – kick off a review workflow, exclude them from search results – once you’ve identified potential problems, then you can decide what to do.
Entity Extraction
Entity Extraction is a tagging technique which involves identifying “entities”, which are typically names of something, and tagging the document with the entities it contains. This can also include things like product names or project names – anything that can be identified by context and pattern is a candidate for Entity Extraction. There are several ways to do this:
- List. When you have a finite number of potential values such as product or customer names, this technique makes sense. You just provide a list of potential values to our software, and any entities found within documents will be automatically extracted and applied as tags.
- Entity Recognition. This is a technique whereby Machine Learning models are “trained” to be able to identify certain types of entities such as names of people, places, projects, or products based on context and other factors, without a defined list as a reference. Then, when entity names that match what the models have been trained to identify appear within the content, they are extracted as tags. It’s important to be able to differentiate between entities with similar names and correctly identify Newton the scientist vs. Newton the city in Massachusetts. We also provide the ability to train our software to identify new entities which are important to your business so that you are not limited to ones we define.
Cognitive Analysis
“Cognitive” is the hottest buzzword in the search market, and for good reason. The technologies that fall under this umbrella have great promise and can revolutionize the way people find information. At BA Insight, we are taking a very different approach than many other players in the market, i.e. open technology. Rather than try to create our own set of cognitive services and force customers to do this “our way”, we are integrating with the cognitive suites of products offered by companies like Google, Microsoft, and Amazon. In the same way that we enable our customers to choose the search platform which best suits their needs and use our portfolio of products to enhance that capabilities of that search platform, we enable our customers to decide which vendor of cognitive services is best for them.
The names are different, but there are several capabilities which are common across cognitive suites which we have or will incorporate into our software portfolio:
- Image and Video Analysis. Up to this point I’ve talked about unstructured content as if it’s all just text, but that’s not the case. There is valuable content in binary formats such as images and video which should be leveraged as well. This is another area where ML can shine. Cognitive Search suites provide the capability to analyze an image and give you back text which describes what is in the picture, and it can extract any text which is visible in the image. If the image appears within a document, then the text which describes the image can be added to the other text within the document. When that document is analyzed using the methods described above, all content within the document can be considered, improving the accuracy of the tags that are applied.
- Sentiment Analysis. Machine learning models can go beyond the content or meaning of a document to get to the “tone” to determine if the document is written in a positive or negative manner. This kind of understanding can be used to determine what content best meets a request. For example, you may want to search for content with a negative tone to edit and improve, or for documents with a positive tone to deliver to customers. This technique is applicable to certain specific use cases such as Customer Service, but it’s broadly useful for Enterprise Search.
- Natural Language Understanding. The objective of applying Natural Language Understanding is to get to the “intent” within analyzed text. It can be applied both as part of content classification and search. For example, understanding of the intent of the content of a document can enable things like assigning a document type such as Documentation, Procedure, Policy, etc. When users are searching for documents using natural language, an understanding of the way in which they ask the question can increase the relevance of search results by ensuring the right types of documents are returned first.
Bottom line:
Analyzing and tagging unstructured content for findability is not a one-size-fits-all task. It is best to apply a combination of human understanding and automated capabilities to apply metadata to the ever-growing body of content so that it can be easily found and utilized by the people who need it most within your organization. It’s also important to note that there may be cases where multiple mechanisms are needed to correctly classify your content. For example, a rule may look for a combination of an extracted entity such as a person’s name and a date in close proximity and deem that combination to be sensitive.