Enterprise search engines aren’t inherently ‘smart’ about the data they deal with, because it’s largely unstructured. ECM has that problem to a degree, since documents aren’t self-describing unless they’re created with rich markup to begin with, which historically has only happened in limited areas. BI, on the other hand, has the historical advantage of operating on data that’s already structured. But BI hasn’t been that useful for text, because text doesn’t lend itself to dimensional representation and doesn’t fit into neat classes of objects that can be reported on.
These application areas are clearly converging, though. At the user experience, search is borrowing from BI in terms of being able to generate views into structured data that is keyed off of search terms, and BI is borrowing techniques such as keyword search and faceted navigation from the enterprise search world. Below the UI itself, though, all of these areas are trying to get smarter about metadata extraction – automatically deriving structure about what’s in the text – so that you can improve the end user’s search experience and open up the vast world of textual data sources to your BI environment.
One of the biggest challenges of implementing organization-wide search is the indexing strategy. Do you use a pure indexing engine and keep one large index (physical or virtual), or do you take advantage of the fact that you already have a multitude of search engines inside your organization and use a federation engine to send queries out to each search engine?
Most organization use a hybrid approach. If you don’t have content already indexed somewhere, then you need an indexing engine to make it searchable, but indexing in and of itself is commodity technology. Whether – or, more accurately when – to start federating versus indexing is usually driven by the project hitting the proverbial wall, and this can happen for a variety of reasons. One is the desire to incorporate external subscription or ‘deep web’ content – content that is either not allowed to be crawled or that is only served up dynamically by a query. Another reason is that most pure indexing engines have trouble scaling to very large collections of documents without very expensive hardware and potentially steep license fees. A third reason (and usually a driving factor) is the challenge of harmonizing different security models, a plethora of crawling methods, differing rates of change of data, and organizational and geographic distribution of content sites.
Content intelligence is a key technology that enables organizations to really have a complete picture of what’s going on within their business. Traditional BI has focused solely on the 10-15 percent of enterprise data that is already structured – and therefore amenable to analysis, query and reporting – while the vast majority of content in email repositories, content management systems, databases and CRM systems can’t be analyzed because it’s not described in any way. Search technology has made that content findable, but the onus is still on the user to read through all of the results, so search by itself is a poor analytical tool.
The content intelligence engine should be independent of any one application, so that its results can be shared across many applications – search, content management, storage, BI – without having to define separate systems and methodologies for each of those. And it should leverage existing knowledge within the organization so that you get a common layer of metadata that unifies your structured and unstructured data into a full-spectrum view of what’s going on within your business.
I believe search will be integrated into more and more applications as an expected feature, and the expectations of end users of how good their search engine should be are ever increasing. The increasing use of collaboration technologies – blogs, wikis, social networking sites – is creating for each user a rich and analyzable collection of areas of interest, expertise and connection to other people who have similar areas of interest and expertise. In one sense this is nothing new – bulletin boards, listservs, chatrooms, etc. have existed for some time – but it’s the widespread adoption of collaboration to very large user communities and the openness of the tools that are really different in the Web 2.0 world. Search needs to tap into this inherent knowledge about preferences and relevance in order to serve up more targeted content to end-users.
I also see the content intelligence layer leveraging more of the Web 2.0 experience, though you won’t see the proliferation of purely manual tagging of content the way you see on photo sharing sites and the like. Experience with mandatory tagging in the content management sector suggest a ‘garbage in, garbage out’ effect – the level of data quality is usually not very high, since users are neither professional librarians nor keen to do something that doesn’t directly relate to their job. I see instead an approach that combines some automation technology with a fun, interactive end-user experience as well as – and this is critically important – an incentive structure that rewards and encourages the employees for collaboration.
About the author
Ian Hersey is VP for Technology Development and Strategy at Business Objects. He came to Business Objects through its acquisition of Inxight Software, Inc., where he was co-founder and SVP of Corporate Development and Strategy.