This is a common question from large corporations when they are evaluating enterprise systems for search, preservation and collection. The answer is not simple or definitive. First let us look at what conceptual clustering is and what it can and cannot do for us. Conceptual engines use a variety of methods to group common sets of words, phrases and even properties to create associated lists of concepts, topics, facets and even more names than can be relayed here. They all have a different secret sauce for ranking and grouping these related words and properties to extract analytical relationships between the vast numbers of items in the typical collection.
A conceptual engine essentially lets the ESI talk for itself. Rather than a person creating a search from their preconceptions of what criteria will retrieve all items related to a given request, the systems analyze and present the items back as folders, dot clusters and other visual diagrams to help the user make sense of the complex relationships. All of this sounds like just what the attorney asked for. “Give me everything relating to this deal.”
But what one should consider is that concepts are just groups of words or phrases that express similar ideas. They do not directly answer the typical request for production demanding ‘any and all documents or communications that constitute, contain, embody, comprise, reflect, identify, state, refer to, deal with, comment on, respond to, describe, involve, mention, discus, record, support or are in any way pertinent to’ the topic at hand. That means that you cannot just plug in the text of the demand into a conceptual search and expect that you will retrieve everything required.
To make matters even more complicated, the secret sauce of conceptual engines is a lot like black magic to those who do not spend all of their time figuring out how to build a better search engine, meaning most of us. Trying to explain to the court how such an engine determined that all these documents were related and why the cluster was named that way can challenge the bravest of testifying experts and drown the veracity of your efforts in a sea of technical fog.
“So what good are concepts if I cannot search on them?”, asks counsel. No one doubts the worth of conceptual analysis to find the key items produced by the opposing party, but how does that help the corporation find all their relevant email? The first step of incident response in the EDRM model is the Identification phase. The traditional approach is to interview the primary custodians associated with the matter and try to generate the preservation instructions. When dealing with enterprise archives like Autonomy’s EAS, that means coming up with search criteria to place your legal holds. How do you know what criteria are related to the matter? Up to this point, counsel inside and out have been doing their best to define the criteria, but few have leverage a defensible, explainable process to support their criteria.
Maybe you have already collected the initial, key ESI that the custodians thought relevant to the matter. Or you have run a search using the truly unique terms, names and dates to perform your risk assessment. If you have a collection that you know is relevant, conceptual search engines like Autonomy’s IDOL can show you the concepts within your collection to define or expand your actual hard Boolean search criteria. Now you can demonstrate the tools used to verify your due diligence and catch the variations and jargon used to talk about your topic.
So I do not feel that counsel and the courts are ready to rely on conceptual search as the sole means to retrieve ESI, but they can enable, defend and control the cost of corporate preservation and collection efforts when used in a structured process. In the end, it is not concept vs. keyword. Instead the products and solutions that integrate them will provide the most bang for the buck.