Doing searches across unstructured data stores and understanding who owns this data are emerging as higher priorities in today’s Big Data era. However archiving software can vary greatly in how it performs these tasks of search and assigning data ownership. In this fourth blog entry in my interview series with C2C Systems’ CTO Ken Hughes, he examines how C2C performs search across distributed email and file systems and what techniques it employs to establish data ownership.

Charles:  How does C2C do search? The fact is that it has the ability to set up processes, policy definition, and implementation, based on certain time frames, from an architecture standpoint.  Is this a distributed type of system that’s doing this? Or is it all done from a central location?

Ken:  The answer is both of those. C2C offers software that runs on a server which is typically in your data center or close by or next to your Exchange server.

All of the mailbox data, or data that we have previously pulled into the archive, that is easy to process. Most vendors can do that because they can guarantee that the source of the data is online. So the real question becomes, “How do we access the remote work stations with PSTs on them or the file servers with PSTs?

To do that C2C does have some client technology that just runs in memory so there is no install required. Typically what C2C does is it has a one lane edit or change to a user’s log in script that basically runs the client from a network share.

In many cases users do not even know that it is running in the background. It will basically sit in memory and wait to be told what to do by the central server. So every five minutes it will wake up and ask if there is anything to do.

It basically works in two modes. One is that it works in Outlook mode whereby it will wait for the user to open Outlook and then examine their PSTs that are open within the Outlook session, and do whatever processing is needed. It just searches through all the messages, finds everything that matches the criteria, and then carries out “X” actions on them.

The other mode is file system mode. A user basically tells it to either search all the drives or gives it some paths that it wants to search. Then it does not care about whether Outlook’s open or not. It just will go through the file system as well as find any PSTs and load them into its own version of Outlook and then checks through all the data, and carries out the actions on them.

Then it either has to send the data off for archiving which it does over web services so an organization can run it inside corporate HQ, or individuals can run it from home or really anywhere since web services usually traverse firewalls. So yes, C2C is archiving data, sending it up to the server, or if C2C is moving or copying the data, it is sent to the server and processed from there.

The precursor to all of this obviously is that when an organization starts today it does not know what PSTs are out there, which ones it wants to migrate, or even which ones it wants to process. Organizations do not know the scale of that problem that they have. Is it 1 TB? Is it 100 TB? Who knows?

Organizations can make some guesstimates with some storage tools. But until they actually have the data and understand it, then they do not really know. So what C2C offers is a couple of nodes that tell organizations about the PSTs that C2C has found.

As C2C searches for these, the first thing C2C does is say, “Hey we found a PST! Here is the kind of data that is in it, here is the metadata about that PST, who owns it, where it is on disk, what client, what workstation it was on, some information about the size of it and the size of the data inside it.

C2C will tell the organization the mailbox name. It will give it the title of that PST. It will reveal the size on disk as well as the size of the data inside it. There can be vast differences here if there is a lot of byte space or it was a big PST and an individual has deleted a lot of data. It can make drastic differences.

C2C has seen organizations where it has over 100,000 PSTs that when they have used the storage tool, they think they have hundreds of GB or even TBs.  But when they have actually analyzed the volume of data that’s inside the PSTs, the actual sizes of the messages inside it reveal a completely different amount of data. It can be 50 percent or more of wasted space.

This also tells you some things about mailbox quotas. You obviously do not want to go ahead and ingest the PST if it is going to blow your mailbox quota.

The other problem is when you have just found a PST on a disk somewhere. This is a different kettle of fish. What C2C has to do with these is basically examine all the data contained within in, and then make a calculation as to who we believe the owner is.

C2C does that with four separate algorithms and each one comes with a different confidence factor. The best case is obviously loaded in Outlook. If we cannot do that, then we are just looking at uncoupled PSTs. Then the first thing we do is look in the sent items folder. If everything has been sent from one user, we can pretty much be sure that that user is the owner of the data as that gives us like a 95 percent confidence factor.

From there we fall back to some methodologies such as counting recipients and tallying up who is copied on the cc fields and bcc fields. Then we look for the most prominent user. Depending on the ratio of that most prominent user to the next users gives you different confidence factors.

That allows us to basically work out in an automated fashion who the owner of the data is. Because at some point if an organization is doing a discovery or wants to migrate data into a mailbox, it needs to know which mailbox to migrate in.

This is automated though an organization can manually override it if it knows for instance that it does not have a high level of confidence in the accuracy of the results. In those case C2C can leave it unassigned, and an organization can go ahead and manually assign ownership if needed.

In Part I of this interview series, we discuss C2C’s focus on Microsoft Exchange and which size environments C2C’s products are best positioned to handle.

In Part II of this interview series, Ken explains why eDiscovery and retention management are becoming the new
driving forces behind archiving and why C2C’s ArchiveOne is so well
positioned to respond to that trend.

In Part III of this interview series, Ken discusses C2C’s policy management features and the granular ways in which users may manage deletion in their data stores

In Part V of this interview series Ken explains how C2C manages archival data stores.