“Early Case Assessment” is a broad term used in e-discovery today. Depending upon who you talk to, it could run the gamut of definitions. It could be anything from a report on how many emails were collected, to summary report that greatly aids in understanding the actual narrative of a case. Because price varies widely on “ECA” and your particular needs may vary, it is important to ensure that you are getting what you pay for and to understand some different categories of data evidence analysis.
Capabilities and Definitions (from simple to complex analysis):
- Data Size Analysis
This is simply providing reports of data size. Almost any collection, processing or review platform can provide some measure of this kind of analysis. Analysis of data size is often key to e-discovery cost prediction.
- Data and Metadata Statistical Analysis
The vast majority of “Early Case Assessment” tools are essentially providing graphs, grids and tables on statistics about discovered data. This information is provided with the goal of providing counsel with choices about how to cull and identify relevant documents. This information can be tremendously helpful, but it is somewhat of a misnomer to consider a graph of the dates of emails, or the percentage of documents that are PDFs or XLS files to be an assessment of the case. What we’re really looking at here is Early Data Analysis. How early and how detailed the information provided is varies depending upon the tool used and the manner in which the specialist uses them. Vendors vary in the skill and flexibility with which they provide these reports and allow them to be combined or pivoted.
- Standard Date Analysis
This subtype of metadata statistical analysis is so commonly displayed in histograms, that I feel it deserves its own section. This is the practice of graphing the date sent, received, modified, or another pertinent date in a histogram to look for spikes in activity. Because the volume of someone’s email or documents only shows activity level, it is more useful when cross-referenced with other types of culling or searching. Date analysis combined with search terms can provide better insight into the dates during which a certain topic was discussed. For instance, if we know the number of emails related to “let go” or “fire” have increased in the body of data, we may be able to zero in on the period of time where an employment dispute was brewing. Search terms, even when a thesaurus is applied, are precise and easily understood, but are still static instruments.
- Advanced Date Recognition
Dates come in many forms, and not just month/day/year versus year/month/day. In reality, people communicate time in a variety of ways that rely upon context. “I’ll see you tomorrow” means the date after today. “The first week in June” means a range of dates that depends upon year. Simply filtering on the dates of email can ignore important context from text of the email itself. Newer analytic tools do and must take this into account.
- Textual Analysis
Whether pulled directly from the files themselves, or lifted from images using Optical Character Recognition, many types of analysis can be run on text in a discovery collection. The most common type of textual analysis analyzes the relationships between words and other words and the documents in which they are found. In this way an text analysis engine that analyzes these relationships can then score the similarity of two documents based upon an algorithm. This algorithm takes into account the number of connections and the degrees of separations between them in aggregate between the two documents. In applying this scoring, one can calculate the similarity of a group of documents versus all other documents in a set of documents. The end result of all of this is that you can drop, say, a Wikipedia article on your subject matter into a search box, with no formatting or extra work necessary and instantly get a list of the highest scoring documents, which will tend to share the same text, but will also include documents that perhaps have completely different words, but are about the same thing. This kind of analysis is often the engine behind so-called predictive coding, which categorizes documents as highly similar to example documents that have been categorized a certain way by humans.
Many times, textual analysis tools have brought me and my clients quickly to documents we may never have found with simple search terms.
There are several analysis engines on the market that use similar technology, but with different algorithms and approaches. The best tools are those that require the least number of review rounds to quality control.
- Key Personnel and Relationship Analysis
Many tools in the e-discovery marketplace offer some degree of analysis of ‘who talks to who’ to allow a visualization of relationships. How deeply the tool understands these relationships depends upon the tool. Be sure that the software is not just giving you a visualization of who sends email to whom most frequently, because, obviously, what they talk about and how they talk to each other is going to be more useful in reconstructing an organizational chart, seeing who manages a department, or whose email might contain the most important evidence. Relationship analysis can be large time-savers in constructing players’ lists, organizational charts, and determining who the key custodians really are. What can be fascinating is seeing who that key operational person is, or who is really calling the shots.
- Entity Recognition and Resolution
People have different personas and different email addresses. Corporations have multiple DBA’s and subsidiaries. Email accounts have multiple aliases. Do we really want to rely upon reading emails and scanning indexes to compile a list in order to develop search terms? Some software can determine whether IRS and Internal Revenue Service are the same thing, or whether email@example.com is the same person or entity as firstname.lastname@example.org. This is referred to as Named Entity Resolution. If your spam filter did this perfectly every time, email spammers would be out of business.
On the other hand, how can we determine whether the Kennedy referenced is the person, the highway, or the university? Some software can recognize these differences, and this is referred to as Named Entity Recognition. If the human mind did this perfectly every time there might never have been the skit “Who’s on first” by Abbott and Costello.
- Communication Timing Analysis
Whether a communication is sent during standard work hours,or whether it was so important as to be sent outside of that person or the recipient’s normal work hours can be a useful indicator of whether communication merits closer scrutiny. Of course, we all know people who send useless communications at off hours or overestimate the importance of their message.
- Communication Sentiment Analysis
Sentiment analysis is now light years ahead of dropping expletives into a search box. The most advanced ECA tools are able to analyze the language in messages and determine a conversation has turned positive or negative using a variety of linguistic comparisons and are able to identify euphemistic, even passive-aggressive language . How many times do our elders have to remind us, it’s not what you say, but how you say it?
- Pattern Analysis – Combining All of the Aforementioned Forms of Analysis
What it all boils down to is understanding the narrative of the case — what happened, when, to whom, and why. Litigation is reassembling Humpty Dumpty, and this time we have more than a bunch of horses and feudal sycophants to reassemble him. There are few tools existing that can successfully blend most of the former kinds of analysis to very quickly tell you who talks to whom about what and how and what they did. When we are able to analyze from many dimensions, all at once – senders, recipients, time of day, topics, and sentiment, we can catch instances where people break their patterns. When they cause anomalies, they are giving us a behavioral signal that the time period may be significant.
Source of graphic: http://www.imdb.com/media/rm2133579264/tt0122933?ref_=tt_ov_i