19/09/23 4:10 PM

Identifying duplicate emails just got easier, thanks to EDRM

Picture of Jo Sherman, CEO and Founder, EDT

Identifying and managing duplicate emails within one evidence management platform is already complicated. Do you use case-wide, per-custodian, custodian ranking, or family duplicate identification? The answer is often ‘all of the above’.

But what happens when you receive a production of emails from someone who used a different software to yours – or, the horror, as TIFF files – and you want to know which ones you’ve already seen? Until recently, it was double trouble.

The main challenge has been that each discovery software vendor uses its own formula to generate MD5 hashes that uniquely identify each email message. As a result, the same email message would generate a different hash in EDT as it would in, say, Relativity. So even if you received a load file that included hashes for email messages, your platform wouldn’t be able to match up the duplicates. Instead, you’d have to reprocess the production using your software to generate hashes using its formula. Even then, that could be problematic if you’re not working with native versions of the emails.

Multiple copies of movably type letters in a wooden frame It’s a struggle to identify multiple identical emails when they’re handled by different software. Photo by Natalia Y.

In February 2023, industry association EDRM released the Cross Platform Email Duplicate Identification Specification. This was the result of a diverse team of experts – including industry luminaries Craig Ball and Beth Patterson; EDT’s Phil Haselden, Paul Hunter, and yours truly; and representatives from Nuix, Relativity, and Reveal – collaborating for the greater good.

EDRM’s simple and elegant solution

EDRM’s specification delivers a simple and elegant solution. It relies on the fact that a Message-ID header – a unique identifier for each email message – has been part of the RFC 5322: Internet Message Format email specification since 1982. All RFC-compliant email servers generate these Message-IDs.

If you look at the headers (usually hidden by your email client) of every RFC-compliant email, you’ll see a header that looks something like this:

The less than and greater than signs are delimiters to show the start and end of the message ID
Everything to the right of the @ is the domain name of the email server that sent the message
Everything to the left is a string of characters that the email server generates and that it guarantees is unique to that email message.

EDRM’s specification requires generating an EDRM Message Identification Hash (MIH) – which is “the MD5 hash value of the ASCII string comprised of the Message-ID header field of RFC-compliant email messages” – and adding the MIH to the software platform’s record for that email.

That way, no matter what format a data set is in, your discovery platform can import the list of MIHs from a load file and compare it against the MIHs in your existing data as part of a process for identifying duplicates.

There are, of course, caveats and complications. The specification lists a range of circumstances where email messages don’t have compliant Message-IDs and as a consequence, you can’t use MIHs to identify duplicates. But these are only things to think about, since the specification will work just fine for most emails and situations.

Coming soon to EDT

EDT is working towards implementing the specification in our SaaS solution by the end of 2023. (And in the meantime, the EDRM Duplicate Identification Project web page provides a workaround using a Microsoft Excel spreadsheet to generate MIH values.)

Simplicity is elegant. One of the problems with traditional approaches to email duplicate detection is that algorithms have become over-engineered. Vendors have done this with the objective to increase the accuracy with which they detect duplicates, but in reality, this actually narrows the net and often delivers suboptimal results.

The elegant approach that underpins the new specification is much more useful. And at EDT we are already starting to think about how it can be applied to many more scenarios beyond the main use case for which it was developed. Watch this space!