SigParser uses ML.NET to detect "non-human" emails

Customer
SigParser

Products & services
ML.NET
Office 365

Industry
Software / Telecommunications

Organization Size
Small (<100 employees)

Country/region
United States

SigParser is an API and service that automates the tedious (and often expensive) process of adding to and maintaining customer relationship management (CRM) systems. SigParser extracts contact information, such as names, email addresses, and phone numbers, from email signatures and feeds all that information as contacts into CRM systems or databases.

The SigParser application lets you provide a sample email and preview the metadata it is able to determine about the email.

Business problem

When SigParser processes emails for a company, many of the emails are non-human (for example, newsletters, payment notifications, passwords resets, and so on). The sender's information from these types of emails should not show up in contact lists or be pushed into a CRM system. Thus, SigParser decided to use machine learning to predict if email messages are "spammy looking."

Take the following notification email from a forum as an example. The sender of this email isn't a contact that should show up in a CRM, so a machine learning model predicts that "isSpammyLookingEmailMessage" is true:

The sample email comes from a noreply email address and has generated information about unread notifications etc.

SigParser classifies the sample email as a 'spammy looking email message', using their ML.NET model

Why ML.NET?

When the team at SigParser decided to utilize machine learning, they originally tried using R; however, they found it was very difficult to maintain and integrate with their API, which is built with .NET Core.

Paul Mendoza, CEO and founder of SigParser, said that R "was just too disconnected from the development process. With R, we were generating all the constants and then we would copy and paste those into .NET and then try the model out for real and learn it didn't quite work and have to repeat. This was too slow."

Thus, they turned to ML.NET to bring everything into one application.

With ML.NET, we're able to train the model and then immediately test it inside of our code. This makes shipping new changes faster because all the tooling was together in one place."

Paul Mendoza, CEO and founder SigParser

Impact of ML.NET

The impact of moving to ML.NET from R has been a 10x productivity improvement. Additionally, until SigParser moved to R, they only utilized one machine learning model. Since the conversion to ML.NET, they've now got 6 machine learning models for various aspects of email parsing. This increase has come about because it's now possible with ML.NET to quickly experiment with new machine learning ideas and show the results in the application quickly.

Solution architecture

Data processing

SigParser first used the well-known Enron dataset to train their model, but when they realized that it was quite outdated, they ended up labeling a couple thousand emails in their own email accounts (keeping with GDPR compliance) as either human or non-human and used this as a training dataset.

Machine learning features

SigParser's ML.NET model has two Features (used to make the prediction "IsHumanE-mail"):

  • HasUnsubscribes —True if an email has an "unsubscribe" or "opt out" in the email body
  • EmailBodyCleaned —Normalizes the HTML email body to make the email language agnostic and to remove any personally identifiable information

Machine learning algorithm

These two Features are inputted into a Binary FastTree algorithm, which is an algorithm for classification scenarios, and the output is the prediction of whether the email was sent from a "real human" or from an automated source. Currently, SigParser is processing millions of emails per month with this ML.NET model.


var mlContext = new MLContext();

var(trainData, testData) = mlContext.BinaryClassification.TrainTestSplit(mlContext.CreateStreamingDataView(totalSampleSet), testFraction:0.2);

var pipeline = mlContext.Transforms.Text.FeaturizeText("EmailBodyCleaned", "EmailHTMLFeaturized")
    .Append(mlContext.Transforms.Concatenate("Features", "HasUnsubscribes", "EmailHTMLFeaturized"))
    .Append(mlContext.BinaryClassification.Trainers.FastTree(labelColumn: "IsHumanEmail", featureColumn: "Features"));

Console.WriteLine("Fitting data");
var fitResult = pipeline.Fit(trainData);

Console.WriteLine("Evaluating metrics");
var metrics = mlContext.BinaryClassification.Evaluate(fitResult.Transform(testData), label: "IsHumanEmail");
Console.WriteLine("Accuracy: " + metrics.Accuracy);

using (var stream = File.Create(emailParsingPath + "EmailHTMLTypeClassifier.zip"))
{
    mlContext.Model.Save(fitResult, stream);
}

SigParser uses ML.NET's data transformations and algorithms for multiple machine learning solutions, including the spam detection model mentioned above, which has enabled them to automatically export the correct contact information to customer databases from email signatures, bypassing the need for time-consuming and error-prone manual contact data entry.

Ready to get started?

Our step-by-step tutorial will help you get ML.NET running on your computer.

Get started