Content Examination Definitions: Word / Phrase Match List Examples

Document created by user.oxriBaJeN4 Employee on Sep 21, 2015Last modified by user.oxriBaJeN4 Employee on Apr 2, 2019
Version 52Show Document
  • View in full screen mode

A content expression is the combination of functions, text, and phrases used in a Content Examination definition to assist with Data Leak Prevention (DLP). A content expression could be a word, phrase, regular expression, or hash algorithm matching a specific document. The content expression is entered into the content examination definition, which scans messages looking for any of the search criteria to match.


These search terms are used to prevent the accidental or malicious data loss through company email. Depending on your requirements, some or all of the examples below may be applicable to make your email system comply with your company's security policy.

Using Content Expressions


Content expressions are entered into the "Word / Phrase Match List" field in a Content Examination definition. The activation score is a required field, and is the total score used to trigger the definition should a match be found in a message.

It is highly recommended to test Content Examination definitions to ensure that the correct results are achieved as required.
Search Parameters (*)Example
weight [:maxscore] <search text>4 “Company Confidential”
weight [:maxscore] required <search text>1 required “Project X”
weight [:maxscore] exclude <search text>1 exclude “Tax exemption”
weight [:maxscore] regex <regular expression>10 regex 4[0-9]{12}(?:[0-9]{3})?
weight [:maxscore] regex,cardnumber <regular expression>1 regex,cardnumber 6(?<=\b6)(767|334)(?!\n\t)(\d{12,15}|[\d- ]{16,19})\b
weight [:maxscore] hash <MD5#>1 hash 9EBD30E761ED4FF770A90DDBD5CB4190 Confidential.PDF


Search Parameter Notation


Search Parameter TypeExample ParametersDescription
Mandatoryweight required exclude regex hashOne or more of these parameters must be included in the search parameters, according to the type of content expression that is being created. At a minimum, the weight must be defined.
[Optional][:maxscore] cardnumberParameters in square brackets [ ] are optional. "cardnumber" invokes additional scanning of the credit card through a Luhn algorithm to determine if the sequence of numbers is a valid credit card number.
<To Be Modified><search text> <regular expression> <MD5#>Parameter in angled brackets <x> must be modified to build the required content expression.


Usage Examples


Words and Phrases


Word / Phrase Combination


Activation Score = 3

Word / Phrase Match ListEmail Content Required to Trigger Definition
1 “financial affairs”
4 confidential
2 “company confidential information”
3 occurrences of ‘financial affairs’ OR
1 occurrence of ‘confidential’ OR
2 occurrences of ‘company confidential information’ OR
Any combination of the words and phrases, so that the cumulative score matches or exceeds 3 points.
Using the "Regex" prefix followed by a search word or phrase wrapped in quotation marks ( " " ) doesn't result in absolute matches being found for the term in quotation marks. Matches will only be found for the search word / phrase including the quotation marks.




Words / Phrases Combination With Comment


Activation Score = 10

Word / Phrase Match ListEmail Content Required to Trigger Definition
# this is a comment, and is ignored
1 “social security”
1 private
8 “personal security code”
10 occurrences of ‘social security’ OR
10 occurrences of ‘private’ OR
2 occurrences of ‘personal security code’ OR
Any combination of the search terms, so that the total cumulative score is 10 or more.
Searches for "personal security code" only triggers the policy if these words appear in the message in exactly the same order.

Multiple Words


Activation score = 1

Word / Phrase Match ListEmail Content Required to Trigger Definition
1 “Product A” “Product B”
1 ”Product C” ”Product D”
1 occurrence of ‘Product A’ OR ‘Product B’ OR 1 occurrence of ‘Product C’ OR ‘Product D’
1:10 notificationAs long as 'match multiple words' option is checked in the content definition, the definition will only match a maximum of 10 occurrences of the word 'notification' in the email.


Wildcard Characters


Activation score = 6

Word / Phrase Match ListEmail Content Required to Trigger Definition
3 car*
3 capitali?e
2 occurrences of words beginning with ‘car’, e.g. cars, carlton, carter, carbuncle, etc. OR
2 occurrences of ‘capitalize’ or ‘capitalise’


Boolean Operators


You can use boolean operators (i.e. AND / OR) to help minimize the number of false positives experienced. This is achieved by specifying two search terms that must exist in a message.


Text Strings and Regular Expression


The following search terms would require the word 'Credit Card' along with a valid credit card number before a content hit is found. If only one of the specified search terms is found, no content hit will be logged.

Word / Phrase Match ListEmail Content Required to Trigger Definition
Credit Card 4111 1111 1111 11111 ("Credit Card") AND (regex,cardnumber 4(?<=\b(?<!\.)4)\d{3}[\W\s]?\d{4}[\W\s]?\d{4}[\W\s]?\d{4}\b)
Each search term must be wrapped in brackets () and followed by the AND operator and your next search term. Phrases must be wrapped in brackets and speech marks. For example, ("Credit Card"). Only one AND operator can be used per line in the Search Word/Phrase match list.

Multiple Entities


The following triggers a match when the presence of two entities is found:

1 (detect SIN) AND (detect Names)


Either One of Two Entities


The following triggers a match when the presence of one of two entities is found:

1 (detect drivers_license_us) OR (detect drivers_license_uk)




Activation score = 4

Word / Phrase Match ListEmail Content Required to Trigger Definition
0 required “Project Alpha”
5 “budget overrun”
4 urgent
2 update
The phrase ‘Project Alpha’ MUST be present AND
1 occurrence of ‘budget overrun’ OR
1 occurrence of ‘urgent’ OR
2 occurrences of ‘update’
0 exclude “Project Delta”
5 “budget overrun”
4 urgent
2 update
The phrase ‘Project Delta’ MUST NOT be present AND
1 occurrence of ‘budget overrun’ OR
1 occurrence of ‘urgent’ OR
2 occurrences of ‘update’
The Required / Exclude operators must be used in the first line of the Content Examination definition.

Proximity Operators


The proximity operator allows administrators to specify if a policy triggers when a search term is found within a specified number of characters of another search term. If no proximity value is specified, the default value for a match is 300 characters. If a match is found outside of the specified proximity distance, no content match is triggered.


When using the proximity operator, consider the following:

  • There is no limit on the distance that can be specified by the proximity operator.
  • Blank spaces count as characters.
  • Mimecast Managed Reference Dictionaries (MMRDs) and Reference Dictionaries can be used in conjunction with the proximity operator.
  • The proximity value is calculated by the number of characters from the end of the first search word / phrase, until the first character of the second search word / phrase (including blank spaces and special characters).


Proximity Example


Take the following proximity example:

Here is my credit card number for you to use 1234 5678 1234 5678.

The proximity value is 17, as there are 17 characters between the end of the phrase"credit card" and the first character of the credit card number "1234 5678 1234 5678".
Proximity Example


Word / Phrase Match ListEmail Content Required to Trigger Definition
1 ("Credit Card Number") Proximity:17 (regex,cardnumber 4(?<=\b(?<!\.)4)\d{3}[\W\s]?\d{4}[\W\s]?\d{4}[\W\s]?\d{4}\b)

The following text would cause the policy to trigger as both Term 1 and Term 2 meet the configured requirements:

Here is my credit card number for you to use 4111 1111 1111 1111.

Also if Term 1 is less than 17 characters from Term 2, a content match still occurs as the matching criteria is a range rather than an absolute distance. The following text would not match as Term 1 and Term 2 are more than 17 characters apart:

Here is my credit card number for you to use to book the hotel  4111 1111 1111 1111.


Content Reference Dictionaries


Content reference dictionaries are added from the Insert menu inside a Content Examination Definition. For more information on Reference Dictionaries, view the full page.


Activation score = 3

Word / Phrase Match ListEmail Content Required to Trigger Definition

#ref 545 Social Security Number

#ref 276 Common Medical Terms

Words and phrases contained in the ‘Social Security Number’ OR ‘Common Medical Terms’ reference dictionaries must be present, in a combination so that their aggregate scores add up to 3 or more. Entries in the dictionaries are individually weighted, or have a default weighting of 1.


These reference dictionaries must be pre-created. The Word / Phrase match list does not auto-populate the entire list of criteria in the dictionary. You have to refer to the original reference dictionary to examine its contents. Mimecast provides Managed Reference Dictionaries for only credit card numbers and profanity lists by default.


Ignoring Terms in a Custom Reference Dictionary


It's possible to use custom reference dictionaries with the ignore operator. This allows you to create a single list of terms that you do not want to find content matches for, rather than entering each term separately. Here's an example syntax:

1 (detect Names) IGNORE (ref 879 Ignored Names)


MD5 Hash Checksum


Checksums are generated from the Insert menu inside a Content Examination Definition. 


Activation score = 1

Word / Phrase Match ListEmail Content Required to Trigger Definition
#1 hash 00C9443961BC9131FE96D580AB29CE59 mimecast.xls - (30208  Bytes)The ‘mimecast.xls’ file must be present in the email.


Regular Expressions (Text Matches)


Activation score = 4

Word / Phrase Match ListEmail Content Required to Trigger Definition
1 regex [bcr]at
10 regex D(?:OB|ate of birth)(?:[\W\D\S])?\s?\d{1,2}(?:\/|\.)\d{1,2}(?:\/|\.)\d{2,4}\b
4 occurrences of words ending in ‘at’, such as ‘bat’, ‘rat’, ‘cat’ etc. OR
1 occurrence of a reference to a Date of birth in the email with the format: Date of birth: MM|DD/MM|DD/YY|YYYY
For more Date Formats:
Month Day Year: (([0-1][0-2])|([0][1-9]|[1][1-2]))(.| |/|-)(([0-2][0-9])|([3][0-1]))(.| |/|-)(([1][9][0-9][0-9])|([2][0][0-9][0-9])|(/d{2}))
Day Month Year: (([0-2][0-9])|([3][0-1]))(.| |/|-)(([0-1][0-2])|([0][1-9]|[1][1-2]))(.| |/|-)(([1][9][0-9][0-9])|([2][0][0-9][0-9])|(/d{2}))
Year Month Day: (([1][9][0-9][0-9])|([2][0][0-9][0-9])|(/d{2}))(.| |/|-)(([0-1][0-2])|([0][1-9]|[1][1-2]))(.| |/|-)(([0-2][0-9])|([3][0-1]))
Year Day Month: (([1][9][0-9][0-9])|([2][0][0-9][0-9])|(/d{2}))(.| |/|-)(([0-2][0-9])|([3][0-1]))(.| |/|-)(([0-1][0-2])|([0][1-9]|[1][1-2]))
5 regex (pre-|post-) operative1 occurrence of ‘pre-operative’ OR ‘post-operative’




Validators are used to identify if specific pieces of content found in messages or attachments are legitimate (e.g. credit cards, social security numbers, etc.). These numbers are validated via the use of LUHN algorithms that are defined by the issuer for confirmation purposes.

Using validators greatly reduces the number of false positives that can occur when attempting to find matches based upon regular expressions alone, as the LUHN check must pass before any regular expression matches are attempted.

Below is a list of all of the validators currently supported by Content Examination policies.


Activation Score = 1

Validator TypeWord / Phrase Match ListEmail Content Required to Trigger Definition
ABA Number (American Banking Association Number)1 regex,aba (\d){9}A valid ABA number, e.g. 021000322
CHI Number (Community Health Index Number)1 regex,chinumber (([^\w\t]?\s)?\d){10}A valid CHI Number (e.g. 3110130017)
CardNumber (Credit Card Numbers)
Includes the following card types: Visa, Mastercard, American Express (Amex), Diner Club, Discover Card, JCB Cards

1 regex,cardnumber 4(?<=\b(?<!\.)4)\d{3}[\W\s]?\d{4}[\W\s]?\d{4}[\W\s]?\d{4}\b

The Regular Expression example provided above only works for Visa cards. The validator portion used works for the other card types, the regex does not.

A valid Credit Card Number (e.g. a valid Visa Credit Card number of 4444333322221111)

Email Address1 regex,email (\w+[@\.]\s*\w+\.*\w+)A valid Email Address (e.g.,, or
IBAN Number (International Bank Account Number)1 regex,iban GB\s(\d){2}\sBARC(\s*\d){14}A valid IBAN number, e.g. GB 37 BARC 2004 1538 2900 08
NI Number (UK National Insurance Number)1 regex,nin \s*[a-zA-Z]{2}(?:\s*\d\s*){6}[a-zA-Z]?\s*A valid National Insurance Number (e.g. BN102966C)
NHS Number (National Health Service Number)1 regex,nhsnumber (([^\w\t]?\s)?(-)?\d){10}
1 regex,nhsnumber (([^\w\t]?\s)?(_)?\d){10}
A valid NHS Number (Hyphen Separated), e.g. 499-999-9994
A valid NHS Number, e.g. 499 999 9994
NPI Number (Health Identification Card Number)1 regex,npi (?<!\\d)\\d{10}(?!\\d)|80840\\d{10}(?!\\d)A valid Standard Health Identification Card Number (e.g. 808401234567893 1234567893).
MOD10 (Modulus 10, used to validate Canadian Health and US Postal Service PIC numbers).1 regex,mod10 (\d){9}A valid Canada Health Number or US Postal Service PIC number consisting of 9 digits.
  • Canada Health Number 9876543217
  • US Postal Service PIC Number 12345678
SIN Number (Canadian Social Insurance)1 regex,sin (\d){9}A valid number, e.g. 046 454 286
SSN (US Social Security Number)1 regex,ssn ([^0-9-]|^)([0-9]{3}-[0-9]{2}-[0-9]{4})([^0-9-]|$)A valid US Social Security Number (e.g. 078-05-1125).
PhoneNumber+Region - Can be used to minimise false positives or to target specific regions.
Includes support for the following regions: AU,UK,US
1 regex,phoneNumberAU (\+?)\d{1,3}(\s)?(\(\d{1}\))?[\s\d-]+
1 regex,phoneNumber
A valid AU based telephone number - (e.g. 1300 307 318)
A valid telephone number - including country code (e.g. +44 (0)20 7847 8700)
PostCode+Region (postcodeau)
Includes support for the following regions: AU,CA,UK,US,ZA
1 regex,postalcodeau [0-9]{4}A valid AU Post Code (e.g. NSW 2060)
VIN Number (Vehicle Identification Number)1 regex,vin [0-9A-HJ-NPR-Z]{17}A valid VIN number, e.g. 1M8GDM9AXKP042788
Regular expressions used in the examples above are for illustration purposes only and may not catch all possible examples.

Fuzzy Hashes


Fuzzy Hashing can be used to limit the flow of sensitive information from leaving your organization, by matching text content similarities between a Control Document and email attachments passing through your Mimecast service.


Activation score = 1


Fuzzy Hashes are generated via the Generate Fuzzy Hash button found on the Content Definitions page, and inserted using the Insert menu in Content Examination definitions.


There are two types of Fuzzy Hashes that can be used within Content Examination policies:

  • SSDEEP - This hash type uses the binary information of the document to generate the hash.
  • MFH - This hash type uses the text contained within the control file to generate the hash.


Word / Phrase Match ListEmail Content Required to Trigger Definition
1 mfh 1.WBF37rr18toZBKcQ1nxmaM9wBWlAPVtUpWOTL5FLLT+cEmnKw....An email attachment containing similar content.
The content must a least be 75% similar for a match to be found.
Further information relating to the configuration and usage of Fuzzy Hashes can be found in the  Using Fuzzy Hashing with a Content Examination Definition.

Entity Examples

The use of certain entities will require the Healthcare Dictionaries package to be enabled on your Mimecast account. The following entities are exceptions to this, and no longer require the Healthcare pack to be used:

  • National Insurance Number (NIN)
  • Community Health Index Number (CHI)
  • Vehicle Identification Number (VIN)
  • UK Electoral Roll Number
  • Credit Cards
  • National Health Service Number (NHS)
  • UK Driver’s License
  • Date of Birth
  • IP Address
  • URL
  • Telephone Numbers
  • Email Address
  • Passports
  • IBAN
  • Date

Single Entity


You wish to hold all messages containing references to credit card numbers. The "creditcard" entity finds all credit card numbers, regardless of the credit card type. For example, the following would match any credit card number found in the specified areas of an email (header, body, attachment).

1 detect creditcard


Multiple Entities with Operators


You want to hold messages that contain a piece of PII (Personal Identifiable Information) and a date of birth, that are a specific distance from each other.

1 (detect aba) Proximity:50 (detect date_dob)

This criteria detects any instances of an ABA number, and any instances of a Date of Birth (DOB), within a range of 50 characters from each other before a match is made.

By default the "Proximity" operator has a default distance of 300 characters. Specifying a number value after proximity overrides the default distance.

Ignoring Terms from Entities


Occasionally specific terms in an entity can cause too many false positives to occur. To prevent this you can ignore terms. This allows you to continue using an entity to look for content matches. Take the following example:


You're checking for FDA drug names in proximity to a person's name, but you wish to exclude the name "Susan" as it also matches the name of one of your customers. Here is the syntax you'd use to ignore the name "Susan":

1 (detect fdadrugs) PROXIMITY (detect names) IGNORE (Susan)

If you wish to ignore the term "Asprin' from the FDA Drugs entity:

1 (detect Names) PROXIMITY (detect fdadrugs) IGNORE (Asprin)

The 'ignore' only applies to the second entity specified on a line. No other operators will be allowed once the first operator has been specified.

To ignore multiple terms, you'll need to separate each term with a space as below:
1 (detect fdadrugs) PROXIMITY (detect names) IGNORE (Susan Bob James)
If you're using multiple entities in a search, the 'Ignore' operator only applies to the entity specified at the end of the search. For example:
1 (detect fdadrugs) PROXIMITY (detect names) IGNORE (Susan)  -  correct syntax
1 (detect fdadrugs) ignore (Asprin) PROXIMITY (detect names)  -  incorrect syntax

Excluding Entities


An administrator is using the PII entity group. They are seeing a high number of false positives with the "Phone Number" entity, and would like to check if excluding this resolves the problem. To do this they can:

  1. Remove the entry for the PII entity group from the Content Examination definition.
  2. Enter all the individual entities that they wish to use.


Original Word / Phrase Match ListNew Word / Phrase Match List

1 detect PII

1 detect Names
1 detect date_dob
1 detect SSN
1 detect medicare_id
1 detect VIN
1 detect IP
1 detect Email
1 detect URL


Negative Score


Using the same example of the PII entity group, but wanting to exclude the "Phone Number" entity, a negative score for the entity can be applied. If a match is found for the search term, the negative score is applied to the number of hits total. This reduces the overall score, and the chance of the Content Examination policy being applied. To do this they can:

  1. Leave the PII entity group in the Content Examination definition.
  2. Apply a negative score for the "Phone Number" entity.


Original Word / Phrase Match ListNew Word / Phrase Match List

1 detect PII

1 detect PII
-1 detect PhoneNumber


No Keywords


This operator disables the context keyword matching associated with many of the entities we support. Using this operator increases the likelihood of false positives occurring, but simplifies whether or not a match is likely to be found. For example, using the NKW operator causes the checks for terms associated with social security numbers to be ignored, and the check will only look for a regular expression match:

1 detect SSN_NKW


Adding Entities


There are reports of Canadian Social Security Numbers (SIN) being present in messages being sent externally. On investigation it is discovered the SIN number entity is not present in the policy. To do this they can:

  1. Add the SIN entity to the Content Examination definition.
1 detect SIN

Combining Entities and Phrases


You want to search for the term "Admission Date" followed by a date in Month / Day / Year format. This can be achieved by using the following policy syntax.

1 ("Admission Date") Proximity (detect date_mdy)

  • 1 is the line score applied when a match is found.
  • ("Admission Date") is the first check performed. This must be in brackets to mark the boundaries of the search text.
  • Proximity is the operator. In this case the phrase "Admission Date" needs to be within 300 characters of a date in Month/Day/Year format.
  • (detect date_mdy) is the second check performed. Again this must be in brackets to mark the boundaries of the search term.

ICD10 Entity

ICD10 Category codes have been split from the ICD10 entity to give more flexibility:

1 (detect icd10cm_categories) IGNORE (r10)


Incorrect Content Matching When Using Spreadsheets


Occasionally an attachment is blocked because of a match (usually numeric) although the content does not appear to be in the spreadsheet. This is as a result of some document types (e.g. Microsoft Excel) that have internal formatting features. These can cause Mimecast to match incorrectly. Examples include:

  • Spreadsheet columns that have an internal numbering scheme (invisible to the user) which when analyzed appear to be numeric content.
  • Dates that are stored internally in a long integer format.


This internal numbering can be mistaken for Social Security Numbers, Credit Card numbers, etc. by the Mimecast content text analyzer. Unfortunately, there is no automatic way to avoid this, and the incorrectly held attachment must be released manually.


See Also...


5 people found this helpful