Content Examination Definitions: Word / Phase Match List Examples

Document created by user.oxriBaJeN4 Employee on Sep 21, 2015Last modified by user.Yo2IBgvWqr on Oct 6, 2017
Version 32Show Document
  • View in full screen mode

A content expression is the combination of functions, text, and phrases used in a Content Examination definition to assist with Data Leak Prevention (DLP). A content expression could be a word, phrase, regular expression, or hash algorithm matching a specific document. The content expression is entered into the Content Examination definition, which scans messages looking for any of the search criteria to match.

 

These search terms are used to prevent the accidental or malicious data loss through company email. Depending on your requirements, some or all of the examples below may be applicable to make your email system comply with your company's security policy. Mimecast has the capabilities to validate regular expressions using both Java and Perl regex engines. These can be specified by either using the invocation language perlregex or javaregex.

 

The below table illustrates the default values of usage:

 

UsageEngine
regexDefaults to perlregex. If it fails, it uses javaregex.
perlregexDefaults to jregex.
javaregexDefaults to javaregex.

 

Using Content Expressions

 

Content expressions are entered into the "Word / Phrase Match List" field in a Content Examination definition. The activation score is a required field, and is the total score used to trigger the definition should a match be found in a message.

It is highly recommended to test Content Examination definitions to ensure that the correct results are achieved as required.
Search Parameters (*)Example
weight [:maxscore] <search text>4 “Company Confidential”
weight [:maxscore] required <search text>1 required “Project X”
weight [:maxscore] exclude <search text>1 exclude “Tax exemption”
weight [:maxscore] regex <regular expression>10 regex 4[0-9]{12}(?:[0-9]{3})?
weight [:maxscore] regex,cardnumber <regular expression>1 regex,cardnumber 6(?<=\b6)(767|334)(?!\n\t)(\d{12,15}|[\d- ]{16,19})\b
weight [:maxscore] hash <MD5#>1 hash 9EBD30E761ED4FF770A90DDBD5CB4190 Confidential.PDF

 

Search Parameter Notation

 

Search Parameter TypeExample ParametersDescription
Mandatoryweight required exclude regex hashOne or more of these parameters must be included in the search parameters, according to the type of content expression that is being created. At a minimum, the weight must be defined.
[Optional][:maxscore] cardnumberParameters in square brackets [ ] are optional. "cardnumber" invokes additional scanning of the credit card through a Luhn algorithm to determine if the sequence of numbers is a valid credit card number.
<To Be Modified><search text> <regular expression> <MD5#>Parameter in angled brackets <x> must be modified to build the required content expression.

 

 

Usage Examples

 

Words and Phrases

 

Word / Phrase Combination

 

Activation Score = 3

 

Word / Phrase Match ListEmail Content Required to Trigger Definition
1 “financial affairs”
4 confidential
2 “company confidential information”
3 occurrences of ‘financial affairs’ OR
1 occurrence of ‘confidential’ OR
2 occurrences of ‘company confidential information’ OR
Any combination of the words and phrases, so that the cumulative score matches or exceeds 3 points.

Using the "Regex" prefix followed by a search word or phrase wrapped in quotation marks ( " " ) doesn't result in absolute matches being found for the term in quotation marks. Matches will only be found for the search word / phrase including the quotation marks.

Words / Phrases Combination With Comment

 

Activation Score = 10

 

Word / Phrase Match ListEmail Content Required to Trigger Definition
# this is a comment, and is ignored
1 “social security”
1 private
8 “personal security code”
10 occurrences of ‘social security’ OR
10 occurrences of ‘private’ OR
2 occurrences of ‘personal security code’ OR
Any combination of the search terms, so that the total cumulative score is 10 or more.
Searches for "personal security code" only triggers the policy if these words appear in the message in exactly the same order.

Multiple Words

 

Activation score = 1

 

Word / Phrase Match ListEmail Content Required to Trigger Definition
1 “Product A” “Product B”
1 ”Product C” ”Product D”
1 occurrence of ‘Product A’ OR ‘Product B’ OR 1 occurrence of ‘Product C’ OR ‘Product D’
1:10 notificationAs long as 'match multiple words' option is checked in the content definition, the definition will only match a maximum of 10 occurrences of the word 'notification' in the email.

 

Wildcard Characters

 

Activation score = 6

 

Word / Phrase Match ListEmail Content Required to Trigger Definition
3 car*
3 capitali?e
2 occurrences of words beginning with ‘car’, e.g. cars, carlton, carter, carbuncle, etc. OR
2 occurrences of ‘capitalize’ or ‘capitalise’

 

Boolean Operators

 

You can use boolean operators (i.e. AND / OR) to help minimize the number of false positives experienced. This is achieved by specifying two search terms that must exist in a message.

 

Text Strings and Regular Expression

 

The following search terms would require the word 'Credit Card' along with a valid credit card number before a content hit is found. If only one of the specified search terms is found, no content hit will be logged.

 

 

Word / Phrase Match ListEmail Content Required to Trigger Definition
Credit Card 4111 1111 1111 11111 ("Credit Card") AND (regex,cardnumber 4(?<=\b(?<!\.)4)\d{3}[\W\s]?\d{4}[\W\s]?\d{4}[\W\s]?\d{4}\b)
Each search term must be wrapped in brackets () and followed by the AND operator and your next search term. Phrases must be wrapped in brackets and speech marks. For example, ("Credit Card"). Only one AND operator can be used per line in the Search Word/Phrase match list.

Multiple Entities

 

The following triggers a match when the presence of two entities is found:

1 (detect SIN) AND (detect Names)

 

Either One of Two Entities

 

The following triggers a match when the presence of one of two entities is found:

1 (detect drivers_license_us) OR (detect drivers_license_uk)

 

Combinations

 

Activation score = 4

 

Word / Phrase Match ListEmail Content Required to Trigger Definition
0 required “Project Alpha”
5 “budget overrun”
4 urgent
2 update
The phrase ‘Project Alpha’ MUST be present AND
1 occurrence of ‘budget overrun’ OR
1 occurrence of ‘urgent’ OR
2 occurrences of ‘update’
0 exclude “Project Delta”
5 “budget overrun”
4 urgent
2 update
The phrase ‘Project Delta’ MUST NOT be present AND
1 occurrence of ‘budget overrun’ OR
1 occurrence of ‘urgent’ OR
2 occurrences of ‘update’
The Required / Exclude operators must be used in the first line of the Content Examination definition.

Proximity Operators

 

The Proximity operator allows administrators to specify if a Content Policy triggers when a search term is found in a specified number of characters of another search term. If a match is found outside of the specified distance no content match is triggered.

 

Considerations:

  • There is no limit on the distance specified by the Proximity operator.
  • Mimecast Managed Reference Dictionaries (MMRDs) and Reference Dictionaries can be used in conjunction with the 'Proximity' operator.
  • The Proximity value is calculated by the number of characters from the end of the first search word / phrase, until the first character of the second search word / phrase (including blank spaces and special characters). Take the following example:

 

"Here is my credit card number for you to use 4111 1111 1111". The proximity value would be 17, as there are 17 characters between the end of Credit Card and the first character of the credit card number specified (4111 1111 1111). Remember that blank spaces count as characters, so you'll need to include them in your proximity value.

Proximity Example

 

Word / Phrase Match ListEmail Content Required to Trigger Definition
1 ("Credit Card Number") Proximity:17 (regex,cardnumber 4(?<=\b(?<!\.)4)\d{3}[\W\s]?\d{4}[\W\s]?\d{4}[\W\s]?\d{4}\b)

For example the following text would cause the Content policy to trigger as both Term 1 and Term 2 meet the configured requirements.

"Here is my credit card number for you to use 4111 1111 1111 1111"

Also if Term 1 is less than 17 characters from Term 2, a content match will still occur, as the matching criteria is a range rather than an absolute distance. 

 

The following text would not match as Term 1 and Term 2 are more than 17 characters apart, it is in fact 34 characters apart.

"Here is my credit card number for you to use to book the hotel  4111 1111 1111 1111"

 

Content Reference Dictionaries

 

Content reference dictionaries are added from the Insert menu inside a Content Examination Definition. For more information on Reference Dictionaries, view the full page.

 

Activation score = 3

 

Word / Phrase Match ListEmail Content Required to Trigger Definition

#ref 545 Social Security Number


#ref 276 Common Medical Terms

Words and phrases contained in the ‘Social Security Number’ OR ‘Common Medical Terms’ reference dictionaries must be present, in a combination so that their aggregate scores add up to 3 or more. Entries in the dictionaries are individually weighted, or have a default weighting of 1.

 

These reference dictionaries must be pre-created. The Word / Phrase match list does not auto-populate the entire list of criteria in the dictionary. You have to refer to the original reference dictionary to examine its contents. Mimecast provides Managed Reference Dictionaries for only credit card numbers and profanity lists by default.

 

MD5 Hash Checksum

 

Checksums are generated from the Insert menu inside a Content Examination Definition. 

 

Activation score = 1

 

Word / Phrase Match ListEmail Content Required to Trigger Definition
#1 hash 00C9443961BC9131FE96D580AB29CE59 mimecast.xls - (30208  Bytes)The ‘mimecast.xls’ file must be present in the email.

 

Regular Expressions (Text Matches)

 

Activation score = 4

 

Word / Phrase Match ListEmail Content Required to Trigger Definition
1 regex [bcr]at
10 regex D(?:OB|ate of birth)(?:[\W\D\S])?\s?\d{1,2}(?:\/|\.)\d{1,2}(?:\/|\.)\d{2,4}\b
4 occurrences of words ending in ‘at’, such as ‘bat’, ‘rat’, ‘cat’ etc. OR
1 occurrence of a reference to a Date of birth in the email with the format: Date of birth: MM|DD/MM|DD/YY|YYYY
For more Date Formats:
Month Day Year: (([0-1][0-2])|([0][1-9]|[1][1-2]))(.| |/|-)(([0-2][0-9])|([3][0-1]))(.| |/|-)(([1][9][0-9][0-9])|([2][0][0-9][0-9])|(/d{2}))
Day Month Year: (([0-2][0-9])|([3][0-1]))(.| |/|-)(([0-1][0-2])|([0][1-9]|[1][1-2]))(.| |/|-)(([1][9][0-9][0-9])|([2][0][0-9][0-9])|(/d{2}))
Year Month Day: (([1][9][0-9][0-9])|([2][0][0-9][0-9])|(/d{2}))(.| |/|-)(([0-1][0-2])|([0][1-9]|[1][1-2]))(.| |/|-)(([0-2][0-9])|([3][0-1]))
Year Day Month: (([1][9][0-9][0-9])|([2][0][0-9][0-9])|(/d{2}))(.| |/|-)(([0-2][0-9])|([3][0-1]))(.| |/|-)(([0-1][0-2])|([0][1-9]|[1][1-2]))
5 regex (pre-|post-) operative1 occurrence of ‘pre-operative’ OR ‘post-operative’

 

Validators

 

Validators are used to identify if specific pieces of content found in messages or attachments that appear to be identification numbers (e.g. credit cards, social security numbers, etc.). These numbers are validated via the use of LUHN algorithms that are defined by the issuer for confirmation purposes.

Using validators greatly reduces the number of false positives that can occur when attempting to find matches based upon regular expressions alone, as the LUHN check must pass before any regular expression matches are attempted.

Below is a list of all of the Validators currently supported by Content Examination policies.

Activation Score = 1

 

Validator TypeWord / Phrase Match ListEmail Content Required to Trigger Definition
ABA Number (American Banking Association Number)1 regex,aba (\d){9}A valid ABA number, e.g. 021000322
CHI Number (Community Health Index Number)1 regex,chinumber (([^\w\t]?\s)?\d){10}A valid CHI Number (e.g. 3110130017)
CardNumber (Credit Card Numbers)
Includes the following card types: Visa, Mastercard, American Express (Amex), Diner Club, Discover Card, JCB Cards
1 regex,cardnumber 4(?<=\b(?<!\.)4)\d{3}[\W\s]?\d{4}[\W\s]?\d{4}
[\W\s]?\d{4}\b
A valid Credit Card Number (e.g. a valid Visa Credit Card number of 4444333322221111)
Email Address1 regex,email (\w+[@\.]\s*\w+\.*\w+)A valid Email Address (e.g. name@domain.com, name_2@domain.com, or name@domain2.com.
IBAN Number (International Bank Account Number)1 regex,iban GB\s(\d){2}\sBARC(\s*\d){14}A valid IBAN number, e.g. GB 37 BARC 2004 1538 2900 08
NI Number (UK National Insurance Number)1 regex,nin \s*[a-zA-Z]{2}(?:\s*\d\s*){6}[a-zA-Z]?\s*A valid National Insurance Number (e.g. BN102966C)
NHS Number (National Health Service Number)1 regex,nhsnumber (([^\w\t]?\s)?(-)?\d){10}
1 regex,nhsnumber (([^\w\t]?\s)?(_)?\d){10}
A valid NHS Number (Hyphen Separated), e.g. 499-999-9994
A valid NHS Number, e.g. 499 999 9994
NPI Number (Health Identification Card Number)1 regex,npi (?<!\\d)\\d{10}(?!\\d)|80840\\d{10}(?!\\d)A valid Standard Health Identification Card Number (e.g. 808401234567893 1234567893).
MOD10 (Modulus 10, used to validate Canadian Health and US Postal Service PIC numbers).1 regex,mod10 (\d){9}A valid Canada Health Number or US Postal Service PIC number consisting of 9 digits.
  • Canada Health Number 9876543217
  • US Postal Service PIC Number 12345678
SIN Number (Canadian Social Insurance)1 regex,sin (\d){9}A valid number, e.g. 046 454 286
SSN (US Social Security Number)1 regex,ssn ([^0-9-]|^)([0-9]{3}-[0-9]{2}-[0-9]{4})([^0-9-]|$)A valid US Social Security Number (e.g. 078-05-1125).
PhoneNumber
Or
PhoneNumber+Region - Can be used to minimise false positives or to target specific regions.
Includes support for the following regions: AU,UK,US
1 regex,phoneNumberAU (\+?)\d{1,3}(\s)?(\(\d{1}\))?[\s\d-]+
1 regex,phoneNumber
(\+?)\d{1,3}(\s)?(\(\d{1}\))?[\s\d-]+
A valid AU based telephone number - (e.g. 1300 307 318)
Or
A valid telephone number - including country code (e.g. +44 (0)20 7847 8700)
PostCode+Region (postcodeau)
Includes support for the following regions: AU,CA,UK,US,ZA
1 regex,postalcodeau [0-9]{4}A valid AU Post Code (e.g. NSW 2060)
VIN Number (Vehicle Identification Number)1 regex,vin [0-9A-HJ-NPR-Z]{17}A valid VIN number, e.g. 1M8GDM9AXKP042788
Regular expressions used in the examples above are for illustration purposes only and may not catch all possible examples.

Fuzzy Hashes

 

Fuzzy Hashing can be used to limit the flow of sensitive information from leaving your organization, by matching text content similarities between a Control Document and email attachments passing through your Mimecast service.

 

Activation score = 1

 

Fuzzy Hashes are generated via the Generate Fuzzy Hash button found on the Content Definitions page, and inserted using the Insert menu in Content Examination definitions.

 

There are two types of Fuzzy Hashes that can be used within Content Examination policies:

  • SSDEEP - This hash type uses the binary information of the document to generate the hash.
  • MFH - This hash type uses the text contained within the control file to generate the hash.

 

Word / Phrase Match ListEmail Content Required to Trigger Definition
1 mfh 1.WBF37rr18toZBKcQ1nxmaM9wBWlAPVtUpWOTL5FLLT+cEmnKw....An email attachment containing similar content.
The content must a least be 75% similar for a match to be found.
Further information relating to the configuration and usage of Fuzzy Hashes can be found in the  Using Fuzzy Hashing with a Content Examination Definition.

Entity Examples

The use of entities requires the healthcare dictionaries package to be enabled on your account.

Single Entity

 

You wish to hold all messages containing references to credit card numbers. The "creditcard" entity finds all credit card numbers, regardless of the credit card type. For example, the following would match any credit card number found in the specified areas of an email (header, body, attachment).

1 detect creditcard

 

Multiple Entities with Operators

 

You want to hold messages that contain a piece of PII (Personal Identifiable Information) and a date of birth, that are a specific distance from each other.

1 (detect aba) Proximity:50 (detect date_dob)

This criteria detects any instances of an ABA number, and any instances of a Date of Birth (DOB), within a range of 50 characters from each other before a match is made.

By default the "Proximity" operator has a default distance of 300 characters. Specifying a number value after proximity overrides the default distance.

 

Excluding Entities

 

An administrator is using the PII entity group. They are seeing a high number of false positives with the "Phone Number" entity, and would like to check if excluding this resolves the problem. To do this they can:

  1. Remove the entry for the PII entity group from the Content Examination definition.
  2. Enter all the individual entities that they wish to use.

 

Original Word / Phase Match ListNew Word / Phrase Match List

1 detect PII

1 detect Name
1 detect DOB
1 detect SSN
1 detect MedicareID
1 detect FAX
1 detect VIN
1 detect IP
1 detect EmailAddress
1 detect URL

 

Negative Score

 

Using the same example of the PII entity group, but wanting to exclude the "Phone Number" entity, a negative score for the entity can be applied. If a match is found for the search term, the negative score is applied to the number of hits total. This reduces the overall score, and the chance of the Content Examination policy being applied. To do this they can:

  1. Leave the PII entity group in the Content Examination definition.
  2. Apply a negative score for the "Phone Number" entity.

 

Original Word / Phase Match ListNew Word / Phrase Match List

1 detect PII

1 detect PII
-1 detect PhoneNumber

 

Adding Entities

 

There are reports of Canadian Social Security Numbers (SIN) being present in messages being sent externally. On investigation it is discovered the SIN number entity is not present in the policy. To do this they can:

  1. Add the SIN entity to the Content Examination definition.
1 detect SIN

Combining Entities and Phrases

 

You want to search for the term "Admission Date" followed by a date in Month / Day / Year format. This can be achieved by using the following policy syntax.

1 ("Admission Date") Proximity (detect date_mdy)

  • 1 is the line score applied when a match is found.
  • ("Admission Date") is the first check performed. This must be in brackets to mark the boundaries of the search text.
  • Proximity is the operator. In this case the phrase "Admission Date" needs to be within 300 characters of a date in Month/Day/Year format.
  • (detect date_mdy) is the second check performed. Again this must be in brackets to mark the boundaries of the search term.

 

Incorrect Content Matching When Using Spreadsheets

 

Occasionally an attachment is blocked because of a match (usually numeric) although the content does not appear to be in the spreadsheet. This is as a result of some document types (e.g. Microsoft Excel) that have internal formatting features. These can cause Mimecast to match incorrectly. Examples include:

  • Spreadsheet columns that have an internal numbering scheme (invisible to the user) which when analyzed appear to be numeric content.
  • Dates that are stored internally in a long integer format.

 

This internal numbering can be mistaken for Social Security Numbers, Credit Card numbers, etc. by the Mimecast content text analyzer. Unfortunately, there is no automatic way to avoid this, and the incorrectly held attachment must be released manually.

 

See Also...

 

3 people found this helpful

Attachments

    Outcomes