How to find your critical data – fast, easy, and (almost) painless

One of the most important superpowers in the Purview toolkit is being able to find where your sensitive data is hiding. In Norway, that usually means personnummer (national identity numbers) are top of mind. In this post, I’ll show you how to hunt for critical data (using personnummer as our running example), while avoiding drowning in false positives. And yes, I’ll steer you right so you don’t break stuff.

Sensitivity info types: your detection engines

In Microsoft Purview, you have built-in Sensitive information types (SITs). These are templates the system uses to scan e-mails, documents, Teams messages, etc., and detect “something that looks like personal info.”

For example, there is a built-in Norway Identity Number type. In theory, it should flag Norwegian personal numbers. In practice, spoiler, it tends to generate lots of noise (false positives).

So: the trick is to customize or extend those types with smarter logic (hello, RegEx) so you get useful results, not a bucket of trash.

Sensitive info types

Step 1: Create your custom sensitive info types (with RegEx magic)

Here’s how to build your own, more precise detectors:

  1. Go to purview.microsoft.comInformation Protection → Classifiers→ Sensitive information typesNew sensitive information type.
  2. Give it a name & description.
  3. Create one or more patterns (rules) using RegEx. For example:
    • Pattern A: 11 consecutive digits, or “6 digits + space + 5 digits”
\d{11}|\d{6} \d{5}
    • Pattern B: same as above plus validate first two digits as 01–31 (day), next two as 01–12 (month), etc.
(?:(?:31(?!02|04|06|09|11))|(?:30(?!02))|(?:29(?=02(?:[02468][048]|[13579][26])$))|(?:0[1-9]|1\d|2[0-8]))(0[1-9]|1[0-2])(\d{2}) ?\d{5}

(Yes, that’s gnarly, but you only need to paste and test once.)

  1. Choose confidence levels (Low/Medium/High) for each pattern, and whether to make it the primary pattern.
  2. Save & finish the wizard.
RegEx

Update note (2025): As of now, Purview allows you to combine built-in detectors + custom ones and weight them. Also, be aware that for large tenants or high volume scans, overly complex RegEx can slow performance. Always test on small sample sets first.

Step 1.5: Test your RegEx before you unleash it

Before you point it at your entire tenant:

  • Use the “Test” feature in the sensitive info type definition.
  • Upload a few test text files: one with valid personnummer, another with random numbers.
  • See how many hits (true vs false) you get.
  • Adjust your RegEx as needed.

Potential drawbacks: your RegEx might not catch subtleties like “3102” (31 Feb), or weird formatting, so be honest about what “good enough” means for your use case.

Step 2: Grant yourself permission to see the results

Creating a smart SIT is pointless if you can’t see where matches happened. By default, you can’t see detailed results (for privacy/security). You’ll need one or both of these roles:

  • Content Explorer List Viewer: lets you see which mailboxes, OneDrives, SharePoint sites, etc., produced matches.
  • Content Explorer Content Viewer: lets you dig into the actual files or messages that matched (even if you normally wouldn’t have direct access).

With this power comes responsibility. Talk to your data protection officer/legal counsel before giving access, especially to the content viewer.

One more tip: once you assign the role, it can take a few hours to propagate. Sometimes logging out/back in helps.

Step 3: Explore in Content Explorer (a.k.a. find the loot)

Once the roles are set:

  1. In the Purview portal, navigate to Information ProtectionContent Explorer.
  2. Filter/search by your custom or built-in sensitive info type(s).
  3. Review results at different levels:
    • Top level: which containers (mailboxes, SharePoint sites) have matches
    • Mid level: number of hits per site/user
    • Deep level (if you have the content viewer role): see the exact files/messages
  4. Compare built-in vs your custom detectors: often you’ll see built-in giving tons of noise, and your custom ones narrowing it down dramatically.

From my own tests:

  • Built-in “Norway Identity number” gave 122,218 hits
  • Custom RegEx A: 4,209 matches (≈96 % false hits eliminated)
  • Custom RegEx B (with date filter): ~2,588 hits (further refinement)

Yeah, those numbers are real, just less embarrassing now.

Why this matters (and what your leaders should hear)

  • Risk reduction: If Copilot is reading everything, you want to know exactly where your sensitive data is.
  • Fewer false positives: You don’t want your compliance team chasing ghosts.
  • Governance visibility: When you can point to “these 27 files on SharePoint contain personnummer,” it’s a compelling story for audits.
  • Remediation planning: Once you have a map, you can prioritize “fix, secure, delete, archive” actions.

Pro tips/pitfalls from the trenches

  • Don’t get carried away with overly clever RegEx. It can backfire.
  • Start small – test in a dev or pilot environment first.
  • Be careful with encoding/file type quirks (e.g. PDFs, images), some detectors don’t scan non-text content.
  • Build in a feedback loop: data owners flag false matches, you iterate your detectors.
  • Schedule periodic scans, data moves, changes, and new users can upload things later.

Author

  • Åsne Holtklimpen

    Åsne is a Microsoft MVP within Microsoft Copilot, an MCT and works as a Cloud Solutions Architect at Crayon. She was recently named one of Norway’s 50 foremost women in technology (2022) by Abelia and the Oda network. She has over 20 years of experience as an IT consultant and she works with Microsoft 365 – with a special focus on Teams and SharePoint, and the data flow security in Microsoft Purview.

    View all posts

Discover more from Agder in the cloud

Subscribe to get the latest posts sent to your email.

By Åsne Holtklimpen

Åsne is a Microsoft MVP within Microsoft Copilot, an MCT and works as a Cloud Solutions Architect at Crayon. She was recently named one of Norway’s 50 foremost women in technology (2022) by Abelia and the Oda network. She has over 20 years of experience as an IT consultant and she works with Microsoft 365 – with a special focus on Teams and SharePoint, and the data flow security in Microsoft Purview.

Related Post

Leave a Reply