How to find your critical data – fast, easy, and (almost) painless
One of the most important superpowers in the Purview toolkit is being able to find where your sensitive data is hiding. In Norway, that usually means personnummer (national identity numbers) are top of mind. In this post, I’ll show you how to hunt for critical data (using personnummer as our running example), while avoiding drowning in false positives. And yes, I’ll steer you right so you don’t break stuff.
Sensitivity info types: your detection engines
In Microsoft Purview, you have built-in Sensitive information types (SITs). These are templates the system uses to scan e-mails, documents, Teams messages, etc., and detect “something that looks like personal info.”
For example, there is a built-in Norway Identity Number type. In theory, it should flag Norwegian personal numbers. In practice, spoiler, it tends to generate lots of noise (false positives).
So: the trick is to customize or extend those types with smarter logic (hello, RegEx) so you get useful results, not a bucket of trash.

Step 1: Create your custom sensitive info types (with RegEx magic)
Here’s how to build your own, more precise detectors:
- Go to purview.microsoft.com → Information Protection → Classifiers→ Sensitive information types → New sensitive information type.
- Give it a name & description.
- Create one or more patterns (rules) using RegEx. For example:
- Pattern A: 11 consecutive digits, or “6 digits + space + 5 digits”
\d{11}|\d{6} \d{5}
- Pattern B: same as above plus validate first two digits as 01–31 (day), next two as 01–12 (month), etc.
(?:(?:31(?!02|04|06|09|11))|(?:30(?!02))|(?:29(?=02(?:[02468][048]|[13579][26])$))|(?:0[1-9]|1\d|2[0-8]))(0[1-9]|1[0-2])(\d{2}) ?\d{5}
(Yes, that’s gnarly, but you only need to paste and test once.)
- Choose confidence levels (Low/Medium/High) for each pattern, and whether to make it the primary pattern.
- Save & finish the wizard.

Update note (2025): As of now, Purview allows you to combine built-in detectors + custom ones and weight them. Also, be aware that for large tenants or high volume scans, overly complex RegEx can slow performance. Always test on small sample sets first.
Step 1.5: Test your RegEx before you unleash it
Before you point it at your entire tenant:
- Use the “Test” feature in the sensitive info type definition.
- Upload a few test text files: one with valid personnummer, another with random numbers.
- See how many hits (true vs false) you get.
- Adjust your RegEx as needed.
Potential drawbacks: your RegEx might not catch subtleties like “3102” (31 Feb), or weird formatting, so be honest about what “good enough” means for your use case.
Step 2: Grant yourself permission to see the results
Creating a smart SIT is pointless if you can’t see where matches happened. By default, you can’t see detailed results (for privacy/security). You’ll need one or both of these roles:
- Content Explorer List Viewer: lets you see which mailboxes, OneDrives, SharePoint sites, etc., produced matches.
- Content Explorer Content Viewer: lets you dig into the actual files or messages that matched (even if you normally wouldn’t have direct access).
With this power comes responsibility. Talk to your data protection officer/legal counsel before giving access, especially to the content viewer.
One more tip: once you assign the role, it can take a few hours to propagate. Sometimes logging out/back in helps.
Step 3: Explore in Content Explorer (a.k.a. find the loot)
Once the roles are set:
- In the Purview portal, navigate to Information Protection → Content Explorer.
- Filter/search by your custom or built-in sensitive info type(s).
- Review results at different levels:
- Top level: which containers (mailboxes, SharePoint sites) have matches
- Mid level: number of hits per site/user
- Deep level (if you have the content viewer role): see the exact files/messages
- Compare built-in vs your custom detectors: often you’ll see built-in giving tons of noise, and your custom ones narrowing it down dramatically.
From my own tests:
- Built-in “Norway Identity number” gave 122,218 hits
- Custom RegEx A: 4,209 matches (≈96 % false hits eliminated)
- Custom RegEx B (with date filter): ~2,588 hits (further refinement)
Yeah, those numbers are real, just less embarrassing now.

Why this matters (and what your leaders should hear)
- Risk reduction: If Copilot is reading everything, you want to know exactly where your sensitive data is.
- Fewer false positives: You don’t want your compliance team chasing ghosts.
- Governance visibility: When you can point to “these 27 files on SharePoint contain personnummer,” it’s a compelling story for audits.
- Remediation planning: Once you have a map, you can prioritize “fix, secure, delete, archive” actions.
Pro tips/pitfalls from the trenches
- Don’t get carried away with overly clever RegEx. It can backfire.
- Start small – test in a dev or pilot environment first.
- Be careful with encoding/file type quirks (e.g. PDFs, images), some detectors don’t scan non-text content.
- Build in a feedback loop: data owners flag false matches, you iterate your detectors.
- Schedule periodic scans, data moves, changes, and new users can upload things later.
Discover more from Agder in the cloud
Subscribe to get the latest posts sent to your email.

