The new website for MarkLogic is Visit it.

How to Find and Control Access to PII

Back to blog
6 minute read
Back to blog
6 minute read

How do you find and control access to Personally Identifiable Information (PII) and sensitive information in a MarkLogic database? You may be complying with regulation, internal policies, or simply trying to reduce risk.

In many situations, it may be apparent which elements of the document or JSON properties contain PII. For instance, an element called SSN or credit card number clearly indicates that this is sensitive information. However, what if you have too many data sources and unknown element names to describe an SSN? For example, you are implementing an Employee 360 application that ingests data from multiple systems. One system describes SSN as id, another uses SSN as primary key, another has SSN within text notes, etc. Therefore, our first step is finding the path to information that we want to protect.

Finding PII

Leverage the Indexes

MarkLogic has a powerful Universal Index that can help us find information in documents. If you can grab a list of known SSNs in the database, for instance a list from HR of some SSNs, you can easily use the number as input to word queries. Here is one example of a query searching for some SSNs.

let $pii := ("567-90-6071", "390-32-7638", "728-48-1630")
let $docs:= cts:search(fn:doc(), cts:word-query($pii))
return fn:distinct-values(xdmp:path($docs//text()[fn:matches(fn:string(.), $pii )]/..))

This search works across XML and JSON, but the output is slightly different. Let’s say for instance that the function returns the following /employee/user/text("ID"), /id/number/text(), /employee_table/key/text("primary"). This tells me that I have Social Security numbers across my documents under the following paths:/employee/user/ID, /id/number, and /employee_table/key/primary.

This is a fast and reliable way of finding the PII. However, what if I don’t know which values are classified as PII up-front?

Pattern Matching: Brute Force Method

For those cases, we will use two different techniques: pattern matching and sampling. Most PII and sensitive information follow a pattern such as a well-structured Social Security number. Therefore, you can use MarkLogic’s built-in capabilities to search for patterns. In the function below we are searching for an SSN pattern using fn:matches and returning all the paths using xdmp:path that match the pattern.

let $pattern :="^(d{3})([-.s]?)(d{2})2(d{4})$" 
xdmp:path(fn:collection("Employees")//*[fn:matches(fn:string(.), $pattern])

CAUTION: STOP, don’t try it! Notice that we are scanning all documents in the collection and the entire document structure using *//. This pattern matching will not leverage the indexes and is a brute force search. Consequently, this would take a long time if you do in a large database. This certainly would be okay if you are running against a small staging or QA database or if you need to guarantee that you find all occurrences of Social Security numbers in the production database in order to protect them.

Scoping and Sampling: Elegant and Better Than Brute Force

Ok, we don’t want to scan the entire database! Let’s apply scoping and sampling techniques to reduce the number of documents that we search. First, begin by scoping the documents using collections. If you have multiple types of documents that represent logical entities in the database, focus on the ones that you know may contain PII. Second, use sampling.

The code below randomly samples 15% of the documents in the “Employee” collection and then searches for the SSN pattern, returning distinct paths

let $x := .15  (: 15% :)
let $results := cts:search(fn:collection("Employee"), cts:true-query(), "score-random")
let $estimate := xdmp:estimate(cts:search(fn:collection("Employee"), cts:true-query()))
let $docs := $results[1 to fn:ceiling($estimate * $x)]
let $pattern :="^(d{3})([-.s]?)(d{2})2(d{4})$"
return fn:distinct-values(xdmp:path($docs//text()[fn:matches(fn:string(.), $pattern )]))

Mechanisms to Secure PII

Now that we have found the paths to the PII, you can use MarkLogic’s document level security, Element Level security and Redaction to control access to it.

Document-Level Security

Document level security controls access to entire documents based on user roles. If you haven’t set document level permissions during ingestion, which is the preferred method, you can use the following function to do it on existing documents. In this example, only Employees can read the documents that match the collection search.

xdmp.documentAddPermissions(fn.collection(“Employee”), [xdmp.permission("Employee_role", "read")])

Element-Level Security

Element Level Security allows you to control access to paths in the document on real-time search, queries and updates based on user roles. The following Element Level security functions will ensure that only users with the HR_read role will be able to search or read Social Security numbers:

var sec = require("/MarkLogic/security.xqy");
const perm = xdmp.permission("HR_read", "read", "element");
let paths = ["//employee/user/ID", "//id/number","//employee_table/key/primary"];

for (const path of paths) {
  sec.protectPath(path, null, perm);

Note: it is important to highlight that the XPath expression (//employee/user/ID) allows some flexibility in the document structure because of the // at the beginning. If your document structure changes to /data/employeedata/employee/user/ID the XPath expression will still match and protect the SSN.


Redaction allows you to hide or mask information while you export it. The following rule, once applied using Redaction, will generate a fully random SSN when you export data, for instance to a QA or development instance. (one rule and path shown for brevity)

<rdt:rule xml:lang="zxx"
  <rdt:description>Mask SSNs</rdt:description>

Redaction rules are simply documents that are added to a collection in the Schemas database. You call mlcp (MarkLogic Content Pump) to export data and use mlcp –redact CollectionName to apply all rules in a collection to the data as it is exported. Redaction also leverages XPath and has the same flexibility in the document structure.

Verify It’s All Done

Once you set document and element level security, you can test to ensure that only the correct users have access to the PII. The following code will search for a particular SSN using different users. In our scenario, only HR_reader (who has HR-read role) will find a match.

'use strict';
/** Invokes any function as a specific user */
const asUser = (fn, user) =>
  xdmp.invokeFunction(fn, { userId: xdmp.user(user) });

/** The actual function that does the work. */
function _readDocs(pii) {

  let docs=;
  var paths=[];

  for (var doc of docs) {
    var matched = doc.xpath("//text()");
    for (var node of matched) {
      if (fn.matches(fn.string(node), pii)) {

/** A new synthetic function that combines the actual function and the user switching */
const readDocs = (pii, user) =>
    (/* zero params*/) => 
    _readDocs(pii), user

const users = ['Employee_reader', 'HR_reader', 'Contractor_reader'];
const pii ="898-31-4408";
const out = [];
let i = 0;
for (const result of => readDocs(pii, user))) {
  out.push(`------------ ${users[i++]} ------------`);

A Better Database to Find and Control PII

We learned how MarkLogic not only helps you find PII in your documents but also allows real-time access control access to PII, both at the document level and sub-document level, using element level security. Besides real-time control, MarkLogic also helps you to create data exports that are safe for developers or data scientists to use, as you can easily hide, mask or create sample data to replace PII.

Additional Resources

Caio Milani

Caio Milani is Director of Product Management at MarkLogic responsible for various aspects of the product including infrastructure, operations, security, cloud and performance. Prior to joining MarkLogic, he held product management roles at EMC and Symantec where he was responsible for storage, high availability and management products.
Caio holds a BSEE from the University of Sao Paulo and a full-time MBA Degree from the University of California, Berkeley.

Read more by this author

Share this article

Read More

Related Posts

Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.


Poker Fun with XQuery

In this post, we dive into building a full five-card draw poker game with a configurable number of players. Written in XQuery 1.0, along with MarkLogic extensions to the language, this game provides examples of some great programming capabilities, including usage of maps, recursions, random numbers, and side effects. Hopefully, we will show those new to XQuery a look at the language that they may not get to see in other tutorials or examples.

All Blog Articles

Protecting passwords in ml-gradle projects

If you are getting involved in a project using ml-gradle, this tip should come in handy if you are not allowed to put passwords (especially the admin password!) in plain text. Without this restriction, you may have multiple passwords in your file if there are multiple MarkLogic users that you need to configure. Instead of storing these passwords in, you can retrieve them from a location where they’re encrypted using a Gradle credentials plugin.

All Blog Articles

Getting Started with Apache Nifi: Migrating from Relational to MarkLogic

Apache NiFi introduces a code-free approach of migrating content directly from a relational database system into MarkLogic. Here we walk you through getting started with migrating data from a relational database into MarkLogic

All Blog Articles

Sign up for a Demo

Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.

Request a Demo