How do you find and control access to Personally Identifiable Information (PII) and sensitive information in a MarkLogic database? You may be complying with regulation, internal policies, or simply trying to reduce risk.
In many situations, it may be apparent which elements of the document or JSON properties contain PII. For instance, an element called SSN or credit card number clearly indicates that this is sensitive information. However, what if you have too many data sources and unknown element names to describe an SSN? For example, you are implementing an Employee 360 application that ingests data from multiple systems. One system describes SSN as id, another uses SSN as primary key, another has SSN within text notes, etc. Therefore, our first step is finding the path to information that we want to protect.
MarkLogic has a powerful Universal Index that can help us find information in documents. If you can grab a list of known SSNs in the database, for instance a list from HR of some SSNs, you can easily use the number as input to word queries. Here is one example of a query searching for some SSNs.
let $pii := ("567-90-6071", "390-32-7638", "728-48-1630") let $docs:= cts:search(fn:doc(), cts:word-query($pii)) return fn:distinct-values(xdmp:path($docs//text()[fn:matches(fn:string(.), $pii )]/..))
This search works across XML and JSON, but the output is slightly different. Let’s say for instance that the function returns the following /employee/user/text("ID"), /id/number/text()
, /employee_table/key/text("primary")
. This tells me that I have Social Security numbers across my documents under the following paths:/employee/user/ID
, /id/number
, and /employee_table/key/primary
.
This is a fast and reliable way of finding the PII. However, what if I don’t know which values are classified as PII up-front?
For those cases, we will use two different techniques: pattern matching and sampling. Most PII and sensitive information follow a pattern such as a well-structured Social Security number. Therefore, you can use MarkLogic’s built-in capabilities to search for patterns. In the function below we are searching for an SSN pattern using fn:matches
and returning all the paths using xdmp:path
that match the pattern.
let $pattern :="^(d{3})([-.s]?)(d{2})2(d{4})$" xdmp:path(fn:collection("Employees")//*[fn:matches(fn:string(.), $pattern])
CAUTION: STOP, don’t try it! Notice that we are scanning all documents in the collection and the entire document structure using *//. This pattern matching will not leverage the indexes and is a brute force search. Consequently, this would take a long time if you do in a large database. This certainly would be okay if you are running against a small staging or QA database or if you need to guarantee that you find all occurrences of Social Security numbers in the production database in order to protect them.
Ok, we don’t want to scan the entire database! Let’s apply scoping and sampling techniques to reduce the number of documents that we search. First, begin by scoping the documents using collections. If you have multiple types of documents that represent logical entities in the database, focus on the ones that you know may contain PII. Second, use sampling.
The code below randomly samples 15% of the documents in the “Employee” collection and then searches for the SSN pattern, returning distinct paths
let $x := .15 (: 15% :) let $results := cts:search(fn:collection("Employee"), cts:true-query(), "score-random") let $estimate := xdmp:estimate(cts:search(fn:collection("Employee"), cts:true-query())) let $docs := $results[1 to fn:ceiling($estimate * $x)] let $pattern :="^(d{3})([-.s]?)(d{2})2(d{4})$" return fn:distinct-values(xdmp:path($docs//text()[fn:matches(fn:string(.), $pattern )]))
Now that we have found the paths to the PII, you can use MarkLogic’s document level security, Element Level security and Redaction to control access to it.
Document level security controls access to entire documents based on user roles. If you haven’t set document level permissions during ingestion, which is the preferred method, you can use the following function to do it on existing documents. In this example, only Employees can read the documents that match the collection search.
xdmp.documentAddPermissions(fn.collection(“Employee”), [xdmp.permission("Employee_role", "read")])
Element Level Security allows you to control access to paths in the document on real-time search, queries and updates based on user roles. The following Element Level security functions will ensure that only users with the HR_read role will be able to search or read Social Security numbers:
declareUpdate(); var sec = require("/MarkLogic/security.xqy"); const perm = xdmp.permission("HR_read", "read", "element"); let paths = ["//employee/user/ID", "//id/number","//employee_table/key/primary"]; for (const path of paths) { sec.protectPath(path, null, perm); }
Note: it is important to highlight that the XPath expression (//employee/user/ID
) allows some flexibility in the document structure because of the //
at the beginning. If your document structure changes to /data/employeedata/employee/user/ID
the XPath expression will still match and protect the SSN.
Redaction allows you to hide or mask information while you export it. The following rule, once applied using Redaction, will generate a fully random SSN when you export data, for instance to a QA or development instance. (one rule and path shown for brevity)
<rdt:rule xml:lang="zxx" xmlns:rdt="http://marklogic.com/xdmp/redaction"> <rdt:description>Mask SSNs</rdt:description> <rdt:path>//employee/user/ID</rdt:path> <rdt:method> <rdt:function>redact-us-ssn</rdt:function> </rdt:method> <rdt:options> <rdt:level>full-random</rdt:level> </rdt:options> </rdt:rule>
Redaction rules are simply documents that are added to a collection in the Schemas database. You call mlcp (MarkLogic Content Pump) to export data and use
CollectionName to apply all rules in a collection to the data as it is exported. Redaction also leverages XPath and has the same flexibility in the document structure.mlcp –redact
Once you set document and element level security, you can test to ensure that only the correct users have access to the PII. The following code will search for a particular SSN using different users. In our scenario, only HR_reader
(who has HR-read
role) will find a match.
'use strict'; /** Invokes any function as a specific user */ const asUser = (fn, user) => xdmp.invokeFunction(fn, { userId: xdmp.user(user) }); /** The actual function that does the work. */ function _readDocs(pii) { let docs= cts.search(cts.wordQuery(pii)); var paths=[]; for (var doc of docs) { var matched = doc.xpath("//text()"); for (var node of matched) { if (fn.matches(fn.string(node), pii)) { paths.push(xdmp.path(node)); } } } fn.distinctValues(xdmp.arrayValues(paths)); } /** A new synthetic function that combines the actual function and the user switching */ const readDocs = (pii, user) => asUser( (/* zero params*/) => _readDocs(pii), user ); const users = ['Employee_reader', 'HR_reader', 'Contractor_reader']; const pii ="898-31-4408"; const out = []; let i = 0; for (const result of users.map(user => readDocs(pii, user))) { out.push(`------------ ${users[i++]} ------------`); out.push(result); } out;
We learned how MarkLogic not only helps you find PII in your documents but also allows real-time access control access to PII, both at the document level and sub-document level, using element level security. Besides real-time control, MarkLogic also helps you to create data exports that are safe for developers or data scientists to use, as you can easily hide, mask or create sample data to replace PII.
Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.
In this post, we dive into building a full five-card draw poker game with a configurable number of players. Written in XQuery 1.0, along with MarkLogic extensions to the language, this game provides examples of some great programming capabilities, including usage of maps, recursions, random numbers, and side effects. Hopefully, we will show those new to XQuery a look at the language that they may not get to see in other tutorials or examples.
If you are getting involved in a project using ml-gradle, this tip should come in handy if you are not allowed to put passwords (especially the admin password!) in plain text. Without this restriction, you may have multiple passwords in your gradle.properties file if there are multiple MarkLogic users that you need to configure. Instead of storing these passwords in gradle.properties, you can retrieve them from a location where they’re encrypted using a Gradle credentials plugin.
Apache NiFi introduces a code-free approach of migrating content directly from a relational database system into MarkLogic. Here we walk you through getting started with migrating data from a relational database into MarkLogic
Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.
Request a Demo