I am frequently asked about using MLCP to load documents with whitespace in their filenames or paths, since well-formed database document URIs may not have whitespaces. Let’s discuss the impact of loading documents with whitespace and ways to handle the whitespaces.
Let’s suppose we have a directory on the filesystem called “white space dir” and we want to load files in that directory into MarkLogic. This is the example setup:
/ tmp/ blog/ white space dir/ sample.json
I can load that with MLCP
$ ~/software/mlcp-8.0-4/bin/mlcp.sh import -username admin -password admin -host localhost -port 8000 -input_file_path "/tmp/blog"
Loading the example document using MLCP will result in the following document URI: “/tmp/blog/white%20space%20dir/sample.json”. Why? According to the section on Character Encoding of URIs in the MLCP User Guide, well-formed database document URIs may not have whitespaces and therefore MLCP automatically encodes the illegal whitespaces using %20.
Let’s think about this issue some more. If we loaded the example document as is and try to retrieve the document, we may build an application that expects to find documents based on their paths on the filesystem using the following call:
fn:doc('/tmp/blog/white space dir/sample.json')
The URI based on the filesystem doesn’t match any URI in the database, so we won’t get any results.
What if we build an application that tries to access our document through the REST API? The following call will return a 404 response:
http://localhost:8000/v1/documents?uri=/tmp/blog/white space dir/sample.json <error-response xmlns="http://marklogic.com/xdmp/error"> <status-code>404</status-code> <status>Not Found</status> <message-code>RESTAPI-NODOCUMENT</message-code> <message>RESTAPI-NODOCUMENT: (err:FOER0000) Resource or document does not exist: category: content message: /tmp/blog/white space dir/sample.json</message> </error-response>
There are several things to consider when deciding how to handle the whitespaces. Are you able to change the paths on the filesystem? Is it better for your application to use another character like a dash or to use the encoded whitespace? If you leave the whitespaces encoded, is there a way to work around the encoding? Read on for the various methods.
If changing the filesystem is an option, the easiest solution is to simply to change the filesystem path to avoid white spaces, allowing the in-database URIs to match. If the filesystem path doesn’t have spaces, MLCP won’t need to adjust the paths to make them match.
Another way to adapt to having whitespaces in the filesystem paths is to transform the whitespaces during the load. MLCP lets us execute a write transform. We’ll start by writing an MLCP transform that converts the spaces into dashes. Note that by the time the transform runs, the spaces have already been encoded as “%20”.
xquery version "1.0-ml"; module namespace space = "http://marklogic.com/transform/space-to-dash"; declare function space:transform( $content as map:map, $context as map:map) as map:map* { map:put( $content, "uri", fn:replace(map:get($content, "uri"), "%20+", "-")), $content };
There are multiple ways to deploy our transform. We will do it through the REST API.
$ curl --anyauth --user admin:admin -X PUT -i --data-binary @"./space-to-dash.xqy" -H "Content-type: application/xquery" 'http://localhost:8000/v1/ext/mlcp/space-to-dash.xqy'
Now we can call the transform as we load our data:
$ ~/software/mlcp-8.0-4/bin/mlcp.sh import -username admin -password admin -host localhost -port 8000 -input_file_path "/tmp/blog" -transform_module /ext/mlcp/space-to-dash.xqy -transform_namespace http://marklogic.com/transform/space-to-dash
The result is a URI that is accessible without encoding: “/tmp/blog/white-space-dir/sample.json”.
Application development is simpler if you avoid encoded URIs, but there may be cases where we need to leave the URIs with whitespaces encoded. If using encoded URIs is a requirement, apply the necessary encoding when requesting a document:
fn:doc(xdmp:url-encode('/tmp/blog/white space dir/sample.json', fn:true()))
We can take the same approach when using the REST API, but an extra step is required. If we ask for “/v1/documents?uri=/white%20space%20dir/sample.json”, normal processing turns that back to “/white space dir/sample.json” (resulting in a 404 response). However, if we encode the % signs themselves (%25), it works:
http://localhost:8000/v1/documents?uri=/white%2520space%2520dir/sample.json
Encoding means that the URI change has effects throughout the application—which is not ideal. For cases where external systems need to be able to directly address documents with predicted URIs, either adjusting the predicted URIs (change on the filesystem before load) or encoding in the application is necessary. For other applications, however, the solution is to rely on search functionality instead of using the original URIs. When you search in MarkLogic, you are given access to the URI of matching documents. If your application can rely on discovery instead, the problem goes away.
When searching with cts:search(), you can call xdmp:node-uri() on any result, getting the document URI. With the Search API or REST API, the response include the URI of each search result. This method guarantees the correct document URIs, regardless of the original filesystem path or filename.
Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.
In this post, we dive into building a full five-card draw poker game with a configurable number of players. Written in XQuery 1.0, along with MarkLogic extensions to the language, this game provides examples of some great programming capabilities, including usage of maps, recursions, random numbers, and side effects. Hopefully, we will show those new to XQuery a look at the language that they may not get to see in other tutorials or examples.
If you are getting involved in a project using ml-gradle, this tip should come in handy if you are not allowed to put passwords (especially the admin password!) in plain text. Without this restriction, you may have multiple passwords in your gradle.properties file if there are multiple MarkLogic users that you need to configure. Instead of storing these passwords in gradle.properties, you can retrieve them from a location where they’re encrypted using a Gradle credentials plugin.
Apache NiFi introduces a code-free approach of migrating content directly from a relational database system into MarkLogic. Here we walk you through getting started with migrating data from a relational database into MarkLogic
Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.
Request a Demo