Progress Acquires MarkLogic! Learn More

Ingesting Delimited Text with MLCP

Back to blog
2 minute read
Back to blog
2 minute read

We quite often see customers run into the exception invalid char between encapsulated token and delimiter when they are ingesting delimited text into MarkLogic Server using the MarkLogic Content Pump (MLCP). The error sounds technical and hard to understand — but what exactly is wrong with the data? Luckily, it’s not too hard to figure out how it happens and and how to solve it.

What does the exception mean?

Invalid char between encapsulated token and delimiter means that you have invalid characters between an encapsulator and a delimiter. Hold on — what is an encapsulator? To put simply, it is the character used to wrap the CSV field or column that may contain special characters, such as line breaks. In most cases, people use double-quotes as the encapsulator.

For more details about encapsulators, please refer to the Internet Engineering Task Force (IETF) standard for CSVs.

What makes characters invalid?

Let’s explain with a delimited text example. Consider the following scenarios:

  1. “foo”| “bar” | “foo”
  2. “foo”| X “bar” | “foo”
  3. “foo” |X “bar” Y | “foo”
  4. “foo” | “bar” Y |”foo”

Here, the delimiter is “|” and the encapsulator is a double-quote. In rows 2, 3 and 4, there are some characters (bolded) between a delimiter and an encapsulator — here is where things went wrong. According to the IETF standard, those columns are actually not in a valid format for delimited text, which results in errors:

“Each field may or may not be enclosed in double quotes. If fields are not enclosed with double quotes, then double quotes may not appear inside the fields.”

Which brings up two points:

  1. The double quotes are only used to enclose the whole column, not several characters in the column.
  2. If you don’t enclose the column (field) with double quotes, then double quote(s) should not be present inside the field.

With that being said, rows 2, 3, 4 should be rejected by CSVParser as invalid CSV records. However, the CSVParser that MLCP currently uses can actually handle cases 2 and 3, parsing them without any issue; however, it is not able to deal with case 4, and in turn, will throw an exception with the message invalid char between encapsulated token and delimiter.

How to work around the exception?

The best way to get around this exception is to avoid having malformed CSV data in the first place. If that is not possible, you can escape the double quotes in the field if you really want them to be part of the string. But remember, you must escape double quotes using another double quote in CSV! You CSV data will look something like this:

  1. “foo” | “”bar”” Y |”foo”

Share this article

Read More

Related Posts

Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.


Semantics, Search, MarkLogic 11 and Beyond

Get info on recent and upcoming product updates from John Snelson, head of the MarkLogic product architecture team.

All Blog Articles

Integrating MarkLogic with Kafka

The MarkLogic Kafka Connector makes it easy to move data between the two systems, without the need for custom code.

All Blog Articles

Introduction to GraphQL with MarkLogic

MarkLogic 11 introduces support for GraphQL queries that run against views in your MarkLogic database. Customers interested in or already using GraphQL can now securely query MarkLogic via this increasingly popular query language.

All Blog Articles

Sign up for a Demo

Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.

Request a Demo