Tokenizing a String

Over the past while I've been working a lot with text strings. Some of these have been short in length like tweets from Twitter, or longer pieces of text like product reviews. Plus others of various lengths.

In all these scenarios I have to break up the data into individual works or Tokens.

The examples given below illustrate how you can take a string and break it into its individual tokens. In addition to tokenising the string I've also included some code to remove any special characters that might be included with the string.

These include ? # $ . ; : &

This list of special characters to ignore are just an example and is not an exhaustive list. You can add whatever characters to the list yourself. To remove these special characters I've used regular expressions as this seemed to be the easiest way to do this.

Using PL/SQL

The following example shows a simple PL/SQL unit that will tokenise a string.

DECLARE

vDelimiter VARCHAR2(5) := ' ';

vString VARCHAR2(32767) := 'Hello Brendan How are you today?'||vDelimiter;

vPosition PLS_INTEGER;

vToken VARCHAR2(32767);

vRemove VARCHAR2(100) := '\?|\#|\$|\.|\,|\;|\:|\&';

vReplace VARCHAR2(100) := '';

BEGIN

dbms_output.put_line('String = '||vString);

dbms_output.put_line('');

dbms_output.put_line('Tokens');

dbms_output.put_line('------------------------');

vPosition := INSTR(vString, vDelimiter);

WHILE vPosition > 0 LOOP

vToken := LTRIM(RTRIM(SUBSTR(vString, 1, vPosition-1)));

vToken := regexp_replace(vToken, vRemove, vReplace);

vString := SUBSTR(vString, vPosition + LENGTH(vDelimiter));

dbms_output.put_line(vPosition||': '||vToken);

vPosition := INSTR(vString, vDelimiter);

END LOOP;

END;

When we run this (with Serveroutput On) we get the following output.

A slight adjustment is needed to the output of this code to remove the numbers or positions of the token separator/delimiter.

Tokenizer using a Function

To make this more usable we will really need to convert this into an iterative function. The following code illustrates this, how to call the function and what the output looks like.

CREATE OR replace TYPE token_list

AS TABLE OF VARCHAR2(32767);

CREATE OR replace FUNCTION TOKENIZER(pString IN VARCHAR2,

pDelimiter IN VARCHAR2)

RETURN token_list pipelined

vPosition INTEGER;

vPrevPosition INTEGER := 1;

vRemove VARCHAR2(100) := '\?|\#|\$|\.|\,|\;|\:|\&';

vReplace VARCHAR2(100) := '';

vString VARCHAR2(32767) := regexp_replace(pString, vRemove, vReplace);

BEGIN

LOOP

vPosition := INSTR (vString, pDelimiter, vPrevPosition);

IF vPosition = 0 THEN

pipe ROW (SUBSTR(vString, vPrevPosition ));

EXIT;

ELSE

pipe ROW (SUBSTR(vString, vPrevPosition, vPosition - vPrevPosition ));

vPrevPosition := vPosition + 1;

END IF;

END LOOP;

END TOKENIZER;

Here are a couple of examples to show how it works and returns the Tokens.

SELECT column_value TOKEN

FROM TABLE(tokenizer('It is a hot and sunny day in Ireland.', ' '))

, dual;

How if we add in some of the special characters we should see a cleaned up set of tokens.

SELECT column_value TOKEN

FROM TABLE(tokenizer('$$$It is a hot and sunny day in #Ireland.', ' '))

, dual;

Brendan Tierney - Oralytics Blog

Pages

Thursday, July 24, 2014

Tokenizing a String

No comments:

Post a Comment