Thursday, July 24, 2014

Tokenizing a String

Over the past while I've been working a lot with text strings. Some of these have been short in length like tweets from Twitter, or longer pieces of text like product reviews. Plus others of various lengths.

In all these scenarios I have to break up the data into individual works or Tokens.

The examples given below illustrate how you can take a string and break it into its individual tokens. In addition to tokenising the string I've also included some code to remove any special characters that might be included with the string.

These include ? # $ . ; : &

This list of special characters to ignore are just an example and is not an exhaustive list. You can add whatever characters to the list yourself. To remove these special characters I've used regular expressions as this seemed to be the easiest way to do this.

Using PL/SQL

The following example shows a simple PL/SQL unit that will tokenise a string.

DECLARE

    vDelimiter VARCHAR2(5) := ' ';

    vString    VARCHAR2(32767) := 'Hello Brendan How are you today?'||vDelimiter;

    vPosition   PLS_INTEGER;

    vToken     VARCHAR2(32767);

    vRemove    VARCHAR2(100) := '\?|\#|\$|\.|\,|\;|\:|\&';

    vReplace   VARCHAR2(100) := '';

BEGIN

    dbms_output.put_line('String = '||vString);

    dbms_output.put_line('');

    dbms_output.put_line('Tokens');

    dbms_output.put_line('------------------------');

   

    vPosition := INSTR(vString, vDelimiter);

   

    WHILE vPosition > 0 LOOP

   

       vToken := LTRIM(RTRIM(SUBSTR(vString, 1, vPosition-1)));

       vToken := regexp_replace(vToken, vRemove, vReplace);

   

       vString := SUBSTR(vString, vPosition + LENGTH(vDelimiter));

       dbms_output.put_line(vPosition||': '||vToken);

       vPosition := INSTR(vString, vDelimiter);

   

    END LOOP;

END;

/

When we run this (with Serveroutput On) we get the following output.

Toke PLSQL

A slight adjustment is needed to the output of this code to remove the numbers or positions of the token separator/delimiter.

Tokenizer using a Function

To make this more usable we will really need to convert this into an iterative function. The following code illustrates this, how to call the function and what the output looks like.

CREATE OR replace TYPE token_list

AS TABLE OF VARCHAR2(32767);

/

 

CREATE OR replace FUNCTION TOKENIZER(pString IN VARCHAR2,

                                     pDelimiter IN VARCHAR2)

RETURN token_list pipelined

AS

    vPosition       INTEGER;

    vPrevPosition   INTEGER := 1;

    vRemove         VARCHAR2(100) := '\?|\#|\$|\.|\,|\;|\:|\&';

    vReplace        VARCHAR2(100) := '';

    vString         VARCHAR2(32767) := regexp_replace(pString, vRemove, vReplace);

BEGIN

    LOOP

       vPosition := INSTR (vString, pDelimiter, vPrevPosition);

       IF vPosition = 0 THEN

          pipe ROW (SUBSTR(vString, vPrevPosition ));

          EXIT;

       ELSE

          pipe ROW (SUBSTR(vString, vPrevPosition, vPosition - vPrevPosition ));

          vPrevPosition := vPosition + 1;

       END IF;

    END LOOP;

END TOKENIZER;

/

Here are a couple of examples to show how it works and returns the Tokens.

SELECT column_value TOKEN

FROM TABLE(tokenizer('It is a hot and sunny day in Ireland.', ' '))

, dual;

Token fun 1

How if we add in some of the special characters we should see a cleaned up set of tokens.

SELECT column_value TOKEN

FROM TABLE(tokenizer('$$$It is a hot and sunny day in #Ireland.', ' '))

, dual;

Token fun 2

No comments:

Post a Comment