Wednesday, July 30, 2014

Tokenizing a String : Using Regular Expressions

In my previous blog post I gave some PL/SQL that performed the tokenising of a string. Check out this blog post here.

Thanks also to the people who sent me links examples of how to tokenise a string using the MODEL clause. Yes there are lots of examples of this out there on the interest.

While performing the various searches on the internet I did come across some examples of using Regular Expressions to extract the tokens. The following example is thanks to a blog post by Tanel Poder

I've made some minor changes to it to remove any of the special characters we want to remove.

column token format a40

define separator=" "

define mystring="$My OTN LA Tour (2014?) will consist of Panama, CostRica and Mexico."

define myremove="\?|\#|\$|\.|\,|\;|\:|\&|\(|\)|\-";

 

SELECT regexp_replace(REGEXP_REPLACE(

                 REGEXP_SUBSTR( '&mystring'||'&separator', '(.*?)&separator', 1, LEVEL )

                           , '&separator$', ''), '&myremove', '') TOKEN

FROM

    DUAL

CONNECT BY

    REGEXP_INSTR( '&mystring'||'&separator', '(.*?)&separator', 1, LEVEL ) > 0

ORDER BY

LEVEL ASC

/

When we run this code we get the following output.

Token fun 3

So we have a number of options open to use to tokenise strings using SQL and PL/SQL, using a number of approaches including substring-ing, using pipelined functions, using the Model clause and also using Regular Expressions.

Monday, July 28, 2014

BUCKET_WIDTH: Calculating the size of the bucket

Some time ago I had some blog posts introducing some of the basic Statistical function available in Oracle. Here are the links to these.

Most people do not realise that Oracle has over 250+ statistical functions that are available (no addition cost) in all the database versions.

I've had a query about one of the functions BUCKET_WIDTH. The question was wondering if it was possible to get the width of the bucket in each case. There does not seem to be a build in feature to get this value, so we have to calculate this ourselves.

Here is an example of how to calculate the bucket width, as on the example I used in my previous blog post.

SELECT bucket, max(age)-min(age) BUCKET_WIDTH, count(*)

FROM (SELECT cust_id,

                       age,

                      width_bucket(age,

                         (SELECT min(age) from mining_data_build_v),

                         (select max(age)+1 from mining_data_build_v),

                      10) bucket

          FROM mining_data_build_v

          GROUP BY cust_id, age )

GROUP BY bucket

ORDER BY bucket;

Bucket width

What this query gives is an approximate value of the size of the Bucket Width based on the values/records that are in a bucket. The actual values used cannot be determined exactly as there is not function/value in SQL that tells us the actual value.

Thursday, July 24, 2014

Tokenizing a String

Over the past while I've been working a lot with text strings. Some of these have been short in length like tweets from Twitter, or longer pieces of text like product reviews. Plus others of various lengths.

In all these scenarios I have to break up the data into individual works or Tokens.

The examples given below illustrate how you can take a string and break it into its individual tokens. In addition to tokenising the string I've also included some code to remove any special characters that might be included with the string.

These include ? # $ . ; : &

This list of special characters to ignore are just an example and is not an exhaustive list. You can add whatever characters to the list yourself. To remove these special characters I've used regular expressions as this seemed to be the easiest way to do this.

Using PL/SQL

The following example shows a simple PL/SQL unit that will tokenise a string.

DECLARE

    vDelimiter VARCHAR2(5) := ' ';

    vString    VARCHAR2(32767) := 'Hello Brendan How are you today?'||vDelimiter;

    vPosition   PLS_INTEGER;

    vToken     VARCHAR2(32767);

    vRemove    VARCHAR2(100) := '\?|\#|\$|\.|\,|\;|\:|\&';

    vReplace   VARCHAR2(100) := '';

BEGIN

    dbms_output.put_line('String = '||vString);

    dbms_output.put_line('');

    dbms_output.put_line('Tokens');

    dbms_output.put_line('------------------------');

   

    vPosition := INSTR(vString, vDelimiter);

   

    WHILE vPosition > 0 LOOP

   

       vToken := LTRIM(RTRIM(SUBSTR(vString, 1, vPosition-1)));

       vToken := regexp_replace(vToken, vRemove, vReplace);

   

       vString := SUBSTR(vString, vPosition + LENGTH(vDelimiter));

       dbms_output.put_line(vPosition||': '||vToken);

       vPosition := INSTR(vString, vDelimiter);

   

    END LOOP;

END;

/

When we run this (with Serveroutput On) we get the following output.

Toke PLSQL

A slight adjustment is needed to the output of this code to remove the numbers or positions of the token separator/delimiter.

Tokenizer using a Function

To make this more usable we will really need to convert this into an iterative function. The following code illustrates this, how to call the function and what the output looks like.

CREATE OR replace TYPE token_list

AS TABLE OF VARCHAR2(32767);

/

 

CREATE OR replace FUNCTION TOKENIZER(pString IN VARCHAR2,

                                     pDelimiter IN VARCHAR2)

RETURN token_list pipelined

AS

    vPosition       INTEGER;

    vPrevPosition   INTEGER := 1;

    vRemove         VARCHAR2(100) := '\?|\#|\$|\.|\,|\;|\:|\&';

    vReplace        VARCHAR2(100) := '';

    vString         VARCHAR2(32767) := regexp_replace(pString, vRemove, vReplace);

BEGIN

    LOOP

       vPosition := INSTR (vString, pDelimiter, vPrevPosition);

       IF vPosition = 0 THEN

          pipe ROW (SUBSTR(vString, vPrevPosition ));

          EXIT;

       ELSE

          pipe ROW (SUBSTR(vString, vPrevPosition, vPosition - vPrevPosition ));

          vPrevPosition := vPosition + 1;

       END IF;

    END LOOP;

END TOKENIZER;

/

Here are a couple of examples to show how it works and returns the Tokens.

SELECT column_value TOKEN

FROM TABLE(tokenizer('It is a hot and sunny day in Ireland.', ' '))

, dual;

Token fun 1

How if we add in some of the special characters we should see a cleaned up set of tokens.

SELECT column_value TOKEN

FROM TABLE(tokenizer('$$$It is a hot and sunny day in #Ireland.', ' '))

, dual;

Token fun 2

Thursday, July 17, 2014

OTN Latin America (North) Tour 2014

For a few years now I (and I'm sure you have too) have heard about and followed the various Oracle User Group tours that OTN arranges/facilitates. A tour consists of a number of Oracle User Groups in a region coordinating together to have their conferences organised so that they can get speakers from across the world to come and present.

For most presenters it involves lots of travel. So instead of them doing all that travelling to present at one conference, they can now extend their travels a little and present in a number of countries. Most of the speakers are Oracle ACE Directors and OTN is very generous with their support in that they pay for all the flights, transportation and hotels. Without the generous support of OTN these tours and perhaps many of the conference would not take place.

With envy I used to follow the various speakers on tweeter as they talked about their travels from country to country and their experiences of meeting the people and exploring the various countries. Yes their time in each country seemed to be limited but they always got to see and do so much.

Earlier this year there was an call for presentations for the various OTN Tours in 2014. I submitted 3 presentations that coverd Oracle Advanced Analytics Option (Oracle Data Mining and Oracle R Enterprise). I thought I didn't stand a chance given the speakers that have participated in previous years.

A couple of weeks ago I received an email saying that I had been accepted onto the OTN Latin America (North) Tour. So you can imagine my excitement. The full OTN Tour North leg covers a number of countries across central and south America and is over a 2 week period. Unfortunately I'm not able to be away for that long, so I was accepted for the conferences on the first week of the tour. This will include Panama, Costa Rica and Mexico :-)

Some of you might think this is a bit of a golly and a holiday. What I've discovered over the past week or more is that it will be far from that. There is a lot of work in preparing the presentations, giving the presentation, setting up live demos between presentation, various meetings with people at the conferences etc etc etc. Then there is all the travel, all the airports, all the airport transfers, all the overnights in hotels. Over the course of 7 days I will be staying 6 different hotels.

I have spent the last week just trying to arrange my flights and hotels. This also involved trying to coordinate with other speakers so that we can travel together as much as possible.

Here are the dates and the presentations that I will be giving at these conferences:

4th August : Panama (in Panama City)

     10:00-11:00 : Getting Started with Oracle Data Miner & Predictive Analytics

     11:00-12:00 : Combining the Power of R and in-Datbaase Data Mining. Running R in the Database. Seriously!

     13:00-13:40 : Sentiment Analysis Using Oracle Data Mining

6th August : Costa Rica (in San Carlos)

     10:00-11:00 : Getting Started with Oracle Data Miner & Predictive Analytics

     13:00-14:00 : Sentiment Analysis Using Oracle Data Mining

     16:00-17:00 : Combining the Power of R and in-Datbaase Data Mining. Running R in the Database. Seriously!

8th August : Mexico (in Mexico City)

     14:00-15:00 : Getting Started with Oracle Data Miner & Predictive Analytics

     15:00-16:00 : Combining the Power of R and in-Datbaase Data Mining. Running R in the Database. Seriously!

When the agenda for the conferences are available I will have another blog post with their details.

If you are at one of these conference do please say hello :-)

I've finally booked all my flights and hotels. Many thanks to my fellow ACE Director presenters for your research and sharing of travel plans. It looks like there will be a groups of us all travelling together.

Now the next challenge is to prepare the presentations and live demos (yes live demos).

I hope to blog about each of the conferences and my travels to/from each country. It really depends on what time I will have and access to the internet. Perhaps this is something I will try to do on my various plane flights or waiting at the airports. So watch out for these :-)


Updated with some stats on my travels

My travel plans for the OTN Latin America tour of user group conferences involves

  • 12,200 flying miles,
  • 29.75 of flying time,
  • way too many hours hanging around in airports
  • over 8 days
  • staying in 6 hotels
  • plus 1 over night flight,
  • giving 8 hours of presentations in 3 countries

Why do we do this? Because we love sharing with the Oracle User Groups around the world. I'm only doing 1 week of the tour. Some people are doing 2 weeks :-(