A new Oracle Data Mining algorithm in the Oracle 12.2c Database is called Explicit Semantic Analysis.
[The following examples are built using Oracle Data Miner 4.2 (SQL Developer 4.2) and the Oracle 12.2 Database cloud service (extreme edition) ]
The Explicit Semantic Analysis algorithm is an unsupervised algorithm used for feature extraction. ESA does not discover latent features but instead uses explicit features based on an existing knowledge base. There is no setup or install necessary to use this algorithm All you need is a licence for the Advanced Analytics Option for the database.
The out from the algorithm is a distance measure that indicates how similar or dis-similar the input texts are, using the ESA model (and the training data set used).
Let us look at an example.
Setup training data for ESA Algorithm
Oracle Data Miner 4.2 (that comes with SQL Developer 4.2) has a data Wiki data set from 2005. This contains over 200,000 features. To locate the file go to.
...\sqldeveloper\dataminer\scripts\instWikiSampleData.sql
This file contains the DDL and the insert statements for the Wiki data set.
After you run this script a new table called WIKISAMPLE table exists and contains records
This gives us the base/seed data set to feed into the ESA algorithm.
Create the ESA Model using ODMr
To create the ESA model we have 2 ways of doing this. In this blog post I'll show you the easiest way by using the Oracle Data Miner (ODMr) tool. I'll have another blog post that will show you the SQL needed to create the model.
In an ODMr workflow create a new Data Source node. Then set this node to have the WIKISAMPLE table as it's data source.
Next you need to create the ESA node on the workflow. This node can be found in the Models section, of the Workflow Editor. The node is called Explicit Feature Extraction. Click on this node, in the model section, and then move your mouse to your workflow and click again. The ESA node will be created.
Join the Data Node to the ESA node by right clicking on the data node and then clicking on the ESA node.
Double click on the ESA node to edit the properties of the node and the algorithm.
Explore the ESA Model and ESA Model Features
After the model node has finished you can now explore the results generated by the ESA model. Right click on the model node and select 'View Model'. The model properties window opens and it has 2 main tabs. The first of these is the coefficients tab. Here you can select a particular topic (click on the search icon beside the Feature ID) and select it from the list. The attributes and their coefficient values will be displayed.
Next you can examine the second tab that is labeled as Features. In this table we can select a particular record and have a tag cloud and coefficients displayed. The tag cloud is a great way to see visually what words are important.
How to use the ESA model to Compare new data using SQL
Now that we have the ESA model created, we can not use it model to compare other similar sets of documents.
You will need to use the FEATURE_COMPARE SQL function to evaluate the input texts, using the ESA model to compare for similarity. For example,
SELECT FEATURE_COMPARE(feat_esa_1_1
USING 'Oracle Database is the best available for managing your data' text
AND USING 'The SQL language is the one language that all databases have in common' text) similarity
FROM DUAL;
The result we get is 0.7629.
The result generate by the query is a distance measure. The FEATURE_COMPARE function returns a comparison number in the range 0 to 1. Where 0 indicates that the text are not similar or related. If a 1 is returned then that indicated that the text are very similar or very related.
You can use this returned value to make a decision on what happens next. For example, it can be used to decide what the next step should be in your workflow and you can easily write application logic to manage this.
The examples given here are for general text. In the real world you would probably need a bigger data set. But if you were to use this approach in other domains, such as legal, banking, insurance, etc. then you would need to create a training data set based on the typical language that is used in each of those domains. This will then allow you to compare documents with each domain with greater accuracy.
[The above examples are built using Oracle Data Miner 4.2 (SQL Developer 4.2) and the Oracle 12.2 Database cloud service (extreme edition) ]