Brendan Tierney - Oralytics Blog: Big Data

Showing posts with label Big Data. Show all posts

Friday, April 19, 2019

Data Sets for Analytics

When working with analytics, in whatever flavor, one of the key things you need is some data. But data comes in many different shapes and sizes, but where can you get some useful data, be it transactional, time-series, meta-data, analytical, master, categorical, numeric, regression, clustering, etc.

Many of the popular analytics languages have some data sets built into them. For example the R language comes pre-loaded with data sets and these can be accessed using

data()

but many of the R packages also come with data sets.

Similarly if you are using Python, it comes with some pre-loaded data sets and similarly many of the Python libraries have data sets build into them. For example scikit learn.

from sklearn import datasets

But where else can you get data sets. There are lots and lots of website available with data sets and the list could be very long. The following is a list of, what I consider, the websites with the best data sets.

Kaggle
Amazon Open Data
UCI Machine Learning Repository
Google Search Engine
Google Open Images Data
Google Fiance
Microsoft Open Data
Awesome Public Datasets Collection
EU Open Data
US Government Data
US Census Bureau
Ireland Open Data
Northern Ireland Public Open Data
UK Open Data
Image Processing Data
Carnegie Mellon University Data Sets
World Bank Open Data
IMF Open Data
Movie Reviews Data Set
Amazon Reviews
Amazon public data sets
IMDb Datasets

Monday, July 23, 2018

Lessor known Apache Machine Learning languages

Machine learning is a very popular topic in recent times, and we keep hearing about languages such as R, Python and Spark. In addition to these we have commercially available machine learning languages and tools from SAS, IBM, Microsoft, Oracle, Google, Amazon, etc., etc. Everyone want a slice of the machine learning market!

The Apache Foundation supports the development of new open source projects in a number of areas. One such area is machine learning. If you have read anything about machine learning you will have come across Spark, and maybe you might believe that everyone is using it. Sadly this isn't true for lots of reasons, but it is very popular. Spark is one of the project support by the Apache Foundation.

But are there any other machine learning projects being supported under the Apache Foundation that are an alternative to Spark? The follow lists the alternatives and lessor know projects: (most of these are incubator/retired/graduated Apache projects)

Flink	Flink is an open source system for expressive, declarative, fast, and efficient data analysis. Stratosphere combines the scalability and programming flexibility of distributed MapReduce-like platforms with the efficiency, out-of-core execution, and query optimization capabilities found in parallel databases. Flink was originally known as Stratosphere when it entered the Incubator. Documentation (graduated)
HORN	HORN is a neuron-centric programming APIs and execution framework for large-scale deep learning, built on top of Apache Hama. Wiki Page (Retired)
HiveMail	Hivemall is a library for machine learning implemented as Hive UDFs/UDAFs/UDTFs Apache Hivemall offers a variety of functionalities: regression, classification, recommendation, anomaly detection, k-nearest neighbor, and feature engineering. It also supports state-of-the-art machine learning algorithms such as Soft Confidence Weighted, Adaptive Regularization of Weight Vectors, Factorization Machines, and AdaDelta. Apache Hivemall offers a variety of functionalities: regression, classification, recommendation, anomaly detection, k-nearest neighbor, and feature engineering. It also supports state-of-the-art machine learning algorithms such as Soft Confidence Weighted, Adaptive Regularization of Weight Vectors, Factorization Machines, and AdaDelta. Documentation (incubator)
MADlib	Apache MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data. Key features include: Operate on the data locally in-database. Do not move data between multiple runtime environments unnecessarily; Utilize best of breed database engines, but separate the machine learning logic from database specific implementation details; Leverage MPP shared nothing technology, such as the Greenplum Database and Apache HAWQ (incubating), to provide parallelism and scalability. Documentation (graduated)
MXNet	A Flexible and Efficient Library for Deep Learning . MXNet provides optimized numerical computation for GPUs and distributed ecosystems, from the comfort of high-level environments like Python and R MXNet automates common workflows, so standard neural networks can be expressed concisely in just a few lines of code. Webpage (incubator)
OpenNLP	OpenNLP is a machine learning based toolkit for the processing of natural language text. OpenNLP supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, language detection and coreference resolution. Documentation (graduated)
PredictionIO	PredictionIO is an open source Machine Learning Server built on top of state-of-the-art open source stack, that enables developers to manage and deploy production-ready predictive services for various kinds of machine learning tasks. Documentation (graduated)
SAMOA	SAMOA provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms that run on top of distributed stream processing engines (DSPEs). It features a pluggable architecture that allows it to run on several DSPEs such as Apache Storm, Apache S4, and Apache Samza. Documentation (incubator)
SINGA	SINGA is a distributed deep learning platform. An intuitive programming model based on the layer abstraction is provided, which supports a variety of popular deep learning models. SINGA architecture supports both synchronous and asynchronous training frameworks. Hybrid training frameworks can also be customized to achieve good scalability. SINGA provides different neural net partitioning schemes for training large models. Documentation (incubator)
Storm	Storm is a distributed, fault-tolerant, and high-performance realtime computation system that provides strong guarantees on the processing of data. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language. Documentation (graduated)
SystemML	SystemML provides declarative large-scale machine learning (ML) that aims at flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from single node, in-memory computations, to distributed computations such as Apache Hadoop MapReduce and Apache Spark. Documentation (graduated)

Big data ml

I will have a closer look that the following SQL based machine learning languages in a lager blog post:

- MADlib

- Storm

Tuesday, March 22, 2016

Configuring RStudio Server for Oracle R Enterprise

In this blog post I will show you the configurations that are necessary for RStudio Server to work with Oracle R Enterprise on your Oracle Database server. In theory if you have just installed ORE and then RStudio Server, everything should work, but if you encounter any issues then check out the following.

Before I get started make sure to check out my previous blog posts on installing R Studio Server. The first blog post was installing and configuring RStudio Server on the Oracle BigDataLite VM. This is an automated install. The second blog post was a step by step guide to installing RStudio Server on your (Oracle) Linux Database Server and how to open the port on the VM using VirtualBox.

Right. Let's get back to configuring to work with Oracle R Enterprise. The following assumes you have complete the second blog post mentioned above.

1. Edit the rserver.conf files

Add in the values and locations for RHOME and ORACLE_HOME

sudo vi /etc/rstudio/rserver.conf
    rsession-ld-library-path=RHOME/lib:ORACLE_HOME/lib

2. Edit the .Renviron file.

Add in the values for ORACLE_HOME, ORACLE_HOSTNAME and ORACLE_SID

cd /home/oracle
sudo vi .Renviron
    ORACLE_HOME=ORACLE_HOME
    ORACLE_HOSTNAME=ORACLE_HOSTNAME
    ORACLE_SID=ORACLE_SID
 
export ORACLE_HOME
export ORACLE_HOSTNAME
export ORACLE_SID

3. To access the Oracle R Distribution

Add the following to the usr/lib/rstudio-server/R/modules/SessionHelp.R file for the version of Oracle R Distribution you installed prior to installing Oracle R Enterprise.

.rs.addFunction( "httpdPortIsFunction", function() {
   getRversion() >= "3.2"
})

You are all done now with all the installations and configurations.

Thursday, March 17, 2016

Installing RStudio Server on an (Oracle) Linux server

In a previous blog post I showed how you can install and get started with using RStudio on a server by using RStudio Server. My previous post showed how you could do that on the Oracle BigDataLite VM. On this VM everything was nicely scripted and set up for you. But when it comes to installing it on a different server, well things can be a bit different.

The purpose of this blog post is to go through the install steps you need to follow on your own server or Oracle Database server. The following is based on a server that is setup with Oracle Linux. (I'm actually using the Oracle DB Developer VM).

1. Download the latest version of RStudio Server.

Use the following link to download RStudio Server. But do a quick check on the RStudio server to get the current version number.

wget https://download2.rstudio.org/rstudio-server-rhel-0.99.892-x86_64.rpm

The following shows you what you will see when you run this command.

--2016-03-16 06:22:30--  https://download2.rstudio.org/rstudio-server-rhel-0.99.892-x86_64.rpm
Resolving download2.rstudio.org (download2.rstudio.org)... 54.192.28.107, 54.192.28.54, 54.192.28.12, ...
Connecting to download2.rstudio.org (download2.rstudio.org)|54.192.28.107|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 38814908 (37M) [application/x-redhat-package-manager]
Saving to: ‘rstudio-server-rhel-0.99.892-x86_64.rpm’

100%[============================================================>] 38,814,908  6.54MB/s   in 6.0s  

2016-03-16 06:22:37 (6.17 MB/s) - ‘rstudio-server-rhel-0.99.892-x86_64.rpm’ saved [38814908/38814908]

2. Install RStudio Server

sudo yum install --nogpgcheck rstudio-server-rhel-0.99.892-x86_64.rpm

when prompted if it is OK to install, enter y (highlighted in bold below)

Loaded plugins: langpacks
Examining rstudio-server-rhel-0.99.892-x86_64.rpm: rstudio-server-0.99.892-1.x86_64
Marking rstudio-server-rhel-0.99.892-x86_64.rpm to be installed
Resolving Dependencies
--> Running transaction check
---> Package rstudio-server.x86_64 0:0.99.892-1 will be installed
--> Finished Dependency Resolution
ol7_UEKR3/x86_64                                                                    | 1.2 kB  00:00:00    
ol7_addons/x86_64                                                                   | 1.2 kB  00:00:00    
ol7_latest/x86_64                                                                   | 1.4 kB  00:00:00    
ol7_optional_latest/x86_64                                                          | 1.2 kB  00:00:00    

Dependencies Resolved

===========================================================================================================
 Package               Arch          Version             Repository                                   Size
===========================================================================================================
Installing:
 rstudio-server        x86_64        0.99.892-1          /rstudio-server-rhel-0.99.892-x86_64        280 M

Transaction Summary
===========================================================================================================
Install  1 Package

Total size: 280 M
Installed size: 280 M

Is this ok [y/d/N]: y

Downloading packages:
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Installing : rstudio-server-0.99.892-1.x86_64                                                        1/1
groupadd: group 'rstudio-server' already exists
rsession: no process found
ln -s '/etc/systemd/system/rstudio-server.service' '/etc/systemd/system/multi-user.target.wants/rstudio-server.service'
rstudio-server.service - RStudio Server
   Loaded: loaded (/etc/systemd/system/rstudio-server.service; enabled)
   Active: active (running) since Wed 2016-03-16 10:46:00 PDT; 1s ago
  Process: 3191 ExecStart=/usr/lib/rstudio-server/bin/rserver (code=exited, status=0/SUCCESS)
 Main PID: 3192 (rserver)
   CGroup: /system.slice/rstudio-server.service
           ├─3192 /usr/lib/rstudio-server/bin/rserver
           └─3205 /usr/lib64/R/bin/exec/R --slave --vanilla -e cat(R.Version()$major,R.Version()$minor,~+~sep=".")

Mar 16 10:46:00 localhost.localdomain systemd[1]: Started RStudio Server.
  Verifying  : rstudio-server-0.99.892-1.x86_64                                                        1/1

Installed:
  rstudio-server.x86_64 0:0.99.892-1                                                                      

Complete!

3. Open RStudio using a web browser.

Open your favourite web browser and put in the host name or the IP address of your server. In my example I'm using the Oracle DB Developer VM to demonstrate the install, so I can use localhost, followed by the port number for RStudio Server.

4. Use and Enjoy

If you get logged into RStudio Server then you will see a screen something like the following!

Job Done and Enjoy!

5. An Extra Step is using the Oracle DB Developer VM

If you want to use RStudio on the Oracle DB Developer VM from your local OS, then you will need to open the port 8787 on the VM. To do this power down the VM, if you have it open. The open the Network section of the VM settings. I'm using VirtualBox. And then click on the Port Forwarding.

NewImage

Click on OK to save your Port Forwarding setting and then click on the OK button again to close the Network settings for the VM.

Now start up the VM. When it has loaded and you have the desktop displayed in the VM window, you should now be able to connect to RStudio in the VM, from your local machine.

To do this open your web browser on your local machine and type in

http://localhost:8787

You should now get the RStudio login in screen that is shown in point 3 above. Go ahead, login and enjoy.

6. A little warning

Make sure to log out of RStudio when you are finished using it. If you don't then your R environment may not have been saved and you will get a message when you log in next. Now we don't want that happenings, so just log out of RStudio. You can do that by looking at the top right hand corner of the RStudio Server application.

I will have one more blog post on how you can configure RStudion Server to work with an Oracle Database server that has Oracle R Enterprise installed.

Monday, March 14, 2016

Installing RStudio Server on Oracle BigDataLite VM

A very popular tool for data scientists is RStudio. This tool allows you to interactively work with your R code, view the R console, the graphs and charts you create, manage the various objects and data frames you create, as well shaving easy access to the R help documentation. Basically it is a core everyday tool.

The typical approach is to have RStudio installed on your desktop or laptop. What this really means is that the data is pulled to your desktop or laptop and all analytics is performed there. In most cases this is fine but as your data volumes goes does does the limitations of using R on your local machine.

An alternative is to install a version called RStudio Server on an analytics server or on the database server. You can now use the computing capabilities of this server to overcome some of the limitations of using R or RStudio locally. Now you will use your web browser to access RStudio Server on your database server.

In this blog post I will walk you through how to install and get connected to RStudio Server on the Oracle BigDataLite VM.

NewImage

After starting up the Oracle BigDataLite VM and logging into the Oracle user (password=welcome1) you will see the Start Here icon on the desktop. You will need to double click on this.

NewImage

This will open a webpage on the VM that contains details of all the various tools that are installed on the VM or are ready for you to install and configure. This information contains all the http addresses and ports you need to access each of these tools via a web browser or some other way, along with the usernames and passwords you need to use them.

NewImage

One of the tools lists is for RStudio Server. This product is not installed on the VM but Oracle has provided a script that you can run to perform the install in an automated way. This script is located in:

[oracle@bigdatalite ~]$ cd  /home/oracle/scripts/

Use the following command to run the RStudio Server install script.

[oracle@bigdatalite scripts]$ ./install_rstudio.sh

The following is the output from running this script and it will be displayed in your terminal window. You can use this to monitor the progress of the installation.

Retrieving RStudio
--2016-03-12 02:06:15--  https://download2.rstudio.org/rstudio-server-rhel-0.99.489-x86_64.rpm
Resolving download2.rstudio.org... 54.192.28.12, 54.192.28.54, 54.192.28.98, ...
Connecting to download2.rstudio.org|54.192.28.12|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 34993428 (33M) [application/x-redhat-package-manager]
Saving to: `rstudio-server-rhel-0.99.489-x86_64.rpm'

100%[======================================>] 34,993,428  5.24M/s   in 10s     

2016-03-12 02:06:26 (3.35 MB/s) - `rstudio-server-rhel-0.99.489-x86_64.rpm' saved [34993428/34993428]

Installing RStudio
Loaded plugins: refresh-packagekit, security, ulninfo
Setting up Install Process
Examining rstudio-server-rhel-0.99.489-x86_64.rpm: rstudio-server-0.99.489-1.x86_64
Marking rstudio-server-rhel-0.99.489-x86_64.rpm to be installed
public_ol6_UEKR3_latest                                  | 1.2 kB     00:00     
public_ol6_UEKR3_latest/primary                          |  22 MB     00:03     
public_ol6_UEKR3_latest                                                 568/568
public_ol6_latest                                        | 1.4 kB     00:00     
public_ol6_latest/primary                                |  55 MB     00:12     
public_ol6_latest                                                   33328/33328
Resolving Dependencies
--> Running transaction check
---> Package rstudio-server.x86_64 0:0.99.489-1 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

================================================================================
 Package        Arch   Version       Repository                            Size
================================================================================
Installing:
 rstudio-server x86_64 0.99.489-1    /rstudio-server-rhel-0.99.489-x86_64 251 M

Transaction Summary
================================================================================
Install       1 Package(s)

Total size: 251 M
Installed size: 251 M
Downloading Packages:
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
  Installing : rstudio-server-0.99.489-1.x86_64                             1/1 
useradd: user 'rstudio-server' already exists
groupadd: group 'rstudio-server' already exists
rsession: no process killed
rstudio-server start/running, process 5037
  Verifying  : rstudio-server-0.99.489-1.x86_64                             1/1 

Installed:
  rstudio-server.x86_64 0:0.99.489-1                                            

Complete!
Restarting RStudio
rstudio-server stop/waiting
rsession: no process killed
rstudio-server start/running, process 5066

When the installation is finished you are now ready to connect to the RStudio Server. So open your web browser and enter the following into the address bar.

http://localhost:8787/

NewImage

The initial screen you are presented with is a login screen. Enter your Linux username and password. In the case of the BigDataLite VM this will be oracle/welcome1.

NewImage

Then you will be presented with the RStudio Server application in your web browser, as shown below. As you can see it is very similar to using RStudio on your desktop. Happy Days! You are now setup and able to run RStudio on the database server.

NewImage

Make sure to log out of RStudio Server before closing down the window.

If you don't log out of RStudio Server then the next time you open RStudio Server your session will automatically open. Perhaps this is not the best for security, so try to remember to log out each time.

By now using RStudio Server on the Oracle Database server I can not get some of the benefits of computing capabilities of this server. Although there are still the typical limitations with of using R. But now I access RStudio on the database server and process the data on the database server, all from my local PC or laptop.

Everything is nicely setup and ready for you to install on the BigDataLite VM (thank you Oracle). But what about when we want to install RStudion Server on a different server. What are the steps necessary to install, configure and log in. Yes they should be similar but I will give a complete list of steps in my next blog post.

Friday, February 12, 2016

Spark versus Flink

Spark is an open source Apache project that provides a framework for multi stage in-memory analytics. Spark is based on the Hadoop platform and can interface with Cassandra OpenStack Swift, Amazon S3, Kudu and HDFS. Spark comes with a suite of analytic and machine learning algorithm allowing you to perform a wide variety of analytics on you distribute Hadoop platform. This allows you to generate data insights, data enrichment and data aggregations for storage on Hadoop and to be used on other more main stream analytics as part of your traditional infrastructure. Spark is primarily aimed at batch type analytics but it does come with a capabilities for streaming data. When data needs to be analysed it is loaded into memory and the results are then written back to Hadoop.

NewImage

Flink is another open source Apache project that provides a platform for analyzing and processing data that is in a distributed stream and/or batch data processing. Similarly to Spark, Flink comes with a set of APIs that allows for each integration in with Java, Scala and Python. The machine learning algorithms have been specifically tuned to work with streaming data specifically but can also work in batch oriented data. As Flink is focused on being able to process streaming data, it run on Yarn, works with HDFS, can be easily integrated with Kafka and can connect to various other data storage systems.

NewImage

Although both Spark and Flink can process streaming data, when you examine the underlying architecture of these tools you will find that Flink is more specifically focused for streaming data and can process this data in a more efficient manner.

There has been some suggestions in recent weeks and months that Spark is now long the tool of choice for analytics on Hadoop. Instead everyone should be using Flink or something else. Perhaps it is too early to say this. You need to consider the number of companies that have invested significant amount of time and resources building and releasing products on top of Spark. These two products provide similar-ish functionality but each product are designed to process this data in a different manner. So it really depends on what kind of data you need to process, if it is bulk or streaming will determine which of these products you should use. In some environments it may be suitable to use both.

Will these tool replace the more traditional advanced analytics tools in organisations? the simple answer is No they won't replace them. Instead they will complement each other and if you have a Hadoop environment you will will probably end up using Spark to process the data on Hadoop. All other advanced analytics that are part of your more traditional environments you will use the traditional advanced analytics tools from the more main stream vendors.

Thursday, July 9, 2015

Oracle Architect's Guides to Big Data

Over the past couple of years we have had a lot of information about Big Data presented to us. But one of the things that still stands out is that there is still a bit of confusion on what Big Data is. Depending on who you are talking to you will get a different definition and interpretation of what Big Data is and what you can do with it.

For example there is one company I know of who are talking about their Big Data project. For them this involves processing approx. 1 million records. That is Big for them. For others that is tiny.

Oracle has recently put together a series of articles that talk about what architectural changes are needed to your technical infrastructure to support Big Data. In this case it is more about the volume of data rather than different types of data. Although this is covered by the architecture that Oracle gives.

As part of the Oracle Enterprise Architecture section of the Oracle website, they have put together a series of articles on how you can include Big Data within your Enterprise Information Architecture.

These are a good read and a great place to get a better understanding of what you need to be considering as you move to an architecture that includes Big Data.

Pages

Friday, April 19, 2019

Monday, July 23, 2018

Tuesday, March 22, 2016

Thursday, March 17, 2016

Monday, March 14, 2016

Friday, February 12, 2016

Thursday, July 9, 2015