Monday, July 23, 2018

Lessor known Apache Machine Learning languages

Machine learning is a very popular topic in recent times, and we keep hearing about languages such as R, Python and Spark. In addition to these we have commercially available machine learning languages and tools from SAS, IBM, Microsoft, Oracle, Google, Amazon, etc., etc. Everyone want a slice of the machine learning market!

The Apache Foundation supports the development of new open source projects in a number of areas. One such area is machine learning. If you have read anything about machine learning you will have come across Spark, and maybe you might believe that everyone is using it. Sadly this isn't true for lots of reasons, but it is very popular. Spark is one of the project support by the Apache Foundation.

But are there any other machine learning projects being supported under the Apache Foundation that are an alternative to Spark? The follow lists the alternatives and lessor know projects: (most of these are incubator/retired/graduated Apache projects)

Flink Flink is an open source system for expressive, declarative, fast, and efficient data analysis. Stratosphere combines the scalability and programming flexibility of distributed MapReduce-like platforms with the efficiency, out-of-core execution, and query optimization capabilities found in parallel databases. Flink was originally known as Stratosphere when it entered the Incubator.

Documentation

(graduated)

HORN HORN is a neuron-centric programming APIs and execution framework for large-scale deep learning, built on top of Apache Hama.

Wiki Page

(Retired)

HiveMail Hivemall is a library for machine learning implemented as Hive UDFs/UDAFs/UDTFs

Apache Hivemall offers a variety of functionalities: regression, classification, recommendation, anomaly detection, k-nearest neighbor, and feature engineering. It also supports state-of-the-art machine learning algorithms such as Soft Confidence Weighted, Adaptive Regularization of Weight Vectors, Factorization Machines, and AdaDelta. Apache Hivemall offers a variety of functionalities: regression, classification, recommendation, anomaly detection, k-nearest neighbor, and feature engineering. It also supports state-of-the-art machine learning algorithms such as Soft Confidence Weighted, Adaptive Regularization of Weight Vectors, Factorization Machines, and AdaDelta.

Documentation

(incubator)

MADlib Apache MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data. Key features include: Operate on the data locally in-database. Do not move data between multiple runtime environments unnecessarily; Utilize best of breed database engines, but separate the machine learning logic from database specific implementation details; Leverage MPP shared nothing technology, such as the Greenplum Database and Apache HAWQ (incubating), to provide parallelism and scalability.

Documentation

(graduated)

MXNet A Flexible and Efficient Library for Deep Learning . MXNet provides optimized numerical computation for GPUs and distributed ecosystems, from the comfort of high-level environments like Python and R MXNet automates common workflows, so standard neural networks can be expressed concisely in just a few lines of code.

Webpage

(incubator)

OpenNLP OpenNLP is a machine learning based toolkit for the processing of natural language text. OpenNLP supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, language detection and coreference resolution.

Documentation

(graduated)

PredictionIO PredictionIO is an open source Machine Learning Server built on top of state-of-the-art open source stack, that enables developers to manage and deploy production-ready predictive services for various kinds of machine learning tasks.

Documentation

(graduated)

SAMOA SAMOA provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms that run on top of distributed stream processing engines (DSPEs). It features a pluggable architecture that allows it to run on several DSPEs such as Apache Storm, Apache S4, and Apache Samza.

Documentation

(incubator)

SINGA SINGA is a distributed deep learning platform. An intuitive programming model based on the layer abstraction is provided, which supports a variety of popular deep learning models. SINGA architecture supports both synchronous and asynchronous training frameworks. Hybrid training frameworks can also be customized to achieve good scalability. SINGA provides different neural net partitioning schemes for training large models.

Documentation

(incubator)

Storm Storm is a distributed, fault-tolerant, and high-performance realtime computation system that provides strong guarantees on the processing of data. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language.

Documentation

(graduated)

SystemML SystemML provides declarative large-scale machine learning (ML) that aims at flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from single node, in-memory computations, to distributed computations such as Apache Hadoop MapReduce and Apache Spark.

Documentation

(graduated)

Big data ml

I will have a closer look that the following SQL based machine learning languages in a lager blog post:

- MADlib

- Storm

Thursday, July 12, 2018

Oracle Developer Champion

Yesterday evening I received an email titled 'Invitation to Developer Champion Program'.

What a surprise!
Oracle dev champion
The Oracle Developer Champion program was setup just a year ago and is aimed at people who are active in generating content and sharing their knowledge on new technologies including cloud, micro services, containers, Java, open source technologies, machine learning and various types of databases.
For me, I fit into the machine learning, cloud, open source technologies, a bit on chatbots and various types of databases areas. Well I think I do!

This made me look back over my activities for the past 12-18 months. As an Oracle ACE Director, we have to record all our activities. I'd been aware that the past 12-18 months had been a bit quieter than previous years. But when I looked back at all the blog posts, articles for numerous publications, books, and code contributions, etc. Even I was impressed with what I had achieved, even though it was a quiet period for me.

Membership of Oracle Developer Champion program is for one year, and the good people in Oracle Developer Community (ODC) will re-evaluate what I, and the others in the program, have been up to and will determine if you can continue for another year.

In addition to writing, contributing to projects, presenting, etc Oracle Developer Champions typically have leadership roles in user groups, answering questions on forums and providing feedback to product managers.

The list of existing Oracle Developer Champions is very impressive. I'm honoured to be joining this people.

Click on the image to go to the Oracle Developer Champion website to find out more.
Screen Shot 2018 07 12 at 17 21 32

And check out the list of existing Oracle Developer Champions.
 Oracle dev champion O ACEDirectorLogo clr