Wednesday, November 22, 2017
Make SQL Great Again baseball cap
Let me know if you would like to order one.
They cost €15 + P&P
Friday, September 29, 2017
Irish people presenting at OOW
Here is a list of presentations at Oracle Open World and JavaOne in 2017, that will be given by people and partners based in Ireland.
(I'll update this list if I find additional presentations)
Day | Time | Presentation | Location |
---|---|---|---|
Sunday | 13:45-14:30 | SQL: One Language to Rule All Your Data [OOW SUN1238]
Brendan Tierney, Oralytics SQL is a very powerful language that has been in use for almost 40 years. SQL comes with many powerful techniques for analyzing your data, and you can analyze data outside the database using SQL as well. Using the new Oracle Big Data SQL it is now possible to analyze data that is stored in a database, in Hadoop, and in NoSQL all at the same time. This session explores the capabilities in Oracle Database that allow you to work with all your data. Discover how SQL really is the unified language for processing all your data, allowing you to analyze, process, run machine learning, and protect all your data. Hopefully this presentation will be a bit of Fun! For those who have been working with the database for a long time, we can sometimes forget what we can really do. For those starting out in the career may not realise what the database can do. The presentation delivers an important message while having a laugh or two (probably at me). | Marriott Marquis (Golden Gate Level) - Golden Gate C1/C2 |
Monday | 16:30-17:15 | ESB Networks Automates Core IT Infrastructure and Grid Operations [CON7878]
Simon Holt, DBA / Technical Architect, ESB Networks Andrew Walsh, OMS Application Support, ESB In this session learn how ESB Networks deployed Oracle Utilities Network Management System Release 1.12 on a complete Oracle SuperCluster. Hear about the collaboration between multiple Oracle business units and the in-house expertise that delivered an end-to-end solution. This upgrade is an important step toward expanding ESB Network’s future network operations vision. Her about the challenges, the process of choosing a COTS solution, cybersecurity, and implementation. The session also explores the benefits the new system delivered when managing the effects of large-scale weather events, as well as the technical challenges of deploying a combined hardware and software solution. | Park Central (Floor 2) - Metropolitan I |
Monday | 16:45-17:30 | Automation and Innovation for Application Management and Support [CON7862]
Raja Roy, Associate Partner, IBM Ireland Automation and innovation are transforming the way application support and development projects are being executed. Market trends show three fundamental shifts: innovation to improve quality of service delivery, the emergence of knowledge-based systems with capabilities for self-service and self-heal, and leveraging the power of the cloud to move capital expenditures to operating expenditures for enhanced functionality. In this session see how IBM introduced innovation in deployments globally to help customers achieve employee and business productivity and enhanced quality of services. | Moscone West - Room 3022 |
Tuesday | 12:15-13:00 | DMigrating Oracle E-Business Suite to Oracle IaaS: A Customer Journey [CON1848]
Ken MacMahon, IT, Version 1 Ken Lynch, Head of IT, Irish Life Simon Joyce, Consultant / Contractor, Version 1 Software In this session hear about a leading global insurance provider’s experience of migrating Oracle E-Business Suite to Oracle Cloud. This session includes a discussion of the considerations for Oracle Iaas/PaaS vs. alternatives, the total cost of ownership for Oracle IaaS vs. on-premises solutions, the key project and support issues, the benefits of IaaS, and tips and tricks. Gain insights that can help others on their journey with Oracle IaaS generally and with Oracle E-Business Suite specifically. | Moscone West - Room 2001 |
Wednesday | 14:00-14:45 | Ireland's An Post: Customer Analytics Using Oracle Analytics Cloud [CON7176]
Tony Cassidy, CEO, Vertice John Hagerty, Oracle An Post, the Republic of Ireland’s state-owned provider of postal services, is an organization in transformation. It has used data and analytics to create innovations that led to cost savings and better sustainability. The current focus—customer analytics for a new line of business called Parcels and Packets—utilizes Oracle Analytics Cloud to externalize pertinent data to clients through a portal in a secure, effective, and easy-to-manage environment. In this session hear from An Post and its partner, Vertice, as they discuss the architecture and solution, along with recommendations for ensuring success using Oracle Analytics Cloud. | Moscone West - Room 3009 |
Thursday | 13:45-14:30 | Is SQL the Best Language for Statistics and Machine Learning? [OOW and JavaOne CON7350]
Brendan Tierney, Oralytics Did you know that Oracle Database comes with more than 300 statistical functions? And most of these statistical functions are available in all versions of Oracle Database? Most people do not seem to know this. When we hear about people performing statistical analytics, we hear them talking about Excel and R, but what if we could do statistical analysis in the database without having to extract any data onto client machines? This presentation explores the various statistical areas available in Oracle Database and gives several demonstrations. We can also greatly expand our statistical capabilities by using Oracle R Enterprise with the embedded capabilities in SQL. This presentation is just one of the 14 presentations that are scheduled for the Thursday! I believe this session is already fully booked, but you can still add yourself to the wait list. | Marriott Marquis (Golden Gate Level) - Golden Gate B |
Tuesday, September 19, 2017
My Oracle Open World 2017 Presentations
Oracle Open World 2017 will be happening very soon (1st-5th October). Still lots to do before I can get on that plane to San Francisco.
This year I'll be giving 2 presentations (see table below). One on the Sunday during the User Groups Sunday sessions. I've been accepted on the EMEA track. I then get a few days off to enjoy and experience OOW until Thursday when I have my second presentation that is part of JavaOne (I think!)
My OOW kicks off on Friday 29th September with the ACE Director briefing at Oracle HQ, after flying to SFO on Thursday 28th. This year it is only for one day instead of two days. I really enjoy this event as we get to learn and see what Oracle will be announcing at OOW as well as some things that will be coming out during the following few months.
Day | Time | Presentation | Location |
---|---|---|---|
Sunday | 13:45-14:30 | SQL: One Language to Rule All Your Data [OOW SUN1238]
SQL is a very powerful language that has been in use for almost 40 years. SQL comes with many powerful techniques for analyzing your data, and you can analyze data outside the database using SQL as well. Using the new Oracle Big Data SQL it is now possible to analyze data that is stored in a database, in Hadoop, and in NoSQL all at the same time. This session explores the capabilities in Oracle Database that allow you to work with all your data. Discover how SQL really is the unified language for processing all your data, allowing you to analyze, process, run machine learning, and protect all your data. Hopefully this presentation will be a bit of Fun! For those who have been working with the database for a long time, we can sometimes forget what we can really do. For those starting out in the career may not realise what the database can do. The presentation delivers an important message while having a laugh or two (probably at me). | Marriott Marquis (Golden Gate Level) - Golden Gate C1/C2 |
Thursday | 13:45-14:30 | Is SQL the Best Language for Statistics and Machine Learning? [OOW and JavaOne CON7350] Did you know that Oracle Database comes with more than 300 statistical functions? And most of these statistical functions are available in all versions of Oracle Database? Most people do not seem to know this. When we hear about people performing statistical analytics, we hear them talking about Excel and R, but what if we could do statistical analysis in the database without having to extract any data onto client machines? This presentation explores the various statistical areas available in Oracle Database and gives several demonstrations. We can also greatly expand our statistical capabilities by using Oracle R Enterprise with the embedded capabilities in SQL. This presentation is just one of the 14 presentations that are scheduled for the Thursday! I believe this session is already fully booked, but you can still add yourself to the wait list. | Marriott Marquis (Golden Gate Level) - Golden Gate B |
My flights and hotel have been paid by OTN as part of the Oracle ACE Director program. Yes this costs a lot of money and there is no way I'd be able to pay these costs. Thank you.
My diary for OOW is really full. No it is completely over booked. It is just mental. Between attending conference session, meeting with various product teams (we only get to meet at OOW), attending various community meet-ups, this year I get to attend some events for OUG leaders (representing UKOUG), spending some time on the EMEA User Group booth, various meetings with people to discuss how they can help or contribute to the UKOUG, then there is Oak Table World, trying to check out the exhibition hall, spend some time at the OTN/ODC hangout area, getting a few OTN t-shirts, doing some book promotions at the Oracle Press shop, etc., etc., etc. I'm exhausted just thinking about it. Mosts days start at 7am and then finish around 10pm.
I'll need a holiday when I get home! but it will be straight back to work :-(
If you are at OOW and want to chat then contact me via DM on Twitter or WhatsApp (these two are best) or via email (this will be the slowest way).
I'll have another blog post listing the presentations from various people and partners from the Republic of Ireland who are speaking at OOW.
Tuesday, September 5, 2017
Monday, August 28, 2017
How to speed up your Oracle Data Mining with in-memory and parallel
Have you have found running a workflow in Oracle Data Miner slow or running the scripts in the database slow ?
No. Good, because I haven't found it slow.
But (there is always a but) it really depends on the volume of data your are dealing with. For the vast majority of us who aren't of the size of google, amazon, etc have data volumes that are not that large really and a basic server can process many millions of records extremely quickly using Oracle Data Mining.
But what if we have a large volume of data. In one recent project I had a data set containing over 3.5 billion records. Now that is big data. All of this data sitting in an Oracle Database.
So how can we process over 3.5 billion records in a couple of seconds, building 4 machine learning models in that time? Is that really possible with just using an Oracle Database? Yes is the answer and very easily. (Surely I needed Hadoop and Spark to process this data? Nope!)
The Oracle Data Miner (ODMr) tool comes with a new feature in SQL Developer 4 (and higer) that allows you to manage using Parallel execution and the in-memory DB features. These can be accessed on the ODMr Worksheet tool bar.
The best time to look at these setting is when you have created your workflow and are ready to run it for the first time. When you click on the 'Performance Options' link, you will get the following window. It will display the list of nodes you have in the workflow and will then indicate if the Degree of Parallel and the In-Memory options can be set for each of the nodes.
The default values are shown and you can changes these. For example, in a lot of scenarios you might prefer to leave the Degree of Parallel as System Determined. This will then use whatever the the default is for the database and controlled by the DBA, but if you want to specify a particular value then you can, for example setting the degree of parallel to 4 for the 'Class Build' node, in the above image. Similarly for the in-memory option, this will only be available for nodes where the in-memory option would be applicable. This will be where there is a lot of data processing (preparing data, transforming data, performing specific statistics, etc) and for storing any data that is generated by Oracle Data Mining.
But what if you want to change the default values. You can change these at a global level within the SQL Developer Preferences. Here you can set the default to be used for each of the different types of Oracle Data Mining nodes.
I mentioned at the start that I've been able to build 4 machine learning models using Oracle Data Mining on a data set of over 3.5 billion records, all in a couple of seconds. In my scenario Parallel was set to 16 and we didn't use in-memory as we didn't have the licence for it. You can see that machine learning at lighting speed (ish) is possible. This timing is only for building the models, which is the step that consumes the most about of resources and time. When it comes to scoring the data, that is lighting fast. In may scenario, scoring over 300,000 was less than a second, and I didn't use parallel or anything else to speed things up. Because we didn't need to.
Go give it a try!
Monday, August 21, 2017
Scheduling ODMr Workflows in SQL Developer 4.2+
Monday, August 7, 2017
Auto enabling APPROX_* function in the Oracle Database
With the releases of 12.1 and 12.2 of Oracle Database we have seen some new functions that perform approximate calculations. These include:
- APPROX_COUNT_DISTINCT
- APPROX_COUNT_DISTINCT_DETAIL
- APPROX_COUNT_DISTINCT_AGG
- APPROX_MEDIAN
- APPROX_PERCENTILE
- APPROX_PERCENTILE_DETAIL
- APPROX_PERCENTILE_AGG
These functions can be used when approximate answers can be used instead of the exact answer. Yes can have many scenarios for these and particularly as we move into the big data world, the ability to process our data quickly is slightly more important and exact numbers. For example, is there really a difference between 40% of our customers being of type X versus 41%. The real answer to this is, 'It Depends!', but for a lot of analytical and advanced analytical methods this difference doesn't really make a difference.
There are various reports of performance improvement of anything from 6x to 50x with the response times of the queries that are using these functions, instead of using the more traditional functions.
If you are a BI or big data analyst and you have build lots of code and queries using the more traditional functions. But what if you now want to use the newer functions. Does this mean you have go and modify all the code you have written over the years? you can imagine getting approval to do this!
The simple answer to this question is 'No'. No you don't have to change any code, but with some parameter changes for the DB or your session you can tell the database to automatically switch from using the traditional functions (count, etc) to the newer more optimised and significantly faster APPROX_* functions.
So how can you do this magic?
First let us see what the current settings values are:
SELECT name, value FROM v$ses_optimizer_env WHERE sid = sys_context('USERENV','SID') AND name like '%approx%';
Now let us run a query to test what happens using the default settings (on a table I have with 10,500 records).
set auto trace on select count(distinct cust_id) from test_inmemory; COUNT(DISTINCTCUST_ID) ---------------------- 1500 Execution Plan ---------------------------------------------------------- Plan hash value: 2131129625 -------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 13 | 70 (2)| 00:00:01 | | 1 | SORT AGGREGATE | | 1 | 13 | | | | 2 | VIEW | VW_DAG_0 | 1500 | 19500 | 70 (2)| 00:00:01 | | 3 | HASH GROUP BY | | 1500 | 7500 | 70 (2)| 00:00:01 | | 4 | TABLE ACCESS FULL| TEST_INMEMORY | 10500 | 52500 | 69 (0)| 00:00:01 | --------------------------------------------------------------------------------------
Let us now set the automatic usage of the APPROX_* function.
alter session set approx_for_aggregation = TRUE; SQL> select count(distinct cust_id) from test_inmemory; COUNT(DISTINCTCUST_ID) ---------------------- 1495 Execution Plan ---------------------------------------------------------- Plan hash value: 1029766195 --------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 5 | 69 (0)| 00:00:01 | | 1 | SORT AGGREGATE APPROX| | 1 | 5 | | | | 2 | TABLE ACCESS FULL | TEST_INMEMORY | 10500 | 52500 | 69 (0)| 00:00:01 | ---------------------------------------------------------------------------------------
We can see above that the APPROX_* equivalent function was used, and slightly less work. But we only used this on a very small table.
The full list of session level settings is:alter session set approx_for_aggregation = TRUE; alter session set approx_for_aggregation = FALSE; alter session set approx_for_count_distinct = TRUE; alter session set approx_for_count_distinct = FALSE; alter session set approx_for_percentile = 'PERCENTILE_CONT DETERMINISTIC'; alter session set approx_for_percentile = PERCENTILE_DISC; alter session set approx_for_percentile = NONE;
Or at a system wide level:
alter system set approx_for_aggregation = TRUE; alter system set approx_for_aggregation = FALSE; alter system set approx_for_count_distinct = TRUE; alter system set approx_for_count_distinct = FALSE; alter system set approx_for_percentile = 'PERCENTILE_CONT DETERMINISTIC'; alter system set approx_for_percentile = PERCENTILE_DISC; alter system set approx_for_percentile = NONE;
And to reset back to the default settings:
alter system reset approx_for_aggregation; alter system reset approx_for_count_distinct; alter system reset approx_for_percentile;
Monday, July 31, 2017
Part 5 - The right to be forgotten (EU GDPR)s
This is the fifth part of series of blog posts on 'How the EU GDPR will affect the use of Machine Learning'
Article 17 is titled Right of Erasure (right to be forgotten) allows a person to obtain their data and for the data controller to ensure that the personal data is erased without any any delay.
This does not mean that their data can be flagged for non-contact, as I've seen done in many companies, only for the odd time when one of these people have been contacted.
It will also allow for people to choose to not take part in data profiling. Meaning that these people cannot be included in any of the input data sets. And should not be the scenario where they are included but they are flagged as not to be contacted in any post ML process where the consumers are contacted, just like I've seen in lots of places.
Click back to 'How the EU GDPR will affect the use of Machine Learning - Part 1' for links to all the blog posts in this series.
Monday, July 24, 2017
Part 4b - (Article 22: Profiling) Why me? and how Oracle 12c saves the day
This is the fourth part of series of blog posts on 'How the EU GDPR will affect the use of Machine Learning'
In this blog post (Part4b) I will examine some of the more technical aspects and how the in-database machine learning functions saves the day!
Probably in most cases where machine learning has been used and/or deployed in your company to analyse, profile and predict customers, it is more than likely that some sort of black box machine learning has been used.
Typical black box machine learning will include using algorithms like Neural Networks, but these can extended to other algorithms, within the context of the EU GDPR requirements, such as SVMs, GLM, etc. Additionally most companies don't just use one algorithm to make a decision on a customer. Many algorithms and rules based decision make can be used together, using some sort of voting system, to determine if a customer is targeted in a certain way.
Basically all of these do not really support the requirements of the EU GDPRs.
In most cases we need to go back to basics. Back to more simpler approaches of machine learning for customer profiling and prediction. This means no more, for now, ensemble models, unless you can explain why a customer was selected. This means having to use simple algorithms like Decision Trees, at a push Naive Bayes, and using some well defined rules based methods. All of these approaches allows us to see and understand why a customer was selected and based on Article 22 being able to explain why.
But there is some hope. Some of the commercial machine learning vendors already for some prediction insights built into their software. Very few if any open source solutions have this capability.
For example, Oracle introduced a new function called PREDICTION_DETAILS in Oracle 12.1c and this was expanded in Oracle 12.2c to cover all their in-database machine learning algorithms.
The following is an example of using this function for an SVM model. When you examine the boxes in the following image you an see that a slightly different set of attributes and the values of these attributes are listed. Each box corresponds to a different customer. This means we can give an explanation of why a customer was selected. Oracle 12c saves the day.
select cust_id, prediction(clas_svm_1_27 using *) pred_value, prediction_probability(clas_svm_1_27 using *) pred_prob, prediction_details(clas_svm_1_27 using *) pred_details from mining_data_apply_v;
If you have a look at other commercial machine learning solutions, you will find some give similar functionality or it will be available soon. Can we get the same level of detail from open source solutions. Not really unless you are using Decision Tress and maybe Naive Bayes. This means that companies that have gone done the pure open source for their machine learning may have to look at using alternative software and may have to folk out some hard earned dollars/euros to make sure that they are complainant with Article 22 of the EU GDPRs.
Click back to 'How the EU GDPR will affect the use of Machine Learning - Part 1' for links to all the blog posts in this series.
Monday, July 17, 2017
Part 4a - (Article 22: Profiling) Why me? and how Oracle 12c saves the day
This is the fourth part of series of blog posts on 'How the EU GDPR will affect the use of Machine Learning'
In this blog post (Part4a) I will discuss the specific issues relating to the use of machine learning algorithms and models. In the next blog post (Part 4a) I will examine some of the more technical aspects and how the in-database machine learning functions saves the day!
The EU GDPR has some rules that will affect the use of machine learning models for predicting customers.
As with all the other section of the EU GDPR, the use of machine learning and profiling of individuals does not affect organisations based in within Europe but affects all organisations around the globe who will be using these methods and associated data.
Article 22 of the EU GDPR deals with the “Automated individual decision-making, including profiling” and effectively creates a “right to explanation”. This means that an individual is entitled to an explanation of the decisions made by automated decision making models or profiling that has resulted in a decision being made about them. These new regulations present many challenges for organisations and their teams of data scientists.
To be able to give an explanation of the decision made by the machine learning models or by profile, requires the ability of the underlying models and their associated algorithms to be able to gives details of the model processing and how the decision about the individual has been obtained. For most machine learning models and algorithms this is generally not possible. For a limited set of algorithms, for example with decision trees, this is possible, but with other algorithms such as support vector machines, some regression models, and in particular neural networks, the ability to give these explanations is not possible. Some of these can be considered black box modelling (for neural networks) and grey box modelling for the others. But these algorithms are in widespread use in many organisations and are core to their predictive analytics solutions. This presents many challenges for organisations as they will need to look at alternative algorithms that many not have the same degree of predictive accuracy. With the recent rise of deep learning using neural networks, is extremely difficult to explain the multilayer neural net with various learned weights between each of the nodes at each layer.
Ensemble machine learning methods like Random Forests are also a challenge. Although the underlying machine learning algorithm is explainable, the ensemble approach of Random Forest, and other similar methods, result from an aggregation, averaging or voting process. Additionally, scenarios when machine learning models are combine with multiple other models, along with rules based solutions, where the predicted outcome is based on the aggregation or voting of all methods may no longer be useable. The ability to explain a predicted outcome using ensemble methods may not be possible and this will affect their continued use for predictive analytics.
In addition to the requirements of Article 22, Articles 13 and 14 state that the a person has a right to the meaningful information about the logic involved in profiling the person.
Over the past few years many of the commercially available machine learning solutions have been preparing for changes required to meet the EU GDPR. Some vendors have been able to add in greater model explanation features as well as specific explanations for each of the individual predictions. Many other vendors are will working on adding the required level of explanations and some of these many not be available in time for when the EU GDPR goes live in April 2018. This will present many challenges for organisations around the world who will be using data gathered within the EU region.
For machine learning based on open source languages and tools the EU GDPR present a very different challenge. While a small number of these come with some simple explanations for some of the more basic machine learning algorithms, there seems to be little information available on what work is currently being done to update these languages and tools. The limiting factor with making the required updates in the open source community lies with there being no commercial push to so. As a result of these limitation, many organisations may be forced into using commercial machine learning products, but for many other organisation the cost of doing so will be prohibitive.
It is clear that the tasks of building machine learning models have become significantly more complex with the introduction of the new EU GDPR. This complexity applies to the selection of what data can be used, ensuring there is no inherent discrimination in the machine learning models and the ability of these models to give an explanation of how the predicted outcome was determined. Companies around the World need to address these issues and in doing so may limit what software and algorithms that can be used for the customer profiling and predictive analytics. Although some of the commercially available machine learning languages and products can give the required insights, more product enhancements are required. Many challenges are facing machine learning open source community, with many research group only starting in recent months to look at how their languages, packages and tools can be enhanced to facilitate the requirements of the EU GDPR.
Click back to 'How the EU GDPR will affect the use of Machine Learning - Part 1' for links to all the blog posts in this series.
Wednesday, July 12, 2017
PRECIS R package
If you use R then you are very familiar with the SUMMARY function.
If you use R then you are very familiar with the name Hadley Wickham. He has produced some really cool packages for R.
He has produced a new R package and function that complements the commonly used SUMMARY R function.
The following outlines how you can install this new R package from GitHub (Hadley's GitHub is https://github.com/hadley/).
Install the R devtools package. This will allow you to download the package code from GitHub.
install.packages("devtools")
Install the package from Hadley's GitHub repository.
devtools::install_github("hadley/precis")
Load the library.
library(precis)
The following displays information produced by the SUMMARY and the PRECIS function.
> summary(mtcars) mpg cyl disp hp drat wt Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760 Min. :1.513 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080 1st Qu.:2.581 Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695 Median :3.325 Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597 Mean :3.217 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.610 Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930 Max. :5.424 qsec vs am gear carb Min. :14.50 Min. :0.0000 Min. :0.0000 Min. :3.000 Min. :1.000 1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000 Median :17.71 Median :0.0000 Median :0.0000 Median :4.000 Median :2.000 Mean :17.85 Mean :0.4375 Mean :0.4062 Mean :3.688 Mean :2.812 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000 Max. :22.90 Max. :1.0000 Max. :1.0000 Max. :5.000 Max. :8.000 > precis(mtcars) # data.frame [32 x 11] name type precis1 mpg dbl 10.4 [ 15.4 ( 19.2) 22.8] 33.9 2 cyl dbl 4 (11) 6 (7) 8 (14) 3 disp dbl 71.1 [121.0 (196.0) 334.0] 472.0 4 hp dbl 52 [ 96 ( 123) 180] 335 5 drat dbl 2.76 [ 3.08 ( 3.70) 3.92] 4.93 6 wt dbl 1.51 [ 2.54 ( 3.32) 3.65] 5.42 7 qsec dbl 14.5 [ 16.9 ( 17.7) 18.9] 22.9 8 vs dbl 0 (18) 1 (14) 9 am dbl 0 (19) 1 (13) 10 gear dbl 3 (15) 4 (12) 5 (5) 11 carb dbl 1 [ 2 ( 2) 4] 8 > precis(mtcars, histogram=TRUE) # data.frame [32 x 11] name type precis 1 mpg dbl 10.4 ▂▁▇▃▅▅▂▂▁▁▂▂ 33.9 2 cyl dbl 4 ▅▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▇ 8 3 disp dbl 71.1 ▅▁▁▃▇▂▁▁▁▁▃▁▃▁▅▁▁▁▁▁▁ 472.0 4 hp dbl 52 ▁▅▅▇▂▂▇▁▂▁▂▁▁▁▁ 335 5 drat dbl 2.76 ▂▂▇▂▁▅▇▃▂▁▁▁ 4.93 6 wt dbl 1.51 ▁▁▂▂▁▁▂▁▂▁▇▂▂▁▁▁▁▁▁▂▁ 5.42 7 qsec dbl 14.5 ▂▂▁▁▃▇▅▁▇▂▂▂▁▁▁▁▁ 22.9 8 vs dbl 0 ▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ 1 9 am dbl 0 ▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ 1 10 gear dbl 3 ▇▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▂ 5 11 carb dbl 1 ▅▇▁▂▁▇▁▁▁▁▁▁▁▁ 8
Monday, July 10, 2017
Part 3 - Ensuring there is no Discrimination in the Data and machine learning models
This is the third part of series of blog posts on 'How the EU GDPR will affect the use of Machine Learning'
The new EU GDPR has some new requirements that will affect what data can be used to ensure there is no discrimination. Additionally, the machine learning models needs to ensure that there is no discrimination with the predictions it will make. There is an underlying assumption that the organisation has the right to use the data about individuals and that this data has been legitimately obtained. The following outlines the areas relating to discrimination:- Discrimination based on unfair treatment of an individual based on using certain variables that may be inherently discriminatory. For example, race, gender, etc., and any decisions based on machine learning methods or not, that are based on an individual being part of one or more of these variables. This is particularly challenging for data scientists and it can limit some of the data points that can be included in their data sets.
- All data mining models need to tested to ensure that there is no discrimination built into them. Although the data scientist has removed any obvious variables that may cause discrimination, the machine learning models may have been able to discover some bias or discrimination based on the patterns it has discovered in the data.
- In the text preceding the EU GDPR (paragraph 71), details the requirements for data controllers to “implement appropriate technical and organizational measures” that “prevent, inter alia, discriminatory effects” based on sensitive data. Paragraph 71 and Article 22 paragraph 4 addresses discrimination based on profiling (using machine learning and other methods) that uses sensitive data. Care is needed to remove any associated correlated data.
- If one group of people are under represented in a training data set then, depending on the type of prediction being used, may unknowingly discriminate this group when it comes to making a prediction. The training data sets will need to be carefully partitioned and separate machine learning models built on each partition to ensure that such discrimination does not occur.
In the next blog post I will look at addressing the issues relating to Article 22 on the right to an explanation on outcomes automated individual decision-making, including profiling using machine learning and other methods.
Click back to 'How the EU GDPR will affect the use of Machine Learning - Part 1' for links to all the blog posts in this series.