Data Mining – Beyond Algorithms
by Dr Akeel Al-Attar, XpertRule Software
The case for data mining
Most organizations can be currently labelled ‘data rich’, since they are collecting increasing volumes of data about business processes and resources. Typically, these Data Mountains are used to provide endless ‘facts and figures’ such are ‘there are 60 categories of occupation’, ‘2000 mortgage accounts are in arrears’ etc. Such ‘facts and figures’ do not represent knowledge but if anything can lead to ‘information overload’. However, patterns in the data represent knowledge and most organizations nowadays can be labelled ‘knowledge poor’. Our definition of data mining is the process of discovering knowledge from data.
Data mining enables complex business processes to be understood and re-engineered. This can be achieved through the discovery of patterns in data relating to the past behaviour of a business process. Such patterns can be used to improve the performance of a process by exploiting favourable patterns and avoiding problematic patterns.
Examples of business processes where data mining can be useful are customer response to mailing, lapsed insurance policies and energy consumption. In each of these examples, data mining can reveal what factors affect the outcome of the business event or process and the patterns relating the outcome to these factors. Such patterns increase our understanding of these processes and therefore our ability to predict and affect the outcome.
Data Mining Technologies
There is a high degree of confusion among the potential users of data mining as to what data mining technologies are. This confusion has been compounded by vendors, of complimentary technologies, positioning their tools as data mining tools. So we have many vendors of query and reporting tools and OLAP (On-Line Analytical processing) tools claiming that their products can be used for data mining. While it is true that one can discover useful patterns in the data using these tools there is a question mark as to who is doing the discovery – the user or the tool! For example, query and reporting tools will interrogate the data and report on any pattern (query) requested by the user. This is a ‘manual’ and ‘validation driven’ method of discovery in the sense that unless the user suspects a pattern they will never find it! A marginally better situation is encountered with the OLAP tools, which can be termed ‘visualisation driven’ since they assist the user in the process of pattern discovery by displaying multi-dimensional data graphically. The class of tools that can genuinely be termed ‘data mining tools’ are those that support the automatic discovery of patterns in data.
We are going to make one more assertion regarding the difference between data mining and data modelling. Data mining is about discovering understandable patterns (trees, rules or associations) in data. Data modelling is about discovering a model that fits the data, regardless of whether the model is understandable – (e.g. tree or rules) or a black box (e.g. neural network). Based on this assertion, we restrict the main data mining technologies to induction and the discovery of associations and clusters.
Rule or decision tree induction is the most established and effective data mining technologies in use today. It is what can be termed ‘goal driven’ data mining in that a business goal is defined and rule induction is used to generate patterns that relate to that business goal. The business goal can be the occurrence of an event such as ‘response to mail shots’ or ‘mortgage arrears’ or the magnitude of an event such as ‘energy use’ or ‘efficiency’. Rule induction will generate patterns relating the business goal to other data fields (attributes). The resulting patterns are typically generated as a tree with splits on data fields and terminal points (leafs) showing the propensity or magnitude of the business event of interest.
As an example of tree induction data mining consider this data table which represents captured data on the process of loan authorisation. The table captures a number of data items relating to each loan applicant (sex, age, time at address, residence status, occupation, time in employment, time with the bank and monthly house expenses) as well as the decision made by the underwriters (accept or reject).
The objective of applying rule induction data mining to this table is to discover patterns relating the decisions made by the loan underwriters to the details of the application.
Such patterns can reveal the decision making process of the underwriters and their consistency, as shown in this tree. It reveals that the time with the bank is the attribute (data field) considered most important with a critical threshold of 3 years. For applicants that have been with the bank over 3 years the time in employment is considered the next most important factor, and so on. The tree below reveals 5 patterns (leafs) each with an outcome (accept or reject) and a probability (0 – 1). High probability figures represent consistent decision making.
The majority of data miners who use tree induction will most probably use as in automatic algorithm which can generate a tree once the outcome and the attributes are defined. Whilst this is a reasonable first cut for generating patterns from data, the real power of tree induction can be gained using the interactive (incremental) tree induction mode. This mode allows the user to impart his/her knowledge of the business process to the induction algorithm. In interactive induction, the algorithm stops at every split in the tree (starting at the root) and displays to the user the list of attributes available for activating a split, with these attributes being ranked by the criteria of the induction engine for selecting attributes (significance, entropy or a combination of both). The user is also presented with the best split of attribute values (threshold or groups of values) according to the algorithm. The user is then free to select the top ranking attribute (and value split) according to the algorithm or select any other attribute in the ranked list. This allows the user to override the automatic selection of attributes based on the user’s background knowledge of the process. For example, the user may feel that the relationship between the outcome and the best attribute is a spurious one, or that the best attribute is one that the user has no control over and should be replaced by one that can be controlled.
Interactive induction can also be seen as bridging the gap between OLAP based manual data segmentation / exploration and algorithm assisted segmentation.
The Discovery of associations
This is the second most common data mining technology and involves the discovery of associations between the various data fields. One popular application of this technology is the discovery of associations between business events or transactions. For example discovering that 90% of customers that buy product A will also buy product B (basket analysis) or that in 80% of cases when fault 1 is encountered then fault 7 is also encountered. If the sequence of events is important then another data mining technology for discovering sequences can be used.
A second application of associations discovery data mining is the discovery of associations between the fields of case data. Case data is data that can be structured as a flat table of cases. Records of mortgage applications is an example of case data. In such data, associations can be found between data fields; for example that 75% of all applicants that are over 45 and in managerial occupations are also earning over £40,000. Such associations can be used as a way of discovering clusters in the data. Note that this differs from rule induction on case data in that no outcome needs to be defined for the discovery process.
A number of case studies are described in the following sections which detail the background to each case study, the data mining approach used and the benefits gained by the organizations concerned.
Case Study 1: Mortgage Lending
This case study comes from a UK Mortgage Lender which had a mortgage portfolio in which 9.8% of all accounts were in arrears (over 3 months in arrears) and 4.1% of all accounts were in severe arrears (over 6 months in arrears). The objectives of the data mining project were to discover patterns relating the propensity of arrears to the mortgage application data. Such patterns can be applied at the front end applications processing to reduce the level of arrears and can result in better management of accounts that go into arrears.
Rule induction data mining was used with the outcome of the analysis being the arrears status healthy, moderate arrears or severe arrears. The attributes of the data mining analysis were the mortgage application data such as age, income, occupation, term, loan amount, region etc. Two separate data mining analysis were carried out; one to discover the patterns of arrears and the second to discover the patterns of severe arrears. Rule induction generated the following tree for arrears with splits on the attributes %Adv (% of loan to property value), SecondCh (second charge on property), Owned (is the property already owned), sub/Inc (subscription to income), AppSource (application source) and Region. The tree reveals 10 profiles with a propensity for arrears ranging from 0.02 to 0.49.
The trees discovered from the arrears data was used in three ways:
Case Study 2: Life Underwriting
This case study is from the Hibernian in Ireland who like other Life Insurer’s were facing the challenges of reducing costs, maintaining market share and meeting market demands. In order to meet these challenges, Hibernian decided to re-engineer its Life Underwriting Process in order to speed up the process and reduce its costs.
The first phase of re-engineering involved the implementation of a rule based underwriting system which was used to automate the processing of Life Proposals at the point of application. The system involved capturing underwriting knowledge which resulted in 51% of proposals being underwritten automatically with the remaining cases being referred to head office for manual underwriting.
While the automated underwriting system proved very successful, Hibernian looked for ways of increasing the percentage of cases that can be processed by the system. Attempts were made to capture more advanced underwriting rules, however this proved to be very difficult. Data mining was then considered as an alternative for generating additional knowledge. The basic premise was that out of the 49% of cases being referred to Head Office a significant number were underwritten with no or a very small additional premium (less than the cost of the manual underwriting!). It was therefore decided to apply rule induction analysis to cases being referred to Head Office with the amount of additional underwriting premium being used as an outcome. Rule induction analysis generated patterns of low additional premiums of the following format:
If AGE > 30 & AGE < 41 and HEIGHT-WEIGHT = NORMAL Then PREMIUM LOADING = 1%
The generated patterns for low additional premiums were qualified and checked for risk by the actuaries at Hibernian before being added as additional underwriting rules to the automated underwriting system. The result was to increase the rate of automated underwriting to 78%.
This case study illustrates how data mining helped Hibernian in Ireland develop new ways of processing life proposals and these methods now underpin a cost effective new business process.
Case Study 3 : Gas Processing Plant
This project was carried out for an oil company and was based in a remote US oil field location. The process investigated was a very large gas processing plant which produces two useful products from the gas from the wells, natural gas liquids and miscible injectant. NGL is mixed with crude oil and transported for refining, and MI is used to improve the viscosity of oil in the fields to improve crude oil recovery.
The aim of the study was to use data mining techniques to analyse historical process data to find opportunities to increase the production rates, and hence increase the revenue generated by the process. Approximately 2000 data measurements for the process are captured every minute.
Rule induction data mining was used to discover patterns in the data. The business goal for data mining was the revenue from the Gas Process Plant, while the attributes of the analysis fell into two categories:
An important part of the process is where the incoming feed gas is pre cooled with heat exchangers in two parallel process streams. The oil company has always believed that there is an opportunity to improve process performance by altering the split of flows; however, it was not sure in which way to split the flow and what the impact will be on the revenue. Therefore flow split was put forward as an attribute for data mining.
This is the tree generated by rule induction. It reveals patterns relating the revenue from the process to the disturbances and control settings of the process. In particular the impact of the flow_split is revealed with a critical ratio of 1.25 : 1
The benefits derived from the generated patterns include the identification of opportunities to improve process revenue considerably (by up to 4%). Mostly these involve altering control set points, such as altering the flow splits. Some of the discovered knowledge can be implemented without any further work and for no extra cost (e.g. Altering flow splits). In other cases, it is necessary to provide the operators with timely advice about the best combination of settings for a given circumstance. This can be achieved cost effectively by delivering the rules generated as part of an expert system.
The company is in the process of carrying out another rule induction analysis with product quality as an outcome. The discovered patterns will allow the plant to be operated at the maximum revenue possible with acceptable product quality.
Case Study 4 – Energy Usage in a Power Station
ICI Thornton Power station, produces steam for a range of processes on the site, and generates electricity in a mix of primary pass out and secondary condensing turbines. Total power output is approximately 50 MW (i.e. a small power station). Fuel and water costs amount to about £5 million a year (depending on site steam demands).
The objective of the data mining project was to identify opportunities to reduce power station operating costs. Costs include the cost of fuel (gas and oil to fire the boilers) and water (to make up for losses). Electricity and steam are sold and represent revenue.
Rule induction data mining was used for the project with the outcome (goal) of the analysis being the net cost of steam per unit of steam supplied to the site (i.e. the cost of the product). The attributes fall into two categories; disturbances such as ambient temperature and the site steam demand over which the operators have no control, and control settings such as pressure and bled steam rates.
Here is a section of the induced tree revealing the variation of steam cost with attribute values. It identifies the main contributors to efficient operations as manifold pressure (i.e. pressure at primary turbine inlet), steam flow to the secondary turbines and the total site steam flow.
The benefits derived from the generated patterns include the identification of opportunities to improve process revenue considerably (by up to 5%). Mostly these involve altering control set points which can be implemented without any further work and for no extra cost. Implementing some of the opportunities identified needed additional controls and instrumentation. The pay back for the additional controls would be a few months.
The Current Issues in Data Mining
With real case studies of organizations deploying data mining as a catalyst for enhancing and re-engineering their business processes, data mining is now entering mainstream IT as a mature and tested technology. With this phase of evolution data mining has moved beyond the debate on algorithms and into the debate on usability. There are three main issues which should be considered by any organization considering the introduction of data mining; methodology, ease of use and performance / scalability.
For data mining to gain wide acceptance, it is important to have a step by step methodology for a data mining project. This ensures that the benefits reported by seasoned data miners are repeatable by other people in various business sectors. This can help dispel the belief that data mining is a kind of ‘black art’ which can only be practised by specialist. Such a methodology is beginning to emerge and there is certainly wide agreement on the main steps of such a methodology. These are:
Ease of Use
Data mining tools are increasingly used by computer literate business users. This requires these tools to be no more difficult to use than a spreadsheet program. Furthermore the data mining tool needs to support all the steps of a data mining methodology. Finally, because of the nature of data mining, the tool has to support extensive data and patterns reporting and visualisation.
Performance and Scalability
With the decreasing costs of data processing and storage comes the data rich organization. It is now common place for small and medium sized organizations to hold gigabytes of data relating to a business process. It is therefore essential that data mining tools can deliver acceptable performance on large volumes of data regardless of the computing platform / architecture being used. There are a number of computing architectures for data mining
Client based data mining
In this architecture the data to be mined is downloaded (extracted) and stored on the client machine (Windows 95 or NT). All the data preparation and mining is carried out on the client.
Until recently this approach was limited to mining tens of thousands of records (in acceptable times of under an hour). Recent advances has made it possible for millions of records to be mined on a client in tens of minutes and XpertRule Software’s Profiler is an example of a data mining tool which such a capability.
Two tier client server data mining
In this architecture the data is extracted and stored on a server but is mined from the client machine(s). There two distinct flavours of this architecture:
Three tier client server data mining
This architecture typically involves a dedicated high performance server (such as Teradata from NCR), a standard platform (Unix but increasingly NT) middle tier and a number of data mining clients. Again there two distinct scenarios: