Sunteți pe pagina 1din 14

PARADIGMS FOR SPATIAL AND SPATIO-TEMPORAL DATA MINING

JOHN F. RODDICK
School of Informatics and Engineering, Flinders University of South Australia, PO Box 2100, Adelaide 5001, South Australia

BRIAN G. LEES
Department of Resource Management and Environmental Science, Australian National University, Canberra, ACT 0200, Australia

1. Introduction
With some significant exceptions, current applications for data mining are either in those areas for which there is little accepted discovery methodology or are being used within a knowledge discovery process that does not expect authoritative results but finds the discovered rules useful none-the-less. This is in contrast to its application in the fields applicable to spatial or spatio-temporal discovery which possess a rich history of methodological discovery and result evaluation. Examples of the former include market basket analysis which, in its simplest form, (q.v. (Agrawal, Imielinski and Swami 1993)) provides insight into the correspondence between items purchased in a retail trade environment, and web log analysis (qq.v. (Cooley, Mobasher and Srivastava 1997; Viveros, Wright, Elo-Dean and Duri 1997; Madria, Bhowmick, Ng and Lim 1999)), which attempts to derive a broad understanding of sequences of user activity on the internet. Examples of the latter includes time series analysis and signal processing (Weigend and Gershenfeld 1993; Guralnik and Srivastava 1999; Han, Dong and Yin 1999). The rules resulting from investigations in both of these areas may or may not be the result of behavioural or structural conditions but significantly it is the rule1 itself, rather that the underlying reasons behind the rule, which is generally the focus of interest. An alternative approach is employed in the field of medical knowledge discovery which employs a procedure in which the results of data mining are embedded within a process that interprets the results as being merely hints towards further properly structured investigation into the reasons behind the rules (Lavrac 1999). This latter approach may also be usefully employed by knowledge discovery processes over geographic data. A third, and in some cases more useful approach may be appropriate for many of those areas for which spatial and spatio-temporal rules might be mined. This last approach accepts a (null) hypothesis and attempts to refine it (or disprove it) through the modification of the hypothesis as a result of knowledge discovery. This latter approach is carried out according to the principles of scientific experimentation and

Appeared as Roddick, J. F. and Lees, B. G. (2001). Paradigms for Spatial and Spatio-Temporal Data Mining. Geographic Data Mining and Knowledge Discovery. Taylor and Francis. Research Monographs in Geographic Information Systems. Miller, H. and Han, J., Eds.
1

Although each form of data mining algorithm provides results with different semantics, we will use the term "rule" to describe all forms of mining output.

induction and has resulted in theories being developed and refined according to repeatable and accepted conventions. The promises inherent in the development of data mining techniques and knowledge discovery processes are manifold and include an ability to suggest rich areas of future research in a manner which could yield unexpected correlations and causal relationships. However, the nature of such techniques is that they can also yield spurious and logically and statistically erroneous conjectures. Regardless of the process of discovery, the form of the input and the nature and allowable interpretation of the resulting rules can also vary significantly for knowledge discovery from geographic/spatiotemporal data, as opposed to that produced by conventional data mining algorithms. For example, the complexity of the rule space requires significant constraints to be placed on the rules that can be generated to avoid either excessive or useless findings. To this end, some structuring of the data (Lees 1996) will often enhance the generation of more relevant rules. This chapter presents a discussion of the issues that make the discovery of spatial and spatio-temporal knowledge different with an emphasis on geographical data. We discuss how the new opportunities of data mining can be integrated into a cohesive and, importantly, scientifically credible knowledge discovery process. This is particularly necessary for spatial and spatio-temporal discovery as the opportunity for meaningless and expensive diversions is high. We discuss the concepts of spatiotemporal knowledge discovery from geographic and other spatial data, the need to re-code temporal data to a more meaningful metric, the ideas behind higher order or meta-mining as well as scientific theory formation processes and, briefly, the need to acknowledge the second hand nature of much collected data.

2. Mining from Spatial and Spatio-Temporal Data


Current approaches to spatial and spatio-temporal knowledge discovery exhibit a number of important characteristics that will be discussed in order to compare and contrast them with possible future directions. However, space precludes a full survey of the manner in which spatial and spatio-temporal knowledge discovery is currently undertaken and readers are directed to a number of other papers with reviews of the area (Bell, Anand and Shapcott 1994; Koperski, Adhikary and Han 1996; Abraham and Roddick 1998). In addition, a survey of temporal data mining research is available (Roddick and Spiliopoulou 2001) and a bibliography of temporal, spatial and spatio-temporal data mining research is currently being maintained (Roddick, Hornsby and Spiliopoulou 2001). 2.1. Rule Types As discussed by Abraham and Roddick (1998), the forms that spatio-temporal rules may take are extensions of their static counterparts and at the same time are uniquely different from them. Five main types can be identified: Spatio-Temporal Associations. These are again similar in concept to their static counterparts as described by Agrawal et al. (Agrawal, Imielinski and Swami 1993). Association Rules are of the form X Y (c%, s%) where the occurrence of X is accompanied by the occurrence of Y in c% of cases (while X and Y occur together in a transaction in s% of cases)2. Spatio-temporal extensions to this

Note that while support and confidence were introduced in (Agrawal, Imielinski and Swami 1993), considerable research has been undertaken into the nature of "interestingness" in mining rules - see for example (Silberschatz and Tuzhilin 1996; Dong and Li 1998; Bayardo Jr and Agrawal 1999; Freitas 1999; Sahar 1999).

form of rule require the use of spatial and temporal predicates (Koperski and Han 1995; EstivillCastro and Murray 1998). Moreover, it should be noted that for temporal association rules, the emphasis moves from the data itself to changes in the data (Chen, Petrounias and Heathfield 1998; Ye and Keane 1998; Rainsford and Roddick 1999). Spatio-Temporal Generalisation. This is a process whereby concept hierarchies are used to aggregate data, thus allowing stronger rules to be located at the expense of specificity. Two types are discussed in the literature (Lu, Han and Ooi 1993); spatial-data-dominant generalisation proceeds by first ascending spatial hierarchies and then generalising attributes data by region, while nonspatial-datadominant generalisation proceeds by first ascending the aspatial attribute hierarchies. For each of these different rules may result. For example, the former may give a rule such as South Australian summers are commonly hot and dry, while the latter Hot, dry summers are often experienced by areas close to large desert systems. Spatio-Temporal Clustering. While the complexity is far higher than its static, non-spatial counterpart the ideas behind spatio-temporal clustering are similar - that is, either characteristic features of objects in a spatio-temporal region or the spatio-temporal characteristics of a set of objects are sought (Ng and Han 1994; Ng 1996). Evolution Rules. This form of rule has an explicit temporal and spatial context and describes the manner in which spatial entities change over time. Due to the exponential number of rules that can be generated, it requires the explicit adoption of sets of predicates that are usable and understandable. Example predicates might include the following3:

follows

One cluster of objects traces the same (or similar) spatial route as another cluster at a later time. (I.e. spatial coordinates are fixed, time is varying). Other relationships in this class might include the temporal relationships discussed by Allen and Freksa (Allen 1983; Freksa 1992). One cluster of objects traces the same (or similar) spatial path whenever a second cluster undergoes specified activity. (I.e. temporal coordinates are fixed, spatial activity varies). This may also include a causal relationship in which one cluster of objects undergoes some transformation or movement immediately after a second set undergoes some transformation or movement. One cluster of objects traces the same (or a similar) spatial pattern but offset in space. (I.e. temporal coordinates are fixed, spatial activity varies). This class may include a number of common spatial translations (such as rotation, reflection, etc.) One cluster of objects transforms itself into a second cluster. See the work of Hornsby, Egenhofer and others that examine change in geographic information (Hornsby and Egenhofer 1998).

coincides

parallels

mutates

This is by no means exhaustive but gives some idea as to what useful predicates may resemble.

Meta-Rules. These are created when rulesets rather than datasets are inspected for trends and coincidental behaviour. They describe observations discovered amongst sets of rules. For example, the support for suggestion X is increasing. This form of rule is particularly useful for temporal and spatio-temporal knowledge disovery and are discussed in more detail later in Section 3.

2.2. Spatial versus Spatio-Temporal Data The dimensioning-up of the spatial dimension to include time was originally seen as a useful way of accommodating spatio-temporal data. However, the nature of time results in the semantics of time in discovered rules needing to be coded according to the relevant process aspect of time in order to make them useful. In most systems development, time is generally considered to be unidirectional and linear. Thus the relational concepts (before, during etc.) are easily understood, communicated and accommodated. Conversely, space is perceived as bi-directional and, particularly in spatial/geographic applications, commonly non-linear. Although both time and space are continuous phenomena, it is common to encode time as discrete and isomorphic with integers, and a larger granularity is often selected (days, hours, etc.). Space, on the other hand, while a specific granularity is sometimes adopted, is often considered as isomorphic with real numbers, and the granularity relative to the domain is generally smaller. Consider, for example, a land titles system in which spatial accuracy to a metre or less is required across cities or states that can be hundreds of kilometres or more wide (an area ratio of typically of the order of 1:1012). Time on the other hand is commonly to the day over possibly three centuries (a granularity of 1:106). A counter example might be AVHRR data, captured within a couple of seconds, on a grid of 1.1 km, and commonly reported with a spatial resolution of about 4 km. There is often an agreement that recent events and/or the current state of the system is considered of more interest than past events. While one user may focus on a particular location, it is unlikely that all users of a system will focus on a particular geographic region. Indexing schemes are thus able to be oriented to the "now" point in time but not the "here" point in space. Thus, when one is trying to extract new relationships from a database, simple dimensioning-up strategies work poorly. There have been numerous attempts to deal with time in the context of spatio-temporal data (see (Egenhofer and Golledge 1994) for a recent review) and the importance of recognising the differences between the spatial and temporal dimensions cannot be overstated, even when examining apparently static phenomena. Consideration of the temporal characteristics of some typical datasets used in data mining will highlight this. For example, spectral data represents an instant in time. The time slice is a very narrow and constrained sample of the phenomenon being observed. These data are often included in databases with environmental data of various sorts. In contrast to the spectral data, environmental data typically represents long-term estimates of mean and variance of very dynamic environmental variables. The time scale of this data is quite different to that of the spectral data. Spatial data (in geographic space) has characteristics that differ from both of these. It is reasonable to question whether this gross scale difference means that our accepted data mining procedures are flawed. This is not the case, however, as the time scale differences between the data types generally match the characteristics we wish to include in most analyses of land cover. For example, the spectral time slice provides discrimination between vegetation types while the environmental data provides long term conditions which match the time scale of germination, growth and development of the largest plants. When we are concerned with forecasting, say, crop production, then shorter time scales would be necessary, as is common practice. Very often, too little consideration is given to the appropriate temporal scales necessary. An example might be the monitoring of wetlands in the dry tropics, The extent of these land-cover elements varies considerably through time, both on a seasonal

basis and from year to year. In many years, the inter-annual variability in extent is greater than the average annual variability. This means that a spectral image of wetland extent, without a precise annual and seasonal labelling in the database, and without monthly rainfall and evaporation figures, is a meaningless measurement. The temporal scales used in conjunction with spatial data are often inconsistent and need to be chosen more carefully. As discussed above, time, as normally implemented in process models, is a simple, progressive step function. This fails to capture the essence of temporal change in both environmental and many cultural processes and is scale dependent. Whilst our normal indices of time are either categorical or linear, process time is essentially spatial in character. The mismatch in the data model for time may well underlie the difficulties that many data miners are experiencing in trying to incorporate spatial and temporal attributes into their investigations. For many spatio-temporal data mining exercises considerably better results will be achieved by a considered recoding of the temporal data. The work of palaeo-climate reconstruction demonstrates this. In order to make sense of deep-ocean cores, and ice cores, the results of down-core analyses are usually analysed using Fourier, or Spectral, Analysis to decompose the time series data into a series of repeating cycles. Most, if not all, of the cycles thus identified can be associated with the relative positioning of the Earth, the Sun, and the major planets in space. Consideration of this makes it clear that our useful assumption that geographic space is static and time invariant, is flawed. The Cartesian description of location defined by latitude, longitude and elevation is not only an inaccurate representation of reality, it is an encumbrance to understanding the relationship between time and space. Time is a spatial phenomenon. A fuller understanding of this leads to a resolution of some of the conceptual problems that bedevil the literature on spatio-temporal GIS modelling (Egenhofer and Golledge 1994). Properties of time concepts such as continuous/linear, discrete, monotonic and cyclic time tend to deal only with limited aspects of time and, as such, have limited application. In data mining, the process aspects of time are particularly important. In order to progress this discussion, it is worth first returning to consider the Cartesian representation of space using latitude, longitude and elevation. A point on the Earths surface defined by this schema is not static in space. It is moving, in a complicated but predictable way, through a complex energy environment. This movement, and the dynamics of the energy environment itself, is time. There are three main components to the environmental energy field, gravity, radiation and magnetism. These fluctuate in amplitude and effectiveness. The effectiveness of gravity, radiation and magnetism is almost entirely due to the relationships between bodies in space. Interaction between these forces and the internal dynamics of a body such as the Sun can alter the amplitude of its radiation and magnetism. These feedback relationships sound complex, but are predictable. The most important relationships have already been indexed as clock and calendar time. These are: The relative positions of a point on the surface of the Earth and the Sun, the diurnal cycle. This is a function of the rotation of the Earth, and the tilt of the Earths axis relative to the plane of the ecliptic (the declination of the Sun). The orbit of the Moon around the Earth. The orbit of the Earth around the Sun.

Each of these relationships has a very significant relationship with the dynamics of both our natural, cultural, and even economic, environments. These dynamic spatial relationships are the basis of the index we call time, but do not include all the important phenomena we now understand to influence our local process environment. Others include:

The Solar Day, which sweeps a pattern of four solar magnetic sectors past the Earth in about 27 days. Alternating sectors have reverse polarity and the passage of a sector boundary only takes a few hours. This correlates with a fluctuation in the generation of low-pressure systems. The lunar cycle. The lunar cycle is a 27.3-day period in the declination of the moon during which it moves north for 13.65 days and south for 13.65 days. This too correlates with certain movements of pressure systems on the Earth. The Solar year. The Sun is not the centre of the Solar System. Instead, it orbits the barycentre of the Solar System, which at times passes through the Sun. The orbit is determined by the numerous gravitational forces within the Solar System, but tends to be dominated by the orbits of the larger planets, Jupiter and Saturn, at about 22-23 years. This orbit appears to affect solar emissions (the sunspot cycle). Notoriously, this cycle correlates with long term variation in a large number of natural, cultural and economic indices from cricket scores, pig belly futures to a host of other, more serious, areas.

There are much longer periods which can be discussed, but the above relate to both the Earths energy environment and time on the sorts of scales we are most concerned with in data mining. These have been reviewed by Lees (1999). Time coded as position using these well understood astrophysical relationships is not an abstract concept. Such a coding correlates with energy variability which both drives our natural systems and influences many of our cultural systems. This coding also links directly to variations in spectral space. Illumination is a function of season (apparent declination of the Sun), time of day (diurnal cycle) modified by latitude. The simple process of recoding the time stamp on data to a relevant continuous variable, such as Solar Declination or time of the Solar Year, rather than indices such as Julian Day, provides most intelligent data mining software a considerably better chance of identifying important relationships in spatio-temporal data. 2.3. Handling Second Hand Data A significant issue for many systems, and one that is particularly applicable to geographical data, is the need to reuse data collected for other purposes. While few data collection methodologies are able to take into account the non-deterministic nature of data mining, the expense and in many cases the difficulty in performing data collection specifically for the knowledge discovery process results in heterogeneous data sources, each possibly collected for different purposes, commonly being bought together. This requires that the interpretation of such data must be carefully considered. Possible errors that could result might include: The rules reflecting the heterogeneity of the data sets rather than any differences in the observed phenomena, The datasets being temporally incompatible. For instance, the data collection points may render useful comparison impossible. This is also an issue in time series mining in which the scales of the different data sets must first be reconciled (qv. (Berndt and Clifford 1995)). The collection methods being incompatible. For example, the granularities adopted or the aggregation methods of observations may differ. More severe, the implicit semantics of the observations may be different.

This puts particular emphasis on either or both of the quality of the data cleaning and the need for the mining process to take account of the allowable interpretations.

3. Meta-Mining as a Discovery Process Paradigm


The target of many mining operations has traditionally been the data itself. With the increase in data and the polynomial complexity of many mining algorithms, the direct extraction of useful rules from data becomes difficult. One solution to this, first suggested in the realm of temporal data mining (Abraham and Roddick 1997, 1999; Spiliopoulou and Roddick 2000) is to mine from either summaries of the data or from the results of previous mining exercises as shown in Figure 1.
S(R1...5) S(R1...5)

R(DB(a1))

R(DB(a2 ))

R(DB(a3))

R(DB(a4))

R(DB(a5))

DB(a1)

DB(a2)

DB(a3)

DB(a4)

DB(a5)

DB(a1)

DB(a2)

DB(a3)

DB(a4)

DB(a5)

Figure 1 - Mining from Data and from Rulesets Consider the following results (possibly amongst hundreds of others) of a mining run on UK weather regions as follows4:
SeaArea(Hebrides), Windspeed(High), Humidity (Medium/High) Forecast(Rain), LandArea(North Scotland) SeaArea(Hebrides), Windspeed(Medium), Humidity (Medium/High) Forecast(Rain), LandArea(North Scotland) SeaArea(Hebrides), Windspeed(Low), Humidity (Medium/High) Forecast(Fog), LandArea(North Scotland) SeaArea(Hebrides), Windspeed(High), Humidity (Low) Forecast(Windy), LandArea(North Scotland) SeaArea(Hebrides), Windspeed(Medium), Humidity (Low) Forecast(Light Winds), LandArea(North Scotland) SeaArea(Malin), Windspeed(High), Humidity (Medium/High) Forecast(Rain), LandArea(South Scotland) SeaArea(Malin), Windspeed(Medium), Humidity (Medium/High) Forecast(Rain), LandArea(South Scotland) SeaArea(Malin), Windspeed(Low), Humidity (Medium/High) Forecast(Fog), LandArea(South Scotland) SeaArea(Malin), Windspeed(High), Humidity (Low) Forecast(Windy), LandArea(South Scotland) SeaArea(Malin), Windspeed(Medium), Humidity (Low) Forecast(Light Winds), LandArea(South Scotland) SeaArea(Rockall), Windspeed(High), Humidity (Medium/High) Forecast(Rain), LandArea(Scotland)

Hebrides, Malin and Rockall are geographic "shipping" regions to the west of Scotland.

SeaArea(Rockall), Windspeed(Medium), Humidity (Medium/High) Forecast(Rain), LandArea(Scotland) SeaArea(Rockall), Windspeed(Low), Humidity (Medium/High) Forecast(Fog), LandArea(Scotland) SeaArea(Rockall), Windspeed(High), Humidity (Low) Forecast(Windy), LandArea(Scotland) SeaArea(Rockall), Windspeed(Medium), Humidity (Low) Forecast(Light Winds), LandArea(Scotland)

These rules may be inspected to create higher level rules such as:
SeaArea(West of Scotland), Windspeed(Medium/High), Humidity(Medium/High) Forecast(Rain), LandArea(Scotland)

or even:
SeaArea(West of LandArea), Windspeed(Medium/High), Humidity (Medium/High) Forecast(Rain)

These higher level rules can also be produced directly from the source data in a manner similar to the concept ascension algorithms of Cai, Han, Cercone (Cai, Cercone and Han 1991) and others. However, the source data is not always either available or tractable. Note that the semantics of meta mining must be carefully considered (qv. (Spiliopoulou and Roddick 2000)). Each rule generated from data is generated according to an algorithm that, to some extent, removes irrelevant data. Association rules, for example, provide a support and confidence rating which must be taken into account when meta-rules are constructed. Similarly, clusters may use criteria such as lowest entropy to group observations that may mask important outlying facts.

4. Processes for Theory/Hypothesis Management


Analyses into geographic, geo-social, socio-political and environmental issues commonly require a more formal and in many cases strongly ethically driven approach. For example, environmental science uses a formal scientific experimentation process requiring the formulation and refutation of a credible null hypothesis. The development of data mining over the past few years has been largely oriented towards the discovery of previously unknown but potentially useful rules that are in some way interesting in themselves. To this end, a large number of algorithms have been proposed to generate rules of various types (association, classification, characterisation, etc.) according to the source, structure and dimensionality of the data and the knowledge sought. In addition, a number of different types of interestingness metric have also been proposed (qv. (Silberschatz and Tuzhilin 1996)) that strive to keep the exponential target rule space to within tractable limits. More recently, in many research forums, a holistic process-centred view of knowledge discovery has been discussed and the interaction between tool and user (and implicitly, between the rules discovered and the possibly tacit conceptual model) has been stressed. To a great extent the ambition of totally autonomous data mining has now been abandoned (Roddick 1999). This shift has resulted in the realisation that the algorithms used for mining and rule selection need to be put into a process-oriented context. This in turn raises the question of which processes might benefit from data mining research. One of the motivations for data mining has been the inability of conventional analytical tools to handle, within reasonable time limits, the quantities of data that are now being stored. Data mining is thus being seen as a useful method of providing some measure of automated insight into the data being collected. However, it has become apparent that while some useful rules can be mined and the discipline has had a number of notable successes, the potential for either logical or statistical error is extremely high and the results of much data mining is at best a set of suggested topics for further investigation.

4.1. The Process of Scientific Induction The process of investigation (or knowledge discovery) can be considered as having two distinct forms the process modelling approach in which the real world is modelled in a mathematical manner and from which predictions can be in some way computed, and the pattern matching or inductive approach in which prediction is made based on past experience. It is important to note that the process of rule generation through data mining is wholly the latter (in Figure 2, the right hand side), while scientific induction starts with the latter and aims to translate the process into one which is predominantly the former. Another view of the scientific induction process can be considered to be the following. Given a set of observations and an infinitely large hypothesis space, rules (ie. trends, correlations, clusters, etc.) extracted from the observations constrain the hypothesis space until the space is such that a sufficiently restrictive description of that space can be formed. Experiments are commonly constructed to explore the boundaries between the known regions (ie. those parts definitely in or out of the solution space). Of course, the intuition and experience of the scientist plays a large part in designing adequate experiments that unambiguously determine whether a region is in or out of the final hypothesis. The Fallacy of Induction comes into play when the hypothesis developed from the observations (or data) resides in a different part of the space from the true solution and yet it is not contradicted by the available data. The commonly accepted method to reduce the likelihood of a false assumption is to develop alternative hypotheses and prove these are false, and in so doing, to constrain the hypothesis space. An alternative process, which we outline briefly in this chapter, aims to support the development of scientific hypotheses through the accepted scientific methodology of null hypothesis creation and refutation. To continue with the visual metaphor provided by the hypothesis space described above, data mining can be considered as a process for finding those parts of the hypothesis space that fit the observations and to return the results in the form of rules. A number of data mining systems have been developed which aim to describe and search the hypothesis space in a variety of ways.

Conceptualisation of Current Knowledge i.e.Conceptual Model or Model of Behaviour

Hypothesis formed / revised to explain observed facts

Process Modelling (Mathematical-based computation)

Observations

Pattern Matching / Visualisation

Prediction

Figure 2 - Investigation Paths in Discovery Unfortunately, the complexity of such a task commonly results in either less than useful answers or high computational overhead. One of the reasons for this is that the search space is exponential to the number of data items which itself is large. A common constraint therefore is to limit the structural complexity of a solution, for example, by restricting the number of predicates or terms in a rule. Data mining also commonly starts with a "clean sheet" approach and while restartable or iterative methods are being researched, little progress has been made to date. Another significant problem is that the data has often been collected at different times with different schemata, sometimes by different agents and commonly for an alternative purpose to data mining. Data mining is often only vaguely considered (if at all) and thus drawing accurate inferences from the data is often problematic. 4.2. Using Data Mining to Support Scientific Induction An alternative solution is to develop (sets of) hypotheses that will constrain the search space by defining areas within which the search is to take place. Significantly, the hypotheses themselves are not examined; rather (sets of) null hypotheses are developed which are used instead. Figure 3 shows a schematic view of such a data mining process. In this model, a user supplied conceptual model (or an initial hypothesis) provides the starting point from which hypotheses are generated and tested. Generated hypotheses are first tested against known constraints and directed data mining routines then validate (to a greater or lesser extent) the revised theory. In cases where the hypothesis is generally supported, weight is added to the confidence of the conceptual model in accordance with the usual notion of scientific induction. In cases where the hypothesis is not supported, either a change to the conceptual model or a need for external input is indicated.

10

Null Hypothesis

Conceptual Model

Rules conforming to Null Hypothesis

Data Mining Routines

Database

Mining "Success" (ie. data was found to support the null hypothesis)

Mining "Failure" (ie. data did not support the null hypothesis)

Conceptual Model supported

Alternative Hypotheses

Alternative Hypotheses

Conceptual Model must be changed

Figure 3 - Data Mining for Null Hypotheses Note that the process provides three aspects of interest: Firstly, the procedure is able to accept a number of alternative conceptual models and provide a ranking between them based on the available observations. It also allows for modifications to a conceptual model in cases where the rules for such modification are codified. Secondly, the hypothesis generation component may yield new, hitherto unexplored insights into accepted conceptual models. Finally, the process employs directed mining algorithms and thus represents a reasonably efficient way of exploring large quantities of data which is essential in the case of mining high dimensional datasets such as those used in geographic systems.

11

As this model relies heavily on the accepted process of scientific induction, the process is more acceptable to the general science community.

5. Conclusion
The ideas outlined in this chapter differ, to some extent, from conventional research directions in spatiotemporal data mining and emphasise that the process into which the mining algorithm will be used can alter significantly the interpretation of the results. Moreover, the knowledge discovery process must take account of this to avoid problems. In (Johnson-Laird 1993), Johnson-Laird suggested that induction should come with a government warning - this is particularly true of spatio-temporal mining as the scope for error is large. It should be noted that many of the ideas discussed in this chapter need further examination. For example, while the ideas of using data mining for hypothesis refutation have been discussed in a few fora, to our knowledge there has been little serious investigation of the idea in a scientific setting. Before this can occur a strong framework needs to be established and a credible method of hypothesis evolution needs to be defined.

Acknowledgements
We are particularly grateful to the NCGIA Varenius Project for organising the Seattle Workshop on Discovering Knowledge from Geographical Data, which enabled some of these ideas to be explored. We would also particularly like to thank Tamas Abraham, DSTO, Australia and Jonathan Raper, City University, London, UK for discussions and comments.

References
ABRAHAM, T. and RODDICK, J.F. (1997): Discovering meta-rules in mining temporal and spatiotemporal data. Proc. Eighth International Database Workshop, Data Mining, Data Warehousing and Client/Server Databases (IDW'97), Hong Kong, 30-41, FONG, J. (ed) Springer-Verlag. ABRAHAM, T. and RODDICK, J.F. (1998): Opportunities for knowledge discovery in spatio-temporal information systems. Australian Journal of Information Systems 5(2):3-12. ABRAHAM, T. and RODDICK, J.F. (1999): Incremental meta-mining from large temporal data sets. In Advances in Database Technologies, Proc. First International Workshop on Data Warehousing and Data Mining, DWDM'98. Lecture Notes in Computer Science, 1552:41-54. KAMBAYASHI, Y., LEE, D.K., LIM, E.-P., MOHANIA, M. and MASUNAGA, Y. (eds). Berlin, Springer-Verlag. AGRAWAL, R., IMIELINSKI, T. and SWAMI, A. (1993): Mining Association Rules Between Sets of Items in Large Databases. Proc. ACM SIGMOD International Conference on Management of Data, Washington DC, USA, 22:207-216, ACM Press. ALLEN, J.F. (1983): Maintaining knowledge about temporal intervals. Communications of the ACM 26(11):832-843. BAYARDO JR, R.J. and AGRAWAL, R. (1999): Mining the most interesting rules. Proc. Fifth International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 145-154, CHAUDHURI, S. and MADIGAN, D. (eds). ACM Press. BELL, D.A., ANAND, S.S. and SHAPCOTT, C.M. (1994): Data Mining in Spatial Databases. Proc. International Workshop on Spatio-Temporal Databases, Benicassim, Spain. BERNDT, D.J. and CLIFFORD, J. (1995): Finding patterns in time series: a dynamic programming approach. In Advances in Knowledge Discovery and Data Mining. 229-248. FAYYAD, U.M., PIATETSKY-SHAPIRO, G., SMYTH, P. and UTHURUSAMY, R. (eds). AAAI Press/ MIT Press.

12

CAI, Y., CERCONE, N. and HAN, J. (1991): Attribute-oriented induction in relational databases. In Knowledge Discovery in Databases. 213-228, (Ch. 12). PIATETSKY-SHAPIRO, G. and FRAWLEY, W.J. (eds). Cambridge, MA, AAAI Press/MIT Press. CHEN, X., PETROUNIAS, I. and HEATHFIELD, H. (1998): Discovering temporal association rules in temporal databases. Proc. International Workshop on Issues and Applications of Database Technology (IADT'98), 312-319. COOLEY, R., MOBASHER, B. and SRIVASTAVA, J. (1997): Web mining: information and pattern discovery on the World Wide Web. Proc. Ninth IEEE International Conference on Tools with Artificial Intelligence, 558-567, IEEE Comput. Soc, Los Alamitos, CA. DONG, G. and LI, J. (1998): Interestingness of discovered association rules in terms of neighbourhoodbased unexpectedness. Proc. Second Pacific-Asia Conference on Knowledge Discovery and data Mining: Research and Development, Melbourne, Australia, 72-86, WU, X., KOTAGIRI, R. and KORB, K.B. (eds). Springer-Verlag. EGENHOFER, M.J. and GOLLEDGE, R.J. (1994): Time in Geographic Space. Report on the Specialist Meeting of Research Initiative 10 94-9. National Centre for Geographic Information and Analysis, University of California, Santa Barbera. ESTIVILL-CASTRO, V. and MURRAY, A.T. (1998): Discovering associations in spatial data-an efficient mediod based approach. Proc. Second Pacific-Asia Conference on Research and Development in Knowledge Discovery and Data Mining, PAKDD-98, 110-121, Springer-Verlag, Berlin. FREITAS, A.A. (1999): On rule interestingness measures. Knowledge Based Systems 12(5-6):309-315. FREKSA, C. (1992): Temporal reasoning based on semi-intervals. Artificial Intelligence 54:199-227. GURALNIK, V. and SRIVASTAVA, J. (1999): Event Detection from Time Series Data. Proc. Fifth International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 33-42, CHAUDHURI, S. and MADIGAN, D. (eds). ACM Press. HAN, J., DONG, G. and YIN, Y. (1999): Efficient Mining of Partial Periodic Patterns in Time Series Database. Proc. Fifteenth International Conference on Data Engineering, Sydney, Australia, 106-115, IEEE Computer Society. HORNSBY, K. and EGENHOFER, M. (1998): Identity-Based Change Operations for Composite Objects. Proc. Eighth International Symposium on Spatial Data Handling, Vancouver, Canada, 202-213, POIKER, T. and CHRISMAN, N. (eds). JOHNSON-LAIRD, P. (1993): The computer and the mind. London, 2nd Edition Edn, Fontana Press. KOPERSKI, K., ADHIKARY, J. and HAN, J. (1996): Knowledge Discovery in Spatial Databases: Progress and Challenges. Proc. ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Montreal, Canada, 55-70. KOPERSKI, K. and HAN, J. (1995): Discovery of Spatial Association Rules in Geographic Information Databases. Proc. Fourth International Symposium on Large Spatial Databases, Maine, 47-66. LAVRAC, N. (1999): Selected techniques for data mining in medicine. Artificial Intelligence in Medicine 16:3-23. LEES, B.G. (1996): Sampling strategies for machine learning using GIS. In GIS and Environmental Modelling: Progress and Research Issues. GOODCHILD, M.F., STEYART, L., PARKS, B.et al (eds). Fort Collins, CO, GIS World Inc. LEES, B.G. (1999): Cycles, Climatic. In Encyclopedia of Environmental Science. 105-107. ALEXANDER, D.E. and FAIRBRIDGE, R.W. (eds). New York, Van Nostrand Reinhold. LU, W., HAN, J. and OOI, B.C. (1993): Discovery of General Knowledge in Large Spatial Databases. Proc. 1993 Far East Workshop on GIS (IEGIS 93), Singapore, 275-289.

13

MADRIA, S.K., BHOWMICK, S.S., NG, W.K. and LIM, E.-P. (1999): Research Issues in Web Data Mining. Proc. First International Conference on Data Warehousing and Knowledge Discovery, DaWaK '99, Florence, Italy, Lecture Notes in Computer Science, 1676:303-312, MOHANIA, M.K. and TJOA, A.M. (eds). Springer. NG, R.T. (1996): Spatial Data Mining: Discovering Knowledge of Clusters from Maps. Proc. ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Montreal, Canada. NG, R.T. and HAN, J. (1994): Efficient and effective clustering methods for spatial data mining. Proc. Twentieth International Conference on Very Large Data Bases, Santiago, Chile, 144-155, BOCCA, J.B., JARKE, M. and ZANIOLO, C. (eds). Morgan Kaufmann. RAINSFORD, C.P. and RODDICK, J.F. (1999): Adding Temporal Semantics to Association Rules. Proc. 3rd European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'99), Prague, Lecture Notes in Artificial Intelligence, 1704 :504-509, ZYTKOW, J.M. and RAUCH, J. (eds). Springer. RODDICK, J.F. (1999): Data Warehousing and Data Mining: Are we working on the right things? In Advances in Database Technologies. Lecture Notes in Computer Science, 1552:141-144. KAMBAYASHI, Y., LEE, D.K., LIM, E.-P., MASUNAGA, Y. and MOHANIA, M. (eds). Berlin, Springer-Verlag. RODDICK, J.F., HORNSBY, K. and SPILIOPOULOU, M. (2001): An Updated Bibliography of Temporal, Spatial and Spatio-Temporal Data Mining Research. In Post-Workshop Proceedings of the International Workshop on Temporal, Spatial and Spatio-Temporal Data Mining, TSDM2000. Lecture Notes in Artificial Intelligence, 2007. RODDICK, J.F. and HORNSBY, K. (eds). Berlin, Springer. RODDICK, J.F. and SPILIOPOULOU, M. (2001): A Survey of Temporal Knowledge Discovery Paradigms and Methods. IEEE Transactions on Knowledge and Data Engineering. SAHAR, S. (1999): Interestingness via what is not interesting. Proc. Fifth International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 332-336, CHAUDHURI, S. and MADIGAN, D. (eds). ACM Press. SILBERSCHATZ, A. and TUZHILIN, A. (1996): What makes patterns interesting in knowledge discovery systems? IEEE Transactions on Knowledge and Data Engineering 8(6):970-974. SPILIOPOULOU, M. and RODDICK, J.F. (2000): Higher Order Mining: Modelling and Mining the Results of Knowledge Discovery. In Data Mining II - Proc. Second International Conference on Data Mining Methods and Databases. 309-320. EBECKEN, N. and BREBBIA, C.A. (eds). Cambridge, UK, WIT Press. VIVEROS, M.S., WRIGHT, M.A., ELO-DEAN, S. and DURI, S.S. (1997): Visitors' behavior: mining web servers. Proc. First International Conference on the Practical Application of Knowledge Discovery and Data Mining, 257-269, Practical Application Co, Blackpool, UK. WEIGEND, A.S. and GERSHENFELD, N.A. (eds) (1993): Time Series Prediction: Forecasting the Future and Understanding the Past. Proc. NATO Advanced Research Workshop on Comparative Time Series Analysis XV. Santa Fe, New Mexico, Addison-Wesley. YE, S. and KEANE, J.A. (1998): Mining association rules in temporal databases. Proc. International Conference on Systems, Man and Cybernetics, 2803-2808, IEEE, New York.

14

S-ar putea să vă placă și