Documente Academic
Documente Profesional
Documente Cultură
Agenda
Extensions
Overview
http://crime.dnsalias.com:9090/crime/
Project Context
Design and implement a Crime Spatial Database Management System and Application to support
General Public (awareness, safety) Police officer (optimize resources) Crime analyst (trending, patterns)
Project Objectives
Configure project infrastructure including Spatial DBMS Implement physical model for crime data Load Fairfax County Police Department data into the database Develop required queries (Window and KNearest Neighbor) Optimize query performance
Data Structure
Spatial database and application requires that we store some additional info
Crime ID Latitude/longitude location of crime Spatial object Geocode accuracy
?????
SDBMS
Several Challenges
All crime reports were in .pdf or .doc format which are difficult to access without an API No standard format/structure exists, so each crime record has a slightly different format To find date/time attribute requires literally reading the report text Reports actually contain some errors, like incorrect crime types in some cases Non-specific locations and times
10
Designed web-based data input interface Required human assistance to manually read the reports and enter the crime info
My wife entered about 1251 records into my input interface 8 of 95 .doc files were reviewed and input Many assumptions made to correct the data
11
Performance Planning
Indexes
Normal primary key indexes implemented Implemented R-tree index for spatial objects Hints for spatial index are used in my queries, based on Oracle Spatial best practice documentation
Query Processing
Web app builds Dynamic SQL based on user input Web app interfaces with database (CDE) and passes SQL CDE retrieves data and passes data set back Web app binds data to visualizations
Query Type (window, k-near., range) Entity Type Selection (corners, dist., crime type) Entity Value Selection (set or range) Time Range
14
CDE
Data Set
Query Support
Window Query
Search inside a dragged box
System Platform
Software
Windows XP Home IIS 5
Availability
Hardware
Server: Dell Dimension 8400 Client: Dell Inspiron 2650 Notebook
Oracle 10g Express Edition with Oracle Locator (reduced, free version of Oracle Spatial)
Familiarity Licensing Just enough features
Development
Visual Studio 2005 ASP.NET in Visual Basic Oracle Data Provider for .NET (ODP.NET)
16
Demo Architecture
17
Straightforward approach makes for timely integration into the project Density-based approach is good for crime hotspots Resistance to noise makes DBSCAN a good choice for this application
18
DBSCAN Review
19
DBSCAN Implementation
To support incremental application of the DBSCAN algorithm, an additional relation was defined:
Density Eps MinPts Category Cluster_Label
TCLUSTERS Crime_ID
For each point this stores the density score for a given Eps, a categorization based on MinPts, and a cluster label
20
DBSCAN Implementation
First step is to calculate a density score for all points, based on Eps This is implemented using the sdo_within_distance function for each point, and counting results (grouped)
For points which have no neighbors within Eps, they are directly added to the clusters table with a zero Density score
21
DBSCAN Implementation
Now that we have Density scores for all points, we need to categorize them
Update the cluster table to mark each point as core, border, or noise
If Density score < MinPts but the point is within Eps distance of a core point
Mark as a border point
DBSCAN Implementation
Now that we have categorized each point as core, border, or noise we can get our first visual look at the results
23
24
An Early Insight
This example visually confirms that there are many noise points, so DBSCAN looks like a smart choice of clustering algorithm
DBSCAN is fairly resistant to noise Note: level of noise is similar with other Eps and MinPts values
25
DBSCAN Implementation
Next we need to actually identify and label the clusters But first we eliminate noise points
26
27
So for each core point If the core point is not in a cluster already, label it as part of a new cluster Find all of its neighbors within distance Eps, and if they are not in a cluster already, label them as part of the same cluster too Then move on to the next unlabeled core point (start of next cluster) Using sdo_within_distance and loops, these labels are updated in the clusters table
28
DBSCAN Implementation
Now that we have labeled the clusters, we can get a visual look at the final results
This example uses Eps=0.25, MinPts=20 Notice Tysons, Rt 1 corridor, Springfield Mall
29
DBSCAN Implementation
Implemented DBSCAN as a stored procedure to be easily called with varying Eps and MinPts (and time range, crime types, etc.)
Recalculates all clusters relatively quickly
DBSCAN Optimization
Uses concept that kth nearest neighbors are at some similar distance in a cluster So we pick a k-value and sort all the points distance to that neighbor Example from class:
k=4 only
31
DBSCAN Optimization
DBSCAN: Determining Eps and MinPts
3 2.5
1.5
0.5
0 1 51 101 151 201 251 301 351 401 451 501 551 601 651 701 751 801 851 901 951 100110511101115112011251 Points Sorted According to Distance of kth Nearest Neighbor
32
DBSCAN Optimization
Not so easy to choose a good point on that curve Original DBSCAN algorithm authors discussed in their paper that we should eyeball the plot for the first/last valley
33
DBSCAN Observations
DBSCAN does not work well with varying densities which is also a characteristic of this crime data This is observable when we change the order of the core point cluster labeling
34
35
Tan,Steinbach, Kumar
4/18/2004
28
3. 4. 5. 6.
Create a graph whose nodes are the points to be clustered For each core-point c create an edge from c to every point p in the -neighborhood of c Set N to the nodes of the graph; If N does not contain any core points terminate Pick a core point c in N Let X be the set of nodes that can be reached from c by going forward; 1. create a cluster containing X{c} 2. N=N/(X{c}) Continue with step 4
7.
Remarks: points that are not assigned to any cluster are outliers;
http://www2.cs.uh.edu/~ceick/7363/Papers/dbscan.pdf gives a more efficient implementation by performing steps 2 and 6 in parallel
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16
36
37
Future Work
To be useful for a real law enforcement organization much more work needs done
System must handle more than one crime at a single location (map currently shows 1 clickable marker) partially due to non-specific crime addresses Crime addresses written as intersections (corner of abc street and xyz street) must be considered Need to support individual layers (enable/disable) for crime types Most upgrades rely on officers or someone inputting more details attributes and data into system
Only then can intelligence and case-building be really cool
38
Demonstration
Questions?
40