Sunteți pe pagina 1din 40

Crime Data Visualization and Spatial Database Management System

Michael Wyland CS 6604 Virginia Tech NVC 24 April 2008

Agenda

Overview Project Context & Objectives Design


Database design Data Loading Optimization Application Interface

Extensions

Future Work Demonstration

Hotspot Detection & DBSCAN

Overview

Implemented Crime Spatial Database Management System components


Crime Data Engine Mapping and Visualization Capabilities

http://crime.dnsalias.com:9090/crime/

Project Context

Design and implement a Crime Spatial Database Management System and Application to support
General Public (awareness, safety) Police officer (optimize resources) Crime analyst (trending, patterns)

Project Objectives
Configure project infrastructure including Spatial DBMS Implement physical model for crime data Load Fairfax County Police Department data into the database Develop required queries (Window and KNearest Neighbor) Optimize query performance

Project Objectives (contd)


Implement Hotspot Identification (Cluster) Integrate mapping and visualization interfaces Create a publishable quality site

Data Structure

Crime records need to have some consistent attributes


Date/time of crime Type of crime Narrative or description of the crime Address of the crime

Data Structure (contd)

Spatial database and application requires that we store some additional info
Crime ID Latitude/longitude location of crime Spatial object Geocode accuracy

A single relation was defined:


Crime_type Geocode_lat Crime_address Geocode_lon Crime_dt Crime_geo_location Crime_city Geocode_accuracy
8

TCRIMES Crime_ID Narrative

Loading the Data

Wanted to pursue the Fairfax County Police Department data


Seemed like everyone else was doing City of Falls Church Police Department data FFX County is also a larger dataset both in number of records and space to play with

?????

SDBMS

Loading the Data (contd)

Several Challenges
All crime reports were in .pdf or .doc format which are difficult to access without an API No standard format/structure exists, so each crime record has a slightly different format To find date/time attribute requires literally reading the report text Reports actually contain some errors, like incorrect crime types in some cases Non-specific locations and times
10

Data Loading Strategy

Did not design a parser


Given the challenges above, short timeline, and purpose of this class (Not Parsers 101)

Designed web-based data input interface Required human assistance to manually read the reports and enter the crime info

My wife entered about 1251 records into my input interface 8 of 95 .doc files were reviewed and input Many assumptions made to correct the data
11

Data Loading Strategy (contd)


Enter the block address Address is GeoCoded using Google Maps API via Geocoding object and JSON response object Enter date/time, crime type, narrative Record saved in crimes table along with GeoCode results

No need to GeoCode on-the-fly later


12

Performance Planning

Indexes
Normal primary key indexes implemented Implemented R-tree index for spatial objects Hints for spatial index are used in my queries, based on Oracle Spatial best practice documentation

Examined Oracle Query Plans


Confirmed that the required queries are utilizing the normal and spatial indices and working efficiently

System responds in a timely fashion to multiple concurrent, large queries


13

Query Processing
Web app builds Dynamic SQL based on user input Web app interfaces with database (CDE) and passes SQL CDE retrieves data and passes data set back Web app binds data to visualizations

Query Type (window, k-near., range) Entity Type Selection (corners, dist., crime type) Entity Value Selection (set or range) Time Range
14

Charts Data Grids Maps

Web App (UI)

CDE

Data Set

Web App (Visualization)

Query Support

Window Query
Search inside a dragged box

K-Nearest Neighbor Query


Click a spot and choose # of results desired

Range Query or Circle Search


Click a spot and choose a distance

All queries can be limited by a date/time range and crime type


15

System Platform

Software
Windows XP Home IIS 5
Availability

Hardware
Server: Dell Dimension 8400 Client: Dell Inspiron 2650 Notebook

Oracle 10g Express Edition with Oracle Locator (reduced, free version of Oracle Spatial)
Familiarity Licensing Just enough features

Development
Visual Studio 2005 ASP.NET in Visual Basic Oracle Data Provider for .NET (ODP.NET)

Google Maps API Dundas Chart for ASP.NET Professional Evaluation

16

Demo Architecture

17

Crime Hotspot Identification

Chose to implement Crime Hotspot Identification using the DBSCAN algorithm


DBSCAN is a density-based spatial clustering algorithm driven by two parameters
Eps epsilon radius from each point MinPts minimum neighbors to categorize points

Straightforward approach makes for timely integration into the project Density-based approach is good for crime hotspots Resistance to noise makes DBSCAN a good choice for this application

18

DBSCAN Review

Density = # of points within a specified radius


Density is a score for each point based on Eps

Each point in the database is categorized


A core point has more than a specified number of points (MinPts) within Eps A border point has fewer than MinPts within Eps but is in the neighborhood of a core point A noise point is any point that is not a core or border

19

DBSCAN Implementation

To support incremental application of the DBSCAN algorithm, an additional relation was defined:
Density Eps MinPts Category Cluster_Label

TCLUSTERS Crime_ID

For each point this stores the density score for a given Eps, a categorization based on MinPts, and a cluster label
20

DBSCAN Implementation
First step is to calculate a density score for all points, based on Eps This is implemented using the sdo_within_distance function for each point, and counting results (grouped)

Results are inserted into the clusters table

For points which have no neighbors within Eps, they are directly added to the clusters table with a zero Density score
21

DBSCAN Implementation

Now that we have Density scores for all points, we need to categorize them
Update the cluster table to mark each point as core, border, or noise

If Density score >= MinPts


Mark as a core point

If Density score < MinPts but the point is within Eps distance of a core point
Mark as a border point

If not marked as core or border


Mark as a noise point
22

DBSCAN Implementation

Now that we have categorized each point as core, border, or noise we can get our first visual look at the results

23

DBSCAN Point Categorization

Core, Border, Noise


This example uses Eps=0.5, MinPts=20

24

An Early Insight

This example visually confirms that there are many noise points, so DBSCAN looks like a smart choice of clustering algorithm
DBSCAN is fairly resistant to noise Note: level of noise is similar with other Eps and MinPts values

25

DBSCAN Implementation
Next we need to actually identify and label the clusters But first we eliminate noise points

Simply delete these from the clusters table

Now lets review the labeling algorithm

26

DBSCAN Cluster Labeling

27

DBSCAN Cluster Labeling

So for each core point If the core point is not in a cluster already, label it as part of a new cluster Find all of its neighbors within distance Eps, and if they are not in a cluster already, label them as part of the same cluster too Then move on to the next unlabeled core point (start of next cluster) Using sdo_within_distance and loops, these labels are updated in the clusters table
28

DBSCAN Implementation

Now that we have labeled the clusters, we can get a visual look at the final results
This example uses Eps=0.25, MinPts=20 Notice Tysons, Rt 1 corridor, Springfield Mall

29

DBSCAN Implementation

Implemented DBSCAN as a stored procedure to be easily called with varying Eps and MinPts (and time range, crime types, etc.)
Recalculates all clusters relatively quickly

Lets review how we optimize Eps and MinPts


30

DBSCAN Optimization
Uses concept that kth nearest neighbors are at some similar distance in a cluster So we pick a k-value and sort all the points distance to that neighbor Example from class:

k=4 only

31

DBSCAN Optimization
DBSCAN: Determining Eps and MinPts
3 2.5

kth Nearest Neighbor Distance

2 k=1 k=4 k=9 k=10 k=15 k=20

1.5

0.5

0 1 51 101 151 201 251 301 351 401 451 501 551 601 651 701 751 801 851 901 951 100110511101115112011251 Points Sorted According to Distance of kth Nearest Neighbor

32

DBSCAN Optimization
Not so easy to choose a good point on that curve Original DBSCAN algorithm authors discussed in their paper that we should eyeball the plot for the first/last valley

33

DBSCAN Observations
DBSCAN does not work well with varying densities which is also a characteristic of this crime data This is observable when we change the order of the core point cluster labeling

Ex: Start w/ high vs. low density score cores?

34

DBSCAN Observations (contd)


The algorithm labels neighbors of the current core point regardless of whether we started a new cluster or not Initialize the cluster label to zero, not one Many different presentations of the DBSCAN algorithm

35

DBSCAN Algorithm Presentations


DBSCAN Algorithm

Eliminate noise points Perform clustering on the remaining points


Connect all core points with an edge that are less than Eps from each other. Make each group of connected core points into a separate cluster. Assign each border point to one of its associated clusters.

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

28

DBSCAN Algorithm (simplified view for teaching)


1. 2.

3. 4. 5. 6.

Create a graph whose nodes are the points to be clustered For each core-point c create an edge from c to every point p in the -neighborhood of c Set N to the nodes of the graph; If N does not contain any core points terminate Pick a core point c in N Let X be the set of nodes that can be reached from c by going forward; 1. create a cluster containing X{c} 2. N=N/(X{c}) Continue with step 4

7.

Remarks: points that are not assigned to any cluster are outliers;
http://www2.cs.uh.edu/~ceick/7363/Papers/dbscan.pdf gives a more efficient implementation by performing steps 2 and 6 in parallel
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16

36

Publishable Quality (?)

37

Future Work

To be useful for a real law enforcement organization much more work needs done
System must handle more than one crime at a single location (map currently shows 1 clickable marker) partially due to non-specific crime addresses Crime addresses written as intersections (corner of abc street and xyz street) must be considered Need to support individual layers (enable/disable) for crime types Most upgrades rely on officers or someone inputting more details attributes and data into system
Only then can intelligence and case-building be really cool
38

Demonstration

Queries to demo CDE


Window Query K-Nearest Neighbor Query Range Query

Visualizations integrated with query demos


Mapping & charting, info windows, etc.

DBSCAN categorization and cluster detection demo


39

Questions?

40

S-ar putea să vă placă și