Wyland Crime Data Base Management System.

Crime Data Visualization and Spatial Database Management System
Michael Wyland CS 6604 Virginia Tech NVC 24 April 2008
Agenda

Overview Project Context & Objectives Design

Database design Data Loading Optimization Application Interface
Extensions
Future Work Demonstration
Hotspot Detection & DBSCAN
Overview
Implemented Crime Spatial Database Management System components

Crime Data Engine Mapping and Visualization Capabilities
http://crime.dnsalias.com:9090/crime/
Project Context
Design and implement a Crime Spatial Database Management System and Application to support
General Public (awareness, safety) Police officer (optimize resources) Crime analyst (trending, patterns)
Project Objectives
Configure project infrastructure including Spatial DBMS Implement physical model for crime data Load Fairfax County Police Department data into the database Develop required queries (Window and KNearest Neighbor) Optimize query performance
Project Objectives (contd)

Implement Hotspot Identification (Cluster) Integrate mapping and visualization interfaces Create a publishable quality site
Data Structure
Crime records need to have some consistent attributes

Date/time of crime Type of crime Narrative or description of the crime Address of the crime
Data Structure (contd)
Spatial database and application requires that we store some additional info
Crime ID Latitude/longitude location of crime Spatial object Geocode accuracy
A single relation was defined:

Crime_type Geocode_lat Crime_address Geocode_lon Crime_dt Crime_geo_location Crime_city Geocode_accuracy
8
TCRIMES Crime_ID Narrative
Loading the Data
Wanted to pursue the Fairfax County Police Department data

Seemed like everyone else was doing City of Falls Church Police Department data FFX County is also a larger dataset both in number of records and space to play with
?????
SDBMS
Loading the Data (contd)
Several Challenges
All crime reports were in .pdf or .doc format which are difficult to access without an API No standard format/structure exists, so each crime record has a slightly different format To find date/time attribute requires literally reading the report text Reports actually contain some errors, like incorrect crime types in some cases Non-specific locations and times
10
Data Loading Strategy
Did not design a parser

Given the challenges above, short timeline, and purpose of this class (Not Parsers 101)
Designed web-based data input interface Required human assistance to manually read the reports and enter the crime info
My wife entered about 1251 records into my input interface 8 of 95 .doc files were reviewed and input Many assumptions made to correct the data
11
Data Loading Strategy (contd)

Enter the block address Address is GeoCoded using Google Maps API via Geocoding object and JSON response object Enter date/time, crime type, narrative Record saved in crimes table along with GeoCode results
No need to GeoCode on-the-fly later

12
Performance Planning
Indexes
Normal primary key indexes implemented Implemented R-tree index for spatial objects Hints for spatial index are used in my queries, based on Oracle Spatial best practice documentation
Examined Oracle Query Plans

Confirmed that the required queries are utilizing the normal and spatial indices and working efficiently
System responds in a timely fashion to multiple concurrent, large queries

13
Query Processing
Web app builds Dynamic SQL based on user input Web app interfaces with database (CDE) and passes SQL CDE retrieves data and passes data set back Web app binds data to visualizations
Query Type (window, k-near., range) Entity Type Selection (corners, dist., crime type) Entity Value Selection (set or range) Time Range
14
Charts Data Grids Maps
Web App (UI)
CDE
Data Set
Web App (Visualization)
Query Support

Window Query
Search inside a dragged box
K-Nearest Neighbor Query

Click a spot and choose # of results desired
Range Query or Circle Search

Click a spot and choose a distance
All queries can be limited by a date/time range and crime type

15
System Platform
Software
Windows XP Home IIS 5
Availability
Hardware
Server: Dell Dimension 8400 Client: Dell Inspiron 2650 Notebook
Oracle 10g Express Edition with Oracle Locator (reduced, free version of Oracle Spatial)
Familiarity Licensing Just enough features
Development
Visual Studio 2005 ASP.NET in Visual Basic Oracle Data Provider for .NET (ODP.NET)
Google Maps API Dundas Chart for ASP.NET Professional Evaluation
16
Demo Architecture
17
Crime Hotspot Identification
Chose to implement Crime Hotspot Identification using the DBSCAN algorithm

DBSCAN is a density-based spatial clustering algorithm driven by two parameters
Eps epsilon radius from each point MinPts minimum neighbors to categorize points
Straightforward approach makes for timely integration into the project Density-based approach is good for crime hotspots Resistance to noise makes DBSCAN a good choice for this application
18
DBSCAN Review

Density = # of points within a specified radius

Density is a score for each point based on Eps
Each point in the database is categorized

A core point has more than a specified number of points (MinPts) within Eps A border point has fewer than MinPts within Eps but is in the neighborhood of a core point A noise point is any point that is not a core or border
19
DBSCAN Implementation
To support incremental application of the DBSCAN algorithm, an additional relation was defined:
Density Eps MinPts Category Cluster_Label
TCLUSTERS Crime_ID
For each point this stores the density score for a given Eps, a categorization based on MinPts, and a cluster label
20
First step is to calculate a density score for all points, based on Eps This is implemented using the sdo_within_distance function for each point, and counting results (grouped)
Results are inserted into the clusters table
For points which have no neighbors within Eps, they are directly added to the clusters table with a zero Density score
21
Now that we have Density scores for all points, we need to categorize them
Update the cluster table to mark each point as core, border, or noise
If Density score >= MinPts

Mark as a core point
If Density score < MinPts but the point is within Eps distance of a core point
Mark as a border point
If not marked as core or border

Mark as a noise point
22
Now that we have categorized each point as core, border, or noise we can get our first visual look at the results
23
DBSCAN Point Categorization
Core, Border, Noise

This example uses Eps=0.5, MinPts=20
24
An Early Insight
This example visually confirms that there are many noise points, so DBSCAN looks like a smart choice of clustering algorithm
DBSCAN is fairly resistant to noise Note: level of noise is similar with other Eps and MinPts values
25
Next we need to actually identify and label the clusters But first we eliminate noise points
Simply delete these from the clusters table
Now lets review the labeling algorithm
26
DBSCAN Cluster Labeling
27
DBSCAN Cluster Labeling
So for each core point If the core point is not in a cluster already, label it as part of a new cluster Find all of its neighbors within distance Eps, and if they are not in a cluster already, label them as part of the same cluster too Then move on to the next unlabeled core point (start of next cluster) Using sdo_within_distance and loops, these labels are updated in the clusters table
28
Now that we have labeled the clusters, we can get a visual look at the final results
This example uses Eps=0.25, MinPts=20 Notice Tysons, Rt 1 corridor, Springfield Mall
29
Implemented DBSCAN as a stored procedure to be easily called with varying Eps and MinPts (and time range, crime types, etc.)
Recalculates all clusters relatively quickly
Lets review how we optimize Eps and MinPts

30
DBSCAN Optimization
Uses concept that kth nearest neighbors are at some similar distance in a cluster So we pick a k-value and sort all the points distance to that neighbor Example from class:
k=4 only
31
DBSCAN Optimization
DBSCAN: Determining Eps and MinPts
3 2.5
kth Nearest Neighbor Distance
2 k=1 k=4 k=9 k=10 k=15 k=20
1.5
0.5
0 1 51 101 151 201 251 301 351 401 451 501 551 601 651 701 751 801 851 901 951 100110511101115112011251 Points Sorted According to Distance of kth Nearest Neighbor
32
DBSCAN Optimization
Not so easy to choose a good point on that curve Original DBSCAN algorithm authors discussed in their paper that we should eyeball the plot for the first/last valley
33
DBSCAN Observations
DBSCAN does not work well with varying densities which is also a characteristic of this crime data This is observable when we change the order of the core point cluster labeling
Ex: Start w/ high vs. low density score cores?
34
DBSCAN Observations (contd)

The algorithm labels neighbors of the current core point regardless of whether we started a new cluster or not Initialize the cluster label to zero, not one Many different presentations of the DBSCAN algorithm
35
DBSCAN Algorithm Presentations

DBSCAN Algorithm

Eliminate noise points Perform clustering on the remaining points

Connect all core points with an edge that are less than Eps from each other. Make each group of connected core points into a separate cluster. Assign each border point to one of its associated clusters.
Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
28
DBSCAN Algorithm (simplified view for teaching)

1. 2.
3. 4. 5. 6.
Create a graph whose nodes are the points to be clustered For each core-point c create an edge from c to every point p in the -neighborhood of c Set N to the nodes of the graph; If N does not contain any core points terminate Pick a core point c in N Let X be the set of nodes that can be reached from c by going forward; 1. create a cluster containing X{c} 2. N=N/(X{c}) Continue with step 4
7.
Remarks: points that are not assigned to any cluster are outliers;
http://www2.cs.uh.edu/~ceick/7363/Papers/dbscan.pdf gives a more efficient implementation by performing steps 2 and 6 in parallel
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16
36
Publishable Quality (?)
37
Future Work
To be useful for a real law enforcement organization much more work needs done
System must handle more than one crime at a single location (map currently shows 1 clickable marker) partially due to non-specific crime addresses Crime addresses written as intersections (corner of abc street and xyz street) must be considered Need to support individual layers (enable/disable) for crime types Most upgrades rely on officers or someone inputting more details attributes and data into system
Only then can intelligence and case-building be really cool
38
Demonstration
Queries to demo CDE

Window Query K-Nearest Neighbor Query Range Query
Visualizations integrated with query demos

Mapping & charting, info windows, etc.
DBSCAN categorization and cluster detection demo

39
Questions?
40

Wyland Crime Data Base Management System.

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Wyland Crime Data Base Management System.

Încărcat de

Drepturi de autor:

Formate disponibile

Crime Data Visualization and Spatial Database Management System

Michael Wyland CS 6604 Virginia Tech NVC 24 April 2008

Overview Project Context & Objectives Design

Future Work Demonstration

Hotspot Detection & DBSCAN

Implemented Crime Spatial Database Management System components

Project Objectives (contd)

Crime records need to have some consistent attributes

Data Structure (contd)

A single relation was defined:

TCRIMES Crime_ID Narrative

Loading the Data

Wanted to pursue the Fairfax County Police Department data

Loading the Data (contd)

Data Loading Strategy

Did not design a parser

Data Loading Strategy (contd)

No need to GeoCode on-the-fly later

Examined Oracle Query Plans

System responds in a timely fashion to multiple concurrent, large queries

Charts Data Grids Maps

Web App (UI)

Web App (Visualization)

K-Nearest Neighbor Query

Range Query or Circle Search

All queries can be limited by a date/time range and crime type

Google Maps API Dundas Chart for ASP.NET Professional Evaluation

Crime Hotspot Identification

Chose to implement Crime Hotspot Identification using the DBSCAN algorithm

Density = # of points within a specified radius

Each point in the database is categorized

Results are inserted into the clusters table

If Density score >= MinPts

If not marked as core or border

DBSCAN Point Categorization

Core, Border, Noise

Simply delete these from the clusters table

Now lets review the labeling algorithm

DBSCAN Cluster Labeling

DBSCAN Cluster Labeling

Lets review how we optimize Eps and MinPts

kth Nearest Neighbor Distance

2 k=1 k=4 k=9 k=10 k=15 k=20

Ex: Start w/ high vs. low density score cores?

DBSCAN Observations (contd)

DBSCAN Algorithm Presentations

Eliminate noise points Perform clustering on the remaining points

Introduction to Data Mining

DBSCAN Algorithm (simplified view for teaching)

Publishable Quality (?)

Queries to demo CDE

Visualizations integrated with query demos

DBSCAN categorization and cluster detection demo

S-ar putea să vă placă și