Sunteți pe pagina 1din 7

Developing a Database Plan

The first step in creating a database is creating a plan that serves both as a guide to be used when implementing the database and as a functional specification for the database after it has been implemented. The complexity and detail of a database design is dictated by the complexity and size of the database application and also the user population. The nature and complexity of a database application, and also the process of planning it, can vary significantly. A database can be relatively simple and designed for use by a single person, or it can be large and complex and designed, for example, to handle all the banking transactions for thousands of clients. In the first case, the database design may be slightly more than a few notes on some scratch paper. In the latter case, the design may be a formal document hundreds of pages long that contains every possible detail about the database. In planning the database, regardless of its size and complexity, use the following basic steps: Gather information. Identify the objects. Model the objects. Identify the types of information for each object. Identify the relationships between objects.

Gathering Information
Before creating a database, you must have a good understanding of the job the database is expected to perform. If the database is to replace a paper-based or manually performed information system, the existing system will give you most of the information that you need. You should interview everyone involved in the system to determine what they do and what they need from the database. It is also important to identify what they want the new system to do, and also to identify the problems, limitations, and bottlenecks of any existing system. Collect copies of customer statements, inventory lists, management reports, and any other documents that are part of the existing system, because these will be useful to you in designing the database and the interfaces.

Identifying the Objects


During the process of gathering information, you must identify the key objects or entities that will be managed by the database. The object can be a tangible thing, such as a person or a product, or it can be a more intangible item, such as a business transaction, a department in a company, or a payroll period. There are generally a few primary objects, and after these are identified, the related items become visible. Each distinct item in your database should have a corresponding table. The primary object in the AdventureWorks2008R2 sample database included with SQL Server is a bicycle. The objects related to bicycle within this company's business are the employees who manufacture the

bicycle, the vendors that sell components used to manufacture the bicycle, the customers who buy them, and the sales transactions performed with the customers. Each of these objects is a table in the database.

Modeling the Objects


As the objects in the system are identified, you should record them in a way that represents the system visually. You can use your database model as a reference during implementation of the database. For this purpose, database developers use tools that range in technical complexity from pencils and scratch paper to word processing and spreadsheet programs, and even to software programs created specifically for the job of data modeling for database designs. Whatever tool you decide to use, it is important that you keep it up to date.

Identifying the Types of Information for Each Object


After the primary objects in the database have been identified as candidates for tables, the next step is to identify the types of information that must be stored for each object. These are the columns in the table of the object. The columns in a database table contain a few common types of information: Raw data columns These columns store tangible pieces of information, such as names, determined by a source external to the database. Categorical columns These columns classify or group the data and store a limited selection of data such as true/false, married/single, and VP/Director/Group Manager. Identifier columns These columns provide a mechanism to identify each item stored in the table. These columns frequently have an ID or number in their name, for example, employee_id, invoice_number, and publisher_id. The identifier column is the primary component used by both users and internal database processing for gaining access to a row of data in the table. Sometimes the object has a tangible form of ID used in the table, for example, a social security number, but in most situations you can define the table so that a reliable, artificial ID can be created for the row. Relational or referential columns These columns establish a link between information in one table and related information in another table. For example, a table that tracks sales transactions will generally have a link to the customers table so that the complete customer information can be associated with the sales transaction.

Identifying the Relationship Between Objects


One of the strengths of a relational database is the ability to relate or associate information about various items in the database. Isolated types of information can be stored separately, but the database engine can combine data when it is required. Identifying the relationship between objects in the design process requires looking at the tables, determining how they are logically related, and adding relational columns that establish a link from one table to another. For example, the designer of the AdventureWorks2008R2 database has created tables for products and product models in the database. The Production.Product table contains information for each product that includes an identifier column named ProductID; data columns for the product name, the price of the product, and the product color, size, and weight. The table contains categorical columns, such as Class, or Style, that lets the products be grouped by these types. Each product also has a product model, but that information is stored in another table. Therefore, the Production.Product table has a ProductModelID column to store just the ID of the product model. When a row of data is added for a product, the value for ProductModelID must exist in the Production.ProductModel table.

2. Question: Ive been using database snapshots since SQL Server 2005 was released but recently Ive been having problems where the snapshot becomes unavailable because there isnt enough space. Ive read that a database snapshot reserves space when it is created, so how can it run out of space? Answer: Unfortunately what youve read is incorrect. A database snapshot does not reserve space when it is created so it is quite possible for it to run out of space (and hence become unusable). Database snapshots use NTFS sparse filesyou can have an arbitrarily large file that takes up very minimal space on disk. For instance, a 100GB sparse file that contains 2MB of data at offset zero in the file, and 3MB of data at offset 64MB in the file will only occupy 5MB of disk space instead of 100GB. NTFS keeps track of which offsets in the file contain data and stores all the data in a compacted form so that minimal space is required. This is the same concept as sparse arrays in programming languages.

Things to consider when planning a database


An introduction to cluster planning
Clusters are fairly common, so you might be wondering why they merit so much planning. The reason is because of the very nature of server clusters. As I mentioned earlier, server clusters are best suited to servers that are hosting database applications. When a server is hosting a database application, there are typically very frequent transactions made to the database itself. These transactions tend to be problematic for a cluster. Imagine for example that you have three servers that are all hosting the same database application. In a situation like this, you never know which server a user is going to connect to. If the user were to add or modify a database record, then the update to the database would be applied to the copy of the database that is running on the server that the user connected to. The other two servers that are also hosting the same application would remain blissfully unaware of the update. As you can see, even if each server were to start out with an identical copy of the database, it would not take long for the three servers to each have very different versions of the data. As you can see, continuity of data is a major concern for clustered database application servers. Clustering a database application server simply will not work unless there is a way of guaranteeing that each server in the cluster always has access to the same data set. When Microsoft created server clusters, there were a couple of different ways that they could have taking care of the data continuity issue .One option would have been to immediately replicate any changes to a database to all the other copies of the database. Ultimately though, Microsoft chose not to use this approach because it has several problems. First, latency has the potential to affect the databases integrity. For example, what would happen if someone were to update a database record, and then another user were to make a different change to the same record, but on a different server, before the first change could be replicated? Another issue is bandwidth. In an environment in which changes are being made to a data set on a frequent basis, replicating those changes between multiple servers can consume a tremendous amount of bandwidth. Because of these and other issues, Microsoft chose not to use the replication method for server clusters (although other Microsoft products use this method to maintain consistency between multiple copies of anonclustered database). Instead, Microsoft chose to solve the data continuity problem by having all cluster nodes share a single copy of the database. Typically ,this means that each node in the cluster has a direct connection to a centralized storage system that is shared by all of the cluster nodes. It is this shared storage that makes designing a cluster as complex as it is expensive. One of the reasons why is so much planning is in order is because clusters are designed to be fault tolerant. It is difficult for a cluster to be truly fault tolerant though if all of the nodes share a common storage system. Of course you can mitigate the effects of a hard disk failure by implementing a RAID array (which is also a good idea from a performance standpoint). However, just implementing a RAID array does not guarantee true fault tolerance. For example, suppose that the RAID array's power supply were to fail. This failure would

effectively undermine the cluster's fault tolerant capabilities. Even if the storage system contained fully redundant hardware and was on a backup generator there are still things that could happen that would cause the storage system to fail. One example of such a situation is that the facility containing the storage system could be destroyed during a hurricane, fire, etc. A less dramatic, but equally disruptive situation involves database corruption. If the database were to suddenly become corrupt for some reason, the cluster would come to a grinding halt until you restored the database from backup. The point is that no matter how much work you put into planning a cluster, there are always going to be situations beyond your control that could theoretically cause the cluster to fail. Therefore one of the most important considerations in planning the cluster should be whether the cost of downtime warrants the cost of the cluster. I have been in IT for long enough to know that although the statement is sometimes unrealistic, no amount of downtime is acceptable. That being the case, what I'm about to tell you probably won't be very popular with most of you who are reading this. Even so, I firmly believe that it is very important to look at cluster implementation cost and the cost of downtime from a business standpoint, not just from the standpoint of some idiot manager who tells you that downtime is completely unacceptable.

Facility considerations
To see why this is the case, let's go back to my example in which a hurricane destroys the facility containing the cluster storage system. For enough money, you could create a backup storage system in another city (preferably away from the coast). You could also devise a method of keeping that backup storage system somewhat current, and automatically rerouting the cluster to use the backup storage system in the event that the primary storage system fails. Obviously, this type of failover system would cost a huge amount of money. Suppose however that the database application that the cluster is hosting is not something that's accessible over the Internet. Instead ,it hosts a business critical database that is only used internally by employees of your company. If all of your employees work in the same city as the primary storage system resides in, then having a backup storage system in another city is probably pointless. If the city were to be devastated by a hurricane, then what you think the chances are of the main office even having electricity? Even if you were to bring in generators and some of the computers were still functional, what do you think the odds are that employees will actually come into the office? If you have never been through a hurricane, I can tell you that the chances are of the employees coming to work are pretty much the row ,because mandatory evacuations. My point is that it is possible to spend a fortune on the underlying infrastructure that would make a cluster continue to function during times of crisis, but doing so is not always a wise investment. It does little good to spend good money ensuring that a cluster never fails if no one is even able to benefit from the cluster's availability in times of crisis.

Other cost considerations


Having said that, let's look at the cost issue from a different perspective. Suppose for a moment that a server at your company's headquarters hosts a database application that is considered to be mission critical for the company. Because the database is so important, the company decides that it would be wise to cluster the server

in an effort to increase its performance and to provide a degree of fault tolerance. Let's also assume that users from other offices in other cities also use the database and that these users place a considerable strain on your WAN link by accessing the database application from a remote site. If that were the case, then it might seem logical to create a geographically dispersed cluster. By doing so, you can allow users to access the application from servers located within their own facility, thus improving the end user experience, and relieving the congestion on your WAN link. Forgetting about fault tolerant issues, the biggest problem with this particular deployment scenario would be the cost. First, you would have the expense of purchasing server hardware and software for each cluster node. The cost of server hardware and software would pale in comparison to the connectivity cost though. Even in a geographically dispersed cluster, all of the cluster nodes must maintain reliable, high-speed connectivity to the central data store. This means that you would have to construct a Storage Area Network(SAN) then it would be used to connect the various sites to each other. In addition, you would still need to keep your existing WAN links so that user sand remote sites would be able to access other types of data from servers locate data the corporate headquarters. What all this boils down to is the fact that you're going to have to make some big decisions weighing the cost of the cluster against its benefits. On one hand, creating this type of cluster would accomplish the goal of relieving the congestion on your WAN link and improving the end user experience. It would not however provide true fault tolerance to users in the remote locations. You could use redundant hardware to improve the fault tolerance level of cluster nodes in remote locations. That way, if a cluster node were to fail then another server in the remote location would continue to function, and the users would probably be none the wiser that a failure had occurred. The problem is that if the link between the remote site an data the corporate headquarters were to fail, then the cluster itself might as well have failed (at least from the perspective of users and the remote location). To provide true fault tolerance, you would need redundant SAN connectivity, which would increase the cost of the project exponentially.

Keep an eye on the big picture


With this in mind, let's take another look at our original goals. Originally, our goals for this facility were to relieve some of the congestion from the WAN link and to increase the speed of the end user experience. A geographically dispersed cluster could definitely achieve these goals if constructed properly. At the same time though these goals can also be achieved at a much lower cost by investing in a higher speed WAN link. If you wanted to implement a degree of fault tolerance for the users of the remote locations, you could even invest in redundant WAN links, which would still probably costs a lot less than creating a geographically dispersed cluster. After reading this article, you might have gotten the idea that I am against the idea of constructing clusters. In reality though, nothing could be further from the truth. I just don't believe that a cluster is necessarily always the best solution for every project. If you're contemplating building a cluster, then it is very important that you weigh the cluster's benefits against its cost. You may find that constructing a cluster is cost prohibitive, or that a less expensive solution could just as easily meet your goals. Of course there are plenty of situations in which

a cluster is the ideal solution to an IT problem. In Part 2 of this article series, I will continue the discussion of planning a server cluster. DATABASE PROBLEMS AND SOLUTIONS
This week was a hard programming week at work and I had to reface the hardships of database programming in Java with JDBC. On Monday morning I decided to solve the quotes problem when inserting and updating into the database. As in most programming languages, SQL gets quite upset when certain special characters are used incorrectly. In SQL's case the single quote (') is the trouble maker because it has a special use to enclose a character / string value. If you've got a field and a user types in a single quote the single quote must first be "escaped" by inserting another quote, before adding it to the insert statement. So for example insert into remarks value ('I don't want to drive'); must become insert into remarks value ('I don''t want to drive');. My first attempt at solving this problem was to use the replace All() function in Java to replace all ' by '', for each field. This method is cumbersome and not very elegant, so I had to look for a better solution. The answer was to use the Prepared Statement object instead of the Statement object to update values in the database. Apart from solving the quotes problem the Prepared Statement precompiles the SQL statement passed as a parameter, thus making it faster than the normal statement. For more information of the usage of the Prepared Statement object see Sun's Java JDBC basic tutorial. While I was learning how to use the Prepared Statement object I stumbled across a new connection leakage error. The actual error was: ORA-01000 maximum open cursors exceeded. To help me analyse the fault better I found two SQL statements to document the number of open cursors in the database:SQL Statement 1: List of al open cursors, users and the SQL executed. select user_name, status, osuser, machine, a.sql_text from v$session b, v$open_cursor a where a.sid = b.sid; SQL Statement 2: The number of truly open cursors. select a.value, b.name from v$mystat a, v$statname b where a.statistic# = b.statistic# and a.statistic# = 3; The culprit for the error was that I was had an unclosed Prepared Statement well hidden inside a loop. Although it was not so difficult locating the error, the problem was not fixed immediately by changing the code, so I started going around in circles looking for problems that didn't exist. The reason for this was that Oracle frees the inactive connections every 15 minutes (this might be dependent on a database parameter [INITSID.ORA]), so the open cursors had to be freed before the code could work. This week I had very little time to experiment with common applications but I found a quick way to load Adobe Acrobat faster i.e. without loading the plug-ins. To do this all you've got to do is press [Shift] while the program is loading. Apparently this feature also works with other programs that load plug-ins.

S-ar putea să vă placă și