This is an old revision of this page, as edited by Beland (talk | contribs) at 16:08, 14 March 2013 (→Database storage: merge subsection to Computer_data_storage#Volatility; other minor cleanups). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
Revision as of 16:08, 14 March 2013 by Beland (talk | contribs) (→Database storage: merge subsection to Computer_data_storage#Volatility; other minor cleanups)(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. Find sources: "Database management system" – news · newspapers · books · scholar · JSTOR (February 2013) (Learn how and when to remove this message) |
A database management system (DBMS) is a set of programs that enables storing, modifying, and extracting information from a database. It also provides users with tools to add, delete, access, modify, and analyze data stored in one location. A group can access the data by using query and reporting tools that are part of the DBMS or by using application programs specifically written to access the data. DBMS’s also provide the method for maintaining the integrity of stored data, running security and users access, and recovering information if the system fails. The information from a database can be presented in a variety of formats. Most DBMSs include a report writer program that enables you to output data in the form of a report. Many DBMSs also include a graphics component that enables you to output information in the form of graphs and charts. Database and database management system are essential to all areas of business, they must be carefully managed. There are many different types of DBMSs, ranging from small systems that run on personal computers to huge systems that run on mainframes. The following are examples of database applications: computerized library systems, flight reservation systems, and computerized parts inventory systems.
A DBMS typically supports query languages, which are in effect high-level programming languages, dedicated database languages that considerably simplify writing database application programs. Database languages also simplify the database organization as well as retrieving and presenting information from it. A DBMS provides facilities for controlling data access, enforcing data integrity, managing concurrency control, and recovering the database after failures and restoring it from backup files, as well as maintaining database security.
DBMSs can be categorized according to the database model(s) that they support, such as relational or XML, the type(s) of computer they support, such as a server cluster or a mobile phone, the query language(s) that access the database, such as SQL or XQuery, performance trade-offs, such as maximum scale or maximum speed or others. Some DBMSs cover more than one entry in these categories, e.g., supporting multiple query languages. Database software typically support the Open Database Connectivity (ODBC) standard which allows the database to integrate (to some extent) with other databases.
Overview and terminology
A database is an organised pool of logically-related data. Data is stored within the data structures of the database. A DBMS is a suite of computer software providing the interface between users and a database or databases. A DBMS is a shell which surrounds a database or series of databases and through which all interactions take place with the database. The interactions catered for by most existing DBMS fall into four main groups:
- Data definition. Defining new data structures for a database, removing data structures from the database, modifying the structure of existing data.
- Data maintenance. Inserting new data into existing data structures, updating data in existing data structures, deleting data from existing data structures.
- Data retrieval. Querying existing data by end-users and extracting data for use by application programs.
- Data control. Creating and monitoring users of the database, restricting access to data in the database and monitoring the performance of databases.
Both a database and its DBMS conform to the principles of a particular database model.
Database system refers collectively to the database model, database management system, and database.
Physically, database servers are dedicated computers that hold the actual databases and run only the DBMS and related software. Database servers are usually multiprocessor computers, with generous memory and RAID disk arrays used for stable storage. Hardware database accelerators, connected to one or more servers via a high-speed channel, are also used in large volume transaction processing environments. DBMSs are found at the heart of most database applications. DBMSs may be built around a custom multitasking kernel with built-in networking support, but modern DBMSs typically rely on a standard operating system to provide these functions.
Many databases have application software that accesses the database on behalf of end-users, without exposing the DBMS interface directly. Database designers and database administrators interact with the DBMS through dedicated interfaces to build and maintain the applications' databases, and thus need some more knowledge and understanding about how DBMSs operate and the DBMSs' external interfaces and tuning parameters.
The development of a mature general-purpose DBMS typically takes several years and many man-years. Developers of DBMS typically update their product to follow and take advantage of progress in computer and storage technologies. Several DBMS products have been in on-going development since the 1970s-1980s. Since DBMSs comprise a significant economical market, computer and storage vendors often take into account DBMS requirements in their own development plans.
Functionality provided
Features commonly offered by database management systems include:
- Query ability
- Querying is the process of requesting attribute information from various perspectives and combination of factors. Example: "How many 2-door cars in Texas are green?" A database query language and report writer allow users to interactively interrogate the database, analyze its data and update it according to the users privileges on data.
- Data security
- For security reasons, it is often desirable to limit who can see or change specific attributes or groups of attributes. This may be managed directly on an individual basis, or by the assignment of individuals and privileges to groups, or (in the most elaborate models) through the assignment of individuals and groups to roles which are then granted entitlements. Data security prevents unauthorized users from viewing or updating the database. Using passwords, users are allowed access to the entire database or subsets of it called "subschemas". For example, an employee database can contain all the data about an individual employee, but one group of users may be authorized to view only payroll data, while others are allowed access to only work history and medical data. If the DBMS provides a way to interactively enter and update the database, as well as interrogate it, this capability allows for managing personal databases.
- Backup and replication
- Copies of attributes need to be made regularly in case primary disks or other equipment fails. A periodic copy of attributes may also be created for a distant organization that cannot readily access the original. DBMS usually provide utilities to facilitate the process of extracting and disseminating attribute sets. When data is replicated between database servers, so that the information remains consistent throughout the database system and users cannot tell or even know which server in the DBMS they are using, the system is said to exhibit replication transparency.
- Rule enforcement
- Often one wants to apply rules to attributes so that the attributes are clean and reliable. For example, we may have a rule that says each car can have only one engine associated with it (identified by Engine Number). If somebody tries to associate a second engine with a given car, we want the DBMS to deny such a request and display an error message. However, with changes in the model specification such as, in this example, hybrid gas-electric cars, rules may need to change. Ideally such rules should be able to be added and removed as needed without significant data layout redesign.
- Computation
- Common computations requested on attributes are counting, summing, averaging, sorting, grouping, cross-referencing, and so on. Rather than have each computer application implement these from scratch, they can rely on the DBMS to supply such calculations.
- Change and access logging
- This describes who accessed which attributes, what was changed, and when it was changed. Logging services allow for a database audit by keeping a record of access occurrences and changes.database audit later. Sometimes application-level code is used to record changes rather than leaving this to the database.
- Automated optimization
- For frequently occurring usage patterns or requests, some DBMS can adjust themselves to improve the speed of those interactions. In some cases the DBMS will merely provide tools to monitor performance, allowing a human expert to make the necessary adjustments after reviewing the statistics collected.
- Development and monitoring support
- A DBMS typically intends to provide convenient environment to develop and later maintain an application built around its respective database type. A DBMS either provides such tools, or allows integration with such external tools. Examples for tools relate to database design, application programming, application program maintenance, database performance analysis and monitoring, database configuration monitoring, DBMS hardware configuration (a DBMS and related database may span computers, networks, and storage units) and related database mapping (especially for a distributed DBMS), storage allocation and database layout monitoring, storage migration, etc.
Components
This section does not cite any sources. Please help improve this section by adding citations to reliable sources. Unsourced material may be challenged and removed. (March 2013) (Learn how and when to remove this message) |
DBMS architecture specifies its components (including descriptions of their functions) and their interfaces. DBMS architecture is distinct from database architecture. The following are major DBMS components:
- DBMS external interfaces - They are the means to communicate with the DBMS (both ways, to and from the DBMS) to perform all the operations needed for the DBMS. These can be operations on a database, or operations to operate and manage the DBMS. For example:
- - Direct database operations: defining data types, assigning security levels, updating data, querying the database, etc.
- - Operations related to DBMS operation and management: backup and restore, database recovery, security monitoring, database storage allocation and database layout configuration monitoring, performance monitoring and tuning, etc.
- An external interface can be either a user interface (e.g., typically for a database administrator), or an application programming interface (API) used for communication between an application program and the DBMS.
- Database language engines (or processors) - Most operations upon databases are performed through expression in Database languages (see above). Languages exist for data definition, data manipulation and queries (e.g., SQL), as well as for specifying various aspects of security, and more. Language expressions are fed into a DBMS through proper interfaces. A language engine processes the language expressions (by a compiler or language interpreter) to extract the intended database operations from the expression in a way that they can be executed by the DBMS.
- Query optimizer - Performs query optimization on every query to choose for it the most efficient query plan (a partial order (tree) of operations) to be executed to compute the query result.
- Database engine - Performs the received database operations on the database objects, typically at their higher-level representation.
- Storage engine - translates the operations to low-level operations on the storage bits. In some references the Storage engine is viewed as part of the database engine.
- Transaction engine - for correctness and reliability purposes most DBMS internal operations are performed encapsulated in transactions (see below). Transactions can also be specified externally to the DBMS to encapsulate a group of operations. The transaction engine tracks all the transactions and manages their execution according to the transaction rules (e.g., proper concurrency control, and proper commit or abort for each).
- DBMS management and operation component - Comprises many components that deal with all the DBMS management and operational aspects like performance monitoring and tuning, backup and restore, recovery from failure, security management and monitoring, database storage allocation and database storage layout monitoring, etc.
This article may need to be cleaned up. It has been merged from (link). |
- Data definition subsystem helps the user create and maintain the data dictionary and define the structure of the file in a database.
- Data manipulation subsystem helps the user to add, change, and delete information in a database and query it for valuable information. Software tools within the data manipulation subsystem are most often the primary interface between user and the information contained in a database. It allows the user to specify its logical information requirements.
- Application generation subsystem contains facilities to help users develop transaction-intensive applications. It usually requires that the user perform a detailed series of tasks to process a transaction. It facilitates easy-to-use data entry screens, programming languages, and interfaces.
- Data administration subsystem helps users manage the overall database environment by providing facilities for backup and recovery, security management, query optimization, concurrency control, and change management.
Database transactions
Main article: Database transactionAs with every software system, a DBMS that operates in a faulty computing environment is prone to failures of many kinds. A failure can corrupt the respective database unless special measures are taken to prevent this. A DBMS achieves certain levels of fault tolerance by encapsulating operations within transactions. The concept of a database transaction (or atomic transaction) has evolved in order to ensure data integrity: the data should be in a coherent state after recovery from a crash. A database transaction is a unit of work, typically encapsulating a number of operations over a database (e.g., reading a database object, writing, acquiring lock, etc.), an abstraction supported in database and also other systems. Each transaction has well defined boundaries in terms of which program/code executions are included in that transaction (determined by the transaction's programmer via special transaction commands).
The acronym ACID describes some ideal properties of a database transaction: Atomicity, Consistency, Isolation, and Durability.
Further information: Concurrency controlData structures (models)
Main article: Database modelA database model is a type of data model that determines the logical structure of a database and fundamentally determines in which manner data can be stored, organized, and manipulated. The most popular example of a database model is the relational model.
Common logical data models for databases include:
- Hierarchical database model
- Network model
- Relational model
- Entity–relationship model
- Object model
- Document model
- Multidimensional model
- Multivalue model
An object-relational database combines the two related structures. Various physical data models can implement any given logical model, and different models have different, application-specific performance characteristics. Most database management systems are built around one particular data model, although it is possible for products to offer support for more than one model.
In practice
This article appears to contradict the article Database model. Please discuss at the talk page and do not remove this message until the contradictions are resolved. |
The dominant model in use today is the ad hoc one embedded in SQL, despite the objections of purists who believe this model is a corruption of the relational model since it violates several fundamental principles for the sake of practicality and performance. Many DBMSs also support the Open Database Connectivity API that supports a standard way for programmers to access the DBMS.
Before the database management approach, organizations relied on file processing systems to organize, store, and process data files. End users criticized file processing because the data is stored in many different files and each organized in a different way. Each file was specialized to be used with a specific application. File processing was bulky, costly and inflexible when it came to supplying needed data accurately and promptly. Data redundancy is an issue with the file processing system because the independent data files produce duplicate data so when updates were needed each separate file would need to be updated. Another issue is the lack of data integration. The data is dependent on other data to organize and store it. Lastly, there was not any consistency or standardization of the data in a file processing system which makes maintenance difficult. For these reasons, the database management approach was produced.
History
See Database#History.
Database storage
Main article: Computer data storageDatabase storage is the container of the physical materialization of a database. It comprises the internal (physical) level in the database architecture. It also contains all the information needed (e.g., metadata, "data about the data", and internal data structures) to reconstruct the conceptual level and external level from the internal level when needed. It is generally the responsibility of the storage engine. Though typically accessed by a DBMS through the underlying operating system (and often utilizing the operating systems' file systems as intermediates for storage layout), storage properties and configuration setting are extremely important for the efficient operation of the DBMS, and thus are closely maintained by database administrators. A DBMS, while in operation, always has its database residing in several types of storage (e.g., memory and external storage). The database data and the additional needed information, possibly in very large amounts, are coded into bits. Data typically reside in the storage in structures that look completely different from the way the data look in the conceptual and external levels, but in ways that attempt to optimize (the best possible) these levels' reconstruction when needed by users and programs, as well as for computing additional types of needed information from the data (e.g., when querying the database).
In principle the database storage can be viewed as a linear address space, where every bit of data has its unique address in this address space. In practice, only a very small percentage of addresses are kept as initial reference points (which also requires storage); most data is accessed by indirection using displacement calculations (distance in bits from the reference points) and data structures which define access paths (using pointers) to all needed data in an effective manner, optimized for the needed data access operations.
General considerations
Data is encoded by assigning a bit pattern to each character, digit, or multimedia object. Many standards exist for encoding (e.g., character encodings like ASCII, image encodings like JPEG, video encodings like MPEG-4). By adding bits to each encoded unit, the redundancy allows both to detect errors in coded data and to correct them based on mathematical algorithms. Errors occur regularly in low probabilities due to random bit value flipping, or "physical bit fatigue," loss of the physical bit in storage its ability to maintain distinguishable value (0 or 1), or due to errors in inter or intra-computer communication. A random bit flip (e.g., due to random radiation) is typically corrected upon detection. A bit, or a group of malfunctioning physical bits (not always the specific defective bit is known; group definition depends on specific storage device) is typically automatically fenced-out, taken out of use by the device, and replaced with another functioning equivalent group in the device, where the corrected bit values are restored (if possible). The cyclic redundancy check (CRC) method is typically used in storage for error detection and correction. Some DBMS support specifying which character encoding was used to store data, so multiple encodings can be used in the same database.
Data compression methods allow in many cases to represent a string of bits by a shorter bit string ("compress") and reconstruct the original string ("decompress") when needed. This allows to utilize substantially less storage (tens of percents) for many types of data at the cost of more computation (compress and decompress when needed). Analysis of trade-off between storage cost saving and costs of related computations and possible delays in data availability is done before deciding whether to keep certain data in a database compressed or not. Data compression is typically controlled through the DBMS's data definition interface, but in some cases may be a default and automatic.
For security reasons certain types of data (e.g., credit-card information) may be kept encrypted in storage to prevent the possibility of unauthorized information reconstruction from chunks of storage snapshots (taken either via unforeseen vulnerabilities in a DBMS, or more likely, by bypassing it). Data encryption is typically controlled through the DBMS's data definition interface, but in some cases may be a default and automatic.
Storage metrics
This section needs expansion. You can help by adding to it. (July 2011) |
Databases always use several types of storage when operational (and implied several when idle). Different types may significantly differ in their properties, and the optimal mix of storage types is determined by the types and quantities of operations that each storage type needs to perform, as well as considerations like physical space and energy consumption and dissipation (which may become critical for a large database). Storage types can be categorized by the following attributes:
- Volatile/nonvolatile.
- Cost of the medium (e.g., per megabyte), cost to operate (cost of energy consumed per unit time).
- Access speed (e.g., bytes per second).
- Granularity — from fine to coarse (e.g., size in bytes of access operation).
- Reliability (the probability of spontaneous bit value change under various conditions).
- Maximal possible number of writes (of any specific bit or specific group of bits; could be constrained by the technology used (e.g., "write once" or "write twice"), or due to "physical bit fatigue," loss of ability to distinguish between the 0, 1 states due to many state changes (e.g., in Flash memory)).
- Power needed to operate (Energy per time; energy per byte accessed), Energy efficiency, Heat to dissipate.
- Packaging density (e.g., realistic number of bytes per volume unit)
Protecting storage device content: Device mirroring (replication) and RAID
Main articles: Disk mirroring and RAID- See also Disk storage replication
While a group of bits malfunction may be resolved by error detection and correction mechanisms (see above), storage device malfunction requires different solutions. The following solutions are commonly used and valid for most storage devices:
- Device mirroring (replication) - A common solution to the problem is constantly maintaining an identical copy of device content on another device (typically of a same type). The downside is that this doubles the storage, and both devices (copies) need to be updated simultaneously with some overhead and possibly some delays. The upside is possible concurrent read of a same data group by two independent processes, which increases performance. When one of the replicated devices is detected to be defective, the other copy is still operational, and is being utilized to generate a new copy on another device (usually available operational in a pool of stand-by devices for this purpose).
- Redundant array of independent disks (RAID) - This method generalizes the device mirroring above by allowing one device in a group of N devices to fail and be replaced with content restored (Device mirroring is RAID with N=2). RAID groups of N=5 or N=6 are common. N>2 saves storage, when comparing with N=2, at the cost of more processing during both regular operation (with often reduced performance) and defective device replacement.
Device mirroring and typical RAID are designed to handle a single device failure in the RAID group of devices. However, if a second failure occurs before the RAID group is completely repaired from the first failure, then data can be lost. The probability of a single failure is typically small. Thus the probability of two failures in a same RAID group in time proximity is much smaller (approximately the probability squared, i.e., multiplied by itself). If a database cannot tolerate even such smaller probability of data loss, then the RAID group itself is replicated (mirrored). In many cases such mirroring is done geographically remotely, in a different storage array, to handle also recovery from disasters (see disaster recovery above).
Database storage layout
Database bits are laid-out in storage in data-structures and grouping that can take advantage of both known effective algorithms to retrieve and manipulate them and the storage own properties. Typically the storage itself is design to meet requirements of various areas that extensively utilize storage, including databases. A DBMS in operation always simultaneously utilizes several storage types (e.g., memory, and external storage), with respective layout methods.
Database storage hierarchy
A database, while in operation, resides simultaneously in several types of storage. By the nature of contemporary computers most of the database part inside a computer that hosts the DBMS resides (partially replicated) in volatile storage. Data (pieces of the database) that are being processed/manipulated reside inside a processor, possibly in processor's caches. These data are being read from/written to memory, typically through a computer bus (so far typically volatile storage components). Computer memory is communicating data (transferred to/from) external storage, typically through standard storage interfaces or networks (e.g., fibre channel, iSCSI). A storage array, a common external storage unit, typically has storage hierarchy of it own, from a fast cache, typically consisting of (volatile and fast) DRAM, which is connected (again via standard interfaces) to drives, possibly with different speeds, like flash drives and magnetic disk drives (non-volatile). The drives may be connected to magnetic tapes, on which typically the least active parts of a large database may reside, or database backup generations.
Typically a correlation exists currently between storage speed and price, while the faster storage is typically volatile.
Data structures
Main article: Database storage structuresThis section needs expansion. You can help by adding to it. (June 2011) |
A data structure is an abstract construct that embeds data in a well defined manner. An efficient data structure allows to manipulate the data in efficient ways. The data manipulation may include data insertion, deletion, updating and retrieval in various modes. A certain data structure type may be very effective in certain operations, and very ineffective in others. A data structure type is selected upon DBMS development to best meet the operations needed for the types of data it contains. Type of data structure selected for a certain task typically also takes into consideration the type of storage it resides in (e.g., speed of access, minimal size of storage chunk accessed, etc.). In some DBMSs database administrators have the flexibility to select among options of data structures to contain user data for performance reasons. Sometimes the data structures have selectable parameters to tune the database performance.
Databases may store data in many data structure types. Common examples are the following:
- ordered/unordered flat files
- hash tables
- B+ trees
- ISAM
- heaps
In contrast to conventional row-orientation, relational databases can also be column-oriented or correlational in the way they store data in these structures.
Application data and DBMS data
A typical DBMS cannot store the data of the application it serves alone. In order to handle the application data the DBMS need to store this data in data structures that comprise specific data by themselves. In addition the DBMS needs its own data structures and many types of bookkeeping data like indexes and logs. The DBMS data is an integral part of the database and may comprise a substantial portion of it.
Database indexing
Main article: Index (database)Indexing is a technique for improving database performance. The many types of indexes share the common property that they reduce the need to examine every entry when running a query. In large databases, this can reduce query time/cost by orders of magnitude. The simplest form of index is a sorted list of values that can be searched using a binary search with an adjacent reference to the location of the entry, analogous to the index in the back of a book. The same data can have multiple indexes (an employee database could be indexed by last name and hire date.)
Indexes affect performance, but not results. Database designers can add or remove indexes without changing application logic, reducing maintenance costs as the database grows and database usage evolves.
Given a particular query, the DBMS' query optimizer is responsible for devising the most efficient strategy for finding matching data.
Indexes can speed up data access, but they consume space in the database, and must be updated each time the data is altered. Indexes therefore can speed data access but slow data maintenance. These two properties determine whether a given index is worth the cost.
Database data clustering
In many cases substantial performance improvement is gained if different types of database objects that are usually utilized together are laid in storage in proximity, being clustered. This usually allows to retrieve needed related objects from storage in minimum number of input operations (each sometimes substantially time consuming). Even for in-memory databases clustering provides performance advantage due to common utilization of large caches for input-output operations in memory, with similar resulting behavior.
For example it may be beneficial to cluster a record of an item in stock with all its respective order records. The decision of whether to cluster certain objects or not depends on the objects' utilization statistics, object sizes, caches sizes, storage types, etc.
Database materialized views
Main article: Materialized viewOften storage redundancy is employed to increase performance. A common example is storing materialized views, which consist of frequently needed external views or query results. Storing such views saves the expensive computing of them each time they are needed. The downsides of materialized views are the overhead incurred when updating them to keep them synchronized with their original updated database data, and the cost of storage redundancy.
Database and database object replication
Main article: Database replication- See also Replication below
Occasionally a database employs storage redundancy by database objects replication (with one or more copies) to increase data availability (both to improve performance of simultaneous multiple end-user accesses to a same database object, and to provide resiliency in a case of partial failure of a distributed database). Updates of a replicated object need to be synchronized across the object copies. In many cases the entire database is replicated.
Distributed DBMS
A distributed database management system (DDBMS) is a collection of data which logically belong to the same system but are spread out over the sites of the computer network. The two aspects of a distributed database are distribution and logical correlation:
- Distribution: The fact that the data are not resident at the same site, so that we can distinguish a distributed database from a single, centralized database.
- Logical Correlation: The fact that the data have some properties which tie them together, so that we can distinguish a distributed database from a set of local databases or files which are resident at different sites of a computer network.
See also
- Database
- Column-oriented DBMS
- Data warehouse
- Database-centric architecture
- Database testing
- Comparison of relational database management systems
- Comparison of object database management systems
- Comparison of object-relational database management systems
- Metadata
References
- Tsitchizris, D. C. and F. H. Lochovsky (1982). Data Models. Englewood-Cliffs, Prentice-Hall.
- Beynon-Davies P. (2004). Database Systems 3rd Edition. Palgrave, Basingstoke, UK. ISBN 1-4039-1601-2
- Lightstone, Teorey & Nadeau 2007 harvnb error: no target: CITEREFLightstoneTeoreyNadeau2007 (help)
Further reading
- Abraham Silberschatz, Henry F. Korth, S. Sudarshan, Database System Concepts
- Raghu Ramakrishnan and Johannes Gehrke, Database Management Systems
Database management systems | |
---|---|
Types | |
Concepts | |
Objects | |
Components | |
Functions | |
Related topics | |
Database | |
---|---|
Main | |
Languages | |
Security | |
Design |
|
Programming | |
Management | |
Lists of | |
See also | |
WikiProject Category |
Database models | |
---|---|
Common models | |
Other models | |
Implementations |