Sunteți pe pagina 1din 17

XML DTD Pros and Cons

Pros
The main advantage of the DTD is that it provides validating parsers with a map
document. The map describes how elements relate to each other within the document.
nest within each other and the number of times they can occur, and it
Because XML first used DTDs to validate documents, they're the de facto standard
documents. Nearly every XML package supports validation with DTDs.

of how to validate the XML


It also specifies how elements
loosely defines data types.
protocol for describing XML

Cons
As useful as DTDs are, they arent really designed for handling the XML domain. The major concern most developers
have with DTDs is the lack of good type-checking. Also, DTDs are created in a strange and seemingly archaic format.
They have only a limited capability in describing the document structure in terms of how many elements can nest
within other elements.

2 Document Type Declaration


To make an XML document valid it can be associated with a Document Type Definition or DTD.
The document type declaration is where this happens. Every valid XML document must have (at
least) one, which must come after the XML declaration (if there is one) but before the first (root)
element in the document.
The simplest (but not usually the best) approach is to embed the DTD into the XML file. In this
case the document type declaration actually contains the DTD. Thus we could expand the
'employees' XML document presented above to the following:
<?xmlversion="1.0"?>

<!DOCTYPEemployees[
<!ELEMENTemployees(employee*)>
<!ELEMENTemployee(firstname,familyname,comment)>
<!ATTLISTemployeeidID#REQUIRED
managerIDREF#IMPLIED>
<!ELEMENTfirstname(#PCDATA)>
<!ELEMENTfamilyname(#PCDATA)>
<!ELEMENTcomment(#PCDATA)>
]>

<employees>
<employeemanager="e3257">
<firstname>Joe</firstname><familyname>Bloggs</familyname>
<comment>Duetoretiresoon.</comment>
</employee>
<employeeid="e3257">
<familyname>Patel</familyname>
<comment/>
</employee>
</employees>

Although this document is well-formed, it is now invalid, since it does not conform to the DTD,
which specifies that:

The employee element has a list of attributes, of which the first, id, is required while the
second, manager, is optional; however, the first employee element does not have an id attribute.

The employee element must always contain the three subelements firstname, familyname and comment, in exactly that order; however the
second employeeelement does not contain a firstname element.
We will now look at the components of DTDs in turn.

3 Element Declarations
Element type declarations are of the form:
<!ELEMENTelementnamecontentrule>

The elementname, is, of course, just the name of the element that the rule applies to, while
the contentrule specifies the constraints that apply to it. There are a number of different forms
for the contentrule, called content models.
3.1 The Any Content Model Here the contentrule has the value ANY, and the corresponding
element in the XML document can have any mix of character data and elements in any order. If no
DTD is provided for an XML document, this is the default assumed for every element. Clearly this
content model should be avoided if at all possible, since it effectively avoids specifying the format
of the element. The following declaration says that every a element may have any content:
<!ELEMENTaANY>

3.2 The Empty Content Model Here the contentrule has the value EMPTY, and the
corresponding element in the XML document must contain no content, neither sub-elements nor
character data. Thus the following declaration says that every a element must have no content:
<!ELEMENTaEMPTY>

3.3 Element Only Content Models Here an expression involving element names and some
special characters (representing operators) is used to describe patterns of child elements that can
occur in the content of the target element. The expressions are built up recursively from a number
of components. Expressions must be fully parenthesised in order to exclude any possibility of
ambiguity or the need to specify the order of precedence of pattern operators.

Element Name: an element name means that an element of that name is required at this
position. The following declaration says that an element named a must have precisely one child
element whose name is b:
<!ELEMENTa(b)>

Technically, an element name can only appear inside sequences or choices, both of which must be
enclosed in parentheses. Hence the parentheses around the b in the example above are because it is
actually the only entry in a sequence of length 1. Thus the following is invalid:
<!ELEMENTab>

(The reason for this restriction is so that there can be elements named ANY and EMPTY without
conflicting with the corresponding content models.)
Sequence: a comma is used to separate expressions which must appear in the order given.
The following declaration says that an element named a must have precisely three child elements
whose names are b, c and d:

<!ELEMENTa(b,c,d)>

Recall that every sequence must be enclosed in parentheses.


Choice: a vertical bar | is used to separate expressions of which any one must appear. The
following says that an element named a must have precisely one child element whose name is
either b, c or d:
<!ELEMENTa(b|c|d)>

Zero or More Repeats: an asterisk * is used to follow a sequence, a choice or an element


name which must be repeated 0 or more times. The following says that an element named a must
have 0 or more b child elements (and nothing else):
<!ELEMENTa(b*)>

Note that this is exactly the same as writing:


<!ELEMENTa(b)*>

One or More Repeats: a plus + is used to follow a sequence, a choice or an element name
which must be repeated 1 or more times. The following says that an element named amust have at
least one b child element, but possibly many (and nothing else):
<!ELEMENTa(b+)>

Optionality: the question mark ? is used to follow a sequence, a choice or an element name
which is optional, i.e. may or may not occur. The following says that an element named a may or
may not have a single b child element, but must contain no other elements:
<!ELEMENTa(b?)>

By combining these descriptions, very complex constraints can be constructed:


<!ELEMENTcv
(preface,(qualification|experience)+,hobbies?,referee*)>

Here the cv element must start with a preface element. This must be followed by a repeating
group that has to occur at least once. Each cycle of this group can contain either
a qualification element or an experience element; i.e. there can be a sequence
ofqualification and experience elements in any order and of any length greater than 0. Next
there might or might not be a single hobbies element. Finally there can be a sequence of any
number (including zero) of referee elements.
3.4 Mixed Content Models Here we can finally specify that an element contains character data,
and how it can be mixed with child elements. There are two forms:

If element a contains character data but no child elements at all, it is declared by:
<!ELEMENTa(#PCDATA)>
#PCDATA historically comes

from 'Parsed Character Data', but is not a good name since it


corresponds to what in XML is called just 'character data'.
If element a contains any number of the elements b, c, d, ... in any order, interleaved with
any amount of character data, it is declared by:
<!ELEMENTa(#PCDATA|b|c|d|...)*>

These are the only kinds of mixed content models allowed. There's no way, for example, of
specifying that an element starts with character data but then ends with a given element. For this

reason, the second form of mixed content model should be avoided unless it really is the case that
the child elements can occur anywhere within character data.
Neither form provides any way of specifying what kind of character data can occur, e.g. that it
must be an integer or consist of exactly three words.

4 Attribute Declarations
Attribute Type Declarations are of the form:
<!ATTLISTelemnameattname1atttype1attstatus1
attname2atttype2attstatus2
.........>

The elemname identifies the element to which this list of attribute declarations applies. Theatt
name identifies which particular attribute of the element is being declared. The values of att
type and attstatus then specify the form of the attribute.
The attstatus part of the declaration is the easiest to understand. It specifies whether the
attribute is required or optional and, if optional, whether a default value should be assumed if not
given in the XML document.

means that the attribute must be present in every target element in the XML
document; hence there is no need for a default value.

means that the attribute is optional in every target element in the XML document
(i.e. may or may not be present), but that no default value is provided if the attribute is omitted. (I
find IMPLIED to be an odd name for this case; OPTIONAL would surely have been better.)

means that the attribute is optional in every target element in the XML
document, but that if it is not present then it must be added by a validating XML processor and
given the value of attvalue (without the quotes).

#FIXED"attvalue" means that if the attribute is present in a target element in the XML
document then it must have precisely the given value; if it is not present then it must be added by a
validating XML processor and given the value of attvalue(without the quotes).

#REQUIRED
#IMPLIED

"attvalue"

The last two forms create a problem. A non-validating XML processor is only required to check
that an XML document is well-formed; it is not required to process the DTD fully in order to
check whether the document is valid. However, if it doesn't fully process the DTD then it may not
be able add default or fixed attributes, so that if the processor constructs a tree to represent the
XML document (see e.g. "Introduction to XML", 9), a validating XML processor may create a
different tree to that created by a non-validating XML processor.
Returning to the atttype part of the attribute declaration, there are ten different forms, of which
seven fall within the scope of this module. Each will be described in turn.
4.1 Enumerated List Attribute Type This form is used to specify that the value of the attribute
can only be one of a fixed set of values given in a vertical bar separated list of names. The list

must be enclosed in parentheses. Note that a value must be a valid name (see e.g. "Introduction to
XML", 1). Some examples:
<!ATTLISTemployeestatus(fulltime|parttime)"fulltime">

This specifies that employee elements must have the attribute status, which can only have one of
the two values fulltime or parttime. If the attribute is omitted, then a validating XML processor
must add it with the value fulltime. If the overwhelming majority of employees are full time,
then an attribute declaration like this avoids the need to include status="fulltime" in most of
the employee elements. However, it usually makes the eventual code written to process the XML
document more complex, since it will be safer to determine whether an employee is full time or
not by testing for either the absence of the status attribute or its having the value fulltime.
<!ATTLISTmenucuisine(French|Italian|Indian|Chinese)#REQUIRED>

This specifies that menu elements must have the attribute cuisine, which can only have one of the
values French, Italian, Indian or Chinese. If the attribute is omitted, then a validating XML
processor must report an error.
4.2 Name Token Attribute Type The keyword NMTOKEN is used to indicate this type, which is
used to specify that the value of the attribute can be any arbitrary 'name token'. A name token is
like a name, but can in addition start with a digit. In the example above, if it had been decided to
use a single word to refer to a type of cuisine, but it was required to be able to extend the types of
cuisine to more than a fixed list, the appropriate declaration would be:
<!ATTLISTmenucuisineNMTOKEN#REQUIRED>

This would allow the following XML to be valid:


<menucuisine="Greek">...</menu>

but not:
<menucuisine="GreekandItalian">...</menu>

4.3 Multi-Name Token Attribute Type The keyword NMTOKENS is used to indicate this type,
which is used to allow an attribute to have a white space separated list of values, each of
type NMTOKEN. Thus if we declare the cuisine attribute of menu via:
<!ATTLISTmenucuisineNMTOKENS#REQUIRED>

then the following XML is valid:


<menucuisine="GreekItalian">...</menu>

(Actually the value "GreekandItalian" would also be valid, but the spirit of this attribute type
is to have a list of name tokens each with its own meaning.)
4.4 String Attribute Type The keyword CDATA is used to indicate this type. As with the
keyword #PCDATA for elements, the name is historic, but misleading. Attributes cannot have
general character data values. They cannot contain CDATA sections, so it's particularly odd to
use CDATA as the keyword. CDATA means that the attribute can have any string value. For example:
<!ATTLISTpurchaseordernumberCDATA#REQUIRED
customeridCDATA#IMPLIED

sellerCDATA#FIXED"ACMEINC."
priorityCDATA"normal">

All of the four attributes of a purchase element can have string values (and so, for example, can
contain white space). The attribute ordernumber must be present, whereas customerid can be
omitted, but has no default value if it is. The value of seller is always the string 'ACMEINC.'. The
attribute priority may be omitted but if so defaults to normal.
It should be clear that CDATA offers only a very weak specification for an attribute value, and so
should be avoided if possible. The order number and customer ID are likely to be numbers in a
real application. DTDs don't provide any way of specifying this. However NMTOKEN is likely to be
a better choice than CDATA, since it rules out embedded white space. If the attribute priority is to
be allowed to have values like rushorder then an enumerated list can't be used (because its
values can only be names) nor can NMTOKEN, and CDATA is the only choice. However, it would
probably be better instead to use values like rushOrder orrush_order, which allow the more
restricted NMTOKEN as the attribute type. Only the attribute seller above really needs to be CDATA.
4.5 Identifier Attribute Type The keyword ID is used to indicate this type. It means that the
attribute is an identifier for the target element and so can be used in references (see
e.g. "Introduction to XML", 5). Note that:

No default values can be specified for identifier attributes. Thus the only att
statusdeclarations that can be used for these types are #REQUIRED and #IMPLIED.

Identifier attribute values must be names, not arbitrary strings; in particular, they cannot be
numbers, so we have to use something like C245892 rather than 245892 to make a customer
number into an identifier.

No element can have more than one identifier attribute declared for it.
No two different elements in an XML document can have the same identifier attribute
value assigned to them, even if the elements and attributes have different names.
Example:
<!ATTLISTbookbookidID#REQUIRED
publicationdateCDATA#IMPLIED>

4.6 Identifier Reference Attribute Type The keyword IDREF is used to indicate this type. It
means that the attribute is used to refer to an element in the current XML document which has an
attribute of type ID, i.e. that the attribute is used to make a reference. Note that:

Any of the four attstatus declarations can be used with this attribute type.

Just as with ID attribute types, the values for IDREF attributes must be names, not arbitrary
strings, since the values must be those used for an attribute of type ID.

An element can have multiple different attributes of type IDREF (which will usually be
used to refer to different kinds of element, but don't have to).

For every attribute of type IDREF in a valid XML document, there must exist an element in
the document with an attribute of type ID whose attribute value matches that of

the IDREF attribute. This makes sure that every attribute of type IDREF which has a value actually
refers to an element.
For example:
<!ATTLISTdivisiondivisionIDID#REQUIRED>

<!ATTLISTemployeeemployeeIDID#REQUIRED
divisionIDREF#REQUIRED
managerIDREF#IMPLIED>

These two declarations ensure that division and employee elements have unique identifiers
(IDs), divisionID and employeeID respectively. An employee has to have an ID reference value
in a division attribute (e.g. to say that he or she is employed in that division), and may have an
ID reference value in a manager attribute (e.g. to say that the employee with that employeeID is
his or her manager). Notice that the only requirement enforced by validity is that the value of an
attribute of type IDREF is the value of an attribute of type ID in the same XML document. Validity
checks don't prevent us from erroneously putting an employee ID value in the division attribute,
or a division ID value in the manager attribute.
Attributes of type IDREF cannot be used to make references to elements in other XML documents,
since the validity check will fail. Generally, the NMTOKEN type must be used for this purpose.
4.7 Multi-Identifier Reference Attribute Type The keyword IDREFS is used to indicate this type.
The value of the attribute is a list of white space separated IDREF values, each with the same
constraints as a single identifier reference. For example:
<!ATTLISTemployeeemployeeIDID#REQUIRED
managerIDREF#IMPLIED
subordinatesIDREFS#IMPLIED>

Every employee may have a single manager and may have multiple subordinates. A corresponding
valid XML document might be:
<?xmlversion="1.0"?>

<!DOCTYPEemployees[
<!ELEMENTemployees(employee*)>
<!ELEMENTemployeeEMPTY>
<!ATTLISTemployeeemployeeIDID#REQUIRED
managerIDREF#IMPLIED
subordinatesIDREFS#IMPLIED>
]>

<employees>

<employeeemployeeID="E3247"manager="E8012"/>
<employeeemployeeID="E3248"manager="E8012"/>
<employeeemployeeID="E3249"manager="E8012"/>
<employeeemployeeID="E8012"subordinates="E3247E3248E3249"/>
</employees>

5 Entities
Entities are a somewhat complicated part of the XML standard, but we will only cover a very
small part of them in this module. Essentially, entities allow pieces of text, and even fragments of
XML or DTDs, to be associated with entity names. Then whenever an entity reference appears in
the XML document, the XML processor substitutes the associated text.
The simplest kind of entity ('an internal general entity') is declared using the form:
<!ENTITYentityname"entityvalue">

The entity is then referred to by writing &entityname; in character data or in the value of an
attribute.
Two common uses are:

To avoid numeric character references. For example, the Unicode character can be
inserted into an XML document by using the character reference &#xC5;. More memorable is to
define an entity:
<!ENTITYAring"&#xC5;">
Then &Aring; will be expanded

by a validating XML processor into the value of &#xC5;, namely

the character .

Recall that five entities are predefined in XML (see e.g."Introduction to XML", 8), with the
names lt, gt, amp, quote and apos.
To simplify and standardize entering 'boilerplate' text. For example, an XML document
might contain frequent uses of the address of a company. Rather than entering it in full each time,
an entity could be defined, e.g.:
<!ENTITYcompanyAddress"AcmeInc.,LookoutHill,NeverNeverLand">
The reference &companyAddress; will then be expanded to the specified string (provided

that the

XML processor reads the DTD in which this entity is defined).


Advanced use of entities, together with another similar advanced feature (conditionals) and the
overriding of entity and attribute declarations, allow sophisticated parameterisation of XML
documents and DTDs, which can help to make them more robust, flexible and easier to maintain,
although at some significant cost in complexity to the reader of the raw XML files.

Other Schema Languages


DTDs are a relatively simple, yet powerful way of providing a 'schema' for XML documents. They
have some drawbacks; in particular:

DTDs are very limited in the data types they can specify. Character data and attribute
values cannot be constrained to be integers, reals, booleans, valid dates, etc., yet such constraints
are often required.

DTDs do not allow conditional constraints. It's not possible to specify relationships among
attributes, for example, such as saying that if an employee element has the value manager for
the status attribute then there must be a subordinates attribute.

DTDs are not expressed in XML, but in their own unique notation. This makes it awkward
for an XML processor which generates new XML to generate an appropriate DTD. DTDs require
their own separate parsers and are not easily manipulated.
Many alternative XML schema languages have been proposed, but as of now, none has come to
dominate the others and thus replace DTDs. Three seem to have achieved sufficient momentum to
be widely used and available and are actively supported.

W3C XML Schema Language is very powerful but also very complex. Its particular
strength is its support for the typing of attribute values and character data; it contains a large
number of primitive types that can be used (including the usual integer, date, etc.) as well as
allowing the user to define new, bespoke data types. It also has excellent support for the validation
of parent-child relationships. However, its complexity has constrained the development of
supporting tools so that defining a schema in this language requires considerable expertise. Thus it
can be considered to be a heavyweight solution that is appropriate for highly engineered XML
application design, but is somewhat too complex and labour intensive for smaller programming
situations.

RELAX NG is a relatively recent entry into the schema language wars. It is far simpler and
cleaner than the W3C XML Schema language. Its special strength is that it is much easier to read
and write. It can specify anything that the W3C Schema language can, with the exception that it
cannot define new, bespoke, complex data types for attribute values and character data.

Schematron uses the XPath language to refer to components of an XML document. XPath
allows navigation through an XML document, using 'trails' of parents, children, siblings, etc.
Schematron can be used to specify complex constraints that are not specifiable with the other
schema languages. However, since it treats as valid anything that doesn't conflict with the
constraints it specifies (in contrast with the other languages which only allow what they specify),
many constraints usually have to be written to match a few lines of a DTD. Furthermore, like
DTDs, Schematron provides no data type support for character data or attribute values.

Schemas
Provide greater power for validating the contents of
an XML document
Can specify the limitations of the data contained
within an element
Are XML documents
Should be well-formed and valid
Limitations of DTDs
DTDs:
Do not specify the values that are valid for an
element
Do not specify the number of times an element can
occur
Are not written in well-formed XML
Advantages (disadvantages) of Schemas
Allow you to limit the values of an element

Support namespaces, so that you can include different XML vocabularies wi


Are represented using XML syntax
Supports many data types
(Multiple standards)
(Does not support entities)
Handles mixed content
Create user defined data types
(Not supported by all validating parsers)

Schemas are stored as external files


Linked to from the XML document by including the following code included in

the root element


<rootElementName
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="schemaName.xsd">
Example:
<customers xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance"
xsi:noNamespaceSchemaLocation="schema27.xsd">
The xsi name may change. The latest may be found
at: http://www.w3.org/TR/xmlschema-1/
The Schema Prolog
The schema must start with the standard xml header text:
<?xml version="1.0"?>
The schema declaration must be included in the prolog, before any schema definitions
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
xsd or (or xs used in the text) are part of the schema syntax and represent the namespace
that identifies the type of schema. In this course we will use xsd
Note: this namespace URI may change. The latest may be found
at: http://www.w3.org/TR/xmlschema-1/
Comments
You may include comments within the standard HTML or XML comment notation:
<!-- Comment Line -->
You may also include comments by using annotations with the following format:
<xsd:annotation>

<xsd:documentation>
Comments here
</xsd:documentation>
</xsd:annotation>

Complex and Simple Types


Schema elements are divided into two types:
Complex types are used for elements that contain child elements and attributes
Simple types are for elements that can contain only text and no attributes
Most schemas contain a combination of complex and simple types
Schema validator online copy and paste your code
List of schema types
Schema
Type

Description

XML-Schema

Created by W3C Schema working group

XDR

Microsoft schema
definition, XML Data Reduced

DDML (previously
XSchema)

Document Definition Markup Language

RELAX NG

Relax plus NG specifications

Schematron (resource)

Uses a tree structure

TREX

Tree regualar expressions, XML documents


that match the pattern

Difference Between XML Schema


and DTD
Categorized under Protocols & Formats,Technology | Difference
Between XML Schema and DTD

XML Schema vs. DTD


DTD, or Document Type Definition, and XML Schema, which is also
known as XSD, are two ways of describing the structure and content of
an XML document. DTD is the older of the two, and as such, it has
limitations that XML Schema has tried to improve. The first difference
between DTD and XML Schema, is namespace awareness; XML
Schema is, while DTD is not. Namespace awareness removes the
ambiguity that can result in having certain elements and attributes
from multiple XML vocabularies, by giving them namespaces that put
the element or attribute into context.
Part of the reason why XML Schema is namespace aware while DTD is
not, is the fact that XML Schema is written in XML, and DTD is not.
Therefore, XML Schemas can be programmatically processed just like
any XML document. XML Schema also eliminates the need to learn

another language, as it is written in XML, unlike DTD.


Another key advantage of XML Schema, is its ability to implement
strong typing. An XML Schema can define the data type of certain
elements, and even constrain it to within specific lengths or values. This
ability ensures that the data stored in the XML document is accurate.
DTD lacks strong typing capabilities, and has no way of validating the
content to data types. XML Schema has a wealth of derived and built-in
data types to validate content. This provides the advantage stated
above. It also has uniform data types, but as all processors and
validators need to support these data types, it often causes older XML
parsers to fail.
A characteristic of DTD that people often consider both as an advantage
and disadvantage, is the ability to define DTDs inline, which XML
Schema lacks. This is good when working with small files, as it allows
you to contain both the content and the schema within the same
document, but when it comes to larger documents, this can be a
disadvantage, as you pull content every time you retrieve the schema.
This can lead to serious overhead that can degrade performance.
Summary:
1. XML Schema is namespace aware, while DTD is not.
2. XML Schemas are written in XML, while DTDs are not.
3. XML Schema is strongly typed, while DTD is not.

4. XML Schema has a wealth of derived and built-in data types that are
not available in DTD.
5. XML Schema does not allow inline definitions, while DTD does.

Read more: Difference Between XML Schema and DTD | Difference


Between http://www.differencebetween.net/technology/differencebetween-xml-schema-and-dtd/#ixzz4SKGdlEFos

XML Schema XSD vs DTD


An XML schema is a road map for the XML document similar to a Document Type Definition (DTD). Created by the
World Wide Web Consortium (W3C), schemas describe the elements and map out the presentation and nesting of
XML documents.
Essentially, the schema enables all applications to understand the flow of the page and validate the elements.

Why not write a DTD instead?


You could write a DTD for an XML page and accomplish some of the same goals. However, because a schema is
written in XML, there is no new syntax or rules to understand. If you can write a page in XML, you can write an XML
schema.
A DTD will do the same thing. While there are many differences between a DTD and the schema, both serve to
provide instruction for XML. Writing a DTD means learning a new set of rules and syntax. However, a schema is
written in XML. From a developer's standpoint, because schemas are XML documents, they can be parsed just like
any XML file. One other significant difference between these two document formats is the data type. In a DTD, a zip
code is text. That is the only way to define it. In a schema, an author could establish a different definition for the
element zip by setting up a data type. This means that you tell the parser, data strings under the element zip must
follow a set pattern.

Why Use An XML Schema?


Step away from websites for a minute and consider house building. If the plumber, electrician and contractor all do
their own thing, the result is a building with an uneven line, sinks in the bedroom and cable hookups in the bathroom.
That does not happen because the architect draws a diagram that maps out the basic structure. With a blueprint, there
is no guesswork. The plumber understands where the sinks go, and the electrician knows what rooms need what type
of wiring. A schema is the blueprint of an XML document. Since XML works to move data, it is essential that the
sender and the receiver of this data both understand the content.

S-ar putea să vă placă și