Documente Academic
Documente Profesional
Documente Cultură
Pros
The main advantage of the DTD is that it provides validating parsers with a map
document. The map describes how elements relate to each other within the document.
nest within each other and the number of times they can occur, and it
Because XML first used DTDs to validate documents, they're the de facto standard
documents. Nearly every XML package supports validation with DTDs.
Cons
As useful as DTDs are, they arent really designed for handling the XML domain. The major concern most developers
have with DTDs is the lack of good type-checking. Also, DTDs are created in a strange and seemingly archaic format.
They have only a limited capability in describing the document structure in terms of how many elements can nest
within other elements.
<!DOCTYPEemployees[
<!ELEMENTemployees(employee*)>
<!ELEMENTemployee(firstname,familyname,comment)>
<!ATTLISTemployeeidID#REQUIRED
managerIDREF#IMPLIED>
<!ELEMENTfirstname(#PCDATA)>
<!ELEMENTfamilyname(#PCDATA)>
<!ELEMENTcomment(#PCDATA)>
]>
<employees>
<employeemanager="e3257">
<firstname>Joe</firstname><familyname>Bloggs</familyname>
<comment>Duetoretiresoon.</comment>
</employee>
<employeeid="e3257">
<familyname>Patel</familyname>
<comment/>
</employee>
</employees>
Although this document is well-formed, it is now invalid, since it does not conform to the DTD,
which specifies that:
The employee element has a list of attributes, of which the first, id, is required while the
second, manager, is optional; however, the first employee element does not have an id attribute.
The employee element must always contain the three subelements firstname, familyname and comment, in exactly that order; however the
second employeeelement does not contain a firstname element.
We will now look at the components of DTDs in turn.
3 Element Declarations
Element type declarations are of the form:
<!ELEMENTelementnamecontentrule>
The elementname, is, of course, just the name of the element that the rule applies to, while
the contentrule specifies the constraints that apply to it. There are a number of different forms
for the contentrule, called content models.
3.1 The Any Content Model Here the contentrule has the value ANY, and the corresponding
element in the XML document can have any mix of character data and elements in any order. If no
DTD is provided for an XML document, this is the default assumed for every element. Clearly this
content model should be avoided if at all possible, since it effectively avoids specifying the format
of the element. The following declaration says that every a element may have any content:
<!ELEMENTaANY>
3.2 The Empty Content Model Here the contentrule has the value EMPTY, and the
corresponding element in the XML document must contain no content, neither sub-elements nor
character data. Thus the following declaration says that every a element must have no content:
<!ELEMENTaEMPTY>
3.3 Element Only Content Models Here an expression involving element names and some
special characters (representing operators) is used to describe patterns of child elements that can
occur in the content of the target element. The expressions are built up recursively from a number
of components. Expressions must be fully parenthesised in order to exclude any possibility of
ambiguity or the need to specify the order of precedence of pattern operators.
Element Name: an element name means that an element of that name is required at this
position. The following declaration says that an element named a must have precisely one child
element whose name is b:
<!ELEMENTa(b)>
Technically, an element name can only appear inside sequences or choices, both of which must be
enclosed in parentheses. Hence the parentheses around the b in the example above are because it is
actually the only entry in a sequence of length 1. Thus the following is invalid:
<!ELEMENTab>
(The reason for this restriction is so that there can be elements named ANY and EMPTY without
conflicting with the corresponding content models.)
Sequence: a comma is used to separate expressions which must appear in the order given.
The following declaration says that an element named a must have precisely three child elements
whose names are b, c and d:
<!ELEMENTa(b,c,d)>
One or More Repeats: a plus + is used to follow a sequence, a choice or an element name
which must be repeated 1 or more times. The following says that an element named amust have at
least one b child element, but possibly many (and nothing else):
<!ELEMENTa(b+)>
Optionality: the question mark ? is used to follow a sequence, a choice or an element name
which is optional, i.e. may or may not occur. The following says that an element named a may or
may not have a single b child element, but must contain no other elements:
<!ELEMENTa(b?)>
Here the cv element must start with a preface element. This must be followed by a repeating
group that has to occur at least once. Each cycle of this group can contain either
a qualification element or an experience element; i.e. there can be a sequence
ofqualification and experience elements in any order and of any length greater than 0. Next
there might or might not be a single hobbies element. Finally there can be a sequence of any
number (including zero) of referee elements.
3.4 Mixed Content Models Here we can finally specify that an element contains character data,
and how it can be mixed with child elements. There are two forms:
If element a contains character data but no child elements at all, it is declared by:
<!ELEMENTa(#PCDATA)>
#PCDATA historically comes
These are the only kinds of mixed content models allowed. There's no way, for example, of
specifying that an element starts with character data but then ends with a given element. For this
reason, the second form of mixed content model should be avoided unless it really is the case that
the child elements can occur anywhere within character data.
Neither form provides any way of specifying what kind of character data can occur, e.g. that it
must be an integer or consist of exactly three words.
4 Attribute Declarations
Attribute Type Declarations are of the form:
<!ATTLISTelemnameattname1atttype1attstatus1
attname2atttype2attstatus2
.........>
The elemname identifies the element to which this list of attribute declarations applies. Theatt
name identifies which particular attribute of the element is being declared. The values of att
type and attstatus then specify the form of the attribute.
The attstatus part of the declaration is the easiest to understand. It specifies whether the
attribute is required or optional and, if optional, whether a default value should be assumed if not
given in the XML document.
means that the attribute must be present in every target element in the XML
document; hence there is no need for a default value.
means that the attribute is optional in every target element in the XML document
(i.e. may or may not be present), but that no default value is provided if the attribute is omitted. (I
find IMPLIED to be an odd name for this case; OPTIONAL would surely have been better.)
means that the attribute is optional in every target element in the XML
document, but that if it is not present then it must be added by a validating XML processor and
given the value of attvalue (without the quotes).
#FIXED"attvalue" means that if the attribute is present in a target element in the XML
document then it must have precisely the given value; if it is not present then it must be added by a
validating XML processor and given the value of attvalue(without the quotes).
#REQUIRED
#IMPLIED
"attvalue"
The last two forms create a problem. A non-validating XML processor is only required to check
that an XML document is well-formed; it is not required to process the DTD fully in order to
check whether the document is valid. However, if it doesn't fully process the DTD then it may not
be able add default or fixed attributes, so that if the processor constructs a tree to represent the
XML document (see e.g. "Introduction to XML", 9), a validating XML processor may create a
different tree to that created by a non-validating XML processor.
Returning to the atttype part of the attribute declaration, there are ten different forms, of which
seven fall within the scope of this module. Each will be described in turn.
4.1 Enumerated List Attribute Type This form is used to specify that the value of the attribute
can only be one of a fixed set of values given in a vertical bar separated list of names. The list
must be enclosed in parentheses. Note that a value must be a valid name (see e.g. "Introduction to
XML", 1). Some examples:
<!ATTLISTemployeestatus(fulltime|parttime)"fulltime">
This specifies that employee elements must have the attribute status, which can only have one of
the two values fulltime or parttime. If the attribute is omitted, then a validating XML processor
must add it with the value fulltime. If the overwhelming majority of employees are full time,
then an attribute declaration like this avoids the need to include status="fulltime" in most of
the employee elements. However, it usually makes the eventual code written to process the XML
document more complex, since it will be safer to determine whether an employee is full time or
not by testing for either the absence of the status attribute or its having the value fulltime.
<!ATTLISTmenucuisine(French|Italian|Indian|Chinese)#REQUIRED>
This specifies that menu elements must have the attribute cuisine, which can only have one of the
values French, Italian, Indian or Chinese. If the attribute is omitted, then a validating XML
processor must report an error.
4.2 Name Token Attribute Type The keyword NMTOKEN is used to indicate this type, which is
used to specify that the value of the attribute can be any arbitrary 'name token'. A name token is
like a name, but can in addition start with a digit. In the example above, if it had been decided to
use a single word to refer to a type of cuisine, but it was required to be able to extend the types of
cuisine to more than a fixed list, the appropriate declaration would be:
<!ATTLISTmenucuisineNMTOKEN#REQUIRED>
but not:
<menucuisine="GreekandItalian">...</menu>
4.3 Multi-Name Token Attribute Type The keyword NMTOKENS is used to indicate this type,
which is used to allow an attribute to have a white space separated list of values, each of
type NMTOKEN. Thus if we declare the cuisine attribute of menu via:
<!ATTLISTmenucuisineNMTOKENS#REQUIRED>
(Actually the value "GreekandItalian" would also be valid, but the spirit of this attribute type
is to have a list of name tokens each with its own meaning.)
4.4 String Attribute Type The keyword CDATA is used to indicate this type. As with the
keyword #PCDATA for elements, the name is historic, but misleading. Attributes cannot have
general character data values. They cannot contain CDATA sections, so it's particularly odd to
use CDATA as the keyword. CDATA means that the attribute can have any string value. For example:
<!ATTLISTpurchaseordernumberCDATA#REQUIRED
customeridCDATA#IMPLIED
sellerCDATA#FIXED"ACMEINC."
priorityCDATA"normal">
All of the four attributes of a purchase element can have string values (and so, for example, can
contain white space). The attribute ordernumber must be present, whereas customerid can be
omitted, but has no default value if it is. The value of seller is always the string 'ACMEINC.'. The
attribute priority may be omitted but if so defaults to normal.
It should be clear that CDATA offers only a very weak specification for an attribute value, and so
should be avoided if possible. The order number and customer ID are likely to be numbers in a
real application. DTDs don't provide any way of specifying this. However NMTOKEN is likely to be
a better choice than CDATA, since it rules out embedded white space. If the attribute priority is to
be allowed to have values like rushorder then an enumerated list can't be used (because its
values can only be names) nor can NMTOKEN, and CDATA is the only choice. However, it would
probably be better instead to use values like rushOrder orrush_order, which allow the more
restricted NMTOKEN as the attribute type. Only the attribute seller above really needs to be CDATA.
4.5 Identifier Attribute Type The keyword ID is used to indicate this type. It means that the
attribute is an identifier for the target element and so can be used in references (see
e.g. "Introduction to XML", 5). Note that:
No default values can be specified for identifier attributes. Thus the only att
statusdeclarations that can be used for these types are #REQUIRED and #IMPLIED.
Identifier attribute values must be names, not arbitrary strings; in particular, they cannot be
numbers, so we have to use something like C245892 rather than 245892 to make a customer
number into an identifier.
No element can have more than one identifier attribute declared for it.
No two different elements in an XML document can have the same identifier attribute
value assigned to them, even if the elements and attributes have different names.
Example:
<!ATTLISTbookbookidID#REQUIRED
publicationdateCDATA#IMPLIED>
4.6 Identifier Reference Attribute Type The keyword IDREF is used to indicate this type. It
means that the attribute is used to refer to an element in the current XML document which has an
attribute of type ID, i.e. that the attribute is used to make a reference. Note that:
Any of the four attstatus declarations can be used with this attribute type.
Just as with ID attribute types, the values for IDREF attributes must be names, not arbitrary
strings, since the values must be those used for an attribute of type ID.
An element can have multiple different attributes of type IDREF (which will usually be
used to refer to different kinds of element, but don't have to).
For every attribute of type IDREF in a valid XML document, there must exist an element in
the document with an attribute of type ID whose attribute value matches that of
the IDREF attribute. This makes sure that every attribute of type IDREF which has a value actually
refers to an element.
For example:
<!ATTLISTdivisiondivisionIDID#REQUIRED>
<!ATTLISTemployeeemployeeIDID#REQUIRED
divisionIDREF#REQUIRED
managerIDREF#IMPLIED>
These two declarations ensure that division and employee elements have unique identifiers
(IDs), divisionID and employeeID respectively. An employee has to have an ID reference value
in a division attribute (e.g. to say that he or she is employed in that division), and may have an
ID reference value in a manager attribute (e.g. to say that the employee with that employeeID is
his or her manager). Notice that the only requirement enforced by validity is that the value of an
attribute of type IDREF is the value of an attribute of type ID in the same XML document. Validity
checks don't prevent us from erroneously putting an employee ID value in the division attribute,
or a division ID value in the manager attribute.
Attributes of type IDREF cannot be used to make references to elements in other XML documents,
since the validity check will fail. Generally, the NMTOKEN type must be used for this purpose.
4.7 Multi-Identifier Reference Attribute Type The keyword IDREFS is used to indicate this type.
The value of the attribute is a list of white space separated IDREF values, each with the same
constraints as a single identifier reference. For example:
<!ATTLISTemployeeemployeeIDID#REQUIRED
managerIDREF#IMPLIED
subordinatesIDREFS#IMPLIED>
Every employee may have a single manager and may have multiple subordinates. A corresponding
valid XML document might be:
<?xmlversion="1.0"?>
<!DOCTYPEemployees[
<!ELEMENTemployees(employee*)>
<!ELEMENTemployeeEMPTY>
<!ATTLISTemployeeemployeeIDID#REQUIRED
managerIDREF#IMPLIED
subordinatesIDREFS#IMPLIED>
]>
<employees>
<employeeemployeeID="E3247"manager="E8012"/>
<employeeemployeeID="E3248"manager="E8012"/>
<employeeemployeeID="E3249"manager="E8012"/>
<employeeemployeeID="E8012"subordinates="E3247E3248E3249"/>
</employees>
5 Entities
Entities are a somewhat complicated part of the XML standard, but we will only cover a very
small part of them in this module. Essentially, entities allow pieces of text, and even fragments of
XML or DTDs, to be associated with entity names. Then whenever an entity reference appears in
the XML document, the XML processor substitutes the associated text.
The simplest kind of entity ('an internal general entity') is declared using the form:
<!ENTITYentityname"entityvalue">
The entity is then referred to by writing &entityname; in character data or in the value of an
attribute.
Two common uses are:
To avoid numeric character references. For example, the Unicode character can be
inserted into an XML document by using the character reference Å. More memorable is to
define an entity:
<!ENTITYAring"Å">
Then Å will be expanded
the character .
Recall that five entities are predefined in XML (see e.g."Introduction to XML", 8), with the
names lt, gt, amp, quote and apos.
To simplify and standardize entering 'boilerplate' text. For example, an XML document
might contain frequent uses of the address of a company. Rather than entering it in full each time,
an entity could be defined, e.g.:
<!ENTITYcompanyAddress"AcmeInc.,LookoutHill,NeverNeverLand">
The reference &companyAddress; will then be expanded to the specified string (provided
that the
DTDs are very limited in the data types they can specify. Character data and attribute
values cannot be constrained to be integers, reals, booleans, valid dates, etc., yet such constraints
are often required.
DTDs do not allow conditional constraints. It's not possible to specify relationships among
attributes, for example, such as saying that if an employee element has the value manager for
the status attribute then there must be a subordinates attribute.
DTDs are not expressed in XML, but in their own unique notation. This makes it awkward
for an XML processor which generates new XML to generate an appropriate DTD. DTDs require
their own separate parsers and are not easily manipulated.
Many alternative XML schema languages have been proposed, but as of now, none has come to
dominate the others and thus replace DTDs. Three seem to have achieved sufficient momentum to
be widely used and available and are actively supported.
W3C XML Schema Language is very powerful but also very complex. Its particular
strength is its support for the typing of attribute values and character data; it contains a large
number of primitive types that can be used (including the usual integer, date, etc.) as well as
allowing the user to define new, bespoke data types. It also has excellent support for the validation
of parent-child relationships. However, its complexity has constrained the development of
supporting tools so that defining a schema in this language requires considerable expertise. Thus it
can be considered to be a heavyweight solution that is appropriate for highly engineered XML
application design, but is somewhat too complex and labour intensive for smaller programming
situations.
RELAX NG is a relatively recent entry into the schema language wars. It is far simpler and
cleaner than the W3C XML Schema language. Its special strength is that it is much easier to read
and write. It can specify anything that the W3C Schema language can, with the exception that it
cannot define new, bespoke, complex data types for attribute values and character data.
Schematron uses the XPath language to refer to components of an XML document. XPath
allows navigation through an XML document, using 'trails' of parents, children, siblings, etc.
Schematron can be used to specify complex constraints that are not specifiable with the other
schema languages. However, since it treats as valid anything that doesn't conflict with the
constraints it specifies (in contrast with the other languages which only allow what they specify),
many constraints usually have to be written to match a few lines of a DTD. Furthermore, like
DTDs, Schematron provides no data type support for character data or attribute values.
Schemas
Provide greater power for validating the contents of
an XML document
Can specify the limitations of the data contained
within an element
Are XML documents
Should be well-formed and valid
Limitations of DTDs
DTDs:
Do not specify the values that are valid for an
element
Do not specify the number of times an element can
occur
Are not written in well-formed XML
Advantages (disadvantages) of Schemas
Allow you to limit the values of an element
<xsd:documentation>
Comments here
</xsd:documentation>
</xsd:annotation>
Description
XML-Schema
XDR
Microsoft schema
definition, XML Data Reduced
DDML (previously
XSchema)
RELAX NG
Schematron (resource)
TREX
4. XML Schema has a wealth of derived and built-in data types that are
not available in DTD.
5. XML Schema does not allow inline definitions, while DTD does.