How does XML handle white-space in my documents?
All white-space, including linebreaks, TAB characters,
and normal spaces, even between ‘structural’ elements
where no text can ever appear, is passed by the parser
unchanged to the application (browser, formatter,
viewer, converter, etc), identifying the context in
which the white-space was found (element content, data
content, or mixed content, if this information is
available to the parser, eg from a DTD or Schema). This
means it is the application's responsibility to decide
what to do with such space, not the parser's:
* insignificant white-space between structural elements
(space which occurs where only element content is
allowed, ie between other elements, where text data
never occurs) will get passed to the application (in
SGML this white-space gets suppressed, which is why you
can put all that extra space in HTML documents and not
worry about it)
* significant white-space (space which occurs within
elements which can contain text and markup mixed
together, usually mixed content or PCDATA) will still
get passed to the application exactly as under SGML. It
is the application's responsibility to handle it
The parser must inform the application that white-space
has occurred in element content, if it can detect it.
(Users of SGML will recognize that this information is
not in the ESIS, but it is in the Grove.)
My title for
In the example above, the application will receive all
the pretty-printing linebreaks, TABs, and spaces between
the elements as well as those embedded in the chapter
title. It is the function of the application, not the
parser, to decide which type of white-space to discard
and which to retain. Many XML applications have
configurable options to allow programmers or users to
control how such white-space is handled.
Which parts of an XML document are case-sensitive?
All of it, both markup and text. This is significantly
different from HTML and most other SGML applications. It
was done to allow markup in non-Latin-alphabet
languages, and to obviate problems with case-folding in
writing systems which are caseless.
* Element type names are case-sensitive: you must follow
whatever combination of upper- or lower-case you use to
define them (either by first usage or in a DTD or
Schema). So you can't say <BODY>…</body>: upper- and
lower-case must match; thus <Img/>, <IMG/>, and <img/>
are three different element types;
* For well-formed XML documents with no DTD, the first
occurrence of an element type name defines the casing;
* Attribute names are also case-sensitive, for example
the two width attributes in <PIC width="7in"/> and <PIC
WIDTH="6in"/> (if they occurred in the same file) are
separate attributes, because of the different case of
width and WIDTH;
* Attribute values are also case-sensitive. CDATA values
(eg Url="MyFile.SGML") always have been, but NAME types
(ID and IDREF attributes, and token list attributes) are
now case-sensitive as well;
* All general and parameter entity names (eg Á), and
your data content (text), are case-sensitive as always.
How can I make my existing HTML files work in XML?
Either convert them to conform to some new document type
(with or without a DTD or Schema) and write a stylesheet
to go with them; or edit them to conform to XHTML.
It is necessary to convert existing HTML files because
XML does not permit end-tag minimisation (missing
, etc), unquoted attribute values, and a number of other
SGML shortcuts which have been normal in most HTML DTDs.
However, many HTML authoring tools already produce
almost (but not quite) well-formed XML.
You may be able to convert HTML to XHTML using the Dave
Raggett's HTML Tidy program, which can clean up some of
the formatting mess left behind by inadequate HTML
editors, and even separate out some of the formatting to
a stylesheet, but there is usually still some
hand-editing to do.
Is there an XML version of HTML?
Yes, the W3C recommends using XHTML which is ‘a
reformulation of HTML 4 in XML 1.0’. This specification
defines HTML as an XML application, and provides three
DTDs corresponding to the ones defined by HTML 4.*
(Strict, Transitional, and Frameset).
The semantics of the elements and their attributes are
as defined in the W3C Recommendation for HTML 4. These
semantics provide the foundation for future
extensibility of XHTML. Compatibility with existing HTML
browsers is possible by following a small set of
guidelines (see the W3C site).
If XML is just a subset of SGML, can I use XML files
directly with existing SGML tools?
Yes, provided you use up-to-date SGML software which
knows about the WebSGML Adaptations TC to ISO 8879 (the
features needed to support XML, such as the variant form
for EMPTY elements; some aspects of the SGML Declaration
such as NAMECASE GENERAL NO; multiple attribute token
list declarations, etc).
An alternative is to use an SGML DTD to let you create a
fully-normalised SGML file, but one which does not use
empty elements; and then remove the DocType Declaration
so it becomes a well-formed DTDless XML file. Most SGML
tools now handle XML files well, and provide an option
switch between the two standards.
Can XML use non-Latin characters?
Yes, the XML Specification explicitly says XML uses ISO
10646, the international standard character repertoire
which covers most known languages. Unicode is an
identical repertoire, and the two standards track each
other. The spec says (2.2): ‘All XML processors must
accept the UTF-8 and UTF-16 encodings of ISO 10646…’.
There is a Unicode FAQ at http://www.unicode.org/faq/FAQ.
UTF-8 is an encoding of Unicode into 8-bit characters:
the first 128 are the same as ASCII, and higher-order
characters are used to encode anything else from Unicode
into sequences of between 2 and 6 bytes. UTF-8 in its
single-octet form is therefore the same as ISO 646 IRV
(ASCII), so you can continue to use ASCII for English or
other languages using the Latin alphabet without
diacritics. Note that UTF-8 is incompatible with ISO
8859-1 (ISO Latin-1) after code point 127 decimal (the
end of ASCII).
UTF-16 is an encoding of Unicode into 16-bit characters,
which lets it represent 16 planes. UTF-16 is
incompatible with ASCII because it uses two 8-bit bytes
per character (four bytes above U+FFFF).
What's a Document Type Definition (DTD) and where do I
A DTD is a description in XML Declaration Syntax of a
particular type or class of document. It sets out what
names are to be used for the different types of element,
where they may occur, and how they all fit together. (A
question C.16, Schema does the same thing in XML
Document Syntax, and allows more extensive
For example, if you want a document type to be able to
describe Lists which contain Items, the relevant part of
your DTD might contain something like this:
<!ELEMENT List (Item)+>
<!ELEMENT Item (#PCDATA)>
This defines a list as an element type containing one or
more items (that's the plus sign); and it defines items
as element types containing just plain text (Parsed
Character Data or PCDATA). Validators read the DTD
before they read your document so that they can identify
where every element type ought to come and how each
relates to the other, so that applications which need to
know this in advance (most editors, search engines,
navigators, and databases) can set themselves up
correctly. The example above lets you create lists like:
(The indentation in the example is just for legibility
while editing: it is not required by XML.)
A DTD provides applications with advance notice of what
names and structures can be used in a particular
document type. Using a DTD and a validating editor means
you can be certain that all documents of that particular
type will be constructed and named in a consistent and
DTDs are not required for processing the tip in question
Bwell-formed documents, but they are needed if you want
to take advantage of XML's special attribute types like
the built-in ID/IDREF cross-reference mechanism; or the
use of default attribute values; or references to
external non-XML files (‘Notations’); or if you simply
want a check on document validity before processing.
There are thousands of DTDs already in existence in all
kinds of areas (see the SGML/XML Web pages for
pointers). Many of them can be downloaded and used
freely; or you can write your own (see the question on
creating your own DTD. Old SGML DTDs need to be
converted to XML for use with XML systems: read the
question on converting SGML DTDs to XML, but most
popular SGML DTDs are already available in XML form.
The alternatives to a DTD are various forms of question
C.16, Schema. These provide more extensive validation
features than DTDs, including character data content