When should I use a CDATA Marked Section?
You should almost never need to use CDATA Sections. The
CDATA mechanism was designed to let an author quote
fragments of text containing markup characters (the
open-angle-bracket and the ampersand), for example when
documenting XML (this FAQ uses CDATA Sections quite a
lot, for obvious reasons). A CDATA Section turns off
markup recognition for the duration of the section (it
gets turned on again only by the closing sequence of
double end-square-brackets and a close-angle-bracket).
Consequently, nothing in a CDATA section can ever be
recognised as anything to do with markup: it's just a
string of opaque characters, and if you use an XML
transformation language like XSLT, any markup characters
in it will get turned into their character entity
If you try, for example, to use:
some text with <![CDATA[markup]]> in it.
in the expectation that the embedded markup would remain
untouched, it won't: it will just output
some text with <em>markup</em> in it.
In other words, CDATA Sections cannot preserve the
embedded markup as markup. Normally this is exactly what
you want because this technique was designed to let
people do things like write documentation about markup.
It was not designed to allow the passing of little
chunks of (possibly invalid) unparsed HTML embedded
inside your own XML through to a subsequent
process—because that would risk invalidating the output.
As a result you cannot expect to keep markup untouched
simply because it looked as if it was safely ‘hidden’
inside a CDATA section: it can't be used as a magic
shield to preserve HTML markup for future use as markup,
only as characters.
How can I handle embedded HTML in my XML
Apart from using CDATA Sections, there are two common
occasions when people want to handle embedded HTML
inside an XML element:
1. when they have received (possibly poorly-designed)
XML from somewhere else which they must find a way to
2. when they have an application which has been
explicitly designed to store a string of characters
containing < and & character entity references with the
objective of turning them back into markup in a later
process (eg FreeMind, Atom).
Generally, you want to avoid this kind of trick, as it
usually indicates that the document structure and design
has been insufficiently thought out. However, there are
occasions when it becomes unavoidable, so if you really
need or want to use embedded HTML markup inside XML, and
have it processable later as markup, there are a couple
of techniques you may be able to use:
* Provide templates for the handling of that markup in
your XSLT transformation or whatever software you use
which simply replicates what was there, eg
* Use XSLT's ‘deep copy’ instruction, which outputs
nested well-formed markup verbatim, eg
* As a last resort, use the disable-output-escaping
attribute on the xsl:text element of XSL[T] which is
available in some processors, eg
* Some processors (eg JX) are now providing their own
equivalents for disabling output escaping. Their
proponents claim it is ‘highly desirable’ or ‘what most
people want’, but it still needs to be treated with care
to prevent unwanted (possibly dangerous) arbitrary code
from being passed untouched through your system. It also
adds another dependency to your software.
For more details of using these techniques in XSL[T],
see the relevant question in the XSL FAQ.
What are the special characters in XML ?
For normal text (not markup), there are no special
characters: just make sure your document refers to the
correct encoding scheme for the language and/or writing
system you want to use, and that your computer correctly
stores the file using that encoding scheme. See the
question on non-Latin characters for a longer
If your keyboard will not allow you to type the
characters you want, or if you want to use characters
outside the limits of the encoding scheme you have
chosen, you can use a symbolic notation called ‘entity
referencing’. Entity references can either be numeric,
using the decimal or hexadecimal Unicode code point for
the character (eg if your keyboard has no Euro symbol
(€) you can type €); or they can be character, using an
established name which you declare in your DTD (eg ) and
then use as € in your document. If you are using a
Schema, you must use the numeric form for all except the
five below because Schemas have no way to make character
If you use XML with no DTD, then these five character
entities are assumed to be predeclared, and you can use
them without declaring them:
The less-than character (<) starts element markup (the
first character of a start-tag or an end-tag).
The ampersand character (>) starts entity markup (the
first character of a character entity reference).
The greater-than character (>) ends a start-tag or an
The double-quote character (") can be symbolised with
this character entity reference when you need to embed a
double-quote inside a string which is already
The apostrophe or single-quote character (') can be
symbolised with this character entity reference when you
need to embed a single-quote or apostrophe inside a
string which is already single-quoted.
If you are using a DTD then you must declare all the
character entities you need to use (if any), including
any of the five above that you plan on using (they cease
to be predeclared if you use a DTD). If you are using a
Schema, you must use the numeric form for all except the
five above because Schemas have no way to make character
Do I have to change any of my server software to work
The only changes needed are to make sure your server
serves up .xml, .css, .dtd, .xsl, and whatever other
file types you will use as the correct MIME content
The details of the settings are specified in RFC 3023.
Most new versions of Web server software come preset.
If not, all that is needed is to edit the mime-types
file (or its equivalent: as a server operator you
already know where to do this, right?) and add or edit
the relevant lines for the right media types. In some
servers (eg Apache), individual content providers or
directory owners may also be able to change the MIME
types for specific file types from within their own
directories by using directives in a .htaccess file. The
media types required are:
* text/xml for XML documents which are ‘readable by
* application/xml for XML documents which are
‘unreadable by casual users’;
* text/xml-external-parsed-entity for external parsed
entities such as document fragments (eg separate
chapters which make up a book) subject to the
readability distinction of text/xml;
* application/xml-external-parsed-entity for external
parsed entities subject to the readability distinction
* application/xml-dtd for DTD files and modules,
including character entity sets.
The RFC has further suggestions for the use of the +xml
media type suffix for identifying ancillary files such
as XSLT (application/xslt+xml).
If you run scripts generating XHTML which you wish to be
treated as XML rather than HTML, they may need to be
modified to produce the relevant Document Type
Declaration as well as the right media type if your
application requires them to be validated.