XML

Submitted by Alan on Sat, 04/25/2009 - 20:16

Extensible Markup Language (XML) is a human-readable, machine-understandable, general syntax for describing hierarchical data, applicable to a wide range of applications (databases, e-commerce, Java, web development, searching, etc.). XML provides a flexible way to create common information formats and share both the format and the data on the WWW, intranets, and elsewhere. Custom tags enable the definition, transmission, validation, and interpretation of data between applications and between organizations.

Imagine: widget makers might agree on a standard or common way to describe the information about a widget product (shape, color, weight, size, etc), and then describe the product information format with XML. Such a standard way of describing data would enable a user to send an intelligent agent (a program) to each widget maker's Web site, gather data, and then make a valid comparison. XML can be used by any individual or group of individuals or organisations that want to share information in a consistent way.

Currently a formal recommendation from the W3C, XML is similar to the language of today's Web pages, HTML. Both XML and HTML contain markup symbols to describe the contents of a page or file. HTML, however, describes the content of a Web page (mainly text and graphic images) primarily in terms of how it is to be displayed and interacted with. For example, a <p> starts a new paragraph. XML describes the content in terms of what data is being described. For example, a <email> could indicate that the data that followed it was a email address. This means that an XML file can be processed purely as data by a program or it can be stored with similar data on another computer or, like an HTML file, that it can be displayed. For example, depending on how the application in the receiving computer wanted to handle the phone number, it could be stored, displayed, or emailed. XML is not a single, predefined markup language: it's a language for describing other languages which lets you design your own markup. A predefined markup language like HTML defines a way to describe information in one specific class of documents: XML lets you define your own customized markup languages for different classes of document.

XML is a simplified subset of SGML, the Standard Generalised Markup Language. SGML is a language for describing markup languages, particularly those used in electronic document exchange, document management, and document publishing. HTML is a language defined in SGML. SGML is an international standard for the definition of device-independent, system-independent methods of representing texts in electronic form. More precisely, SGML is a metalanguage, i.e. a method of formally describing a language, in this case, a markup language.

XML was conceived as a means of regaining the power and flexibility of SGML without most of its complexity. Although a restricted form of SGML, XML nonetheless preserves most of SGML's power and richness, and yet still retains all of SGML's commonly used features. While retaining these beneficial features, XML removes many of the more complex features of SGML that make the authoring and design of suitable software both difficult and costly. The purpose of XML is to provide an easy to use subset of SGML that allows for custom tags to be processed. Custom tags will enable the definition, transmission and interpretation of data structures between cooperating processes.

XML is extensible because it is a metalanguage; the markup symbols are unlimited and self-defining. XML is actually a simpler and easier-to-use subset of SGML. It is expected that HTML and XML will be used together in many Web applications. which enables one to write a Document Type Definition (DTD) and define the rules of the language so the document can be interpreted by the document receiver.

XML is a system for defining, validating, and sharing document formats. XML uses tags (for example <em>emphasis</em> for emphasis), to distinguish document structures, and attributes (for example, in <A HREF="http://www.xml.com/">, HREF is the attribute name, and http://www.xml.com/ is the attribute value) to encode extra document information.

An XML processor can read clean, valid, HTML, and with a few small changes an HTML browser like Netscape Navigator or Microsoft Internet Explorer would be able to read XML. The biggest difference between XML and HTML is that in XML, you can define your own tags for your own purposes, and if you want, share those tags with other users.

The concept of a well-formed document is something that is really new in XML. A document that is well-formed is easy for a computer program to read, and ready for network delivery. Specifically, in a well-formed document:

  • All the begin-tags and end-tags match up
  • Empty tags use the special XML syntax (e.g. <empty/>)
  • All the attribute values are nicely quoted (e.g. <a href="http://www.textuality.com/xml.html">)
  • All the entities are declared (entities are re-usable chunks of data, much like macros, part of XML's inheritance from SGML).

A valid document must have a document type declaration, which is a grammar or set of rules that define what tags can appear in the document and how they must nest within each other. The document type declaration also is used to declare entities, re-usable chunks of text that can appear many times but only have to be transmitted once. A document is valid when it conforms to the rules in the document type declaration. Validity is useful because an XML-savvy editor can use the type declaration to help (and in fact require) users to create documents that are valid; such documents are much easier to use and (especially) re-use than those which can contain any old set of tags in any old order.

XHTML is a markup language written in XML - an "XML application". XHTML is a specific application or document type of the meta-language XML.

XML namespaces

Namespaces provide a simple method for qualifying element and attribute names used in XML documents by associating them with namespaces identified by URI references. This allows element names from different documents to be combined in one document without confusion in cases where names happen to be the same. It uses qualified names to prevent potential conflicts between identically named XML elements, by associating a prefix which identifies an intended namespace with an URI. XHTML 1.0 specifies three XML namespaces, corresponding to the three HTML 4.0 DTDs: Strict, Transitional, and Frameset.


XML has been developed by a working group under the auspices of the W3C. Another good source is Robin Cover's The SGML/XML Web Page.

IBM provides an XML Developer Web site that includes free online XML courses, articles, and frequently-asked questions. Here is IBM's Introduction to XML, an online tutorial.

Microsoft provides XML info.

Basic XML

A Very Simple Example

In Microsoft Explorer 5 or later, the example should look like a normal web page. In other web browsers, you will probably just see the XML markup. There is a CSS style sheet which IE5 uses to render the document.

XML documents may, and should, begin with an XML declaration which specifies the version of XML being used.

An XML document must contain one or more elements. There is exactly one element, called the root, or document element, no part of which appears in the content of any other element.

Names are case-sensitive. Element names can contain letters, digits, hyphens, underscores, colons, or full stops. A colon can be used only in a special case where it separates so cold name space. Element names starting with xml, XML or other combination of cases of this string are reserved for the standard.

An element can have none, one or several attributes. Permited characters are the same as for element names. The name of attribute is separated from its value by =. The attribute value must be given inside apostrophes '...' or double-quotes "..." . If an apostrophe or double-quote is used in the attribute value the opposite delimiter must be used.

The name in an element's end-tag must match the element type in the start-tag. If the start-tag is in the content of another element, the end-tag is in the content of the same element. I.e., the elements, delimited by start- and end-tags, nest properly within each other.

The text between the start-tag and end-tag is called the element's content. An element without content can take a special form: <name/> . The slash before > substitutes the end tag.

Characters < and & cannot be used in text as they are used in markup. If these characters are needed &lt; must be used insted of < and &amp; instead of & Characters >, " , and ' can be also substituted by &gt; , &quot; and &apos; , respectively

Comments may appear anywhere in a document outside other markup. An XML processor may, but need not, make it possible for an application to retrieve the text of comments. The string "--" (double-hyphen) must not occur within comments.

CDATA sections are used to escape blocks of text containing characters which would otherwise be recognized as markup. CDATA sections begin with the string "<![CDATA[" and end with the string "]]>". The string ']]>' must not occur inside CDATA section.

Processing instructions (PIs) allow documents to contain instructions for applications.

More...