Download
FAQ
History
PrevHomeNext API
Search
Feedback
Divider

Parsing with a DTD

After the XML declaration, the document prolog can include a DTD, or reference an external DTD, or both. In this section, you'll see the effect of the DTD on the data that the parser delivers to your application.

DTD's Effect on the Nonvalidating Parser

In this section, you'll use the Echo program to see how the data appears to the SAX parser when the data file references a DTD.


Note: The XML file used in this section is slideSample05.xml, which references slideshow1a.dtd, as described in Parsing with a DTD. The output is shown in Echo07-05.txt. (The browsable versions are slideshow1a-dtd.html, slideSample05-xml.html, and Echo07-05.html.)


Running the Echo program on your latest version of slideSample.xml shows that many of the superfluous calls to the characters method have now disappeared.

Where before you saw:

  ...
>
PROCESS: ...
CHARS:
  ELEMENT:        <slide
    ATTR: ...
  >
      ELEMENT:        <title>
      CHARS:        Wake up to ...
      END_ELM:        </title>
  END_ELM:        </slide>
CHARS:
  ELEMENT:        <slide
    ATTR: ...
  >
  ... 

Now you see:

  ...
>
PROCESS: ...
  ELEMENT:        <slide
    ATTR: ...
  >
      ELEMENT:        <title>
      CHARS:        Wake up to ...
      END_ELM:        </title>
  END_ELM:        </slide>
  ELEMENT:        <slide
    ATTR: ...
  >
  ... 

It is evident here that the whitespace characters which were formerly being echoed around the slide elements are no longer being delivered by the parser, because the DTD declares that slideshow consists solely of slide elements:

  <!ELEMENT slideshow (slide+)> 

Tracking Ignorable Whitespace

Now that the DTD is present, the parser is no longer calling the characters method with whitespace that it knows to be irrelevant. From the standpoint of an application that is only interested in processing the XML data, that is great. The application is never bothered with whitespace that exists purely to make the XML file readable.

On the other hand, if you were writing an application that was filtering an XML data file, and you wanted to output an equally readable version of the file, then that whitespace would no longer be irrelevant--it would be essential. To get those characters, you need to add the ignorableWhitespace method to your application. You'll do that next.


Note: The code written in this section is contained in Echo08.java. The output is in Echo08-05.txt. (The browsable version is Echo08-05.html.)


To process the (generally) ignorable whitespace that the parser is seeing, add the code highlighted below to implement the ignorableWhitespace event handler in your version of the Echo program:

public void characters (char buf[], int offset, int len)
... 
} 
public void ignorableWhitespace char buf[], int offset, int Len)
throws SAXException
{
  nl(); 
  emit("IGNORABLE");
} 
public void processingInstruction(String target, String data)
... 

This code simply generates a message to let you know that ignorable whitespace was seen.


Note: Again, not all parsers are created equal. The SAX specification does not require this method to be invoked. The Java XML implementation does so whenever the DTD makes it possible.


When you run the Echo application now, your output looks like this:

ELEMENT: <slideshow
  ATTR: ...
>
IGNORABLE
IGNORABLE
PROCESS: ...
IGNORABLE
IGNORABLE
  ELEMENT: <slide
    ATTR: ...
  >
  IGNORABLE
    ELEMENT: <title>
    CHARS:   Wake up to ...
    END_ELM: </title>
  IGNORABLE
  END_ELM: </slide>
IGNORABLE
IGNORABLE
  ELEMENT: <slide
    ATTR: ...
  >
  ... 

Here, it is apparent that the ignorableWhitespace is being invoked before and after comments and slide elements, where characters was being invoked before there was a DTD.

Cleanup

Now that you have seen ignorable whitespace echoed, remove that code from your version of the Echo program--you won't be needing it any more in the exercises ahead.


Note: That change has been made in Echo09.java.


Empty Elements, Revisited

Now that you understand how certain instances of whitespace can be ignorable, it is time revise the definition of an "empty" element. That definition can now be expanded to include

  <foo>   </foo> 

where there is whitespace between the tags and the DTD says that whitespace as ignorable.

Echoing Entity References

When you wrote slideSample06.xml, you defined entities for the product name. Now it's time to see how they're echoed when you process them with the SAX parser.


Note: The XML used here is contained in slideSample06.xml, which references slideshow1b.dtd, as described in Defining Attributes and Entities in the DTD. The output is shown in Echo09-06.txt. (The browsable versions are slideSample06-xml.html, slideshow1b-dtd.html and Echo09-06.html.)


When you run the Echo program on slideSample06.xml, here is the kind of thing you see:

ELEMENT:        <title>
CHARS:        Wake up to WonderWidgets!
END_ELM:        </title> 

Note that the product name has been substituted for the entity reference.

Echoing the External Entity

In slideSample07.xml, you defined an external entity to reference a copyright file.


Note: The XML used here is contained in slideSample07.xml and in copyright.xml. The output is shown in Echo09-07.txt. (The browsable versions are slideSample07-xml.html, copyright-xml.html and Echo09-07.html.)


When you run the Echo program on that version of the slide presentation, here is what you see:

...
END_ELM: </slide>
ELEMENT: <slide
  ATTR: type        "all"
>
  ELEMENT: <item>
  CHARS: 
This is the standard copyright message that our lawyers
make us put everywhere so we don't have to shell out a
million bucks every time someone spills hot coffee in their
lap...
  END_ELM: </item>
END_ELM: </slide>
... 

Note that the newline which follows the comment in the file is echoed as a character, but that the comment itself is ignored. That is the reason that the copyright message appears to start on the next line after the CHARS: label, instead of immediately after the label--the first character echoed is actually the newline that follows the comment.

Summarizing Entities

An entity that is referenced in the document content, whether internal or external, is termed a general entity. An entity that contains DTD specifications that are referenced from within the DTD is termed a parameter entity. (More on that later.)

An entity which contains XML (text and markup), and which is therefore parsed, is known as a parsed entity. An entity which contains binary data (like images) is known as an unparsed entity. (By its very nature, it must be external.) We'll be discussing references to unparsed entities in the next section of this tutorial.

Divider
Download
FAQ
History
PrevHomeNext API
Search
Feedback
Divider

All of the material in The Java(TM) Web Services Tutorial is copyright-protected and may not be published in other works without express written permission from Sun Microsystems.