2015年5月3日 星期日

Print output while Processing HTML/XML data in Jsoup Project

Currently, I encountered one problem while retrieving XML data from one website. In my case, assume that the original XML document is like
<result>
<device />
<name>Allen's device</name>
</result>

If I use Jsoup.parse(File, “UTF-8”); without additional options, the returned document object will be like:
<result>
<device>
<name>Allen's device</name>
</result>

The weired result is <device> is just an open tag but without close tag here. Howevver, if you process <size /> tag in this example, the program will produce the same tag <size />.

That’s because Jsoup adds some initial tags which can be viewed as an open tag but don’t attach a close tag. In this condition, <device /> can be first derived as an open tag <device> and jsoup will create an empty tag at line 204 from org.jsoup.parser.HtmlTreeBuilder.insertEmpty(Token.StartTag) method. Since <device /> is an self-closing tag, this procedure will go through line 205 ~ 210 of insertEmpty.

The program will go into line 205. The reason is that <device /> is one of known tags. The definitions of known tags are initialized since line 257 (the all known tags’ definition can be found in line 221 ~ line 253) of org.jsoup.parser.Tag and will be called by line 29 of org.jsoup.nodes.Document. Due to this insertEmpty procedure run into line 206, it called a boolean operation here then return element <device>. Notice that this insertEmpty methodis called by org.jsoup.parser.HtmlTreeBuilderState.process (Token, HtmlTreeBuilder).

After that, while reading close tag </device>, the procedure will call anyOtherEndTag(t, tb) in line 746 of HtmlBuilderState. Then, that procedure will call HtmlTreeBuilder.generateImpliedEndTags(String) in line 765. Significantly, the procedure uses popToStack() to pop out </device> element from a stack here.

However, it doesn’t perform true closing tag operation. That means, HtmlTreeBuilder leaves an open tag- <device> in its’ HTML Tree after executing line 206 of insertEmpty method, but doesn’t handle </device> tag to complete closing tag operation in anyOtherEndTag.

沒有留言:

張貼留言