Parsing XML¶
-
class
jxmlease.
Parser
(**kwargs)[source]¶ Creates Python data structures from raw XML.
This class creates a callable object used to parse XML into Python data structures. You can provide optional parameters at the class creation time. These parameters modify the default behavior of the parser. When you invoke the callable object to parse a document, you can supply additional parameters to override the values specified when the
Parser
object was created.General usage is:
>>> myparser = Parser() >>> root = myparser("<a>foo</a>")
Calling a
Parser
object returns anXMLDictNode
containing the parsed XML tree.In this example,
root
is anXMLDictNode
which contains a representation of the parsed XML:>>> isinstance(root, XMLDictNode) True >>> root.prettyprint() {u'a': u'foo'} >>> print root.emit_xml() <?xml version="1.0" encoding="utf-8"?> <a>foo</a>
If you will just be using a parser once, you can just use the
parse()
method, which is a shortcut way of creating aParser
class and calling it all in one call. You can provide the same arguments to theparse()
method that you provide to theParser
class.For example:
>>> root = jxmlease.parse('<a x="y"><b>1</b><b>2</b><b>3</b></a>') >>> root.prettyprint() {u'a': {u'b': [u'1', u'2', u'3']}}
It is possible to call a
Parser
object as a generator by specifying thegenerator
parameter. Thegenerator
parameter contains a list of paths to match. If paths are provided in this parameter, the behavior of the parser is changed. Instead of returning the root node of a parsed XML hierarchy, the parser returns a generator object. On each call to the generator object, it will return the next node that matches one of the provided paths.Paths are provided in a format similar to XPath expressions. For example,
/a/b
will match node<b>
in this XML:<a> <b/> </a>
If a path begins with a
/
, it must exactly match the full path to a node. If a path does not begin with a/
, it must exactly match the “right side” of the path to a node. For example, consider this XML:<a> <b> <c/> </b> </a>
In this example,
/a/b/c
,c
,b/c
, anda/b/c
all match the<c>
node.For each match, the generator returns a tuple of:
(path,match_string,xml_node)
, where the path is the calculated absolute path to the matching node, match_string is the user-supplied match string that triggered the match, and xml_node is the object representing that node (an instance of aXMLNodeBase
subclass).For example:
>>> xml = '<a x="y"><b>1</b><b>2</b><b>3</b></a>' >>> myparser = Parser(generator=["/a/b"]) >>> for (path, match, value) in myparser(xml): ... print "%s: %s" % (path, value) ... /a/b: 1 /a/b: 2 /a/b: 3
When calling the parser, you can specify all of these parameters. When creating a parsing instance, you can specify all of these parameters except
xml_input
:Parameters: - xml_input (stirng or file-like object) – Contains the XML to parse.
- encoding (string or None) – The input’s encoding. If not provided, this defaults to ‘utf-8’.
- expat (An expat, or equivalent, parser class) – Used for parsing the XML
input. If not provided, defaults to the expat parser in
xml.parsers
. - process_namespaces (bool) – If True, namespaces in tags and attributes are converted to their full URL value. If False (the default), the namespaces in tags and attributes are left unchanged.
- namespace_separator (string) – If
process_namespaces
is True, this specifies the separator that expat should use between namespaces and identifiers in tags and attributes - xml_attribs (bool) – If True (the default), include XML attributes. If False, ignore them.
- strip_whitespace (bool) – If True (the default), strip whitespace at the start and end of CDATA. If False, keep all whitespace.
- namespaces (dict) – A remapping for namespaces. If supplied, identifiers
with a namespace prefix will have their namespace prefix rewritten
based on the dictionary. The code will look for
namespaces[current_namespace]
. If found,current_namespace
will be replaced with the result of the lookup. - strip_namespace (bool) – If True, the namespace prefix will be removed from all identifiers. If False (the default), the namespace prefix will be retained.
- cdata_separator (string) – When encountering “semi-structured” XML
(where the XML has CDATA and tags intermixed at the same level), the
cdata_separator
will be placed between the different groups of CDATA. By default, thecdata_separator
parameter is ‘’, which results in the CDATA groups being concatenated without separator. - generator (list of strings) – A list of paths to match. If paths are
provided here, the behavior of the parser is changed. Instead of
returning the root node of a parsed XML hierarchy, the parser
returns a
generator
object. On each call to thegenerator
object, it will return the next node that matches one of the provided paths.
Returns: A callable instance of the
Parser
class.Calling a
Parser
object returns anXMLDictNode
containing the parsed XML tree.Alternatively, if the
generator
parameter is specified, agenerator
object is returned.-
__delattr__
¶ x.__delattr__(‘name’) <==> del x.name
-
__format__
()¶ default object formatter
-
__getattribute__
¶ x.__getattribute__(‘name’) <==> x.name
-
__hash__
¶
-
__reduce__
()¶ helper for pickle
-
__reduce_ex__
()¶ helper for pickle
-
__repr__
¶
-
__setattr__
¶ x.__setattr__(‘name’, value) <==> x.name = value
-
__sizeof__
() → int¶ size of object in memory, in bytes
-
__str__
¶
-
jxmlease.
parse
(xml_input, **kwargs)[source]¶ Create Python data structures from raw XML.
See the
Parser
class documentation.
-
class
jxmlease.
EtreeParser
(**kwargs)[source]¶ Creates Python data structures from an ElementTree object.
This class returns a callable object. You can provide parameters at the class creation time. These parameters modify the default parameters for the parser. When you call the callable object to parse a document, you can supply additional parameters to override the default values.
General usage is like this:
>>> myparser = Parser() >>> root = myparser(etree_root)
For detailed usage information, please see the :py:class`Parser` class. Other than the differences noted below, the behavior of the two classes should be the same. Namespace Identifiers:
In certain versions of
ElementTree
, the original namespace identifiers are not maintained. In these cases, the class will recreate namespace identfiers to represent the original namespaces. It will add appropriate xmlns attributes to maintain the original namespace mapping. However, the actual identifier will be lost. As best I can tell, this is a bug withElementTree
, rather than this code. To avoid this problem, uselxml
.Single-invocation Parsing:
If you will just be using a parser once, you can just use the
parse_etree()
method, which is a shortcut way of creating aEtreeParser
class and calling it all in one call. You can provide the same arguments to theparse_etree()
method that you can provide to theEtreeParser
class.Parameters: etree_root ( ElementTree
) – AnElementTree
object representing the tree you wish to parse.Also accepts most of the same arguments as the
Parser
class. However, it does not accept thexml_input
,expat
, orencoding
parameters.-
__delattr__
¶ x.__delattr__(‘name’) <==> del x.name
-
__format__
()¶ default object formatter
-
__getattribute__
¶ x.__getattribute__(‘name’) <==> x.name
-
__hash__
¶
-
__reduce__
()¶ helper for pickle
-
__reduce_ex__
()¶ helper for pickle
-
__repr__
¶
-
__setattr__
¶ x.__setattr__(‘name’, value) <==> x.name = value
-
__sizeof__
() → int¶ size of object in memory, in bytes
-
__str__
¶
-
-
jxmlease.
parse_etree
(etree_root, **kwargs)[source]¶ Create Python data structures from an
ElementTree
object.See the
EtreeParser
class documentation.