Be careful with formatOutput().
Creating an empty node like this:
createElement('foo','')
instead of
createElement('foo')
will break formatOutput.
La classe DOMDocument
Introduction
Représente un document HTML ou XML entier ; ce sera la racine de l'arbre document.
Synopsis de la classe
Propriétés
- actualEncoding
-
Obsolète. L'encodage actuel du document, en lecture seule, équivalent àencoding.
- config
-
Obsolète. Configuration utilisée lorsque DOMDocument::normalizeDocument() est appelé.
- doctype
-
Le doctype associé au document.
- documentElement
-
C'est un attribut de convenence, qui permet un accès direct au noeud fils, qui est l'élément document du document.
- documentURI
-
La localisation du document, ou NULL si indéfini.
- encoding
-
L'encodage du document, tel que spécifié par la déclaration XML. Cet attribut n'est pas présent dans la spécification DOM Level 3 finale, mais représente la seule façon de manipuler l'encodage du document XML dans cette implémentation.
- formatOutput
-
Formate la sortie avec une jolie indentation et des espaces supplémentaires.
- implementation
-
L'objet DOMImplementation qui gère ce document.
- preserveWhiteSpace
-
Ne pas supprimer les espaces redondants. Vaut par défaut TRUE.
- recover
-
Propriétaire. Active le mode "recovery", i.e. tente d'analyser un document mal formé. Cet attribut ne fait pas parti de la spécification DOM et est spécifique à libxml.
- resolveExternals
-
Définissez-le à TRUE pour charger des entités externes depuis la déclaration doctype. C'est utile pour inclure des entités dans vos documents XML.
- standalone
-
Obsolète. Si le document est "standalone", ou non, tel que spécifié par la déclaration XML, correspondant à xmlStandalone.
- strictErrorChecking
-
Lance une DOMException en cas d'erreur. Par défaut, vaut TRUE.
- substituteEntities
-
Propriétaire. Si l'on doit ou non substituer les entités. Cet attribut ne fait pas parti de la spécification DOM et est spécifique à libxml.
- validateOnParse
-
Charge et valide la DTD. Par défaut, vaut FALSE.
- version
-
Obsolète. Version du XML, correspond à xmlVersion
- xmlEncoding
-
Un attribut spécifiant l'encodage du document. Il vaut NULL lorsque l'encodage n'est pas spécifié, ou lorsqu'il est inconnu, comme c'est le cas lorsque le document a été créé en mémoire.
- xmlStandalone
-
Un attribut spécifiant si le document est "standalone". Il vaut FALSE lorsque non spécifié.
- xmlVersion
-
Un attribut spécifiant le numéro de version du document. S'il n'y a pas de déclaration et si le document supporte la fonctionnalité "XML", la valeur sera "1.0".
Sommaire
- DOMDocument::__construct — Crée un nouvel objet DOMDocument
- DOMDocument::createAttribute — Crée un nouvel attribut
- DOMDocument::createAttributeNS — Crée un nouvel attribut avec un espace de noms associé
- DOMDocument::createCDATASection — Crée un nouveau noeud cdata
- DOMDocument::createComment — Crée un nouveau noeud de commentaire
- DOMDocument::createDocumentFragment — Crée un nouveau fragment de document
- DOMDocument::createElement — Crée un nouveau noeud
- DOMDocument::createElementNS — Crée un nouveau noeud avec un espace de noms associé
- DOMDocument::createEntityReference — Crée un nouveau noeud de référence d'entité
- DOMDocument::createProcessingInstruction — Crée un nouveau noeud PI
- DOMDocument::createTextNode — Crée un nouveau noeud de texte
- DOMDocument::getElementById — Cherche un élément avec un certain identifiant
- DOMDocument::getElementsByTagName — Cherche tous les éléments qui ont le nom de balise donné
- DOMDocument::getElementsByTagNameNS — Recherche tous les éléments avec un nom de balise donné dans un espace de noms spécifié
- DOMDocument::importNode — Importe un noeud dans le document courant
- DOMDocument::load — Charge du XML depuis un fichier
- DOMDocument::loadHTML — Charge du code HTML à partir d'une chaîne de caractères
- DOMDocument::loadHTMLFile — Charge du HTML à partir d'un fichier
- DOMDocument::loadXML — Charge du XML depuis une chaîne de caractères
- DOMDocument::normalizeDocument — Normalise le document
- DOMDocument::registerNodeClass — Enregistre la classe étendue utilisée pour créer un type de base de noeud
- DOMDocument::relaxNGValidate — Effectue une validation relaxNG sur le document
- DOMDocument::relaxNGValidateSource — Effectue une validation relaxNG sur le document
- DOMDocument::save — Sauvegarde l'arbre interne XML dans un fichier
- DOMDocument::saveHTML — Sauvegarde le document interne dans une chaîne en utilisant un formatage HTML
- DOMDocument::saveHTMLFile — Sauvegarde un document interne dans un fichier en utilisant un formatage HTML
- DOMDocument::saveXML — Sauvegarde l'arbre interne XML dans une chaîne de caractères
- DOMDocument::schemaValidate — Valide un document selon un schéma
- DOMDocument::schemaValidateSource — Valide un document selon un schéma
- DOMDocument::validate — Valide un document en se basant sur sa DTD
- DOMDocument::xinclude — Remplace les XIncludes dans un objet DOMDocument
DOMDocument
31-Oct-2009 09:30
06-Oct-2009 12:08
Child class of DOMDocument which has a toArray() method. Enjoy and/or improve
<?php
class MyDOMDocument extends DOMDocument
{
public function toArray(DOMNode $oDomNode = null)
{
// return empty array if dom is blank
if (is_null($oDomNode) && !$this->hasChildNodes()) {
return array();
}
$oDomNode = (is_null($oDomNode)) ? $this->documentElement : $oDomNode;
if (!$oDomNode->hasChildNodes()) {
$mResult = $oDomNode->nodeValue;
} else {
$mResult = array();
foreach ($oDomNode->childNodes as $oChildNode) {
// how many of these child nodes do we have?
// this will give us a clue as to what the result structure should be
$oChildNodeList = $oDomNode->getElementsByTagName($oChildNode->nodeName);
$iChildCount = 0;
// there are x number of childs in this node that have the same tag name
// however, we are only interested in the # of siblings with the same tag name
foreach ($oChildNodeList as $oNode) {
if ($oNode->parentNode->isSameNode($oChildNode->parentNode)) {
$iChildCount++;
}
}
$mValue = $this->toArray($oChildNode);
$sKey = ($oChildNode->nodeName{0} == '#') ? 0 : $oChildNode->nodeName;
$mValue = is_array($mValue) ? $mValue[$oChildNode->nodeName] : $mValue;
// how many of thse child nodes do we have?
if ($iChildCount > 1) { // more than 1 child - make numeric array
$mResult[$sKey][] = $mValue;
} else {
$mResult[$sKey] = $mValue;
}
}
// if the child is <foo>bar</foo>, the result will be array(bar)
// make the result just 'bar'
if (count($mResult) == 1 && isset($mResult[0]) && !is_array($mResult[0])) {
$mResult = $mResult[0];
}
}
// get our attributes if we have any
$arAttributes = array();
if ($oDomNode->hasAttributes()) {
foreach ($oDomNode->attributes as $sAttrName=>$oAttrNode) {
// retain namespace prefixes
$arAttributes["@{$oAttrNode->nodeName}"] = $oAttrNode->nodeValue;
}
}
// check for namespace attribute - Namespaces will not show up in the attributes list
if ($oDomNode instanceof DOMElement && $oDomNode->getAttribute('xmlns')) {
$arAttributes["@xmlns"] = $oDomNode->getAttribute('xmlns');
}
if (count($arAttributes)) {
if (!is_array($mResult)) {
$mResult = (trim($mResult)) ? array($mResult) : array();
}
$mResult = array_merge($mResult, $arAttributes);
}
$arResult = array($oDomNode->nodeName=>$mResult);
return $arResult;
}
}
$sXml = <<<XML
<nodes>
<node>text<node>
<node>
<field>hello<field>
<field>world<field>
<node>
<nodes>
XML;
$dom = new MyDOMDocument;
$dom->loadXml($sXml);
var_dump($dom->toArray());
?>
Output:
array (
"nodes" => array (
"node" => array (
0 => "text",
1 => array (
"field" => array (
0 => "hello",
1 => "world"
)
)
)
)
15-Aug-2009 08:32
If you want to use the DOMDocument to create xHTML documents here is a simple class
Note this is designed for creating xHTML documents from scratch but could be easily extended to work with xHTML documents. Also this is for xHTML not XML.
<?php
class Document
{
public $doctype;
public $head;
public $title = 'Sensei Ninja';
public $body;
private $styles;
private $metas;
private $scripts;
private $document;
function __construct ( )
{
$this->document = new DOMDocument( );
$this->head = $this->document->createElement( 'head', ' ' );
$this->body = $this->document->createElement( 'body', ' ' );
}
public function addStyleSheet ( $url, $media='all' )
{
$element = $this->document->createElement( 'link' );
$element->setAttribute( 'type', 'text/css' );
$element->setAttribute( 'href', $url );
$element->setAttribute( 'media', $media );
$this->styles[] = $element;
}
public function addScript ( $url )
{
$element = $this->document->createElement( 'script', ' ' );
$element->setAttribute( 'type', 'text/javascript' );
$element->setAttribute( 'src', $url );
$this->scripts[] = $element;
}
public function addMetaTag ( $name, $content )
{
$element = $this->document->createElement( 'meta' );
$element->setAttribute( 'name', $name );
$element->setAttribute( 'content', $content );
$this->metas[] = $element;
}
public function setDescription ( $dec )
{
$this->addMetaTag( 'description', $dec );
}
public function setKeywords ( $keywords )
{
$this->addMetaTag( 'keywords', $keywords );
}
public function createElement ( $nodeName, $nodeValue=null )
{
return $this->document->createElement( $nodeName, $nodeValue );
}
public function assemble ( )
{
// Doctype creation
$doctype = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML TRANSITIONAL 1.0//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">';
// Create the head element
$title = $this->document->createElement( 'title', $this->title );
// Add stylesheets if needed
if ( is_array( $this->styles ))
foreach ( $this->styles as $element )
$this->head->appendChild( $element );
// Add scripts if needed
if( is_array( $this->scripts ))
foreach ( $this->scripts as $element )
$this->head->appendChild( $element );
// Add meta tags if needed
if ( is_array( $this->metas ))
foreach ( $this->metas as $element )
$this->head->appendChild( $element );
$this->head->appendChild( $title );
// Create the document
$html = $this->document->createElement( 'html' );
$html->setAttribute( 'xmlns', 'http://www.w3.org/1999/xhtml' );
$html->setAttribute( 'xml:lang', 'en' );
$html->setAttribute( 'lang', 'en' );
$html->appendChild( $this->head );
$html->appendChild( $this->body );
$this->document->appendChild( $html );
return $doctype . $this->document->saveXML( );
}
}
?>
Small example
<?php
$document = new Document( );
$document->title = 'Hello';
$document->addStyleSheet( 'StyleSheets/main.css' );
$div = $document->createElement( 'div' );
$div->nodeValue = 'Hello, world!';
$div->setAttribute( 'style', 'color: red;' );
$document->body->appendChild( $div );
printf( '%s', $document->assemble( ) );
?>
19-Jun-2009 04:19
It should be pointed out that DOMDocument extends DOMNode in every way... that means that you even have access to the DOMNode properties (even though the documentation here does not mention them as being inherited).
I used to use an XPath query to access nodes from a DOMDocument (when getElementById or getElementsByTagName weren't usable), as I believed this to be the only way. However, since DOMDocument fully extends DOMNode, you can use DOMDocument->firstChild for example to get the first child node.
This simplifies things quite a bit when using an XPath query may seem a bit excessive to get access to something as simple as the child nodes.
23-May-2009 07:31
This function may help to debug current dom element:
<?php
function dom_dump($obj) {
if ($classname = get_class($obj)) {
$retval = "Instance of $classname, node list: \n";
switch (true) {
case ($obj instanceof DOMDocument):
$retval .= "XPath: {$obj->getNodePath()}\n".$obj->saveXML($obj);
break;
case ($obj instanceof DOMElement):
$retval .= "XPath: {$obj->getNodePath()}\n".$obj->ownerDocument->saveXML($obj);
break;
case ($obj instanceof DOMAttr):
$retval .= "XPath: {$obj->getNodePath()}\n".$obj->ownerDocument->saveXML($obj);
//$retval .= $obj->ownerDocument->saveXML($obj);
break;
case ($obj instanceof DOMNodeList):
for ($i = 0; $i < $obj->length; $i++) {
$retval .= "Item #$i, XPath: {$obj->item($i)->getNodePath()}\n".
"{$obj->item($i)->ownerDocument->saveXML($obj->item($i))}\n";
}
break;
default:
return "Instance of unknown class";
}
} else {
return 'no elements...';
}
return htmlspecialchars($retval);
}
?>
Example usage:
<?php
$dom = new DomDocument();
$dom->load('test.xml');
$body = $dom->documentElement->getElementsByTagName('book');
echo '<pre>'.dom_dump($body).'<pre>';
?>
Output:
Instance of DOMNodeList, node list:
Item #0, XPath: /library/book[1]
<book isbn="0345342968">
<title>Fahrenheit 451</title>
<author>R. Bradbury</author>
<publisher>Del Rey</publisher>
</book>
Item #1, XPath: /library/book[2]
<book isbn="0048231398">
<title>The Silmarillion</title>
<author>J.R.R. Tolkien</author>
<publisher>G. Allen & Unwin</publisher>
</book>
Item #2, XPath: /library/book[3]
<book isbn="0451524934">
<title>1984</title>
<author>G. Orwell</author>
<publisher>Signet</publisher>
</book>
Item #3, XPath: /library/book[4]
<book isbn="031219126X">
<title>Frankenstein</title>
<author>M. Shelley</author>
<publisher>Bedford</publisher>
</book>
Item #4, XPath: /library/book[5]
<book isbn="0312863551">
<title>The Moon Is a Harsh Mistress</title>
<author>R. A. Heinlein</author>
<publisher>Orb</publisher>
</book>
30-Nov-2008 07:59
Here is a simple web scraping example using the PHP DOM that tries to get the largest text body of a HTML document. I needed it for a spider that had to show a short description for a page. It assumes that document annotation can be the largest <div>, <td> or <p> element in the page.
In the example I show a way to prevent a bug in the DOM as it sometimes just doesn't recognize html encoding. It seems to work if you put charset meta tag right after the head tag of the document.
<?php
$ch= curl_init();
curl_setopt ($ch, CURLOPT_URL, '...put url here...' );
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch,CURLOPT_VERBOSE,1);
curl_setopt($ch, CURLOPT_USERAGENT, 'set sth...');
curl_setopt ($ch, CURLOPT_REFERER, '...set sth...'); //just a fake referer
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_POST,0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 20);
$html= curl_exec($ch);
$html1= curl_getinfo($ch);
//try to get page encoding as it was sent from server
if ($html1['content_type']){
$arr= explode('charset=',$html1['content_type']);
$csethdr= strtolower(trim($arr[1]));
} else {
$csethdr= false;
}
$cset= false;
$arr= array();
//This has to replace page meta tags for charset with utf-8, but it doesn't actually help(see the bug info).
if (preg_match_all(
'/(<meta\s*http-equiv="Content-Type"\s*content="[^;]*;
\s*charset=([^"]*?)(?:"|\;)[^>]*>)/' //merge this line
,$html,$arr,PREG_PATTERN_ORDER)){
$cset= strtolower(trim($arr[2][0]));
if ($cset!='utf-8'||$cset!=$csethdr){
$new= str_replace($arr[2][0],'utf-8',$arr[1][0]);
$html= str_replace($arr[1][0],$new,$html);
$cset= $csethdr;
} else {
$cset= false;
}
if ($cset=='utf-8'){
$cset= false;
}
}
unset($arr);
if ($cset){
$html= iconv($cset,'utf-8',$html);
}
unset($cset);
//solve dom bug
$html=preg_replace('/<head[^>]*>/','<head>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
',$html);
$dom= new DOMDocument();
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
function getMaxTextBody($dom){
$content = $dom->getElementsByTagname('div');
$content2= $dom->getElementsByTagname('td');
$content3= $dom->getElementsByTagname('p');
$new= array();
foreach ($content as $value) {
$new[]= $value;
unset($value);
}
unset($content);
foreach ($content2 as $value) {
$new[]= $value;
unset($value);
}
unset($content2);
foreach ($content3 as $value) {
$new[]= $value;
unset($value);
}
unset($content3);
$maxlen= 0;
$result= '';
foreach ($new as $item)
{
$str= $item->nodeValue;
if (strlen($str)>$maxlen){
$content1= $item->getElementsByTagName('div');
$content2= $item->getElementsByTagname('td');
$content3= $item->getElementsByTagname('p');
$contentnew= array();
foreach ($content1 as $value) {
$contentnew[]= $value;
unset($value);
}
unset($content1);
foreach ($content2 as $value) {
$contentnew[]= $value;
unset($value);
}
unset($content2);
foreach ($content3 as $value) {
$contentnew[]= $value;
unset($value);
}
unset($content3);
if (count($contentnew)==0){
$result= $str;
} else {
foreach ($contentnew as $value) {
$str1= getMaxTextBody($value);
$str2= $value->nodeValue;
//let's say largest body has more than 50% of the text in its parent
if (strlen($str1)*2<strlen($str2)){
$str1= $str2;
}
if (strlen($str1)*2>strlen($str)&&strlen($str1)>$maxlen){
$result= $str1;
} elseif (strlen($str1)>$maxlen){
$result= $str1;
}
$maxlen= strlen($result);
}
}
$maxlen= strlen($result);
unset($contnentnew);
}
}
unset($new);
return $result;
}
print getMaxTextBody($dom);
?>
15-May-2008 01:58
To indent a XML in a pretty way I use:
<?
$sXML = '<root><element><key>a</key><value>b</value></element></root>';
$doc = new DOMDocument();
$doc->preserveWhiteSpace = false;
$doc->formatOutput = true;
$doc->loadXML($sXML);
echo $doc->saveXML();
?>
11-Apr-2008 07:48
Showing a quick example of how to use this class, just so that new users can get a quick start without having to figure it all out by themself. ( At the day of posting, this documentation just got added and is lacking examples. )
<?php
// Set the content type to be XML, so that the browser will recognise it as XML.
header( "content-type: application/xml; charset=ISO-8859-15" );
// "Create" the document.
$xml = new DOMDocument( "1.0", "ISO-8859-15" );
// Create some elements.
$xml_album = $xml->createElement( "Album" );
$xml_track = $xml->createElement( "Track", "The ninth symphony" );
// Set the attributes.
$xml_track->setAttribute( "length", "0:01:15" );
$xml_track->setAttribute( "bitrate", "64kb/s" );
$xml_track->setAttribute( "channels", "2" );
// Create another element, just to show you can add any (realistic to computer) number of sublevels.
$xml_note = $xml->createElement( "Note", "The last symphony composed by Ludwig van Beethoven." );
// Append the whole bunch.
$xml_track->appendChild( $xml_note );
$xml_album->appendChild( $xml_track );
// Repeat the above with some different values..
$xml_track = $xml->createElement( "Track", "Highway Blues" );
$xml_track->setAttribute( "length", "0:01:33" );
$xml_track->setAttribute( "bitrate", "64kb/s" );
$xml_track->setAttribute( "channels", "2" );
$xml_album->appendChild( $xml_track );
$xml->appendChild( $xml_album );
// Parse the XML.
print $xml->saveXML();
?>
Output:
<Album>
<Track length="0:01:15" bitrate="64kb/s" channels="2">
The ninth symphony
<Note>
The last symphony composed by Ludwig van Beethoven.
</Note>
</Track>
<Track length="0:01:33" bitrate="64kb/s" channels="2">Highway Blues</Track>
</Album>
If you want your PHP->DOM code to run under the .xml extension, you should set your webserver up to run the .xml extension with PHP ( Refer to the installation/configuration configuration for PHP on how to do this ).
Note that this:
<?php
$xml = new DOMDocument( "1.0", "ISO-8859-15" );
$xml_album = $xml->createElement( "Album" );
$xml_track = $xml->createElement( "Track" );
$xml_album->appendChild( $xml_track );
$xml->appendChild( $xml_album );
?>
is NOT the same as this:
<?php
// Will NOT work.
$xml = new DOMDocument( "1.0", "ISO-8859-15" );
$xml_album = new DOMElement( "Album" );
$xml_track = new DOMElement( "Track" );
$xml_album->appendChild( $xml_track );
$xml->appendChild( $xml_album );
?>
although this will work:
<?php
$xml = new DOMDocument( "1.0", "ISO-8859-15" );
$xml_album = new DOMElement( "Album" );
$xml->appendChild( $xml_album );
?>
