XPath语法大全：全面解析与示例教程

引言

XPath（XML Path Language）是一种强大的查询语言，用于在XML文档中导航和选择节点。尽管其最初设计用于XML，但由于HTML文档可以被解析为类似XML的树状结构，XPath也广泛应用于HTML文档的数据提取和Web自动化测试中。本文将深入探讨XPath的各种语法特性，包括路径表达式、谓语、轴、函数和运算符，并通过丰富的示例帮助您全面掌握XPath。

XPath基本概念

在深入学习XPath语法之前，理解其基本概念至关重要。XPath将XML或HTML文档视为一个节点树。树中的每个部分都是一个节点。XPath中定义了七种类型的节点：

元素节点 (Element Nodes)：例如 <html>, <body>, <p>。
属性节点 (Attribute Nodes)：例如 <a href="#"> 中的 href。
文本节点 (Text Nodes)：元素或属性中的文本内容，例如 <p>Hello</p> 中的 Hello。
命名空间节点 (Namespace Nodes)：XML命名空间声明。
处理指令节点 (Processing Instruction Nodes)：例如 <?xml-stylesheet type="text/xsl" href="foo.xsl"?>。
注释节点 (Comment Nodes)：例如 。
文档（根）节点 (Document (root) Node)：整个文档的根，它是所有其他节点的父节点。

XPath通过路径表达式来选取这些节点，这些表达式描述了如何从一个节点导航到另一个节点。

XPath路径表达式

路径表达式是XPath的核心，它们用于在节点树中定位节点。以下是一些最常用的路径表达式：

表达式	描述	示例	结果
`nodename`	选取所有名为 `nodename` 的子节点。	`bookstore`	选取 `<bookstore>` 元素的所有子节点。
`/`	从根节点开始选取。	`/bookstore`	选取根元素 `<bookstore>`。
`//`	从当前节点下的任意位置选取匹配的节点（不考虑层级）。	`//book`	选取文档中所有 `<book>` 元素。
`.`	选取当前节点。	`.`	选取当前节点。
`..`	选取当前节点的父节点。	`book/..`	选取 `<book>` 元素的父节点。
`@attribute`	选取名为 `attribute` 的属性。	`//book/@category`	选取所有 `<book>` 元素的 `category` 属性。

示例XML文档 (books.xml):

<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
  <book category="COOKING">
    <title lang="en">Everyday Italian</title>
    <author>Giada De Laurentiis</author>
    <year>2005</year>
    <price>30.00</price>
  </book>
  <book category="CHILDREN">
    <title lang="en">Harry Potter</title>
    <author>J K. Rowling</author>
    <year>2005</year>
    <price>29.99</price>
  </book>
  <book category="WEB">
    <title lang="en">XQuery Kick Start</title>
    <author>James McGovern</author>
    <author>Per Bothner</author>
    <author>Simon St. Laurent</author>
    <author>Linda Burroughes</author>
    <year>2006</year>
    <price>49.99</price>
  </book>
  <book category="WEB">
    <title lang="en">Learning XML</title>
    <author>Erik T. Ray</author>
    <year>2003</year>
    <price>39.95</price>
  </book>
</bookstore>

路径表达式示例：

/bookstore/book：选取 <bookstore> 元素下的所有 <book> 子元素。
//title：选取文档中所有 <title> 元素。
bookstore//author：选取 <bookstore> 元素下的所有 <author> 后代元素。
//book[@category]：选取所有带有 category 属性的 <book> 元素。

XPath谓语 (Predicates)

谓语用于查找某个特定节点或者包含某个指定值的节点。谓语被嵌在方括号 [] 中，可以包含条件表达式，从而对选取的节点进行过滤。

表达式	描述	示例	结果
`[position()]`	选取特定位置的节点。	`/bookstore/book[1]`	选取 `<bookstore>` 元素下的第一个 `<book>` 元素。
`[last()]`	选取最后一个节点。	`/bookstore/book[last()]`	选取 `<bookstore>` 元素下的最后一个 `<book>` 元素。
`[last()-1]`	选取倒数第二个节点。	`/bookstore/book[last()-1]`	选取 `<bookstore>` 元素下的倒数第二个 `<book>` 元素。
`[position()<N]`	选取位置小于N的节点。	`/bookstore/book[position()<3]`	选取 `<bookstore>` 元素下位置小于3的 `<book>` 元素（即前两个）。
`[@attribute='value']`	选取属性值为特定值的节点。	`//book[@category='COOKING']`	选取所有 `category` 属性值为 'COOKING' 的 `<book>` 元素。
`[element='value']`	选取子元素值为特定值的节点。	`//book[price>35.00]`	选取所有 `<price>` 元素值大于35.00的 `<book>` 元素。
`[contains(@attribute, 'substring')]`	选取属性值包含特定子串的节点。	`//title[contains(@lang, 'en')]`	选取 `lang` 属性值包含 'en' 的所有 `<title>` 元素。
`[starts-with(@attribute, 'prefix')]`	选取属性值以特定前缀开头的节点。	`//author[starts-with(., 'J K.')]`	选取文本内容以 'J K.' 开头的所有 `<author>` 元素。
`[ends-with(@attribute, 'suffix')]`	选取属性值以特定后缀结尾的节点（XPath 2.0+）。	`//title[ends-with(., 'XML')]`	选取文本内容以 'XML' 结尾的所有 `<title>` 元素。
`[text()='value']`	选取文本内容为特定值的节点。	`//author[text()='J K. Rowling']`	选取文本内容为 'J K. Rowling' 的 `<author>` 元素。
`[not(condition)]`	选取不满足条件的节点。	`//book[not(@category='COOKING')]`	选取所有 `category` 属性值不为 'COOKING' 的 `<book>` 元素。
`[count(element) > N]`	选取子元素数量大于N的节点。	`//book[count(author) > 1]`	选取拥有多个 `<author>` 的 `<book>` 元素。

谓语示例：

//book[price>35.00]/title：选取所有 <price> 元素值大于35.00的 <book> 元素下的 <title> 元素。
//book[year=2005 and @category='CHILDREN']：选取 year 为2005且 category 为 'CHILDREN' 的 <book> 元素。
//book[position()=1 or position()=last()]：选取第一个和最后一个 <book> 元素。

XPath轴 (Axes)

轴定义了相对于当前节点的节点集。XPath提供了13种不同的轴，它们允许我们从当前节点出发，沿着不同的方向（如父、子、兄弟、祖先、后代等）选择节点。

轴名称	结果	示例
`ancestor`	选取当前节点的所有先辈（父、祖父等）。	`//title/ancestor::book`
`ancestor-or-self`	选取当前节点的所有先辈（父、祖父等）以及当前节点本身。	`//title/ancestor-or-self::bookstore`
`attribute`	选取当前节点的所有属性。	`//book/attribute::category` 或 `@category`
`child`	选取当前节点的所有子元素。	`//bookstore/child::book` 或 `bookstore/book`
`descendant`	选取当前节点的所有后代元素（子、孙等）。	`//bookstore/descendant::author`
`descendant-or-self`	选取当前节点的所有后代元素（子、孙等）以及当前节点本身。	`//bookstore/descendant-or-self::book`
`following`	选取文档中当前节点之后的所有节点（不包括当前节点的后代、属性和命名空间节点）。	`//book[1]/following::book`
`following-sibling`	选取当前节点之后的所有同级节点。	`//book[1]/following-sibling::book`
`namespace`	选取当前节点的所有命名空间节点。	`//book/namespace::*`
`parent`	选取当前节点的父节点。	`//title/parent::book` 或 `//title/..`
`preceding`	选取文档中当前节点之前的所有节点（不包括当前节点的祖先、属性和命名空间节点）。	`//book[last()]/preceding::book`
`preceding-sibling`	选取当前节点之前的所有同级节点。	`//book[last()]/preceding-sibling::book`
`self`	选取当前节点本身。	`//book/self::book` 或 `.`

轴示例：

//book[author='J K. Rowling']/following-sibling::book：选取作者为 'J K. Rowling' 的 <book> 元素之后的所有同级 <book> 元素。
//price/ancestor::book[@category='WEB']：选取 <price> 元素的 <book> 先辈，且该 <book> 元素的 category 属性为 'WEB'。
//book[./year = 2005]/child::title：选取 year 为 2005 的 <book> 元素的子 <title> 元素。

XPath函数

XPath提供了丰富的内置函数，用于处理节点集、字符串、数值和布尔值。这些函数极大地增强了XPath的表达能力，使得更复杂的数据提取成为可能。

节点集函数

函数	描述	示例
`last()`	返回节点集中最后一个节点的索引。	`//book[last()]`
`position()`	返回节点在节点集中的位置。	`//book[position() < 3]`
`count(node-set)`	返回节点集中的节点数量。	`count(//book)`
`id(id)`	选取具有指定ID的元素。	`id("myid")`
`name()`	返回节点的限定名。	`//book[name()=\'book\']`
`local-name()`	返回节点的本地部分名称（不带命名空间前缀）。	`//book[local-name()=\'book\']`
`namespace-uri()`	返回节点的命名空间URI。	`//book[namespace-uri()=\'http://www.example.com/books\']`

字符串函数

函数	描述	示例
`string(object)`	将对象转换为字符串。	`string(//book[1]/title)`
`concat(string1, string2, ...)`	连接多个字符串。	`concat("Hello", " ", "World")`
`starts-with(string, substring)`	判断字符串是否以指定子串开头。	`//title[starts-with(., \'Everyday\')]`
`contains(string, substring)`	判断字符串是否包含指定子串。	`//author[contains(., \'Rowling\')]`
`substring(string, start, length)`	提取字符串的子串。	`substring("Hello World", 1, 5)`
`substring-before(string, substring)`	返回指定子串之前的部分。	`substring-before("Hello World", " ")`
`substring-after(string, substring)`	返回指定子串之后的部分。	`substring-after("Hello World", " ")`
`string-length(string)`	返回字符串的长度。	`string-length("Hello")`
`normalize-space(string)`	移除字符串前导和尾随空格，并将多个连续空格替换为单个空格。	`normalize-space(" Hello World ")`
`translate(string, from, to)`	替换字符串中指定字符。	`translate("abc", "a", "A")`

数值函数

| 函数 | 描述 | 示例 |
|---|---|---|---|
| number(object) | 将对象转换为数字。 | number(//book[1]/price) |
| sum(node-set) | 计算节点集中所有节点的数值和。 | sum(//price) |
| floor(number) | 返回不大于该数字的最大整数。 | floor(3.7) |
| ceiling(number) | 返回不小于该数字的最小整数。 | ceiling(3.2) |
| round(number) | 四舍五入到最接近的整数。 | round(3.5) |

布尔函数

函数	描述	示例
`boolean(object)`	将对象转换为布尔值。	`boolean(//book)`
`not(boolean)`	布尔值的逻辑非。	`not(//book[@category=\'COOKING\'])`
`true()`	返回真。	`true()`
`false()`	返回假。	`false()`
`lang(string)`	判断当前节点的语言是否与指定语言匹配。	`//title[lang(\'en\')]`

函数示例：

//book[contains(title, 'Harry') and year = 2005]：选取标题包含 'Harry' 且年份为 2005 的 <book> 元素。
sum(//book/price)：计算所有图书价格的总和。
//book[substring-after(price, '.') = '00']：选取价格以 '.00' 结尾的图书。

XPath运算符

XPath支持多种运算符，用于在表达式中执行算术、比较和逻辑操作。

算术运算符

运算符	描述	示例	结果
`+`	加法	`6 + 4`	`10`
`-`	减法	`6 - 4`	`2`
`*`	乘法	`6 * 4`	`24`
`div`	除法	`8 div 4`	`2`
`mod`	取模（余数）	`5 mod 2`	`1`

比较运算符

运算符	描述	示例	结果
`=`	等于	`price = 30.00`	如果 `price` 元素的值等于 `30.00`，则为真。
`!=`	不等于	`price != 30.00`	如果 `price` 元素的值不等于 `30.00`，则为真。
`<`	小于	`price < 30.00`	如果 `price` 元素的值小于 `30.00`，则为真。
`<=`	小于等于	`price <= 30.00`	如果 `price` 元素的值小于或等于 `30.00`，则为真。
`>`	大于	`price > 30.00`	如果 `price` 元素的值大于 `30.00`，则为真。
`>=`	大于等于	`price >= 30.00`	如果 `price` 元素的值大于或等于 `30.00`，则为真。

逻辑运算符

运算符	描述	示例	结果
`and`	逻辑与	`year = 2005 and price > 20.00`	如果 `year` 等于 `2005` 且 `price` 大于 `20.00`，则为真。
`or`	逻辑或	`year = 2005 or price > 20.00`	如果 `year` 等于 `2005` 或 `price` 大于 `20.00`，则为真。

联合运算符

运算符	描述	示例
`	`	计算两个节点集。

运算符示例：

//book[price > 30 and year < 2006]：选取价格大于30且年份小于2006的 <book> 元素。
//book[author = 'J K. Rowling' or author = 'Erik T. Ray']：选取作者为 'J K. Rowling' 或 'Erik T. Ray' 的 <book> 元素。

XPath实际应用示例

为了更好地理解XPath在实际中的应用，我们将使用Python的lxml库来演示如何结合XPath提取HTML文档中的数据。假设我们有以下HTML结构：

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <title>新闻列表页</title>
</head>
<body>
    <div id="header">
        <h1>最新新闻</h1>
    </div>
    <div id="main-content">
        <ul class="news-list">
            <li class="news-item" data-id="001">
                <a href="/news/001.html" class="news-title">新闻标题一：科技巨头发布创新产品</a>
                <p class="news-summary">摘要：某科技公司今日发布了一款颠覆性智能设备，引发市场广泛关注。</p>
                <span class="news-date">2025-07-01</span>
                <div class="tags">
                    <span>科技</span>
                    <span>新品</span>
                </div>
            </li>
            <li class="news-item" data-id="002">
                <a href="/news/002.html" class="news-title">新闻标题二：全球经济形势分析</a>
                <p class="news-summary">摘要：专家指出，全球经济正面临多重挑战，需警惕潜在风险。</p>
                <span class="news-date">2025-06-30</span>
                <div class="tags">
                    <span>经济</span>
                    <span>分析</span>
                </div>
            </li>
            <li class="news-item" data-id="003">
                <a href="/news/003.html" class="news-title">新闻标题三：文化遗产保护新进展</a>
                <p class="news-summary">摘要：一项新的文化遗产保护计划在全国范围内启动，旨在传承和弘扬传统文化。</p>
                <span class="news-date">2025-06-29</span>
                <div class="tags">
                    <span>文化</span>
                    <span>遗产</span>
                </div>
            </li>
            <li class="news-item" data-id="004">
                <a href="/news/004.html" class="news-title">新闻标题四：健康生活方式指南</a>
                <p class="news-summary">摘要：健康专家分享了保持身心健康的实用建议，助您拥有活力人生。</p>
                <span class="news-date">2025-07-01</span>
                <div class="tags">
                    <span>健康</span>
                    <span>生活</span>
                </div>
            </li>
        </ul>
    </div>
    <div id="footer">
        <p>&copy; 2025 新闻网</p>
    </div>
</body>
</html>

我们将使用lxml库来解析上述HTML，并演示如何使用XPath提取各种信息。

from lxml import etree

html_doc_news = r"""
<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <title>新闻列表页</title>
</head>
<body>
    <div id="header">
        <h1>最新新闻</h1>
    </div>
    <div id="main-content">
        <ul class="news-list">
            <li class="news-item" data-id="001">
                <a href="/news/001.html" class="news-title">新闻标题一：科技巨头发布创新产品</a>
                <p class="news-summary">摘要：某科技公司今日发布了一款颠覆性智能设备，引发市场广泛关注。</p>
                <span class="news-date">2025-07-01</span>
                <div class="tags">
                    <span>科技</span>
                    <span>新品</span>
                </div>
            </li>
            <li class="news-item" data-id="002">
                <a href="/news/002.html" class="news-title">新闻标题二：全球经济形势分析</a>
                <p class="news-summary">摘要：专家指出，全球经济正面临多重挑战，需警惕潜在风险。</p>
                <span class="news-date">2025-06-30</span>
                <div class="tags">
                    <span>经济</span>
                    <span>分析</span>
                </div>
            </li>
            <li class="news-item" data-id="003">
                <a href="/news/003.html" class="news-title">新闻标题三：文化遗产保护新进展</a>
                <p class="news-summary">摘要：一项新的文化遗产保护计划在全国范围内启动，旨在传承和弘扬传统文化。</p>
                <span class="news-date">2025-06-29</span>
                <div class="tags">
                    <span>文化</span>
                    <span>遗产</span>
                </div>
            </li>
            <li class="news-item" data-id="004">
                <a href="/news/004.html" class="news-title">新闻标题四：健康生活方式指南</a>
                <p class="news-summary">摘要：健康专家分享了保持身心健康的实用建议，助您拥有活力人生。</p>
                <span class="news-date">2025-07-01</span>
                <div class="tags">
                    <span>健康</span>
                    <span>生活</span>
                </div>
            </li>
        </ul>
    </div>
    <div id="footer">
        <p>&copy; 2025 新闻网</p>
    </div>
</body>
</html>
"""

html_tree = etree.HTML(html_doc_news)

# 1. 提取所有新闻标题及其链接
news_items = html_tree.xpath("//li[@class=\'news-item\']")
for item in news_items:
    title = item.xpath(".//a[@class=\'news-title\']/text()")[0]
    link = item.xpath(".//a[@class=\'news-title\']/@href")[0]
    print(f"标题: {title}, 链接: {link}")

# 2. 提取所有新闻的摘要
summaries = html_tree.xpath("//p[@class=\'news-summary\']/text()")
print("\n所有新闻摘要:")
for s in summaries:
    print(s)

# 3. 提取发布日期为2025-07-01的新闻标题
daily_news_titles = html_tree.xpath("//li[@class=\'news-item\'][.//span[@class=\'news-date\' and text()=\'2025-07-01\']]/a[@class=\'news-title\']/text()")
print("\n2025-07-01发布的新闻标题:", daily_news_titles)

# 4. 提取包含“科技”标签的新闻的data-id
tech_news_ids = html_tree.xpath("//li[@class=\'news-item\'][.//div[@class=\'tags\']/span[text()=\'科技\']]/@data-id")
print("\n包含‘科技’标签的新闻ID:", tech_news_ids)

# 5. 提取所有新闻的第一个标签
first_tags = html_tree.xpath("//li[@class=\'news-item\']/div[@class=\'tags\']/span[1]/text()")
print("\n所有新闻的第一个标签:", first_tags)

# 6. 提取所有新闻的第二个标签（如果存在）
second_tags = html_tree.xpath("//li[@class=\'news-item\']/div[@class=\'tags\']/span[2]/text()")
print("\n所有新闻的第二个标签:", second_tags)

# 7. 提取所有新闻的发布日期，并按日期倒序排列（XPath本身不支持排序，通常在代码中实现）
# XPath可以选取所有日期，但排序需要在Python中完成
all_dates = html_tree.xpath("//span[@class=\'news-date\']/text()")
print("\n所有新闻日期 (未排序):")
print(all_dates)

# 8. 提取新闻标题中包含“经济”或“文化”的链接
eco_culture_links = html_tree.xpath("//a[contains(text(), \'经济\') or contains(text(), \'文化\')]/@href")
print("\n包含‘经济’或‘文化’的新闻链接:", eco_culture_links)

# 9. 选取id为main-content的div下的所有直接子元素
main_content_children = html_tree.xpath("//div[@id=\'main-content\']/*")
print("\nmain-content下的直接子元素标签名:")
for child in main_content_children:
    print(child.tag)

# 10. 选取第一个新闻项的下一个兄弟节点
next_sibling_of_first_news = html_tree.xpath("//li[@class=\'news-item\'][1]/following-sibling::li")
print("\n第一个新闻项的下一个兄弟节点标题:")
for item in next_sibling_of_first_news:
    print(item.xpath(".//a[@class=\'news-title\']/text()")[0])

结论

XPath作为一种强大的语言，为XML和HTML文档的数据提取和导航提供了无与伦比的灵活性和精确性。通过掌握其路径表达式、谓语、轴、函数和运算符，您将能够高效地从复杂的文档结构中定位和提取所需的信息。无论是进行网络爬虫、自动化测试，还是数据分析，XPath都是您工具箱中不可或缺的一部分。希望这篇全面的XPath语法大全能帮助您更好地理解和应用这项技术。

XPath语法大全：全面解析与示例教程

XPath语法大全：全面解析与示例教程

引言

XPath基本概念

XPath路径表达式

XPath谓语 (Predicates)

XPath轴 (Axes)

XPath函数

节点集函数

字符串函数

数值函数

布尔函数

XPath运算符

算术运算符

比较运算符

逻辑运算符

联合运算符

XPath实际应用示例

结论

参考文献