It’s a command-line tool to extract HTML elements using an XPath query or CSS3 selector.
It’s based on the great and simple scraping tool written by Jeroen Janssens.
You can install scrape-cli using several methods:
pipx install scrape-cli
# Install as a global CLI tool (recommended)
uv tool install scrape-cli
# Or install with uv pip
uv pip install scrape-cli
# Or run temporarily without installing
uvx scrape-cli --help
pip install scrape-cli
Or install from source:
git clone https://github.com/aborruso/scrape-cli
cd scrape-cli
pip install -e .
In the resources directory you’ll find a test.html file that you can use to test various scraping scenarios.
Note: You can also test directly from the URL without cloning the repository:
scrape -e "h1" https://raw.githubusercontent.com/aborruso/scrape-cli/refs/heads/master/resources/test.html
Here are some examples:
# CSS
scrape -e "table.data-table td" resources/test.html
# XPath
scrape -e "//table[contains(@class, 'data-table')]//td" resources/test.html
# CSS
scrape -e "ul.items-list li" resources/test.html
# XPath
scrape -e "//ul[contains(@class, 'items-list')]/li" resources/test.html
# CSS
scrape -e "a.external-link" -a href resources/test.html
# XPath
scrape -e "//a[contains(@class, 'external-link')]/@href" resources/test.html
# CSS
scrape -e "#main-title" --check-existence resources/test.html
# XPath
scrape -e "//h1[@id='main-title']" --check-existence resources/test.html
# CSS
scrape -e ".nested-elements p" resources/test.html
# XPath
scrape -e "//div[contains(@class, 'nested-elements')]//p" resources/test.html
# CSS
scrape -e "[data-test]" resources/test.html
# XPath
scrape -e "//*[@data-test]" resources/test.html
# Get all links with href attribute
scrape -e "//a[@href]" resources/test.html
# Get checked input elements
scrape -e "//input[@checked]" resources/test.html
# Get elements with multiple classes
scrape -e "//div[contains(@class, 'class1') and contains(@class, 'class2')]" resources/test.html
# Get text content of specific element
scrape -e "//h1[@id='main-title']/text()" resources/test.html
A CSS selector query like this
curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -be 'table.wikitable > tbody > tr > td > b > a'
Note: When using both -b and -e options together, they must be specified in the order -be (body first, then expression). Using -eb will not work correctly.
or an XPATH query like this one:
curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -be "//table[contains(@class, 'wikitable')]/tbody/tr/td/b/a"
gives you back:
<html>
<head>
</head>
<body>
<a href="/wiki/Afghanistan" title="Afghanistan">
Afghanistan
</a>
<a href="/wiki/Albania" title="Albania">
Albania
</a>
<a href="/wiki/Algeria" title="Algeria">
Algeria
</a>
<a href="/wiki/Andorra" title="Andorra">
Andorra
</a>
<a href="/wiki/Angola" title="Angola">
Angola
</a>
<a href="/wiki/Antigua_and_Barbuda" title="Antigua and Barbuda">
Antigua and Barbuda
</a>
<a href="/wiki/Argentina" title="Argentina">
Argentina
</a>
<a href="/wiki/Armenia" title="Armenia">
Armenia
</a>
...
...
</body>
</html>
You can extract only the text content (without HTML tags) using the -t option, which is particularly useful for LLMs and text processing:
# Extract all text content from a page
curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -t
# Extract text from specific elements
curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -te 'table.wikitable td'
# Extract text from headings only
scrape -te 'h1, h2, h3' resources/test.html
The -t option automatically excludes text from <script> and <style> tags and cleans up whitespace for better readability.
You can integrate scrape-cli with xq (part of yq) to convert HTML output to structured JSON:
# Extract and convert to JSON (requires -b for complete HTML)
scrape -be "a.external-link" resources/test.html | xq .
Output:
{
"html": {
"body": {
"a": {
"@href": "https://example.com",
"@class": "external-link",
"#text": "Example Link"
}
}
}
}
Table extraction example:
scrape -be "table.data-table td" resources/test.html | xq .
Output:
{
"html": {
"body": {
"td": [
"1",
"John Doe",
"john@example.com",
"2",
"Jane Smith",
"jane@example.com"
]
}
}
}
Note: The -b flag is mandatory to produce valid HTML with <html>, <head> and <body> tags.
Useful for JSON-based pipelines, APIs, databases, and processing with jq/DuckDB.
Some notes on the commands:
-e to set the query-b to add <html>, <head> and <body> tags to the HTML output-t to extract only text content (useful for LLMs and text processing)