pip install scrape-cli

scrape-cli

Extract HTML elements from the command line using CSS or XPath

Pipe-friendly.  Simple.  Powerful.

bash — scrape-cli demo
$

Why scrape-cli?

Built for the terminal. Designed for pipelines.

Simple

One flag to extract, one to wrap. No boilerplate. No config files.

🎯

CSS & XPath

Use the selector language you already know. Switch anytime, same result.

🔗

Pipeline-friendly

Reads stdin, writes stdout. Composes naturally with curl, jq, xq.

🤖

LLM-ready

-t flag extracts clean text, perfect for AI pipelines.

How it works

CSS selectors and XPath — same result, your choice

01 Extract sovereign states from Wikipedia
curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -be 'table.wikitable > tbody > tr > td > b > a'
curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -be "//table[contains(@class, 'wikitable')]/tbody/tr/td/b/a"
02 Extract table data
scrape -e "table.data-table td" resources/test.html
scrape -e "//table[contains(@class, 'data-table')]//td" resources/test.html
03 Extract link hrefs
scrape -e "a.external-link" -a href resources/test.html
scrape -e "//a[contains(@class, 'external-link')]/@href" resources/test.html

Key flags

-e CSS selector or XPath expression
-b Wrap output in html/head/body
-t Extract plain text only
-a ATTR Extract attribute value
--check-existence Exit 0 if found, 1 if not

Installation

Get started in seconds

$ pipx install scrape-cli
$ uv tool install scrape-cli
$ pip install scrape-cli

Python ≥ 3.6  ·  requires: requests, lxml, cssselect

Practical examples

Real-world use cases straight from the terminal

Extract & convert to JSON

Pipe to xq for structured output

scrape -be "a.external-link" resources/test.html | xq .

Requires xq (kislyuk/yq) for XML/HTML-to-JSON conversion.

{
  "html": {
    "body": {
      "a": {
        "@href": "https://example.com",
        "@class": "external-link",
        "#text": "Example Link"
      }
    }
  }
}

Extract text for LLMs

Clean plain text, no HTML tags

curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -te 'table.wikitable td'

The -t flag strips HTML tags, excludes <script> and <style>, and cleans up whitespace — ideal for feeding content into an LLM or text pipeline.

Check element existence

Scriptable exit codes for automation

scrape -e "#main-title" --check-existence resources/test.html
0 Element found
1 Not found

Pipeline with curl

Scrape live web content instantly

curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -be 'table.wikitable > tbody > tr > td > b > a'
<a href="/wiki/Afghanistan">Afghanistan</a>
<a href="/wiki/Albania">Albania</a>
<a href="/wiki/Algeria">Algeria</a>
...