scrape cli

PyPI version Python Versions

scrape cli

It’s a command-line tool to extract HTML elements using an XPath query or CSS3 selector.

It’s based on the great and simple scraping tool written by Jeroen Janssens.

Installation

You can install scrape-cli using pip:

pipx install scrape-cli

Using pip

pip install scrape-cli

Or install from source:

git clone https://github.com/aborruso/scrape-cli
cd scrape-cli
pip install -e .

Requirements

How does it work?

A CSS selector query like this

curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -be 'table.wikitable > tbody > tr > td > b > a'

or an XPATH query like this one:

curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -be '//table[contains(@class, 'wikitable')]/tbody/tr/td/b/a'

gives you back:

<html>
 <head>
 </head>
 <body>
  <a href="/wiki/Afghanistan" title="Afghanistan">
   Afghanistan
  </a>
  <a href="/wiki/Albania" title="Albania">
   Albania
  </a>
  <a href="/wiki/Algeria" title="Algeria">
   Algeria
  </a>
  <a href="/wiki/Andorra" title="Andorra">
   Andorra
  </a>
  <a href="/wiki/Angola" title="Angola">
   Angola
  </a>
  <a href="/wiki/Antigua_and_Barbuda" title="Antigua and Barbuda">
   Antigua and Barbuda
  </a>
  <a href="/wiki/Argentina" title="Argentina">
   Argentina
  </a>
  <a href="/wiki/Armenia" title="Armenia">
   Armenia
  </a>
...
...
 </body>
</html>

Some notes on the commands:

Linux 64 bit precompiled binary

If you are looking for precompiled executables for Linux, please refer to the Releases page on GitHub where you can find the latest precompiled binary file.

I have built the scrape-linux-x86_64 precompiled binary, using pyinstaller and this command: pyinstaller --onefile scrape.py.

Once you have built it, it’s an executable, and it’s possible to use it Linux 64 bit environment.

License

MIT