List Extraction & Scrapers

Previous | Next
Grab Data From a List and Create Your First Scraper

When a Web page includes one or several HTML list -like in the example below- use lists view, in the data section --refresh if the data doesn't appear--, to extract the data and move it to 'the Catch' or export it in MS Excel, HTML, SQL, CSV formats.

(Please scroll down for a more advanced mode of extraction.)

TIP: In the 'lists' view, like in most other views, right-clicking on selected rows gives you access to a wealth of features to edit and clean the data.


 

Largest metropolitan areas by continent

  • Asia
    1.  Shanghai, China (31.23°N, 121.47°E): 15M inhab.
    2.  Bombay, India (18.96°N, 72.82°E): 12.9M inhab.
    3.  Karachi, Pakistan (24.86°N, 67.01°E): 12M inhab.
    4.  Delhi, India (28.67°N, 77.21°E): 11.2M inhab.
    5.  Manila, Philippines (14.62°N, 120.97°E): 10.5M inhab.
    6.  Seoul, South Korea (37.56°N, 126.99°E): 10.4M inhab.
    7.  Jakarta, Indonesia (6.18°S, 106.83°E): 8.6M inhab.
    8.  Tokyo, Japan (35.67°N, 139.77°E): 8.4M inhab.
  • South America
    1.  Buenos Aires, Argentina (34.61°S, 58.37°W): 11.6M inhab.
    2.  São Paulo, Brazil (23.53°S, 46.63°W): 10M inhab.
    3.  Mexico City, Mexico (19.43°N, 99.14°W): 8.7M inhab.
  • Europe
    1.  Moscow, Russia (55.75°N, 37.62°E): 10.5M inhab.
    2.  Istanbul, Turkey (41.10°N, 29.00°E): 10M inhab.
  • Africa
    1.  Lagos, Nigeria (6.45°N, 3.47°E): 9M inhab.
  • North America
    1.  New York, United States (40.67°N, 73.94°W): 8.1M inhab.
 
     
Quick Start — Example 3 (Advanced) Create Your First Scraper

In many cases automatic data extraction methods ( tables, lists, guess) will be enough, and you will manage to extract and export the data in just a few clicks. If, however, the page is too complex, or if your needs are more specifc, there is a way to extract data manually, by creating a scraper. Scrapers will be saved to your personal database and you will be able to re-apply them on the same URL or on other URLs starting, for instance, with the same domain name. A scraper can even be applied to whole lists of URLs. You can also export your scrapers and share them with other users.

In our present example, if the data, as extracted in the list widget, is not structured enough for your needs, you will have to create a specific scraper for this page. The Scraper Editor is rather easy to use. Go to the scrapers view and you will see the colorized HTML source of the page:

The text in black is what is actually displayed in the page. This colorization makes it very easy to identify the data you are interested in. Building a scraper is simply telling the program what comes immediately before and/or after the data you want to extract and/or what its format is.
If it is your very first scraper, you are directly in the scraper editor, otherwise you are in the scraper manager and see the list of your other scrapers. In the latter case, hit the "New" button and type in a name for your new scraper. Once in the scraper editor, just fill the description and marker cells (double-click on a cell to edit it). Your first version should look like this:

Scraper screenshot

Hit Execute, and... that's it! You are running your first scraper.

Brava! or Bravo! For indeed you did it:
You just need to go to the scraped view, and here is your result:


OK, the present example is not all that exciting and the figures are already out of date. It would almost be faster to do the 15 rows manually. But, what if the data filled 20 pages and if we updated the population figures tomorrow? Better: what if the data was changing every morning, like job ads, sport results or stock market indices?... No problem; you would simply re-apply your new scraper.


TIP: the online Help is pretty complete. You will find there a good description on Scrapers and Regular Expressions. Please refer to it and try to create more advanced scrapers. (In the Pro version, if you master regular expressions and the Replace, Separator and Labels fields, you can even separate all the data, including continent, city, country, latitude and longitude in different columns.)