Quick Start —

List Extraction & Scrapers

Previous | Next
Grab Data From a List

When a Web page includes one or several HTML list -like in the example below- use lists, in the data tab (refresh if the data doesn't appear), to extract the data and move it to 'the Catch' or export it to MS Excel.

(Please scroll down for a more advanced mode of extraction.)

TIP: In the 'lists' tab, like in most other tabs, right-clicking on selected rows gives you access to a wealth of features to edit and clean the data.


 

Largest metropolitan areas by continent

  • Asia
    1. Shanghai, China (31.23°N, 121.47°E): 15M inhab.
    2. Bombay, India (18.96°N, 72.82°E): 12.9M inhab.
    3. Karachi, Pakistan (24.86°N, 67.01°E): 12M inhab.
    4. Delhi, India (28.67°N, 77.21°E): 11.2M inhab.
    5. Manila, Philippines (14.62°N, 120.97°E): 10.5M inhab.
    6. Seoul, South Korea (37.56°N, 126.99°E): 10.4M inhab.
    7. Jakarta, Indonesia (6.18°S, 106.83°E): 8.6M inhab.
    8. Tokyo, Japan (35.67°N, 139.77°E): 8.4M inhab.
  • South America
    1. Buenos Aires, Argentina (34.61°S, 58.37°W): 11.6M inhab.
    2. São Paulo, Brazil (23.53°S, 46.63°W): 10M inhab.
    3. Mexico City, Mexico (19.43°N, 99.14°W): 8.7M inhab.
  • Europe
    1. Moscow, Russia (55.75°N, 37.62°E): 10.5M inhab.
    2. Istanbul, Turkey (41.10°N, 29.00°E): 10M inhab.
  • Africa
    1. Lagos, Nigeria (6.45°N, 3.47°E): 9M inhab.
  • North America
    1. New York, United States (40.67°N, 73.94°W): 8.1M inhab.
 
     
Quick Start — Example 3 (Advanced) Create Your First Scraper

In many cases automatic data extraction methods ( tables, lists, guess) will be enough, and you will manage to extract and export the data in just a few clicks. If, however, the page is too complex, or if your needs are more specifc, there is a way to extract data manually, by creating a scraper. Scrapers will be saved to your personal database and you will be able to re-apply them on the same URL or on other URLs starting, for instance, with the same domain name. A scraper can even be applied to whole lists of URLs. You can also export your scrapers and share them with other users.

In our present example, if the data, as extracted in the list widget, is not structured enough for your needs, you will have to create a specific scraper for this page. The Scraper Editor is very far from final, however, with a little practice, it can already be of great help. Go to the source tab, you will see the colorized HTML source of the page:

The text in black is what is actually displayed in the page. This colorization makes it very easy to identify the data you are interested in. Building a scraper is simply telling the program what comes immediately before and/or after the data you want to extract and/or what its format is. Click on new, type in the URL of the page and a name for your new scraper. Your first version should logically look like this:


Hit save, and that's it! You are ready to run your first scraper. If you now go to the scraper tab (under the data tab) and hit to refresh, the results are there.
Ooops! They are not bad... but not totally satisfying:


The first row contains a large bunch of text instead of the Coordinates, and the City is missing. Another look at the source code explains it. The parenthesis (, which is used as the Marker Before Coordinates, appears in the introduction text:


You must therefore be a little more precise, defining at least the format of the first character that must be found after the marker. Here, a good way is to use the Regular Expression syntax in the Format field. "RegExps" can become pretty tricky if you need to find complex patterns; but here, what you want to say is simple: "a string that starts with a digit". For this, you need to type \d.+ (a digit \d, followed by a series of one or more characters .+) :


Hit save. Back to the scraper tab, the new result is not bad (hit to refresh). One last problem, though: the first city took its continent along...


Let's have a look at the source code one last time:


<li>, our Marker Before City, also appears before the continent. A simple way, here, is to select all the characters between the beginning of the line and the city name, and copy them into the scraper editor. It makes the marker more specific, and it will keep working because all cities are at the same indentation level:


Our final scraper looks like this (don't forget to hit save) :


For indeed we did it:


OK, the present example is not all that exciting and the figures are already out of date. It would almost be faster to do the 15 rows manually. But, what if the data filled 20 pages and if we updated the population figures tomorrow? Better: what if the data was changing every morning, like job ads, sport results or stock market indices?... No problem; you would simply re-apply your new scraper.


TIP: the online Help is very limited, but you will find there a few tips concerning Scrapers and Regular Expressions. Please refer to it and try to create more advanced scrapers.