The Scraper Editor (OutWit Scrapers Syntax Reference)
In the scrapers view, the bottom part of the window can either be the Scraper Manager or the Scraper Editor. In the scraper manager, you can see and organize your scrapers and, when double-clicking on one of them or
creating a new one using the New button, the scraper editor opens and you can create or alter your scraper lines.
At the bottom of the editor, the following series of buttons gives you access to the general management functions:
-
Save
Saves the current scraper to your profile
-
New
Opens a new blank scraper
- Duplicate Duplicates the selected lines
-
Delete
Deletes the current scraper from your profile
- Export Saves the scraper as an XML file to your hard disk
-
Import
Loads a previously exported scraper
- Revert Restores the last saved version of the scraper
-
Properties
Displays the Scraper Properties Dialog where information about the current macro can be found and edited (name, author, comments, etc.)
-
Close
Closes the Scraper Editor and displays the Scraper Manager.
The Editor itself allows you to define the following information:
Apply if URL contains...: The URL to Scrape (or a part thereof). This is the condition to apply the scraper to a page. The string you enter in this field can be a whole URL, a part of URL, or a regular expression, starting and ending with a '/'. (In the last case, the string will be displayed in red if the syntax is invalid.) If you try to scrape a page with the 'Execute' button when this
field doesn't match the URL, an error message will be displayed. If two or more scrapers match the URL of the page to be scraped, the priority will be given to the most recent, with the most significant condition (longest match).
Note: If you keep getting the message: "This scraper is not destined to the current URL", this is the field that must be changed. A frequent mistake is to put a whole URL in this field, when the scraper is destined to
several pages. Try to enter only the part of the URL which is common to all the pages you wish to scrape, but specific enough to not match unwanted pages.
You may also get this error message if you are trying to apply a disabled
scraper to a valid URL. In this case, just check the OK checkbox in the scraper manager.
Source Type (Pro, Expert & Enterprise Editions): You can set your scraper to be applied either to the original source code that was loaded by the browser when opening the page or to the code as it was dynamically
altered by scripts after the page was loaded.
In the Expert & Enterprise editions you can also select the dynamic source as a concatenation of all the frames
composing the page.
In the scraper definition itself, each line corresponds to a field of data to be extracted. (If you leave the scraper empty, its default behavior will be to scrape the whole text of the page, removing HTML tags and styles
and keeping line breaks and basic layout when possible.)
To edit or enter a value in a cell, double-click on the cell. To simplify the fabrication of scrapers and avoid typos, the best way is often to select the string you want in the source code and copy or drag it to the
cell. You can then edit it as you like.
- Description (name of the field): can either contain a simple label like "Phone" or "First Name" or a directive (see below),
- A) Marker Before - optional: a string or a Regular Expression marking the beginning of the data to extract,
- B) Marker After - optional: a string or a Regular Expression marking the end of the data to extract,
- C) Format - optional: a Regular Expression describing the format of the data to extract,
- Replace - (pro version) optional: replacement pattern (or value to which this field must be set).
- Separator(pro version) optional: delimiter, to split the extracted result into several fields.
- List of Labels(pro version) optional: the list of labels to be used, if the result is split into several fields with a separator.
Important Notes:
1) In a scraper, a line doesn't have to include Marker Before (A), Marker After (B) and Format (C). One or two of these fields can be empty. The authorized combinations are: ABC, AC, BC, AB, A, C.
2) When creating a scraper you can right-click on a marker or format field and choose Find in Source to find and highlight the occurrences of a string or a pattern in the source code of the current page. If
you right-click on the description field it will allow you to find the whole scraper line in the source code. This is very useful for troubleshooting.
3) When right-clicking on a marker or format field and choosing Insert From Library>Directive, the program will display a list of the main directives available for your license level. Choose the one you
want and it will insert it with sample parameters before the selected line.
4) Record Separator: By default, the first line of the scraper (which is not a #directive#) will be considered by OutWit Hub as the field that starts a new record. This means that each time this scraper line matches data in
the page, a new record will be created.
Usually, the best way is to follow the order of appearance of the fields in the source document. Note that, if needed, the record separator can also be forced to an arbitrary marker, using the #newRecord# directive
(see below).
In the Format pattern, use the regular expression syntax and do not forget to escape reserved characters.
Note: If you right-click on the text you have entered in a cell, an option will allow you to escape a literal string easily. In the Format field, the content will always be understood as a regular expression, even if not surrounded
by
/ /
.
In the Replace string, use \0 to insert the whole string extracted by this line of the scraper, or \1, \2, etc. to include the data captured by parentheses --if any-- in the
Format regular expression.
For instance, say you extract the string "0987654321" with a given scraper line. Adding a replacement pattern can help you rebuild a whole URL from the extracted data:
If you enter:
http://www.mySite.com/play?id=\0&autostart=true
as replacement string, the scraper line will return
http://www.mySite.com/play?id=0987654321&autostart=true
In the Separator, use either a literal string like
,
or
;
or a regular expression like
/ [,;_\-\/] /
.
For technical reasons, all regular expressions used in scrapers are interpreted as case insensitive patterns by default. [A-Z], [A-Za-z] and [a-z] have the same result. This can be changed using the #caseSensitive# directive. This means
that 'Marker before', 'Marker after', 'Format', which are always converted to regular expressions by the program, are case insensitive by default. Conversely, the 'Separator 'which is used as is by the program, is therefore case
sensitive by default if it is a literal string, and case insensitive if it was entered as a regular expression.
When splitting the result with a separator, use the List of Labels to assign a field name to each part of the data. Separate the labels with a comma. If there are less labels than split elements or if the Labels field is empty,
default labels will be assigned using the description and an index.
Example: the string you want to extract doesn't have remarkable markers and you do not know how to separate different elements of the data. Say the source code looks like this:
<li>Dimensions:35x40x70</li>
If you want the three dimensions in three separate columns, you can reconstitute the structure by entering the following:
Marker Before:
<li>Dimensions:
Marker After:
</li>
Separator:
x
Labels:
Height,Width,Depth
Regular Expressions in Your Scraper
To learn about Regular Expressions, please visit the RegExp Quick Start Guide
Regular expressions can be used in several parts of the scraper:
- Apply if URL contains
- Marker Before
- Marker After
- Format
- Separator
They allow you to keep the scraper short if you are confortable with them. In many cases, however, it is also possible to do without. For instance, if you need a OR, just create two scraper lines with the same field name (Description).
Example: If you want your scraper to match both 'sedan-4 doors' and 'coupe-2 doors'
The simple way is do it in two separate lines:
Description:
car
Format:
sedan-4 doors
Description:
car
Format:
coupe-2 doors
Or you can use a regular expression:
Description:
car
Format:
/(sedan\-4|coupe\-2) doors/
Directives (Pro, Expert & Enterprise Editions)
Directives alter the normal behavior of the scraper. They must be entered in the description column.
Directives are not applied if the scraper doesn't contain at least one line to extract data (i.e. a line which is not a directive: the extraction of the page title, for instance). The non-directive line must exist even if it doesn't grab anything in the page. (Note: if a scraper is empty, it will simply return the page as simple text.)
Lines with a directive can be located at any position in the scraper as they are re-sorted by the program when the scraper is applied. Pre-processing directives are interpreted before all other scraper lines, post-prossessing directives are interpreted once the extractions have been executed. Directives are identified (prefixed and suffixed) by # characters in the description field:
Alphabetical list of available directives (Click on one for details):
- abortAfter aborts the current extraction after the current page, if the scraper line matches.
- abortAfterNPages (Expert & Enterprise) aborts the current extraction after the Nth explored page.
- abortAfterNResults (Expert & Enterprise) aborts the current extraction when the scraped view contains N
extracted records.
- abortIf aborts the current extraction immediately if the scraper line matches.
- abortIfNot aborts the current extraction immediately if the scraper line doesn't match.
- addToQueue stores the data scraped by the line in the queue.
- addURLToSource (Expert & Enterprise) adds the URL at the beginning of the source code
- alertOnStop displays an alert at the end of the exploration process.
- allFrames (Expert & Enterprise) scrapes the concatenation of all frames in the current page.
- allowCrossDomain (Expert & Enterprise) removes javascript restrictions and reloads the page.
- autoCatch (Expert & Enterprise) forces the position of the Empty/Catch option of the scraped view to On Demand or Auto.
- autoEmpty, emptyOnDemand (Expert & Enterprise) set the value of the 'Empty' checkbox.
- caseSensitive makes the whole scraper case sensitive.
- catchEvery,catchAndDeleteEvery (Expert & Enterprise) sends scraped data to the Catch every n collected rows.
- catchOnStop (Expert & Enterprise) sends scraped data to the Catch at the end of the exploration.
- checkIf and checkIfNot condition the activation of the passed lines on page content.
- checkIfURL and checkIfNotURL conditions the activation of the passed lines on URL.
- cleanData and originalData clean the extracted data or not
- cleanHTML normalizes the HTML tags before the scrape.
- clearAllHistory (Expert & Enterprise) clears all history, forms, cookies, cache from the browser memory.
- clearBrowsingHistory (Expert & Enterprise) clears browsing history.
- clearCookie (Expert & Enterprise) clears cookies stored for the current URL.
- clearCookieEvery (Expert & Enterprise) clears cookies for current URL every n pages.
- clearCookieIf (Expert & Enterprise) clears cookies stored for the current URL if the scraper line matches.
- clearCookiesEvery (Expert & Enterprise) clears all cookies every n pages.
- clearCookiesIf (Expert & Enterprise) clears all cookies if the scraper line matches.
- clearCookiesIfNot (Expert & Enterprise) clears all cookies if the scraper line does not match.
- clearForms (Expert & Enterprise) clears form data from the browser memory.
- clickOnNodes instructs the scraper to click on page elements matching a css selector.
- coalesceOnStop (Expert & Enterprise) coalesces extracted data keeping the first value in each column.
- concatSeparator sets the delimiter to use for concatenation.
- decodeEntities (Expert & Enterprise) Decodes HTML entities (like & or >) to their plain text
equivalent.
- decodeJSCharcodesConverts hexadecimal and decimal codes into characters.
- deduplicate removes duplicates from extracted data.
- deduplicateOnStop (Expert & Enterprise) removes duplicates from extracted data after exploration.
- deduplicateWithinPage (Expert & Enterprise) does a row by row deduplication for each scraped page.
- default (Expert & Enterprise) sets default value for myFieldName.
- deleteCookies, deleteCookiesIf, deleteCookiesIfNot (Expert & Enterprise) deletes
cookies.
- download (Expert & Enterprise) downloads file at URL grabbed by scraper line.
- downloadReferer (Expert & Enterprise) sets the default referer for the next URL(s) downloaded.
- emptyDirectory (Expert & Enterprise) Empties the first directory of the queries view matching the passed name.
- emptyOnDemand forces the position of the Empty option of the scraped view to On Demand or Auto.
- enableNodes and disableNodes allow to directly change the state of page elements matching a css selector.
- exclude ignores a field if it contains this value
- excludeFromQueueIf (Expert & Enterprise) excludes urls that match a string or a regular expression.
- excludeFromQueueIfNot (Expert & Enterprise) excludes urls that do not match a string or a regular
expression.
- exportEvery,exportAndDeleteEvery (Expert & Enterprise) exports, every n collected rows, the data stored in the scraped
view.
- extractContacts (Expert & Enterprise) applies the contact extractor to the current URL loaded in the browser.
- fieldGroup (Expert & Enterprise) makes sure field indexes of a same group are incremented together.
- getJSON (Expert & Enterprise) ignores all other lines and tries to interpret JSON blocks.
- getLocalIP gets public IP address of the machine from the network.
- hideNodes (Expert & Enterprise) Makes the nodes matching the passed css selector invisible.
- holdResults, holdResultsIf and holdResultsIfNot (Expert & Enterprise) hold (to later
merge) data scraped from different pages
- ifURLContains, ifURLDoesNotContain allow to execute a scraper line or not depending on the URL being scraped.
- ignoreIfField instructs the scraper to ignore this page or record if a field has a certain value.
- ignoreErrors Cells where a function returned an error are empty.
- indentedText indents the document/page before scraping.
- insertIf inserts the content of the replace cell if this scraper line matches.
- insertIfNoResults adds a line with the content of the replacement column if no data was extracted in the page.
- insertIfNot inserts the content of the replace cell if this scraper line doesn't match.
- insertRow (Expert & Enterprise) inserts a row with the value of the replacement column for every record of the final
output.
- keepForms does not remove forms when cleaning the extracted text.
- keepOrder has the same effect as checking the 'keep order' checkbox.
- limit (Expert & Enterprise) only keeps the n first rows of extracted data.
- maxColumn (Expert & Enterprise) sets the maximum number of columns in the extracted datasheet.
- maxIndex (Expert & Enterprise) sets the maximum number of columns for a given field.
- mergeResults, mergeResultsIf and mergeResultsIfNot (Expert & Enterprise) concatenate
previously stored data with the data scraped in this page.
- minIndex (Expert & Enterprise) sets the minimum number of columns with the same name.
- newRecord creates a new record (new row) for each match .
- nextPage sets the link of the next page to use in an automatic browse process
- nextPageMax (Expert & Enterprise) sets the maximum number of pages to be explored.
- nextPageReferer (Expert & Enterprise) sets the referer for the next page query that will be sent to the server.
- normalizeToK and normalizeToUnits normalize numerical value in the field to decimal units or k units
- oneRow present all page data as a single row.
- originalData does not clean the scraper result cells.
- originalHTML scrapes the original HTML source without any alteration.
- outline alters the source code before scraping, keeping only the document/page outline.
- pauseAfter waits, after the page is processed, for the passed number of seconds.
- pauseBefore waits, before the page is processed, for the passed number of seconds.
- pressKey allows the scraper to simulate a key press in certain cases.
- processPatterns (Expert & Enterprise) interprets generation patterns and adds generated strings to the queue.
- queueMax (Expert & Enterprise) sets the maximum number of URLs to be added to the queue.
- readFromQueries (Expert & Enterprise) Reads the next active string from the passed query directory and stores its
value in the passed variable, then unchecks the line in the query directory.
- reapply (Expert & Enterprise) reapplies the scraper without moving to the next page. (now accepts parameters for the
number of reapplies and the delay between them.)
- removeScripts (Expert & Enterprise) cleans source code of the page before scraper is applied.
- removeTags (Expert & Enterprise) cleans source before scraper is applied, leaving only simple text.
- rename (Expert & Enterprise) renames all files matching a pattern in a disk folder using the replacement string.
- repeat adds the extracted field content to each extracted record.
- replace Pre-processing replacement
- replaceInField (Expert & Enterprise) replaces value in a given field.
- replaceUsingQueries (Expert & Enterprise) replaces matches of lists of patterns with lists of replacement
strings.
- resetAll, resetVisited, resetQueue (Expert & Enterprise)reset lists of visited URLs, URLs to visit.
- resetAllIf, resetVisitedIf, resetQueueIf (Expert & Enterprise) conditional reset lists of visited URLs, URLs to visit
- resetPrefOnStop (Expert & Enterprise) Reset the passed preference to its default value at the end of the scrape
process.
- resetQueueIf (Expert & Enterprise) resets queue of URLs to visit if scraper line matches.
- resetQueueIfNot (Expert & Enterprise) resets queue of URLs to visit if scraper line doesn't match.
- resetVisitedIf (Expert & Enterprise) resets list of visited URLs if scraper line matches.
- resetVisitedIfNot (Expert & Enterprise) resets list of visited URLs if scraper line doesn't match.
- resetVisitedJSLinksIf (Expert & Enterprise) resets list of visited Javascript links if scraper line matches.
- restartEvery (Expert & Enterprise) Sets 'auto-explore on startup' flag to true and restarts the application, every n
pages or seconds.
- save (Expert & Enterprise) Saves the string extracted by the scraper line to a separate text file.
- saveQueueAsQueries (Expert & Enterprise) Sends the current queue of URLs left to visit to a query directory.
- scope (Expert & Enterprise) limits Fast mode explorations to a domain and/or a depth.
- scrapeIf conditions data extraction to a match in the source code.
- scrapeIfIn (Expert & Enterprise) proceed with extraction if URL is in query directory.
- scrapeIfNot conditions data extraction to no match in the source code.
- scrapeIfNotIn (Expert & Enterprise) proceed with extraction if URL is not in query directory.
- screenshot (Expert & Enterprise) Saves a screenshot of the current page into a file using the passed file name.
- scrollBy (Expert & Enterprise) Scrolls the page loaded in the OutWit Hub browser by the passed number of pixels.
- scrollToBottom (Expert & Enterprise) scrolls (once) to the bottom of the currently loaded page.
- scrollToEnd scrolls down to the end of the page before scraping.
- scrollToString (Expert & Enterprise) scrolls down to a string in the page.
- select adds elements matching a css selector to the selection in the current page.
- sendToQueries (Expert & Enterprise) sends extracted URL to the 'queries' view.
- setAnchorRow stores the row number for later reference.
- setCookie (Expert & Enterprise) Sets values of the cookies for the current URL if the scraper line matches. Separate values with ';'.
- setValue (Expert & Enterprise) Sets the value of the <select> or <input> HTML block matching the format
column, to the value passed in the replace column.
- setVariable (Expert & Enterprise) declares and sets the value of variable before the extraction process.
- showAlert displays an alert with the data scraped by the line.
- showMatches (or logMatches) displays alert (or message in console) with strings matching patterns.
- showNextPage (or logNextPage) displays alert (or message in console) with next page URL.
- showNextPageCandidates (or logNextPageCandidates) displays alert (or message in console) with list of next page candidates.
- showOriginalSource (or logOriginalSource) displays alert (or message in console) with source code before alterations.
- showQueue (or logQueue) displays alert (or message in console) with the queue.
- showRecordDelimiter (or logRecordDelimiter) displays alert (or message in console) with record delimiter.
- showResults (or logResults) displays alert (or message in console) with grabbed data.
- showScraper (or logScraper) displays alert (or message in console) with scraper content.
- showScraperErrors (or logScraperErrors) displays an alert (or a message in the error console) if an error occurs.
- showServerErrors creates a separate column with error messages.
- showSource (or logSource) displays alert (or message in console) with source code.
- showVariables (or logVariables) displays alert (or message in console) with values of variables.
- showVisited (or logVisited) displays alert (or message in console) with list of visited URLs.
- simulate processes the scraper without actually applying it.
- skipIfIn (Expert & Enterprise) does not visit page if the URL is present in query directory.
- splitField (Expert & Enterprise) Splits the passed field as a post-process, using the values in the separator and
labels columns. (Can allow consecutive splits.)
- start starts extraction for the part of the source code following the match of this scraper line.
- stop stops extracting after the match of this scraper line
- suspend, suspendIf, suspendIfNot, suspendEvery (Expert & Enterprise)
suspends all operations.
- switchTo (Expert & Enterprise) Changes the current view to the value set in the replace column.
- switchTo (Expert & Enterprise) changes the current view.
- uncheckItemInQuery (Expert & Enterprise) Unchecks the 'OK' checkbox of the first line containing the string
extracted by the scraper line in the passed query directory.
- uncheckURLInQuery (Expert & Enterprise) Unchecks the 'OK' checkbox of the first line containing the current URL
in the passed query directory.
- uniqueField (Expert & Enterprise) Makes sure that no duplicate values are extracted for the specified field(s) during
the same exploration. (An alternative to deduplication while scraping, in case volumes are too large to post-process it.)
- unzip (Expert & Enterprise) Unzips a file or the content of a directory.
- variable declares and sets the value of a variable.
- zapGremlins removes unwanted control or invisible characters and tries to correct badly encoded characters.
Directive Details:
Pre-Processing:
- #abortAfter# Aborts the current extraction after the current page, if the scraper line matches.
- #abortAfterNPages#n# (Expert & Enterprise) Aborts the current extraction after the nth explored page.
- #abortAfterNResults#n# (Expert & Enterprise) Aborts the current extraction when the scraped view contains n extracted
records or more.
- #abortIf# and #abortIfNot# abort the scraping and interrupt the current automatic exploration if the scraper line matches (or doesn't match) a string within the page.
- #addURLToSource# (Expert & Enterprise) adds the URL at the beginning of the source code between
<pageurl></pageurl> tags, so that you can scrape the URL content or use it for conditional extractions. Note that you will not see the added URL in the source code of the page displayed in the scraper editor or in the source
view because the addition is only done when the scraper is applied. If you want to verify that it is there, you can use the #showSource# directive.
- #allFrames# (Expert & Enterprise) Scrapes the concatenation of all frames in the current page.
- #allowCrossDomain# (Expert & Enterprise) Removes javascript same-origin policy restrictions and reloads the page.
(Restrictions will be restored at end of process or at restart.)
- #autoCatch# (Expert & Enterprise) autoCatch:[true/false] - Forces the position of the Empty/Catch option of the scraped view to On
Demand or Auto. (Empty On Demand and Auto-Catch will not be activated together.)
- #autoEmpty# & #emptyOnDemand# (Expert & Enterprise) directives set the value of the 'Empty' checkbox in
the scraped view from within a scraper.
- #caseSensitive# makes the whole scraper case sensitive. Note that as every regular expressions and literals of a scraper are combined into a single regular expression at application time, it is therefore
not possible to define case sensitivity line by line or field by field. The whole scraper must be conceived with this in mind.
- #catchEvery#n# (or #catchAndDeleteEvery#n#) (Expert & Enterprise) Sends, every n collected rows, the data stored in the scraped
view (or in any other view specified in the Format column) to the Catch, then empties the datasheet if Delete is requested.
- #checkIf# and #checkIfNot# If the scraper line matches at least one string in the page, or does not match anything, or in any case, without condition (#check#) the
content of the 'replace' field will alter the OK column of your scraper. A string of 0s and 1s in the replace field will set the OK checkboxes of the scraper in the same order. Note that the right-click menu on the replace field of a #check
directive line will allow you to copy the values from the OK column to the cell or copy the cell string to the OK column.
Example:
You want to turn off line 5 of your scraper if the page doesn't contain "breaking news":
Description:
#checkIfNot#
Format:
breaking news
Replace:
11110111
- #checkIfURL# and #checkIfNotURL# allow you to include URL-based conditions in the scraper: If the scraper line matches (or does not match) the current URL, the content of the 'replace' field
will alter the OK column of your scraper. A string of 0s and 1s in the replace field will set the OK checkboxes of the scraper in the same order. Note that the right-click menu on the replace field of a #check directive line will allow you
to copy the values from the OK column to the cell or copy the cell string to the OK column.
- #cleanHTML# normalizes the HTML tags before the scrape, placing all attributes in alphabetical order. This can prove useful in some occasions when a page was typed manually (by a person lacking rigor), instead
of generated automatically.
- #clearAllHistory# (Expert & Enterprise) Clears all history, forms, cookies, cache from the browser memory.
- #clearBrowsingHistory# (Expert & Enterprise) Clears browsing history.
- #clearCookie# (Expert & Enterprise) Clears cookies stored for the current URL.
- #clearCookieEvery#n# (Expert & Enterprise) Clears cookies for current URL every n pages.
- #clearCookieIf# (Expert & Enterprise) Clears cookies stored for the current URL if the scraper line matches.
- #clearCookiesEvery#n# (Expert & Enterprise) Clears all cookies every n pages.
- #clearCookiesIf# (Expert & Enterprise) Clears all cookies if the scraper line matches.
- #clearCookiesIfNot# (Expert & Enterprise) Clears all cookies if the scraper line does not match.
- #setCookie# (Expert & Enterprise) Sets the values of the cookie(s) for the current URL -if the scraper line matches- to the value(s) of the replace column. Separate distinct values with ';'. (Ex.: id=456;ref=332;)
- #clearForms# (Expert & Enterprise) Clears form data from the browser memory.
- #clickOnNodes# (Expert & Enterprise) instructs the scraper to click on all page elements that match the css selector passed in
the Format column.
- #coalesceOnStop#criterionColumnName# (Expert & Enterprise) Once the current automatic exploration is achieved or
interrupted, data rows sharing the same value in the passed field name are coalesced keeping the first value found in each column.
- #concatSeparator#separator# allows you to set the character or string to be used as a delimiter in contactenation functions like #CONCAT#, #DISTINCT#, etc.
- #deduplicate# changes the Deduplicate checkbox of the scraped view from within a scraper.
- #deduplicateOnStop#criterionColumnName# (Expert & Enterprise) removes duplicates from the extracted data (row by row)
in the datasheet, only once the current automatic exploration is achieved or interrupted. (This prevents the deduplication from slowing down the whole extraction process.)
-
#decodeEntities# Decodes HTML entities (like & or >) to their plain text equivalent in the whole source code
before the scrape.
- #decodeJSCharcodes# Converts hexadecimal and decimal codes (\uXXXX and \xXX) into characters.
- #deleteCookies#, #deleteCookiesIf#, #deleteCookiesIfNot# (Expert
& Enterprise) This directive will instruct the program to delete ALL cookies (respectively: in all cases, when the scraper line matches, when the scraper line doesn't match). Note that it will not only delete the
cookies for the current page or domain, but all cookies for the application. (This means that you may for instance loose current session connections on other websites when using this directive.)
- #downloadReferer# (Expert & Enterprise) Sets the default referer for the next URL(s) downloaded in the current exploration.
- #enableNodes# (Expert & Enterprise) and #disableNodes# changes the state of page elements matching a css
selector. Can sometimes be used to enable a button before simulating a click.
- #emptyOnDemand# Forces the position of the Empty option of the scraped view to On Demand or Auto. If On Demand is set Auto-Catch is deactivated.
- #emptyDirectory#queryDirectoryName# (Expert & Enterprise) Empties the first directory of the queries view matching the passed
directory [queryDirectoryName]. (Attention: no warning dialog)
- #excludeFromQueueIf# (Expert & Enterprise) Allows to exclude urls that match a string or a regular expression.
- #excludeFromQueueIfNot# (Expert & Enterprise) Allows to exclude urls that do not match a string or a regular
expression.
- #exportEvery# or #exportAndDeleteEvery#n# (Expert & Enterprise) Exports, every n collected rows, the data stored in the scraped
view (or in any other view specified in the Format column) to the file (defined by a filename, a renaming pattern or a path in the form of file://...) specified in the Replace column, Then
empties the datasheet if requested.
Note: If used in the Enterprise edition, this directive can define an SQLite database as the destination (using a filename with the .sqlite extension), which is the most efficient way to manage very
large volumes of data.
- #fieldGroup# (Expert & Enterprise) Makes sure that the fields indexes in a same group are incremented together even if some of the
fields are empty.
- #getLocalIP# (Expert & Enterprise) Gets public IP address of the machine from the network.
- #getJSON# (Expert & Enterprise) If this directive is used, the rest of the scraper will be ignored and the program will try to
interpret JSON blocks found in the page. Note that converting complex recursive JSON objects into a two-dimensional table is only possible to a certain point. Do not expect to grab the data from complex objects this way. This function can
however be very useful in simple cases.
- #ifURLContains#keyString#, #ifURLDoesNotContain#keyString# (Expert & Enterprise) allows to execute a scraper line (or not) if
the URL of the current page matches the string keyString.
- #ignoreErrors# When this directive is used, cells where a function returned an error will be empty instead of containing an ##Error message.
-
#indentedText# alters the source code before scraping, reorganizing the document/page layout into an outline with indented text.
- #insertIfNot#myFieldName# If the scraper line does not match anything in the page, the content of the 'replace' field will be added once to each row scraped for this page. It is the only way to insert
information to your extracted data when the page does not contain something.
- #insertIf#myFieldName# The data extracted by this scraper line will be added once to each record scraped in this page, if the scraper line matches one or more strings in the page. It is mostly here as the
corollary of the previous directive, but it is a good way to get rid of duplicate columns in certain cases.
- #keepOrder# has the same effect as checking the 'keep order' checkbox in the scraped view or in a macro, i.e. ensuring that the columns will appear in the result datasheet in the same order as the scraper
lines. Setting it directly in the scraper allows you to make sure to always have this behavior with this scraper. Note that this directive is without effect on fields that are split using a separator.
- #keepForms# Instructs the program not to remove forms when cleaning the extracted text;.
- #limit# (Expert & Enterprise) Only keeps the n first rows of extracted data.
- #maxColumn# (Expert & Enterprise) Sets the maximum number of columns in the extracted datasheet. Additional fields will be ignored.
- #maxIndex#fieldName# (Expert & Enterprise) Sets the maximum number of columns with the passed field name. Additional such columns
will be ignored.
- #minIndex#fieldName# (Expert & Enterprise) Sets the minimum number of columns with the same name. Missing columns will be created
empty.
- #nextPageMax# (Expert & Enterprise) Sets the maximum number of pages to be explored within the automatic Browsing of a series of
pages.
- (Expert & Enterprise) Sets the referer for the next page query that will be sent to the server.
- #oneRow# All data extracted in this page will be presented as a single row in the datasheet.
- #originalData# Does not clean the scraper result cells. Leaves all html tags and special characters.
- #originalHTML# Scrapes the original HTML source without any alteration, preparation or special character decoding.
-
#outline# alters the source code before scraping, keeping only the document/page outline.
- #pauseBefore# instructs the scraper to wait, before the page is processed, for the number of seconds set in the Replace field.
-
#pauseAfter# instructs the scraper to wait, after the page is processed, for the number of seconds set in the Replace field.
- #processPatterns# (Expert & Enterprise) instructs the scraper to check if URLs passed to the #addToQueue# directives
are generation patterns. If they are, the patterns will be interpreted and all generated strings will be added to the queue.
- #queueMax# (Expert & Enterprise) Sets the maximum number of URLs to be added to the queue.
- #readFromQueries#directoryName# or #readAndUncheckFromQueries#directoryName# (Expert & Enterprise) Reads the
next active string from the passed query directory and stores its value in the variable passed in the replace column (#myVariable#), then unchecks the line in the query directory.
- #removeScripts# (Expert & Enterprise) Cleans the source code of the page before the scraper is applied, by removing scripts and
comments.
- #removeTags# (Expert & Enterprise) Cleans the source before the scraper is applied, removing html tags and unwanted characters,
and leaving only simple text.
- #rename# (Expert & Enterprise) Renames all files matching the regExp pattern of the Format column in the directory passed in the
description column (e.g. #rename#file:///Users/userName/path/outwit/unzipped#) to the value of the replacement string. If no directory is provided, (...)/downloads/outwit/unzipped is used.
- #replaceUsingQueries#myDirectory# (Expert & Enterprise) Replaces matches of the string (or regExp pattern) in the 'query
string' column of the directory with the content of the 'notes' column. (If the Replace column of the scraper line contains #RELOAD#, the page will be reloaded displaying the replacements.
- #replace# Pre-processing replacement: the string (or regular expression) entered in the 'format' field will be replaced by the content of the 'Replace' field throughout the whole source code of the page, before
the scraper is applied.
Example: The page you wish to scrape contains both "USD" and US$. You wish to normalize it before scraping:
Description:
#replace#
Format:
US$
Replace:
USD
- #resetAll#, #resetVisited#, #resetQueue# (Expert & Enterprise) reset the list(s) of visited URLs and/or URLs to visit (the
queue).
- #resetAllIf#, #resetVisitedIf#, #resetQueueIf# (Expert & Enterprise) reset the list(s) of visited URLs and/or URLs to visit
(the queue), when the scraper line matches a string in the page.
- #resetQueueIf# (Expert & Enterprise) Resets queue of URLs to visit if scraper line matches.
- #resetQueueIfNot# (Expert & Enterprise) Resets queue of URLs to visit if scraper line doesn't match.
- #resetVisitedIf# (Expert & Enterprise) Resets list of visited URLs if scraper line matches.
- #resetVisitedIfNot# (Expert & Enterprise) Resets list of visited URLs if scraper line doesn't match.
- #resetVisitedJSLinksIf# (Expert & Enterprise) Resets list of visited Javascript links if scraper line matches.
- #restartEvery# (Expert & Enterprise) Sets 'auto-explore on startup' flag to true and restarts the application, every n pages or
seconds.
- #select# (Expert & Enterprise) adds elements matching a css selector to the selection in the current page (or creates a new
selection).
- #scope# (Expert & Enterprise) limits Fast mode explorations to a domain and/or a depth defined in the Replace column. Values of
the desired exploration scope parameter can be: within_0, within_1, within_2, within_all, outside_0, outside_1, outside_2.
- #scrapeIf# Data will only be extracted from the page if this scraper line matches something in the page source code.
- #scrapeIfNot# Data will only be extracted from the page if this scraper line doesn't match anything.
Example: You want to scrape only pages that contain "breaking news":
Description:
#scrapeIf#
Format:
breaking news
- #scrapeIfIn#queryDirectoryName# (Expert & Enterprise) Scrape the page if the URL is present in the passed query directory.
(Doesn't work in fast scrape mode from a right-click on a list of URLs.).
- #scrapeIfNotIn#queryDirectoryName# or #skipIfIn#queryDirectoryName# (Expert & Enterprise)
Proceed with the extraction if the URL is not found in the passed query directory.
- #scrollBy# (Expert & Enterprise) Scrolls the currently loaded page by the number of pixels set in the Replace column.
- #scrollToBottom# (Expert & Enterprise) Scrolls (once) to the bottom of the currently loaded page.
-
#scrollToEnd# instructs the scraper to scroll down to the end of the page and wait for the number of seconds set in the
replace field (usually for AJAX pages, in order to leave time for the page to be refreshed). You can also add a step parameter #scrollToEnd#n# to instruct the program to scroll down by steps of n pixels. (This is often useful
in AJAX pages when the data is only loaded when the user scrolls to a certain part of the page.) You can finally use a css selector as a parameter (#scrollToEnd#cssSelector#), in order to address a specific HTML element and scroll down within
this element. (This is very useful for recent AJAX interfaces.)
This function often requires some fine-tuning of the time parameters: the numer of seconds in the replace column of the scraper line, the three top sliders of Tools>Preferences>Timeouts... to define the max delays (they must allow for more
time than the number of seconds set in the scraper), the AJAX settings in Tools>Preferences>Advanced... to define the behavior when new data is dynamically added to the page.
- #scrollToString# (Expert & Enterprise) instructs the scraper to scroll down until the scraper line matches a string in
the page (useful for AJAX pages, where new data is added when you scroll).
- #setValue# (Expert & Enterprise) Sets the value of the <select> or <input> HTML block matching the format column ("id="
or "class="), to the value passed in the replace column.
- #setVariable#variableName# (Expert & Enterprise) Declares and sets the value of the variable (#variableName#). Occurrences of
#variableName# are replaced, before the extraction process, by the scraped value in all other lines of the scraper. Contrary to #variable#, #setVariable# sets the value of the variable before the source code is processed, so that even if this
line matches at the end of the page only, the value will already be available when scraping the beginning of the page. Variables can only be used in the Description and Replace columns and only within the scope of one scraper execution. (They
cannot be used to transfer information between two scrapers.).
- #switchTo# (Expert & Enterprise) Changes the current view to the value set in the replace column.
- #uniqueField# (Expert & Enterprise) Makes sure that no duplicate values are extracted for the specified field(s) during the same
exploration. (An alternative to deduplication while scraping, in case volumes are too large to post-process it.)
- #unzip#file:///.../myFile.zip# (Expert & Enterprise) Unzips the file at the passed path (or all zip files in the first passed
directory), into the directory passed in the replace column. If no directory is provided, (...)/downloads/outwit/unzipped is used.
- #zapGremlins# Removes unwanted control or invisible characters and tries to correct badly encoded characters.
Processing:
- #addToQueue# stores the data scraped by the line in a global variable. The queue can then be accessed with the #nextToVisit()# function. (See below for more info.)
Note that in the Expert &
Enterprise editions, when scraping the original source (white background), #addToQueue# is enough to instruct the program to visit grabbed URLs in Fast Scrape mode. In this case, #nextPage# and #nextToVisit()# are not needed.
- #default# or #default#myFieldName# (Expert & Enterprise) If this directive is used, the content of the
'Replace' field of the scraper line will be used as the default value for myFieldName. If myFieldName is not specified, the default will apply to all fields. This value will be added if the line doesn't match or is empty.
(This is incidently a good way to solve the limitation problem of #keepOrder# with a separator.) A scraper line using the #default# directive is not destined to directly grab data but to set the default value of a given field (or all
fields) used in the scraper.
- #download# or #download#renamePattern# (Expert & Enterprise) downloads the file at the URL grabbed by the
scraper line (setting the renaming pattern if specified).
Example:
Description:
#download#
Marker Before:
Product:<br /><img src="
Marker After:
"
- (Expert & Enterprise) Applies the contact extractor to the current URL loaded in the browser and returns
all fields. The optional parameter sets the filter level.
-
#exclude#myFieldName# If this directive is used, the content of the 'Format' field of the scraper line will not be accepted as a value
for myFieldName. If the line matches a string corresponding to the excluded value, the match will be ignored. A scraper line using the #exclude# directive is not destined to grab data but to set an unwanted value for a field name
used elsewhere in the scraper.
- #hideNodes# (Expert & Enterprise) Makes the nodes matching the passed css selector invisible.
- #newRecord# Each time the pattern of this scraper line matches a string in the page source, a new record (new row) is created in the result datasheet. The pattern to match can be entered either in the 'marker
before' field or in the 'format' field. Note that if this directive is not used, the record separator is the first (top) record of your scraper. In cases where there is no clear field markers or if the field you want as a unique key for the
record is not first, or if fields are not always populated, #newRecord# is very useful.
- #reapply# (Expert & Enterprise) Reapplies the scraper without moving to the next page.
- #repeat#myFieldName# The matching or replacement value will be added in a column named myFieldName to all rows following the match of this scraper line. A scraper line using the #repeat# directive is not
destined to directly grab data. It instructs the program to add the field to each new extracted record; if no record is found by the scraper, the repeated field will not be inserted. (See the #insertIfNot# directive, if you want to return a
value when a scraper doesn't match.)
Example: Say you have a page where the data to scrape is divided by continent between the following tags: <h3>Continent: XXXXX</h3>.
You can set the scraper to add the continent in a column for every row by adding:
Description:
#repeat#Continent#
Marker Before:
<h3>Continent:
Marker After:
</h3>
The repeat directive can be used to set a fixed value in a column by only entering a string in the Replace field:
Example: For inputing data directly in your database without any touchup in the process you need to add the field "location" with a set value:
Description:
#repeat#Location#
Replace:
New Delhi
Note: if a variable is entered in the Replace field, all its values will be concatenated in the repeated output.
- #save# (Expert & Enterprise) Saves the string extracted by the scraper line to a separate text file using the filename or renaming pattern set in the replace column.
- #saveQueueAsQueries# (Expert & Enterprise) Sends the current queue of URLs left to visit to a query directory. A
name can be added to the directive if you want to specify how the directory should be named or in order to replace an existing directory with the current queue URLs: #saveQueueAsQueries#directoryName#. This directive can be extremely useful if
you fear that a process may be interrupted. It will allow you to resume an automatic exploration where it stopped. Note, however, that the directive only stores URLs to explore, not URLs wihich have already been explored. If your scraper adds
URLs to the queue, already visited URLs may be added again, so resuming an exploration this way may in some cases lead to re-visiting URLs that have been seen before the interruption.
- #screenshot# (Expert & Enterprise) Creates an image file with a screenshot of the current Webpage in the download folder using the
filename or renaming pattern set in the replace column.
- #sendToQueries# (Expert & Enterprise) sends to the 'queries' view the URL extracted by the scraper line. A name can be
added to the directive to specify how the directory should be called or to add the extracted URL to an existing directory: #sendToQueries#directoryName#.
- #start# switches scraping on. Data will start being extracted in the part of the source code following the match of this scraper line. (Directives are not limited by #start# and #stop#. For instance, if the
#scrapeIf# directive matches a string outside of the start/stop zones, it will still be executed.)
Example: You only want to start scraping after a given title, say <h2>Synopsis:</h2>.
You simply need to type the string in the Format field of your scraper line:
Description:
#start#
Format:
<h2>Synopsis:</h2>
- #stop# switches scraping off. Data extraction will stop after the match of this scraper line in the source code. (But the code analysis continues and scraping will start again if a #start# line matches.) Note
that if the #stop# line matches before a #start# line (or if there is no #start# line), a #start# directive is implied at the beginning. In other words, in order to be able to stop, the scraping needs to start. Directives are not limited by
#start# and #stop#. For instance, if the #scrapeIf# directive matches a string outside of the start/stop zones, it will still be executed.
- #suspend#n#, #suspendIf#n#, #suspendIfNot#n#, #suspendEvery#n# (Expert & Enterprise) This directive will instruct the program to suspend all operations (respectively: in all cases, when the scraper line matches, when the scraper
line doesn't match, or every nth page). If you add a parameter to the first three functions, the program will wait for n seconds before resuming when the OK button is clicked. This is useful to give the user time to interract with the page,
solve a captcha, etc.
- #variable#myVariableName# declares and sets the value of the variable (#myVariableName#). The occurrences of the variable are then replaced, at application time, by the scraped value in all other lines of the
scraper. Variables can only be used in the Description and Replace columns and only within the scope of one scraper execution. They cannot be used to transfer information between two scrapers.
Example: Setting and using the variable 'trend'.
line 1:
Description:
#variable#trend#
Marker Before:
Dow Jones:
Marker After:
<br />
Format:
/[-+\d,.]+/
line 2:
Description:
#showAlert#
Replace:
#if(#trend#<0,Bear,Bull)#
- Anchor Functions:(The need for these functions is relatively rare, it will help you solve difficult cases when the data is presented in columns in the HTML page, using blocks with left or right 'float'
tags.) #setAnchorRow# stores the row number where this scraper line matches, so that data that will be found later in the page source code can be added to the result table as additional columns,
starting at this row number. Thus, when the directive #useAnchorRow# is encountered --and if an anchor row has been previously set-- the following fields of data are added, starting at the anchor row until the
#useCurrentRow# directive reverts to the normal behavior, adding a new row at the bottom of the result table each time a record separator is found.
Post-Processing:
- #alertOnStop# Displays an alert at the end of the automatic exploration process (when no more next page link is found), with the list of visited URLs and discarded next page link candidates, if any.
- #catchOnStop# (Expert & Enterprise) Sends the data stored in the scraped view (or in any other view specified in the Format
column) to the Catch at the end of the exploration.
-
#cleanData# and #originalData# override the 'Clean Text' checkbox in the scraped view. When original data is set, the data is left
as is (including HTML tags and entities), when clean data is used, HTML tags are removed from the scraped data.
- #deduplicateWithinPage# (Expert & Enterprise) Does a smart deduplication of the extracted data (row by row) for each
scraped page, before sending the results to the datasheet. (This prevents the deduplication of tens of thousands of rows to slow down the whole process.).
- #holdResults#, #holdResultsIf# and #holdResultsIfNot# (Expert &
Enterprise) are used to merge data scraped from different pages. They instruct the program to not return the extracted data in the present scrape but store it in memory until a #mergeResults# directive is used scraping
another page. The conditional versions of the directive allow you to hold the set of results only if the scraper line matches or doesn't match a string in the page.
- #ignoreIfField#fieldName# (Expert & Enterprise) After the page has been scraped, this directive instructs the scraper to ignore
the current page or record if the value of the field is equal to the string passed in the replace column.
- #insertIfNoResults# Adds a line with the content of the Replace column if no data was extracted in the page.
- #insertRow# (Expert & Enterprise) Inserts a row with the value of the replacement column before (or after) every record of the
final output (ignoring first and last).
- #mergeResults#, #mergeResultsIf# and #mergeResultsIfNot# (Expert &
Enterprise) instruct the program to concatenate the previously stored data with the data scraped in this page and return the combined set of results. The conditional versions of the directive allow you to merge the results
only if the scraper line matches or doesn't match a string in the page.
- #nextPage# allows you to tell OutWit Hub how to find the link to the next page to use in an automatic browse process. Use this when the Hub doesn't find the next page link automatically, or when you wish to
manually set a specific course for the exploration.
NOTE: As any feature in scrapers, the next page directive is only applied when the scraped view is active (which means that the view's bottom panel has non-default settings and
the view name is in bold in the side panel).
Example: A typical next page scraper line.
Description:
#nextPage#
Marker Before:
<a href="
Marker After:
">Next page</a>
Format:
/[^"]+/
Replace:
#BASEURL#\0
- #nextPage#x# You can add a positive integer rating in the next page directive: if several nextPage directives are used, the first matching line of the highest rating will be chosen. Use #nextPage#0#, the lowest, for the
default value. If #nextPage# is used without a rating parameter, it will be considered as the highest rated.
Example: You want to go to the link of the text "Next Page", if found, or go back to the previous page otherwise:
line 1:
Description:
#nextPage#0#
Replace:
#BACK#
line 2:
Description:
#nextPage#3#
Marker Before:
<a href="
Marker After:
">Next page</a>
Replace:
#BASEURL#\0
-
#normalizeToK#myFieldName# and #normalizeToUnits#myFieldName# normalizes numerical value
in the field myFieldName: converts it to decimal units (m, m2, m3, g...) or k units (km, km2, kg...), removes thousand separators and uses the dot as a decimal separator. A scraper line using a
normalize directive is not destined to grab data but to instruct the program to normalize a field used elsewhere in the scraper.
- #pressKey# (Expert & Enterprise) allows the scraper to simulate the event of a key being pressed by the user.
- #replaceInField#aFieldName# (Expert & Enterprise) replaces value (litteral of RegExp) in a given field at the end of the
process.
- #resetPrefOnStop# (Expert & Enterprise) Resets the passed preference to its default value at the end of the scrape process.
- #splitField# (Expert & Enterprise) Splits the passed field as a post-process, using the values in the separator and labels
columns. (Can allow consecutive splits.)
- #uncheckURLInQuery# (Expert & Enterprise) Unchecks the 'OK' checkbox of the first line containing the current URL in the
passed query directory.
- #uncheckItemInQuery# (Expert & Enterprise) Unchecks the 'OK' checkbox of the first line containing the string extracted by
the scraper line in the passed query directory.
- Debug directives:
- #showAlert# displays an alert with the data scraped by the directive line. If only the 'Replace' field is filled, the alert will be shown at the end of the scraping.
- #showMatches# (or #logMatches#) displays an alert (or a message in the error console) with all the strings that match the scraper patterns.
- #showNextPage# (or #logNextPage#) displays alert (or message in console) with value of the selected next page URL.
- #showNextPageCandidates# (or #logNextPageCandidates#) displays alert (or message in console) with list of possible next page URLs found.
- #showQueue# (or #logQueue#) displays alert (or message in console) with content of the queue (the list of URLs to explore).
- #showRecordDelimiter# (or #logRecordDelimiter#) displays alert (or message in console) with name of the field selected as the record delimiter for this
scraper.
- #showResults# (or #logResults#) displays alert (or message in console) with data grabbed by the scraper.
- #showScraper# (or #logScraper#) displays alert (or message in console) with content of the scraper as interpreted by the program.
- #showScraperErrors# (or #logScraperErrors#) displays an alert (or a message in the error console) if an error occurs. (Most of the time alerts are not welcome as
they would block the execution of automatic tasks.)
- #showServerErrors# creates a separate column in the result datasheet with error messages returned by the server.
- #showSource# (or #logSource#) displays alert (or message in console) with source code to which the scraper is applied (after replacements made by the #replace# directive).
- #showOriginalSource# (or #logOriginalSource#) displays alert (or message in console) with original source code that was sent to the scraper (before
alterations).
- #showVariables# (or #logVariables#) displays alert (or message in console) with values of all variables.
- #showVisited# (or #logVisited#) displays alert (or message in console) with list of the URLs visited since the beginning of the browse process.
- #simulate# instructs the program to process the scraper without actually applying it. The interpretation is performed and some directives will still work, allowing you to display information for debug. This
can be helpful if the scraper application fails --in particular in case of freezes during the application of scrapers with too complex or faulty regular expressions-- in order to seek the cause of the problem.
Time Variables (pro version)
The following variables can be used in the 'Replace' field to complement or replace the scraped content.
Use #YEAR#, #MONTH#, #WEEK#, #DAY#, #HOURS#, #MINUTES#, #SECONDS#,
#MILLISECONDS#, #DATE#, #TIME#, #DATETIME# in the 'Replace' field to insert the respective values in your replacement string.
Example:
You can add a collection time to the scraper using both a directive and a time variable:
Description:
#repeat#Collected On
Replace:
#DATETIME#
(Expert & Enterprise) #DATE#mm/dd/yy# (or another pattern between the last two sharp signs) can be added to specify the date format
to be used.
Navigation Variables (pro version)
The following variables can be used in the 'Replace' field to complement or replace the scraped content.
- Use #URL#, #BASEURL#, #DOMAIN#, #BACK#, #FORWARD#... in the 'Replace' field to insert the respective values in your replacement string.
-
#URL#: current page URL (http://www.example.com/whatever/page.htm)
-
#BASEURL#: current page path (http://www.example.com/whatever/)
Example:
You grab a relative link (incomplete) from the source and want to absolutize it:
Description:
Link
Before:
href="
After:
"
Replace:
#DOMAIN#\0
-
#DOMAIN#: current domain (http://www.example.com)
- (Expert & Enterprise)
#DOMAIN-NAME#: current domain name (example)
- (Expert & Enterprise)
#WITHIN-DOMAIN#: is equivalent to #DOMAIN# but only yields a result if the link is within the current domain
- (Expert & Enterprise)
#OUTSIDE-DOMAIN#: only yields a result if the link is not within the current domain (http://blog.example.com is accepted when scraping http://www.example.com)
- (Expert & Enterprise)
#OUTSIDE-DOMAIN-NAME#: is equivalent to #OUTSIDE-DOMAIN# but only yields a result if the link is not within the current domain name (http://blog.example.com is not accepted when scraping
http://www.example.com)
-
#BACK#: previous page in history
-
#FORWARD#: next page in history
- (Expert & Enterprise) Use #LAST-POST-QUERY# returns the value of the passed parameter in the last POST query sent.
Example:
You just want the source domain in a column 'Source':
Description:
repeat#Source
Replace:
Collected on #DOMAIN#
- (Expert & Enterprise) Use #RELOAD# for conditional reloading of the current page.
-
Redirections:
- #REQUESTED-URL# gives the URL that was queried or clicked on.
- #REDIRECTED-URL# returns the URL the browser eventually landed on after a redirection, if any, and returns nothing if there was no redirection.
- #TARGET-URL# returns the URL the browser eventually landed on after a redirection, if any, and returns the requested (current) URL if there was no redirection.
- (Expert & Enterprise) Click Generation:
- #CLICK-ID#id# When used in the Replace column on a #nextPage# line, the program will simulate a click on the HTML node with the provided ID.
- #CLICK-CLASS#classname# When used in the Replace column on a #nextPage# line, the program will simulate a click on the first HTML node with the provided Class name.
- #CLICK-CLASS-FIRST-NODE#classname# When used in the Replace column on a #nextPage# line, the program will simulate a click on the first node of the provided Class.
- #CLICK-CLASS-LAST-NODE#classname# When used in the Replace column on a #nextPage# line, the program will simulate a click on the last node of the provided Class.
- #CLICK-CLASS-NODEn#classname# When used in the Replace column on a #nextPage# line, the program will simulate a click on the nth node of the provided Class.
- #CLICK-CLASS-FIRST-LINK#classname# When used in the Replace column on a #nextPage# line, the program will simulate a click on the first hyperlink with the provided Class name.
- #CLICK-CLASS-LAST-LINK#classname# When used in the Replace column on a #nextPage# line, the program will simulate a click on the last hyperlink with the provided Class name.
- #CLICK-CLASS-LINKn#classname# When used in the Replace column on a #nextPage# line, the program will simulate a click on the nth hyperlink with the provided Class name.
- #CLICK-CLASS-ALL#classname# When used in the Replace column on a #nextPage# line, the program will simulate a click on all nodes with the provided Class name.
- #CLICK-CLASS-NEXT-LINK#classname# When used in the Replace column on a #nextPage# line, the program will successively simulate a click on all links with the provided Class name, and run a scrape
on each resulting page.
- #CLICK-CLASS-NEXT-NODE#classname# When used in the Replace column on a #nextPage# line, the program will successively simulate a click on all nodes with the provided Class name, and run a scrape
on each resulting page.
- #CLICK-SELECTOR#cssSelector# When used in the Replace column on a #nextPage# line, the program will simulate a click on the first hyperlink matching the provided css selector.
- #CLICK-SELECTOR-FIRST-NODE#cssSelector# When used in the Replace column on a #nextPage# line, the program will simulate a click on the first node matching the provided css selector.
- #CLICK-SELECTOR-LAST-NODE#cssSelector# When used in the Replace column on a #nextPage# line, the program will simulate a click on the last node matching the provided css selector.
- #CLICK-SELECTOR-NODEn#cssSelector# When used in the Replace column on a #nextPage# line, the program will simulate a click on the nth node matching the provided css selector.
- #CLICK-SELECTOR-FIRST-LINK#cssSelector# When used in the Replace column on a #nextPage# line, the program will simulate a click on the first hyperlink matching the provided css selector.
- #CLICK-SELECTOR-LAST-LINK#cssSelector# When used in the Replace column on a #nextPage# line, the program will simulate a click on the last hyperlink matching the provided css selector.
- #CLICK-SELECTOR-LINKn#cssSelector# When used in the Replace column on a #nextPage# line, the program will simulate a click on the nth hyperlink matching the provided css selector.
- #CLICK-SELECTOR-ALL#cssSelector# When used in the Replace column on a #nextPage# line, the program will simulate a click on all nodes matching the provided css selector.
- #CLICK-SELECTOR-NEXT-NODE#cssSelector# When used in the Replace column on a #nextPage# line, the program will successively simulate a click on all nodes matching the provided css selector,
and run a scrape on each resulting page.
- (Expert & Enterprise) Sounds:
- #BEEP#, #TICK#, #CHIMES#, #WOOSH# produce a sound when the scraper line matches a string in the page.
- (Expert & Enterprise) Host Info:
- #HOSTNAME# returns the most probable name of the organization hosting the current Web page.
- #HOSTCOUNTRY# returns the most probable country of the current Web page.
-
#ORDINAL# returns the ordinal number of the page being scraped in an automatic exploration. (Note that this is different from the
Ordinal ID column in datasheets. The number returned by #ORDINAL# is the first group of digits that constitute the Ordinal ID.)
-
#COOKIE# returns the content of the cookie(s) that have been set in your browser by the current Website if any.
Data Cleaning
Several features allow you to decide how the data should be cleaned from the HTML tags. The 'Clean Text' checkbox of the scraped view bottom panel is the most obvious. It allows to set the application to remove/keep the HTML
tags in all scrapers. The #cleanData# directive in a scraper allows you to set the behaviour for this scraper. Finally, the Description field (first column) can be used to set the cleaning behavior for a single field:
<MyFieldName> keeps all HTML tags, *MyFieldName* returns the scraped data with a streamlined HTML, only keeping formatting tags and MyFieldName returns text without any HTML or style.
Required Fields (Expert & Enterprise)
! And ? Suffixes in the description column allow you to specify which field(s) must not be empty, in order to return a result. ! means "required field" (the record will not be returned if this field is empty) and ? means "Ignored
if the other fields are empty" (this field will not be returned if the other fields are empty). Example: MyFieldName?
If there are multiple required fields (descriptions ending with !) the #onlyOneRequired# directive will determine whether AND or OR must be used when interpreting the conditions: onlyOneRequired means the OR operator will be used.
Indexed Columns (Expert & Enterprise)
Alternatively to the #maxIndex# directive, the syntax myFieldName<n in the description column allows you to specify the maximum number of columns with the same name (subsequent matches for the field are ignored).
myFieldName= in the description column prevents duplicate values for this field in the same record.
Replacement functions (Pro, Expert & Enterprise Editions)
The following functions can be used in the 'Replace' field to alter the scraped content.
These are executed when the scraper line (markers and/or format) match a string in the source code.
NOTE: these functions are still subject to evolution. At this point they can only be used alone in the replace field.
They can now be used in a variable declaration.
- Put #AVERAGE#, #SUM#, #MAX#, #MIN#, #CONCAT#, #HAPAX#, #UNIQUE#, #STRICTLY-UNIQUE#, #DISTINCT#, #STRICTLY-DISTINCT#, #FIRST# to #FIFTH#, #LAST#, #SHORTEST# or
#LONGEST# in the 'Replace' field to replace the scraped values by the corresponding total calculation. (Note that totals cannot serve as record separator. They will only work if not located on the first line of a
scraper.)
- #AVERAGE#: if scraped values are numerical, the result is replaced by the arithmetic mean of these values
- #SUM#: if scraped values are numerical, the result is replaced by the sum of these values
- #MIN#: if scraped values are numerical, the result is replaced by the minimum value, otherwise by the first in alphabetical order
- #MAX#: if scraped values are numerical, the result is replaced by the maximum value, otherwise by the last in alphabetical order
- #CONCAT#: all values are concatenated, using semicolons as separators
- #COUNT#: the number of occurrences
- #HAPAX#: if only one occurrence is found, it is returned, otherwise the field does not return anything
- #UNIQUE#: if only one value is found (whatever the number of occurrences), the value is returned, otherwise the field does not return anything
- #STRICTLY-UNIQUE#: (case sensitive) if only one value is found (whatever the number of occurrences), the value is returned, otherwise the field does not return anything
- #DISTINCT#: all distinct values are concatenated, using semicolons as separators; duplicate values are ignored (even if in different cases)
- #STRICTLY-DISTINCT#: (case sensitive) all distinct values are concatenated, using semicolons as separators; exact duplicates are ignored
- #DISTINCT-COUNT#: creates two columns (fields). The first one with the COUNT, the second with the DISTINCT concatenation.
- #STRICTLY-DISTINCT-COUNT#: creates two columns (fields). The first one with the COUNT, the second with the STRICTLY-DISTINCT concatenation.
- #FIRST#,#SECOND#,#THIRD#,#FOURTH#,#FIFTH#: only the corresponding occurrence is returned
- #LAST#: only the last occurrence is returned
- #EARLIEST#: returns the earliest date matching the scraper line
- #LATEST#: returns the latest date matching the scraper line
- #SHORTEST#: only the shortest matching occurrence is returned
- #LONGEST#: only the longest matching occurrence is returned
- #MAXLENGTH#: the result is replaced by the maximum length (number of characters) of matching strings
- #MINLENGTH#: the result is replaced by the minimum length (number of characters) of matching strings
- #MINLENGTH#: the result is replaced by the minimum length (number of characters) of matching strings
- #PAGESTATUS#: returns JSON info on the current page (errors, title...)
Totals and Concat functions are now allowed in variable declarations and in replacement functions or operations.
(Expert & Enterprise) adding -INREC to the above functions (#MAX-INREC#, #FIRST-INREC#, #DISCTINCT-INREC# etc.), will return a total or concatenation for each record. This means that the function will be calculated for each block of source code between two occurrences of the record separator (the
first field of the scraper, in general).
- Operations: #(term1 operator term2)# Works with the following operators: + (addition of integers: 1+3=4; concatenation of strings: out+wit=outwit; incrementing characters: c+3=f), -
(subtraction of integers: 5-2=3 or decrementing chars: e-3=b ), * (multiplication), / (division), ^ (power), <, >, =,
==, !=,... (comparison operators): a=A (case-insensitive comparison), a==a (case-sensitive comparison), a!=b (not equal, case insensitive), a!==b (not equal, case sensitive). The terms can be literals,
variables or functions. When using equality operators on strings (=, !=, ==, !==), you can now use the wildcard % in the second term to replace any string. (ex. these three statements are true: headstart = Head% ; homeland == h%d ; lighthouse =
%HOUSE).
- Conditions:
#if(condition,valueIfTrue,valueIfFalse)# or
#if(condition;valueIfTrue;valueIfFalse)# for conditional replacements. The separator used
between the parameters (comma or semicolon) must not be present in the parameters themselves.
- (Expert & Enterprise) Dates:
- #formatDate(string)# tries to convert the scraped string into date and formats it in the standard and easily useable format of yyyy-mm-dd hh.mm.ss
- #parseDate(string)# tries to convert the scraped string into a number which can be compared to another, for a conditionnal extraction, for instance. "Past" can be translated in the replace
column by:
#parseDate(\0)#<#parseDate(#DATE#)#.
- (Expert & Enterprise) Names:
The following features are statistics-based recognition functions. They are extremely useful to enhance large quantities of data but they cannot be 100% accurate. A good idea is to store the data into additional fields, and also
extract the original full name field, so that you can compare the results.
- #firstName(string)# tries to find the most likely first name in the passed full name string.
If the scraper line extracts "Dr Peter de Witt, M.D. 1998", the replace expression #firstName(\0)#
will return "Peter".
- #lastName(string)# tries to find the most likely last name in the passed full name string.
If the scraper line extracts "Dr Peter de Witt, M.D. 1998", the replace expression #lastName(\0)# will
return "de Witt".
- #firstLastName(string)# tries to find the most likely first and last name in the passed full name string.
If the scraper line extracts "Dr Peter de Witt, M.D. 1998", the replace expression
#firstLastName(\0)# will return "Peter de Witt".
- #gender(string)# tries to find the most likely gender for the passed full name string.
If the scraper line extracts "Dr Peter de Witt, M.D. 1998", the expression #gender(\0)# will return
"M".
If the gender cannot be determined, the function returns "-". (Gender is returned for recognized first names with distribution statistics higher than 65% or 70%.)
- (Expert & Enterprise) Transformation:
- #adler32(string)# generate a short hash from the string extracted by the scraper line. (This can be useful for deduplication, although it is not 100% reliable because (in relatively unlikely cases), two
different
strings can result in the same hash.)
- #base64(url)# can be used to convert a small image into a self-contained data element using the data: URI scheme. Inline images allow to embed images within a web page or a document, as immediate
(offline) data that the browser can render without querying the Web.
- #encodeBase64(string)#, #decodeBase64(string)# Converts the string extracted by the scraper line into a base64 encoded string or decodes it into plain text.
- #encodeURL(string)# converts some special caracters to their hexadecimal equivallent so that they can be used in URL parameters.
- #decodeURL(string)# decodes URL encoded characters (like %20) to their plain text equivalent.
- #decode(string)# uses a series of different algorithms to try and decode a string (decimal, hexa, base64, etc.) to plain text.
- #lowerCase(string)#, #upperCase(string)#, #properCase(string)#, #sentenceCase(string)# changes the case of the string (lowercase string, UPPERCASE STRING, Proper Case String, Sentence case string).
- #unique(string)# Only returns the string extracted by the scraper line if the value is unique during the same exploration. (An alternative to deduplication while scraping, in case volumes are too large to
post-process it.)
- Lookup lists:
#lookUp(originalString,listOfValuesToFind,listOfReplacementValues)# or #lookUp(originalString;listOfValuesToFind;listOfReplacementValues)# or #lookUp(originalString|listOfValuesToFind|listOfReplacementValues)# for replacing lists of values. The parameters listOfValuesToFind and listOfReplacementValues
must include the same number of items, separated by commas or semicolons. The elements of the first list will be respectively replaced by those of the second. The separator used between the parameters must not be present in the parameters
themselves. (originalString can typically be the result of the scraper line \0 or a variable #myVariable# etc.)
- Replace function (not to be confused with the replace directive) #replace(originalString,stringToFind,replacementString)# or #replace(originalString;stringToFind;replacementString)# or #replace(originalString|stringToFind|replacementString)# replaces the first occurrence of stringToFind by replacementString in originalString. (originalString can typically
be the result of the scraper line \0 or a variable #myVariable# etc.)
- URL alteration functions: #getParam(URL,parameterName)# returns the value of a parameter in the passed URL and #setParam(URL,parameterName,parameterValue)#, to assign a new value to a parameter. When used in conjunction with #URL# in the #nextPage# directive line, this function allows you to
easily set the value of the next page URL in many cases. (URL can typically be the result of the scraper line \0 or a variable #myURL#, or a replacement function #URL#, #REDIRECTED-URL# etc.)
- Alert: #alert(Your Message)# Displays an alert with the message passed as a parameter (and blocking the scraping process).
Example:
This scraper line will generate the next URL to explore, incrementing the parameter 'page' in the current URL.
Description:
#nextPage#
Replace:
#setParam(#URL#,page,#(#getParam(#URL#,page)#+1)#)#
Automatic Exploration and Hierarchical Scraping (Pro, Expert & Enterprise Editions)
It is now possible for a scraper to set the URL of the next page to explore in a browse process (see #nextPage# directive above). Together with this feature comes a replacement function which allows advanced users to develop powerful scraping
agents:
#nextToVisit(#myURL#)#, in the 'Replace' field, instructs the Hub to give the variable #myURL# the next value which is not found in the list of visited URLs. If you set #variable#myURL# in a
scraper line, and if this line matches say 10 strings within the source code of the page, this variable will contain an array of 10 values. The #nextToVisit# directive will give #myURL# the value of the first URL which hasn't been explored in the
current Browse process. This means that, used in conjunction with #nextPage# and #BACK# you can create complex scraping workflows. You can, in particular, create multi-level scraping processes.
#addToQueue# and #nextToVisit()#: This follows exactly the same principle, but without declaring a variable. It is simpler to use but it offers a little less control as it only allows you to have a single stack
of URLs to explore. Contrary to variables, the queue can be accessed by any scraper during the process of an exploration. You can put URLs in the queue with one scraper and refer to it with another.
Example 1: Two-level scraping using #addToQueue# and #nextToVisit()#
Say you have a page named 'Widget List' with a list of URLs leading to the 'Widget Detail' pages where the interesting information is. You just need to create two scrapers:
Scraper #1:
Apply if URL contains:
widget-list
line 1:
Description:
#addToQueue#
Marker Before:
<a href="
Marker After:
">See Widget Description</a>
Replace:
#BASEURL#\0
Line 2:
Description:
#nextPage#
Replace:
#nextToVisit()#
Scraper #2:
Apply if URL contains:
widget-detail
line 1:
Description:
#nextPage#
Replace:
#BACK#
line 2...:
... scrape the data here.
Example 2: Two-level scraping using a variable #nextToVisit(#extractedURLs#)#
Same scenario, but this time, using a variable (for instance because you wish to keep two different kinds of URLs in separate piles):
Scraper #1:
Apply if URL contains:
widget-list
line 1:
Description:
#variable#extractedURLs#
Marker Before:
<a href="
Marker After:
">See Widget Description</a>
Replace:
#BASEURL#\0
Line 2:
Description:
#nextPage#
Replace:
#nextToVisit(#extractedURLs#)#
Scraper #2:
Apply if URL contains:
widget-detail
line 1:
Description:
#nextPage#
Replace:
#BACK#
line 2...:
... scrape the data here.
Note: This may look confusing, but it's not all that bad, once you have gotten the principle.
The idea is that you often have a list L1 that links to another list L2 (n times), which in turn links to the pages P where you want to scrape your data.
Think of it from the end:
- You have to make a page scraper (#2 in the example above) for the data in P with #nextPage# set to #BACK# (It's the "leaf" at the end of the branch, so the program will backtrack once the page is scraped.)
- You also have to make one or several list scrapers where you will extract the links from L1, L2... into a variable like #extractedURLs#.
- In the list scraper, you also need to set #nextPage#1# (higher priority) to #nextToVisit(#extractedURLs#)# to explore all the pages one after the other,
- and, finally -still in the list scraper- set #nextPage#0# (default value) to #BACK#, to backtrack to the higher level, once all #extractedURLs# of the level have been visited.
One of the tricky things is to make sure that each scraper will apply to the right kind of page using the "URL contains" field. This may require a regular expression.
In the Expert & Enterprise editions, when scraping the original source (white background), #addToQueue# is enough to instruct the program to visit grabbed URLs in Fast Scrape mode. In this case, #nextPage# and #nextToVisit()# are not
needed.