Job Function Email Database

Resolving Extraction Issues & Controlling the Crawl

EL LeadsEverything looks good in my example, on the surface. What you’ll likely notice, however, is that there are other urls listed without extraction text. This can happen when the code is slightly different on certain pages, or sf moves on to other site sections. I have a few options to resolve this issue:

Crawl other batches of pages separately walking through this same process, but with adjusted xpath code taken from one of the other urls.
Switch to using regex or another option besides xpath to help broaden parameters and potentially capture the information I’m after on other pages.

Ignore the Pages Altogether and Exclude Them From the Crawl

In this situation, I’m going to exclude the pages I can’t pull information from based on my current settings and lock sf into the content we want. This may be another point of experimentation, but it doesn’t take much experience for you to get a feel for the direction you’ll want to go if the problem arises.

In order to lock sf to urls I would like data from, I’ll use the “include” and “exclude” options under the “configuration” menu item. I’ll start Manufacturing Email List with include options.

Ssd:private:var:folders:m2:wh1vdy452ps54mq15f_w0jlh0000gn:t:6scuuu:seospiderui.Png
Here, I can configure sf to only crawl specific urls on the site using regex.

The “excludes” Are Where Things Get Slightly (but Only Slightly) Trickier

Job Function Email Database

During the initial crawl, I took note of a number of urls that sf was not extracting information from.  This makes exclusion easy as long as I can find and appropriately define them.

Ssd:private:var:folders:m2:wh1vdy452ps54mq15f_w0jlh0000gn:t:fuqmmv:seospiderui.Png
In order to cut these folders out, I’ll add the following lines to the exclude filter:

It’s worth noting that you don’t have to EL Leads work through this part of configuring sf to get the data you want. If sf is let loose, it will crawl everything within the start folder, which would also include the data I want. The refinements above are far more efficient from. A crawl perspective and also lessen the chance I’ll be a pest to the site. It’s good to play nice.

Leave a Reply

Your email address will not be published. Required fields are marked *