|  | 
Chilkat  HOME  Android™  AutoIt  C  C#  C++  Chilkat2-Python  CkPython  Classic ASP  DataFlex  Delphi DLL  Go  Java  Node.js  Objective-C  PHP Extension  Perl  PowerBuilder  PowerShell  PureBasic  Ruby  SQL Server  Swift  Tcl  Unicode C  Unicode C++  VB.NET  VBScript  Visual Basic 6.0  Visual FoxPro  Xojo Plugin
| (Perl) Avoiding Outbound Links Matching PatternsThe spider accumulates outbound links when crawling. Your program may specify any number of "avoid patterns" to prevent any link matching at least one of the wildcarded patterns from being added. 
 use chilkat(); $spider = chilkat::CkSpider->new(); # -------------------------------------------------------------------- # Note: The URLs in this example are no longer valid. # You should replace the URLs with URLs from a site of your # own choosing -- preferably your own site if testing. # (Google's Directory no longer exists.) # -------------------------------------------------------------------- # First, we'll get the outbound links for a page in the # Google directory. Then we'll add some avoid patterns # and then re-fetch, to see it work... $spider->Initialize("directory.google.com"); $spider->AddUnspidered("http://directory.google.com/Top/Recreation/Food/Cheese/"); $success = $spider->CrawlNext(); # Display the outbound links for ($i = 0; $i <= $spider->get_NumOutboundLinks() - 1; $i++) { print $spider->getOutboundLink($i) . "\r\n"; } # The output: # http://www.cheese.com/ # http://www.cheesediaries.com/ # http://www.WisDairy.com/ # http://www.newenglandcheese.com # http://www.ilovecheese.com # http://www.cheesefromspain.com # http://www.realcaliforniacheese.com/ # http://www.frencheese.co.uk/ # http://www.cheesesociety.org/ # http://www.specialcheese.com/queso.htm # http://www.franceway.com/cheese/intro.htm # http://www.foodsubs.com/Chesfirm.html # http://www.cheeseboard.co.uk/ # http://www.thecheeseweb.com/ # http://www.vtcheese.com/ # http://www.coldbacon.com/cheese.html # http://www.norwegiancheeses.co.uk/ # http://www.reluctantgourmet.com/cheese.htm # http://www.lancewood.co.za/ # http://www.switzerlandcheese.ca # http://www.frenchcheese.dk/ # http://www.dolcevita.com/cuisine/cheese/cheese.htm # http://cheeseisland.net/ # http://www.cheestrings.ca/ # http://www.dreamcheese.co.uk # http://hgic.clemson.edu/factsheets/HGIC3506.htm # http://www.epicurious.com/cooking/how_to/food_dictionary/entry?id=1815 # http://www.mousetrapcheese.co.uk # http://taquitos.net/yum/gc.shtml # http://www.greek-recipe.com/static/greek-cheese # http://www.park.org/Netherlands/pavilions/food_and_markets/cheese/introduction.html # http://www.dairyfarmers.org/engl/recipes/4_1.asp # http://www.prairieridgecheese.com/wischeesguid.html # http://dmoz.org/cgi-bin/add.cgi?where=Recreation/Food/Cheese # http://dmoz.org/about.html # http://dmoz.org/cgi-bin/apply.cgi?where=Recreation/Food/Cheese # Do it again, but this time with avoid patterns. $spider->Initialize("directory.google.com"); $spider->AddUnspidered("http://directory.google.com/Top/Recreation/Food/Cheese/"); # Add some avoid patterns: $spider->AddAvoidOutboundLinkPattern("*dmoz.org*"); $spider->AddAvoidOutboundLinkPattern("*?id=*"); $spider->AddAvoidOutboundLinkPattern("*.co.uk*"); $success = $spider->CrawlNext(); print "-----------------------" . "\r\n"; # Display the outbound links for ($i = 0; $i <= $spider->get_NumOutboundLinks() - 1; $i++) { print $spider->getOutboundLink($i) . "\r\n"; } # Output: # http://www.cheese.com/ # http://www.cheesediaries.com/ # http://www.WisDairy.com/ # http://www.newenglandcheese.com # http://www.ilovecheese.com # http://www.cheesefromspain.com # http://www.realcaliforniacheese.com/ # http://www.cheesesociety.org/ # http://www.specialcheese.com/queso.htm # http://www.franceway.com/cheese/intro.htm # http://www.foodsubs.com/Chesfirm.html # http://www.thecheeseweb.com/ # http://www.vtcheese.com/ # http://www.coldbacon.com/cheese.html # http://www.reluctantgourmet.com/cheese.htm # http://www.lancewood.co.za/ # http://www.switzerlandcheese.ca # http://www.frenchcheese.dk/ # http://www.dolcevita.com/cuisine/cheese/cheese.htm # http://cheeseisland.net/ # http://www.cheestrings.ca/ # http://hgic.clemson.edu/factsheets/HGIC3506.htm # http://taquitos.net/yum/gc.shtml # http://www.greek-recipe.com/static/greek-cheese # http://www.park.org/Netherlands/pavilions/food_and_markets/cheese/introduction.html # http://www.dairyfarmers.org/engl/recipes/4_1.asp # http://www.prairieridgecheese.com/wischeesguid.htm | ||||
© 2000-2025 Chilkat Software, Inc. All Rights Reserved.