Chilkat HOME .NET Core C# Android™ AutoIt C C# C++ Chilkat2-Python CkPython Classic ASP DataFlex Delphi ActiveX Delphi DLL Go Java Lianja Mono C# Node.js Objective-C PHP ActiveX PHP Extension Perl PowerBuilder PowerShell PureBasic Ruby SQL Server Swift 2 Swift 3,4,5... Tcl Unicode C Unicode C++ VB.NET VBScript Visual Basic 6.0 Visual FoxPro Xojo Plugin
(DataFlex) A Simple Web CrawlerThis demonstrates a very simple web crawler using the Chilkat Spider component.
Use ChilkatAx-win32.pkg Procedure Test Handle hoSpider Handle hoSeenDomains Handle hoSeedUrls Boolean iSuccess String sUrl String sDomain Integer i Boolean iSuccess String sDomain String sBaseDomain String sTemp1 Integer iTemp1 Boolean bTemp1 Get Create (RefClass(cComChilkatSpider)) To hoSpider If (Not(IsComObjectCreated(hoSpider))) Begin Send CreateComObject of hoSpider End Get Create (RefClass(cComCkStringArray)) To hoSeenDomains If (Not(IsComObjectCreated(hoSeenDomains))) Begin Send CreateComObject of hoSeenDomains End Get Create (RefClass(cComCkStringArray)) To hoSeedUrls If (Not(IsComObjectCreated(hoSeedUrls))) Begin Send CreateComObject of hoSeedUrls End Set ComUnique Of hoSeenDomains To True Set ComUnique Of hoSeedUrls To True // You will need to change the start URL to something else... Get ComAppend Of hoSeedUrls "http://something.whateverYouWant.com/" To iSuccess // Set outbound URL exclude patterns // URLs matching any of these patterns will not be added to the // collection of outbound links. Send ComAddAvoidOutboundLinkPattern To hoSpider "*?id=*" Send ComAddAvoidOutboundLinkPattern To hoSpider "*.mypages.*" Send ComAddAvoidOutboundLinkPattern To hoSpider "*.personal.*" Send ComAddAvoidOutboundLinkPattern To hoSpider "*.comcast.*" Send ComAddAvoidOutboundLinkPattern To hoSpider "*.aol.*" Send ComAddAvoidOutboundLinkPattern To hoSpider "*~*" // Use a cache so we don't have to re-fetch URLs previously fetched. Set ComCacheDir Of hoSpider To "c:/spiderCache/" Set ComFetchFromCache Of hoSpider To True Set ComUpdateCache Of hoSpider To True While ((ComCount(hoSeedUrls)) > 0) Get ComPop Of hoSeedUrls To sUrl Send ComInitialize To hoSpider sUrl // Spider 5 URLs of this domain. // but first, save the base domain in seenDomains Get ComGetUrlDomain Of hoSpider sUrl To sDomain Get ComGetBaseDomain Of hoSpider sDomain To sTemp1 Get ComAppend Of hoSeenDomains sTemp1 To iSuccess For i From 0 To 4 Get ComCrawlNext Of hoSpider To iSuccess If (iSuccess = True) Begin // Display the URL we just crawled. Get ComLastUrl Of hoSpider To sTemp1 Showln sTemp1 // If the last URL was retrieved from cache, // we won't wait. Otherwise we'll wait 1 second // before fetching the next URL. Get ComLastFromCache Of hoSpider To bTemp1 If (bTemp1 <> True) Begin Send ComSleepMs To hoSpider 1000 End End Else Begin // cause the loop to exit.. Move 999 To i End Loop // Add the outbound links to seedUrls, except // for the domains we've already seen. Get ComNumOutboundLinks Of hoSpider To iTemp1 For i From 0 To (iTemp1 - 1) Get ComGetOutboundLink Of hoSpider i To sUrl Get ComGetUrlDomain Of hoSpider sUrl To sDomain Get ComGetBaseDomain Of hoSpider sDomain To sBaseDomain Get ComContains Of hoSeenDomains sBaseDomain To bTemp1 If (bTemp1 = False) Begin // Don't let our list of seedUrls grow too large. Get ComCount Of hoSeedUrls To iTemp1 If (iTemp1 < 1000) Begin Get ComAppend Of hoSeedUrls sUrl To iSuccess End End Loop Loop End_Procedure |
© 2000-2024 Chilkat Software, Inc. All Rights Reserved.