How to Code Web Crawlers in C++

Asked By Taylor B Campos 90 points N/A Posted on - 11/17/2014

I would like to know if you can have web crawlers written in C++ programming language to search the web for specific information. If it is possible, then how do I begin? What essential things must I consider when developing a web crawler in C++? Will such a web crawler be efficient?

Status: Open
Question Views: 2073
Answer Count: 2
Vote Up 0 Vote Down

Answer Accepted: No
Question Category: C++

Answered By Neil Auso 0 points N/A #185620

How to Code Web Crawlers in C++

Check out this program.

#include <CkSpider.h>
#include <CkStringArray.h>

void ChilkatSample(void)
    {
    //  The Chilkat Spider component/library is free.
    CkSpider spider;

    CkStringArray seenDomains;
    CkStringArray seedUrls;

    seenDomains.put_Unique(true);
    seedUrls.put_Unique(true);

    //  You will need to change the start URL to something else...
    seedUrls.Append("http://something.whateverYouWant.com/");

    //  Set outbound URL exclude patterns
    //  URLs matching any of these patterns will not be added to the
    //  collection of outbound links.
    spider.AddAvoidOutboundLinkPattern("*?id=*");
    spider.AddAvoidOutboundLinkPattern("*.mypages.*");
    spider.AddAvoidOutboundLinkPattern("*.personal.*");
    spider.AddAvoidOutboundLinkPattern("*.comcast.*");
    spider.AddAvoidOutboundLinkPattern("*.aol.*");
    spider.AddAvoidOutboundLinkPattern("*~*");

    //  Use a cache so we don't have to re-fetch URLs previously fetched.
    spider.put_CacheDir("c:/spiderCache/");
    spider.put_FetchFromCache(true);
    spider.put_UpdateCache(true);

    while (seedUrls.get_Count() > 0) {

        const char * url;
        url = seedUrls.pop();
        spider.Initialize(url);

        //  Spider 5 URLs of this domain.
        //  but first, save the base domain in seenDomains
        const char * domain;
        domain = spider.getUrlDomain(url);
        seenDomains.Append(spider.getBaseDomain(domain));

        long i;
        bool success;
        for (i = 0; i <= 4; i++) {
            success = spider.CrawlNext();
            if (success != true) {
                break;
            }

            //  Display the URL we just crawled.
            printf("%sn",spider.lastUrl());

            //  If the last URL was retrieved from cache,
            //  we won't wait.  Otherwise we'll wait 1 second
            //  before fetching the next URL.
            if (spider.get_LastFromCache() != true) {
                spider.SleepMs(1000);
            }

        }

        //  Add the outbound links to seedUrls, except
        //  for the domains we've already seen.
        for (i = 0; i <= spider.get_NumOutboundLinks() - 1; i++) {

            url = spider.getOutboundLink(i);

            domain = spider.getUrlDomain(url);
            const char * baseDomain;
            baseDomain = spider.getBaseDomain(domain);
            if (!seenDomains.Contains(baseDomain)) {
                seedUrls.Append(url);
            }

            //  Don't let our list of seedUrls grow too large.
            if (seedUrls.get_Count() > 1000) {
                break;
            }

        }

    }

    }

SOURCE: http://www.example-code.com/vcpp/spider_simplecrawler.asp

About Neil Auso

Questions
0

Answers
117

Best Answers
11

Vote Up 0 Vote Down

Posted on - 12/23/2014
Question Category: C++

Answered By Sharath Reddy 590495 points N/A #185621

How to Code Web Crawlers in C++

I guess you are looking for a sample web spider written in C++. You can easily get very simple codes for web crawlers on the web but to make it work, you should also know how to code and compile with C++ because not all samples will work. You can encounter a sample code that has errors that needs correction and when this happens your knowledge in C++ will be handy.

You can try Whalebot. It is an open-source web spider which is designed to be fast, simple, and memory efficient. It was designed as a targeted spider but you can also use it as a common. Visit Whalebot 0.02 to download. It features start/stop/resume fetching sessions; and simple configuration from text files and command lines.

Another sample web crawler is the Gist web crawler. It uses boost_regex and boost_algorithm. I can’t post the entire code because of its length. If you want to see it and download, visit Gist Simplest Web Crawler in C++.

About Sharath Reddy

Questions
1

Answers
14599

Best Answers
2290

Vote Up 0 Vote Down

Posted on - 01/15/2015
Question Category: C++

How to Code Web Crawlers in C++

How to Code Web Crawlers in C++

How to Code Web Crawlers in C++

Searching for Comprehensive Tutorial on HPC

Windows Groove File Sharing System Effectiveness

Related Questions

Latest Articles

Rokid Max 2 Review: I Tried AR Glasses So I Could Watch Netflix in...

Top 10 Technology Trends For 2025

How To Choose The Right Linux VPS For Your Needs

Latest Blogs

Top 10 New Laptop Entrants That Shook The Public

10 Facts About The Dark Web

Top 10 Latest Steam Cleaner Machines

Latest Tips

Top 10 Internet Monitoring Software

Top 10 Best Partition Manager Software

Top 10 Best Online Music Production Software