Skip to content

Used Java design patterns, Jackson and Guice, Streams API to implement parallelism for web crawler

Notifications You must be signed in to change notification settings

persinammon/parallel-web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Implementing Multi-Threading to Single-Threaded Web Crawler

I was given a Java implementation of a single-threaded web crawler and unit tests. I implemented a multi-threaded version of the crawler. Credit to the original, very well-planned and dense project goes to here.

Creational Patterns and Libraries Used, Bugs Squashed

How to Run

Clone and run the following to run:

mvn package
java -classpath target/udacity-webcrawler-1.0.jar com.udacity.webcrawler.main.WebCrawlerMain src/main/config/sample_config.json

Configuration File

This is a sample configuration JSON given to the web crawler.

{
  "startPages": ["http://example.com", "http://example.com/foo"],
  "ignoredUrls": ["http://example\\.com/.*"], 
  "ignoredWords": ["^.{1,3}$"], 
  "parallelism": 4, 
  "implementationOverride": "com.udacity.webcrawler.SequentialWebCrawler", 
  "maxDepth": 10, 
  "timeoutSeconds": 7, 
  "popularWordCount": 3, 
  "profileOutputPath": "profileData.txt" 
  "resultPath": "crawlResults.json" 
}


/**
 * Notes:
 * ignoredUrls and ignoredWords use regex, which in Java is an instance of the Pattern class.
 * parallelism is the number of desired threads, and is either that or defaults to number of available CPU cores.
 * implementation override overrides parallelism (which invokes parallel web crawler if > 1). It can be 
 * either SequentialWebCrawler or ParallelWebCrawler.
 * maxDepth is the hardcoded depth of the search trie, the program terminates at a further depth.
 * The two paths are where to write performance data and the results. If unset, these are printed to standard output.
 */

Open-Source Third Party Java Libraries

  • jsoup
  • Jackson Project
  • Guice
  • Maven
  • JUnit 5
  • Truth

Takeaway

Overall, this was a fun use case for practicing more complex Java patterns and doing some debugging.

About

Used Java design patterns, Jackson and Guice, Streams API to implement parallelism for web crawler

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published