nutch.git
2 weeks agoMerge pull request #748 from sebastian-nagel/NUTCH-2883-docker master
Sebastian Nagel [Sun, 11 Sep 2022 10:06:02 +0000 (12:06 +0200)] 
Merge pull request #748 from sebastian-nagel/NUTCH-2883-docker

NUTCH-2883 Provide means to run server and webapp as persistent services in Docker container

2 weeks agoNUTCH-2883 Provide means to run server and webapp as persistent services in Docker... 748/head
Sebastian Nagel [Sun, 21 Aug 2022 09:21:26 +0000 (11:21 +0200)] 
NUTCH-2883 Provide means to run server and webapp as persistent services in Docker container
- install Nutch WebApp from separate repository (see NUTCH-2886) and run
  it via `mvn jetty:run -Djetty.port=<WEBAPP_PORT>
- sync log paths in supervisord config files

2 weeks agoNUTCH-2883 Provide means to run server and webapp as persistent services in Docker...
Sebastian Nagel [Wed, 22 Sep 2021 20:20:05 +0000 (22:20 +0200)] 
NUTCH-2883 Provide means to run server and webapp as persistent services in Docker container
- move ARG instructions into FROM block they're used in (duplicate if
  necessary)

2 weeks agoNUTCH-2883 Provide means to run server and webapp as persistent services in Docker...
Lewis John McGibbney [Mon, 28 Jun 2021 03:24:47 +0000 (20:24 -0700)] 
NUTCH-2883 Provide means to run server and webapp as persistent services in Docker container

2 weeks agoPrepare for new development after release of 1.19
Sebastian Nagel [Thu, 8 Sep 2022 14:28:27 +0000 (16:28 +0200)] 
Prepare for new development after release of 1.19
- bump version number (-> 1.20-NAPSHOT)

2 weeks agoNutch 1.19 release
Sebastian Nagel [Mon, 22 Aug 2022 13:57:41 +0000 (15:57 +0200)] 
Nutch 1.19 release
- update current year in API docs etc.
- update version number
- add changes / release notes
- update links to Hadoop API docs

5 weeks agoNUTCH-2969 Javadoc: Javascript search is not working when built on JDK 11
Sebastian Nagel [Mon, 22 Aug 2022 13:18:50 +0000 (15:18 +0200)] 
NUTCH-2969 Javadoc: Javascript search is not working when built on JDK 11
- pass --no-module-directories to javadoc target when building on JDK 11
- remove obsolete condition to fail javadoc builds on JDK 7u25 and earlier

5 weeks agoMerge pull request #747 from sebastian-nagel/NUTCH-2963-upgrade-dependencies
Sebastian Nagel [Sun, 21 Aug 2022 10:39:06 +0000 (12:39 +0200)] 
Merge pull request #747 from sebastian-nagel/NUTCH-2963-upgrade-dependencies

NUTCH-2963 Upgrade dependencies before release of 1.19

5 weeks agoNUTCH-2795 CrawlDbReader: compress CrawlDb dumps if configured
Sebastian Nagel [Fri, 19 Aug 2022 09:46:25 +0000 (11:46 +0200)] 
NUTCH-2795 CrawlDbReader: compress CrawlDb dumps if configured
- configure CSV and JSON LineRecordWriters to compress the output
  files according to the configuration

5 weeks agoNUTCH-2863 Injector to parse command-line flags case-insensitive
Sebastian Nagel [Wed, 17 Aug 2022 13:59:24 +0000 (15:59 +0200)] 
NUTCH-2863 Injector to parse command-line flags case-insensitive

5 weeks agoNUTCH-2963 Upgrade dependencies before release of 1.19 747/head
Sebastian Nagel [Fri, 19 Aug 2022 19:59:30 +0000 (21:59 +0200)] 
NUTCH-2963 Upgrade dependencies before release of 1.19
- upgrade Nutch core dependencies
   httpcore-nio 4.4.9 -> 4.4.14
   cxf 2.9.0 -> 2.9.1
   commons-jexl3 3.2.1 -> 3.3
   log4j 2.17.2 -> 2.18.0
   t-digest 3.2 -> 3.3
- update / complete LICENSE-binary

5 weeks agoNUTCH-2843 Duplicate declaration of dependencies in ivy.xml
Sebastian Nagel [Fri, 19 Aug 2022 19:24:37 +0000 (21:24 +0200)] 
NUTCH-2843 Duplicate declaration of dependencies in ivy.xml
- remove duplicated dependencies: commons-collections4 and httpclient
- move Maven POM creation into separate target to reproduce issue

5 weeks agoNUTCH-2963 Upgrade dependencies before release of 1.19
Sebastian Nagel [Fri, 19 Aug 2022 16:16:06 +0000 (18:16 +0200)] 
NUTCH-2963 Upgrade dependencies before release of 1.19
- update / complete LICENSE-binary

5 weeks agoNUTCH-2963 Upgrade dependencies before release of 1.19
Sebastian Nagel [Fri, 19 Aug 2022 15:37:29 +0000 (17:37 +0200)] 
NUTCH-2963 Upgrade dependencies before release of 1.19
- upgrade Hadoop 3.3.3 -> 3.3.4
- adapt ivy retrieve pattern to optionally include the `classifier`
  (used in Hadoop deps to differentiate between architecture:
   x86_64 vs. aarch_64)

5 weeks agoNUTCH-2963 Upgrade dependencies before release of 1.19
Sebastian Nagel [Fri, 19 Aug 2022 14:52:59 +0000 (16:52 +0200)] 
NUTCH-2963 Upgrade dependencies before release of 1.19
- upgrade indexer-solr dependencies:
  - Solr 8.5.1 -> 8.11.2
  - httpmime 4.5.10 -> 4.5.13
  - httpcore 4.4.12 -> 4.4.15

5 weeks agoNUTCH-2963 Upgrade dependencies before release of 1.19
Sebastian Nagel [Fri, 19 Aug 2022 14:38:00 +0000 (16:38 +0200)] 
NUTCH-2963 Upgrade dependencies before release of 1.19
- upgrade urlfilter-automaton to depend on dk.brics automaton 1.12-4

5 weeks agoNUTCH-2963 Upgrade dependencies before release of 1.19
Sebastian Nagel [Fri, 19 Aug 2022 14:18:51 +0000 (16:18 +0200)] 
NUTCH-2963 Upgrade dependencies before release of 1.19
- upgrade dependency-check ant plugin

5 weeks agoNUTCH-2962 Update and complete package info of protocol plugins
Sebastian Nagel [Mon, 15 Aug 2022 14:29:34 +0000 (16:29 +0200)] 
NUTCH-2962 Update and complete package info of protocol plugins

5 weeks agoNUTCH-2930 Protocol-okhttp: implement IP filter (#736)
Sebastian Nagel [Fri, 19 Aug 2022 13:26:07 +0000 (15:26 +0200)] 
NUTCH-2930 Protocol-okhttp: implement IP filter (#736)

- add include/exclude rules as list of IP address, CIDR notation
  or predefined IP ranges (localhost, loopback, sitelocal)

5 weeks agoMerge pull request #743 from sebastian-nagel/NUTCH-2290-update-licenses
Sebastian Nagel [Fri, 19 Aug 2022 12:56:06 +0000 (14:56 +0200)] 
Merge pull request #743 from sebastian-nagel/NUTCH-2290-update-licenses

NUTCH-2290 update licenses

5 weeks agoNUTCH-2957 indexer-solr / Solr schema.xml
Sebastian Nagel [Sat, 6 Aug 2022 13:25:27 +0000 (15:25 +0200)] 
NUTCH-2957 indexer-solr / Solr schema.xml
- add fall-back field definitions for unknown index fields
- update comments and descriptions
- fix indentation

5 weeks agoNUTCH-2955 indexer-solr: replace deprecated/removed field type solr.LatLonType
Sebastian Nagel [Sat, 6 Aug 2022 12:18:20 +0000 (14:18 +0200)] 
NUTCH-2955 indexer-solr: replace deprecated/removed field type solr.LatLonType

6 weeks agoMerge pull request #729 from sebastian-nagel/NUTCH-2947-keep-stateful-fetch-queues
Sebastian Nagel [Mon, 15 Aug 2022 14:40:52 +0000 (16:40 +0200)] 
Merge pull request #729 from sebastian-nagel/NUTCH-2947-keep-stateful-fetch-queues

NUTCH-2947 Fetcher: keep state of empty fetch queues unless queue feeder is finished

6 weeks agoMerge pull request #697 from sebastian-nagel/NUTCH-2896-okhttp-connection-pool
Sebastian Nagel [Mon, 15 Aug 2022 14:38:22 +0000 (16:38 +0200)] 
Merge pull request #697 from sebastian-nagel/NUTCH-2896-okhttp-connection-pool

NUTCH-2896 Protocol-okhttp: make connection pool configurable

6 weeks agoNUTCH-2958 Upgrade to crawler-commons 1.3 (#740)
Sebastian Nagel [Fri, 12 Aug 2022 18:42:45 +0000 (20:42 +0200)] 
NUTCH-2958 Upgrade to crawler-commons 1.3 (#740)

6 weeks agoNUTCH-2290 Update licenses of bundled libraries 743/head
Sebastian Nagel [Fri, 12 Aug 2022 18:25:13 +0000 (20:25 +0200)] 
NUTCH-2290 Update licenses of bundled libraries
- update the pull-request template and add updating licenses as a potential to-do

6 weeks agoNUTCH-2290 Update licenses of bundled libraries
Sebastian Nagel [Fri, 12 Aug 2022 18:17:30 +0000 (20:17 +0200)] 
NUTCH-2290 Update licenses of bundled libraries
UTCH-2822 Split the LICENSE.txt file into two files for source resp. binary releases
- ensure the binary license and notice files are shipped with the source
  and binary packages

6 weeks agoNUTCH-2290 Update licenses of bundled libraries
Sebastian Nagel [Wed, 10 Aug 2022 16:34:52 +0000 (18:34 +0200)] 
NUTCH-2290 Update licenses of bundled libraries
- NOTICE-binary: add Apache projects and links to
  the projects' NOTICE files
- NOTICE-binary: add other software projects
  with links to the project homepage and
  the used license
- add all licenses (different from the Apache 2.0 license)
  used by dependencies shipped in the binary package

6 weeks agoNUTCH-2290 Update licenses of bundled libraries
Sebastian Nagel [Wed, 10 Aug 2022 16:30:46 +0000 (18:30 +0200)] 
NUTCH-2290 Update licenses of bundled libraries
NUTCH-2821 Deduplicate licenses in LICENSE.txt file
- LICENSE-binary: list dependencies by license
  (this also deduplicates licenses)

6 weeks agoNUTCH-2290 Update licenses of bundled libraries
Sebastian Nagel [Wed, 10 Aug 2022 13:59:52 +0000 (15:59 +0200)] 
NUTCH-2290 Update licenses of bundled libraries
- ivy license report: add homepage URL of dependencies

6 weeks agoNUTCH-2290 Update licenses of bundled libraries
Sebastian Nagel [Wed, 10 Aug 2022 11:32:04 +0000 (13:32 +0200)] 
NUTCH-2290 Update licenses of bundled libraries
- move "export control notice" from README to NOTICE files
  (following the schema used by Hadoop and Spark)
- update "export control notice" following the scheme
  used by Apache Tika

6 weeks agoNUTCH-2290 Update licenses of bundled libraries
Sebastian Nagel [Wed, 10 Aug 2022 11:14:49 +0000 (13:14 +0200)] 
NUTCH-2290 Update licenses of bundled libraries
- update year in NOTICE files: follow schema used by Hadoop and Spark
  projects ("<start-of-project-year> and onwards")
- change "developed by The ASF" -> "developed at The ASF"
  following https://infra.apache.org/licensing-howto.html#bundle-asf-product

6 weeks agoNUTCH-2822 Split the LICENSE.txt file into two files for source resp. binary releases
Sebastian Nagel [Wed, 10 Aug 2022 09:06:18 +0000 (11:06 +0200)] 
NUTCH-2822 Split the LICENSE.txt file into two files for source resp. binary releases

6 weeks agoUpgrade to Apache Rat 0.14
Sebastian Nagel [Wed, 10 Aug 2022 08:57:46 +0000 (10:57 +0200)] 
Upgrade to Apache Rat 0.14
(download of Rat 0.13 failed)

7 weeks agoNUTCH-2861 Remove parse-swf 742/head
Sebastian Nagel [Tue, 9 Aug 2022 10:29:03 +0000 (12:29 +0200)] 
NUTCH-2861 Remove parse-swf

7 weeks agoNUTCH-2956 index-geoip: dependency upgrades and improvements
Sebastian Nagel [Sat, 6 Aug 2022 13:04:10 +0000 (15:04 +0200)] 
NUTCH-2956 index-geoip: dependency upgrades and improvements
- upgrade to geoip2 3.0.1
- exclude transitive dependencies (Jackson) provided as Nutch core deps
- read also GeoLite2-*.mmdb files
- review index field names in plugin and Nutch Solr schema:
  - fix typos in field names
  - remove unused fields from schema

7 weeks agoNUTCH-2953 Indexer Elastic to ignore SSL issues
Sebastian Nagel [Mon, 8 Aug 2022 14:19:24 +0000 (16:19 +0200)] 
NUTCH-2953 Indexer Elastic to ignore SSL issues
- apply patch contributed by Markus Jelsma
- fix class imports

7 weeks agoNUTCH-2952 Upgrade core dependencies
Sebastian Nagel [Wed, 15 Jun 2022 15:07:07 +0000 (17:07 +0200)] 
NUTCH-2952 Upgrade core dependencies
- Hadoop 3.1.3 -> 3.3.3
- log4j 2.17.0 -> 2.17.2
- and some more

7 weeks agoNUTCH-2936 Early registration of URL stream handlers provided by plugins may fail...
Sebastian Nagel [Wed, 15 Jun 2022 11:08:00 +0000 (13:08 +0200)] 
NUTCH-2936 Early registration of URL stream handlers provided by plugins may fail Hadoop jobs
           running in distributed mode if protocol-okhttp is used
NUTCH-2949 Tasks of a multi-threaded map runner may fail because of slow creation of URL stream handlers

- cache URLStreamHandlers for each protocol to avoid that handlers are
  created anew

- utilize the cache to route standard protocols (http, https, file, jar)
  to handlers implemented by the JVM: this fixes NUTCH-2936

7 weeks agoNUTCH-2936 Early registration of URL stream handlers provided by plugins may fail...
Sebastian Nagel [Thu, 19 May 2022 13:26:46 +0000 (15:26 +0200)] 
NUTCH-2936 Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used
- code improvements Nutch plugin system:
  - use `Class<?>` and remove suppressions of warnings
  - javadocs: fix typos
  - remove superfluous white space
  - autoformat using code style template

7 weeks agoNUTCH-2936 Early registration of URL stream handlers provided by plugins may fail...
Sebastian Nagel [Tue, 14 Jun 2022 09:00:31 +0000 (11:00 +0200)] 
NUTCH-2936 Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used
- protocol-okhttp: initialize SSLContext used to ignore SSL/TLS certificate verificiation
  not in a static code block

3 months agoNUTCH-2951 Crawl datum with metadata WRITABLE_GENERATE_TIME_KEY awaits fetching forever
Sebastian Nagel [Sun, 5 Jun 2022 10:45:00 +0000 (12:45 +0200)] 
NUTCH-2951 Crawl datum with metadata WRITABLE_GENERATE_TIME_KEY awaits fetching forever
- bug fix: add missing braces
  (bug introduced with NUTCH-2737, solution contributed by Lapadula Alessandro)

3 months agoNUTCH-2896 Protocol-okhttp: make connection pool configurable 697/head
Sebastian Nagel [Tue, 21 Sep 2021 18:23:02 +0000 (20:23 +0200)] 
NUTCH-2896 Protocol-okhttp: make connection pool configurable
- fix javadoc error

3 months agoNUTCH-2896 Protocol-okhttp: make connection pool configurable
Sebastian Nagel [Tue, 21 Sep 2021 13:29:33 +0000 (15:29 +0200)] 
NUTCH-2896 Protocol-okhttp: make connection pool configurable
- add configuration property `http.connection.pool.okhttp` to configure
  the number of connection pools, their size and the keep-alive time
  of the pooled connections
- create as many clients as pools are configured, each client holding
  one pool. Distribute connections by target host name over clients

4 months agoMerge pull request #731 from sebastian-nagel/NUTCH-2950-update-hostdb-performance
Sebastian Nagel [Tue, 24 May 2022 12:55:19 +0000 (14:55 +0200)] 
Merge pull request #731 from sebastian-nagel/NUTCH-2950-update-hostdb-performance

NUTCH-2950 UpdateHostDb: performance improvements

4 months agoNUTCH-2936 Early registration of URL stream handlers provided by plugins may fail...
Lewis John McGibbney [Fri, 20 May 2022 18:04:22 +0000 (11:04 -0700)] 
NUTCH-2936 Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode (#726)

* NUTCH-2936 Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode

4 months agoNUTCH-2950 Improve performance of UpdateHostDb 731/head
Sebastian Nagel [Fri, 20 May 2022 14:14:34 +0000 (16:14 +0200)] 
NUTCH-2950 Improve performance of UpdateHostDb
- fix Javadoc errors / warnings

4 months agoFail javadoc build on all kinds of javadoc errors and warnings
Sebastian Nagel [Fri, 20 May 2022 13:50:53 +0000 (15:50 +0200)] 
Fail javadoc build on all kinds of javadoc errors and warnings
independent from system settings

4 months agoNUTCH-2950 Improve performance of UpdateHostDb
Sebastian Nagel [Fri, 13 May 2022 16:42:42 +0000 (18:42 +0200)] 
NUTCH-2950 Improve performance of UpdateHostDb
- only create the homepage string if needed,
  rely on the parsed URL to select a URL as homepage candidate

4 months agoImprove performance of UpdateHostDb
Sebastian Nagel [Fri, 13 May 2022 16:29:31 +0000 (18:29 +0200)] 
Improve performance of UpdateHostDb
- parameterize logging
- set logging level of information which is later found in the HostDb itself
  to DEBUG (avoid that frequent log messages flood the log files)
- if DNS look-ups are not enabled (no -check* options passed):
  - do not count and log the hosts skipped not yet eligible for DNS look-ups
  - do not create DNS resolver threads

4 months agoNUTCH-2950 Improve performance of UpdateHostDb
Sebastian Nagel [Fri, 13 May 2022 15:18:09 +0000 (17:18 +0200)] 
NUTCH-2950 Improve performance of UpdateHostDb
- simplify map function:
  - remove instanceof conditions for key (it's an instance of Text
    by method signature)
  - avoid parsing the URL string multiple times

4 months agoNUTCH-2950 Improve performance of UpdateHostDb
Sebastian Nagel [Fri, 13 May 2022 12:44:47 +0000 (14:44 +0200)] 
NUTCH-2950 Improve performance of UpdateHostDb
- be lazy creating HostDatum metaData objects:
  - do not create MapWritable object unless needed
  - use clear() instead of constructing new object
    when reading metadata from sequence file
  - use statically serialization of empty metaData MapWritable
    as empty HostDatum metadata is the most common case
    (stay back-ward compatible by keeping metadata mandatory)

4 months agoNUTCH-2950 Improve performance of UpdateHostDb
Sebastian Nagel [Fri, 13 May 2022 11:26:29 +0000 (13:26 +0200)] 
NUTCH-2950 Improve performance of UpdateHostDb
- avoid needless conversion between host name and URL and back
  if -filter and -normalize are off
- URLUtil: use ROOT locale when converting host name / URL to lowercase

4 months agoNUTCH-2947 Fetcher: keep state of empty but stateful fetch queues 729/head
Sebastian Nagel [Thu, 12 May 2022 13:53:22 +0000 (15:53 +0200)] 
NUTCH-2947 Fetcher: keep state of empty but stateful fetch queues
- also keep state if `fetcher.exceptions.per.queue.delay` > 0.0

4 months agoNUTCH-2947 Fetcher: keep state of empty but stateful fetch queues
Sebastian Nagel [Thu, 27 Jan 2022 15:57:38 +0000 (16:57 +0100)] 
NUTCH-2947 Fetcher: keep state of empty but stateful fetch queues
unless queue feeder is finished in order to ensure politeness
- next fetch time not yet reached
- non-zero exception counter and queue feeder still
  adding new fetch items to queues
Only if the the queue feeder is finished and no more new
fetch items are added, these queues can finally removed.

4 months agoNUTCH-2946 Fetcher: optionally slow down fetching from hosts with repeated exceptions
Sebastian Nagel [Tue, 3 May 2022 15:14:03 +0000 (17:14 +0200)] 
NUTCH-2946 Fetcher: optionally slow down fetching from hosts with repeated exceptions
- configure the delay in seconds as a float instead of milliseconds
- use the value of fetcher.server.delay as default
- double the delay with every observed exception (exponential backoff)
  but cap the growth at 2**31 to avoid overflows

4 months agoNUTCH-2946 Fetcher: slow down fetching from hosts where requests fail repeatedly
Sebastian Nagel [Fri, 14 Jan 2022 17:31:31 +0000 (18:31 +0100)] 
NUTCH-2946 Fetcher: slow down fetching from hosts where requests fail repeatedly
with exceptions or HTTP status codes mapped to ProtocolStatus.EXCEPTION
(HTTP 403 Forbidden, 429 Too many requests, 5xx server errors, etc.)

4 months agoNUTCH-2948 Upgrade dependencies to Any23 2.7 and Tika 2.3.0
Sebastian Nagel [Thu, 5 May 2022 16:59:16 +0000 (18:59 +0200)] 
NUTCH-2948 Upgrade dependencies to Any23 2.7 and Tika 2.3.0

7 months agoNUTCH-2923: Added JobId in Job Failure logs (#721)
Prakhar Chaube [Thu, 27 Jan 2022 16:03:51 +0000 (21:33 +0530)] 
NUTCH-2923: Added JobId in Job Failure logs (#721)

* added JobId in Job Failure logs
* moved job failure log message logic to NutchJob.java
* added description for throws in JavaDoc
* logging only state from Job Status and Simplified Job name for SitemapProcessor

8 months agoNUTCH-2573 Suspend crawling if robots.txt fails to fetch with 5xx status (#724)
Sebastian Nagel [Tue, 18 Jan 2022 07:22:36 +0000 (08:22 +0100)] 
NUTCH-2573 Suspend crawling if robots.txt fails to fetch with 5xx status (#724)

NUTCH-2573 Suspend crawling if robots.txt fails to fetch with 5xx status
- add properties
  http.robots.503.defer.visits :
    enable/disable the feature (default: enabled)
  http.robots.503.defer.visits.delay :
    delay to wait before the next trial to fetch the deferred URL
    and the corresponding robots.txt
    (default: wait 5 minutes)
  http.robots.503.defer.visits.retries :
    max. number of retries before giving up and dropping all URLs from
    the given host / queue
    (default: give up after the 3rd retry, ie. after 4 attempts)
- handle HTTP 5xx in robots.txt parser
- handle delay, retries and dropping queues in Fetcher
- count dropped fetch items in `robots_defer_visits_dropped`

8 months agoNUTCH-2935 DeduplicationJob: failure on URLs with invalid percent encoding
Sebastian Nagel [Fri, 14 Jan 2022 09:34:22 +0000 (10:34 +0100)] 
NUTCH-2935 DeduplicationJob: failure on URLs with invalid percent encoding
- catch IllegalArgumentException when unescaping percent-encoding in URLs
- if one URL of two compared URLs is valid, keep it as non-duplicate
- add unit tests for DeduplicationJob

8 months agoNUTCH-2919 Upgrade to Tika 2.2.1 and Any23 2.6 (#717)
Lewis John McGibbney [Sat, 15 Jan 2022 23:24:21 +0000 (15:24 -0800)] 
NUTCH-2919 Upgrade to Tika 2.2.1 and Any23 2.6 (#717)

* NUTCH-2919 Upgrade to Tika 2.2.1 and Any23 2.6

8 months agoMerge pull request #722 from sebastian-nagel/NUTCH-2929-fetcher-threads-slow-start
Sebastian Nagel [Fri, 14 Jan 2022 09:41:20 +0000 (10:41 +0100)] 
Merge pull request #722 from sebastian-nagel/NUTCH-2929-fetcher-threads-slow-start

NUTCH-2929 Fetcher: start threads slowly to avoid that resources are temporarily exhausted

8 months agoNUTCH-2929 Fetcher: start threads slowly to avoid that resources are temporarily... 722/head
Sebastian Nagel [Tue, 11 Jan 2022 12:43:55 +0000 (13:43 +0100)] 
NUTCH-2929 Fetcher: start threads slowly to avoid that resources are temporarily exhausted
- sleep for a configurable delay (fetcher.threads.start.delay) before starting the next
  Fetcher thread to avoid that resources (DNS, Tika XML parser pools) are
  temporarily exhausted when Fetcher threads fetch the first pages simultaneously

8 months agoMerge pull request #703 from sebastian-nagel/NUTCH-2903-indexer-elastic-https
Sebastian Nagel [Sun, 9 Jan 2022 09:45:53 +0000 (10:45 +0100)] 
Merge pull request #703 from sebastian-nagel/NUTCH-2903-indexer-elastic-https

NUTCH-2903 indexer-elastic: allow to connect to Elastic server via HTTPS

8 months agoNUTCH-2429 Fix Plugin System to allow protocol plugins to bundle their URLStreamHandl...
Lewis John McGibbney [Sat, 8 Jan 2022 04:07:54 +0000 (20:07 -0800)] 
NUTCH-2429 Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers (#720)

* NUTCH-2429 Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers

Co-authored-by: Hiran Chaudhuri <hiran.chaudhuri@mail.de>
9 months agoUpgrade to log4j 2.17.0 (#719)
Sebastian Nagel [Wed, 22 Dec 2021 09:16:04 +0000 (10:16 +0100)] 
Upgrade to log4j 2.17.0 (#719)

9 months agoNUTCH-2917 Remove transitive dependency to log4j 1.x (#718)
Sebastian Nagel [Wed, 22 Dec 2021 09:13:12 +0000 (10:13 +0100)] 
NUTCH-2917 Remove transitive dependency to log4j 1.x (#718)

9 months agoNUTCH-2449 Replace Tika LanguageIdentifier in language-identifier (#716)
Lewis John McGibbney [Sat, 18 Dec 2021 04:11:01 +0000 (20:11 -0800)] 
NUTCH-2449 Replace Tika LanguageIdentifier in language-identifier (#716)

9 months agoNUTCH-2914 nutch-default.xml: remove obsolete and unused properties (#709)
Sebastian Nagel [Fri, 17 Dec 2021 09:00:58 +0000 (10:00 +0100)] 
NUTCH-2914 nutch-default.xml: remove obsolete and unused properties (#709)

9 months agoNUTCH-2807 SitemapProcessor to warn that ignoring robots.txt affects detection of...
Sebastian Nagel [Fri, 17 Dec 2021 08:59:19 +0000 (09:59 +0100)] 
NUTCH-2807 SitemapProcessor to warn that ignoring robots.txt affects detection of sitemaps (#710)

9 months agoMerge pull request #711 from sebastian-nagel/NUTCH-2808
Sebastian Nagel [Fri, 17 Dec 2021 08:56:02 +0000 (09:56 +0100)] 
Merge pull request #711 from sebastian-nagel/NUTCH-2808

NUTCH-2808 Document side effects of ignoring robots.txt

9 months agoNUTCH-2918 Upgrade to log4j 2.16.0 (#715)
Sebastian Nagel [Fri, 17 Dec 2021 08:53:18 +0000 (09:53 +0100)] 
NUTCH-2918 Upgrade to log4j 2.16.0 (#715)

9 months agoNUTCH-2916 Fix log file rotation / rename default log file (#714)
Sebastian Nagel [Tue, 14 Dec 2021 16:34:42 +0000 (17:34 +0100)] 
NUTCH-2916 Fix log file rotation / rename default log file (#714)

- `nutch.log` replaces `hadoop.log` as default log file name
- file rotation patterns only match `nutch.log` as log file name
- checking on start-up whether the log file(s) need to be rotated

9 months agoMerge pull request #713 from sebastian-nagel/NUTCH-2915
Sebastian Nagel [Mon, 13 Dec 2021 23:25:20 +0000 (00:25 +0100)] 
Merge pull request #713 from sebastian-nagel/NUTCH-2915

NUTCH-2915 Upgrade to log4j 2.15.0

9 months agoNUTCH-2915 Upgrade to log4j 2.15.0 713/head
Sebastian Nagel [Sun, 12 Dec 2021 21:29:55 +0000 (22:29 +0100)] 
NUTCH-2915 Upgrade to log4j 2.15.0

9 months agoUpdate documentation of protocol-related properties in 711/head
Sebastian Nagel [Tue, 18 Jun 2019 11:48:55 +0000 (13:48 +0200)] 
Update documentation of protocol-related properties in
nutch-default.xml

9 months agoNUTCH-2808 Document side effects of ignoring robots.txt
Sebastian Nagel [Fri, 3 Dec 2021 14:11:27 +0000 (15:11 +0100)] 
NUTCH-2808 Document side effects of ignoring robots.txt

9 months agoMerge pull request #539 from lewismc/NUTCH-2803
Sebastian Nagel [Fri, 3 Dec 2021 14:52:24 +0000 (15:52 +0100)] 
Merge pull request #539 from lewismc/NUTCH-2803

NUTCH-2803 Rename property http.robot.rules.whitelist

9 months agoMerge branch 'master' into NUTCH-2803 539/head
Sebastian Nagel [Fri, 3 Dec 2021 14:24:57 +0000 (15:24 +0100)] 
Merge branch 'master' into NUTCH-2803

9 months agoMerge pull request #708 from prakharchaube/NUTCH-2911
Sebastian Nagel [Fri, 3 Dec 2021 08:17:56 +0000 (09:17 +0100)] 
Merge pull request #708 from prakharchaube/NUTCH-2911

fix for NUTCH-2911 contributed by prakharchaube

9 months agoNUTCH-2911: Added InterruptedException to throws to allow propagation 708/head
prakharchaube [Wed, 1 Dec 2021 10:04:49 +0000 (15:34 +0530)] 
NUTCH-2911: Added InterruptedException to throws to allow propagation

9 months agoMerge pull request #704 from sebastian-nagel/NUTCH-2905-index-writers-logging-mask...
Sebastian Nagel [Wed, 1 Dec 2021 09:38:09 +0000 (10:38 +0100)] 
Merge pull request #704 from sebastian-nagel/NUTCH-2905-index-writers-logging-mask-credentials

NUTCH-2905 Mask sensitive strings in log output of index writers

9 months agoMerge pull request #707 from sebastian-nagel/NUTCH-2908
Sebastian Nagel [Wed, 1 Dec 2021 09:31:32 +0000 (10:31 +0100)] 
Merge pull request #707 from sebastian-nagel/NUTCH-2908

NUTCH-2908 Log mapreduce job messages and counters in local mode (Log4j2)

9 months agoMerge pull request #700 from sebastian-nagel/NUTCH-2891-tika-2.1
Sebastian Nagel [Wed, 1 Dec 2021 09:11:02 +0000 (10:11 +0100)] 
Merge pull request #700 from sebastian-nagel/NUTCH-2891-tika-2.1

NUTCH-2891 Upgrade to Tika 2.1.0

9 months agoNUTCH-2911: Caught and added log for InterruptedException
prakharchaube [Wed, 1 Dec 2021 06:10:07 +0000 (11:40 +0530)] 
NUTCH-2911: Caught and added log for InterruptedException

9 months agoNUTCH-2891 Upgrade to Tika 2.1.0 700/head
Sebastian Nagel [Tue, 30 Nov 2021 16:06:40 +0000 (17:06 +0100)] 
NUTCH-2891 Upgrade to Tika 2.1.0
- remove commons-codec and commons-compress from exclusions
  to enable parsing of application/x-7z-compressed files

9 months agofix for NUTCH-2911 contributed by prakharchaube
prakharchaube [Tue, 30 Nov 2021 13:52:02 +0000 (19:22 +0530)] 
fix for NUTCH-2911 contributed by prakharchaube

10 months agoNUTCH-2908 Log mapreduce job messages and counters in local mode (Log4j2) 707/head
Sebastian Nagel [Mon, 22 Nov 2021 16:33:59 +0000 (17:33 +0100)] 
NUTCH-2908 Log mapreduce job messages and counters in local mode (Log4j2)

10 months agoMerge pull request #705 from sebastian-nagel/NUTCH-2867
Sebastian Nagel [Mon, 22 Nov 2021 14:12:36 +0000 (15:12 +0100)] 
Merge pull request #705 from sebastian-nagel/NUTCH-2867

NUTCH-2867 Support for custom HostDb aggregators

10 months agoNUTCH-2867 Support for custom HostDb aggregators 705/head
Sebastian Nagel [Mon, 22 Nov 2021 13:53:49 +0000 (14:53 +0100)] 
NUTCH-2867 Support for custom HostDb aggregators
- complete Javadoc

10 months agoMerge pull request #706 from sebastian-nagel/NUTCH-2865
Sebastian Nagel [Mon, 22 Nov 2021 13:52:43 +0000 (14:52 +0100)] 
Merge pull request #706 from sebastian-nagel/NUTCH-2865

NUTCH-2865 WARC exporter support for metadata and dropping empty responses

10 months agoMerge pull request #695 from lewismc/NUTCH-2892
Sebastian Nagel [Mon, 22 Nov 2021 13:32:30 +0000 (14:32 +0100)] 
Merge pull request #695 from lewismc/NUTCH-2892

NUTCH-2892 Upgrade to Any23 2.5

10 months agoNUTCH-2867 Support for custom HostDb aggregators
Sebastian Nagel [Mon, 22 Nov 2021 13:23:06 +0000 (14:23 +0100)] 
NUTCH-2867 Support for custom HostDb aggregators
- rename aggregator interface in documentation

10 months agoNUTCH-2892 Upgrade to Any23 2.5 695/head
Sebastian Nagel [Mon, 22 Nov 2021 13:16:16 +0000 (14:16 +0100)] 
NUTCH-2892 Upgrade to Any23 2.5

- update instructions how to upgrade (s/tika/any23/g)

10 months agoMerge pull request #702 from sebastian-nagel/NUTCH-2904-crawler-commons-1.2
Sebastian Nagel [Mon, 22 Nov 2021 12:48:35 +0000 (13:48 +0100)] 
Merge pull request #702 from sebastian-nagel/NUTCH-2904-crawler-commons-1.2

NUTCH-2904 Upgrade to crawler-commons 1.2

10 months agoNUTCH-2865 WARC exporter support for metadata and dropping empty responses 706/head
Sebastian Nagel [Fri, 19 Nov 2021 16:17:40 +0000 (17:17 +0100)] 
NUTCH-2865 WARC exporter support for metadata and dropping empty responses
(patch contributed by Markus Jelsma)

10 months agoNUTCH-2867 Support for custom HostDb aggregators
Sebastian Nagel [Fri, 19 Nov 2021 21:34:00 +0000 (22:34 +0100)] 
NUTCH-2867 Support for custom HostDb aggregators
- rename aggregator interface

10 months agoNUTCH-2867 Support for custom HostDb aggregators
Sebastian Nagel [Fri, 19 Nov 2021 21:32:27 +0000 (22:32 +0100)] 
NUTCH-2867 Support for custom HostDb aggregators
- apply code formatting
- complete documentation

10 months agoNUTCH-2867 Support for custom HostDb aggregators
Sebastian Nagel [Fri, 19 Nov 2021 21:20:38 +0000 (22:20 +0100)] 
NUTCH-2867 Support for custom HostDb aggregators
(patch contributed by markus)