afa42f5942ec96b846b1fac53807fe2c02006ff1
[samza.git] / docs / learn / documentation / versioned / jobs / configuration-table.html
1 <!DOCTYPE html>
2 <!--
3 Licensed to the Apache Software Foundation (ASF) under one or more
4 contributor license agreements. See the NOTICE file distributed with
5 this work for additional information regarding copyright ownership.
6 The ASF licenses this file to You under the Apache License, Version 2.0
7 (the "License"); you may not use this file except in compliance with
8 the License. You may obtain a copy of the License at
9
10 http://www.apache.org/licenses/LICENSE-2.0
11
12 Unless required by applicable law or agreed to in writing, software
13 distributed under the License is distributed on an "AS IS" BASIS,
14 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 See the License for the specific language governing permissions and
16 limitations under the License.
17 -->
18 <html>
19 <head>
20 <meta charset="utf-8">
21 <title>Samza Configuration Reference</title>
22 <style type="text/css">
23 body {
24 font-family: "Helvetica Neue",Helvetica,Arial,sans-serif;
25 font-size: 14px;
26 line-height: 22px;
27 color: #333;
28 background-color: #fff;
29 }
30
31 table {
32 border-collapse: collapse;
33 margin: 1em 0;
34 }
35
36 table th, table td {
37 text-align: left;
38 vertical-align: top;
39 padding: 12px;
40 border-bottom: 1px solid #ccc;
41 border-top: 1px solid #ccc;
42 border-left: 0;
43 border-right: 0;
44 }
45
46 table td.property, table td.default {
47 white-space: nowrap;
48 }
49
50 table th.section {
51 background-color: #eee;
52 }
53
54 table th.section .subtitle {
55 font-weight: normal;
56 }
57
58 code, a.property {
59 font-family: monospace;
60 }
61
62 span.system, span.stream, span.store, span.serde, span.rewriter, span.listener, span.reporter {
63 padding: 1px;
64 margin: 1px;
65 border-width: 1px;
66 border-style: solid;
67 border-radius: 4px;
68 }
69
70 span.system {
71 background-color: #ddf;
72 border-color: #bbd;
73 }
74
75 span.stream {
76 background-color: #dfd;
77 border-color: #bdb;
78 }
79
80 span.store {
81 background-color: #fdf;
82 border-color: #dbd;
83 }
84
85 span.serde {
86 background-color: #fdd;
87 border-color: #dbb;
88 }
89
90 span.rewriter {
91 background-color: #eee;
92 border-color: #ccc;
93 }
94
95 span.listener {
96 background-color: #ffd;
97 border-color: #ddb;
98 }
99
100 span.reporter {
101 background-color: #dff;
102 border-color: #bdd;
103 }
104 </style>
105 </head>
106
107 <body>
108 <h1>Samza Configuration Reference</h1>
109 <p>The following table lists all the standard properties that can be included in a Samza job configuration file.</p>
110 <p>Words highlighted like <span class="system">this</span> are placeholders for your own variable names.</p>
111 <table>
112 <tbody>
113 <tr><th>Name</th><th>Default</th><th>Description</th></tr>
114 <tr>
115 <th colspan="3" class="section" id="job"><a href="configuration.html">Samza job configuration</a></th>
116 </tr>
117
118 <tr>
119 <td class="property" id="job-factory-class">job.factory.class</td>
120 <td class="default"></td>
121 <td class="description">
122 <strong>Required:</strong> The <a href="job-runner.html">job factory</a> to use for running this job.
123 The value is a fully-qualified Java classname, which must implement
124 <a href="../api/javadocs/org/apache/samza/job/StreamJobFactory.html">StreamJobFactory</a>.
125 Samza ships with three implementations:
126 <dl>
127 <dt><code>org.apache.samza.job.local.ThreadJobFactory</code></dt>
128 <dd>Runs your job on your local machine using threads. This is intended only for
129 development, not for production deployments.</dd>
130 <dt><code>org.apache.samza.job.local.ProcessJobFactory</code></dt>
131 <dd>Runs your job on your local machine as a subprocess. An optional command builder
132 property can also be specified (see <a href="#task-command-class" class="property">
133 task.command.class</a> for details). This is intended only for development,
134 not for production deployments.</dd>
135 <dt><code>org.apache.samza.job.yarn.YarnJobFactory</code></dt>
136 <dd>Runs your job on a YARN grid. See <a href="#yarn">below</a> for YARN-specific configuration.</dd>
137 </dl>
138 </td>
139 </tr>
140
141 <tr>
142 <td class="property" id="job-name">job.name</td>
143 <td class="default"></td>
144 <td class="description">
145 <strong>Required:</strong> The name of your job. This name appears on the Samza dashboard, and it
146 is used to tell apart this job's checkpoints from other jobs' checkpoints.
147 </td>
148 </tr>
149
150 <tr>
151 <td class="property" id="job-id">job.id</td>
152 <td class="default">1</td>
153 <td class="description">
154 If you run several instances of your job at the same time, you need to give each execution a
155 different <code>job.id</code>. This is important, since otherwise the jobs will overwrite each
156 others' checkpoints, and perhaps interfere with each other in other ways.
157 </td>
158 </tr>
159 <td class="property" id="job-coordinator-system">job.coordinator.system</td>
160 <td class="default"></td>
161 <td class="description">
162 <strong>Required:</strong> The <span class="system">system-name</span> to use for creating and maintaining the <a href="../container/coordinator-stream.html">Coordinator Stream</a>.
163 </td>
164 </tr>
165 <tr>
166 <td class="property" id="job-default-system">job.default.system</td>
167 <td class="default"></td>
168 <td class="description">
169 The <span class="system">system-name</span> to access any input or output streams for which the system is not explicitly configured.
170 This property is for input and output streams whereas job.coordinator.system is for samza metadata streams.</a>.
171 </td>
172 </tr>
173
174 <tr>
175 <td class="property" id="job-coordinator-replication-factor">job.coordinator.<br />replication.factor</td>
176 <td class="default">3</td>
177 <td class="description">
178 If you are using Kafka for coordinator stream, this is the number of Kafka nodes to which you want the
179 coordinator topic replicated for durability.
180 </td>
181 </tr>
182
183 <tr>
184 <td class="property" id="job-coordinator-segment-bytes">job.coordinator.<br />segment.bytes</td>
185 <td class="default">26214400</td>
186 <td class="description">
187 If you are using a Kafka system for coordinator stream, this is the segment size to be used for the coordinator
188 topic's log segments. Keeping this number small is useful because it increases the frequency
189 that Kafka will garbage collect old messages.
190 </td>
191 </tr>
192
193 <tr>
194 <td class="property" id="job-coordinator-monitor-partition-change">job.coordinator.<br />monitor-partition-change</td>
195 <td class="default">false</td>
196 <td class="description">
197 If you are using Kafka for coordinator stream, this configuration enables the Job Coordinator to
198 detect partition count difference in Kafka input topics. On detection, it updates a Gauge
199 metric of format <span class="system">system-name</span>.<span class="stream">stream-name</span>.partitionCount,
200 which indicates the difference in the partition count from the initial state. Please note that currently this
201 feature only works for Kafka-based systems.
202 </td>
203 </tr>
204
205 <tr>
206 <td class="property" id="job-coordinator-monitor-partition-change-frequency-ms">job.coordinator.<br />monitor-partition-change.frequency.ms</td>
207 <td class="default">300000</td>
208 <td class="description">
209 The frequency at which the input streams' partition count change should be detected. This check
210 can be tuned to be pretty low as partition increase is not a common event.
211 </td>
212 </tr>
213
214 <tr>
215 <td class="property" id="job-config-rewriter-class">job.config.rewriter.<br><span class="rewriter">rewriter-name</span>.class</td>
216 <td class="default"></td>
217 <td class="description">
218 You can optionally define configuration rewriters, which have the opportunity to dynamically
219 modify the job configuration before the job is started. For example, this can be useful for
220 pulling configuration from an external configuration management system, or for determining
221 the set of input streams dynamically at runtime. The value of this property is a
222 fully-qualified Java classname which must implement
223 <a href="../api/javadocs/org/apache/samza/config/ConfigRewriter.html">ConfigRewriter</a>.
224 Samza ships with these rewriters by default:
225 <dl>
226 <dt><code>org.apache.samza.config.RegExTopicGenerator</code></dt>
227 <dd>When consuming from Kafka, this allows you to consume all Kafka topics that match
228 some regular expression (rather than having to list each topic explicitly).
229 This rewriter has <a href="#regex-rewriter">additional configuration</a>.</dd>
230 <dt><code>org.apache.samza.config.EnvironmentConfigRewriter</code></dt>
231 <dd>This rewriter takes environment variables that are prefixed with <i>SAMZA_</i>
232 and adds them to the configuration, overriding previous values where they
233 exist. The keys are lowercased and underscores are converted to dots.</dd>
234 </dl>
235 </td>
236 </tr>
237
238 <tr>
239 <td class="property" id="job-config-rewriters">job.config.rewriters</td>
240 <td class="default"></td>
241 <td class="description">
242 If you have defined configuration rewriters, you need to list them here, in the order in
243 which they should be applied. The value of this property is a comma-separated list of
244 <span class="rewriter">rewriter-name</span> tokens.
245 </td>
246 </tr>
247
248 <tr>
249 <td class="property" id="job-systemstreampartition-grouper-factory">job.systemstreampartition.<br>grouper.factory</td>
250 <td class="default">org.apache.samza.<br>container.grouper.stream.<br>GroupByPartitionFactory</td>
251 <td class="description">
252 A factory class that is used to determine how input SystemStreamPartitions are grouped together for processing in individual StreamTask instances. The factory must implement the SystemStreamPartitionGrouperFactory interface. Once this configuration is set, it can't be changed, since doing so could violate state semantics, and lead to a loss of data.
253
254 <dl>
255 <dt><code>org.apache.samza.container.grouper.stream.GroupByPartitionFactory</code></dt>
256 <dd>Groups input stream partitions according to their partition number. This grouping leads to a single StreamTask processing all messages for a single partition (e.g. partition 0) across all input streams that have a partition 0. Therefore, the default is that you get one StreamTask for all input partitions with the same partition number. Using this strategy, if two input streams have a partition 0, then messages from both partitions will be routed to a single StreamTask. This partitioning strategy is useful for joining and aggregating streams.</dt>
257 <dt><code>org.apache.samza.container.grouper.stream.GroupBySystemStreamPartitionFactory</code></dt>
258 <dd>Assigns each SystemStreamPartition to its own unique StreamTask. The GroupBySystemStreamPartitionFactory is useful in cases where you want increased parallelism (more containers), and don't care about co-locating partitions for grouping or joins, since it allows for a greater number of StreamTasks to be divided up amongst Samza containers.</dd>
259 </dl>
260 </td>
261 </tr>
262
263 <tr>
264 <td class="property" id="job-systemstreampartition-matcher-class">job.systemstreampartition.<br>matcher.class</td>
265 <td class="default"></td>
266 <td class="description">
267 If you want to enable static partition assignment, then this is a <strong>required</strong> configuration.
268 The value of this property is a fully-qualified Java class name that implements the interface
269 <code>org.apache.samza.system.SystemStreamPartitionMatcher</code>.
270 Samza ships with two matcher classes:
271 <dl>
272 <dt><code>org.apache.samza.system.RangeSystemStreamPartitionMatcher</code></dt>
273 <dd>This classes uses a comma separated list of range(s) to determine which partition matches,
274 and thus statically assigned to the Job. For example "2,3,1-2", statically assigns partition
275 1, 2, and 3 for all the specified system and streams (topics in case of Kafka) to the job.
276 For config validation each element in the comma separated list much conform to one of the
277 following regex:
278 <ul>
279 <li><code>"(\\d+)"</code> or </li>
280 <li><code>"(\\d+-\\d+)"</code> </li>
281 </ul>
282 <code>JobConfig.SSP_MATCHER_CLASS_RANGE</code> constant has the canonical name of this class.
283 </dd>
284 </dl>
285 <dl>
286 <dt><code>org.apache.samza.system.RegexSystemStreamPartitionMatcher</code></dt>
287 <dd>This classes uses a standard Java supported regex to determine which partition matches,
288 and thus statically assigned to the Job. For example "[1-2]", statically assigns partition 1 and 2
289 for all the specified system and streams (topics in case of Kafka) to the job.
290 <code>JobConfig.SSP_MATCHER_CLASS_REGEX</code> constant has the canonical name of this class.</dd>
291 </dl>
292 </td>
293 </tr>
294
295 <tr>
296 <td class="property" id="job_systemstreampartition_matcher_config_range">job.systemstreampartition.<br>matcher.config.range</td>
297 <td class="default"></td>
298 <td class="description">
299 If <code>job.systemstreampartition.matcher.class</code> is specified, and the value of this property is
300 <code>org.apache.samza.system.RangeSystemStreamPartitionMatcher</code>, then this property is a
301 <strong>required</strong> configuration. Specify a comma separated list of range(s) to determine which
302 partition matches, and thus statically assigned to the Job. For example "2,3,11-20", statically assigns
303 partition 2, 3, and 11 to 20 for all the specified system and streams (topics in case of Kafka) to the job.
304 A singel configuration value like "19" is valid as well. This statically assigns partition 19.
305 For config validation each element in the comma separated list much conform to one of the
306 following regex:
307 <ul>
308 <li><code>"(\\d+)"</code> or </li>
309 <li><code>"(\\d+-\\d+)"</code> </li>
310 </ul>
311 </td>
312 </tr>
313
314 <tr>
315 <td class="property" id="job_systemstreampartition_matcher_config_regex">job.systemstreampartition.<br>matcher.config.regex</td>
316 <td class="default"></td>
317 <td class="description">
318 If <code>job.systemstreampartition.matcher.class</code> is specified, and the value of this property is
319 <code>org.apache.samza.system.RegexSystemStreamPartitionMatcher</code>, then this property is a
320 <strong>required</strong> configuration. The value should be a valid Java supported regex. For example "[1-2]",
321 statically assigns partition 1 and 2 for all the specified system and streams (topics in case of Kakfa) to the job.
322 </td>
323 </tr>
324
325 <tr>
326 <td class="property" id="job_systemstreampartition_matcher_config_job_factory_regex">job.systemstreampartition.<br>matcher.config.job.factory.regex</td>
327 <td class="default"></td>
328 <td class="description">
329 This configuration can be used to specify the Java supported regex to match the <code>StreamJobFactory</code>
330 for which the static partition assignment should be enabled. This configuration enables the partition
331 assignment feature to be used for custom <code>StreamJobFactory</code>(ies) as well.
332 <p>
333 This config defaults to the following value:
334 <code>"org\\.apache\\.samza\\.job\\.local(.*ProcessJobFactory|.*ThreadJobFactory)"</code>,
335 which enables static partition assignment when <code>job.factory.class</code> is set to
336 <code>org.apache.samza.job.local.ProcessJobFactory</code> or <code>org.apache.samza.job.local.ThreadJobFactory.</code>
337 </p>
338 </td>
339 </tr>
340
341
342 <tr>
343 <td class="property" id="job-checkpoint-validation-enabled">job.checkpoint.<br>validation.enabled</td>
344 <td class="default">true</td>
345 <td class="description">
346 This setting controls if the job should fail(true) or just warn(false) in case the validation of checkpoint partition number fails. <br/> <b>CAUTION</b>: this configuration needs to be used w/ care. It should only be used as a work-around after the checkpoint has been auto-created with wrong number of partitions by mistake.
347 </td>
348 </tr>
349 <tr>
350 <td class="property" id="job-security-manager-factory">job.security.manager.factory</td>
351 <td class="default"></td>
352 <td class="description">
353 This is the factory class used to create the proper <a href="../api/javadocs/org/apache/samza/container/SecurityManager.html">SecurityManager</a> to handle security for Samza containers when running in a secure environment, such as Yarn with Kerberos eanbled.
354 Samza ships with one security manager by default:
355 <dl>
356 <dt><code>org.apache.samza.job.yarn.SamzaYarnSecurityManagerFactory</code></dt>
357 <dd>Supports Samza containers to run properly in a Kerberos enabled Yarn cluster. Each Samza container, once started, will create a <a href="../api/javadocs/org/apache/samza/job/yarn/SamzaContainerSecurityManager.html">SamzaContainerSecurityManager</a>. SamzaContainerSecurityManager runs on its separate thread and update user's delegation tokens at the interval specified by <a href="#yarn-token-renewal-interval-seconds" class="property">yarn.token.renewal.interval.seconds</a>. See <a href="../yarn/yarn-security.html">Yarn Security</a> for details.</dd>
358 </dl>
359 </td>
360 </tr>
361
362 <tr>
363 <td class="property" id="job-container-count">job.container.count</td>
364 <td class="default">1</td>
365 <td class="description">
366 The number of YARN containers to request for running your job. This is the main parameter
367 for controlling the scale (allocated computing resources) of your job: to increase the
368 parallelism of processing, you need to increase the number of containers. The minimum is one
369 container, and the maximum number of containers is the number of task instances (usually the
370 <a href="../container/samza-container.html#tasks-and-partitions">number of input stream partitions</a>).
371 Task instances are evenly distributed across the number of containers that you specify.
372 </td>
373 </tr>
374
375 <tr>
376 <td class="property" id="job-container-single-thread-mode">job.container.single.thread.mode</td>
377 <td class="default">false</td>
378 <td class="description">
379 If set to true, samza will fallback to legacy single-threaded event loop. Default is false, which enables the <a href="../container/event-loop.html">multithreading execution</a>.
380 </td>
381 </tr>
382
383 <tr>
384 <td class="property" id="job-container-thread-pool-size">job.container.thread.pool.size</td>
385 <td class="default"></td>
386 <td class="description">
387 If configured, the container thread pool will be used to run synchronous operations of each task in parallel. The operations include StreamTask.process(), WindowableTask.window(), and internally Task.commit(). Note that the thread pool is not applicable to AsyncStremTask.processAsync(). The size should always be greater than zero. If not configured, all task operations will run in a single thread.
388 </td>
389 </tr>
390
391 <tr>
392 <td class="property" id="job-host_affinity-enabled">job.host-affinity.enabled</td>
393 <td class="default">false</td>
394 <td class="description">
395 This property indicates whether host-affinity is enabled or not. Host-affinity refers to the ability of Samza to request and allocate a container on the same host every time the job is deployed.
396 When host-affinity is enabled, Samza makes a "best-effort" to honor the host-affinity constraint.
397 The property <a href="#cluster-manager-container-request-timeout-ms" class="property">cluster-manager.container.request.timeout.ms</a> determines how long to wait before de-prioritizing the host-affinity constraint and assigning the container to any available resource.
398 <b>Please Note</b>: This feature is tested to work with the FairScheduler in Yarn when continuous-scheduling is enabled.
399 </td>
400 </tr>
401
402 <tr>
403 <td class="property" id="job-changelog-system">job.changelog.system</td>
404 <td class="default"></td>
405 <td class="description">
406 This property specifies a default system for changelog, which will be used with the stream specified in
407 <a href="#stores-changelog" class="property">stores.store-name.changelog</a> config.
408 You can override this system by specifying both the system and the stream in
409 <a href="#stores-changelog" class="property">stores.store-name.changelog</a>.
410 </td>
411 </tr>
412 <tr>
413 <th colspan="3" class="section" id="task"><a href="../api/overview.html">Task configuration</a></th>
414 </tr>
415
416 <tr>
417 <td class="property" id="task-class">task.class</td>
418 <td class="default"></td>
419 <td class="description">
420 <strong>Required:</strong> The fully-qualified name of the Java class which processes
421 incoming messages from input streams. The class must implement
422 <a href="../api/javadocs/org/apache/samza/task/StreamTask.html">StreamTask</a> or
423 <a href="../api/javadocs/org/apache/samza/task/AsyncStreamTask.html">AsyncStreamTask</a>,
424 and may optionally implement
425 <a href="../api/javadocs/org/apache/samza/task/InitableTask.html">InitableTask</a>,
426 <a href="../api/javadocs/org/apache/samza/task/ClosableTask.html">ClosableTask</a> and/or
427 <a href="../api/javadocs/org/apache/samza/task/WindowableTask.html">WindowableTask</a>.
428 The class will be instantiated several times, once for every
429 <a href="../container/samza-container.html#tasks-and-partitions">input stream partition</a>.
430 </td>
431 </tr>
432
433 <tr>
434 <td class="property" id="task-inputs">task.inputs</td>
435 <td class="default"></td>
436 <td class="description">
437 <strong>Required:</strong> A comma-separated list of streams that are consumed by this job.
438 Each stream is given in the format
439 <span class="system">system-name</span>.<span class="stream">stream-name</span>.
440 For example, if you have one input system called <code>my-kafka</code>, and want to consume two
441 Kafka topics called <code>PageViewEvent</code> and <code>UserActivityEvent</code>, then you would set
442 <code>task.inputs=my-kafka.PageViewEvent, my-kafka.UserActivityEvent</code>.
443 </td>
444 </tr>
445
446 <tr>
447 <td class="property" id="task-window-ms">task.window.ms</td>
448 <td class="default">-1</td>
449 <td class="description">
450 If <a href="#task-class" class="property">task.class</a> implements
451 <a href="../api/javadocs/org/apache/samza/task/WindowableTask.html">WindowableTask</a>, it can
452 receive a <a href="../container/windowing.html">windowing callback</a> in regular intervals.
453 This property specifies the time between window() calls, in milliseconds. If the number is
454 negative (the default), window() is never called. Note that Samza is
455 <a href="../container/event-loop.html">single-threaded</a>, so a window() call will never
456 occur concurrently with the processing of a message. If a message is being processed at the
457 time when a window() call is due, the window() call occurs after the processing of the current
458 message has completed.
459 </td>
460 </tr>
461
462 <tr>
463 <td class="property" id="task-checkpoint-factory">task.checkpoint.factory</td>
464 <td class="default"></td>
465 <td class="description">
466 To enable <a href="../container/checkpointing.html">checkpointing</a>, you must set
467 this property to the fully-qualified name of a Java class that implements
468 <a href="../api/javadocs/org/apache/samza/checkpoint/CheckpointManagerFactory.html">CheckpointManagerFactory</a>.
469 This is not required, but recommended for most jobs. If you don't configure checkpointing,
470 and a job or container restarts, it does not remember which messages it has already processed.
471 Without checkpointing, consumer behavior is determined by the
472 <a href="#systems-samza-offset-default" class="property">...samza.offset.default</a>
473 setting, which by default skips any messages that were published while the container was
474 restarting. Checkpointing allows a job to start up where it previously left off.
475 Samza ships with two checkpoint managers by default:
476 <dl>
477 <dt><code>org.apache.samza.checkpoint.file.FileSystemCheckpointManagerFactory</code></dt>
478 <dd>Writes checkpoints to files on the local filesystem. You can configure the file path
479 with the <a href="#task-checkpoint-path" class="property">task.checkpoint.path</a>
480 property. This is a simple option if your job always runs on the same machine.
481 On a multi-machine cluster, this would require a network filesystem mount.</dd>
482 <dt><code>org.apache.samza.checkpoint.kafka.KafkaCheckpointManagerFactory</code></dt>
483 <dd>Writes checkpoints to a dedicated topic on a Kafka cluster. This is the recommended
484 option if you are already using Kafka for input or output streams. Use the
485 <a href="#task-checkpoint-system" class="property">task.checkpoint.system</a>
486 property to configure which Kafka cluster to use for checkpoints.</dd>
487 </dl>
488 </td>
489 </tr>
490
491 <tr>
492 <td class="property" id="task-commit-ms">task.commit.ms</td>
493 <td class="default">60000</td>
494 <td class="description">
495 If <a href="#task-checkpoint-factory" class="property">task.checkpoint.factory</a> is
496 configured, this property determines how often a checkpoint is written. The value is
497 the time between checkpoints, in milliseconds. The frequency of checkpointing affects
498 failure recovery: if a container fails unexpectedly (e.g. due to crash or machine failure)
499 and is restarted, it resumes processing at the last checkpoint. Any messages processed
500 since the last checkpoint on the failed container are processed again. Checkpointing
501 more frequently reduces the number of messages that may be processed twice, but also
502 uses more resources.
503 </td>
504 </tr>
505
506 <tr>
507 <td class="property" id="task-command-class">task.command.class</td>
508 <td class="default">org.apache.samza.job.<br>ShellCommandBuilder</td>
509 <td class="description">
510 The fully-qualified name of the Java class which determines the command line and environment
511 variables for a <a href="../container/samza-container.html">container</a>. It must be a subclass of
512 <a href="../api/javadocs/org/apache/samza/job/CommandBuilder.html">CommandBuilder</a>.
513 This defaults to <code>task.command.class=org.apache.samza.job.ShellCommandBuilder</code>.
514 </td>
515 </tr>
516
517 <tr>
518 <td class="property" id="task-opts">task.opts</td>
519 <td class="default"></td>
520 <td class="description">
521 Any JVM options to include in the command line when executing Samza containers. For example,
522 this can be used to set the JVM heap size, to tune the garbage collector, or to enable
523 <a href="/learn/tutorials/{{site.version}}/remote-debugging-samza.html">remote debugging</a>.
524 This cannot be used when running with <code>ThreadJobFactory</code>. Anything you put in
525 <code>task.opts</code> gets forwarded directly to the commandline as part of the JVM invocation.
526 <dl>
527 <dt>Example: <code>task.opts=-XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC</code></dt>
528 </dl>
529 </td>
530 </tr>
531
532 <tr>
533 <td class="property" id="task-java-home">task.java.home</td>
534 <td class="default"></td>
535 <td class="description">
536 The JAVA_HOME path for Samza containers. By setting this property, you can use a java version that is
537 different from your cluster's java version. Remember to set the <code>yarn.am.java.home</code> as well.
538 <dl>
539 <dt>Example: <code>task.java.home=/usr/java/jdk1.8.0_05</code></dt>
540 </dl>
541 </td>
542 </tr>
543
544 <tr>
545 <td class="property" id="task-execute">task.execute</td>
546 <td class="default">bin/run-container.sh</td>
547 <td class="description">
548 The command that starts a Samza container. The script must be included in the
549 <a href="packaging.html">job package</a>. There is usually no need to customize this.
550 </td>
551 </tr>
552
553 <tr>
554 <td class="property" id="task-chooser-class">task.chooser.class</td>
555 <td class="default">org.apache.samza.<br>system.chooser.<br>RoundRobinChooserFactory</td>
556 <td class="description">
557 This property can be optionally set to override the default
558 <a href="../container/streams.html#messagechooser">message chooser</a>, which determines the
559 order in which messages from multiple input streams are processed. The value of this
560 property is the fully-qualified name of a Java class that implements
561 <a href="../api/javadocs/org/apache/samza/system/chooser/MessageChooserFactory.html">MessageChooserFactory</a>.
562 </td>
563 </tr>
564
565 <tr>
566 <td class="property" id="task-drop-deserialization-errors">task.drop.deserialization.errors</td>
567 <td class="default"></td>
568 <td class="description">
569 This property is to define how the system deals with deserialization failure situation. If set to true, the system will
570 skip the error messages and keep running. If set to false, the system with throw exceptions and fail the container. Default
571 is false.
572 </td>
573 </tr>
574
575 <tr>
576 <td class="property" id="task-drop-serialization-errors">task.drop.serialization.errors</td>
577 <td class="default"></td>
578 <td class="description">
579 This property is to define how the system deals with serialization failure situation. If set to true, the system will
580 drop the error messages and keep running. If set to false, the system with throw exceptions and fail the container. Default
581 is false.
582 </td>
583 </tr>
584
585 <tr>
586 <td class="property" id="task-log4j-system">task.log4j.system</td>
587 <td class="default"></td>
588 <td class="description">
589 Specify the system name for the StreamAppender. If this property is not specified in the config,
590 Samza throws exception. (See
591 <a href="logging.html#stream-log4j-appender">Stream Log4j Appender</a>)
592 <dl>
593 <dt>Example: <code>task.log4j.system=kafka</code></dt>
594 </dl>
595 </td>
596 </tr>
597
598 <tr>
599 <td class="property" id="task-log4j-location-info-enabled">task.log4j.location.info.enabled</td>
600 <td class="default">false</td>
601 <td class="description">
602 Defines whether or not to include log4j's LocationInfo data in Log4j StreamAppender messages. LocationInfo includes
603 information such as the file, class, and line that wrote a log message. This setting is only active if the Log4j
604 stream appender is being used. (See <a href="logging.html#stream-log4j-appender">Stream Log4j Appender</a>)
605 <dl>
606 <dt>Example: <code>task.log4j.location.info.enabled=true</code></dt>
607 </dl>
608 </td>
609 </tr>
610
611 <tr>
612 <td class="property" id="task-poll-interval-ms">task.poll.interval.ms</td>
613 <td class="default"></td>
614 <td class="description">
615 Samza's container polls for more messages under two conditions. The first condition arises when there are simply no remaining
616 buffered messages to process for any input SystemStreamPartition. The second condition arises when some input
617 SystemStreamPartitions have empty buffers, but some do not. In the latter case, a polling interval is defined to determine how
618 often to refresh the empty SystemStreamPartition buffers. By default, this interval is 50ms, which means that any empty
619 SystemStreamPartition buffer will be refreshed at least every 50ms. A higher value here means that empty SystemStreamPartitions
620 will be refreshed less often, which means more latency is introduced, but less CPU and network will be used. Decreasing this
621 value means that empty SystemStreamPartitions are refreshed more frequently, thereby introducing less latency, but increasing
622 CPU and network utilization.
623 </td>
624 </tr>
625
626 <tr>
627 <td class="property" id="task-ignored-exceptions">task.ignored.exceptions</td>
628 <td class="default"></td>
629 <td class="description">
630 This property specifies which exceptions should be ignored if thrown in a task's <code>process</code> or <code>window</code>
631 methods. The exceptions to be ignored should be a comma-separated list of fully-qualified class names of the exceptions or
632 <code>*</code> to ignore all exceptions.
633 </td>
634 </tr>
635
636 <tr>
637 <td class="property" id="task-shutdown-ms">task.shutdown.ms</td>
638 <td class="default">5000</td>
639 <td class="description">
640 This property controls how long the Samza container will wait for an orderly shutdown of task instances.
641 </td>
642 </tr>
643 <tr>
644 <td class="property" id="task-name-grouper-factory">task.name.grouper.factory</td>
645 <td class="default">org.apache.samza.<br>container.grouper.task.<br>GroupByContainerCountFactory</td>
646 <td class="description">
647 The fully-qualified name of the Java class which determines the factory class which will build the TaskNameGrouper.
648 The default configuration value if the property is not present is <code>task.name.grouper.factory=org.apache.samza.container.grouper.task.GroupByContainerCountFactory</code>.<br>
649 The user can specify a custom implementation of the TaskNameGrouperFactory where a custom logic is implemented for grouping the tasks.
650 </td>
651 </tr>
652
653 <tr>
654 <td class="property" id="task-broadcast-inputs">task.broadcast.inputs</td>
655 <td class="default"></td>
656 <td class="description">
657 This property specifies the partitions that all tasks should consume. The systemStreamPartitions you put
658 here will be sent to all the tasks.
659 <dl>
660 <dt>Format: <span class="system">system-name</span>.<span class="stream">stream-name</span>#<i>partitionId</i>
661 or <span class="system">system-name</span>.<span class="stream">stream-name</span>#[<i>startingPartitionId</i>-<i>endingPartitionId</i>]</dt>
662 </dl>
663 <dl>
664 <dt>Example: <code>task.broadcast.inputs=mySystem.broadcastStream#[0-2], mySystem.broadcastStream#0</code></dt>
665 </dl>
666 </td>
667 </tr>
668
669 <tr>
670 <td class="property" id="task-max-concurrency">task.max.concurrency</td>
671 <td class="default">1</td>
672 <td class="description">
673 Max number of outstanding messages being processed per task at a time, and it’s applicable to both StreamTask and AsyncStreamTask. The values can be:
674 <dl>
675 <dt><code>1</code></dt>
676 <dd>Each task processes one message at a time. Next message will wait until the current message process completes. This ensures strict in-order processing.</dd>
677 <dt><code>&gt1</code></dt>
678 <dd>Multiple outstanding messages are allowed to be processed per task at a time. The completion can be out of order. This option increases the parallelism within a task, but may result in out-of-order processing.</dd>
679 </dl>
680 </td>
681 </tr>
682
683 <tr>
684 <td class="property" id="task-callback-timeout-ms">task.callback.timeout.ms</td>
685 <td class="default"></td>
686 <td class="description">
687 This property is for AsyncStreamTask only. It defines the max time interval from processAsync() to callback is fired. When the timeout happens, it will throw a TaskCallbackTimeoutException and shut down the container. Default is no timeout.
688 </td>
689 </tr>
690
691 <tr>
692 <td class="property" id="task-consumer-batch-size">task.consumer.batch.size</td>
693 <td class="default">1</td>
694 <td class="description">
695 If set to a positive integer, the task will try to consume
696 <a href="../container/streams.html#batching">batches</a> with the given number of messages
697 from each input stream, rather than consuming round-robin from all the input streams on
698 each individual message. Setting this property can improve performance in some cases.
699 </td>
700 </tr>
701
702 <tr>
703 <th colspan="3" class="section" id="systems">Systems</th>
704 </tr>
705
706 <tr>
707 <td class="property" id="systems-samza-factory">systems.<span class="system">system-name</span>.<br>samza.factory</td>
708 <td class="default"></td>
709 <td class="description">
710 <strong>Required:</strong> The fully-qualified name of a Java class which provides a
711 <em>system</em>. A system can provide input streams which you can consume in your Samza job,
712 or output streams to which you can write, or both. The requirements on a system are very
713 flexible &mdash; it may connect to a message broker, or read and write files, or use a database,
714 or anything else. The class must implement
715 <a href="../api/javadocs/org/apache/samza/system/SystemFactory.html">SystemFactory</a>.
716 Samza ships with the following implementations:
717 <dl>
718 <dt><code>org.apache.samza.system.kafka.KafkaSystemFactory</code></dt>
719 <dd>Connects to a cluster of <a href="http://kafka.apache.org/">Kafka</a> brokers, allows
720 Kafka topics to be consumed as streams in Samza, allows messages to be published to
721 Kafka topics, and allows Kafka to be used for checkpointing (see
722 <a href="#task-checkpoint-factory" class="property">task.checkpoint.factory</a>).
723 See also <a href="#kafka">configuration of a Kafka system</a>.</dd>
724 <dt><code>org.apache.samza.system.filereader.FileReaderSystemFactory</code></dt>
725 <dd>Reads data from a file on the local filesystem (the stream name is the path of the
726 file to read). The file is read as ASCII, and treated as a stream of messages separated
727 by newline (<code>\n</code>) characters. A task can consume each line of the file as
728 a <code>java.lang.String</code> object. This system does not provide output streams.</dd>
729 </dl>
730 </td>
731 </tr>
732
733 <tr>
734 <td class="property" id="systems-default-stream">systems.<span class="system">system-name</span>.<br>default.stream.*</td>
735 <td class="default"></td>
736 <td class="description">
737 A set of default properties for any stream associated with the system. For example, if
738 "systems.kafka-system.default.stream.replication.factor"=2 was configured, then every Kafka stream
739 created on the kafka-system will have a replication factor of 2 unless the property is explicitly
740 overridden at the stream scope using <a href="#streams-properties">streams properties</a>.
741 </td>
742 </tr>
743
744 <tr>
745 <td class="property" id="systems-samza-key-serde">systems.<span class="system">system-name</span>.<br>default.stream.samza.key.serde</td>
746 <td class="default"></td>
747 <td class="description">
748 The <a href="../container/serialization.html">serde</a> which will be used to deserialize the
749 <em>key</em> of messages on input streams, and to serialize the <em>key</em> of messages on
750 output streams. This property defines the serde for an for all streams in the system. See the
751 <a href="#streams-samza-key-serde">stream-scoped property</a> to define the serde for an
752 individual stream. If both are defined, the stream-level definition takes precedence.
753 The value of this property must be a <span class="serde">serde-name</span> that is registered
754 with <a href="#serializers-registry-class" class="property">serializers.registry.*.class</a>.
755 If this property is not set, messages are passed unmodified between the input stream consumer,
756 the task and the output stream producer.
757 </td>
758 </tr>
759
760 <tr>
761 <td class="property" id="systems-samza-msg-serde">systems.<span class="system">system-name</span>.<br>default.stream.samza.msg.serde</td>
762 <td class="default"></td>
763 <td class="description">
764 The <a href="../container/serialization.html">serde</a> which will be used to deserialize the
765 <em>value</em> of messages on input streams, and to serialize the <em>value</em> of messages on
766 output streams. This property defines the serde for an for all streams in the system. See the
767 <a href="#streams-samza-msg-serde">stream-scoped property</a> to define the serde for an
768 individual stream. If both are defined, the stream-level definition takes precedence.
769 The value of this property must be a <span class="serde">serde-name</span> that is registered
770 with <a href="#serializers-registry-class" class="property">serializers.registry.*.class</a>.
771 If this property is not set, messages are passed unmodified between the input stream consumer,
772 the task and the output stream producer.
773 </td>
774 </tr>
775
776 <tr>
777 <td class="property" id="systems-samza-offset-default">systems.<span class="system">system-name</span>.<br>default.stream.samza.offset.default</td>
778 <td class="default">upcoming</td>
779 <td class="description">
780 If a container starts up without a <a href="../container/checkpointing.html">checkpoint</a>,
781 this property determines where in the input stream we should start consuming. The value must be an
782 <a href="../api/javadocs/org/apache/samza/system/SystemStreamMetadata.OffsetType.html">OffsetType</a>,
783 one of the following:
784 <dl>
785 <dt><code>upcoming</code></dt>
786 <dd>Start processing messages that are published after the job starts. Any messages published while
787 the job was not running are not processed.</dd>
788 <dt><code>oldest</code></dt>
789 <dd>Start processing at the oldest available message in the system, and
790 <a href="reprocessing.html">reprocess</a> the entire available message history.</dd>
791 </dl>
792 This property is for all streams within a system. To set it for an individual stream, see
793 <a href="#streams-samza-offset-default">streams.<span class="stream">stream-id</span>.<br>samza.offset.default</a>
794 If both are defined, the stream-level definition takes precedence.
795 </td>
796 </tr>
797
798 <tr>
799 <td class="property" id="systems-samza-key-serde-legacy">systems.<span class="system">system-name</span>.<br>samza.key.serde</td>
800 <td class="default" rowspan="2"></td>
801 <td class="description">
802 This is deprecated in favor of <a href="#systems-samza-key-serde" class="property">
803 systems.<span class="system">system-name</span>.default.stream.samza.key.serde</a>.
804 </td>
805 </tr>
806 <tr>
807 <td class="property">systems.<span class="system">system-name</span>.<br>streams.<span class="stream">stream-name</span>.<br>samza.key.serde</td>
808 <td class="description">
809 This is deprecated in favor of <a href="#streams-samza-key-serde" class="property">
810 streams.<span class="stream">stream-id</span>.samza.key.serde</a>.
811 </td>
812 </tr>
813
814 <tr>
815 <td class="property" id="systems-samza-msg-serde-legacy">systems.<span class="system">system-name</span>.<br>samza.msg.serde</td>
816 <td class="default" rowspan="2"></td>
817 <td class="description">
818 This is deprecated in favor of <a href="#systems-samza-msg-serde" class="property">
819 systems.<span class="system">system-name</span>.default.stream.samza.msg.serde</a>.
820 </td>
821 </tr>
822 <tr>
823 <td class="property">systems.<span class="system">system-name</span>.<br>streams.<span class="stream">stream-name</span>.<br>samza.msg.serde</td>
824 <td class="description">
825 This is deprecated in favor of <a href="#streams-samza-msg-serde" class="property">
826 streams.<span class="stream">stream-id</span>.samza.msg.serde</a>.
827 </td>
828 </tr>
829
830 <tr>
831 <td class="property" id="systems-samza-offset-default-legacy">systems.<span class="system">system-name</span>.<br>samza.offset.default</td>
832 <td class="default" rowspan="2">upcoming</td>
833 <td class="description">
834 This is deprecated in favor of <a href="#systems-samza-offset-default" class="property">
835 systems.<span class="system">system-name</span>.default.stream.samza.offset.default</a>.
836 </td>
837 </tr>
838 <tr>
839 <td class="property">systems.<span class="system">system-name</span>.<br>streams.<span class="stream">stream-name</span>.<br>samza.offset.default</td>
840 <td class="description">
841 This is deprecated in favor of <a href="#streams-samza-offset-default" class="property">
842 streams.<span class="stream">stream-id</span>.samza.offset.default</a>.
843 </td>
844 </tr>
845
846 <tr>
847 <td class="property" id="systems-streams-samza-reset-offset-legacy">systems.<span class="system">system-name</span>.<br>streams.<span class="stream">stream-name</span>.<br>samza.reset.offset</td>
848 <td>false</td>
849 <td>
850 This is deprecated in favor of <a href="#streams-samza-reset-offset" class="property">
851 streams.<span class="stream">stream-id</span>.samza.reset.offset</a>.
852 </td>
853 </tr>
854
855 <tr>
856 <td class="property" id="systems-streams-samza-priority-legacy">systems.<span class="system">system-name</span>.<br>streams.<span class="stream">stream-name</span>.<br>samza.priority</td>
857 <td>-1</td>
858 <td>
859 This is deprecated in favor of <a href="#streams-samza-priority" class="property">
860 streams.<span class="stream">stream-id</span>.samza.priority</a>.
861 </td>
862 </tr>
863
864 <tr>
865 <td class="property" id="systems-streams-samza-bootstrap-legacy">systems.<span class="system">system-name</span>.<br>streams.<span class="stream">stream-name</span>.<br>samza.bootstrap</td>
866 <td>false</td>
867 <td>
868 This is deprecated in favor of <a href="#streams-samza-bootstrap" class="property">
869 streams.<span class="stream">stream-id</span>.samza.bootstrap</a>.
870 </td>
871 </tr>
872
873 <tr>
874 <th colspan="3" class="section" id="streams"><a href="../container/streams.html">Streams</a></th>
875 </tr>
876
877 <tr>
878 <td class="property" id="streams-system">streams.<span class="stream">stream-id</span>.<br>samza.system</td>
879 <td class="default"></td>
880 <td class="description">
881 The <span class="system">system-name</span> of the system on which this stream will be accessed.
882 This property binds the stream to one of the systems defined with the property
883 systems.<span class="system">system-name</span>.samza.factory. <br>
884 If this property isn't specified, it is inherited from job.default.system.
885 </td>
886 </tr>
887
888 <tr>
889 <td class="property" id="streams-physical-name">streams.<span class="stream">stream-id</span>.<br>samza.physical.name</td>
890 <td class="default"></td>
891 <td class="description">
892 The physical name of the stream on the system on which this stream will be accessed.
893 This is opposed to the stream-id which is the logical name that Samza uses to identify the stream.
894 A physical name could be a Kafka topic name, an HDFS file URN or any other system-specific identifier.
895 </td>
896 </tr>
897
898 <tr>
899 <td class="property" id="streams-samza-key-serde">streams.<span class="stream">stream-id</span>.<br>samza.key.serde</td>
900 <td class="default"></td>
901 <td class="description">
902 The <a href="../container/serialization.html">serde</a> which will be used to deserialize the
903 <em>key</em> of messages on input streams, and to serialize the <em>key</em> of messages on
904 output streams. This property defines the serde for an individual stream. See the
905 <a href="#systems-samza-key-serde">system-scoped property</a> to define the serde for all
906 streams within a system. If both are defined, the stream-level definition takes precedence.
907 The value of this property must be a <span class="serde">serde-name</span> that is registered
908 with <a href="#serializers-registry-class" class="property">serializers.registry.*.class</a>.
909 If this property is not set, messages are passed unmodified between the input stream consumer,
910 the task and the output stream producer.
911 </td>
912 </tr>
913
914 <tr>
915 <td class="property" id="streams-samza-msg-serde">streams.<span class="stream">stream-id</span>.<br>samza.msg.serde</td>
916 <td class="default"></td>
917 <td class="description">
918 The <a href="../container/serialization.html">serde</a> which will be used to deserialize the
919 <em>value</em> of messages on input streams, and to serialize the <em>value</em> of messages on
920 output streams. This property defines the serde for an individual stream. See the
921 <a href="#systems-samza-msg-serde">system-scoped property</a> to define the serde for all
922 streams within a system. If both are defined, the stream-level definition takes precedence.
923 The value of this property must be a <span class="serde">serde-name</span> that is registered
924 with <a href="#serializers-registry-class" class="property">serializers.registry.*.class</a>.
925 If this property is not set, messages are passed unmodified between the input stream consumer,
926 the task and the output stream producer.
927 </td>
928 </tr>
929
930 <tr>
931 <td class="property" id="streams-samza-offset-default">streams.<span class="stream">stream-id</span>.<br>samza.offset.default</td>
932 <td class="default">upcoming</td>
933 <td class="description">
934 If a container starts up without a <a href="../container/checkpointing.html">checkpoint</a>,
935 this property determines where in the input stream we should start consuming. The value must be an
936 <a href="../api/javadocs/org/apache/samza/system/SystemStreamMetadata.OffsetType.html">OffsetType</a>,
937 one of the following:
938 <dl>
939 <dt><code>upcoming</code></dt>
940 <dd>Start processing messages that are published after the job starts. Any messages published while
941 the job was not running are not processed.</dd>
942 <dt><code>oldest</code></dt>
943 <dd>Start processing at the oldest available message in the system, and
944 <a href="reprocessing.html">reprocess</a> the entire available message history.</dd>
945 </dl>
946 This property is for an individual stream. To set it for all streams within a system, see
947 <a href="#systems-samza-offset-default">systems.<span class="system">system-name</span>.<br>samza.offset.default</a>
948 If both are defined, the stream-level definition takes precedence.
949 </td>
950 </tr>
951
952 <tr>
953 <td class="property" id="streams-samza-reset-offset">streams.<span class="stream">stream-id</span>.<br>samza.reset.offset</td>
954 <td class="default">false</td>
955 <td class="description">
956 If set to <code>true</code>, when a Samza container starts up, it ignores any
957 <a href="../container/checkpointing.html">checkpointed offset</a> for this particular input
958 stream. Its behavior is thus determined by the <code>samza.offset.default</code> setting.
959 Note that the reset takes effect <em>every time a container is started</em>, which may be
960 every time you restart your job, or more frequently if a container fails and is restarted
961 by the framework.
962 </td>
963 </tr>
964
965 <tr>
966 <td class="property" id="streams-samza-priority">streams.<span class="stream">stream-id</span>.<br>samza.priority</td>
967 <td class="default">-1</td>
968 <td class="description">
969 If one or more streams have a priority set (any positive integer), they will be processed
970 with <a href="../container/streams.html#prioritizing-input-streams">higher priority</a> than the other streams.
971 You can set several streams to the same priority, or define multiple priority levels by
972 assigning a higher number to the higher-priority streams. If a higher-priority stream has
973 any messages available, they will always be processed first; messages from lower-priority
974 streams are only processed when there are no new messages on higher-priority inputs.
975 </td>
976 </tr>
977
978 <tr>
979 <td class="property" id="streams-samza-bootstrap">streams.<span class="stream">stream-id</span>.<br>samza.bootstrap</td>
980 <td class="default">false</td>
981 <td class="description">
982 If set to <code>true</code>, this stream will be processed as a
983 <a href="../container/streams.html#bootstrapping">bootstrap stream</a>. This means that every time
984 a Samza container starts up, this stream will be fully consumed before messages from any
985 other stream are processed.
986 </td>
987 </tr>
988
989 <tr>
990 <td class="property" id="streams-properties">streams.<span class="stream">stream-id</span>.*</td>
991 <td class="default"></td>
992 <td class="description">
993 Any properties of the stream. These are typically system-specific and can be used by the system
994 for stream creation or validation. Note that the other properties are prefixed with <em>samza.</em>
995 which distinguishes them as Samza properties that are not system-specific.
996 </td>
997 </tr>
998
999 <tr>
1000 <th colspan="3" class="section" id="serdes"><a href="../container/serialization.html">Serializers/Deserializers (Serdes)</a></th>
1001 </tr>
1002
1003 <tr>
1004 <td class="property" id="serializers-registry-class">serializers.registry.<br><span class="serde">serde-name</span>.class</td>
1005 <td class="default"></td>
1006 <td class="description">
1007 Use this property to register a <a href="../container/serialization.html">serializer/deserializer</a>,
1008 which defines a way of encoding application objects as an array of bytes (used for messages
1009 in streams, and for data in persistent storage). You can give a serde any
1010 <span class="serde">serde-name</span> you want, and reference that name in properties like
1011 <a href="#systems-samza-key-serde" class="property">systems.*.samza.key.serde</a>,
1012 <a href="#systems-samza-msg-serde" class="property">systems.*.samza.msg.serde</a>,
1013 <a href="#streams-samza-key-serde" class="property">streams.*.samza.key.serde</a>,
1014 <a href="#streams-samza-msg-serde" class="property">streams.*.samza.msg.serde</a>,
1015 <a href="#stores-key-serde" class="property">stores.*.key.serde</a> and
1016 <a href="#stores-msg-serde" class="property">stores.*.msg.serde</a>.
1017 The value of this property is the fully-qualified name of a Java class that implements
1018 <a href="../api/javadocs/org/apache/samza/serializers/SerdeFactory.html">SerdeFactory</a>.
1019 Samza ships with several serdes:
1020 <dl>
1021 <dt><code>org.apache.samza.serializers.ByteSerdeFactory</code></dt>
1022 <dd>A no-op serde which passes through the undecoded byte array.</dd>
1023 <dt><code>org.apache.samza.serializers.IntegerSerdeFactory</code></dt>
1024 <dd>Encodes <code>java.lang.Integer</code> objects as binary (4 bytes fixed-length big-endian encoding).</dd>
1025 <dt><code>org.apache.samza.serializers.StringSerdeFactory</code></dt>
1026 <dd>Encodes <code>java.lang.String</code> objects as UTF-8.</dd>
1027 <dt><code>org.apache.samza.serializers.JsonSerdeFactory</code></dt>
1028 <dd>Encodes nested structures of <code>java.util.Map</code>, <code>java.util.List</code> etc. as JSON.</dd>
1029 <dt><code>org.apache.samza.serializers.LongSerdeFactory</code></dt>
1030 <dd>Encodes <code>java.lang.Long</code> as binary (8 bytes fixed-length big-endian encoding).</dd>
1031 <dt><code>org.apache.samza.serializers.DoubleSerdeFactory</code></dt>
1032 <dd>Encodes <code>java.lang.Double</code> as binray (8 bytes double-precision float point).</dd>
1033 <dt><code>org.apache.samza.serializers.MetricsSnapshotSerdeFactory</code></dt>
1034 <dd>Encodes <code>org.apache.samza.metrics.reporter.MetricsSnapshot</code> objects (which are
1035 used for <a href="../container/metrics.html">reporting metrics</a>) as JSON.</dd>
1036 <dt><code>org.apache.samza.serializers.KafkaSerdeFactory</code></dt>
1037 <dd>Adapter which allows existing <code>kafka.serializer.Encoder</code> and
1038 <code>kafka.serializer.Decoder</code> implementations to be used as Samza serdes.
1039 Set serializers.registry.<span class="serde">serde-name</span>.encoder and
1040 serializers.registry.<span class="serde">serde-name</span>.decoder to the appropriate
1041 class names.</dd>
1042 </dl>
1043 </td>
1044 </tr>
1045
1046 <tr>
1047 <th colspan="3" class="section" id="filesystem-checkpoints">
1048 Using the filesystem for checkpoints<br>
1049 <span class="subtitle">
1050 (This section applies if you have set
1051 <a href="#task-checkpoint-factory" class="property">task.checkpoint.factory</a>
1052 <code>= org.apache.samza.checkpoint.file.FileSystemCheckpointManagerFactory</code>)
1053 </span>
1054 </th>
1055 </tr>
1056
1057 <tr>
1058 <td class="property" id="task-checkpoint-path">task.checkpoint.path</td>
1059 <td class="default"></td>
1060 <td class="description">
1061 Required if you are using the filesystem for checkpoints. Set this to the path on your local filesystem
1062 where checkpoint files should be stored.
1063 </td>
1064 </tr>
1065
1066 <tr>
1067 <th colspan="3" class="section" id="elasticsearch">
1068 Using <a href="https://github.com/elastic/elasticsearch">Elasticsearch</a> for output streams<br>
1069 <span class="subtitle">
1070 (This section applies if you have set
1071 <a href="#systems-samza-factory" class="property">systems.*.samza.factory</a>
1072 <code>= org.apache.samza.system.elasticsearch.ElasticsearchSystemFactory</code>)
1073 </span>
1074 </th>
1075 </tr>
1076
1077 <tr>
1078 <td class="property" id="systems-samza-client-factory-class">systems.<span class="system">system-name</span>.<br>client.factory</td>
1079 <td class="default"></td>
1080 <td class="description">
1081 <strong>Required:</strong> The elasticsearch client factory used for connecting
1082 to the Elasticsearch cluster. Samza ships with the following implementations:
1083 <dl>
1084 <dt><code>org.apache.samza.system.elasticsearch.client.TransportClientFactory</code></dt>
1085 <dd>Creates a TransportClient that connects to the cluster remotely without
1086 joining it. This requires the transport host and port properties to be set.</dd>
1087 <dt><code>org.apache.samza.system.elasticsearch.client.NodeClientFactory</code></dt>
1088 <dd>Creates a Node client that connects to the cluster by joining it. By default
1089 this uses <a href="http://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html">zen discovery</a> to find the cluster but other methods can be configured.</dd>
1090 </dl>
1091 </td>
1092 </tr>
1093
1094 <tr>
1095 <td class="property" id="systems-samza-index-request-factory-class">systems.<span class="system">system-name</span>.<br>index.request.factory</td>
1096 <td class="default">org.apache.samza.system</br>.elasticsearch.indexrequest.</br>DefaultIndexRequestFactory</td>
1097 <td class="description">
1098 The index request factory that converts the Samza OutgoingMessageEnvelope into the IndexRequest
1099 to be send to elasticsearch. The default IndexRequestFactory behaves as follows:
1100 <dl>
1101 <dt><code>Stream name</code></dt>
1102 <dd>The stream name is of the format {index-name}/{type-name} which
1103 map on to the elasticsearch index and type.</dd>
1104 <dt><code>Message id</code></dt>
1105 <dd>If the message has a key this is set as the document id, otherwise Elasticsearch will generate one for each document.</dd>
1106 <dt><code>Partition id</code></dt>
1107 <dd>If the partition key is set then this is used as the Elasticsearch routing key.</dd>
1108 <dt><code>Message</code></dt>
1109 <dd>The message must be either a byte[] which is passed directly on to Elasticsearch, or a Map which is passed on to the
1110 Elasticsearch client which serialises it into a JSON String. Samza serdes are not currently supported.</dd>
1111 </dl>
1112 </td>
1113 </tr>
1114
1115 <tr>
1116 <td class="property" id="systems-samza-client-host">systems.<span class="system">system-name</span>.<br>client.transport.host</td>
1117 <td class="default"></td>
1118 <td class="description">
1119 <strong>Required</strong> for <code>TransportClientFactory</code>
1120 <p>The hostname that the TransportClientFactory connects to.</p>
1121 </td>
1122 </tr>
1123
1124 <tr>
1125 <td class="property" id="systems-samza-client-port">systems.<span class="system">system-name</span>.<br>client.transport.port</td>
1126 <td class="default"></td>
1127 <td class="description">
1128 <strong>Required</strong> for <code>TransportClientFactory</code>
1129 <p>The port that the TransportClientFactory connects to.</p>
1130 </td>
1131 </tr>
1132
1133 <tr>
1134 <td class="property" id="systems-samza-client-settings">systems.<span class="system">system-name</span>.<br>client.elasticsearch.*</td>
1135 <td class="default"></td>
1136 <td class="description">
1137 Any <a href="http://www.elastic.co/guide/en/elasticsearch/client/java-api/1.x/client.html">Elasticsearch client settings</a> can be used here. They will all be passed to both the transport and node clients.
1138 Some of the common settings you will want to provide are.
1139 <dl>
1140 <dt><code>systems.<span class="system">system-name</span>.client.elasticsearch.cluster.name</code></dt>
1141 <dd>The name of the Elasticsearch cluster the client is connecting to.</dd>
1142 <dt><code>systems.<span class="system">system-name</span>.client.elasticsearch.client.transport.sniff</code></dt>
1143 <dd>If set to <code>true</code> then the transport client will discover and keep
1144 up to date all cluster nodes. This is used for load balancing and fail-over on retries.</dd>
1145 </dl>
1146 </td>
1147 </tr>
1148
1149 <tr>
1150 <td class="property" id="systems-samza-bulk-flush-max-actions">systems.<span class="system">system-name</span>.<br>bulk.flush.max.actions</td>
1151 <td class="default">1000</td>
1152 <td class="description">
1153 The maximum number of messages to be buffered before flushing.
1154 </td>
1155 </tr>
1156
1157 <tr>
1158 <td class="property" id="systems-samza-bulk-flush-max-size-mb">systems.<span class="system">system-name</span>.<br>bulk.flush.max.size.mb</td>
1159 <td class="default">5</td>
1160 <td class="description">
1161 The maximum aggregate size of messages in the buffered before flushing.
1162 </td>
1163 </tr>
1164
1165 <tr>
1166 <td class="property" id="systems-samza-bulk-flush-interval-ms">systems.<span class="system">system-name</span>.<br>bulk.flush.interval.ms</td>
1167 <td class="default">never</td>
1168 <td class="description">
1169 How often buffered messages should be flushed.
1170 </td>
1171 </tr>
1172
1173 <tr>
1174 <th colspan="3" class="section" id="kafka">
1175 Using <a href="http://kafka.apache.org/">Kafka</a> for input streams, output streams and checkpoints<br>
1176 <span class="subtitle">
1177 (This section applies if you have set
1178 <a href="#systems-samza-factory" class="property">systems.*.samza.factory</a>
1179 <code>= org.apache.samza.system.kafka.KafkaSystemFactory</code>)
1180 </span>
1181 </th>
1182 </tr>
1183
1184 <tr>
1185 <td class="property" id="systems-samza-consumer-zookeeper-connect">systems.<span class="system">system-name</span>.<br>consumer.zookeeper.connect</td>
1186 <td class="default"></td>
1187 <td class="description">
1188 The hostname and port of one or more Zookeeper nodes where information about the
1189 Kafka cluster can be found. This is given as a comma-separated list of
1190 <code>hostname:port</code> pairs, such as
1191 <code>zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181</code>.
1192 If the cluster information is at some sub-path of the Zookeeper namespace, you need to
1193 include the path at the end of the list of hostnames, for example:
1194 <code>zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181/clusters/my-kafka</code>
1195 </td>
1196 </tr>
1197
1198 <tr>
1199 <td class="property" id="systems-samza-consumer-auto-offset-reset">systems.<span class="system">system-name</span>.<br>consumer.auto.offset.reset</td>
1200 <td class="default">largest</td>
1201 <td class="description">
1202 This setting determines what happens if a consumer attempts to read an offset that is
1203 outside of the current valid range. This could happen if the topic does not exist, or
1204 if a checkpoint is older than the maximum message history retained by the brokers.
1205 This property is not to be confused with
1206 <a href="#systems-samza-offset-default">systems.*.samza.offset.default</a>,
1207 which determines what happens if there is no checkpoint. The following are valid
1208 values for <code>auto.offset.reset</code>:
1209 <dl>
1210 <dt><code>smallest</code></dt>
1211 <dd>Start consuming at the smallest (oldest) offset available on the broker
1212 (process as much message history as available).</dd>
1213 <dt><code>largest</code></dt>
1214 <dd>Start consuming at the largest (newest) offset available on the broker
1215 (skip any messages published while the job was not running).</dd>
1216 <dt>anything else</dt>
1217 <dd>Throw an exception and refuse to start up the job.</dd>
1218 </dl>
1219 </td>
1220 </tr>
1221
1222 <tr>
1223 <td class="property" id="systems-samza-consumer">systems.<span class="system">system-name</span>.<br>consumer.*</td>
1224 <td class="default"></td>
1225 <td class="description">
1226 Any <a href="http://kafka.apache.org/documentation.html#consumerconfigs">Kafka consumer configuration</a>
1227 can be included here. For example, to change the socket timeout, you can set
1228 systems.<span class="system">system-name</span>.consumer.socket.timeout.ms.
1229 (There is no need to configure <code>group.id</code> or <code>client.id</code>,
1230 as they are automatically configured by Samza. Also, there is no need to set
1231 <code>auto.commit.enable</code> because Samza has its own checkpointing mechanism.)
1232 </td>
1233 </tr>
1234
1235 <tr>
1236 <td class="property" id="systems-samza-producer-bootstrap-servers">systems.<span class="system">system-name</span>.<br>producer.bootstrap.servers</td>
1237 <td class="default"></td>
1238 <td class="description">
1239 <b>Note</b>:
1240 <i>This variable was previously defined as "producer.metadata.broker.list", which has been deprecated with this version.</i>
1241 <br />
1242 A list of network endpoints where the Kafka brokers are running. This is given as
1243 a comma-separated list of <code>hostname:port</code> pairs, for example
1244 <code>kafka1.example.com:9092,kafka2.example.com:9092,kafka3.example.com:9092</code>.
1245 It's not necessary to list every single Kafka node in the cluster: Samza uses this
1246 property in order to discover which topics and partitions are hosted on which broker.
1247 This property is needed even if you are only consuming from Kafka, and not writing
1248 to it, because Samza uses it to discover metadata about streams being consumed.
1249 </td>
1250 </tr>
1251
1252 <tr>
1253 <td class="property" id="systems-samza-producer">systems.<span class="system">system-name</span>.<br>producer.*</td>
1254 <td class="default"></td>
1255 <td class="description">
1256 Any <a href="http://kafka.apache.org/documentation.html#newproducerconfigs">Kafka producer configuration</a>
1257 can be included here. For example, to change the request timeout, you can set
1258 systems.<span class="system">system-name</span>.producer.timeout.ms.
1259 (There is no need to configure <code>client.id</code> as it is automatically
1260 configured by Samza.)
1261 </td>
1262 </tr>
1263
1264 <tr>
1265 <td class="property" id="systems-samza-fetch-threshold">systems.<span class="system">system-name</span>.<br>samza.fetch.threshold</td>
1266 <td class="default">50000</td>
1267 <td class="description">
1268 When consuming streams from Kafka, a Samza container maintains an in-memory buffer
1269 for incoming messages in order to increase throughput (the stream task can continue
1270 processing buffered messages while new messages are fetched from Kafka). This
1271 parameter determines the number of messages we aim to buffer across all stream
1272 partitions consumed by a container. For example, if a container consumes 50 partitions,
1273 it will try to buffer 1000 messages per partition by default. When the number of
1274 buffered messages falls below that threshold, Samza fetches more messages from the
1275 Kafka broker to replenish the buffer. Increasing this parameter can increase a job's
1276 processing throughput, but also increases the amount of memory used.
1277 </td>
1278 </tr>
1279
1280 <tr>
1281 <td class="property" id="systems-samza-fetch-threshold-bytes">systems.<span class="system">system-name</span>.<br>samza.fetch.threshold.bytes</td>
1282 <td class="default">-1</td>
1283 <td class="description">
1284 When consuming streams from Kafka, a Samza container maintains an in-memory buffer
1285 for incoming messages in order to increase throughput (the stream task can continue
1286 processing buffered messages while new messages are fetched from Kafka). This
1287 parameter determines the total size of messages we aim to buffer across all stream
1288 partitions consumed by a container based on bytes. Defines how many bytes to use for the buffered
1289 prefetch messages for job as a whole. The bytes for a single system/stream/partition are computed based on this.
1290 This fetches the entire messages, hence this bytes limit is a soft one, and the actual usage can be the bytes
1291 limit + size of max message in the partition for a given stream. If the value of this property is > 0
1292 then this takes precedence over systems.<span class="system">system-name</span>.samza.fetch.threshold.<br>
1293 For example, if fetchThresholdBytes is set to 100000 bytes, and there are 50 SystemStreamPartitions registered,
1294 then the per-partition threshold is (100000 / 2) / 50 = 1000 bytes. As this is a soft limit, the actual usage
1295 can be 1000 bytes + size of max message. As soon as a SystemStreamPartition's buffered messages bytes drops
1296 below 1000, a fetch request will be executed to get more data for it.
1297
1298 Increasing this parameter will decrease the latency between when a queue is drained of messages and when new
1299 messages are enqueued, but also leads to an increase in memory usage since more messages will be held in memory.
1300
1301 The default value is -1, which means this is not used.
1302 </td>
1303 </tr>
1304
1305 <tr>
1306 <td class="property" id="task-checkpoint-system">task.checkpoint.system</td>
1307 <td class="default"></td>
1308 <td class="description">
1309 This property is required if you are using Kafka for checkpoints
1310 (<a href="#task-checkpoint-factory" class="property">task.checkpoint.factory</a>
1311 <code>= org.apache.samza.checkpoint.kafka.KafkaCheckpointManagerFactory</code>).
1312 You must set it to the <span class="system">system-name</span> of a Kafka system. The stream
1313 name (topic name) within that system is automatically determined from the job name and ID:
1314 <code>__samza_checkpoint_${<a href="#job-name" class="property">job.name</a>}_${<a href="#job-id" class="property">job.id</a>}</code>
1315 (with underscores in the job name and ID replaced by hyphens).
1316 </td>
1317 </tr>
1318
1319 <tr>
1320 <td class="property" id="task-checkpoint-replication-factor">task.checkpoint.<br>replication.factor</td>
1321 <td class="default">3</td>
1322 <td class="description">
1323 If you are using Kafka for checkpoints, this is the number of Kafka nodes to which you want the
1324 checkpoint topic replicated for durability.
1325 </td>
1326 </tr>
1327
1328 <tr>
1329 <td class="property" id="task-checkpoint-segment-bytes">task.checkpoint.<br>segment.bytes</td>
1330 <td class="default">26214400</td>
1331 <td class="description">
1332 If you are using Kafka for checkpoints, this is the segment size to be used for the checkpoint
1333 topic's log segments. Keeping this number small is useful because it increases the frequency
1334 that Kafka will garbage collect old checkpoints.
1335 </td>
1336 </tr>
1337
1338 <tr>
1339 <td class="property" id="store-changelog-replication-factor">stores.<span class="store">store-name</span>.changelog.<br>replication.factor</td>
1340 <td class="default">stores.default.changelog.replication.factor</td>
1341 <td class="description">
1342 The property defines the number of replicas to use for the change log stream.
1343 </td>
1344 </tr>
1345
1346 <tr>
1347 <td class="property" id="store-default-changelog-replication-factor">stores.default.changelog.replication.factor</td>
1348 <td class="default">2</td>
1349 <td class="description">
1350 This property defines the default number of replicas to use for the change log stream.
1351 </td>
1352 </tr>
1353
1354 <tr>
1355 <td class="property" id="store-changelog-partitions">stores.<span class="store">store-name</span>.changelog.<br>kafka.topic-level-property</td>
1356 <td class="default"></td>
1357 <td class="description">
1358 The property allows you to specify topic level settings for the changelog topic to be created.
1359 For e.g., you can specify the clean up policy as "stores.mystore.changelog.cleanup.policy=delete".
1360 Please refer to http://kafka.apache.org/documentation.html#configuration for more topic level configurations.
1361 </td>
1362 </tr>
1363
1364 <tr>
1365 <th colspan="3" class="section" id="regex-rewriter">
1366 Consuming all Kafka topics matching a regular expression<br>
1367 <span class="subtitle">
1368 (This section applies if you have set
1369 <a href="#job-config-rewriter-class" class="property">job.config.rewriter.*.class</a>
1370 <code>= org.apache.samza.config.RegExTopicGenerator</code>)
1371 </span>
1372 </th>
1373 </tr>
1374
1375 <tr>
1376 <td class="property" id="job-config-rewriter-system">job.config.rewriter.<br><span class="rewriter">rewriter-name</span>.system</td>
1377 <td class="default"></td>
1378 <td class="description">
1379 Set this property to the <span class="system">system-name</span> of the Kafka system
1380 from which you want to consume all matching topics.
1381 </td>
1382 </tr>
1383
1384 <tr>
1385 <td class="property" id="job-config-rewriter-regex">job.config.rewriter.<br><span class="rewriter">rewriter-name</span>.regex</td>
1386 <td class="default"></td>
1387 <td class="description">
1388 A regular expression specifying which topics you want to consume within the Kafka system
1389 <a href="#job-config-rewriter-system" class="property">job.config.rewriter.*.system</a>.
1390 Any topics matched by this regular expression will be consumed <em>in addition to</em> any
1391 topics you specify with <a href="#task-inputs" class="property">task.inputs</a>.
1392 </td>
1393 </tr>
1394
1395 <tr>
1396 <td class="property" id="job-config-rewriter-config">job.config.rewriter.<br><span class="rewriter">rewriter-name</span>.config.*</td>
1397 <td class="default"></td>
1398 <td class="description">
1399 Any properties specified within this namespace are applied to the configuration of streams
1400 that match the regex in
1401 <a href="#job-config-rewriter-regex" class="property">job.config.rewriter.*.regex</a>.
1402 For example, you can set <code>job.config.rewriter.*.config.samza.msg.serde</code> to configure
1403 the deserializer for messages in the matching streams, which is equivalent to setting
1404 <a href="#systems-samza-msg-serde" class="property">systems.*.streams.*.samza.msg.serde</a>
1405 for each topic that matches the regex.
1406 </td>
1407 </tr>
1408
1409 <tr>
1410 <th colspan="3" class="section" id="state"><a href="../container/state-management.html">Storage and State Management</a></th>
1411 </tr>
1412
1413 <tr>
1414 <td class="property" id="stores-factory">stores.<span class="store">store-name</span>.factory</td>
1415 <td class="default"></td>
1416 <td class="description">
1417 This property defines a store, Samza's mechanism for efficient
1418 <a href="../container/state-management.html">stateful stream processing</a>. You can give a
1419 store any <span class="store">store-name</span> except <em>default</em> (the <span class="store">store-name</span>
1420 <em>default</em> is reserved for defining default store parameters), and use that name to get a
1421 reference to the store in your stream task (call
1422 <a href="../api/javadocs/org/apache/samza/task/TaskContext.html#getStore(java.lang.String)">TaskContext.getStore()</a>
1423 in your task's
1424 <a href="../api/javadocs/org/apache/samza/task/InitableTask.html#init(org.apache.samza.config.Config, org.apache.samza.task.TaskContext)">init()</a>
1425 method). The value of this property is the fully-qualified name of a Java class that implements
1426 <a href="../api/javadocs/org/apache/samza/storage/StorageEngineFactory.html">StorageEngineFactory</a>.
1427 Samza currently ships with one storage engine implementation:
1428 <dl>
1429 <dt><code>org.apache.samza.storage.kv.RocksDbKeyValueStorageEngineFactory</code></dt>
1430 <dd>An on-disk storage engine with a key-value interface, implemented using
1431 <a href="http://rocksdb.org/">RocksDB</a>. It supports fast random-access
1432 reads and writes, as well as range queries on keys. RocksDB can be configured with
1433 various <a href="#keyvalue-rocksdb">additional tuning parameters</a>.</dd>
1434 </dl>
1435 </td>
1436 </tr>
1437
1438 <tr>
1439 <td class="property" id="stores-key-serde">stores.<span class="store">store-name</span>.key.serde</td>
1440 <td class="default"></td>
1441 <td class="description">
1442 If the storage engine expects keys in the store to be simple byte arrays, this
1443 <a href="../container/serialization.html">serde</a> allows the stream task to access the
1444 store using another object type as key. The value of this property must be a
1445 <span class="serde">serde-name</span> that is registered with
1446 <a href="#serializers-registry-class" class="property">serializers.registry.*.class</a>.
1447 If this property is not set, keys are passed unmodified to the storage engine
1448 (and the <a href="#stores-changelog">changelog stream</a>, if appropriate).
1449 </td>
1450 </tr>
1451
1452 <tr>
1453 <td class="property" id="stores-msg-serde">stores.<span class="store">store-name</span>.msg.serde</td>
1454 <td class="default"></td>
1455 <td class="description">
1456 If the storage engine expects values in the store to be simple byte arrays, this
1457 <a href="../container/serialization.html">serde</a> allows the stream task to access the
1458 store using another object type as value. The value of this property must be a
1459 <span class="serde">serde-name</span> that is registered with
1460 <a href="#serializers-registry-class" class="property">serializers.registry.*.class</a>.
1461 If this property is not set, values are passed unmodified to the storage engine
1462 (and the <a href="#stores-changelog">changelog stream</a>, if appropriate).
1463 </td>
1464 </tr>
1465
1466 <tr>
1467 <td class="property" id="stores-changelog">stores.<span class="store">store-name</span>.changelog</td>
1468 <td class="default"></td>
1469 <td class="description">
1470 Samza stores are local to a container. If the container fails, the contents of the
1471 store are lost. To prevent loss of data, you need to set this property to configure
1472 a changelog stream: Samza then ensures that writes to the store are replicated to
1473 this stream, and the store is restored from this stream after a failure. The value
1474 of this property is given in the form
1475 <span class="system">system-name</span>.<span class="stream">stream-name</span>.
1476 The "system-name" part is optional. If it is omitted you must specify the system in
1477 <a href="#job-changelog-system" class="property">job.changelog.system</a> config.
1478 Any output stream can be used as changelog, but you must ensure that only one job
1479 ever writes to a given changelog stream (each instance of a job and each store
1480 needs its own changelog stream).
1481 </td>
1482 </tr>
1483
1484 <tr>
1485 <th colspan="3" class="section" id="keyvalue-rocksdb">
1486 Using RocksDB for key-value storage<br>
1487 <span class="subtitle">
1488 (This section applies if you have set
1489 <a href="#stores-factory" class="property">stores.*.factory</a>
1490 <code>= org.apache.samza.storage.kv.RocksDbKeyValueStorageEngineFactory</code>)
1491 </span>
1492 </th>
1493 </tr>
1494
1495 <tr>
1496 <td class="property" id="stores-rocksdb-write-batch-size">stores.<span class="store">store-name</span>.<br>write.batch.size</td>
1497 <td class="default">500</td>
1498 <td class="description">
1499 For better write performance, the storage engine buffers writes and applies them
1500 to the underlying store in a batch. If the same key is written multiple times
1501 in quick succession, this buffer also deduplicates writes to the same key. This
1502 property is set to the number of key/value pairs that should be kept in this
1503 in-memory buffer, per task instance. The number cannot be greater than
1504 <a href="#stores-rocksdb-object-cache-size" class="property">stores.*.object.cache.size</a>.
1505 </td>
1506 </tr>
1507
1508 <tr>
1509 <td class="property" id="stores-rocksdb-object-cache-size">stores.<span class="store">store-name</span>.<br>object.cache.size</td>
1510 <td class="default">1000</td>
1511 <td class="description">
1512 Samza maintains an additional cache in front of RocksDB for frequently-accessed
1513 objects. This cache contains deserialized objects (avoiding the deserialization
1514 overhead on cache hits), in contrast to the RocksDB block cache
1515 (<a href="#stores-rocksdb-container-cache-size-bytes" class="property">stores.*.container.cache.size.bytes</a>),
1516 which caches serialized objects. This property determines the number of objects
1517 to keep in Samza's cache, per task instance. This same cache is also used for
1518 write buffering (see <a href="#stores-rocksdb-write-batch-size" class="property">stores.*.write.batch.size</a>).
1519 A value of 0 disables all caching and batching.
1520 </td>
1521 </tr>
1522
1523 <tr>
1524 <td class="property" id="stores-rocksdb-container-cache-size-bytes">stores.<span class="store">store-name</span>.container.<br>cache.size.bytes</td>
1525 <td class="default">104857600</td>
1526 <td class="description">
1527 The size of RocksDB's block cache in bytes, per container. If there are several
1528 task instances within one container, each is given a proportional share of this cache.
1529 Note that this is an off-heap memory allocation, so the container's total memory use
1530 is the maximum JVM heap size <em>plus</em> the size of this cache.
1531 </td>
1532 </tr>
1533
1534 <tr>
1535 <td class="property" id="stores-rocksdb-container-write-buffer-size-bytes">stores.<span class="store">store-name</span>.container.<br>write.buffer.size.bytes</td>
1536 <td class="default">33554432</td>
1537 <td class="description">
1538 The amount of memory (in bytes) that RocksDB uses for buffering writes before they are
1539 written to disk, per container. If there are several task instances within one
1540 container, each is given a proportional share of this buffer. This setting also
1541 determines the size of RocksDB's segment files.
1542 </td>
1543 </tr>
1544
1545 <tr>
1546 <td class="property" id="stores-rocksdb-compression">stores.<span class="store">store-name</span>.<br>rocksdb.compression</td>
1547 <td class="default">snappy</td>
1548 <td class="description">
1549 This property controls whether RocksDB should compress data on disk and in the
1550 block cache. The following values are valid:
1551 <dl>
1552 <dt><code>snappy</code></dt>
1553 <dd>Compress data using the <a href="https://code.google.com/p/snappy/">Snappy</a> codec.</dd>
1554 <dt><code>bzip2</code></dt>
1555 <dd>Compress data using the <a href="http://en.wikipedia.org/wiki/Bzip2">bzip2</a> codec.</dd>
1556 <dt><code>zlib</code></dt>
1557 <dd>Compress data using the <a href="http://en.wikipedia.org/wiki/Zlib">zlib</a> codec.</dd>
1558 <dt><code>lz4</code></dt>
1559 <dd>Compress data using the <a href="https://code.google.com/p/lz4/">lz4</a> codec.</dd>
1560 <dt><code>lz4hc</code></dt>
1561 <dd>Compress data using the <a href="https://code.google.com/p/lz4/">lz4hc</a> (high compression) codec.</dd>
1562 <dt><code>none</code></dt>
1563 <dd>Do not compress data.</dd>
1564 </dl>
1565 </td>
1566 </tr>
1567
1568 <tr>
1569 <td class="property" id="stores-rocksdb-block-size-bytes">stores.<span class="store">store-name</span>.<br>rocksdb.block.size.bytes</td>
1570 <td class="default">4096</td>
1571 <td class="description">
1572 If compression is enabled, RocksDB groups approximately this many uncompressed bytes
1573 into one compressed block. You probably don't need to change this property.
1574 </td>
1575 </tr>
1576
1577 <tr>
1578 <td class="property" id="stores-rocksdb-ttl">stores.<span class="store">store-name</span>.<br>rocksdb.ttl.ms</td>
1579 <td class="default"></td>
1580 <td class="description">
1581 The time-to-live of the store. Please note it's not a strict TTL limit (removed only
1582 after compaction). Please use caution opening a database with and without TTL, as it might corrupt the
1583 database. Please make sure to read the <a href="https://github.com/facebook/rocksdb/wiki/Time-to-Live">constraints</a> before using.
1584 </td>
1585 </tr>
1586
1587 <tr>
1588 <td class="property" id="stores-rocksdb-compaction-style">stores.<span class="store">store-name</span>.<br>rocksdb.compaction.style</td>
1589 <td class="default">universal</td>
1590 <td class="description">
1591 This property controls the compaction style that RocksDB will employ when compacting its levels. The following values are valid:
1592 <dl>
1593 <dt><code>universal</code></dt>
1594 <dd>Use <a href="https://github.com/facebook/rocksdb/wiki/Universal-Compaction">universal</a> compaction.</dd>
1595 <dt><code>fifo</code></dt>
1596 <dd>Use <a href="https://github.com/facebook/rocksdb/wiki/FIFO-compaction-style">FIFO</a> compaction.</dd>
1597 <dt><code>level</code></dt>
1598 <dd>Use RocksDB's standard leveled compaction.</dd>
1599 </dl>
1600 </td>
1601 </tr>
1602
1603 <tr>
1604 <td class="property" id="stores-rocksdb-num-write-buffers">stores.<span class="store">store-name</span>.<br>rocksdb.num.write.buffers</td>
1605 <td class="default">3</td>
1606 <td class="description">
1607 Configures the <a href="https://github.com/facebook/rocksdb/wiki/Basic-Operations#write-buffer">number of write buffers</a> that a RocksDB store uses. This allows RocksDB to continue taking writes to other buffers even while a given write buffer is being flushed to disk.
1608 </td>
1609 </tr>
1610
1611 <tr>
1612 <td class="property" id="stores-rocksdb-log-file-size">stores.<span class="store">store-name</span>.<br>rocksdb.max.log.file.size.bytes</td>
1613 <td class="default">67108864</td>
1614 <td class="description">
1615 The maximum size in bytes of the RocksDB LOG file before it is rotated.
1616 </td>
1617 </tr>
1618
1619 <tr>
1620 <td class="property" id="stores-rocksdb-num-log-files">stores.<span class="store">store-name</span>.<br>rocksdb.keep.log.file.num</td>
1621 <td class="default">2</td>
1622 <td class="description">
1623 The number of RocksDB LOG files (including rotated LOG.old.* files) to keep.
1624 </td>
1625 </tr>
1626
1627 <tr>
1628 <th colspan="3" class="section" id="cluster-manager">
1629 Running Samza with a cluster manager<br>
1630 </th>
1631 </tr>
1632
1633 <tr>
1634 <td class="property" id="cluster-manager-container-memory-mb">cluster-manager.container.<br>memory.mb</td>
1635 <td class="default">1024</td>
1636 <td class="description">
1637 How much memory, in megabytes, to request from the cluster manager per container of your job. Along with
1638 <a href="#cluster-manager-container-cpu-cores" class="property">cluster-manager.container.cpu.cores</a>, this
1639 property determines how many containers the cluster manager will run on one machine. If the container
1640 exceeds this limit, it will be killed, so it is important that the container's actual
1641 memory use remains below the limit. The amount of memory used is normally the JVM heap
1642 size (configured with <a href="#task-opts" class="property">task.opts</a>), plus the
1643 size of any off-heap memory allocation (for example
1644 <a href="#stores-rocksdb-container-cache-size-bytes" class="property">stores.*.container.cache.size.bytes</a>),
1645 plus a safety margin to allow for JVM overheads.
1646 </td>
1647 </tr>
1648
1649 <tr>
1650 <td class="property" id="cluster-manager-container-cpu-cores">cluster-manager.container.<br>cpu.cores</td>
1651 <td class="default">1</td>
1652 <td class="description">
1653 The number of CPU cores to request per container of your job. Each node in the
1654 cluster has a certain number of CPU cores available, so this number (along with
1655 <a href="#cluster-manager-container-memory-mb" class="property">cluster-manager.container.memory.mb</a>)
1656 determines how many containers can be run on one machine.
1657 </td>
1658 </tr>
1659
1660 <tr>
1661 <td class="property" id="cluster-manager-container-retry-count">cluster-manager.container.<br>retry.count</td>
1662 <td class="default">8</td>
1663 <td class="description">
1664 If a container fails, it is automatically restarted by Samza. However, if a container keeps
1665 failing shortly after startup, that indicates a deeper problem, so we should kill the job
1666 rather than retrying indefinitely. This property determines the maximum number of times we are
1667 willing to restart a failed container in quick succession (the time period is configured with
1668 <a href="#cluster-manager-container-retry-window-ms" class="property">cluster-manager.container.retry.window.ms</a>).
1669 Each container in the job is counted separately. If this property is set to 0, any failed
1670 container immediately causes the whole job to fail. If it is set to a negative number, there
1671 is no limit on the number of retries.
1672 </td>
1673 </tr>
1674
1675 <tr>
1676 <td class="property" id="cluster-manager-container-retry-window-ms">cluster-manager.container.<br>retry.window.ms</td>
1677 <td class="default">300000</td>
1678 <td class="description">
1679 This property determines how frequently a container is allowed to fail before we give up and
1680 fail the job. If the same container has failed more than
1681 <a href="#cluster-manager-container-retry-count" class="property">cluster-manager.container.retry.count</a>
1682 times, and the time between failures was less than this property
1683 <code>cluster-manager.container.retry.window.ms</code> (in milliseconds), then we fail the job.
1684 There is no limit to the number of times we will restart a container if the time between
1685 failures is greater than <code>cluster-manager.container.retry.window.ms</code>.
1686 </td>
1687 </tr>
1688
1689 <tr>
1690 <td class="property" id="cluster-manager-jmx-enabled">cluster-manager.jobcoordinator.<br>jmx.enabled</td>
1691 <td class="default">true</td>
1692 <td class="description">
1693 Determines whether a JMX server should be started on the job's JobCoordinator.
1694 (<code>true</code> or <code>false</code>).
1695 </td>
1696 </tr>
1697
1698 <tr>
1699 <td class="property" id="cluster-manager-allocator-sleep-ms">cluster-manager.allocator.<br>sleep.ms</td>
1700 <td class="default">3600</td>
1701 <td class="description">
1702 The container allocator thread is responsible for matching requests to allocated containers.
1703 The sleep interval for this thread is configured using this property.
1704 </td>
1705 </tr>
1706
1707 <tr>
1708 <td class="property" id="cluster-manager-container-request-timeout-ms">cluster-manager.container.<br>request.timeout.ms</td>
1709 <td class="default">5000</td>
1710 <td class="description">
1711 The allocator thread periodically checks the state of the container requests and allocated containers to determine the assignment of a container to an allocated resource.
1712 This property determines the number of milliseconds before a container request is considered to have expired / timed-out.
1713 When a request expires, it gets allocated to any available container that was returned by the cluster manager.
1714 </td>
1715 </tr>
1716
1717 <tr>
1718 <th colspan="3" class="section" id="yarn">
1719 Running your job on a <a href="../jobs/yarn-jobs.html">YARN</a> cluster<br>
1720 <span class="subtitle">
1721 (This section applies if you have set
1722 <a href="#job-factory-class" class="property">job.factory.class</a>
1723 <code>= org.apache.samza.job.yarn.YarnJobFactory</code>)
1724 </span>
1725 </th>
1726 </tr>
1727
1728 <tr>
1729 <td class="property" id="yarn-package-path">yarn.package.path</td>
1730 <td class="default"></td>
1731 <td class="description">
1732 <strong>Required for YARN jobs:</strong> The URL from which the job package can
1733 be downloaded, for example a <code>http://</code> or <code>hdfs://</code> URL.
1734 The job package is a .tar.gz file with a
1735 <a href="../jobs/packaging.html">specific directory structure</a>.
1736 </td>
1737 </tr>
1738
1739 <tr>
1740 <td class="property" id="yarn-container-memory-mb">yarn.container.memory.mb</td>
1741 <td class="default">1024</td>
1742 <td class="description">
1743 This is deprecated in favor of
1744 <a href="#cluster-manager-container-memory-mb" class="property">cluster-manager.container.memory.mb</a>
1745 </td>
1746 </tr>
1747
1748 <tr>
1749 <td class="property" id="yarn-container-cpu-cores">yarn.container.cpu.cores</td>
1750 <td class="default">1</td>
1751 <td class="description">
1752 This is deprecated in favor of
1753 <a href="#cluster-manager-container-cpu-cores" class="property">cluster-manager.container.cpu.cores</a>
1754 </td>
1755 </tr>
1756
1757 <tr>
1758 <td class="property" id="yarn-container-retry-count">yarn.container.<br>retry.count</td>
1759 <td class="default">8</td>
1760 <td class="description">
1761 This is deprecated in favor of
1762 <a href="#cluster-manager-container-retry-count" class="property">cluster-manager.container.retry.count</a>
1763 </td>
1764 </tr>
1765
1766 <tr>
1767 <td class="property" id="yarn-container-retry-window-ms">yarn.container.<br>retry.window.ms</td>
1768 <td class="default">300000</td>
1769 <td class="description">
1770 This is deprecated in favor of
1771 <a href="#cluster-manager-container-retry-window-ms" class="property">cluster-manager.container.retry.window.ms</a>
1772 </td>
1773 </tr>
1774
1775 <tr>
1776 <td class="property" id="yarn-am-container-memory-mb">yarn.am.container.<br>memory.mb</td>
1777 <td class="default">1024</td>
1778 <td class="description">
1779 Each Samza job when running in Yarn has one special container, the
1780 <a href="../yarn/application-master.html">ApplicationMaster</a> (AM), which manages the
1781 execution of the job. This property determines how much memory, in megabytes, to request
1782 from YARN for running the ApplicationMaster.
1783 </td>
1784 </tr>
1785
1786 <tr>
1787 <td class="property" id="yarn-am-opts">yarn.am.opts</td>
1788 <td class="default"></td>
1789 <td class="description">
1790 Any JVM options to include in the command line when executing the Samza
1791 <a href="../yarn/application-master.html">ApplicationMaster</a>. For example, this can be
1792 used to set the JVM heap size, to tune the garbage collector, or to enable remote debugging.
1793 </td>
1794 </tr>
1795
1796 <tr>
1797 <td class="property" id="yarn-am-java-home">yarn.am.java.home</td>
1798 <td class="default"></td>
1799 <td class="description">
1800 The JAVA_HOME path for Samza AM. By setting this property, you can use a java version that is
1801 different from your cluster's java version. Remember to set the <code>task.java.home</code> as well.
1802 <dl>
1803 <dt>Example: <code>yarn.am.java.home=/usr/java/jdk1.8.0_05</code></dt>
1804 </dl>
1805 </td>
1806 </tr>
1807
1808 <tr>
1809 <td class="property" id="yarn-am-poll-interval-ms">yarn.am.poll.interval.ms</td>
1810 <td class="default">1000</td>
1811 <td class="description">
1812 The Samza ApplicationMaster sends regular heartbeats to the YARN ResourceManager
1813 to confirm that it is alive. This property determines the time (in milliseconds)
1814 between heartbeats.
1815 </td>
1816 </tr>
1817
1818 <tr>
1819 <td class="property" id="yarn-am-jmx-enabled">yarn.am.jmx.enabled</td>
1820 <td class="default">true</td>
1821 <td class="description">
1822 This is deprecated in favor of
1823 <a href="#cluster-manager-jmx-enabled" class="property">cluster-manager.jobcoordinator.jmx.enabled</a>
1824 </td>
1825 </tr>
1826
1827 <tr>
1828 <td class="property" id="yarn-allocator-sleep-ms">yarn.allocator.sleep.ms</td>
1829 <td class="default">3600</td>
1830 <td class="description">
1831 This is deprecated in favor of
1832 <a href="#cluster-manager-allocator-sleep-ms" class="property">cluster-manager.allocator.sleep.ms</a>
1833 </td>
1834 </tr>
1835
1836 <tr>
1837 <td class="property" id="yarn-samza-host_affinity-enabled">yarn.samza.host-affinity.enabled</td>
1838 <td class="default">false</td>
1839 <td class="description">
1840 This is deprecated in favor of
1841 <a href="#job-host_affinity-enabled" class="property">job.host-affinity.enabled</a>
1842
1843 </td>
1844 </tr>
1845
1846 <tr>
1847 <td class="property" id="yarn-container-request-timeout-ms">yarn.container.request.timeout.ms</td>
1848 <td class="default">5000</td>
1849 <td class="description">
1850 This is deprecated in favor of
1851 <a href="#cluster-manager-container-request-timeout-ms" class="property">cluster-manager.container.request.timeout.ms</a>
1852 </td>
1853 </tr>
1854
1855 <tr>
1856 <td class="property" id="yarn-queue">yarn.queue</td>
1857 <td class="default"></td>
1858 <td class="description">
1859 Determines which YARN queue will be used for Samza job.
1860 </td>
1861 </tr>
1862
1863 <tr>
1864 <td class="property" id="yarn-kerberos-principal">yarn.kerberos.principal</td>
1865 <td class="default"></td>
1866 <td class="description">
1867 Principal the Samza job uses to authenticate itself into KDC, when running on a Kerberos enabled YARN cluster.
1868 </td>
1869 </tr>
1870
1871 <tr>
1872 <td class="property" id="yarn-kerberos-keytab">yarn.kerberos.keytab</td>
1873 <td class="default"></td>
1874 <td class="description">
1875 The full path to the file containing keytab for the principal, specified by <a href="#yarn-kerberos-principal" class="property">yarn.kerberos.principal</a>.
1876 The keytab file is uploaded to the staging directory unique to each application on HDFS and the Application Master then uses the keytab and principal to
1877 periodically logs in to recreate the delegation tokens.
1878 </td>
1879 </tr>
1880
1881 <tr>
1882 <td class="property" id="yarn-token-renewal-interval-seconds">yarn.token.renewal.interval.seconds</td>
1883 <td class="default"></td>
1884 <td class="description">
1885 The time interval by which the Application Master re-authenticates and renew the delegation tokens. This value should be smaller than the length of time a delegation token is valid on hadoop namenodes before expiration.
1886 </td>
1887 </tr>
1888
1889 <tr>
1890 <th colspan="3" class="section" id="metrics"><a href="../container/metrics.html">Metrics</a></th>
1891 </tr>
1892
1893 <tr>
1894 <td class="property" id="metrics-reporter-class">metrics.reporter.<br><span class="reporter">reporter-name</span>.class</td>
1895 <td class="default"></td>
1896 <td class="description">
1897 Samza automatically tracks various metrics which are useful for monitoring the health
1898 of a job, and you can also track <a href="../container/metrics.html">your own metrics</a>.
1899 With this property, you can define any number of <em>metrics reporters</em> which send
1900 the metrics to a system of your choice (for graphing, alerting etc). You give each reporter
1901 an arbitrary <span class="reporter">reporter-name</span>. To enable the reporter, you need
1902 to reference the <span class="reporter">reporter-name</span> in
1903 <a href="#metrics-reporters" class="property">metrics.reporters</a>.
1904 The value of this property is the fully-qualified name of a Java class that implements
1905 <a href="../api/javadocs/org/apache/samza/metrics/MetricsReporterFactory.html">MetricsReporterFactory</a>.
1906 Samza ships with these implementations by default:
1907 <dl>
1908 <dt><code>org.apache.samza.metrics.reporter.JmxReporterFactory</code></dt>
1909 <dd>With this reporter, every container exposes its own metrics as JMX MBeans. The JMX
1910 server is started on a <a href="../container/jmx.html">random port</a> to avoid
1911 collisions between containers running on the same machine.</dd>
1912 <dt><code>org.apache.samza.metrics.reporter.MetricsSnapshotReporterFactory</code></dt>
1913 <dd>This reporter sends the latest values of all metrics as messages to an output
1914 stream once per minute. The output stream is configured with
1915 <a href="#metrics-reporter-stream" class="property">metrics.reporter.*.stream</a>
1916 and it can use any system supported by Samza.</dd>
1917 </dl>
1918 </td>
1919 </tr>
1920
1921 <tr>
1922 <td class="property" id="metrics-reporters">metrics.reporters</td>
1923 <td class="default"></td>
1924 <td class="description">
1925 If you have defined any metrics reporters with
1926 <a href="#metrics-reporter-class" class="property">metrics.reporter.*.class</a>, you
1927 need to list them here in order to enable them. The value of this property is a
1928 comma-separated list of <span class="reporter">reporter-name</span> tokens.
1929 </td>
1930 </tr>
1931
1932 <tr>
1933 <td class="property" id="metrics-reporter-stream">metrics.reporter.<br><span class="reporter">reporter-name</span>.stream</td>
1934 <td class="default"></td>
1935 <td class="description">
1936 If you have registered the metrics reporter
1937 <a href="#metrics-reporter-class" class="property">metrics.reporter.*.class</a>
1938 <code>= org.apache.samza.metrics.reporter.MetricsSnapshotReporterFactory</code>,
1939 you need to set this property to configure the output stream to which the metrics data
1940 should be sent. The stream is given in the form
1941 <span class="system">system-name</span>.<span class="stream">stream-name</span>,
1942 and the system must be defined in the job configuration. It's fine for many different jobs
1943 to publish their metrics to the same metrics stream. Samza defines a simple
1944 <a href="../container/metrics.html">JSON encoding</a> for metrics; in order to use this
1945 encoding, you also need to configure a serde for the metrics stream:
1946 <ul>
1947 <li><a href="#systems-samza-msg-serde" class="property">streams.*.samza.msg.serde</a>
1948 <code>= metrics-serde</code> (replacing the asterisk with the
1949 <span class="stream">stream-name</span> of the metrics stream)</li>
1950 <li><a href="#serializers-registry-class" class="property">serializers.registry.metrics-serde.class</a>
1951 <code>= org.apache.samza.serializers.MetricsSnapshotSerdeFactory</code>
1952 (registering the serde under a <span class="serde">serde-name</span> of
1953 <code>metrics-serde</code>)</li>
1954 </ul>
1955 </td>
1956 </tr>
1957 <tr>
1958 <td class="property" id="metrics-reporter-polling-interval">metrics.reporter.<br><span class="reporter">reporter-name</span>.interval</td>
1959 <td class="default"></td>
1960 <td class="description">
1961 If you have registered the metrics reporter
1962 <a href="#metrics-reporter-class" class="property">metrics.reporter.*.class</a>
1963 <code>= org.apache.samza.metrics.reporter.MetricsSnapshotReporterFactory</code>,
1964 you can use this property to configure how frequently the reporter will report the metrics
1965 registered with it. The value for this property should be length of the interval between
1966 consecutive metric reporting. This value is in seconds, and should be a positive integer value.
1967 This property is optional and set to 60 by default, which means metrics will be reported every
1968 60 seconds.
1969 </td>
1970 </tr>
1971
1972 <tr>
1973 <th colspan="3" class="section" id="hdfs-system-producer"><a href="../hdfs/producer.html">Writing to HDFS</a></th>
1974 </tr>
1975
1976 <tr>
1977 <td class="property" id="hdfs-writer-class">systems.<span class="system">system-name</span>.<br>.producer.hdfs.writer.class</td>
1978 <td class="default">org.apache.samza.system.hdfs.<br>writer.BinarySequenceFileHdfsWriter</td>
1979 <td class="description">Fully-qualified class name of the HdfsWriter implementation this HDFS Producer system should use</td>
1980 </tr>
1981 <tr>
1982 <td class="property" id="hdfs-compression-type">systems.<span class="system">system-name</span>.<br>.producer.hdfs.compression.type</td>
1983 <td class="default">none</td>
1984 <td class="description">A human-readable label for the compression type to use, such as "gzip" "snappy" etc. This label will be interpreted differently (or ignored) depending on the nature of the HdfsWriter implementation.</td>
1985 </tr>
1986 <tr>
1987 <td class="property" id="hdfs-base-output-dir">systems.<span class="system">system-name</span>.<br>.producer.hdfs.base.output.dir</td>
1988 <td class="default">/user/USERNAME/SYSTEMNAME</td>
1989 <td class="description">The base output directory for HDFS writes. Defaults to the home directory of the user who ran the job, followed by the systemName for this HdfsSystemProducer as defined in the job.properties file.</td>
1990 </tr>
1991 <tr>
1992 <td class="property" id="hdfs-bucketer-class">systems.<span class="system">system-name</span>.<br>.producer.hdfs.bucketer.class</td>
1993 <td class="default">org.apache.samza.system.hdfs.<br>writer.JobNameDateTimeBucketer</td>
1994 <td class="description">Fully-qualified class name of the Bucketer implementation that will manage HDFS paths and file names. Used to batch writes by time, or other similar partitioning methods.</td>
1995 </tr>
1996 <tr>
1997 <td class="property" id="hdfs-bucketer-date-path-format">systems.<span class="system">system-name</span>.<br>.producer.hdfs.bucketer.date.path.format</td>
1998 <td class="default"yyyy_MM_dd></td>
1999 <td class="description">A date format (using Java's SimpleDataFormat syntax) appropriate for use in an HDFS Path, which can configure time-based bucketing of output files.</td>
2000 </tr>
2001 <tr>
2002 <td class="property" id="hdfs-write-batch-size-bytes">systems.<span class="system">system-name</span>.<br>.producer.hdfs.write.batch.size.bytes</td>
2003 <td class="default">268435456</td>
2004 <td class="description">The number of bytes of outgoing messages to write to each HDFS output file before cutting a new file. Defaults to 256MB if not set.</td>
2005 </tr>
2006 <tr>
2007 <td class="property" id="hdfs-write-batch-size-records">systems.<span class="system">system-name</span>.<br>.producer.hdfs.write.batch.size.records</td>
2008 <td class="default">262144</td>
2009 <td class="description">The number of outgoing messages to write to each HDFS output file before cutting a new file. Defaults to 262144 if not set.</td>
2010 </tr>
2011
2012 <tr>
2013 <th colspan="3" class="section" id="hdfs-system-consumer"><a href="../hdfs/consumer.html">Reading from HDFS</a></th>
2014 </tr>
2015
2016 <tr>
2017 <td class="property" id="hdfs-consumer-buffer-capacity">systems.<span class="system">system-name</span>.<br>.consumer.bufferCapacity</td>
2018 <td class="default">10</td>
2019 <td class="description">Capacity of the hdfs consumer buffer - the blocking queue used for storing messages. Larger buffer capacity typically leads to better throughput but consumes more memory.</td>
2020 </tr>
2021 <tr>
2022 <td class="property" id="hdfs-consumer-numMaxRetries">systems.<span class="system">system-name</span>.<br>.consumer.numMaxRetries</td>
2023 <td class="default">10</td>
2024 <td class="description">The number of retry attempts when there is a failure to fetch messages from HDFS, before the container fails.</td>
2025 </tr>
2026 <tr>
2027 <td class="property" id="hdfs-partitioner-whitelist">systems.<span class="system">system-name</span>.<br>.partitioner.defaultPartitioner.whitelist</td>
2028 <td class="default">.*</td>
2029 <td class="description">White list used by directory partitioner to select files in a hdfs directory, in Java Pattern style.</td>
2030 </tr>
2031 <tr>
2032 <td class="property" id="hdfs-partitioner-blacklist">systems.<span class="system">system-name</span>.<br>.partitioner.defaultPartitioner.blacklist</td>
2033 <td class="default"></td>
2034 <td class="description">Black list used by directory partitioner to filter out unwanted files in a hdfs directory, in Java Pattern style.</td>
2035 </tr>
2036 <tr>
2037 <td class="property" id="hdfs-partitioner-group-pattern">systems.<span class="system">system-name</span>.<br>.partitioner.defaultPartitioner.groupPattern</td>
2038 <td class="default"></td>
2039 <td class="description">Group pattern used by directory partitioner for advanced partitioning. The advanced partitioning goes beyond the basic assumption that each file is a partition. With advanced partitioning you can group files into partitions arbitrarily. For example, if you have a set of files as [part-01-a.avro, part-01-b.avro, part-02-a.avro, part-02-b.avro, part-03-a.avro], and you want to organize the partitions as (part-01-a.avro, part-01-b.avro), (part-02-a.avro, part-02-b.avro), (part-03-a.avro), where the numbers in the middle act as a "group identifier", you can then set this property to be "part-[id]-.*" (note that "[id]" is a reserved term here, i.e. you have to literally put it as "[id]"). The partitioner will apply this pattern to all file names and extract the "group identifier" ("[id]" in the pattern), then use the "group identifier" to group files into partitions. See more details in <a href="https://issues.apache.org/jira/secure/attachment/12827670/HDFSSystemConsumer.pdf">HdfsSystemConsumer design doc</a> </td>
2040 </tr>
2041 <tr>
2042 <td class="property" id="hdfs-consumer-reader-type">systems.<span class="system">system-name</span>.<br>.consumer.reader</td>
2043 <td class="default">avro</td>
2044 <td class="description">Type of the file reader for different event formats (avro, plain, json, etc.). "avro" is only type supported for now.</td>
2045 </tr>
2046 <tr>
2047 <td class="property" id="hdfs-staging-directory">systems.<span class="system">system-name</span>.<br>.stagingDirectory</td>
2048 <td class="default"></td>
2049 <td class="description">Staging directory for storing partition description. By default (if not set by users) the value is inherited from "yarn.job.staging.directory" internally. The default value is typically good enough unless you want explicitly use a separate location.</td>
2050 </tr>
2051 </tbody>
2052 </table>
2053 </body>
2054 </html>