aurora.git
5 weeks agoUpdating .auroraversion to release version 0.18.0. rel/0.18.0
Santhosh Kumar [Thu, 15 Jun 2017 22:45:25 +0000 (15:45 -0700)] 
Updating .auroraversion to release version 0.18.0.

6 weeks agoUpdating .auroraversion to 0.18.0-rc0. rel/0.18.0-rc0
Santhosh Kumar [Sat, 10 Jun 2017 00:11:52 +0000 (17:11 -0700)] 
Updating .auroraversion to 0.18.0-rc0.

6 weeks agoIncrementing snapshot version to 0.19.0-SNAPSHOT.
Santhosh Kumar [Sat, 10 Jun 2017 00:11:52 +0000 (17:11 -0700)] 
Incrementing snapshot version to 0.19.0-SNAPSHOT.

6 weeks agoUpdating CHANGELOG for 0.18.0 release.
Santhosh Kumar [Sat, 10 Jun 2017 00:11:52 +0000 (17:11 -0700)] 
Updating CHANGELOG for 0.18.0 release.

6 weeks agoPrepare release notes for 0.18.0
Santhosh Kumar [Fri, 9 Jun 2017 23:44:06 +0000 (16:44 -0700)] 
Prepare release notes for 0.18.0

6 weeks agoAdding gpg key for santhk
Santhosh Kumar Shanmugham [Fri, 9 Jun 2017 22:09:47 +0000 (15:09 -0700)] 
Adding gpg key for santhk

Reviewed at https://reviews.apache.org/r/59495/

7 weeks agoProcess rescinds in the same thread pool as offers.
Zameer Manji [Tue, 6 Jun 2017 21:21:22 +0000 (14:21 -0700)] 
Process rescinds in the same thread pool as offers.

In a a production environment I was able to observe the following:
```
I0606 00:31:32.510 [Thread-77638, MesosCallbackHandler$MesosCallbackHandlerImpl:229] Offer rescinded: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
I0606 00:31:32.903 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl:211] Received offer: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
I0606 00:31:34.815 [TaskGroupBatchWorker, VersionedSchedulerDriverService:123] Accepting offer 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 with ops [LAUNCH]
```

Notice that the offer rescind was processed before the actual offer. This is
possible because there is a race in the `MesosCallbackHandlerImpl`. The offer is
processed in the executor (to prevent blocking) and the rescind is handled
directly. This means the offer procecssing thread (`SchedulerImpl-0`) is racing
against the callback thread (`Thread-77638`).

In normal operation, there will be seconds to minutes between a rescind and an
offer, but in some cases an offer can be rescinded very quickly in clusters that
use oversubscription modules.

To fix this, we move the rescind processing into the same executor as the offer
processing to ensure they are processed in the order they are received. Without
fixing this, the rescinded offer exists in the offer manager and can be used
later to launch a task. This task will immediately fail to launch because the
offer is invalid.

In this patch, I have also added a metric and logging to record when we fail to
remove an offer from the offer manager, and cleaned up the logging to allow
operators to see when an offer was recieved. With this logging, an operator can
grep for the offer id and see the entire lifecycle of the offer in the
scheduler.

Bugs closed: AURORA-1933

Reviewed at https://reviews.apache.org/r/59853/

7 weeks agoPrioritize adding instances over updating instances during an update
Jordan Ly [Fri, 2 Jun 2017 20:58:42 +0000 (13:58 -0700)] 
Prioritize adding instances over updating instances during an update

Bugs closed: AURORA-1928

Reviewed at https://reviews.apache.org/r/59640/

7 weeks agoAllow custom OfferManager ordering to be injected via Guice modules
David McLaughlin [Fri, 2 Jun 2017 20:43:43 +0000 (13:43 -0700)] 
Allow custom OfferManager ordering to be injected via Guice modules

Reviewed at https://reviews.apache.org/r/59698/

7 weeks agoImprove task history pruning by batch deleting tasks.
Kai Huang [Fri, 2 Jun 2017 20:32:23 +0000 (13:32 -0700)] 
Improve task history pruning by batch deleting tasks.

Bugs closed: AURORA-1929

Reviewed at https://reviews.apache.org/r/59699/

7 weeks agoEnables scalable, high-performance bin-packing approximation by sorting offers. Can...
David McLaughlin [Wed, 31 May 2017 23:30:49 +0000 (16:30 -0700)] 
Enables scalable, high-performance bin-packing approximation by sorting offers. Can be controlled via Scheduler flags.

Reviewed at https://reviews.apache.org/r/59480/

8 weeks agoBump logback to 1.2.3 and SLF4J to 1.7.25
Stephan Erb [Tue, 30 May 2017 20:40:31 +0000 (22:40 +0200)] 
Bump logback to 1.2.3 and SLF4J to 1.7.25

The changelog entries that stand out the most:

* The ReentrantLock in OutputStreamAppender is now "unfair". In previous
  versions of logback, a fair lock was used. Fair locks are much slower.
  Just as importanly, logback has no mandate to influence thread scheduling.
* In PatternLayoutBase the same StringBuilder is used over and over to
  reduce memory allocation.

I am unable to observe any improvements in our micro-benchmarks. In any
case, I still think it is worth to stay up to date.

Full changelogs:

* https://www.slf4j.org/news.html
* https://logback.qos.ch/news.html

Reviewed at https://reviews.apache.org/r/59417/

2 months agoNormalize state endpoint to reduce API payload size.
David McLaughlin [Thu, 25 May 2017 17:37:59 +0000 (10:37 -0700)] 
Normalize state endpoint to reduce API payload size.

Reviewed at https://reviews.apache.org/r/59565/

2 months agoAdd the ability to customize scheduling logic.
David McLaughlin [Thu, 25 May 2017 05:51:54 +0000 (22:51 -0700)] 
Add the ability to customize scheduling logic.

Uses Guice module injection to enable replacing the first-fit scheduling algorithm and associated first-fit preemption logic.

See design/proposal document here: https://docs.google.com/document/d/1fVHLt9AF-YbOCVCDMQmi5DATVusn-tqY8DldKbjVEm0/edit?usp=sharing

Bugs closed: AURORA-1920

Reviewed at https://reviews.apache.org/r/59039/

2 months agoAdd cluster state debug endpoint to Scheduler HTTP servlet.
David McLaughlin [Wed, 24 May 2017 22:07:33 +0000 (15:07 -0700)] 
Add cluster state debug endpoint to Scheduler HTTP servlet.

Reviewed at https://reviews.apache.org/r/59539/

2 months agoFix SchedulingBenchmarks broken in Mesos Maintenance and Update Affinity patches.
David McLaughlin [Tue, 23 May 2017 21:49:16 +0000 (14:49 -0700)] 
Fix SchedulingBenchmarks broken in Mesos Maintenance and Update Affinity patches.

Reviewed at https://reviews.apache.org/r/59478/

2 months agoAdded 'aurora task scp' command for copying/retrieving files to the sandbox of a...
Jordan Ly [Thu, 18 May 2017 21:49:23 +0000 (14:49 -0700)] 
Added 'aurora task scp' command for copying/retrieving files to the sandbox of a task instance.

This command essentially mimics scp but expands task instances into their respective user@host:path
For 'aurora task scp' the sandbox is the relative root. However, you can still use absolute paths
(ex. /tmp or /var/log). Tilde expansion is not supported (ex. paths like ~/some/dir will not work)
as they will try to access home directories: in this case, the command will return an error.

Example usage:
From host to task sandbox folder: `aurora task scp ~/test.txt cluster/role/env/job/instance:`
From task sandbox folder to host: `aurora task scp cluster/role/env/job/instance:test.txt .`
From task tmp folder to host: `aurora task scp cluster/role/env/job/instance:/tmp/test.txt .`
From one task to another task: `aurora task scp cluster/role/env/job/instance:test.txt cluster/role/env/job/instance:some/dir/`

Testing Done:
`./pants test src/test/python/apache/aurora/client/cli:cli`
```
23:07:10 00:03       [run]
                     ============== test session starts ===============
                     platform darwin -- Python 2.7.10 -- py-1.4.33 -- pytest-2.6.4
                     plugins: cov, timeout
                     collected 179 items

                     src/test/python/apache/aurora/client/cli/test_config_noun.py ...
                     src/test/python/apache/aurora/client/cli/test_context.py ........
                     src/test/python/apache/aurora/client/cli/test_version.py .
                     src/test/python/apache/aurora/client/cli/test_quota.py .....
                     src/test/python/apache/aurora/client/cli/test_plugins.py .
                     src/test/python/apache/aurora/client/cli/test_client.py ..
                     src/test/python/apache/aurora/client/cli/test_sla.py .....
                     src/test/python/apache/aurora/client/cli/test_open.py .....
                     src/test/python/apache/aurora/client/cli/test_supdate.py .......................................
                     src/test/python/apache/aurora/client/cli/test_restart.py ..........
                     src/test/python/apache/aurora/client/cli/test_status.py .............
                     src/test/python/apache/aurora/client/cli/test_add.py ....
                     src/test/python/apache/aurora/client/cli/test_diff.py ..
                     src/test/python/apache/aurora/client/cli/test_cron.py ..........
                     src/test/python/apache/aurora/client/cli/test_command_hooks.py ..
                     src/test/python/apache/aurora/client/cli/test_options.py ......
                     src/test/python/apache/aurora/client/cli/test_task.py ...............
                     src/test/python/apache/aurora/client/cli/test_create.py ..............
                     src/test/python/apache/aurora/client/cli/test_kill.py ......................
                     src/test/python/apache/aurora/client/cli/test_inspect.py ....
                     src/test/python/apache/aurora/client/cli/test_api_from_cli.py ..
                     src/test/python/apache/aurora/client/cli/test_diff_formatter.py ......

                     ========== 179 passed in 24.88 seconds ===========

23:07:37 00:30   [complete]
               SUCCESS
```

I've also compiled it within the local cluster with Vagrant and used the command to transfer a text file between the scheduler machine and job I created.

Bugs closed: AURORA-1925

Reviewed at https://reviews.apache.org/r/59163/

2 months agoFix update affinity cache name
Reza Motamedi [Sun, 14 May 2017 19:13:51 +0000 (21:13 +0200)] 
Fix update affinity cache name

In a previous review (https://reviews.apache.org/r/58636/) I introduced metrics
for BiCache explicit removals and expirations. There I changed the contract to
instead of passing _cache size metric name_, I require just the __cache name__.
That has already up applied to all usages of Bichae. The update affinitiy patch
had a merge conflict that did not pick this change, which leads to metric names
such as `update_affinity_cache_size_cache_expiration_removals`,
`update_affinity_cache_size_cache_explicit_removals`,
`update_affinity_cache_size_cache_removals`, `update_affinity_cache_size_cache_size`.

Reviewed at https://reviews.apache.org/r/59231/

2 months agoAdding metrics for removals from BiCache
Reza Motamedi [Mon, 8 May 2017 23:37:09 +0000 (16:37 -0700)] 
Adding metrics for removals from BiCache

Reviewed at https://reviews.apache.org/r/58636/

2 months agoAdd best-effort update affinity into the Scheduler.
David McLaughlin [Sat, 6 May 2017 02:11:57 +0000 (19:11 -0700)] 
Add best-effort update affinity into the Scheduler.

Reviewed at https://reviews.apache.org/r/58259/

2 months agoAURORA-1915 Add automatic browser tab open feature for aurora update start
Takuya Kuwahara [Wed, 3 May 2017 04:54:27 +0000 (21:54 -0700)] 
AURORA-1915 Add automatic browser tab open feature for aurora update start

Aurora client automatically opens a browser tab following `aurora job create` and `aurora cron schedule` commands. This patch provide similar functionality for `aurora update start`.

Reviewed at https://reviews.apache.org/r/58265/

2 months agoAURORA-1922 Expose stats on the number of jobs stored in MemCronJobStore
Mehrdad Nurolahzade [Wed, 3 May 2017 04:25:03 +0000 (21:25 -0700)] 
AURORA-1922 Expose stats on the number of jobs stored in MemCronJobStore

This patch exposes stats on the size of `jobs` map in `MemCronJobStore`.

Reviewed at https://reviews.apache.org/r/58863/

2 months agoMake sure we track scheduling penalty when no tasks are scheduled.
David McLaughlin [Tue, 2 May 2017 17:24:59 +0000 (10:24 -0700)] 
Make sure we track scheduling penalty when no tasks are scheduled.

Reviewed at https://reviews.apache.org/r/58922/

2 months agoAURORA-1923 Aurora client should not automatically retry non-idempotent operations
Mehrdad Nurolahzade [Tue, 2 May 2017 16:35:36 +0000 (09:35 -0700)] 
AURORA-1923 Aurora client should not automatically retry non-idempotent operations

Aurora client has a built in mechanism to automatically retry thrift API operations if the connection with scheduler times out, experiences transport exception, or encounters a transient exception on the scheduler side.

Retrying thrift calls due to scheduler connection timeout and transient exceptions (see AURORA-187) is safe. However, as Aurora has no concept of idempotency, its client can retry non-idempotent operations upon encountering transport exceptions which can lead to nondeterministic situations.

For example, if client requests go through a proxy to reach scheduler, client might consider a non-idempotent request failed and automatically retry it while the original request has been received and processed by the scheduler.

This patch changes Aurora client invocation semantics from "at least once" to "at most once" for non-idempotent operations.

Reviewed at https://reviews.apache.org/r/58850/

2 months agoFix for unnecessary object serializations
Mehrdad Nurolahzade [Fri, 28 Apr 2017 22:19:36 +0000 (15:19 -0700)] 
Fix for unnecessary object serializations

This patch provides a fix for some unnecessary object serilizations that happen on high frequency execution paths and contribute to scheduler's high object creation rate.

Reviewed at https://reviews.apache.org/r/56935/

3 months agoExtend operator documentation
Stephan Erb [Tue, 25 Apr 2017 21:26:43 +0000 (23:26 +0200)] 
Extend operator documentation

Included changes:

* new cluster upgrade instructions
* docs for several best practices collected on the mailinglist
* extracted and extended troubleshooting guide for new cluster operators
* several minor formatting fixes

Reviewed at https://reviews.apache.org/r/58651/

3 months agoBump initial_task_kill_retry_interval to 15s.
Stephan Erb [Tue, 25 Apr 2017 21:18:30 +0000 (23:18 +0200)] 
Bump initial_task_kill_retry_interval to 15s.

It is not very common that kills are dropped by Mesos and have to be retried
by Aurora. It therefore makes sense to slightly increase the retry timeout
so that we don't retry needlessly when Thermos is still busy executing
the lifecycle methods.

By default, Thermos uses the following kill escalation sequence:

  * /quitquitquit
  * wait 5s
  * /abortabortabort
  * wait 5s
  * SIGTERM
  * wait up to 1 minute
  * SIGKILL

Reviewed at https://reviews.apache.org/r/58611/

3 months agoImprove cleanup hints in release and release-candidate scripts
Stephan Erb [Fri, 21 Apr 2017 16:39:48 +0000 (18:39 +0200)] 
Improve cleanup hints in release and release-candidate scripts

Reviewed at https://reviews.apache.org/r/58612/

3 months agoUpdate to Mesos 1.2.0
Stephan Erb [Mon, 17 Apr 2017 20:31:17 +0000 (22:31 +0200)] 
Update to Mesos 1.2.0

Changelog: https://github.com/apache/mesos/blob/1.2.0/CHANGELOG

Reviewed at https://reviews.apache.org/r/58467/

3 months agoFix schema to allow multiple task volumes per task.
Zameer Manji [Fri, 7 Apr 2017 12:18:33 +0000 (14:18 +0200)] 
Fix schema to allow multiple task volumes per task.

The original commit adding this feature added an artifical constraint to the
schema that prevented more than one task volume per task. This is because there
was a `UNIQUE` constraint between the volumes table and the task config table,
preventing a task config from being associated with more than one volume.

This patch removes that constraint. As a result some of the MyBatis mappers had
to change and a new migration was added.

Bugs closed: AURORA-1914

Reviewed at https://reviews.apache.org/r/58066/

3 months agoReliably subscribe to Mesos in the HTTP Driver.
Zameer Manji [Thu, 6 Apr 2017 08:40:41 +0000 (10:40 +0200)] 
Reliably subscribe to Mesos in the HTTP Driver.

As noted in AURORA-1911 the `V1Mesos` driver doesn't re try `SUBSCRIBE` calls if
they fail. This means that after a leader subscribes and disconnects, it is
possible for it to never re subscribe again if the Mesos Master is unhealthy.

To fix this, I have moved the subscription into the dedicated
`SchedulerExecutor` and it coninutes to attempt to subscribe using truncated
binary backoff. It only stops if we are disconnected or if we sucessfully
connect.

Bugs closed: AURORA-1911

Reviewed at https://reviews.apache.org/r/58053/

3 months agoFix Thermos Health Check for MesosContainerizer with `--nosetuid-health-checks`
Charles Raimbert [Wed, 5 Apr 2017 09:25:03 +0000 (11:25 +0200)] 
Fix Thermos Health Check for MesosContainerizer with `--nosetuid-health-checks`

With MesosContainerizer, the health check is performed using a "mesos-containerizer
launch" process, but there is actually a code bug in the way of getting the user
under which to run the health check process:
https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/common/health_checker.py#L370
```
health_check_user = (os.getusername() if self._nosetuid_health_checks
            else assigned_task.task.job.role)
```

If the scheduler is configured with `--nosetuid-health-checks` then "os.getusername()"
is executed, but the "os" python module does not present any "getusername()" function,
which leads the Thermos execution to abort as follow:
```
D0323 01:08:15.453372 16 aurora_executor.py:159] Task started.
E0323 01:08:15.571124 16 aurora_executor.py:121] Traceback (most recent call last):
File "apache/aurora/executor/aurora_executor.py", line 119, in _run
self._start_status_manager(driver, assigned_task)
File "apache/aurora/executor/aurora_executor.py", line 168, in _start_status_manager
status_checker = status_provider.from_assigned_task(assigned_task, self._sandbox)
File "apache/aurora/executor/common/health_checker.py", line 370, in from_assigned_task
health_check_user = (os.getusername() if self._nosetuid_health_checks
AttributeError: 'module' object has no attribute 'getusername'
```

Following the existing unit testing pattern from test_health_checker.py, a test case
was added to cover the `--nosetuid-health-checks` case for MesosContainerizer.

Bugs closed: AURORA-1909

Reviewed at https://reviews.apache.org/r/58167/

3 months agoRemove use of deprecated fields in tests
Nicolás Donatucci [Mon, 3 Apr 2017 22:00:14 +0000 (00:00 +0200)] 
Remove use of deprecated fields in tests

Removed the usage of numCpus, ramMb and diskMb from tests and replaced them with
the Resource set when necessary. Also modified the thrift backfill so that it won't
backfill those resource fields anymore.

Related Issue: Aurora-1707

Reviewed at https://reviews.apache.org/r/57881/

3 months agoEnsure enum tables are complete after a snapshot restore.
Zameer Manji [Thu, 30 Mar 2017 18:38:36 +0000 (11:38 -0700)] 
Ensure enum tables are complete after a snapshot restore.

In our in memory database, we model enums as two column tables. The two columns
would be `id` which corresponds to the integer value in the thrift enum and
`name` which is the all caps string name of the enum. For example to model the
`JobUpdateStatus` enum we have a table called `job_update_statuses`. In there
the `ROLLING_FORWARD` enum is modeled as a row `(0, "ROLLING_FORWARD")`. Other
tables reference the enum table via the id.

When we prepare storage on startup the `DbStorage` starts up. It does two
things:
1. Load in the schema.
2. Populate the enum tables.

This ensures that when we insert values into the database, the enum refernces
will be valid.

However, before we restore from a Snapshot with the `dbScript` field, we blow
all of that data away and restore what was in the snapshot:
````
try (Connection c = ((DataSource) store.getUnsafeStoreAccess()).getConnection()) {
  LOG.info("Dropping all tables");
  try (PreparedStatement drop = c.prepareStatement("DROP ALL OBJECTS"))
    drop.executeUpdate();
  }
````

This means that if we add a new enum value, and then restore from a snapshot,
that enum value will not exist in the table any more. We could address this by
saying that every enum value addition requires a migration. However instead I
propose not blowing away the work done by `DbStorage` instead and re-hydrating
the enum tables.

To do this I extracted the logic into a new class `EnumBackfill`. Restoring from
a snapshot calls this after the migrations are done. The underlying SQL was
changed from `INSERT` to `MERGE` to make this work.

Testing Done:
existing tests and e2e tests

I also added a new enum value to `JobUpdateStatus` and observed it was correctly
loaded in.

Bugs closed: AURORA-1912

Reviewed at https://reviews.apache.org/r/58036/

3 months agoSort the set objects inside TaskConfig during Job diff.
Santhosh Kumar Shanmugham [Thu, 30 Mar 2017 05:28:26 +0000 (22:28 -0700)] 
Sort the set objects inside TaskConfig during Job diff.

Sort the entires in `set` fields inside `TaskConfig` as strings before
shelling out to diff so that the output is consistent and meaningful.

Testing Done:
./build-support/jenkins/build.sh

Bugs closed: AURORA-1913

Reviewed at https://reviews.apache.org/r/58054/

3 months agoReset `framework_registered` metric on disconnection.
Zameer Manji [Wed, 29 Mar 2017 20:35:21 +0000 (13:35 -0700)] 
Reset `framework_registered` metric on disconnection.

Previously the `framework_registered` metric only transitioned from 0 to 1 on
the first registration. On disconnection and registration loss, the metric was
not updated to reflect the loss of registration.

To make this metric more useful, I have moved this metric from the
`SchedulerLifecycle`, where it was tied to the boolean controlling the
LEADER_AWAITING_REGISTRATION -> ACTIVE transtion, to `MesosCallbackHandler`. In
`MesosCallbackHandler` it can easily be updated to reflect the current state
of registration.

Bugs closed: AURORA-1910

Reviewed at https://reviews.apache.org/r/58017/

4 months agoSupport Mesos Maintenance
Zameer Manji [Thu, 23 Mar 2017 21:17:40 +0000 (14:17 -0700)] 
Support Mesos Maintenance

This adds support for Mesos Maintenance per the design doc[1].

Per the design the scheduler gains another parameter,
`unavailability_threshold`. With this threshold the scheduler does the
following:

1. Accept all inverse offers from Mesos.
2. Drain when accepting an inverse offer if the unavailability starts within the
   thereshold.
3. Veto any offers with unavailability starting within the threshold.
4. Penalize offers that have unavailablity information

For readability and safety the time based code uses the new `java.time` package
in Java 8, primarily relying on the `Instant` class.

[1]: https://docs.google.com/document/d/1Z7dFAm6I1nrBE9S5WHw0D0LApBumkIbHrk0-ceoD2YI/edit#heading=h.n5tvzjaj9llx

Testing Done:
e2e tests

Bugs closed: AURORA-1904

Reviewed at https://reviews.apache.org/r/57717/

4 months agoMake Thermos observer resource collection intervals configurable
Stephan Erb [Tue, 21 Mar 2017 08:29:41 +0000 (09:29 +0100)] 
Make Thermos observer resource collection intervals configurable

We have noticed that on hosts with lots of active tasks (~100) the observer UI
is not usable. Thermos fully utilizes one core but does not render any requests.

Dumping `/threads` indicates the observer might be backlogged by the hundred
concurrent `TaskResourceMonitor` threads. Due to the Python GIL only one can
make progress at a time though.

This patch is now adding options to control the resource collection interval,
giving operators a possibility to reduce the CPU pressure.

Testing Done:
./pants test.pytest src/{test,main}/python:: -- -v

Bugs closed: AURORA-1907

Reviewed at https://reviews.apache.org/r/57757/

4 months agoUse Process.oneshot() in latest psutils for faster stats retrieval.
Stephan Erb [Sun, 19 Mar 2017 15:01:50 +0000 (16:01 +0100)] 
Use Process.oneshot() in latest psutils for faster stats retrieval.

Without the Process.oneshot() decorator stats retrieval can lead to
multiple reads of the same `/proc` filesystem values. The oneshot
decorator enables caching to speed this up. It has been added in
psutils 5.0.

Oneshot docs: https://pythonhosted.org/psutil/#psutil.Process.oneshot
Changelog: https://github.com/giampaolo/psutil/blob/master/HISTORY.rst#520

Bugs closed: AURORA-1907

Reviewed at https://reviews.apache.org/r/57732/

4 months agoPopulate `host` and `webURL`fields of FrameworkInfo.
Zameer Manji [Sat, 18 Mar 2017 00:06:15 +0000 (17:06 -0700)] 
Populate `host` and `webURL`fields of FrameworkInfo.

This patch extracts out `FrameworkInfo` construction from the `DriverSettings`
data class to a factory class. This factory class combines the base info
constructed via CLI arguments with the HTTP server's host and port information.
This allows us to populate the `host` and `weburl` fields of framework info,
which enhance the Mesos UI.

This is necessary for users of the `V1_DRIVER` as the new driver does not
automatically populate the `host` field. Further, by using our own host and port
information, we ensure the information in ZooKeeper, the information used for
HTTP redirects and the information in the Mesos UI are all in sync.

Note that in vagrant, the hostname and URL are `aurora.local` because we set the
`--hostname` argument of the scheduler. By default Java will set it to the FQDN
or IP address of the host.

Testing Done:
e2e tests.

Bugs closed: AURORA-1905

Reviewed at https://reviews.apache.org/r/57708/

4 months agoUse --launch_info when invoking MesosContainerizer.
Santhosh Kumar Shanmugham [Tue, 14 Mar 2017 02:20:33 +0000 (19:20 -0700)] 
Use --launch_info when invoking MesosContainerizer.

MesosContainerizer has updated the command line parameters in 1.2.0 and
consolidated the individual arguments into a single ContainerLaunchInfo
proto buf message. Update ThermosExecutor to use the new `--launch_info`
parameter to be compatible with MesosContainerizer also check the
containerizer binary interface to determine to be backward-compatible.

Bugs closed: AURORA-1882

Reviewed at https://reviews.apache.org/r/55951/

4 months agoChange Resource Validation in ConfigurationManager so that it validates the Resource...
Nicolás Donatucci [Mon, 13 Mar 2017 20:59:42 +0000 (21:59 +0100)] 
Change Resource Validation in ConfigurationManager so that it validates the Resource Set instead of deprecated fields

The Resource validation in ConfigurationManager is now done against the Resource set instead of the NumCpus, RamMb and DiskMb fields.

Related Issue: AURORA-1707

Reviewed at https://reviews.apache.org/r/56395/

4 months agoReduce log output in `VersionedSchedulerDriverService`.
Zameer Manji [Wed, 8 Mar 2017 21:06:02 +0000 (13:06 -0800)] 
Reduce log output in `VersionedSchedulerDriverService`.

The `acceptOffers` log message outputs the entire `Operation` object which for
the `LAUNCH` type includes the entire `TaskInfo` protobuf. This makes the log
output massive. This reduces the logging to just the type of the operation.

Reviewed at https://reviews.apache.org/r/57404/

4 months agoRemove SerializableClock interface.
Zameer Manji [Tue, 7 Mar 2017 04:21:02 +0000 (20:21 -0800)] 
Remove SerializableClock interface.

This removes the `SerializableClock` interface. We are not serializing `Clock`
classes anywhere, so this should be safe to remove.

Reviewed at https://reviews.apache.org/r/57357/

4 months agoEnable Mesos HTTP API.
Zameer Manji [Thu, 2 Mar 2017 23:07:11 +0000 (15:07 -0800)] 
Enable Mesos HTTP API.

This patch completes the design doc[1] and enables operators to choose between
two V1 Mesos API implementations. The first is `V0Mesos` which offers the V1 API
backed by the scheduler driver and the second is `V1Mesos` which offers the V1
API backed by a new HTTP API implementation.

There are three sets of changes in this patch.

First, the V1 Mesos code requires a Scheduler callback with a different API. To
maximize code reuse, event handling logic was extracted into a
`MesosCallbackHandler` class. `VersionedMesosSchedulerImpl` was created to
implement the new callback interface. Both callbacks new use the handler class
for logic.

Second, a new driver implementation using the new API was created. All of the
logic for the new driver is encapsulated in the
`VersionedSchedulerDriverService` class.

Third, some wiring changes were done to allow for Guice to do it's work and
allow for operators to select between the different driver implementations.

[1] https://docs.google.com/document/d/1bWK8ldaQSsRXvdKwTh8tyR_0qMxAlnMW70eOKoU3myo

Testing Done:
The e2e test has been run three times, each time with a different driver option.

Bugs closed: AURORA-1887, AURORA-1888

Reviewed at https://reviews.apache.org/r/57061/

4 months agoFix scheduler_framework_disconnects stat.
Ilya Pronin [Mon, 27 Feb 2017 19:04:54 +0000 (11:04 -0800)] 
Fix scheduler_framework_disconnects stat.

Refactoring in r/31550 has disabled incrementing scheduler_framework_disconnects
stats. This change brings it back.

Testing Done:
Added a check to `MesosSchedulerImplTest.testDisconnected()`. Manually verified
in Vagrant by starting/stopping mesos-master and querying `/vars` endpoint.

Bugs closed: AURORA-1860

Reviewed at https://reviews.apache.org/r/57074/

4 months agoCurrently snapshot times are exposed for the entire snapshot save/apply operation...
Mehrdad Nurolahzade [Sat, 25 Feb 2017 04:31:20 +0000 (20:31 -0800)] 
Currently snapshot times are exposed for the entire snapshot save/apply operation. This patch provides the means to collect finer grained metrics on individual fields in a snapshot.

Bugs closed: AURORA-1870

Reviewed at https://reviews.apache.org/r/55105/

5 months agoMove task conversion during reconciliation into the delayed closure.
David McLaughlin [Wed, 22 Feb 2017 16:41:01 +0000 (08:41 -0800)] 
Move task conversion during reconciliation into the delayed closure.

This is a small change to relieve GC pressure while explicit reconciliation runs. It moves the IScheduledTask -> TaskStatus conversion into the batch processing closure so that any object allocation and collection overhead is delayed until the batch is actually processed. It has a noticable effect on GC for large amounts of RUNNING tasks.

Reviewed at https://reviews.apache.org/r/56797/

5 months agoAdd best effort pulse timestamp recovery.
Zameer Manji [Thu, 16 Feb 2017 20:08:34 +0000 (12:08 -0800)] 
Add best effort pulse timestamp recovery.

Currently the scheduler causes all coordinated ("pulsed") updates into
ROLL_FORWARD_AWAITING_PULSE, or ROLL_BACK_AWAITING_PULSE on scheduler
startup/recovery. This is because the last pulse timestamp is not durably stored
and the timestamp of the last pulse is set to 0L (aka no pulse yet).

In cases where the pulse timeout is larger and the failover is fast or frequent,
this casues many updates to unnecessarily transition into a pulse related state
until the next pulse.

It is posible to avoid these uncessary transitons by traversing the job update
events and initializing the last pulse timestamp to the last event if the last
event was not a pulse event.

Bugs closed: AURORA-1890

Reviewed at https://reviews.apache.org/r/56723/

5 months agoAdd DSL and E2E changes for per task volume mounts.
Zameer Manji [Wed, 15 Feb 2017 01:09:37 +0000 (17:09 -0800)] 
Add DSL and E2E changes for per task volume mounts.

Enables the client DSL to set per task volume mounts. This also adds a E2E test
that tests per task volume mounting.

Testing Done:
sh ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh

Bugs closed: AURORA-1107

Reviewed at https://reviews.apache.org/r/53333/

5 months agoExpose task pruning endpoint in aurora_admin. Useful for scale testing in order to...
David McLaughlin [Tue, 14 Feb 2017 21:33:01 +0000 (13:33 -0800)] 
Expose task pruning endpoint in aurora_admin. Useful for scale testing in order to 'clean up' after a test run, but also useful in production if you have a bad actor inflating the size of your task index.

Bugs closed: AURORA-1893

Reviewed at https://reviews.apache.org/r/56629/

5 months agoDisplaying update id after 'Killed for job update' message for the update that
Abhishek Jain [Mon, 13 Feb 2017 20:11:13 +0000 (12:11 -0800)] 
Displaying update id after 'Killed for job update' message for the update that
resulted in the task getting killed.

Testing Done:
Tests:
------
aurora job create devcluster/www-data/devel/hello_world my_jobs/new_hello_world_job.aurora
aurora update start devcluster/www-data/devel/hello_world my_jobs/new_hello_world_job_update.aurora

Completed Task status information:
-----------------------------------
3 minutes ago - KILLED : Instructed to kill task.
02/09 19:52:53 LOCAL • PENDING
02/09 19:52:53 LOCAL • ASSIGNED
02/09 19:52:54 LOCAL • STARTING • Initializing sandbox.
02/09 19:52:55 LOCAL • RUNNING • No health-check defined, task is assumed healthy.
02/09 19:53:08 LOCAL • KILLING • Killed for job update : 900256bb-9cad-41d6-b330-d74a751239bf
02/09 19:53:10 LOCAL • KILLED • Instructed to kill task.

Build tests:
-------------
./build-support/jenkins/build.sh
./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh

Bugs closed: AURORA-1806

Reviewed at https://reviews.apache.org/r/56523/

5 months agoAdd additional tests for the conversion of TaskStatus.
Zameer Manji [Wed, 8 Feb 2017 17:57:18 +0000 (09:57 -0800)] 
Add additional tests for the conversion of TaskStatus.

This adds additional testing for the `ProtosConversions` class, ensuring there
is the correct conversion between `SlaveID` and `AgentID`.

Reviewed at https://reviews.apache.org/r/56361/

5 months agoUpdate PMD to 5.5.3 with the for us relevant fixes:
Stephan Erb [Tue, 7 Feb 2017 22:12:45 +0000 (23:12 +0100)] 
Update PMD to 5.5.3 with the for us relevant fixes:

* [java] InvalidSlf4jMessageFormat: False positive with placeholder and exception
* [java] InvalidSlf4jMessageFormat: fails with NPE

Full changelog: https://pmd.github.io/pmd-5.5.3/overview/changelog.html

The increase of the heap size is not really related. However, given the hard to
trace out of memory errors we have seen in some Jenkins builds recently, it is
probably worth a shot.

Testing Done:
./gradlew -Pq build

Reviewed at https://reviews.apache.org/r/56404/

5 months agoMove Aurora to v1 Protobufs.
Zameer Manji [Mon, 6 Feb 2017 23:09:55 +0000 (15:09 -0800)] 
Move Aurora to v1 Protobufs.

This is the first step in moving Aurora to the V1 API from Mesos. This patch
moves most of the code to v1 Protobufs. This means all peices of code that do
not interact with Mesos now handle only v1 Protobufs.

Classes that interact with Mesos directly are:

* `org.apache.aurora.scheduler.mesos.SchedulerDriverService`
* `org.apache.aurora.scheduler.mesos.MesosSchedulerImpl`
* `org.apache.aurora.scheduler.mesos.DriverFactoryImpl`

These classes handle unversioned Protobufs and use the `ProtosConversion` class
to convert them to v1 Protobufs that can be safely passed to the rest of the
code.

Bugs closed: AURORA-1886

Reviewed at https://reviews.apache.org/r/56265/

5 months agoAdd message parameter to killTasks
Cody Gibb [Mon, 6 Feb 2017 18:43:01 +0000 (10:43 -0800)] 
Add message parameter to killTasks

RPC's such as pauseJobUpdate include a parameter for "a user-specified message
to include with the induced job update state change." This diff provides a
similar optional parameter for the killTasks RPC, which allows users to indicate
the reason why a task was killed, and later inspect that reason when consuming
task events.

Example usage from Aurora CLI:
`$ aurora job killall devcluster/www-data/prod/hello --message "Some message"`

In the task event, the supplied message (if provided) is appended to the
existing template "Killed by <user>", separated by a newline. For the above
example, this looks like: "Killed by aurora\nSome message".

Testing Done:
Added a unit test in the scheduler, and a test in the client.

Also manually tested using the Vagrant environment.

Bugs closed: AURORA-1846

Reviewed at https://reviews.apache.org/r/54459/

5 months agoIncrementing snapshot version to 0.18.0-SNAPSHOT.
Stephan Erb [Wed, 1 Feb 2017 08:35:14 +0000 (09:35 +0100)] 
Incrementing snapshot version to 0.18.0-SNAPSHOT.

5 months agoUpdating CHANGELOG for 0.17.0 release.
Stephan Erb [Wed, 1 Feb 2017 08:35:14 +0000 (09:35 +0100)] 
Updating CHANGELOG for 0.17.0 release.

5 months agoPrepare release notes for 0.17.0
Stephan Erb [Wed, 1 Feb 2017 08:03:22 +0000 (09:03 +0100)] 
Prepare release notes for 0.17.0

Reviewed at https://reviews.apache.org/r/56138/

5 months agoSuppress role deprecation warning as replacement is not yet ready.
David McLaughlin [Wed, 1 Feb 2017 07:38:54 +0000 (08:38 +0100)] 
Suppress role deprecation warning as replacement is not yet ready.

The role field was prematurely deprecated in the Mesos project.
https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L257

Reviewed at https://reviews.apache.org/r/56131/

5 months agoFixed starting cron jobs when using default_docker_parameters
Steve Niemitz [Tue, 31 Jan 2017 17:18:42 +0000 (18:18 +0100)] 
Fixed starting cron jobs when using default_docker_parameters

The code was previously attempting to re-sanitize the configuration read from
storage rather than just using it as is.  This causes issues if after
sanitization the job no longer passes sanitization (which is the case here w/
default_docker_parameters).

We've been running this in our branch forever.

Bugs closed: AURORA-1684

Reviewed at https://reviews.apache.org/r/54754/

5 months agoFix flapping TestRunnerKillProcessGroup test.
Stephan Erb [Mon, 30 Jan 2017 20:40:14 +0000 (21:40 +0100)] 
Fix flapping TestRunnerKillProcessGroup test.

The test was working when run in isolation, but failed when executing the
entire Thermos test suite.

Bugs closed: AURORA-1809

Reviewed at https://reviews.apache.org/r/56062/

5 months agoFix pendingTasks endpoint in case of multiple TaskGroups per job.
Stephan Erb [Mon, 30 Jan 2017 20:37:59 +0000 (21:37 +0100)] 
Fix pendingTasks endpoint in case of multiple TaskGroups per job.

Central idea of this patch is to change the return value of `getPendingReasons`
from a map keyed by JobKey to a map keyed by `TaskGroupKey`. This prevents the
`IllegalArgumentException` during the map construction.

Bugs closed: AURORA-1879

Reviewed at https://reviews.apache.org/r/56058/

5 months agoMove deprecated resource validations so they happen after the thrift backfill.
Nicolás Donatucci [Mon, 30 Jan 2017 19:18:59 +0000 (11:18 -0800)] 
Move deprecated resource validations so they happen after the thrift backfill.

As the validations for NumCpus, RamMb and DiskMb happened before the thrift
backfill, those values needed to be set, even though they are deprecated. In the
thrift backfill, if the Resources field is set, then NumCpus, RamMb and DiskMb
are set accordingly.

So by moving those validations, it is now possible to only set the Resources
field instead of having to set the deprecated fields. As the validations are
moved and not removed, the ckeck for the resource values being greater than 0
still happens. Furthermore, if the Resources field is set but there is no
Resource for Ram in the set, the thrift backfill will throw an
IllegalArgumentException.

Some tests were slightly modified because of this, mostly by adding an
unsetResources() operation. This is because as the validations now happen after
the thrift backfill, during the thrift backfill the values in the deprecated
fields are replaced by those in the Resources field (if it is set). There are
also some new tests.

Related Issue: AURORA-1707

Testing Done:
src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh

Reviewed at https://reviews.apache.org/r/55982/

5 months agoExpose Thrift server request workload stats
Mehrdad Nurolahzade [Mon, 30 Jan 2017 13:00:15 +0000 (14:00 +0100)] 
Expose Thrift server request workload stats

This patch introduces a number of stats that measure the workload generated by
Thrift server requests.

Current Thrift server stats expose the number and timing of requests received
by the server. However, they fail to reflect the size of the requests. This is
limiting us in having an accurate view of the workload handled by the scheduler.
For example, every call to `restartShards()` is recorded as one event despite
the fact that a request might only restart one shard while another request might
seek to restart 1K shards.

Bugs closed: AURORA-1826

Reviewed at https://reviews.apache.org/r/55089/

5 months agoPreemption performance improvement and new metrics release notes entry
Mehrdad Nurolahzade [Sat, 28 Jan 2017 09:48:58 +0000 (10:48 +0100)] 
Preemption performance improvement and new metrics release notes entry

Reviewed at https://reviews.apache.org/r/56048/

5 months agoCapture health check output.
Dmitriy Shirchenko [Wed, 25 Jan 2017 21:21:37 +0000 (13:21 -0800)] 
Capture health check output.

Users really could really benefit from seeing the output of the shell health
check failure, so plumbing through the output.

Testing Done:
added unit tests
e2e tests
screenshot attached.

Bugs closed: AURORA-1881

Reviewed at https://reviews.apache.org/r/55902/

5 months agoExpose finer grained offer veto stats
Mehrdad Nurolahzade [Wed, 25 Jan 2017 19:26:56 +0000 (13:26 -0600)] 
Expose finer grained offer veto stats

Bugs closed: AURORA-1835

Reviewed at https://reviews.apache.org/r/55020/

6 months agoConsider reserving for multiple tasks per preemption round
Mehrdad Nurolahzade [Tue, 24 Jan 2017 18:20:37 +0000 (19:20 +0100)] 
Consider reserving for multiple tasks per preemption round

To be fair, PendingTaskProcessor interleaves tasks from different groups.
However, this fairness comes at the price of increasing reservation time.
Even if reservations are being made for the same task group, the processor
would still restart iterating through slaves for each task instance. This
results in reevaluating all slaves already rejected in a previous search
before it finds a new viable candidate.

This patch improves `PendingTaskProcessor` performance by reducing slave
search/evaluation time, at the cost of reduced fairness.
`PendingTaskProcessor` now does reservation for a configurable maximum of
_N_ candidates per task group in each iteration over the list of slaves.

Bugs closed: AURORA-1867

Reviewed at https://reviews.apache.org/r/55357/

6 months agoEvaluate multiple preemption proposals per round
Mehrdad Nurolahzade [Tue, 24 Jan 2017 16:07:09 +0000 (17:07 +0100)] 
Evaluate multiple preemption proposals per round

`TaskScheduler` makes an attempt to preempt already identified candidates
through `Preemptor` when it fails to schedule one or more tasks. However,
`Preemptor` currently evaluates only one proposal per invocation. A proposal
may get vetoed at this point by scheduling filters. If a proposal fails
validation the task group might get penalized by `TaskGroups` to give
`PendingTaskProcessor` some time to find new preemption candidates; despite
the fact that another proposal may already exist in `slotCache`. This penalty
might result in expiration of existing proposals in `slotCache`, hence slowing
down the overall preemption process.

This patch modifies `Preemptor` so that it evaluates all existing preemption
proposals before giving up.

Bugs closed: AURORA-1868

Reviewed at https://reviews.apache.org/r/55243/

6 months agoMake leader elections resilient to ZK disconnections.
Zameer Manji [Mon, 23 Jan 2017 22:38:56 +0000 (14:38 -0800)] 
Make leader elections resilient to ZK disconnections.

As documented in AURORA-1840 the Curator `LeaderLatch` recipe abdicates
leadership if the ZK connection is lost or if there is a timeout. This is not
compatible with the commons based implementation which would only abdicate
leadership if the ZK session timeout occurred.

This replaces the `LeaderLatch` recipe with the `LeaderSelector` recipe with a
custom listener that only loses leadership if a connection loss occurs.

Bugs closed: AURORA-1669

Reviewed at https://reviews.apache.org/r/54288/

6 months agoAURORA-1876 Expose stats on scheduler rate limiter
Mehrdad Nurolahzade [Mon, 23 Jan 2017 20:58:19 +0000 (14:58 -0600)] 
AURORA-1876 Expose stats on scheduler rate limiter

This patch exposes stats on `rateLimiter.acquire()` blocking events in `TaskGroups`. Hence,
providing visibility into whether scheduling rate is above/below `MAX_SCHEDULE_ATTEMPTS_PER_SEC`.

Bugs closed: AURORA-1876

Reviewed at https://reviews.apache.org/r/55471/

6 months agoAURORA-1828 Expose stats on the number of offers evaluated before a task is assigned
Mehrdad Nurolahzade [Mon, 23 Jan 2017 20:56:17 +0000 (14:56 -0600)] 
AURORA-1828 Expose stats on the number of offers evaluated before a task is assigned

Bugs closed: AURORA-1828

Reviewed at https://reviews.apache.org/r/54995/

6 months agoFix command escaping when using the Mesos containerizer.
Stephan Erb [Mon, 23 Jan 2017 07:38:52 +0000 (08:38 +0100)] 
Fix command escaping when using the Mesos containerizer.

The important bit is the change to call the Mesos containerizer with
`shell=False`. Getting rid of manual json encoding and eliminating shlex
 might have helped as well, but was more motivated by clarity rather than
correctness.

Bugs closed: AURORA-1782

Reviewed at https://reviews.apache.org/r/55684/

6 months agoMake announced scheduler endpoint name configurable.
Stephan Erb [Wed, 18 Jan 2017 09:25:54 +0000 (10:25 +0100)] 
Make announced scheduler endpoint name configurable.

We decided to co-deploy an HTTPS enabled reverse proxy in front of each of our
Aurora schedulers. The proxy instances bind to `public_ip:8081` and the
schedulers to `localhost:8081`. By announcing the scheduler endpoint as `https`
we can ensure the default Aurora [client connects via HTTPS](https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/client/api/scheduler_client.py#L176-L178).

Default:

    [zk: 127.0.0.1:2181(CONNECTED) 5] get /aurora/scheduler/member_0000000011
    {"serviceEndpoint":{"host":"aurora.local","port":8081},"additionalEndpoints":{"http":{"host":"aurora.local","port":8081}},"status":"ALIVE"}

When running with `-serverset_endpoint_name=https`:

    [zk: 127.0.0.1:2181(CONNECTED) 0] get /aurora/scheduler/member_0000000019
    {"serviceEndpoint":{"host":"aurora.local","port":8081},"additionalEndpoints":{"https":{"host":"aurora.local","port":8081}},"status":"ALIVE"}

Bugs closed: AURORA-343

Reviewed at https://reviews.apache.org/r/55583/

6 months agoEnsure Aurora thrift support js and html.
John Sirois [Tue, 17 Jan 2017 22:34:39 +0000 (15:34 -0700)] 
Ensure Aurora thrift support js and html.

We use these for the Aurora UI and the API docs.

Bugs closed: AURORA-1875

Reviewed at https://reviews.apache.org/r/55646/

6 months agoImprove `thriftw` robustness.
John Sirois [Tue, 17 Jan 2017 21:20:15 +0000 (14:20 -0700)] 
Improve `thriftw` robustness.

Now the selected thrift is checked both for the proper version and
support of the gen langs Aurora requires. In addition, all thrifts on
the `PATH` are and an existing locally built thrift is always verified
to protect Aurora thrift requirement changes (if we ever add a gen lang
again).

Bugs closed: AURORA-1875

Reviewed at https://reviews.apache.org/r/55536/

6 months agoLog process sampling failures with debug severity
Stephan Erb [Tue, 17 Jan 2017 20:53:18 +0000 (21:53 +0100)] 
Log process sampling failures with debug severity

The observer's logs consist of lots of warnings about being unable to find PIDs.
This is expected when running with the PID isolator, or when checkpoints are out
of date (e.g. after processes were killed by the OOM).

    W0116 14:42:54.694221 3253 process_collector_psutil.py:75] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=27727)
    W0116 14:42:54.717905 3253 process_collector_psutil.py:42] Error during process sampling [pid=10960]: psutil.NoSuchProcess process no longer exists (pid=10960)
    W0116 14:42:54.718089 3253 process_collector_psutil.py:75] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=10960)
    W0116 14:42:54.718245 3253 process_collector_psutil.py:42] Error during process sampling [pid=10026]: psutil.NoSuchProcess process no longer exists (pid=10026)
    W0116 14:42:54.718334 3253 process_collector_psutil.py:75] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=10026)

This change adopts the proposal of David Robinson to decrease the severity level
to debug.

Bugs closed: AURORA-1541

Reviewed at https://reviews.apache.org/r/55578/

6 months agoExposed stats on number of offers rescinded and number of slaves lost.
Pradyumna Kaushik [Fri, 13 Jan 2017 21:09:17 +0000 (13:09 -0800)] 
Exposed stats on number of offers rescinded and number of slaves lost.

Testing Done:
curl -w '\n' 192.168.33.7:8081/vars | grep offers_rescinded
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
offers_rescinded 0

curl -w '\n' 192.168.33.7:8081/vars | grep slaves_lost
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 30970    0 30970    0     0  4323k      0 --:--:-- --:--:-- --:--:-- 5040k
slaves_lost 0

./build-support/jenkins/build.sh
./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh

Reviewed at https://reviews.apache.org/r/54960/

6 months agoExpose stats on SlotSizeCounter runs.
Mehrdad Nurolahzade [Fri, 13 Jan 2017 19:43:56 +0000 (11:43 -0800)] 
Expose stats on SlotSizeCounter runs.

Bugs closed: AURORA-1874

Reviewed at https://reviews.apache.org/r/55477/

6 months agoExpose stats on statically banned offers
Mehrdad Nurolahzade [Wed, 11 Jan 2017 22:18:43 +0000 (23:18 +0100)] 
Expose stats on statically banned offers

Bugs closed: AURORA-1859

Reviewed at https://reviews.apache.org/r/55058/

6 months agoEliminate sequential scan in MemTaskStore.getJobKeys()
Mehrdad Nurolahzade [Wed, 11 Jan 2017 22:17:34 +0000 (23:17 +0100)] 
Eliminate sequential scan in MemTaskStore.getJobKeys()

If scheduler is configured to run with the `MemTaskStore` every hit on scheduler
landing page (`/scheduler`) causes a call to `MemTaskStore.getJobKeys()` through
`ReadOnlyScheduler.getRoleSummary()`.

The implementation of `MemTaskStore.getJobKeys()` is currently very inefficient
as it requires a sequential scan of the task store and mapping to their
respective job keys. In Twitter clusters this method is currently taking half a
second per call (`mem_storage_get_job_keys`).

This patch eliminates the sequential scan and mapping to job key by simply
returning an immutable copy of the key set of the existing secondary index `job`.

Bugs closed: AURORA-1847

Reviewed at https://reviews.apache.org/r/55217/

6 months agoExpose stats on deleted job updates in JobUpdateHistoryPruner
Mehrdad Nurolahzade [Wed, 11 Jan 2017 22:15:03 +0000 (23:15 +0100)] 
Expose stats on deleted job updates in JobUpdateHistoryPruner

Bugs closed: AURORA-1856

Reviewed at https://reviews.apache.org/r/54967/

6 months agoReduce logging by ChainedStatusChecker and StatusManager when they're on the happy...
Joshua Cohen [Wed, 11 Jan 2017 22:19:49 +0000 (16:19 -0600)] 
Reduce logging by ChainedStatusChecker and StatusManager when they're  on the happy path.

Bugs closed: AURORA-1878

Reviewed at https://reviews.apache.org/r/55434/

6 months agoClean up instances of loggers using a logger name from another class.
Bing-Qian Luan [Wed, 11 Jan 2017 16:00:32 +0000 (10:00 -0600)] 
Clean up instances of loggers using a logger name from another class.

Bugs closed: AURORA-1873

Reviewed at https://reviews.apache.org/r/55409/

6 months agoExpose stats on ZooKeeper connection state
Jing Chen [Tue, 10 Jan 2017 22:35:21 +0000 (23:35 +0100)] 
Expose stats on ZooKeeper connection state

* zk_connection_state_STATE shows 1 if STATE is current connection state, otherwise 0.
* zk_connection_state_STATE_counter represents occurence times of the STATE since scheduler state

Bugs closed: AURORA-1838

Reviewed at https://reviews.apache.org/r/54624/

6 months agoEnsure destination exists when mounting files into a filesystem image.
Joshua Cohen [Tue, 10 Jan 2017 22:11:54 +0000 (16:11 -0600)] 
Ensure destination exists when mounting files into a filesystem image.

When testing filesystem isolation internally, we ran into an issue where mounting a regular file
into the task filesystem failed with exit code 32 since the mount destination did not exist. To
account for this, we'll touch an empty file in the taskfs.

Reviewed at https://reviews.apache.org/r/55347/

6 months agoReduce storage write lock contention by adopting Double-Checked Locking pattern in
Mehrdad Nurolahzade [Wed, 4 Jan 2017 21:50:46 +0000 (15:50 -0600)] 
Reduce storage write lock contention by adopting Double-Checked Locking pattern in
TimedOutTaskHandler.

`TimedOutTaskHandler` acquires storage write lock for every task every time they transition to a
transient state. It then verifies after a default time-out period of 5 minutes if the task has
transitioned out of the transient state.

The verification step takes place while holding the storage write lock. In over 99% of cases the
logic short-circuits and returns from `StateManagerImpl.updateTaskAndExternalState()` once it learns
task has transitioned out of the transient state.

This patch reduces storage write lock contention by adopting Double-Checked Locking pattern in
`TimedOutTaskHandler.run()`.

Bugs closed: AURORA-1820

Reviewed at https://reviews.apache.org/r/55179/

6 months agoExpose stats on undelivered event bus events
Mehrdad Nurolahzade [Tue, 27 Dec 2016 22:32:26 +0000 (23:32 +0100)] 
Expose stats on undelivered event bus events

Bugs closed: AURORA-1834

Reviewed at https://reviews.apache.org/r/55056/

6 months agoExpose stats on JobUpdateAction transitions
Mehrdad Nurolahzade [Tue, 27 Dec 2016 13:19:40 +0000 (14:19 +0100)] 
Expose stats on JobUpdateAction transitions

Introduced new stats that exposes `JobUpdateAction` transitions.

Refactored away from `CachedCounters` for existing metric; it was dynamically
generating new String objects (through concatenation) per stats collection event.

Fixed for a mistake in a previous changeset (https://reviews.apache.org/r/55003/);
removed unnecessary checked `Exception` on `CacheLoader.load()`.

Bugs closed: AURORA-1851

Reviewed at https://reviews.apache.org/r/55019/

6 months agoExpose timing stats on PendingTaskProcessor runs
Mehrdad Nurolahzade [Tue, 27 Dec 2016 11:49:58 +0000 (12:49 +0100)] 
Expose timing stats on PendingTaskProcessor runs

Bugs closed: AURORA-1857

Reviewed at https://reviews.apache.org/r/54992/

6 months agoUpdate to Mesos 1.1.0.
Stephan Erb [Tue, 27 Dec 2016 11:36:44 +0000 (12:36 +0100)] 
Update to Mesos 1.1.0.

Included changes:

* Handle new task states introduced in the latest Mesos release.
* Prevent NullPointer exception when inspecting an empty/invalid executor config in a test.
  Probably this is due to a change in the Mesos protobufs.
* Fix bug preventing the teardown of Vagrant boxes started by the egg build.
* Increase resources for the Mesos egg builds. The build for all distribution now takes 2h in total.

Full Mesos changelog: https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.1.0

Bugs closed: AURORA-1813

Reviewed at https://reviews.apache.org/r/54255/

7 months agoExpose ResponseCode stats on Thrift server calls
Mehrdad Nurolahzade [Fri, 23 Dec 2016 10:08:29 +0000 (02:08 -0800)] 
Expose ResponseCode stats on Thrift server calls

Bugs closed: AURORA-1848

Reviewed at https://reviews.apache.org/r/55003/

7 months agoExpose stats on deleted tasks in TaskHistoryPruner
Mehrdad Nurolahzade [Fri, 23 Dec 2016 10:07:08 +0000 (02:07 -0800)] 
Expose stats on deleted tasks in TaskHistoryPruner

Bugs closed: AURORA-1855

Reviewed at https://reviews.apache.org/r/54990/

7 months agoAURORA-1842 Expose stats on garbage collected rows in RowGarbageCollector
Mehrdad Nurolahzade [Thu, 22 Dec 2016 10:55:51 +0000 (02:55 -0800)] 
AURORA-1842 Expose stats on garbage collected rows in RowGarbageCollector

Bugs closed: AURORA-1842

Reviewed at https://reviews.apache.org/r/54959/

7 months agoRemove ignored snapshot stats. Add high-level timings on storage start-up lifecycle.
David McLaughlin [Mon, 19 Dec 2016 18:43:10 +0000 (10:43 -0800)] 
Remove ignored snapshot stats. Add high-level timings on storage start-up lifecycle.

Reviewed at https://reviews.apache.org/r/54847/

7 months agoAvoid double writing job updates to the Scheduler Snapshot
David McLaughlin [Thu, 15 Dec 2016 21:16:23 +0000 (13:16 -0800)] 
Avoid double writing job updates to the Scheduler Snapshot

Motivation: Thanks to the mybatis query metrics we added, we found that double writing Snapshot fields for H2 stores adds considerable overhead to our snapshot creation time.

Snapshots are also written as backups, and many operators choose to process backups offline for analytics, rather than query the live scheduler (due to not being able to scale reads horizontally). So this allows operators to enable/disable the hydrated fields as needed.

Bugs closed: AURORA-1861

Reviewed at https://reviews.apache.org/r/54774/

7 months agoAdd finer grained timings to the Snapshot process. I also added some log output,...
David McLaughlin [Thu, 15 Dec 2016 18:08:54 +0000 (10:08 -0800)] 
Add finer grained timings to the Snapshot process. I also added some log output, as I found those existing numbers handy when investigating our long snapshot times.

Related ticket: https://issues.apache.org/jira/browse/AURORA-1861

Reviewed at https://reviews.apache.org/r/54773/

7 months agoFix thrift bootstrap to use python2.7.
Joshua Cohen [Mon, 12 Dec 2016 17:33:39 +0000 (11:33 -0600)] 
Fix thrift bootstrap to use python2.7.

Reviewed at https://reviews.apache.org/r/54669/

7 months agoFixup prepare_binary.sh to work under modern bash.
John Sirois [Fri, 9 Dec 2016 04:17:13 +0000 (22:17 -0600)] 
Fixup prepare_binary.sh to work under modern bash.

Previously pushd/popd were used and these emit data to stdout muddying
pants output and breaking setup of the thrift serve dir structure.

 build-support/thrift/prepare_binary.sh | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Reviewed at https://reviews.apache.org/r/54567/