aurora.git
2 weeks agoUpdating .auroraversion to release version 0.19.0. rel/0.19.0
Bill Farner [Sat, 11 Nov 2017 05:02:24 +0000 (21:02 -0800)] 
Updating .auroraversion to release version 0.19.0.

2 weeks agoUpdating .auroraversion to 0.19.0-rc0. rel/0.19.0-rc0
Bill Farner [Wed, 8 Nov 2017 04:47:30 +0000 (20:47 -0800)] 
Updating .auroraversion to 0.19.0-rc0.

2 weeks agoIncrementing snapshot version to 0.20.0-SNAPSHOT.
Bill Farner [Wed, 8 Nov 2017 04:47:30 +0000 (20:47 -0800)] 
Incrementing snapshot version to 0.20.0-SNAPSHOT.

2 weeks agoUpdating CHANGELOG for 0.19.0 release.
Bill Farner [Wed, 8 Nov 2017 04:47:30 +0000 (20:47 -0800)] 
Updating CHANGELOG for 0.19.0 release.

2 weeks agoUpdate release notes in preparation for 0.19.0 release
Bill Farner [Wed, 8 Nov 2017 04:47:04 +0000 (20:47 -0800)] 
Update release notes in preparation for 0.19.0 release

2 weeks agoUse a pair of fields for caching offer resources rather than a Cache
Bill Farner [Wed, 8 Nov 2017 02:45:05 +0000 (18:45 -0800)] 
Use a pair of fields for caching offer resources rather than a Cache

Reviewed at https://reviews.apache.org/r/63454/

2 weeks agoDisplay pending task reasons in TaskList
David McLaughlin [Wed, 8 Nov 2017 00:22:08 +0000 (16:22 -0800)] 
Display pending task reasons in TaskList

Reviewed at https://reviews.apache.org/r/63650/

2 weeks agoDon't show host data when task is Throttled.
David McLaughlin [Wed, 8 Nov 2017 00:13:02 +0000 (16:13 -0800)] 
Don't show host data when task is Throttled.

PENDING and THROTTLED tasks are considered active and they dont have hosts. This manifests in having "null" host links.

Reviewed at https://reviews.apache.org/r/63648/

2 weeks agoPolling updates page if in progress in UI
Reza Motamedi [Wed, 8 Nov 2017 00:08:13 +0000 (16:08 -0800)] 
Polling updates page if in progress in UI

Reviewed at https://reviews.apache.org/r/63337/

2 weeks agoMigrate from findbugs to spotbugs
Stephan Erb [Tue, 7 Nov 2017 07:26:58 +0000 (08:26 +0100)] 
Migrate from findbugs to spotbugs

Findbugs [1] is no longer developed and replaced by spotbugs [2]
as mostly a drop-in replacement.

[1] https://github.com/findbugsproject/findbugs
[2] https://mailman.cs.umd.edu/pipermail/findbugs-discuss/2017-September/004383.html

Reviewed at https://reviews.apache.org/r/63564/

3 weeks agoFixed issue where saving attributes are not being persisted to log
Jordan Ly [Thu, 2 Nov 2017 21:49:10 +0000 (14:49 -0700)] 
Fixed issue where saving attributes are not being persisted to log

A bug was introduced when the old `MemAttributeStore` was revived. Previously,
the `saveHostAttributes` method did not return anything. However, after
migrating to the DB stores, the signature of the interface was changed to return
a `boolean` if the save modified the previous attributes. The new changes
accidentally inverted the order. The `AbstractAttributeStoreTest` did not test
for this scenario so it went unnoticed.

Reviewed at https://reviews.apache.org/r/63521/

3 weeks agoTerminate the executor on unhandled errors
Stephan Erb [Thu, 2 Nov 2017 11:01:40 +0000 (12:01 +0100)] 
Terminate the executor on unhandled errors

This commit consits of two independent parts:

a) ensure we interrupt the main thread when there are unhandled exceptions
b) ensure the main thread of the executor can be interrupted

Testing Done:
This bug is pretty hard to reproduce and test. I therefore opted for a manual
verification and injected an exception throw shortly before the last statement
of the `AuroraExecutor._shutdown` method. Without this patch, this resulted in
hanging executors on the host. With this patch everything is terminated as
expected.

For details of the suffessful run, please see the executor logs below. Please
note that the `apport.fileutils` is due to Ubuntu messing  with its Python
installation. This is not critical.

```
twitter.common.app debug: Initializing: apache.thermos.common.excepthook (Exception termination handler.)
I1031 15:59:37.188621 25437 exec.cpp:162] Version: 1.2.0
I1031 15:59:37.192201 25429 exec.cpp:237] Executor registered on agent 93259518-14f4-4956-a39c-aa615bff9a5e-S0
Writing log files to disk in /var/lib/mesos/slaves/93259518-14f4-4956-a39c-aa615bff9a5e-S0/frameworks/7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000/executors/thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c/runs/54a5ed51-aa9b-476f-9f75-0b42bd6dfa8d

ERROR] Unhandled error in <StatusManager(Thread-7 [TID=25450], started daemon 139968452134656)>. Interrupting main thread.
Traceback (most recent call last):
  File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 126, in _excepting_run
    self.__real_run(*args, **kw)
  File "apache/aurora/executor/status_manager.py", line 62, in run
  File "apache/aurora/executor/aurora_executor.py", line 236, in _shutdown
RuntimeError: Woops!
Exception in thread Thread-7 [TID=25450]:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/root/.pex/install/twitter.common.decorators-0.3.7-py2-none-any.whl.b23f2874a4392741fca582d9e0528c08e0335c68/twitter.common.decorators-0.3.7-py2-none-any.whl/twitter/common/decorators/threads.py", line 115, in identified
    return instancemethod(self, *args, **kwargs)
  File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 130, in _excepting_run
    sys.excepthook(*sys.exc_info())
  File "apache/thermos/common/excepthook.py", line 41, in teardown_handler
    self._former_hook()(exc_type, value, trace)
  File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line 63, in apport_excepthook
    from apport.fileutils import likely_packaged, get_recent_crashes
ImportError: No module named apport.fileutils

twitter.common.app debug: main exited with ^C
twitter.common.app debug: Shutting application down.
twitter.common.app debug: Running exit function for apache.thermos.common.excepthook (Exception termination handler.)
twitter.common.app debug: Running exit function for twitter.common.log (Logging subsystem.)
twitter.common.app debug: Finishing up module teardown.
twitter.common.app debug:   Active thread: <_MainThread(MainThread, started 139968622749504)>
twitter.common.app debug:   Active thread (daemon): <TaskResourceMonitor(TaskResourceMonitor[www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c] [TID=25449], started daemon 139967951009536)>
twitter.common.app debug:   Active thread (daemon): <_DummyThread(Dummy-13, started daemon 139968485705472)>
twitter.common.app debug:   Active thread (daemon): <WaitThread(Thread-9, started daemon 139967934224128)>
twitter.common.app debug:   Active thread (daemon): <WaitThread(Thread-12, started daemon 139967942616832)>
twitter.common.app debug:   Active thread (daemon): <_DummyThread(Dummy-3, started daemon 139968510883584)>
twitter.common.app debug:   Active thread (daemon): <WaitThread(Thread-11, started daemon 139967925831424)>
twitter.common.app debug: Exiting cleanly.
```

Corresponding agent logs, indicating that Mesos knows about the crash on teardown:
```
I1031 15:59:54.692739  1956 slave.cpp:4769] Executor 'thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c' of framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000 exited with status 130
I1031 15:59:54.692834  1956 slave.cpp:4869] Cleaning up executor 'thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c' of framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000 at executor(1)@192.168.33.7:48931
I1031 15:59:54.692996  1956 slave.cpp:4957] Cleaning up framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000
```

Bugs closed: AURORA-1955

Reviewed at https://reviews.apache.org/r/63443/

3 weeks agoRefactor staticallyBannedOffers into a LRU cache
Jordan Ly [Tue, 31 Oct 2017 17:20:27 +0000 (10:20 -0700)] 
Refactor staticallyBannedOffers into a LRU cache

Using the new `hold_offers_forever` option, it is possible for the
`staticallyBannedOffers` to grow very large in size as we never release
offers.
1. The current behavior of `staticallyBannedOffers` is (kinda) preserved.
   Entries will no longer be removed when the offer is used, but they will be
   removed within `maxOfferHoldTime`. This means cluster operators will not
   have to think about the new `offer_static_ban_cache_max_size` if they aren't
   affected by the memory leak now.
2. Cluster operators that use Aurora as a single framework and hold offers
   indefinitely can cap the size of the cache to avoid the memory leak.
3. Using an LRU cache greatly benefits quickly recurring crons and job updates.

Reviewed at https://reviews.apache.org/r/63199/

3 weeks agoRemove inaccurate "Initializing sandbox" message
Stephan Erb [Tue, 31 Oct 2017 16:24:04 +0000 (17:24 +0100)] 
Remove inaccurate "Initializing sandbox" message

The message is no longer completely accurate, now that we remain in
`STARTING` until health checks have passed.

Reviewed at https://reviews.apache.org/r/63435/

3 weeks agoRemove endpoint.thrift, ServiceInstance is never serialized to thrift
Bill Farner [Tue, 31 Oct 2017 04:58:13 +0000 (21:58 -0700)] 
Remove endpoint.thrift, ServiceInstance is never serialized to thrift

This enables removal of some unnecessary complexity in the build (commons no
longer needs thrift) and the unused Codec abstraction (we always encode in
JSON).

Reviewed at https://reviews.apache.org/r/63418/

3 weeks agoCondense whitespace of navigation and breadcrumb, reduce impact of quota widget
David McLaughlin [Mon, 30 Oct 2017 23:26:13 +0000 (16:26 -0700)] 
Condense whitespace of navigation and breadcrumb, reduce impact of quota widget

Reviewed at https://reviews.apache.org/r/63406/

3 weeks agoAdd resource units to config summary
David McLaughlin [Mon, 30 Oct 2017 15:37:19 +0000 (08:37 -0700)] 
Add resource units to config summary

Reviewed at https://reviews.apache.org/r/63375/

3 weeks agoAdd support for generating patch RCs from non-master branches
Bill Farner [Mon, 30 Oct 2017 15:03:44 +0000 (08:03 -0700)] 
Add support for generating patch RCs from non-master branches

Reviewed at https://reviews.apache.org/r/63401/

3 weeks agoAdd release notes for 0.18.1
Bill Farner [Sun, 29 Oct 2017 17:27:06 +0000 (10:27 -0700)] 
Add release notes for 0.18.1

4 weeks agoSuppress multiline logging from mesos callbacks
Bill Farner [Sat, 28 Oct 2017 00:26:34 +0000 (17:26 -0700)] 
Suppress multiline logging from mesos callbacks

Reviewed at https://reviews.apache.org/r/63383/

4 weeks agoMesosCallbackHandler uses separate eventbus for registered call
Jordan Ly [Fri, 27 Oct 2017 21:34:01 +0000 (14:34 -0700)] 
MesosCallbackHandler uses separate eventbus for registered call

We should have `registered` use its own eventbus so it does not get blocked
by other calls.

Bugs closed: AURORA-1953

Reviewed at https://reviews.apache.org/r/63316/

4 weeks agoRevert to old Job Page tab names and add counts
David McLaughlin [Fri, 27 Oct 2017 20:10:00 +0000 (13:10 -0700)] 
Revert to old Job Page tab names and add counts

Changing the names of tabs causes unnecessary confusion. Revert to "active tasks/completed tasks" and add the instance count back to both.

Reviewed at https://reviews.apache.org/r/63374/

4 weeks agoReduce white-space on role and env pages
David McLaughlin [Fri, 27 Oct 2017 20:08:29 +0000 (13:08 -0700)] 
Reduce white-space on role and env pages

Reviewed at https://reviews.apache.org/r/63373/

4 weeks agoRevert role searching in UI to old behavior
David McLaughlin [Fri, 27 Oct 2017 20:03:38 +0000 (13:03 -0700)] 
Revert role searching in UI to old behavior

Move from prefix search to full-text matching.

Reviewed at https://reviews.apache.org/r/63364/

4 weeks agoSupport updates with no desiredState on Job and Update pages
David McLaughlin [Thu, 26 Oct 2017 22:27:05 +0000 (15:27 -0700)] 
Support updates with no desiredState on Job and Update pages

When updates only delete instances, desiredState is null.

Reviewed at https://reviews.apache.org/r/63344/

4 weeks agoSearch entire job name for query string on JobList
David McLaughlin [Thu, 26 Oct 2017 22:19:22 +0000 (15:19 -0700)] 
Search entire job name for query string on JobList

Reviewed at https://reviews.apache.org/r/63339/

4 weeks agoDo not fetch neighbor tasks if no active task
David McLaughlin [Thu, 26 Oct 2017 20:37:12 +0000 (13:37 -0700)] 
Do not fetch neighbor tasks if no active task

Reviewed at https://reviews.apache.org/r/63333/

4 weeks agoClean up TaskList component layout.
David McLaughlin [Thu, 26 Oct 2017 18:45:56 +0000 (11:45 -0700)] 
Clean up TaskList component layout.

Reviewed at https://reviews.apache.org/r/63281/

4 weeks agoReload instance page when URL changes.
Reza Motamedi [Wed, 25 Oct 2017 22:30:37 +0000 (15:30 -0700)] 
Reload instance page when URL changes.

Reviewed at https://reviews.apache.org/r/63221/

4 weeks agoAdd release notes for new UI
David McLaughlin [Wed, 25 Oct 2017 21:51:46 +0000 (14:51 -0700)] 
Add release notes for new UI

Reviewed at https://reviews.apache.org/r/63306/

4 weeks agoAdd a package.json file in the plugin directory to allow custom dependencies
David McLaughlin [Wed, 25 Oct 2017 20:00:04 +0000 (13:00 -0700)] 
Add a package.json file in the plugin directory to allow custom dependencies

Problem: if you're using the plugin mechanism of the new UI, you cannot add your own custom dependencies without a fork of the package.json file. This adds a package.json file into the plugin diretory that is used to install dependencies into the main node_modules directory. With this separate package.json file, we remove the burden of upstream merge conflicts.

Reviewed at https://reviews.apache.org/r/63262/

4 weeks agoRefactor veto logic to use direct method calls as opposed to pubsub events.
Jordan Ly [Wed, 25 Oct 2017 17:21:55 +0000 (10:21 -0700)] 
Refactor veto logic to use direct method calls as opposed to pubsub events.

SchedulingFilterNotifier currently publishes veto events to be consumed by various metadata classes (NearestFit and TaskVars). These veto events cause a lot object allocations/async tasks. We can reduce the number of objects made by directly calling methods and not using pubsub events.

Reviewed at https://reviews.apache.org/r/63236/

4 weeks agoRemove the old UI and serve the new UI instead
David McLaughlin [Wed, 25 Oct 2017 17:04:47 +0000 (10:04 -0700)] 
Remove the old UI and serve the new UI instead

Reviewed at https://reviews.apache.org/r/63282/

4 weeks agoExclusively use Map-based in-memory stores for primary storage
Bill Farner [Wed, 25 Oct 2017 06:34:09 +0000 (23:34 -0700)] 
Exclusively use Map-based in-memory stores for primary storage

This patch introduces map-based volatile stores, most of which were revived
from git history with minimal changes.  The DB storage system is now only
used in a temporary storage when replaying a snapshot containing the `dbScript`
field.

Note that this change removes the transactional nature of in-memory storage
operations as well as the `READ COMMITTED` transaction isolation previously
available to some stores (proven in necessary changes to
`StorageTransactionTest`).  This means some stores will permit dirty reads
when they previously did not.  `TaskStore` has always had this non-transactional
behavior by default, as the DB task store was never deemed suitable for
production.  Nonetheless, this non-transactional behavior should be considered
safe as the scheduler fails over on a storage operation failure, and relies on
the persistent log storage for transaction atomicity.

Reviewed at https://reviews.apache.org/r/62869/

4 weeks agoDo not reserve agents for updates when constraints change.
David McLaughlin [Tue, 24 Oct 2017 22:33:41 +0000 (15:33 -0700)] 
Do not reserve agents for updates when constraints change.

Reviewed at https://reviews.apache.org/r/63261/

4 weeks agoFix alignment of text on JobList
David McLaughlin [Tue, 24 Oct 2017 21:56:56 +0000 (14:56 -0700)] 
Fix alignment of text on JobList

Reviewed at https://reviews.apache.org/r/63260/

4 weeks agoMove job environment validation to the scheduler
Mauricio Garavaglia [Mon, 23 Oct 2017 21:09:47 +0000 (23:09 +0200)] 
Move job environment validation to the scheduler

Removed the Job environment validation from the command line client. Validation was moved to the
the scheduler side through the `allowed_job_environments` option. By default allowing any of
`devel`, `test`, `production`, and any value matching the regular expression `staging[0-9]*`.

This allows to have a consistent behavior when using the CLI and the API.

Reviewed at https://reviews.apache.org/r/62692/

4 weeks agoAdd sorting and filtering controls for TaskList
David McLaughlin [Mon, 23 Oct 2017 19:48:20 +0000 (12:48 -0700)] 
Add sorting and filtering controls for TaskList

Reviewed at https://reviews.apache.org/r/63188/

4 weeks agoUpdate to shiro 1.2.5
Bill Farner [Mon, 23 Oct 2017 16:41:50 +0000 (09:41 -0700)] 
Update to shiro 1.2.5

Reviewed at https://reviews.apache.org/r/63217/

4 weeks agoFix back button issue on Jobs page
David McLaughlin [Mon, 23 Oct 2017 16:41:05 +0000 (09:41 -0700)] 
Fix back button issue on Jobs page

Reviewed at https://reviews.apache.org/r/63197/

4 weeks agoUpdate to guava 23.2
Bill Farner [Mon, 23 Oct 2017 15:01:50 +0000 (08:01 -0700)] 
Update to guava 23.2

Reviewed at https://reviews.apache.org/r/63204/

4 weeks agoAdd test case for regression of AURORA-1952
Bill Farner [Sat, 21 Oct 2017 18:12:26 +0000 (11:12 -0700)] 
Add test case for regression of AURORA-1952

Reviewed at https://reviews.apache.org/r/63202/

5 weeks agoAdd an example of using the UI plugin mechanism
David McLaughlin [Fri, 20 Oct 2017 21:52:38 +0000 (14:52 -0700)] 
Add an example of using the UI plugin mechanism

Reviewed at https://reviews.apache.org/r/63169/

5 weeks agoDisplay cron time as UTC
David McLaughlin [Fri, 20 Oct 2017 21:51:42 +0000 (14:51 -0700)] 
Display cron time as UTC

Reviewed at https://reviews.apache.org/r/63187/

5 weeks agoPrevent diff line from overflowing container
David McLaughlin [Fri, 20 Oct 2017 21:51:10 +0000 (14:51 -0700)] 
Prevent diff line from overflowing container

Reviewed at https://reviews.apache.org/r/63186/

5 weeks agoExpose list of neighbors in the instance page
Reza Motamedi [Fri, 20 Oct 2017 17:44:50 +0000 (10:44 -0700)] 
Expose list of neighbors in the instance page

Reviewed at https://reviews.apache.org/r/63062/

5 weeks agoAdd Cache-Control header to static assets, to allow for cache expiration
David McLaughlin [Fri, 20 Oct 2017 17:36:45 +0000 (10:36 -0700)] 
Add Cache-Control header to static assets, to allow for cache expiration

Reviewed at https://reviews.apache.org/r/63176/

5 weeks agoProvide a formal way to disable offer declining
Bill Farner [Fri, 20 Oct 2017 02:39:02 +0000 (19:39 -0700)] 
Provide a formal way to disable offer declining

Increasing the offer hold time to effectively disable offer declines is a trap,
as the queue of asynchronous declines will grow without bound.  This introduces
a command line argument to explicitly disable declining.

Reviewed at https://reviews.apache.org/r/63157/

5 weeks agoRefactor Job Page to make it more pluggable
David McLaughlin [Thu, 19 Oct 2017 21:37:56 +0000 (14:37 -0700)] 
Refactor Job Page to make it more pluggable

Reviewed at https://reviews.apache.org/r/63135/

5 weeks agoUse LockStore only for backwards compatibility
Bill Farner [Thu, 19 Oct 2017 21:25:30 +0000 (14:25 -0700)] 
Use LockStore only for backwards compatibility

Enter backwards compatibility mode for LockStore.  This means we will restore
and acquire locks as before, but will ignore them otherwise.
Following the next release, `LockStore` will be removed.

Please note that `JobUpdateController` already provides the one-at-a-time
update semantic in addition to using the legacy lock system for the same
purpose.

Reviewed at https://reviews.apache.org/r/63130/

5 weeks agoAdd banner pointing to new UI
David McLaughlin [Thu, 19 Oct 2017 20:48:28 +0000 (13:48 -0700)] 
Add banner pointing to new UI

Reviewed at https://reviews.apache.org/r/63165/

5 weeks agoCosmetic changes to Navigation and task metadata
David McLaughlin [Thu, 19 Oct 2017 17:15:23 +0000 (10:15 -0700)] 
Cosmetic changes to Navigation and task metadata

Reviewed at https://reviews.apache.org/r/63132/

5 weeks agoWhen scheduling, skip offers with no CPU and no mem
Bill Farner [Thu, 19 Oct 2017 01:14:02 +0000 (18:14 -0700)] 
When scheduling, skip offers with no CPU and no mem

There's no reason for us to evaluate offers with no CPUs or memory,
so reject them early in the offer lifecycle.

This is an incremental performance optimization, but it may net significant
improvements based on observations in some very large clusters.

Reviewed at https://reviews.apache.org/r/62956/

5 weeks agoRemove legacy commons ZK code
Bill Farner [Thu, 19 Oct 2017 00:33:50 +0000 (17:33 -0700)] 
Remove legacy commons ZK code

Reviewed at https://reviews.apache.org/r/62652/

5 weeks agoAdd cron configuration to Job Page
David McLaughlin [Wed, 18 Oct 2017 23:16:21 +0000 (16:16 -0700)] 
Add cron configuration to Job Page

Reviewed at https://reviews.apache.org/r/63125/

5 weeks agoFix lint build error from Fonts patch.
David McLaughlin [Wed, 18 Oct 2017 22:09:36 +0000 (15:09 -0700)] 
Fix lint build error from Fonts patch.

Reviewed at https://reviews.apache.org/r/63129/

5 weeks agoAdd fonts again (with line-endings intact!)
David McLaughlin [Wed, 18 Oct 2017 21:24:52 +0000 (14:24 -0700)] 
Add fonts again (with line-endings intact!)

5 weeks agoRemove corrupted fonts and add font files to .gitattributes to prevent line-ending...
David McLaughlin [Wed, 18 Oct 2017 21:18:36 +0000 (14:18 -0700)] 
Remove corrupted fonts and add font files to .gitattributes to prevent line-ending formatting by git

Reviewed at https://reviews.apache.org/r/63122/

5 weeks agoRole and Environment Page fixes:
David McLaughlin [Wed, 18 Oct 2017 20:15:13 +0000 (13:15 -0700)] 
Role and Environment Page fixes:

* Remove env column on environment pages.
* Tidy up CSS that caused names to not be lined up properly with long environment name.
* Allow you to search across job type, tier and environment.

Reviewed at https://reviews.apache.org/r/63117/

5 weeks agoClean up Job Page CSS.
David McLaughlin [Wed, 18 Oct 2017 17:26:17 +0000 (10:26 -0700)] 
Clean up Job Page CSS.

* Make update list smaller (was too dominant on the page).
* Show update progress/size of history.
* Tidy up whitespace.
* Move expander to end of task list item.
* Wrap the main job overview loading element in a panel group to prevent jarring page change as content loads.

Reviewed at https://reviews.apache.org/r/63098/

5 weeks agoDetect and parse Thermos config in Diff output
David McLaughlin [Wed, 18 Oct 2017 17:18:19 +0000 (10:18 -0700)] 
Detect and parse Thermos config in Diff output

Reviewed at https://reviews.apache.org/r/63092/

5 weeks agoAdd Source Sans Pro font to project
David McLaughlin [Wed, 18 Oct 2017 16:58:35 +0000 (09:58 -0700)] 
Add Source Sans Pro font to project

Reviewed at https://reviews.apache.org/r/63099/

5 weeks agoAdd diff viewer to Update Page
David McLaughlin [Tue, 17 Oct 2017 21:57:27 +0000 (14:57 -0700)] 
Add diff viewer to Update Page

Reviewed at https://reviews.apache.org/r/63083/

5 weeks agoAdd pointer for pagination links
David McLaughlin [Tue, 17 Oct 2017 21:53:50 +0000 (14:53 -0700)] 
Add pointer for pagination links

Reviewed at https://reviews.apache.org/r/63088/

5 weeks agoFix instance range display
David McLaughlin [Tue, 17 Oct 2017 21:44:11 +0000 (14:44 -0700)] 
Fix instance range display

Reviewed at https://reviews.apache.org/r/63087/

5 weeks agoAdd URL handling for tab switching on Job page
David McLaughlin [Tue, 17 Oct 2017 20:29:43 +0000 (13:29 -0700)] 
Add URL handling for tab switching on Job page

Reviewed at https://reviews.apache.org/r/62958/

5 weeks agoHide InstanceHistory when there are no old tasks.
David McLaughlin [Tue, 17 Oct 2017 20:15:44 +0000 (13:15 -0700)] 
Hide InstanceHistory when there are no old tasks.

Reviewed at https://reviews.apache.org/r/63082/

5 weeks agoFormat constraints on Task Config Summary
David McLaughlin [Tue, 17 Oct 2017 20:15:07 +0000 (13:15 -0700)] 
Format constraints on Task Config Summary

Reviewed at https://reviews.apache.org/r/63081/

5 weeks agoCenter pagination links when not a table.
David McLaughlin [Tue, 17 Oct 2017 20:14:25 +0000 (13:14 -0700)] 
Center pagination links when not a table.

Reviewed at https://reviews.apache.org/r/63079/

5 weeks agoClean up State Machine CSS to handle long messages
David McLaughlin [Tue, 17 Oct 2017 20:13:31 +0000 (13:13 -0700)] 
Clean up State Machine CSS to handle long messages

Reviewed at https://reviews.apache.org/r/63064/

5 weeks agoHide pagination links on Role and Job lists when only one page
David McLaughlin [Tue, 17 Oct 2017 17:37:47 +0000 (10:37 -0700)] 
Hide pagination links on Role and Job lists when only one page

Reviewed at https://reviews.apache.org/r/63078/

5 weeks agoFix link on Navigation logo
David McLaughlin [Tue, 17 Oct 2017 17:37:01 +0000 (10:37 -0700)] 
Fix link on Navigation logo

Reviewed at https://reviews.apache.org/r/63065/

5 weeks agoUpdate list of Companies using Aurora.
Derek Slager [Tue, 17 Oct 2017 01:12:01 +0000 (21:12 -0400)] 
Update list of Companies using Aurora.

Reviewed at https://reviews.apache.org/r/63052/

5 weeks agoManage Bootstrap with webpack.
David McLaughlin [Mon, 16 Oct 2017 23:36:56 +0000 (16:36 -0700)] 
Manage Bootstrap with webpack.

Reviewed at https://reviews.apache.org/r/62955/

5 weeks agoAdd support for controlling API url in UI without modifying code.
David McLaughlin [Mon, 16 Oct 2017 23:31:34 +0000 (16:31 -0700)] 
Add support for controlling API url in UI without modifying code.

Reviewed at https://reviews.apache.org/r/63040/

5 weeks agoProtect against null value in RoleQuota
David McLaughlin [Mon, 16 Oct 2017 23:25:59 +0000 (16:25 -0700)] 
Protect against null value in RoleQuota

Reviewed at https://reviews.apache.org/r/63038/

5 weeks agoUse compatible Curator session and connection timeouts
Stephan Erb [Sun, 15 Oct 2017 16:00:17 +0000 (18:00 +0200)] 
Use compatible Curator session and connection timeouts

Curator will warn if used with a connection timeout that is lower than
the session timeout [1]. As it uses a default connection timeout of 15s
[2], this warning will be emitted using the Aurora default settings.

This patch remedies this issue in two ways:

* Making the Curator connection timeout configurable
* Bumping the session timeout to 15s. The current default of 4s is
  pretty small and could lead to unexpected failovers during long GC
  pauses. This is especially problematic as a failover in Aurora can
  be lengthy.

[1] https://github.com/apache/curator/blob/15eb063fa22569e797f850fb8d60a0949f52fbf5/curator-client/src/main/java/org/apache/curator/CuratorZookeeperClient.java#L118-L121
[2] https://github.com/apache/curator/blob/6ba4de36d4e8b2b65d45c005a6a92dd85c3c497f/curator-framework/src/main/java/org/apache/curator/framework/CuratorFrameworkFactory.java#L60-L61

Reviewed at https://reviews.apache.org/r/62835/

6 weeks agoImplement Job page in React
David McLaughlin [Thu, 12 Oct 2017 21:08:39 +0000 (14:08 -0700)] 
Implement Job page in React

Reviewed at https://reviews.apache.org/r/62908/

6 weeks agoUse a simpler command line argument system
Bill Farner [Wed, 11 Oct 2017 00:24:45 +0000 (17:24 -0700)] 
Use a simpler command line argument system

Reviewed at https://reviews.apache.org/r/62623/

6 weeks agoStream backup file from disk
Bill Farner [Tue, 10 Oct 2017 22:21:08 +0000 (15:21 -0700)] 
Stream backup file from disk

This reduces the memory burden of loading a backup for recovery.
Previously, the backup file would be fully loaded into a `byte[]`, which
may be very large and fail to allocate.

Reviewed at https://reviews.apache.org/r/62873/

6 weeks agoImplement Update and Updates pages in React.
David McLaughlin [Tue, 10 Oct 2017 17:14:50 +0000 (10:14 -0700)] 
Implement Update and Updates pages in React.

Reviewed at https://reviews.apache.org/r/62763/

6 weeks agoFix broken end-to-end tests
Bill Farner [Tue, 10 Oct 2017 16:59:49 +0000 (09:59 -0700)] 
Fix broken end-to-end tests

TContentAwareServlet constrains the supported Content-Type headers,
resulting in test_kerberos_end_to_end.sh failing with the error
`Unsupported Content-Type: application/x-www-form-urlencoded`, which
is the Content-Type header curl chooses when the --data-binary
argument is passed

Reviewed at https://reviews.apache.org/r/62857/

6 weeks agoImplement Instance pages in React
David McLaughlin [Tue, 10 Oct 2017 15:53:32 +0000 (08:53 -0700)] 
Implement Instance pages in React

Reviewed at https://reviews.apache.org/r/62720/

6 weeks agoRun Jenkins tests without the Gradle daemon
Stephan Erb [Sun, 8 Oct 2017 17:40:53 +0000 (19:40 +0200)] 
Run Jenkins tests without the Gradle daemon

This follows the recommendation of Gradle to only use their daemon when
running in local environments but not in CI environments.

We are seeing spurious build failures from time to time on the shared
Apache build server. Disabling the daemon might help to prevent those.

https://docs.gradle.org/current/userguide/gradle_daemon.html#when_should_i_not_use_the_gradle_daemon

Reviewed at https://reviews.apache.org/r/62832/

6 weeks agoFix documentation of pystachio Volume struct
Stephan Erb [Sun, 8 Oct 2017 17:17:22 +0000 (19:17 +0200)] 
Fix documentation of pystachio Volume struct

For details, see
https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/config/schema/base.py#L145

Reviewed at https://reviews.apache.org/r/62829/

6 weeks agoSwitch release checksum to sha512
Stephan Erb [Sun, 8 Oct 2017 16:41:35 +0000 (18:41 +0200)] 
Switch release checksum to sha512

For our releases we will now be using .sha512 files rather than .sha files
containing sha1 checksums. This change is triggered by a recent update of
the Apache Release Distribution Policy.

Please see this mail for details:

```
Hi PMC,

    The Release Distribution Policy[1] changed regarding .sha files.
    See under "Cryptographic Signatures and Checksums Requirements" [2].

   Old policy :

     -- use extension .sha for any SHA checksum (SHA-1, SHA-256, SHA-512)

   New policy :

      -- use .sha1 for a SHA-1 checksum
      -- use .sha256 for a SHA-256 checksum
      -- use .sha512 for a SHA-512 checksum
      -- [*] .sha should contain a SHA-1

   Why this change ?

      -- Verifying a checksum under the old policy is/was not handy.
         You have to inspect the .sha to find out which algorithm
         should be used ; or try them all (SHA-1, SHA256, etc).
         The new scheme avoids this ambiguity.
      -- The last point[*] was only added for clarity. Most of the
         old, stale .sha's contain a SHA-1. The relatively new .sha's
         contain a SHA-512. The expectation is that the last catagory will
         disappear, when active projects adapt to the 'new' convention.

   Impact :

      -- Should be none ; many projects already use the 'new' convention.
      -- Please ask your release managers to use .sha1, .sha256, .sha512
         instead of the .sha extension.
      -- Please fix your build-tools if you have any.

   Piggyback :

      -- The policy requires a .md5 for every package ;
         providing a .sha512 is recommended.
         Since MD5 is essentially broken, it is to be expected that
         in the future a .sha512 will be required.
         Perhaps it is wize to start providing .sha512's
         with your releases if you do not already do so.

      -- Visit http://mirror-vm.apache.org/checker/
         to check the health of your /dist/-area ;
         my stuff ; any feedback is most welcome.

   Thanks ; regards,

   Henk Penning

    [1] http://www.apache.org/dev/release-distribution
    [2] http://www.apache.org/dev/release-distribution#sigs-and-sums
```

Reviewed at https://reviews.apache.org/r/62830/

7 weeks agoConvert Webhook to AbstractIdleService, use async HTTP client
Jordan Ly [Wed, 4 Oct 2017 20:07:19 +0000 (13:07 -0700)] 
Convert Webhook to AbstractIdleService, use async HTTP client

Hijacking https://reviews.apache.org/r/59703

From the above review: "Current code uses a synchronous HTTP client, which can block the EventBus. Switch to an async HTTP client."

Previously, we had an issue where the HTTP client would have a non-daemon thread which caused the Scheduler to fail to shutdown. I converted it into an AbstractIdleService and properly closed the client in the shutdown() method. Additionally, I made a small tweak to the original code where we ABORT any response receieved after the status since we don't care. We just use the response code for stats.

Testing Done:
./gradlew test

Tested proper shutdown occurs in Vagrant.

Scale tested up to 2000 TASK_LOST events with the registered endpoint waiting 5-10 minutes to response -- does not seem to block scheduling.

Bugs closed: AURORA-1773

Reviewed at https://reviews.apache.org/r/62700/

7 weeks agoImplement Role and Environment pages in Preact.
David McLaughlin [Tue, 3 Oct 2017 19:49:38 +0000 (12:49 -0700)] 
Implement Role and Environment pages in Preact.

Reviewed at https://reviews.apache.org/r/62451/

7 weeks agoReplace auto-generated forwarding code with manual implementations
Bill Farner [Sun, 1 Oct 2017 15:49:06 +0000 (08:49 -0700)] 
Replace auto-generated forwarding code with manual implementations

Reviewed at https://reviews.apache.org/r/62716/

8 weeks agoRemove the rewriteConfigs thrift method
Bill Farner [Fri, 29 Sep 2017 22:30:06 +0000 (15:30 -0700)] 
Remove the rewriteConfigs thrift method

Reviewed at https://reviews.apache.org/r/62601/

8 weeks agoAdded additional stop() to prevent errors in run() to stop shutdown in SchedulerMain
Jordan Ly [Fri, 29 Sep 2017 00:23:17 +0000 (17:23 -0700)] 
Added additional stop() to prevent errors in run() to stop shutdown in SchedulerMain

Ensure that `SchedulerMain.run()` calls stop in the case of exceptions. This prevents the Scheduler from being transitioned to DEAD state, but not actually stopping it's services.

See the attached ticket for an example of issue happening.

Testing Done:
Added an additional unit test for prepare() failing in `SchedulerLifecycle.java`.

./gradlew test
./build-support/jenkin/build.sh

Bugs closed: AURORA-1950

Reviewed at https://reviews.apache.org/r/62626/

8 weeks agoAllow transitions from any state to STOPPED in CallOrderEnforcingStorage
Jordan Ly [Thu, 28 Sep 2017 22:18:14 +0000 (00:18 +0200)] 
Allow transitions from any state to STOPPED in CallOrderEnforcingStorage

- Allow transitions from any state to STOPPED in CallOrderEnforcingStorage,
  including adding a STOPPED -> STOPPED transition so stop() can be used idempotent.
- Use the StateMachines.checkState method (I wasn't sure if the current checkInState
  was designed for anything other than throwing a TransientStorageException)

Bugs closed: AURORA-1950

Reviewed at https://reviews.apache.org/r/62621/

8 weeks agoReplace Preact and custom testing with React + Enzyme
David McLaughlin [Wed, 27 Sep 2017 21:34:02 +0000 (14:34 -0700)] 
Replace Preact and custom testing with React + Enzyme

Reviewed at https://reviews.apache.org/r/62607/

8 weeks agoWorkaround to get pants working in macOS high sierra
Bill Farner [Wed, 27 Sep 2017 18:39:56 +0000 (11:39 -0700)] 
Workaround to get pants working in macOS high sierra

This is a cheat to use pants' thrift binary from 10.12.

Reviewed at https://reviews.apache.org/r/62608/

8 weeks agoFix binding issues preventing ./gradle run from working
Bill Farner [Wed, 27 Sep 2017 18:38:25 +0000 (11:38 -0700)] 
Fix binding issues preventing ./gradle run from working

Reviewed at https://reviews.apache.org/r/62620/

8 weeks agoUse a more efficient query for instance ID collision detection
Bill Farner [Wed, 27 Sep 2017 01:51:52 +0000 (18:51 -0700)] 
Use a more efficient query for instance ID collision detection

Reviewed at https://reviews.apache.org/r/62604/

8 weeks agoRestore scheduler benchmarks to working order
Bill Farner [Tue, 26 Sep 2017 23:42:07 +0000 (16:42 -0700)] 
Restore scheduler benchmarks to working order

Reviewed at https://reviews.apache.org/r/62558/

2 months agoUpdate to gradle 4.2
Bill Farner [Sat, 23 Sep 2017 15:05:08 +0000 (08:05 -0700)] 
Update to gradle 4.2

Reviewed at https://reviews.apache.org/r/62517/

2 months agoImprove in-process test ZooKeeper support
Keisuke Nishimoto [Thu, 21 Sep 2017 21:33:27 +0000 (14:33 -0700)] 
Improve in-process test ZooKeeper support

MesosLogStreamModule tries to connect to ZooKeeper servers specified by
-zk_endpoints even when -zk_in_proc=true.  I updated the module to use
injected server endpoints which will be based on the ephemeral port assigned
to ZooKeeperTestServer if -zk_in_proc=true.  This required to make
@ServiceDiscoveryBindings.ZooKeeper public.

I also tweaked shutdown process of ServiceDiscoveryModule.TestServerService
so that it won't close existing ZooKeeper connections before clients close the
session.  While just delaying the execution by 1 second doesn't really
guarantee that behavior, in practice this achieved clean shutdown of the
scheduler with in-process ZooKeeper server.

Testing Done:
1. Launch Mesos master and slave on my laptop.
2. Launch Aurora scheduler with following arguments:
```
-backup_dir=/var/lib/aurora/backups
-cluster_name=local
-mesos_master_address=localhost:5050
-serverset_path=/aurora/scheduler
-ip=127.0.0.1
-hostname=localhost
-http_port=8081
-zk_in_proc=true
-zk_endpoints=localhost:2181
-native_log_zk_group_path=/aurora/replicated-log
-native_log_file_path=/var/db/aurora
```
3. Observe that there are no ZooKeeper error log outputs caused by missing
   endpoint.
4. Create a simple job, observer it launches normally and then kill it.
5. Stop the scheduler by sending /quitquitquit.
6. Observe that scheduler process shuts down normally.

Bugs closed: AURORA-1947

Reviewed at https://reviews.apache.org/r/62423/

2 months agoAdd Houghton Mifflin Harcourt to adopters list
Robert Allen [Sun, 17 Sep 2017 22:10:36 +0000 (00:10 +0200)] 
Add Houghton Mifflin Harcourt to adopters list

Reviewed at https://reviews.apache.org/r/62347/