adaptive-scheduler


Nameadaptive-scheduler JSON
Version 0.9.10 PyPI version JSON
download
home_pagehttps://github.com/basnijholt/adaptive-scheduler
SummaryRun many `adaptive.Learner`s on many cores (>10k) using `mpi4py.futures`, `ipyparallel`, `dask-mpi`, or `process-pool`.
upload_time2021-02-20 10:34:02
maintainerBas Nijholt
docs_urlNone
author
requires_python>=3.7
licenseBSD-3
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            An asynchronous job scheduler for `Adaptive <https://github.com/python-adaptive/adaptive/>`_
============================================================================================

|PyPI|  |Conda|  |Downloads|  |Build Status| |Documentation Status|

Run many ``adaptive.Learner``\ s on many cores (>10k) using `mpi4py.futures`, `ipyparallel`, or `distributed`.

What is this?
-------------

The Adaptive scheduler solves the following problem, you need to run more learners than you can run with a single runner and/or can use >1k cores.
 
`ipyparallel` and `distributed` provide very powerful engines for interactive sessions. However, when you want to connect to >1k cores it starts to struggle. Besides that, on a shared cluster there is often the problem of starting an interactive session with ample space available.

Our approach is to schedule a different job for each ``adaptive.Learner``. The creation and running of these jobs are managed by ``adaptive-scheduler``. This means that your calculation will definitely run, even though the cluster might be fully occupied at the moment. Because of this approach, there is almost no limit to how many cores you want to use. You can either use 10 nodes for 1 job (\ ``learner``\ ) or 1 core for 1 job (\ ``learner``\ ) while scheduling hundreds of jobs.

Everything is written such that the computation is maximally local. This means that is one of the jobs crashes, there is no problem and it will automatically schedule a new one and continue the calculation where it left off (because of Adaptive's periodic saving functionality). Even if the central "job manager" dies, the jobs will continue to run (although no new jobs will be scheduled.)


Design goals
------------

#. Needs to be able to run on efficiently >30k cores
#. Works seamlessly with the Adaptive package
#. Minimal load on the file system
#. Removes all boilerplate of working with a scheduler

   #. writes job script
   #. (re)submits job scripts

#. Handles random crashes (or node evictions) with minimal data loss
#. Preserves Python kernel and variables inside a job (in contrast to submitting jobs for every parameter)
#. Separates the simulation definition code from the code that runs the simulation
#. Maximizes computation locality, jobs continue to run when the main process dies

How does it work?
-----------------

You create a bunch of ``learners`` and corresponding ``fnames`` such that they can be loaded, like:

.. code-block:: python

   import adaptive
   from functools import partial

   def h(x, pow, a):
       return a * x**pow

   combos = adaptive.utils.named_product(
       pow=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
       a=[0.1, 0.5],
   )  # returns list of dicts, cartesian product of all values

   learners = [adaptive.Learner1D(partial(h, **combo),
               bounds=(-1, 1)) for combo in combos]
   fnames = [f"data/{combo}" for combo in combos]


Then you start a process that creates and submits as many job-scripts as there are learners. Like:

.. code-block:: python

   import adaptive_scheduler

   def goal(learner):
       return learner.npoints > 200

   scheduler = adaptive_scheduler.scheduler.SLURM(cores=10)  # every learner get this many cores

   run_manager = adaptive_scheduler.server_support.RunManager(
       scheduler,
       learners,
       fnames,
       goal=goal,
       log_interval=30,  #  write info such as npoints, cpu_usage, time, etc. to the job log file
       save_interval=300,  # save the data every 300 seconds
   )
   run_manager.start()


That's it! You can run ``run_manager.info()`` which will display an interactive ``ipywidget`` that shows the amount of running, pending, and finished jobs, buttons to cancel your job, and other useful information.

.. image:: http://files.nijho.lt/info.gif
   :target: http://files.nijho.lt/info.gif
   :alt: Widget demo



But how does it *really* work?
------------------------------

The `~adaptive_scheduler.server_support.RunManager` basically does the following.
So, *you* need to create ``N`` ``learners`` and ``fnames`` (like in the section above).
Then a "job manager" writes and submits ``max(N, max_simultaneous_jobs)`` job scripts but *doesn't know* which learner it is going to run!
This is the responsibility of the "database manager", which keeps a database of ``job_id <--> learner``.
The job script starts a Python file ``run_learner.py`` in which the learner is run.


In a Jupyter notebook we can start the "job manager" and the "database manager", and create the ``run_learner.py`` like:

.. code-block:: python

   import adaptive_scheduler
   from adaptive_scheduler import server_support

   # create a scheduler
   scheduler = adaptive_scheduler.scheduler.SLURM(cores=10, run_script="run_learner.py",)

   # create a new database that keeps track of job <-> learner
   db_fname = "running.json"
   url = (
      server_support.get_allowed_url()
   )  # get a url where we can run the database_manager
   database_manager = server_support.DatabaseManager(
      url, scheduler, db_fname, learners, fnames
   )
   database_manager.start()

   # create the Python script that runs a learner (run_learner.py)
   server_support._make_default_run_script(
      url=url,
      save_interval=300,
      log_interval=30,
      goal=None,
      executor_type=scheduler.executor_type,
      run_script_fname=scheduler.run_script,
   )

   # create unique names for the jobs
   n_jobs = len(learners)
   job_names = [f"test-job-{i}" for i in range(n_jobs)]

   job_manager = server_support.JobManager(job_names, database_manager, scheduler)
   job_manager.start()


Then when the job have been running for a while you can check ``server_support.parse_log_files(job_names, database_manager, scheduler)``.

And use ``scheduler.cancel(job_names)`` to cancel the jobs.

You don't actually ever have to leave the Jupter notebook, take a look at the `example notebook <https://github.com/basnijholt/adaptive-scheduler/blob/master/example.ipynb>`_.

Jupyter notebook example
------------------------

See `example.ipynb <https://github.com/basnijholt/adaptive-scheduler/blob/master/example.ipynb>`_.

Installation
------------

**WARNING:** This is still the pre-alpha development stage.

Install the **latest stable** version from conda with (recommended)

.. code-block:: bash

   conda install adaptive-scheduler


or from PyPI with

.. code-block:: bash

   pip install adaptive_scheduler


or install **master** with

.. code-block:: bash

   pip install -U https://github.com/basnijholt/adaptive-scheduler/archive/master.zip


or clone the repository and do a dev install (recommended for dev)

.. code-block:: bash

   git clone git@github.com:basnijholt/adaptive-scheduler.git
   cd adaptive-scheduler
   pip install -e .


Development
-----------

In order to not pollute the history with the output of the notebooks, please setup the git filter by executing

.. code-block:: bash

   python ipynb_filter.py


in the repository.

We also use `pre-commit <https://pre-commit.com>`_\ , so ``pip install pre_commit`` and run

.. code-block:: bash

   pre-commit install


in the repository.

Limitations
-----------

Right now ``adaptive_scheduler`` is only working for SLURM and PBS, however only a class like `adaptive_scheduler/scheduler.py <https://github.com/basnijholt/adaptive-scheduler/blob/master/adaptive_scheduler/scheduler.py#L471>`_ would have to be implemented for another type of scheduler.
Also there are **no tests** at all!

.. references-start
.. |PyPI| image:: https://img.shields.io/pypi/v/adaptive-scheduler.svg
   :target: https://pypi.python.org/pypi/adaptive-scheduler
   :alt: PyPI
.. |Conda| image:: https://anaconda.org/conda-forge/adaptive-scheduler/badges/installer/conda.svg
   :target: https://anaconda.org/conda-forge/adaptive-scheduler
   :alt: Conda
.. |Downloads| image:: https://anaconda.org/conda-forge/adaptive-scheduler/badges/downloads.svg
   :target: https://anaconda.org/conda-forge/adaptive-scheduler
   :alt: Downloads
.. |Build Status| image:: https://dev.azure.com/basnijholt/adaptive-scheduler/_apis/build/status/basnijholt.adaptive-scheduler?branchName=master
   :target: https://dev.azure.com/basnijholt/adaptive-scheduler/_build/latest?definitionId=1&branchName=master
   :alt: Build Status
.. |Documentation Status| image:: https://readthedocs.org/projects/adaptive-scheduler/badge/?version=latest
   :target: https://adaptive-scheduler.readthedocs.io/en/latest/?badge=latest
   :alt: Documentation Status
.. references-end



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/basnijholt/adaptive-scheduler",
    "name": "adaptive-scheduler",
    "maintainer": "Bas Nijholt",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "bas@nijho.lt",
    "keywords": "",
    "author": "",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/45/02/9bed63e6418477cb23b8ddab4d47babd7826918edda3b8a6a0deb20facb2/adaptive_scheduler-0.9.10.tar.gz",
    "platform": "",
    "description": "An asynchronous job scheduler for `Adaptive <https://github.com/python-adaptive/adaptive/>`_\n============================================================================================\n\n|PyPI|  |Conda|  |Downloads|  |Build Status| |Documentation Status|\n\nRun many ``adaptive.Learner``\\ s on many cores (>10k) using `mpi4py.futures`, `ipyparallel`, or `distributed`.\n\nWhat is this?\n-------------\n\nThe Adaptive scheduler solves the following problem, you need to run more learners than you can run with a single runner and/or can use >1k cores.\n\u00a0\n`ipyparallel` and `distributed` provide very powerful engines for interactive sessions. However, when you want to connect to >1k cores it starts to struggle. Besides that, on a shared cluster there is often the problem of starting an interactive session with ample space available.\n\nOur approach is to schedule a different job for each ``adaptive.Learner``. The creation and running of these jobs are managed by ``adaptive-scheduler``. This means that your calculation will definitely run, even though the cluster might be fully occupied at the moment. Because of this approach, there is almost no limit to how many cores you want to use. You can either use 10 nodes for 1 job (\\ ``learner``\\ ) or 1 core for 1 job (\\ ``learner``\\ ) while scheduling hundreds of jobs.\n\nEverything is written such that the computation is maximally local. This means that is one of the jobs crashes, there is no problem and it will automatically schedule a new one and continue the calculation where it left off (because of Adaptive's periodic saving functionality). Even if the central \"job manager\" dies, the jobs will continue to run (although no new jobs will be scheduled.)\n\n\nDesign goals\n------------\n\n#. Needs to be able to run on efficiently >30k cores\n#. Works seamlessly with the Adaptive package\n#. Minimal load on the file system\n#. Removes all boilerplate of working with a scheduler\n\n   #. writes job script\n   #. (re)submits job scripts\n\n#. Handles random crashes (or node evictions) with minimal data loss\n#. Preserves Python kernel and variables inside a job (in contrast to submitting jobs for every parameter)\n#. Separates the simulation definition code from the code that runs the simulation\n#. Maximizes computation locality, jobs continue to run when the main process dies\n\nHow does it work?\n-----------------\n\nYou create a bunch of ``learners`` and corresponding ``fnames`` such that they can be loaded, like:\n\n.. code-block:: python\n\n   import adaptive\n   from functools import partial\n\n   def h(x, pow, a):\n       return a * x**pow\n\n   combos = adaptive.utils.named_product(\n       pow=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],\n       a=[0.1, 0.5],\n   )  # returns list of dicts, cartesian product of all values\n\n   learners = [adaptive.Learner1D(partial(h, **combo),\n               bounds=(-1, 1)) for combo in combos]\n   fnames = [f\"data/{combo}\" for combo in combos]\n\n\nThen you start a process that creates and submits as many job-scripts as there are learners. Like:\n\n.. code-block:: python\n\n   import adaptive_scheduler\n\n   def goal(learner):\n       return learner.npoints > 200\n\n   scheduler = adaptive_scheduler.scheduler.SLURM(cores=10)  # every learner get this many cores\n\n   run_manager = adaptive_scheduler.server_support.RunManager(\n       scheduler,\n       learners,\n       fnames,\n       goal=goal,\n       log_interval=30,  #  write info such as npoints, cpu_usage, time, etc. to the job log file\n       save_interval=300,  # save the data every 300 seconds\n   )\n   run_manager.start()\n\n\nThat's it! You can run ``run_manager.info()`` which will display an interactive ``ipywidget`` that shows the amount of running, pending, and finished jobs, buttons to cancel your job, and other useful information.\n\n.. image:: http://files.nijho.lt/info.gif\n   :target: http://files.nijho.lt/info.gif\n   :alt: Widget demo\n\n\n\nBut how does it *really* work?\n------------------------------\n\nThe `~adaptive_scheduler.server_support.RunManager` basically does the following.\nSo, *you* need to create ``N`` ``learners`` and ``fnames`` (like in the section above).\nThen a \"job manager\" writes and submits ``max(N, max_simultaneous_jobs)`` job scripts but *doesn't know* which learner it is going to run!\nThis is the responsibility of the \"database manager\", which keeps a database of ``job_id <--> learner``.\nThe job script starts a Python file ``run_learner.py`` in which the learner is run.\n\n\nIn a Jupyter notebook we can start the \"job manager\" and the \"database manager\", and create the ``run_learner.py`` like:\n\n.. code-block:: python\n\n   import adaptive_scheduler\n   from adaptive_scheduler import server_support\n\n   # create a scheduler\n   scheduler = adaptive_scheduler.scheduler.SLURM(cores=10, run_script=\"run_learner.py\",)\n\n   # create a new database that keeps track of job <-> learner\n   db_fname = \"running.json\"\n   url = (\n      server_support.get_allowed_url()\n   )  # get a url where we can run the database_manager\n   database_manager = server_support.DatabaseManager(\n      url, scheduler, db_fname, learners, fnames\n   )\n   database_manager.start()\n\n   # create the Python script that runs a learner (run_learner.py)\n   server_support._make_default_run_script(\n      url=url,\n      save_interval=300,\n      log_interval=30,\n      goal=None,\n      executor_type=scheduler.executor_type,\n      run_script_fname=scheduler.run_script,\n   )\n\n   # create unique names for the jobs\n   n_jobs = len(learners)\n   job_names = [f\"test-job-{i}\" for i in range(n_jobs)]\n\n   job_manager = server_support.JobManager(job_names, database_manager, scheduler)\n   job_manager.start()\n\n\nThen when the job have been running for a while you can check ``server_support.parse_log_files(job_names, database_manager, scheduler)``.\n\nAnd use ``scheduler.cancel(job_names)`` to cancel the jobs.\n\nYou don't actually ever have to leave the Jupter notebook, take a look at the `example notebook <https://github.com/basnijholt/adaptive-scheduler/blob/master/example.ipynb>`_.\n\nJupyter notebook example\n------------------------\n\nSee `example.ipynb <https://github.com/basnijholt/adaptive-scheduler/blob/master/example.ipynb>`_.\n\nInstallation\n------------\n\n**WARNING:** This is still the pre-alpha development stage.\n\nInstall the **latest stable** version from conda with (recommended)\n\n.. code-block:: bash\n\n   conda install adaptive-scheduler\n\n\nor from PyPI with\n\n.. code-block:: bash\n\n   pip install adaptive_scheduler\n\n\nor install **master** with\n\n.. code-block:: bash\n\n   pip install -U https://github.com/basnijholt/adaptive-scheduler/archive/master.zip\n\n\nor clone the repository and do a dev install (recommended for dev)\n\n.. code-block:: bash\n\n   git clone git@github.com:basnijholt/adaptive-scheduler.git\n   cd adaptive-scheduler\n   pip install -e .\n\n\nDevelopment\n-----------\n\nIn order to not pollute the history with the output of the notebooks, please setup the git filter by executing\n\n.. code-block:: bash\n\n   python ipynb_filter.py\n\n\nin the repository.\n\nWe also use `pre-commit <https://pre-commit.com>`_\\ , so ``pip install pre_commit`` and run\n\n.. code-block:: bash\n\n   pre-commit install\n\n\nin the repository.\n\nLimitations\n-----------\n\nRight now ``adaptive_scheduler`` is only working for SLURM and PBS, however only a class like `adaptive_scheduler/scheduler.py <https://github.com/basnijholt/adaptive-scheduler/blob/master/adaptive_scheduler/scheduler.py#L471>`_ would have to be implemented for another type of scheduler.\nAlso there are **no tests** at all!\n\n.. references-start\n.. |PyPI| image:: https://img.shields.io/pypi/v/adaptive-scheduler.svg\n   :target: https://pypi.python.org/pypi/adaptive-scheduler\n   :alt: PyPI\n.. |Conda| image:: https://anaconda.org/conda-forge/adaptive-scheduler/badges/installer/conda.svg\n   :target: https://anaconda.org/conda-forge/adaptive-scheduler\n   :alt: Conda\n.. |Downloads| image:: https://anaconda.org/conda-forge/adaptive-scheduler/badges/downloads.svg\n   :target: https://anaconda.org/conda-forge/adaptive-scheduler\n   :alt: Downloads\n.. |Build Status| image:: https://dev.azure.com/basnijholt/adaptive-scheduler/_apis/build/status/basnijholt.adaptive-scheduler?branchName=master\n   :target: https://dev.azure.com/basnijholt/adaptive-scheduler/_build/latest?definitionId=1&branchName=master\n   :alt: Build Status\n.. |Documentation Status| image:: https://readthedocs.org/projects/adaptive-scheduler/badge/?version=latest\n   :target: https://adaptive-scheduler.readthedocs.io/en/latest/?badge=latest\n   :alt: Documentation Status\n.. references-end\n\n\n",
    "bugtrack_url": null,
    "license": "BSD-3",
    "summary": "Run many `adaptive.Learner`s on many cores (>10k) using `mpi4py.futures`, `ipyparallel`, `dask-mpi`, or `process-pool`.",
    "version": "0.9.10",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "d675cefa0f74cde61964407e674037e5",
                "sha256": "64496c1762024df69396cf9f2cec5b2aaf249d2903a60959cf27c9d0d328c48c"
            },
            "downloads": -1,
            "filename": "adaptive_scheduler-0.9.10-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d675cefa0f74cde61964407e674037e5",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 42486,
            "upload_time": "2021-02-20T10:34:01",
            "upload_time_iso_8601": "2021-02-20T10:34:01.837068Z",
            "url": "https://files.pythonhosted.org/packages/12/87/3f671da3c8e303edcc4147ee4e7898ab1e2b69ad4fee22269e2bed6058d5/adaptive_scheduler-0.9.10-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "9c3960b7337bc54677b231d026f72bb3",
                "sha256": "dbce117c664db72dfd75b8fbe8d3a4206d5e5b306fa85dd061c6c3234877b06e"
            },
            "downloads": -1,
            "filename": "adaptive_scheduler-0.9.10.tar.gz",
            "has_sig": false,
            "md5_digest": "9c3960b7337bc54677b231d026f72bb3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 42090,
            "upload_time": "2021-02-20T10:34:02",
            "upload_time_iso_8601": "2021-02-20T10:34:02.920315Z",
            "url": "https://files.pythonhosted.org/packages/45/02/9bed63e6418477cb23b8ddab4d47babd7826918edda3b8a6a0deb20facb2/adaptive_scheduler-0.9.10.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2021-02-20 10:34:02",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": null,
    "github_project": "basnijholt",
    "error": "Could not fetch GitHub repository",
    "lcname": "adaptive-scheduler"
}
        
Elapsed time: 0.20733s