lambda-multiprocessing


Namelambda-multiprocessing JSON
Version 0.5 PyPI version JSON
download
home_pagehttps://github.com/mdavis-xyz/lambda_multiprocessing
Summarydrop-in replacement for multiprocessing.Pool in AWS Lambda functions (without /dev/shm shared memory)
upload_time2024-03-04 18:25:24
maintainer
docs_urlNone
authorMatthew Davis
requires_python
license
keywords python aws amazon lambda multiprocessing pool concurrency
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # lambda_multiprocessing - Multiprocessing in AWS Lambda

This library is for doing multiprocessing in AWS Lambda in python.

(This is unrelated to inline lambda functions such as `f = lambda x: x*x`.
 That's a different kind of lambda function.)

## The Problem

If you deploy Python code to an [AWS Lambda function](https://aws.amazon.com/lambda/),
the multiprocessing functions in the standard library such as [`multiprocessing.Pool.map`](https://docs.python.org/3/library/multiprocessing.html?highlight=multiprocessing%20python%20map%20pool#multiprocessing.pool.Pool.map) will not work.

For example:

```
from multiprocessing import Pool
def func(x):
    return x*x
args = [1,2,3]
with Pool() as p:
    result = p.map(func, args)
```

will give you:

```
OSError: [Errno 38] Function not implemented
```

This is because AWS Lambda functions are very bare bones,
and have no shared memory device (`/dev/shm`).



## The Solution

There is a workaround using `Pipe`s and `Process`es.
Amazon documented it [in this blog post](https://aws.amazon.com/blogs/compute/parallel-processing-in-python-with-aws-lambda/).
However that example is very much tied to the work being done,
it doesn't have great error handling,
and is not structured in the way you'd expect when using the normal `multiprocessing` library.

The purpose of this library is to take the solution from that blog post,
and turn it into a drop-in replacement for `multiprocessing.Pool`.
This also includes unit testing, error handling etc, to match what you get from `multiprocessing.Pool`.

## Usage

Install with:

```
pip install lambda_multiprocessing
```

Once you've imported `Pool`, it acts just like `multiprocessing.Pool`.
[Details here](https://docs.python.org/3/library/multiprocessing.html?highlight=multiprocessing%20python%20map%20pool#module-multiprocessing.pool).

```
from lambda_multiprocessing import Pool

def work(x):
    return x*x

with Pool() as p:
    results = p.map(work, range(5))
assert results == [x*x for x in range(5)]
```

Note that Lambda functions usually have only 2 vCPUs.
If you allocate a lot or memory you get a few more.
(e.g. 3 at 5120MB, 6 at 10240MB)
The performance benefit you get from multiprocessing CPU-bound tasks is equal to the number of CPUs, minus overhead.
(e.g. double speed for multiprocessing with 2 vCPUs)
You can get bigger performance benefits for IO-bound tasks.
(e.g. uploading many files to S3, publishing many payloads to SNS etc).

## Limitations

When constructing the pool, initializer, initargs, maxtasksperchild and context have not been implemented.

For `*map*` functions,
callbacks and chunk sizes have not been implemented.

`imap` and `imap_unordered` have not been implemented.

If you need any of these things implemented, raise an issue or a PR in github.

## Concurrency Safety

Boto3 (the AWS SDK) is concurrency safe.
However the `client` and `session` objects should not be shared between processes or threads.
So do not pass those to or from the child processes.

`moto` (a library for unit-testing code that uses boto, by emulating most AWS services in memory)
is **not** concurrency safe.
So if you're unit testing using moto, pass `0` as the first argument to `Pool`,
and then all the work will actually be done in the main thread.
i.e. no multiprocessing at all.
So you need an `if` statement to pass 0 or a positive integer based on whether this is unit testing or the real thing.

## Development

This library has no dependencies.
The unit tests depend on `boto3` and `moto`.

```
pip install -r lambda_multiprocessing/requirements_test.txt
```

Then you can run the unit tests with:

```
python3 -m unittest
```

`CICD` is for the GitHub Actions which run unit tests and integration tests.
You probably don't need to touch those.

## Design

When you `__enter__` the pool, it creates several `Child`s.
These contain the actual child `Process`es,
plus a duplex pipe to send tasks to the child and get results back.

The child process just waits for payloads to appear in the pipe.
It grabs the function and arguments from it, does the work,
catches any exception, then sends the exception or result back through the pipe.
Note that the function that the client gives to this library might return an Exception for some reason,
so we return either `[result, None]` or `[None, Exception]`, to differentiate.

To close everything up when we're done, the parent sends a payload with a different structure (`payload[-1] == True`)
and then the child will gracefully exit.

We keep a counter of how many tasks we've given to the child, minus how many results we've got back.
When assigning work, we give it to a child chosen randomly from the set of children whose counter value is smallest.
(i.e. shortest backlog)

When passing the question and answer to the child and back, we pass around a UUID.
This is because the client may submit two tasks with apply_async, then request the result for the second one,
before the first.
We can't assume that the next result coming back from the child is the one we want,
since each child can have a backlog of a few tasks.

Originally I passed a new pipe to the child for each task to process,
but this results in OSErrors from too many open files (i.e. too many pipes),
and passing pipes through pipes is unusually slow on low-memory Lambda functions for some reason.

Note that `multiprocessing.Queue` doesn't work in Lambda functions.
So we can't use that to distribute work amongst the children.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/mdavis-xyz/lambda_multiprocessing",
    "name": "lambda-multiprocessing",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "python,AWS,Amazon,Lambda,multiprocessing,pool,concurrency",
    "author": "Matthew Davis",
    "author_email": "github@mdavis.xyz",
    "download_url": "https://files.pythonhosted.org/packages/ef/48/35f4de4e1ff76f91398bc428a0347280b3c5c8897aa6b571e26ee5c404b9/lambda_multiprocessing-0.5.tar.gz",
    "platform": null,
    "description": "# lambda_multiprocessing - Multiprocessing in AWS Lambda\n\nThis library is for doing multiprocessing in AWS Lambda in python.\n\n(This is unrelated to inline lambda functions such as `f = lambda x: x*x`.\n That's a different kind of lambda function.)\n\n## The Problem\n\nIf you deploy Python code to an [AWS Lambda function](https://aws.amazon.com/lambda/),\nthe multiprocessing functions in the standard library such as [`multiprocessing.Pool.map`](https://docs.python.org/3/library/multiprocessing.html?highlight=multiprocessing%20python%20map%20pool#multiprocessing.pool.Pool.map) will not work.\n\nFor example:\n\n```\nfrom multiprocessing import Pool\ndef func(x):\n    return x*x\nargs = [1,2,3]\nwith Pool() as p:\n    result = p.map(func, args)\n```\n\nwill give you:\n\n```\nOSError: [Errno 38] Function not implemented\n```\n\nThis is because AWS Lambda functions are very bare bones,\nand have no shared memory device (`/dev/shm`).\n\n\n\n## The Solution\n\nThere is a workaround using `Pipe`s and `Process`es.\nAmazon documented it [in this blog post](https://aws.amazon.com/blogs/compute/parallel-processing-in-python-with-aws-lambda/).\nHowever that example is very much tied to the work being done,\nit doesn't have great error handling,\nand is not structured in the way you'd expect when using the normal `multiprocessing` library.\n\nThe purpose of this library is to take the solution from that blog post,\nand turn it into a drop-in replacement for `multiprocessing.Pool`.\nThis also includes unit testing, error handling etc, to match what you get from `multiprocessing.Pool`.\n\n## Usage\n\nInstall with:\n\n```\npip install lambda_multiprocessing\n```\n\nOnce you've imported `Pool`, it acts just like `multiprocessing.Pool`.\n[Details here](https://docs.python.org/3/library/multiprocessing.html?highlight=multiprocessing%20python%20map%20pool#module-multiprocessing.pool).\n\n```\nfrom lambda_multiprocessing import Pool\n\ndef work(x):\n    return x*x\n\nwith Pool() as p:\n    results = p.map(work, range(5))\nassert results == [x*x for x in range(5)]\n```\n\nNote that Lambda functions usually have only 2 vCPUs.\nIf you allocate a lot or memory you get a few more.\n(e.g. 3 at 5120MB, 6 at 10240MB)\nThe performance benefit you get from multiprocessing CPU-bound tasks is equal to the number of CPUs, minus overhead.\n(e.g. double speed for multiprocessing with 2 vCPUs)\nYou can get bigger performance benefits for IO-bound tasks.\n(e.g. uploading many files to S3, publishing many payloads to SNS etc).\n\n## Limitations\n\nWhen constructing the pool, initializer, initargs, maxtasksperchild and context have not been implemented.\n\nFor `*map*` functions,\ncallbacks and chunk sizes have not been implemented.\n\n`imap` and `imap_unordered` have not been implemented.\n\nIf you need any of these things implemented, raise an issue or a PR in github.\n\n## Concurrency Safety\n\nBoto3 (the AWS SDK) is concurrency safe.\nHowever the `client` and `session` objects should not be shared between processes or threads.\nSo do not pass those to or from the child processes.\n\n`moto` (a library for unit-testing code that uses boto, by emulating most AWS services in memory)\nis **not** concurrency safe.\nSo if you're unit testing using moto, pass `0` as the first argument to `Pool`,\nand then all the work will actually be done in the main thread.\ni.e. no multiprocessing at all.\nSo you need an `if` statement to pass 0 or a positive integer based on whether this is unit testing or the real thing.\n\n## Development\n\nThis library has no dependencies.\nThe unit tests depend on `boto3` and `moto`.\n\n```\npip install -r lambda_multiprocessing/requirements_test.txt\n```\n\nThen you can run the unit tests with:\n\n```\npython3 -m unittest\n```\n\n`CICD` is for the GitHub Actions which run unit tests and integration tests.\nYou probably don't need to touch those.\n\n## Design\n\nWhen you `__enter__` the pool, it creates several `Child`s.\nThese contain the actual child `Process`es,\nplus a duplex pipe to send tasks to the child and get results back.\n\nThe child process just waits for payloads to appear in the pipe.\nIt grabs the function and arguments from it, does the work,\ncatches any exception, then sends the exception or result back through the pipe.\nNote that the function that the client gives to this library might return an Exception for some reason,\nso we return either `[result, None]` or `[None, Exception]`, to differentiate.\n\nTo close everything up when we're done, the parent sends a payload with a different structure (`payload[-1] == True`)\nand then the child will gracefully exit.\n\nWe keep a counter of how many tasks we've given to the child, minus how many results we've got back.\nWhen assigning work, we give it to a child chosen randomly from the set of children whose counter value is smallest.\n(i.e. shortest backlog)\n\nWhen passing the question and answer to the child and back, we pass around a UUID.\nThis is because the client may submit two tasks with apply_async, then request the result for the second one,\nbefore the first.\nWe can't assume that the next result coming back from the child is the one we want,\nsince each child can have a backlog of a few tasks.\n\nOriginally I passed a new pipe to the child for each task to process,\nbut this results in OSErrors from too many open files (i.e. too many pipes),\nand passing pipes through pipes is unusually slow on low-memory Lambda functions for some reason.\n\nNote that `multiprocessing.Queue` doesn't work in Lambda functions.\nSo we can't use that to distribute work amongst the children.\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "drop-in replacement for multiprocessing.Pool in AWS Lambda functions (without /dev/shm shared memory)",
    "version": "0.5",
    "project_urls": {
        "Homepage": "https://github.com/mdavis-xyz/lambda_multiprocessing"
    },
    "split_keywords": [
        "python",
        "aws",
        "amazon",
        "lambda",
        "multiprocessing",
        "pool",
        "concurrency"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "38c484a42dfd5c1367481979ea38a1d4e90d617cee5410061e7ca07dcd6f6988",
                "md5": "bdcf4747f5b86a7322f0d707ee80b84b",
                "sha256": "d460572683cc42cb64fa4ffebec2d3f08d5dbf1dabbba0ce38bc28bfd33f32a4"
            },
            "downloads": -1,
            "filename": "lambda_multiprocessing-0.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "bdcf4747f5b86a7322f0d707ee80b84b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 11679,
            "upload_time": "2024-03-04T18:25:22",
            "upload_time_iso_8601": "2024-03-04T18:25:22.549747Z",
            "url": "https://files.pythonhosted.org/packages/38/c4/84a42dfd5c1367481979ea38a1d4e90d617cee5410061e7ca07dcd6f6988/lambda_multiprocessing-0.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ef4835f4de4e1ff76f91398bc428a0347280b3c5c8897aa6b571e26ee5c404b9",
                "md5": "1901cc5d741d162c2150002d1226ee68",
                "sha256": "f85d7abdd8234c3e9e5b7c769fcc9845c6d96d3838f8f7aa7ea2f9dd5cb45af3"
            },
            "downloads": -1,
            "filename": "lambda_multiprocessing-0.5.tar.gz",
            "has_sig": false,
            "md5_digest": "1901cc5d741d162c2150002d1226ee68",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 12959,
            "upload_time": "2024-03-04T18:25:24",
            "upload_time_iso_8601": "2024-03-04T18:25:24.052555Z",
            "url": "https://files.pythonhosted.org/packages/ef/48/35f4de4e1ff76f91398bc428a0347280b3c5c8897aa6b571e26ee5c404b9/lambda_multiprocessing-0.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-04 18:25:24",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mdavis-xyz",
    "github_project": "lambda_multiprocessing",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "lambda-multiprocessing"
}
        
Elapsed time: 0.19817s