bartide

Name	bartide JSON
Version	0.3.2 JSON
	download
home_page	https://github.com/parashardhapola/bartide
Summary	A Python package to extract, correct and analyze nucleotide barcodes from sequenced reads.
upload_time	2022-12-05 10:26:47
maintainer
docs_url	None
author	Parashar Dhapola
requires_python
license
keywords	text mining barcode nucleotide sequencing deduplicate
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

[![PyPI version shields.io](https://img.shields.io/pypi/v/bartide.svg)](https://pypi.python.org/pypi/bartide/)
[![PyPI license](https://img.shields.io/pypi/l/bartide.svg)](https://pypi.python.org/pypi/bartide/)

# Bartide
## Extract, correct and analyze barcodes from sequenced reads

### INSTALLATION

To install Bartide you need to have Python version 3.9 or upwards. The suggest wasy to install Python is to use Miniconda:
https://docs.conda.io/en/latest/miniconda.html

Use the following command to install Bartide
```
pip install bartide
```

#### Installation of NMSlib:
Easiet way to install NMSlib is to use the precompiled version from conda-forge repo. This works across Linux, MacOS and Windows machines.
```
conda install -c conda-forge nmslib
```

### USAGE

### 1. Extraction

Barcodes are extracted from the sequencing reads using the Bartideâ€™s `BarcodeExtractor` class. This class is provided two input files, one for each end of paired-end sequencing, in FASTQ format. This class is designed to automatically extract the barcodes assuming that the barcodes are of the same lengths, they span the same position in the reads and the flanking sequence is constant. This is achieved by first summarizing the nucleotide composition at each position for all the reads (or a sample of reads). The flanking sequence will be dominated by a single nucleotide while the barcode should have variable base composition. This pattern is used to identify the position of barcodes and the flanking primer sequence. This behaviour can be overridden by providing the barcode length and the flanking primer sequence to `BarcodeExtractor`. Below we show an example of how to call the BarcodeExtractor class, and perform automatic flank identification.

```
import bartide

extractor = bartide.BarcodeExtractor(
'sample1_read1.fastq.gz',
'sample1_read2.fastq.gz'
)

extractor.identify_flanks()
```

Alternatively, users can provide the flanking sequence can the barcode length manually like below:

```
extractor = bartide.BarcodeExtractor(
'sample1_read1.fastq.gz',
'sample1_read2.fastq.gz',
left_flank='GTAGCC',
right_flank='AGATCG',
barcode_length=27
)
```

Once the flank sequences and barcode length are determined, they are stored as `extractor.leftFlank`, `extractor.rightFlank` and `extractor.barcodeLength`. Now the barcodes can be extracted, and their frequency counted. The `BarcodeExtractor` class will compare the barcode sequence and its reverse complementary sequence from the other pair of the sequenced read. By default, there should not be more than 3 mismatches between the two sequences otherwise the extraction fails for that read. Users can change the maximum allowed mismatch value by using the `max_dist` parameter when calling `BarcodeExtractor`.

The actual extraction of barcodes is triggered by the following command:
```
extractor.count_barcodes()
```

Users can access these uncorrected barcodes with the following command:
```
print (extractor.rawCounts)
```

This prints the barcodes and their frequencies as shown below:

```
TTGTAGGGGTGTGTTCTACCGGTAATT 2843
GTGCTGGTAATGTGGGCGACGGTGGGG 913
TTGGTGAAGCATAGTTCCGTGATTGAA 909
TTCCATGACGTTAAATACCTCCTTATA 723
ATCTGGCGTCCAGCAGATATTAGTTTT 717
...
AAGTTACATGCCGCAAAGGGTTCTTTG 1
AAGGATGAATGACAAGGTGCTAGCCAT 1
GGTACAAGGCGGGATTACCATGCATTG 1
GTGCTGGAAATGTGGGCGACGGTGGGG 1
AAGTCACATGCCGCAAAGTGTCCATTG 1
```

### 2. Error correction

Next, the list obtained above may still contain barcodes that harbour sequencing errors. We assume that a barcode contains error(s) if it has less than three nucleotide difference with a higher abundance barcode. Since the pairwise comparison of all the barcodes can be computationally prohibitive, we use approximate nearest neighbour detection library â€˜nmslibâ€™ to efficiently identify similar barcodes. If an erroneous barcode is found, its frequency is added to the barcode with the nearest match. This functionality is implemented in the `SeqCorrect` class. Users can obtain the corrected barcode list, by running the following command:

```
corrector = bartide.SeqCorrect()
corrector.run(extractor.rawCounts)
```

The corrected list of barcodes is stored under `corrector.correctedCounts`. These barcodes can then be saved in CSV format table as shown below:

```
corrector.save_to_csv('barcodes_freq_sample1.csv')
```

The `SeqCorrect` class will by default, remove any barcode with a frequency of less than 20, as suggested previously (Naik et al., 2014). This behaviour can be overridden by changing the value of the `min_counts` parameter when calling `SeqCorrect`.

### 3. Sample comparison, analysis and visualization

Once the corrected barcode frequencies are saved for all the samples, they can be compared using the `BarcodeAnalyzer` class. This class is initialized by providing the name, with full path, of the directory wherein all the CSV files were saved. The following command illustrates this:

```
analyzer = bartide.BarcodeAnalyzer(â€˜barcodes_dirâ€™)
```

This will lead to aggregation of all the barcodes across all the samples that can be accessed from a single table stored under `analyzer.barcodes`. Users can perform all the custom downstream analysis using this table as the starting point.

Bartide provides four essential plots that allow users to easily identify how the barcodes are shared across the sample.

The first plot is the â€˜Upsetâ€™ that shows all the combinations of samples and the number of barcodes that overlap between them. This plot, compared to Venn diagrams, allow easy visualization of the overlaps and non-overlaps of barcodes. Please note that when using upset plots, we only look at the unique number of barcodes found in each of the samples and not their frequencies.

```
analyzer.plot_upset()
```
<img src="./notebooks/images/upset_plot.png" alt="upset_plot" width="400"/>

Sometimes due large difference in the number of barcodes captured, it might be difficult to easily identify the similarity or differences between the samples. To solve this, rather than using the absolute number of barcodes in a sample, the percentage overlap of barcodes from a sample with all other samples are used. This allows the barcodes from a sample to be defined in proportions and may allow insights into sample similarity that is otherwise to identify with absolute frequencies. The following command shows the proportions in form of a stacked barplot:

```
analyzer.plot_stacked()
```
<img src="./notebooks/images/stacked_plot.png" alt="stacked_plot" width="300"/>

An alternative way to deal with the situation wherein the absolute number of unique barcodes are quite different across the samples is to perform normalization by dividing the overlap value by the sum of the total barcodes from the two samples. The resulting normalized values can be visualized in form of a heatmap.

```
analyzer.plot_overlap_heatmap()
```
<img src="./notebooks/images/overlap_heatmap.png" alt="overlap_heatmap" width="300"/>

In all the above three plotting functions, we do not the frequencies of the barcodes, which are indicative of how dominant a particular barcode is in the samples. A weighted overlap of barcodes is calculated between two samples as following:

$$\sum_{b}^{B}\left(S_b^i-S_b^j\right)^2$$

Wherein, $S$ is a column-sum normalized matrix of samples (columns) and barcodes (rows) containing barcode frequencies, $i$ and $j$ are two samples, $b$ is a barcode in a set of barcodes $B$ that are present in either of the two samples or both. These overlap values are then plotted in form of a heatmap using the following function:

```
analyzer.plot_weighted_heatmap()
```
<img src="./notebooks/images/weighted_heatmap.png" alt="weighted_heatmap" width="300"/>

To save the images generated by the functions above, users can pass a value (path and name of the file where to save) to the `save_name` parameter.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/parashardhapola/bartide",
    "name": "bartide",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "Text Mining,Barcode,nucleotide,sequencing,deduplicate",
    "author": "Parashar Dhapola",
    "author_email": "parashar.dhapola@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/c8/ca/6331e47a832cbfff5c874c573a42ef807d588a6f7ddaa9cb7c9ef12c4fa2/bartide-0.3.2.tar.gz",
    "platform": null,
    "description": "[![PyPI version shields.io](https://img.shields.io/pypi/v/bartide.svg)](https://pypi.python.org/pypi/bartide/)\r\n[![PyPI license](https://img.shields.io/pypi/l/bartide.svg)](https://pypi.python.org/pypi/bartide/)\r\n\r\n# Bartide\r\n## Extract, correct and analyze barcodes from sequenced reads\r\n\r\n\r\n### INSTALLATION\r\n\r\nTo install Bartide you need to have Python version 3.9 or upwards. The suggest wasy to install Python is to use Miniconda:\r\nhttps://docs.conda.io/en/latest/miniconda.html\r\n\r\nUse the following command to install Bartide\r\n```\r\npip install bartide\r\n```\r\n\r\n#### Installation of NMSlib:\r\nEasiet way to install NMSlib is to use the precompiled version from conda-forge repo. This works across Linux, MacOS and Windows machines.\r\n```\r\nconda install -c conda-forge nmslib\r\n```\r\n\r\n### USAGE\r\n\r\n\r\n### 1. Extraction\r\n\r\nBarcodes are extracted from the sequencing reads using the Bartide\u00e2\u20ac\u2122s `BarcodeExtractor` class. This class is provided two input files, one for each end of paired-end sequencing, in FASTQ format. This class is designed to automatically extract the barcodes assuming that the barcodes are of the same lengths, they span the same position in the reads and the flanking sequence is constant. This is achieved by first summarizing the nucleotide composition at each position for all the reads (or a sample of reads). The flanking sequence will be dominated by a single nucleotide while the barcode should have variable base composition. This pattern is used to identify the position of barcodes and the flanking primer sequence. This behaviour can be overridden by providing the barcode length and the flanking primer sequence to `BarcodeExtractor`. Below we show an example of how to call the BarcodeExtractor class, and perform automatic flank identification.\r\n\r\n```\r\nimport bartide\r\n\r\nextractor = bartide.BarcodeExtractor(\r\n\t'sample1_read1.fastq.gz',\r\n\t'sample1_read2.fastq.gz'\r\n)\r\n\r\nextractor.identify_flanks()\r\n```\r\n\r\nAlternatively, users can provide the flanking sequence can the barcode length manually like below:\r\n\r\n```\r\nextractor = bartide.BarcodeExtractor(\r\n\t'sample1_read1.fastq.gz',\r\n\t'sample1_read2.fastq.gz',\r\n\tleft_flank='GTAGCC',\r\n\tright_flank='AGATCG',\r\n\tbarcode_length=27\r\n)\r\n```\r\n\r\n\r\nOnce the flank sequences and barcode length are determined, they are stored as `extractor.leftFlank`, `extractor.rightFlank` and `extractor.barcodeLength`. Now the barcodes can be extracted, and their frequency counted. The `BarcodeExtractor` class will compare the barcode sequence and its reverse complementary sequence from the other pair of the sequenced read. By default, there should not be more than 3 mismatches between the two sequences otherwise the extraction fails for that read. Users can change the maximum allowed mismatch value by using the `max_dist` parameter when calling `BarcodeExtractor`.\r\n\r\nThe actual extraction of barcodes is triggered by the following command:\r\n```\r\nextractor.count_barcodes()\r\n```\r\n\r\nUsers can access these uncorrected barcodes with the following command:\r\n```\r\nprint (extractor.rawCounts)\r\n```\r\n\t\t\t\r\nThis prints the barcodes and their frequencies as shown below:\r\n\r\n```\r\nTTGTAGGGGTGTGTTCTACCGGTAATT    2843\r\nGTGCTGGTAATGTGGGCGACGGTGGGG     913\r\nTTGGTGAAGCATAGTTCCGTGATTGAA     909\r\nTTCCATGACGTTAAATACCTCCTTATA     723\r\nATCTGGCGTCCAGCAGATATTAGTTTT     717\r\n                               ... \r\nAAGTTACATGCCGCAAAGGGTTCTTTG       1\r\nAAGGATGAATGACAAGGTGCTAGCCAT       1\r\nGGTACAAGGCGGGATTACCATGCATTG       1\r\nGTGCTGGAAATGTGGGCGACGGTGGGG       1\r\nAAGTCACATGCCGCAAAGTGTCCATTG       1\r\n```\r\n\r\n### 2. Error correction\r\n\r\nNext, the list obtained above may still contain barcodes that harbour sequencing errors. We assume that a barcode contains error(s) if it has less than three nucleotide difference with a higher abundance barcode. Since the pairwise comparison of all the barcodes can be computationally prohibitive, we use approximate nearest neighbour detection library \u00e2\u20ac\u02dcnmslib\u00e2\u20ac\u2122 to efficiently identify similar barcodes. If an erroneous barcode is found, its frequency is added to the barcode with the nearest match. This functionality is implemented in the `SeqCorrect` class. Users can obtain the corrected barcode list, by running the following command:\r\n\r\n```\r\ncorrector = bartide.SeqCorrect()\r\ncorrector.run(extractor.rawCounts)\r\n```\r\n\r\nThe corrected list of barcodes is stored under `corrector.correctedCounts`. These barcodes can then be saved in CSV format table as shown below:\r\n\r\n```\r\ncorrector.save_to_csv('barcodes_freq_sample1.csv')\r\n```\r\n\r\nThe `SeqCorrect` class will by default, remove any barcode with a frequency of less than 20, as suggested previously (Naik et al., 2014). This behaviour can be overridden by changing the value of the `min_counts` parameter when calling `SeqCorrect`.\r\n\r\n### 3. Sample comparison, analysis and visualization\r\n\r\nOnce the corrected barcode frequencies are saved for all the samples, they can be compared using the `BarcodeAnalyzer` class. This class is initialized by providing the name, with full path, of the directory wherein all the CSV files were saved. The following command illustrates this:\r\n\r\n```\r\nanalyzer = bartide.BarcodeAnalyzer(\u00e2\u20ac\u02dcbarcodes_dir\u00e2\u20ac\u2122)\r\n```\r\n\t\r\nThis will lead to aggregation of all the barcodes across all the samples that can be accessed from a single table stored under `analyzer.barcodes`. Users can perform all the custom downstream analysis using this table as the starting point.\r\n\r\nBartide provides four essential plots that allow users to easily identify how the barcodes are shared across the sample.\r\n\r\nThe first plot is the \u00e2\u20ac\u02dcUpset\u00e2\u20ac\u2122 that shows all the combinations of samples and the number of barcodes that overlap between them. This plot, compared to Venn diagrams, allow easy visualization of the overlaps and non-overlaps of barcodes. Please note that when using upset plots, we only look at the unique number of barcodes found in each of the samples and not their frequencies.\r\n\r\n```\r\nanalyzer.plot_upset()\r\n```\r\n<img src=\"./notebooks/images/upset_plot.png\" alt=\"upset_plot\" width=\"400\"/>\r\n\r\nSometimes due large difference in the number of barcodes captured, it might be difficult to easily identify the similarity or differences between the samples. To solve this, rather than using the absolute number of barcodes in a sample, the percentage overlap of barcodes from a sample with all other samples are used. This allows the barcodes from a sample to be defined in proportions and may allow insights into sample similarity that is otherwise to identify with absolute frequencies. The following command shows the proportions in form of a stacked barplot:\r\n\r\n```\r\nanalyzer.plot_stacked()\r\n```\r\n<img src=\"./notebooks/images/stacked_plot.png\" alt=\"stacked_plot\" width=\"300\"/>\r\n\r\nAn alternative way to deal with the situation wherein the absolute number of unique barcodes are quite different across the samples is to perform normalization by dividing the overlap value by the sum of the total barcodes from the two samples. The resulting normalized values can be visualized in form of a heatmap.\r\n\r\n```\r\nanalyzer.plot_overlap_heatmap()\r\n```\r\n<img src=\"./notebooks/images/overlap_heatmap.png\" alt=\"overlap_heatmap\" width=\"300\"/>\r\n\r\nIn all the above three plotting functions, we do not the frequencies of the barcodes, which are indicative of how dominant a particular barcode is in the samples. A weighted overlap of barcodes is calculated between two samples as following:\r\n\r\n$$\\sum_{b}^{B}\\left(S_b^i-S_b^j\\right)^2$$\r\n\r\nWherein, $S$ is a column-sum normalized matrix of samples (columns) and barcodes (rows) containing barcode frequencies, $i$ and $j$ are two samples, $b$ is a barcode in a set of barcodes $B$ that are present in either of the two samples or both. These overlap values are then plotted in form of a heatmap using the following function:\r\n\r\n```\r\nanalyzer.plot_weighted_heatmap()\r\n```\r\n<img src=\"./notebooks/images/weighted_heatmap.png\" alt=\"weighted_heatmap\" width=\"300\"/> \r\n\r\nTo save the images generated by the functions above, users can pass a value (path and name of the file where to save) to the `save_name` parameter.\r\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "A Python package to extract, correct and analyze nucleotide barcodes from sequenced reads.",
    "version": "0.3.2",
    "split_keywords": [
        "text mining",
        "barcode",
        "nucleotide",
        "sequencing",
        "deduplicate"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "fbaa02d7300fd52100921a97ad7a83a3",
                "sha256": "e2c7e208819204dccf68ee55d5ffd0b30539c74afdc1ca8a34016e659c994e42"
            },
            "downloads": -1,
            "filename": "bartide-0.3.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fbaa02d7300fd52100921a97ad7a83a3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 13096,
            "upload_time": "2022-12-05T10:26:45",
            "upload_time_iso_8601": "2022-12-05T10:26:45.988589Z",
            "url": "https://files.pythonhosted.org/packages/85/9a/aebaba2f7564c23cec1a6b3b3102f97fd4486545afca51999b50c34e5e7e/bartide-0.3.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "6bf904a6cb6b7596d7d4b5e69b5494b2",
                "sha256": "f7d017ae2b2d02699a5014e2218d1ece64baab820d1bef7da9bf471fdc6a8b65"
            },
            "downloads": -1,
            "filename": "bartide-0.3.2.tar.gz",
            "has_sig": false,
            "md5_digest": "6bf904a6cb6b7596d7d4b5e69b5494b2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 11859,
            "upload_time": "2022-12-05T10:26:47",
            "upload_time_iso_8601": "2022-12-05T10:26:47.894989Z",
            "url": "https://files.pythonhosted.org/packages/c8/ca/6331e47a832cbfff5c874c573a42ef807d588a6f7ddaa9cb7c9ef12c4fa2/bartide-0.3.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-05 10:26:47",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "parashardhapola",
    "github_project": "bartide",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "bartide"
}

Parashar Dhapola