# HiSTra
### Installation
Dependency
```shell
conda create -n HiSTra python=3.8
conda activate HiSTra
conda install numpy scipy pandas=1.3.5 matplotlib seaborn h5py
conda install -c conda-forge -c bioconda cooler=0.8.11
pip install matplotlib-venn
```
Linux OS
```shell
pip install HiSTra
```
### Preparation
Download [juicer_tool](https://github.com/aidenlab/juicer/wiki/Juicer-Tools-Quick-Start) and [deDoc](https://github.com/yinxc/structural-information-minimisation). Because of the update of the two softwares, **we recommend that you download them from this repo.** You can find relevant jar files in the HiSTra/juice and HiSTra/deDoc, respectively.
Make sure chromosome.sizes file is exactly the file used in generating test(and control) sample(.hic or .mcool). Or the error will occer at the early step. Make sure that no underscores('\_') is included in the chromosome name.
### Directory tree
For bulk HiC data, a recommended work directory looks like:
```shell
mkdir work_dir
cd work_dir
mkdir hic_input
# Then move corresponding hic file here.
mkdir TL_output
ln -s deDoc_dir_path .
ln -s juice_dir_path .
```
The directory tree is:
```shell
├── deDoc
│ ├── deDoc.jar
├── hic_input
│ ├── Control_GSE63525_IMR90_combined_30.hic
│ ├── Test_GSE63525_K562_combined_30.hic
│ ├── Control_GSE63525_IMR90_combined_30.mcool
│ └── Test_GSE63525_K562_combined_30.mcool
├── juice
│ ├── juicer_tools_2.09.00.jar
└── TL_output
```
For scHiC data, a recommand work directory looks like:
```shell
├── deDoc
│ ├── deDoc.jar
├── hic_input
│ ├── Control_cells_dir
│ │ ├── cell_1
... │ │ ├── raw
│ │ │ ├── 100000
│ │ │ │ └── *.matrix
│ │ │ └── 500000
│ │ │ └── *.matrix
│ │ └── iced
... ... ├── ...
│ ├── Test_cells_dir
... ...
└── TL_output
```
**Note: For scHiC, the subdiretory of hic_input MUST be cells_dir/normalization/resolution/*.matrix!!**
Here, normalization could be "raw"/"iced" or any other string you used for path(which is similar to the output format of HiC-Pro), default is "raw". And, resolution is adapted for genome size, e.g. hg19 the resolution should be 100000 and 500000.
### Example
#### Samples
You can download test case from GSE63525. The test sample [hicfile](https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE63525&format=file&file=GSE63525%5FK562%5Fcombined%5F30%2Ehic) is K562 and control sample [hicfile](https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE63525&format=file&file=GSE63525%5FIMR90%5Fcombined%5F30%2Ehic) is IMR90.
And you can choose the test sample and control sample by yourself.
#### Resolution
For samples of human, the hicfile should contain 100k and 500k resolution matrix data. In general, the appropriate resolution could be calculated as following:
```math
res_{unit} = 10^{len(max(chromosome_{size}))-4}.
```
For example, in the hg.sizes the largest size of chromosome is **chr1(249250621)**, the suggested resolution unit would be 100k, and the lower one is defined as:
```math
5 \times res_{unit}.
```
#### Command
For bulk HiC data,
```shell
# Assume you are in the work_dir,a standard command is for hic format file
HiST -t hic_input/Test_GSE63525_K562_combined_30.hic \
-c hic_input/Control_GSE63525_IMR90_combined_30.hic \
-o TL_output/ \
-d deDoc/deDoc.jar \
-j juice/juicer_tools_2.09.00.jar \
-s sizes/chrom_hg19.sizes
# or mcool format file
HiST -t hic_input/Test_GSE63525_K562_combined_30.mcool \
-c hic_input/Control_GSE63525_IMR90_combined_30.mcool \
-o TL_output/ \
-d deDoc/deDoc.jar \
-s sizes/chrom_hg19.sizes
# or mixed format file
HiST -t hic_input/Test_GSE63525_K562_combined_30.mcool
-c hic_input/Control_GSE63525_IMR90_combined_30.hic \
-o TL_output/ \
-d deDoc/deDoc.jar \
-j juice/juicer_tools_2.09.00.jar \
-s sizes/chrom_hg19.sizes
# Then you can find the result in folder TL_output/SV_result.
```
For scHiC data,
```shell
HiST -t hic_input/Test_cells_dir/ \
-c hic_input/Control_cells_dir \
-s sizes/chrom_hg19.sizes \
-d deDoc/deDoc.jar \
-o TL_output/
```
#### Figure
An example of TL result with ![heatmap](./example_pic/0_Combine_chr1_chr7.png)
### FAQ
##### If you meet "Resource temporarily unavailable" or "error: too many open files" or "ValueError: cannot convert float NaN to integer" or "EmptyDataError: No columns to parse from file"?
If your workstation is configured with more than 128 GB of memory and the number of threads is more than 48, you can try the following operations:
1. You can try command ```ulimit -u 381152``` . Here 381152 could be replaced by any big number.
2. You only need to run HiST command a few more times.
These errors usually occur when the input data is mcool. We will fix them in the next version.
##### If you meet "No such file or directory"...
1. Check matrix_from_hic directory, if no files in sub-directory, check the juicer_log to find error or check cooler package.
2. If juicer_log suggest like "invalid chromosome chr12", you should check the sizes file, a common problem is to tell "chr1" or "1".
Raw data
{
"_id": null,
"home_page": "https://github.com/dtzxyangq/HiSTra",
"name": "HiSTra",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": "HiC genome structure variation translocation",
"author": "Q.Yang",
"author_email": "dtzxyangq@foxmail.com",
"download_url": "https://files.pythonhosted.org/packages/84/ae/8867ab6c5ec8dfef490143cbcf45af11fca490d458bc87e62f4ceb5d4939/histra-1.4.0.tar.gz",
"platform": null,
"description": "# HiSTra\n\n\n### Installation\nDependency\n```shell\nconda create -n HiSTra python=3.8 \nconda activate HiSTra\nconda install numpy scipy pandas=1.3.5 matplotlib seaborn h5py\nconda install -c conda-forge -c bioconda cooler=0.8.11\npip install matplotlib-venn \n```\nLinux OS\n\n```shell\npip install HiSTra\n```\n### Preparation\n\nDownload [juicer_tool](https://github.com/aidenlab/juicer/wiki/Juicer-Tools-Quick-Start) and [deDoc](https://github.com/yinxc/structural-information-minimisation). Because of the update of the two softwares, **we recommend that you download them from this repo.** You can find relevant jar files in the HiSTra/juice and HiSTra/deDoc, respectively.\n\nMake sure chromosome.sizes file is exactly the file used in generating test(and control) sample(.hic or .mcool). Or the error will occer at the early step. Make sure that no underscores('\\_') is included in the chromosome name.\n\n### Directory tree\n\nFor bulk HiC data, a recommended work directory looks like:\n\n```shell\nmkdir work_dir\ncd work_dir\nmkdir hic_input\n# Then move corresponding hic file here.\nmkdir TL_output\nln -s deDoc_dir_path .\nln -s juice_dir_path .\n```\nThe directory tree is:\n\n```shell\n\u251c\u2500\u2500 deDoc\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 deDoc.jar\n\u251c\u2500\u2500 hic_input\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 Control_GSE63525_IMR90_combined_30.hic\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 Test_GSE63525_K562_combined_30.hic\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 Control_GSE63525_IMR90_combined_30.mcool\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 Test_GSE63525_K562_combined_30.mcool\n\u251c\u2500\u2500 juice\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 juicer_tools_2.09.00.jar\n\u2514\u2500\u2500 TL_output\n```\n\nFor scHiC data, a recommand work directory looks like:\n```shell\n\u251c\u2500\u2500 deDoc\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 deDoc.jar\n\u251c\u2500\u2500 hic_input\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 Control_cells_dir\n\u2502\u00a0\u00a0 \u2502 \u251c\u2500\u2500 cell_1\n... \u2502 \u2502 \u251c\u2500\u2500 raw\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 100000\n \u2502 \u2502 \u2502 \u2502 \u2514\u2500\u2500 *.matrix\n \u2502 \u2502 \u2502 \u2514\u2500\u2500 500000\n \u2502 \u2502 \u2502 \u2514\u2500\u2500 *.matrix\n \u2502 \u2502 \u2514\u2500\u2500 iced\n... ... \u251c\u2500\u2500 ...\n\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 Test_cells_dir\n... ...\n\u2514\u2500\u2500 TL_output\n```\n**Note: For scHiC, the subdiretory of hic_input MUST be cells_dir/normalization/resolution/*.matrix!!**\nHere, normalization could be \"raw\"/\"iced\" or any other string you used for path(which is similar to the output format of HiC-Pro), default is \"raw\". And, resolution is adapted for genome size, e.g. hg19 the resolution should be 100000 and 500000.\n\n### Example\n\n#### Samples\n\nYou can download test case from GSE63525. The test sample [hicfile](https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE63525&format=file&file=GSE63525%5FK562%5Fcombined%5F30%2Ehic) is K562 and control sample [hicfile](https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE63525&format=file&file=GSE63525%5FIMR90%5Fcombined%5F30%2Ehic) is IMR90.\n\nAnd you can choose the test sample and control sample by yourself.\n\n#### Resolution\n\nFor samples of human, the hicfile should contain 100k and 500k resolution matrix data. In general, the appropriate resolution could be calculated as following:\n```math\nres_{unit} = 10^{len(max(chromosome_{size}))-4}.\n```\nFor example, in the hg.sizes the largest size of chromosome is **chr1(249250621)**, the suggested resolution unit would be 100k, and the lower one is defined as: \n```math\n5 \\times res_{unit}.\n```\n\n#### Command\n\nFor bulk HiC data,\n```shell\n# Assume you are in the work_dir,a standard command is for hic format file\nHiST -t hic_input/Test_GSE63525_K562_combined_30.hic \\\n-c hic_input/Control_GSE63525_IMR90_combined_30.hic \\\n-o TL_output/ \\\n-d deDoc/deDoc.jar \\\n-j juice/juicer_tools_2.09.00.jar \\\n-s sizes/chrom_hg19.sizes \n# or mcool format file\nHiST -t hic_input/Test_GSE63525_K562_combined_30.mcool \\\n-c hic_input/Control_GSE63525_IMR90_combined_30.mcool \\\n-o TL_output/ \\\n-d deDoc/deDoc.jar \\\n-s sizes/chrom_hg19.sizes\n# or mixed format file\nHiST -t hic_input/Test_GSE63525_K562_combined_30.mcool \n-c hic_input/Control_GSE63525_IMR90_combined_30.hic \\\n-o TL_output/ \\\n-d deDoc/deDoc.jar \\\n-j juice/juicer_tools_2.09.00.jar \\\n-s sizes/chrom_hg19.sizes\n# Then you can find the result in folder TL_output/SV_result.\n```\nFor scHiC data,\n```shell\nHiST -t hic_input/Test_cells_dir/ \\\n-c hic_input/Control_cells_dir \\\n-s sizes/chrom_hg19.sizes \\\n-d deDoc/deDoc.jar \\\n-o TL_output/\n```\n\n#### Figure\nAn example of TL result with ![heatmap](./example_pic/0_Combine_chr1_chr7.png)\n\n### FAQ\n##### If you meet \"Resource temporarily unavailable\" or \"error: too many open files\" or \"ValueError: cannot convert float NaN to integer\" or \"EmptyDataError: No columns to parse from file\"?\n\nIf your workstation is configured with more than 128 GB of memory and the number of threads is more than 48, you can try the following operations: \n\n1. You can try command ```ulimit -u 381152``` . Here 381152 could be replaced by any big number.\n2. You only need to run HiST command a few more times. \n\nThese errors usually occur when the input data is mcool. We will fix them in the next version.\n\n##### If you meet \"No such file or directory\"...\n\n1. Check matrix_from_hic directory, if no files in sub-directory, check the juicer_log to find error or check cooler package.\n2. If juicer_log suggest like \"invalid chromosome chr12\", you should check the sizes file, a common problem is to tell \"chr1\" or \"1\".\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Spectral translocation detection of HiC matrices.",
"version": "1.4.0",
"project_urls": {
"Homepage": "https://github.com/dtzxyangq/HiSTra"
},
"split_keywords": [
"hic",
"genome",
"structure",
"variation",
"translocation"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "594ab440a09cad581f4f36e6d9d1d28f75efeda3aea75280a79a4b14e568b22d",
"md5": "cf122d33cd14440083d09c45d6dad7c6",
"sha256": "b825ac290ee2eba31e2595b86395184d035db7bae7b82b6835f8ab3ead34777b"
},
"downloads": -1,
"filename": "HiSTra-1.4.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "cf122d33cd14440083d09c45d6dad7c6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 32764586,
"upload_time": "2024-11-22T08:57:44",
"upload_time_iso_8601": "2024-11-22T08:57:44.802676Z",
"url": "https://files.pythonhosted.org/packages/59/4a/b440a09cad581f4f36e6d9d1d28f75efeda3aea75280a79a4b14e568b22d/HiSTra-1.4.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "84ae8867ab6c5ec8dfef490143cbcf45af11fca490d458bc87e62f4ceb5d4939",
"md5": "3947b84a4bc4f98e49b91982e2b230f9",
"sha256": "3a5e7ac8b85b40321a17397c4d9e24f2e6119d2b5c3c1c8a3d35b2b9d8c0395e"
},
"downloads": -1,
"filename": "histra-1.4.0.tar.gz",
"has_sig": false,
"md5_digest": "3947b84a4bc4f98e49b91982e2b230f9",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 25426,
"upload_time": "2024-11-22T08:57:48",
"upload_time_iso_8601": "2024-11-22T08:57:48.794219Z",
"url": "https://files.pythonhosted.org/packages/84/ae/8867ab6c5ec8dfef490143cbcf45af11fca490d458bc87e62f4ceb5d4939/histra-1.4.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-22 08:57:48",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "dtzxyangq",
"github_project": "HiSTra",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "histra"
}