# Indo Scraper š®š©
Library Python untuk scraping website Indonesia dengan mudah dan aman. Dirancang khusus untuk domain Indonesia seperti `.id`, `.co.id`, `.go.id`, `.sch.id`, dan domain Indonesia lainnya.
## š Fitur Utama
- ā
**Mudah digunakan** - Cukup satu baris kode untuk scraping
- ā
**Domain Indonesia** - Dioptimalkan untuk website Indonesia
- ā
**Ekstraksi otomatis** - Email, telepon, alamat, dan kontak lainnya
- ā
**Multiple pages** - Scraping beberapa halaman sekaligus
- ā
**Export JSON** - Simpan hasil ke file JSON
- ā
**Command Line** - Bisa digunakan dari terminal
- ā
**Rate limiting** - Menghormati server dengan delay otomatis
## š¦ Instalasi
```bash
pip install indo-scraper
```
## š§ Cara Penggunaan
### 1. Penggunaan Dasar
```python
from indo_scraper import IndoScraper
# Buat instance scraper
scraper = IndoScraper()
# Scraping website
hasil = scraper.scrape("https://www.smkn5bandung.sch.id/")
# Lihat hasil
print(f"Judul: {hasil['title']}")
print(f"Email ditemukan: {hasil['contact_info']['emails']}")
print(f"Telepon: {hasil['contact_info']['phones']}")
```
### 2. Scraping dengan Opsi Lengkap
```python
from indo_scraper import IndoScraper
scraper = IndoScraper(delay=2.0, timeout=60)
hasil = scraper.scrape(
url="https://www.kemendikbud.go.id/",
extract_links=True, # Ekstrak semua link
extract_images=True, # Ekstrak semua gambar
extract_contact=True, # Ekstrak informasi kontak
max_pages=3 # Scraping maksimal 3 halaman
)
# Simpan hasil ke JSON
scraper.save_to_json(hasil, "hasil_scraping.json")
```
### 3. Scraping Multiple Websites
```python
from indo_scraper import IndoScraper
scraper = IndoScraper()
urls = [
"https://www.smkn5bandung.sch.id/",
"https://www.ui.ac.id/",
"https://www.detik.com/"
]
semua_hasil = scraper.scrape_multiple(urls, max_pages=2)
for hasil in semua_hasil:
print(f"Website: {hasil['domain']}")
print(f"Status: {hasil['status']}")
print("-" * 40)
```
### 4. Menggunakan dari Command Line
```bash
# Scraping basic
indo-scraper https://www.smkn5bandung.sch.id/
# Scraping dengan opsi
indo-scraper https://www.kemendikbud.go.id/ --max-pages 3 --output hasil.json --delay 2
# Lihat bantuan
indo-scraper --help
```
## š Format Hasil Scraping
```python
{
"url": "https://www.smkn5bandung.sch.id/",
"domain": "www.smkn5bandung.sch.id",
"title": "SMK Negeri 5 Bandung",
"description": "Website resmi SMK Negeri 5 Bandung",
"content": "Konten lengkap website...",
"links": ["https://...", "https://..."],
"images": ["https://img1.jpg", "https://img2.png"],
"contact_info": {
"emails": ["info@smkn5bandung.sch.id"],
"phones": ["(022) 1234567", "0812-3456-7890"],
"addresses": ["Jl. Veteran No. 1, Bandung"]
},
"metadata": {
"author": "...",
"keywords": "...",
"og:title": "..."
},
"status": "success",
"scraped_pages": 1,
"timestamp": "2024-12-07 10:30:00"
}
```
## š ļø Opsi Konfigurasi
### Inisialisasi Scraper
```python
scraper = IndoScraper(
delay=1.0, # Jeda antar request (detik)
timeout=30 # Timeout request (detik)
)
```
### Parameter Scraping
```python
scraper.scrape(
url="https://website.co.id/",
extract_links=True, # Ekstrak link (default: True)
extract_images=True, # Ekstrak gambar (default: True)
extract_contact=True, # Ekstrak kontak (default: True)
max_pages=1 # Maks halaman (default: 1)
)
```
## šÆ Domain yang Didukung
Library ini dioptimalkan untuk domain Indonesia:
- `.id` - Domain Indonesia
- `.co.id` - Komersial Indonesia
- `.or.id` - Organisasi Indonesia
- `.ac.id` - Akademik Indonesia
- `.sch.id` - Sekolah Indonesia
- `.net.id` - Network Indonesia
- `.web.id` - Web Indonesia
- `.my.id` - Personal Indonesia
- `.go.id` - Pemerintah Indonesia
- `.mil.id` - Militer Indonesia
- `.desa.id` - Desa Indonesia
- `.ponpes.id` - Pondok Pesantren
## š Contoh Penggunaan Lengkap
### Scraping Website Sekolah
```python
from indo_scraper import IndoScraper
from indo_scraper.utils import format_scraped_data
# Buat scraper dengan delay 2 detik
scraper = IndoScraper(delay=2.0)
# Scraping website sekolah
print("š Memulai scraping...")
hasil = scraper.scrape(
url="https://www.smkn5bandung.sch.id/",
max_pages=2
)
# Format dan tampilkan hasil
print(format_scraped_data(hasil))
# Simpan ke file
scraper.save_to_json(hasil, "data_sekolah.json")
print("ā
Data berhasil disimpan!")
```
### Scraping Website Pemerintah
```python
from indo_scraper import IndoScraper
scraper = IndoScraper()
# Scraping website kemendikbud
hasil = scraper.scrape("https://www.kemendikbud.go.id/")
if hasil['status'] == 'success':
print(f"š Berhasil scraping {hasil['domain']}")
print(f"š§ Email ditemukan: {len(hasil['contact_info']['emails'])}")
print(f"š Telepon ditemukan: {len(hasil['contact_info']['phones'])}")
print(f"š Link ditemukan: {len(hasil['links'])}")
else:
print(f"ā Gagal scraping: {hasil.get('error', 'Unknown error')}")
```
### Batch Scraping Multiple Websites
```python
from indo_scraper import IndoScraper
import time
scraper = IndoScraper(delay=3.0) # Delay 3 detik untuk menghormati server
# Daftar website Indonesia
websites = [
"https://www.ui.ac.id/",
"https://www.itb.ac.id/",
"https://www.ugm.ac.id/",
"https://www.unpad.ac.id/"
]
print("š Memulai batch scraping...")
start_time = time.time()
all_results = []
for i, url in enumerate(websites, 1):
print(f"š„ Scraping {i}/{len(websites)}: {url}")
hasil = scraper.scrape(url, max_pages=1)
all_results.append(hasil)
# Progress info
if hasil['status'] == 'success':
print(f" ā
Berhasil - {hasil['title'][:50]}...")
else:
print(f" ā Gagal - {hasil.get('error', 'Unknown')}")
# Summary
elapsed = time.time() - start_time
success_count = sum(1 for r in all_results if r['status'] == 'success')
print(f"\nš RINGKASAN BATCH SCRAPING")
print(f"Total website: {len(websites)}")
print(f"Berhasil: {success_count}")
print(f"Gagal: {len(websites) - success_count}")
print(f"Waktu total: {elapsed:.2f} detik")
# Simpan semua hasil
scraper.save_to_json(all_results, "batch_scraping_results.json")
```
## šØ Tips Penggunaan yang Baik
### 1. Menghormati Server
```python
# Gunakan delay yang wajar (minimal 1 detik)
scraper = IndoScraper(delay=2.0)
# Jangan scraping terlalu banyak halaman sekaligus
hasil = scraper.scrape(url, max_pages=5) # Maksimal 5 halaman
```
### 2. Error Handling
```python
from indo_scraper import IndoScraper
scraper = IndoScraper()
try:
hasil = scraper.scrape("https://website.co.id/")
if hasil['status'] == 'success':
print("ā
Scraping berhasil!")
# Proses data...
else:
print(f"ā Scraping gagal: {hasil.get('error')}")
except Exception as e:
print(f"š„ Error tak terduga: {str(e)}")
```
### 3. Validasi Domain
```python
from indo_scraper.utils import validate_indonesian_domain
url = "https://www.example.com/"
if validate_indonesian_domain(url):
print("ā
Domain Indonesia terdeteksi")
hasil = scraper.scrape(url)
else:
print("ā ļø Bukan domain Indonesia")
# Tetap bisa scraping, tapi tidak dioptimalkan untuk Indonesia
hasil = scraper.scrape(url)
```
## š Command Line Interface
### Perintah Dasar
```bash
# Scraping sederhana
indo-scraper https://www.smkn5bandung.sch.id/
# Dengan output file
indo-scraper https://www.ui.ac.id/ --output universitas.json
# Multiple pages dengan delay
indo-scraper https://www.detik.com/ --max-pages 3 --delay 2.5
# Tanpa ekstrak gambar dan link
indo-scraper https://www.kemendikbud.go.id/ --no-images --no-links
```
### Opsi Command Line Lengkap
```bash
indo-scraper [URL] [OPTIONS]
Opsi:
--max-pages N Maksimal halaman yang di-scrape (default: 1)
--delay N Jeda antar request dalam detik (default: 1.0)
--timeout N Timeout request dalam detik (default: 30)
--output FILE File output JSON (opsional)
--no-links Jangan ekstrak link
--no-images Jangan ekstrak gambar
--no-contact Jangan ekstrak informasi kontak
--version Tampilkan versi
--help Tampilkan bantuan
```
## š”ļø Keamanan dan Etika
### Pedoman Penggunaan
1. **Hormati robots.txt** - Selalu cek file robots.txt website
2. **Gunakan delay yang wajar** - Minimal 1-2 detik antar request
3. **Jangan overload server** - Batasi jumlah halaman yang di-scrape
4. **Patuhi Terms of Service** - Baca dan patuhi TOS website
5. **Data pribadi** - Hati-hati dengan data pribadi yang di-scrape
### Contoh Penggunaan yang Bertanggung Jawab
```python
from indo_scraper import IndoScraper
# Konfigurasi yang menghormati server
scraper = IndoScraper(
delay=2.0, # Delay 2 detik
timeout=30 # Timeout wajar
)
# Scraping dengan batasan
hasil = scraper.scrape(
url="https://website.co.id/",
max_pages=3 # Maksimal 3 halaman saja
)
# Cek apakah ada informasi sensitif
if hasil['contact_info']['emails']:
print("ā ļø Ditemukan email - gunakan dengan bijak")
```
## š Troubleshooting
### Masalah Umum
**1. Error "Connection timeout"**
```python
# Tingkatkan timeout
scraper = IndoScraper(timeout=60)
```
**2. Error "Too many requests"**
```python
# Tingkatkan delay
scraper = IndoScraper(delay=5.0)
```
**3. Website tidak bisa diakses**
```python
# Cek status error
hasil = scraper.scrape(url)
if hasil['status'] == 'error':
print(f"Error: {hasil['error']}")
```
**4. Data tidak lengkap**
```python
# Scraping dengan opsi lengkap
hasil = scraper.scrape(
url=url,
extract_links=True,
extract_images=True,
extract_contact=True,
max_pages=2
)
```
## š Lisensi
MIT License - Bebas untuk digunakan dan dimodifikasi.
## š¤ Kontribusi
Kontribusi sangat diterima! Silakan:
1. Fork repository ini
2. Buat branch untuk fitur baru
3. Commit perubahan Anda
4. Push ke branch
5. Buat Pull Request
## š Dukungan
- š **Bug reports**: [GitHub Issues](https://github.com/adepratama840/indo-scraper/issues)
- š” **Feature requests**: [GitHub Issues](https://github.com/adepratama840/indo-scraper/issues)
- š **Dokumentasi**: [GitHub Wiki](https://github.com/adepratama840/indo-scraper/wiki)
- š§ **Email**: adepratama20071907@gmail.com
## š Changelog
### v1.0.0
- ⨠Rilis pertama
- š Scraping dasar untuk website Indonesia
- š§ Ekstraksi otomatis email, telepon, alamat
- š¾ Export ke JSON
- š„ļø Command line interface
- š± Support untuk semua domain Indonesia
---
**Dibuat dengan ā¤ļø untuk komunitas developer Indonesia**
Raw data
{
"_id": null,
"home_page": "https://github.com/adepratama840/indo-scraper",
"name": "indo-scraper",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "scraping, indonesia, web scraping, data extraction, website",
"author": "ADE PRATAMA",
"author_email": "adepratama20071907@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/17/b6/d850f707237da2d1064c4b481b8414b9ec78a00a50ecd388a7a929f5de4f/indo_scraper-1.0.0.tar.gz",
"platform": null,
"description": "# Indo Scraper \ud83c\uddee\ud83c\udde9\r\n\r\nLibrary Python untuk scraping website Indonesia dengan mudah dan aman. Dirancang khusus untuk domain Indonesia seperti `.id`, `.co.id`, `.go.id`, `.sch.id`, dan domain Indonesia lainnya.\r\n\r\n## \ud83d\ude80 Fitur Utama\r\n\r\n- \u2705 **Mudah digunakan** - Cukup satu baris kode untuk scraping\r\n- \u2705 **Domain Indonesia** - Dioptimalkan untuk website Indonesia\r\n- \u2705 **Ekstraksi otomatis** - Email, telepon, alamat, dan kontak lainnya\r\n- \u2705 **Multiple pages** - Scraping beberapa halaman sekaligus\r\n- \u2705 **Export JSON** - Simpan hasil ke file JSON\r\n- \u2705 **Command Line** - Bisa digunakan dari terminal\r\n- \u2705 **Rate limiting** - Menghormati server dengan delay otomatis\r\n\r\n## \ud83d\udce6 Instalasi\r\n\r\n```bash\r\npip install indo-scraper\r\n```\r\n\r\n## \ud83d\udd27 Cara Penggunaan\r\n\r\n### 1. Penggunaan Dasar\r\n\r\n```python\r\nfrom indo_scraper import IndoScraper\r\n\r\n# Buat instance scraper\r\nscraper = IndoScraper()\r\n\r\n# Scraping website\r\nhasil = scraper.scrape(\"https://www.smkn5bandung.sch.id/\")\r\n\r\n# Lihat hasil\r\nprint(f\"Judul: {hasil['title']}\")\r\nprint(f\"Email ditemukan: {hasil['contact_info']['emails']}\")\r\nprint(f\"Telepon: {hasil['contact_info']['phones']}\")\r\n```\r\n\r\n### 2. Scraping dengan Opsi Lengkap\r\n\r\n```python\r\nfrom indo_scraper import IndoScraper\r\n\r\nscraper = IndoScraper(delay=2.0, timeout=60)\r\n\r\nhasil = scraper.scrape(\r\n url=\"https://www.kemendikbud.go.id/\",\r\n extract_links=True, # Ekstrak semua link\r\n extract_images=True, # Ekstrak semua gambar\r\n extract_contact=True, # Ekstrak informasi kontak\r\n max_pages=3 # Scraping maksimal 3 halaman\r\n)\r\n\r\n# Simpan hasil ke JSON\r\nscraper.save_to_json(hasil, \"hasil_scraping.json\")\r\n```\r\n\r\n### 3. Scraping Multiple Websites\r\n\r\n```python\r\nfrom indo_scraper import IndoScraper\r\n\r\nscraper = IndoScraper()\r\n\r\nurls = [\r\n \"https://www.smkn5bandung.sch.id/\",\r\n \"https://www.ui.ac.id/\",\r\n \"https://www.detik.com/\"\r\n]\r\n\r\nsemua_hasil = scraper.scrape_multiple(urls, max_pages=2)\r\n\r\nfor hasil in semua_hasil:\r\n print(f\"Website: {hasil['domain']}\")\r\n print(f\"Status: {hasil['status']}\")\r\n print(\"-\" * 40)\r\n```\r\n\r\n### 4. Menggunakan dari Command Line\r\n\r\n```bash\r\n# Scraping basic\r\nindo-scraper https://www.smkn5bandung.sch.id/\r\n\r\n# Scraping dengan opsi\r\nindo-scraper https://www.kemendikbud.go.id/ --max-pages 3 --output hasil.json --delay 2\r\n\r\n# Lihat bantuan\r\nindo-scraper --help\r\n```\r\n\r\n## \ud83d\udccb Format Hasil Scraping\r\n\r\n```python\r\n{\r\n \"url\": \"https://www.smkn5bandung.sch.id/\",\r\n \"domain\": \"www.smkn5bandung.sch.id\",\r\n \"title\": \"SMK Negeri 5 Bandung\",\r\n \"description\": \"Website resmi SMK Negeri 5 Bandung\",\r\n \"content\": \"Konten lengkap website...\",\r\n \"links\": [\"https://...\", \"https://...\"],\r\n \"images\": [\"https://img1.jpg\", \"https://img2.png\"],\r\n \"contact_info\": {\r\n \"emails\": [\"info@smkn5bandung.sch.id\"],\r\n \"phones\": [\"(022) 1234567\", \"0812-3456-7890\"],\r\n \"addresses\": [\"Jl. Veteran No. 1, Bandung\"]\r\n },\r\n \"metadata\": {\r\n \"author\": \"...\",\r\n \"keywords\": \"...\",\r\n \"og:title\": \"...\"\r\n },\r\n \"status\": \"success\",\r\n \"scraped_pages\": 1,\r\n \"timestamp\": \"2024-12-07 10:30:00\"\r\n}\r\n```\r\n\r\n## \ud83d\udee0\ufe0f Opsi Konfigurasi\r\n\r\n### Inisialisasi Scraper\r\n\r\n```python\r\nscraper = IndoScraper(\r\n delay=1.0, # Jeda antar request (detik)\r\n timeout=30 # Timeout request (detik)\r\n)\r\n```\r\n\r\n### Parameter Scraping\r\n\r\n```python\r\nscraper.scrape(\r\n url=\"https://website.co.id/\",\r\n extract_links=True, # Ekstrak link (default: True)\r\n extract_images=True, # Ekstrak gambar (default: True)\r\n extract_contact=True, # Ekstrak kontak (default: True)\r\n max_pages=1 # Maks halaman (default: 1)\r\n)\r\n```\r\n\r\n## \ud83c\udfaf Domain yang Didukung\r\n\r\nLibrary ini dioptimalkan untuk domain Indonesia:\r\n\r\n- `.id` - Domain Indonesia\r\n- `.co.id` - Komersial Indonesia\r\n- `.or.id` - Organisasi Indonesia\r\n- `.ac.id` - Akademik Indonesia\r\n- `.sch.id` - Sekolah Indonesia\r\n- `.net.id` - Network Indonesia\r\n- `.web.id` - Web Indonesia\r\n- `.my.id` - Personal Indonesia\r\n- `.go.id` - Pemerintah Indonesia\r\n- `.mil.id` - Militer Indonesia\r\n- `.desa.id` - Desa Indonesia\r\n- `.ponpes.id` - Pondok Pesantren\r\n\r\n## \ud83d\udd0d Contoh Penggunaan Lengkap\r\n\r\n### Scraping Website Sekolah\r\n\r\n```python\r\nfrom indo_scraper import IndoScraper\r\nfrom indo_scraper.utils import format_scraped_data\r\n\r\n# Buat scraper dengan delay 2 detik\r\nscraper = IndoScraper(delay=2.0)\r\n\r\n# Scraping website sekolah\r\nprint(\"\ud83d\udd04 Memulai scraping...\")\r\nhasil = scraper.scrape(\r\n url=\"https://www.smkn5bandung.sch.id/\",\r\n max_pages=2\r\n)\r\n\r\n# Format dan tampilkan hasil\r\nprint(format_scraped_data(hasil))\r\n\r\n# Simpan ke file\r\nscraper.save_to_json(hasil, \"data_sekolah.json\")\r\nprint(\"\u2705 Data berhasil disimpan!\")\r\n```\r\n\r\n### Scraping Website Pemerintah\r\n\r\n```python\r\nfrom indo_scraper import IndoScraper\r\n\r\nscraper = IndoScraper()\r\n\r\n# Scraping website kemendikbud\r\nhasil = scraper.scrape(\"https://www.kemendikbud.go.id/\")\r\n\r\nif hasil['status'] == 'success':\r\n print(f\"\ud83d\udcca Berhasil scraping {hasil['domain']}\")\r\n print(f\"\ud83d\udce7 Email ditemukan: {len(hasil['contact_info']['emails'])}\")\r\n print(f\"\ud83d\udcde Telepon ditemukan: {len(hasil['contact_info']['phones'])}\")\r\n print(f\"\ud83d\udd17 Link ditemukan: {len(hasil['links'])}\")\r\nelse:\r\n print(f\"\u274c Gagal scraping: {hasil.get('error', 'Unknown error')}\")\r\n```\r\n\r\n### Batch Scraping Multiple Websites\r\n\r\n```python\r\nfrom indo_scraper import IndoScraper\r\nimport time\r\n\r\nscraper = IndoScraper(delay=3.0) # Delay 3 detik untuk menghormati server\r\n\r\n# Daftar website Indonesia\r\nwebsites = [\r\n \"https://www.ui.ac.id/\",\r\n \"https://www.itb.ac.id/\",\r\n \"https://www.ugm.ac.id/\",\r\n \"https://www.unpad.ac.id/\"\r\n]\r\n\r\nprint(\"\ud83d\ude80 Memulai batch scraping...\")\r\nstart_time = time.time()\r\n\r\nall_results = []\r\nfor i, url in enumerate(websites, 1):\r\n print(f\"\ud83d\udce5 Scraping {i}/{len(websites)}: {url}\")\r\n \r\n hasil = scraper.scrape(url, max_pages=1)\r\n all_results.append(hasil)\r\n \r\n # Progress info\r\n if hasil['status'] == 'success':\r\n print(f\" \u2705 Berhasil - {hasil['title'][:50]}...\")\r\n else:\r\n print(f\" \u274c Gagal - {hasil.get('error', 'Unknown')}\")\r\n\r\n# Summary\r\nelapsed = time.time() - start_time\r\nsuccess_count = sum(1 for r in all_results if r['status'] == 'success')\r\n\r\nprint(f\"\\n\ud83d\udcca RINGKASAN BATCH SCRAPING\")\r\nprint(f\"Total website: {len(websites)}\")\r\nprint(f\"Berhasil: {success_count}\")\r\nprint(f\"Gagal: {len(websites) - success_count}\")\r\nprint(f\"Waktu total: {elapsed:.2f} detik\")\r\n\r\n# Simpan semua hasil\r\nscraper.save_to_json(all_results, \"batch_scraping_results.json\")\r\n```\r\n\r\n## \ud83d\udea8 Tips Penggunaan yang Baik\r\n\r\n### 1. Menghormati Server\r\n\r\n```python\r\n# Gunakan delay yang wajar (minimal 1 detik)\r\nscraper = IndoScraper(delay=2.0)\r\n\r\n# Jangan scraping terlalu banyak halaman sekaligus\r\nhasil = scraper.scrape(url, max_pages=5) # Maksimal 5 halaman\r\n```\r\n\r\n### 2. Error Handling\r\n\r\n```python\r\nfrom indo_scraper import IndoScraper\r\n\r\nscraper = IndoScraper()\r\n\r\ntry:\r\n hasil = scraper.scrape(\"https://website.co.id/\")\r\n \r\n if hasil['status'] == 'success':\r\n print(\"\u2705 Scraping berhasil!\")\r\n # Proses data...\r\n else:\r\n print(f\"\u274c Scraping gagal: {hasil.get('error')}\")\r\n \r\nexcept Exception as e:\r\n print(f\"\ud83d\udca5 Error tak terduga: {str(e)}\")\r\n```\r\n\r\n### 3. Validasi Domain\r\n\r\n```python\r\nfrom indo_scraper.utils import validate_indonesian_domain\r\n\r\nurl = \"https://www.example.com/\"\r\n\r\nif validate_indonesian_domain(url):\r\n print(\"\u2705 Domain Indonesia terdeteksi\")\r\n hasil = scraper.scrape(url)\r\nelse:\r\n print(\"\u26a0\ufe0f Bukan domain Indonesia\")\r\n # Tetap bisa scraping, tapi tidak dioptimalkan untuk Indonesia\r\n hasil = scraper.scrape(url)\r\n```\r\n\r\n## \ud83d\udcdd Command Line Interface\r\n\r\n### Perintah Dasar\r\n\r\n```bash\r\n# Scraping sederhana\r\nindo-scraper https://www.smkn5bandung.sch.id/\r\n\r\n# Dengan output file\r\nindo-scraper https://www.ui.ac.id/ --output universitas.json\r\n\r\n# Multiple pages dengan delay\r\nindo-scraper https://www.detik.com/ --max-pages 3 --delay 2.5\r\n\r\n# Tanpa ekstrak gambar dan link\r\nindo-scraper https://www.kemendikbud.go.id/ --no-images --no-links\r\n```\r\n\r\n### Opsi Command Line Lengkap\r\n\r\n```bash\r\nindo-scraper [URL] [OPTIONS]\r\n\r\nOpsi:\r\n --max-pages N Maksimal halaman yang di-scrape (default: 1)\r\n --delay N Jeda antar request dalam detik (default: 1.0)\r\n --timeout N Timeout request dalam detik (default: 30)\r\n --output FILE File output JSON (opsional)\r\n --no-links Jangan ekstrak link\r\n --no-images Jangan ekstrak gambar\r\n --no-contact Jangan ekstrak informasi kontak\r\n --version Tampilkan versi\r\n --help Tampilkan bantuan\r\n```\r\n\r\n## \ud83d\udee1\ufe0f Keamanan dan Etika\r\n\r\n### Pedoman Penggunaan\r\n\r\n1. **Hormati robots.txt** - Selalu cek file robots.txt website\r\n2. **Gunakan delay yang wajar** - Minimal 1-2 detik antar request\r\n3. **Jangan overload server** - Batasi jumlah halaman yang di-scrape\r\n4. **Patuhi Terms of Service** - Baca dan patuhi TOS website\r\n5. **Data pribadi** - Hati-hati dengan data pribadi yang di-scrape\r\n\r\n### Contoh Penggunaan yang Bertanggung Jawab\r\n\r\n```python\r\nfrom indo_scraper import IndoScraper\r\n\r\n# Konfigurasi yang menghormati server\r\nscraper = IndoScraper(\r\n delay=2.0, # Delay 2 detik\r\n timeout=30 # Timeout wajar\r\n)\r\n\r\n# Scraping dengan batasan\r\nhasil = scraper.scrape(\r\n url=\"https://website.co.id/\",\r\n max_pages=3 # Maksimal 3 halaman saja\r\n)\r\n\r\n# Cek apakah ada informasi sensitif\r\nif hasil['contact_info']['emails']:\r\n print(\"\u26a0\ufe0f Ditemukan email - gunakan dengan bijak\")\r\n```\r\n\r\n## \ud83d\udc1b Troubleshooting\r\n\r\n### Masalah Umum\r\n\r\n**1. Error \"Connection timeout\"**\r\n```python\r\n# Tingkatkan timeout\r\nscraper = IndoScraper(timeout=60)\r\n```\r\n\r\n**2. Error \"Too many requests\"**\r\n```python\r\n# Tingkatkan delay\r\nscraper = IndoScraper(delay=5.0)\r\n```\r\n\r\n**3. Website tidak bisa diakses**\r\n```python\r\n# Cek status error\r\nhasil = scraper.scrape(url)\r\nif hasil['status'] == 'error':\r\n print(f\"Error: {hasil['error']}\")\r\n```\r\n\r\n**4. Data tidak lengkap**\r\n```python\r\n# Scraping dengan opsi lengkap\r\nhasil = scraper.scrape(\r\n url=url,\r\n extract_links=True,\r\n extract_images=True,\r\n extract_contact=True,\r\n max_pages=2\r\n)\r\n```\r\n\r\n## \ud83d\udcc4 Lisensi\r\n\r\nMIT License - Bebas untuk digunakan dan dimodifikasi.\r\n\r\n## \ud83e\udd1d Kontribusi\r\n\r\nKontribusi sangat diterima! Silakan:\r\n\r\n1. Fork repository ini\r\n2. Buat branch untuk fitur baru\r\n3. Commit perubahan Anda\r\n4. Push ke branch\r\n5. Buat Pull Request\r\n\r\n## \ud83d\udcde Dukungan\r\n\r\n- \ud83d\udc1b **Bug reports**: [GitHub Issues](https://github.com/adepratama840/indo-scraper/issues)\r\n- \ud83d\udca1 **Feature requests**: [GitHub Issues](https://github.com/adepratama840/indo-scraper/issues)\r\n- \ud83d\udcda **Dokumentasi**: [GitHub Wiki](https://github.com/adepratama840/indo-scraper/wiki)\r\n- \ud83d\udce7 **Email**: adepratama20071907@gmail.com\r\n\r\n## \ud83d\udd04 Changelog\r\n\r\n### v1.0.0\r\n- \u2728 Rilis pertama\r\n- \ud83d\ude80 Scraping dasar untuk website Indonesia\r\n- \ud83d\udce7 Ekstraksi otomatis email, telepon, alamat\r\n- \ud83d\udcbe Export ke JSON\r\n- \ud83d\udda5\ufe0f Command line interface\r\n- \ud83d\udcf1 Support untuk semua domain Indonesia\r\n\r\n---\r\n\r\n**Dibuat dengan \u2764\ufe0f untuk komunitas developer Indonesia**\r\n",
"bugtrack_url": null,
"license": null,
"summary": "Library Python untuk scraping website Indonesia dengan mudah",
"version": "1.0.0",
"project_urls": {
"Bug Reports": "https://github.com/adepratama840/indo-scraper/issues",
"Documentation": "https://github.com/adepratama840/indo-scraper/wiki",
"Homepage": "https://github.com/adepratama840/indo-scraper",
"Source": "https://github.com/adepratama840/indo-scraper"
},
"split_keywords": [
"scraping",
" indonesia",
" web scraping",
" data extraction",
" website"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "081cf692542d701b1a58deb00e04e6539131b7f9c87a7f88d78989c3fc40869b",
"md5": "43e19653e964182c6f2bcf2d5a3554d1",
"sha256": "6349783801fef1008592f416a059dfb1af6bfeb40f31e85cea154141a286c589"
},
"downloads": -1,
"filename": "indo_scraper-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "43e19653e964182c6f2bcf2d5a3554d1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 14392,
"upload_time": "2025-08-03T14:06:29",
"upload_time_iso_8601": "2025-08-03T14:06:29.743451Z",
"url": "https://files.pythonhosted.org/packages/08/1c/f692542d701b1a58deb00e04e6539131b7f9c87a7f88d78989c3fc40869b/indo_scraper-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "17b6d850f707237da2d1064c4b481b8414b9ec78a00a50ecd388a7a929f5de4f",
"md5": "ce479b7fa5bdea99a2b962b34d4f318f",
"sha256": "ca70c049417c4399e9cbc630a4fc24275aca6dd5b70f7297e2eb42d01a2a0087"
},
"downloads": -1,
"filename": "indo_scraper-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "ce479b7fa5bdea99a2b962b34d4f318f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 16361,
"upload_time": "2025-08-03T14:06:33",
"upload_time_iso_8601": "2025-08-03T14:06:33.481859Z",
"url": "https://files.pythonhosted.org/packages/17/b6/d850f707237da2d1064c4b481b8414b9ec78a00a50ecd388a7a929f5de4f/indo_scraper-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-03 14:06:33",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "adepratama840",
"github_project": "indo-scraper",
"github_not_found": true,
"lcname": "indo-scraper"
}