Scraper crawling

blad3runn3r · 26.08.2021

any suggestions how to create with python 3 a multi scraper and crawling both in the surface and in the onion?
i using https://www.edureka.co/blog/web-scraping-with-python/

Gagarin61 · 30.08.2021

если ты хочешь работать с tor сайтами, используй tor прокси.

Код:

import requests

session = requests.session()
session.proxies = {
    'http':'socks5://127.0.0.1:9050',
    'https': 'socks5://127.0.0.1:9050'
    }


r = session.get('http://httpbin.org/ip')
print(r.text)

escrow · 30.08.2021

if you want HTTP/1.1 only then you can use requests otherwise use HTTPX for HTTP/2 support

Requests: HTTP for Humans™ — Requests 2.27.1 documentation

HTTPX

A next-generation HTTP client for Python.

www.python-httpx.org

FrenchWineIsLife · 02.09.2021

The easiest way for you to do this would be to scrape all the data via tor.
You can chain a classic crawler that you put on a socks5 docker for tor like this one: https://hub.docker.com/r/peterdavehello/tor-socks-proxy/

Guron_18 · 02.09.2021

Gagarin61 сказал(а):
если ты хочешь работать с tor сайтами, используй tor прокси.
Код:
import requests

session = requests.session()
session.proxies = {
    'http':'socks5://127.0.0.1:9050',
    'https': 'socks5://127.0.0.1:9050'
    }


r = session.get('http://httpbin.org/ip')
print(r.text)

Advanced Usage — Requests 2.26.0 documentation

Using the scheme socks5 causes the DNS resolution to happen on the client, rather than on the proxy server. This is in line with curl, which uses the scheme to decide whether to do the DNS resolution on the client or proxy. If you want to resolve the domains on the proxy server, use socks5h as the scheme.

Python:

proxies = {
    'http':     'socks5h://127.0.0.1:9150',
    'https':    'socks5h://127.0.0.1:9150'
}
response = requests.get(url, proxies=proxies)

GottaHackEmAll · 02.09.2021

As others have already mentioned, simply routing requests through Tor is relatively easy. But if your goal is to access clearnet content this way, beware that many attempt to prevent scraping, or only present certain content through complex JS to make scraping more difficult.

But this is actually pretty easy to work around using Selenium. If this will be relevant for you, let me know and I'd be happy to guide you more specifically through scraping sites with Selenium + Tor

pompompurin · 18.09.2021

blad3runn3r сказал(а):

any suggestions how to create with python 3 a multi scraper and crawling both in the surface and in the onion?
i using https://www.edureka.co/blog/web-scraping-with-python/

Here is an example on github for a hidden service scraper. Didn't test it out, not sure if it works.

GitHub - dirtyfilthy/freshonions-torscraper: Fresh Onions is an open source TOR spider / hidden service onion crawler hosted at zlal32teyptf4tvi.onion

Fresh Onions is an open source TOR spider / hidden service onion crawler hosted at zlal32teyptf4tvi.onion - dirtyfilthy/freshonions-torscraper

github.com

rag3 · 23.09.2021

brilliant.

blad3runn3r · 23.09.2021

ty so much everyone

th3tr0ll · 24.09.2021

I take this opportunity to ask you all: is there a way to bypass client verification (JavaScript based) without the use of Selenium?
The .js files should be rendered and executed by the requests module.

Thanks in advice

pompompurin · 25.09.2021

th3tr0ll сказал(а):

I take this opportunity to ask you all: is there a way to bypass client verification (JavaScript based) without the use of Selenium?
The .js files should be rendered and executed by the requests module.

Thanks in advice

In theory, yes. But you would need to completely reverse-engineer what the JS Code does, which would take a long time depending on the level of protection. I know that people have reverse engineered Cloudflares in the past, but I think they've since changed challenges.

Chevy · 07.10.2021

really cool thread going to look into some of this

Scraper crawling

blad3runn3r

RAID-массив

Gagarin61

HDD-drive

escrow

floppy-диск

HTTPX

FrenchWineIsLife

floppy-диск

Guron_18

floppy-диск

GottaHackEmAll

CD-диск

pompompurin

HDD-drive

GitHub - dirtyfilthy/freshonions-torscraper: Fresh Onions is an open source TOR spider / hidden service onion crawler hosted at zlal32teyptf4tvi.onion

rag3

CD-диск

blad3runn3r

RAID-массив

th3tr0ll

TROLLING ENTERPRISE LTD

pompompurin

HDD-drive

Chevy

HDD-drive