I am brand new to Python and have not very good at it. I am trying to web scrape from a website called Transfermarkt (I'm a big football fan) but its giving me HTTP Error 404 when I try extract data. Here is my code:

from urllib.request import urlopen as uReq

from bs4 import BeautifulSoup as soup

my_url = "https://www.transfermarkt.com/chelsea-fc/leihspielerhistorie/verein/631/plus/1?saison_id=2018&leihe=ist"

uClient = uReq(my_url)

page_html = uClient.read()

uClient.close()

page_soup = soup(page_html, "html.parser")

for che in chelsea:

player = che.tbody.tr.td.table.tbody.tr.td["spielprofil_tooltip tooltipstered"]

print("player: " +player)

Error says:

Traceback (most recent call last):

File "C:\Users\x15476582\Desktop\WebScrape.py", line 12, in

uClient = uReq(my_url)

File "C:\Python36-32\lib\urllib\request.py", line 223, in urlopen

return opener.open(url, data, timeout)

File "C:\Python36-32\lib\urllib\request.py", line 532, in open

response = meth(req, response)

File "C:\Python36-32\lib\urllib\request.py", line 642, in http_response

'http', request, response, code, msg, hdrs)

File "C:\Python36-32\lib\urllib\request.py", line 570, in error

return self._call_chain(*args)

File "C:\Python36-32\lib\urllib\request.py", line 504, in _call_chain

result = func(*args)

File "C:\Python36-32\lib\urllib\request.py", line 650, in http_error_default

raise HTTPError(req.full_url, code, msg, hdrs, fp)

urllib.error.HTTPError: HTTP Error 404: Not Found

Any help would be greatly appreciated, thanks guys x

解决方案

As Rup mentioned above, your user agent may have been rejected by the server.

Try augmenting your code with the following:

import urllib.request # we are going to need to generate a Request object

from bs4 import BeautifulSoup as soup

my_url = "https://www.transfermarkt.com/chelsea-fc/leihspielerhistorie/verein/631/plus/1?saison_id=2018&leihe=ist"

# here we define the headers for the request

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:63.0) Gecko/20100101 Firefox/63.0'}

# this request object will integrate your URL and the headers defined above

req = urllib.request.Request(url=my_url, headers=headers)

# calling urlopen this way will automatically handle closing the request

with urllib.request.urlopen(req) as response:

page_html = response.read()

After the code above you can continue your analysis. The Python docs have some useful pages on this topic:

Mozilla's documentation has a load of user-agent strings to try:

Logo

魔乐社区(Modelers.cn) 是一个中立、公益的人工智能社区,提供人工智能工具、模型、数据的托管、展示与应用协同服务,为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作,由全产业链共同建设、共同运营、共同享有,推动国产AI生态繁荣发展。

更多推荐