4

Fbook does it but I don't know about other sites.

Fbook does it but I don't know about other sites.

7 comments

[–] [Deleted] 0 points (+0|-0)

Could you explain the robots.txt thing?

[–] E-werd 0 points (+0|-0)
[–] [Deleted] 0 points (+0|-0)

Hey thanks! I understand about half of it but get the gist. Is that how scrapers work? I have zero coding experience but wanted to try and create a scraper (I looked into trying to learn how to write a python or a java based scraper script) but it seems too complicated for a beginner. My interest is scraping high quality images of old hollywood film stars.

[–] E-werd 1 points (+1|-0)

Yeah, from a base perspective all a scraper is doing is recording the content of a page (served in HTML, ultimately) and--here's the secret sauce, the method of which being what separates Google from Bing and DuckDuckGo--indexing it in a way that can be recalled when needed. Scrapers will usually follow hyperlinks and index those as well, recursively. The robots.txt is based on a standard that will guide the scraper on (mostly) what NOT to index. If you don't tell your scraper to respect the robots.txt file (if it exists), then it won't.

Now scraping images of old Hollywood film stars would take some guidance. You'd probably be looking for <img> tags in the HTML and feeding your scraper links to sites which contain it. How you determine what will contain them would be the tricky part. Do you manually find the sites in question and turn it loose on those, or do you do it more dynamically from search results on an engine like Google? How do you determine the search terms in the first place: do you query IMDb/OMDb for actors active during a certain time, then use the results of that query? Or something else?

Hope that was helpful and served as some food for thought.