I Need To Know : How to download an IMDb message board?

How to download an IMDb message board?

Hi there,
I want to save the subforums of some of my favorite movies, including every single page/thread therein before they get permanently deleted. I experimented with the depth option in HTTrack but the results are not very satisfying.

Has someone tried this and can help with the settings?

Re: How to download an IMDb message board?

Regardless of what you might have read, unless the site admin modifies their robots.txt file, there's nothing we can do other than email them and request an SQL dump. Neither of which is likely to happen.

Read the file below:


You can plainly see that almost all engines and crawlers are banned from accessing certain directories, and unless the IMDb administrator changes that file, no program or script you try to use to download the forum data will work. You'll grab the main page, but that's about it. See the actual file for more.

# robots.txt for IMDb properties
User-agent: *
Disallow: /board
Disallow: /boards

This means that ALL crawlers and scanners are prevented from not only downloading the entire board directory and all sub-directories below, they're prevented from accessing them at all. Unless the admin edits this file to remove those lines, your sole recourse is to attempt to save each page but even that won't work properly.

More here: http://www.imdb.com/board/bd0000001/nest/265772639?d=265864775#265864775
And here: http://www.imdb.com/board/bd0000001/nest/265784055?d=265829976#265829976

Re: How to download an IMDb message board?

Well, first you want to download every page of the thread index (e.g. /board/bd0000001/threads/?p=1 to /board/bd0000001/threads/?p=31). This is an iteration that can be achieved through a number of means. Then you want to visit all threads mentioned and then download all the pages of the threads. It's tricky to do this in an efficient way. In order to have the maximum number of threads (within a board) per page appear, likewise with posts (within a thread) per page, you have to be logged in and have the board preferences figures maxed out. It'd be difficult to automate this in a web browser, since the scripting tools do not provide a way to automate the use of the "Save Page As" feature. So, what is left is command line web readers (like wget, curl and lwp) or special programs, and the way to be logged in for that case is to copy your cookies over. So, there are more tricks involved, since the downloaded thread index pages have to be parsed in order to move forward with downloading the threads.