Using wget to recursively fetch an image directory
This might be irrelevant but since there are lots of geeks here, I will just try my luck.
What I am trying to do is simple: using wget to fetch a directory in a website and its sub-directories. For example, headline.nycweb.io is a Joomla website, and under its document root, there is an "image" directory containing lots of images. I want to use wget to fetch the whole "image" directory and its contents to my own server.
I have read this SO post: https://stackoverflow.com/questions/273743/using-wget-to-recursively-fetch-a-directory-with-arbitrary-files-in-it/273776#273776
but when I tried
wget --recursive --no-parent -e robots=off http://headline.nycweb.io/images/
I am always getting an index.html file. So what did I do wrong? And is what I am trying to do possible at all?
By the way, I have total control over both the source website and the destination server.
1 Reply
Never tried to do this with wget
before, but I thought I'd take a look to try and get the ball rolling.
I did a little surfing and for a second I thought you might want to try adding a --reject "index.html*"
to your wget
before the download URL, but upon further review it looks like this would just exclude index.html
from the other files that are meant to be here.
Maybe it's something with Apache that's preventing access to the files? I'm getting 403 Forbidden
when trying to access, for instance, headline.nycweb.io/images/Demo
but not when I go directly to http://headline.nycweb.io/images/Demo/blog/business9.jpg
$ curl -ILl headline.nycweb.io/images/Demo
HTTP/1.1 301 Moved Permanently
Date: Wed, 03 Jul 2019 19:51:27 GMT
Server: Apache/2.4.18 (Ubuntu)
Location: http://headline.nycweb.io/images/Demo/
Content-Type: text/html; charset=iso-8859-1
HTTP/1.1 403 Forbidden
Date: Wed, 03 Jul 2019 19:51:27 GMT
Server: Apache/2.4.18 (Ubuntu)
Content-Type: text/html; charset=iso-8859-1
$ curl -ILl headline.nycweb.io/images/Demo/blog/business9.jpg
HTTP/1.1 200 OK
Date: Wed, 03 Jul 2019 19:52:15 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Thu, 29 Sep 2016 12:45:30 GMT
ETag: "520e-53da4da4d8e80"
Accept-Ranges: bytes
Content-Length: 21006
Content-Type: image/jpeg
So maybe wget
is bailing when it tries to read the subdirectories under images
? Might be worth looking at the permissions on those dirs.