Let Download An Entire Website Locally For Viewing A Website Offline

A download symbol.

A download symbol. (Photo credit: Wikipedia)

I should have known how to save a website for offline viewing long long ago, but the truth was that I did not know an elegant way of doing it until now!  For the longest time, I have used wget for downloading open source software over the Internet, but I had no idea that I could also use wget to download an entire website for offline viewing.  Now I know, but how do I use wget to download an entire website on a Mac?

Before I continue on with the main point, you might wonder what is the point of downloading an entire website, but the point is simply that some people might experience Internet Interruption Syndrome and by downloading a website for offline viewing they can basically somewhat anticipate this very syndrome.  You know, it can happen to you too!  Like, whenever you on a road trip to somewhere you have been fantasized about, but your so called 21st century car doesn’t have 21st century wireless technology and you don’t have other 21st century always on wireless technology with you (e.g., a portable hotspot, a good enough smart phone data plan which allows you to have a smart phone behaves as a portable hotspot, etc…) — you are in a bind as to not to be able to connect to the Internet while inside a rather modern moving car and this makes you want to scream “Oh my God, I want a cure for my Internet Interruption Syndrome!”  Don’t scream too loudly, because you might make your driver dangerously swivels in and out of that highway lane.  The driver might blame you for experiencing a “Sudden Oh my God syndrome,” but the blame has to be after the fact that the car and its passengers are still whole.

With the why for using wget to download an entire website out of the way, let us move on with the how to acquire wget so we can use it to download an entire website, OK?  Unfortunately, wget isn’t coming with Mac by default, but you can always get it onto Mac by following the Makeuseof.com’s How To Get Wget For Your Mac tutorial.  If for some reasons you don’t like to follow the tutorial I just mentioned to get wget onto your Mac, you can always install a virtual machine (e.g., VMware, VirtualBox, Parallels) that runs Linux (e.g., Ubuntu, Fedora, CentOS, Mint, etc…) and through this way you can automatically acquire wget as Linux will install wget by default (i.e., so far Linux has always include wget).  Just remember though, you need to enable a share folder between Linux virtual machine and your host machine (e.g., Mac, Windows) so you can share whatever wget had downloaded between the virtual machine and the host machine — this way you don’t have to download the content from a virtual machine onto a USB flash drive and then share whatever content on a USB flash drive with the host machine (e.g., Mac, Windows).

OK, with the how to acquire wget is out of the way, let us move on with how to use wget to download an entire website, OK?  I follow LinuxJournal.com’s Downloading an Entire Web Site with wget tutorial for using wget to download an entire website.  In case you don’t want to check out that tutorial, you can read on as I will repeat the how to use wget to download an entire website within this blog post of mine.  To use wget to download an entire website, what you need to do is to open up a terminal in Linux (or a terminal on Mac if you have wget installed successfully on Mac) and type in the commands below:

  1. cd ~/Documents/
  2. mkdir wget-Downloads
  3. cd ~/Documents/wget-Downloads
  4. wget –recursive –no-clobber –page-requisites –html-extension –convert-links –restrict-file-names=windows –domains example.com example.com/whatever/

After using the commands above, you should now have a directory wget-Downloads created inside your Documents directory (e.g. Linux – /home/[user-name-here]/Documents/wget-Downloads, Mac – /Users/[user-name-here]/Documents/wget-Downloads) and a website which you had downloaded to this directory.  Of course, remember to replace example.com with an actual website, OK?  Also, if you compare the tutorial from LinuxJournal against mine, you will notice I had not used the –no-parent parameter for the wget command.  When using –no-parent parameter with wget command, it will limit you from downloading an entire website, therefore you might have broken links when viewing the website offline.  Still, if you are sure about the usage of –no-parent wget parameter, then you should use it.  Also, you should know that using wget to download an entire website might be the worst thing you can do sometimes, because you might have to fiddle your fingers for the longest time if not forever when a website you try to download is way way too big.  Luckily, you can always use Ctrl+C key combination on Linux (might be the same for Mac) to actually stop wget from continuing the download of an entire website.

As how LinuxJournal.com had explained,

  • –recursive wget parameter is for telling wget to download an entire website
  • –domains example.com wget parameter is for telling wget to download the contents within a specific website and not to download the contents of other websites as wget can actually follow the links that point to other websites and scrape the contents of those websites too
  • –no-parent wget parameter for telling wget to not follow links outside of a directory within a website, therefore stopping wget from downloading whatever contents that are locating outside of a specific directory
  • –page-requisites parameter for wget is for telling wget to download all the extra contents besides just text (e.g., CSS, images, etc…), and this way an offline website will appear pretty much the same as if it’s being viewed online
  • –html-extension wget parameter is for telling wget to save files of the offline website in .html extension, keeping the website structure as if it’s being served online (this is useful for website owner to backup a website locally)
  • –convert-links wget parameter is for telling wget to convert links locally so when a website is viewing offline, the offline website’s web links will link to each other properly (locally)
  • –restrict-file-names=windows wget parameter is for telling wget to convert file names in a way that when using the files that are downloaded with wget will be displayed correctly on Windows as well (i.e., Windows will be able to serve offline website’s files correctly in whatever browsers that are installed on Windows)
  • –no-clobber wget parameter is for telling wget to don’t overwrite any existing file so you can save some bandwidth and storage space, but sometimes it’s best to not use this parameter so you can actually update the entire website offline (i.e., sometimes a website updates its webpages with newer contents)

In summary, I had tried many other methods of saving a website offline for later viewing, but none is so elegant and simple as using wget.  How come?  For an example, when I used a browser to save a website (i.e., File > Save Page As), I had to do this more than once so I could actually save the portions of website correctly.  Furthermore, I had to reorganize the saving portions of the website locally or else the saving portions of the website appear unorganized within a local directory.


Raising Ulimit On CentOS Server!

Some people who run CentOS may have difficulty of raising ulimit (visit here to know more about ulimit).  These folks find out that whenever they enter the command [ulimit -n] inside terminal as root, the actual ulimit value isn’t showing up in the terminal.  Instead of anything, they find out that the original ulimit value is still showing up as they type [ulimit -n].  In this very post, I’ll list the steps of how to raise ulimit value appropriately for CentOS 5.6 and possibly newer versions too.  Just make sure to make copies of whatever files you’re going to change in case you need to revert those files back to original state.  Check out the steps below:

  1. Open up the file /etc/sysctl.conf with vim editor (or vi) using this command [vim /etc/sysctl.conf].
  2. Add this line fs.file-max=122880 into /etc/sysctl.conf.  This line needs to be by itself (i.e., on a single line).  Best, just add it to the very bottom of /etc/sysctl.conf file.
  3. Save and exit /etc/sysctl.conf by hitting ESC button on your keyboard and then type in a colon and a letter w and a letter q and then hit enter to finish.
  4. You can execute the command [sysctl -p] or reboot to permanently add the kernel parameter above (i.e., fs.file-max=122880).
  5. Now, edit this file /etc/security/limits.conf.  How?  Using vim as above.
  6. Inside /etc/security/limits.conf, enter the line * – nofile 122880 (i.e., make sure it stays on a single line by itself).
  7. Save and exit /etc/security/limits.conf file.
  8. This last step is very crucial.  Without this last step, nothing will work!  Execute the command [ulimit -n 122880] inside your terminal as root.
  9. Finally, check to see if your new ulimit value is showing up correctly, just execute the command [ulimit -n] inside your terminal.  If you had done everything correctly, you should now see a value returns as 122880.

Good luck!

Disclaimer:  Following instructions here at your own risk!