Mirror files from the web to IFPS with wget-to-ipfs

in #ipfs6 years ago (edited)

In an attempt to make several large files available on IPFS and 'update' them weekly, I wrote some scripts that I can run a server. It is commandline only. A result can be seen at IPFS hash QmRfvcwg51iDzmcnkuB6wErAtyUwbgS8VUyT95aHs2AT39 (a short while after releasing this post anyway).

Today we'll focus on the first script: it downloads a file, then adds it to IPFS. Optionally in a directory. It is available on my git repo.

Quick start

You can clone or download the script, then run it.

git clone https://git.webschuur.com/placebazaar/wget-to-ipfs.git
cd wget-to-ipfs

Or, if you don't have or know git. Download the wget-to-ipfs.sh file, and make it executable, for example with:

 chmod +x wget-to-ipfs.sh

Now run it.

./wget-to-ipfs.sh --url=http://example.com/file.pdf

Options are shown with wget-to-ipfs.sh --help

Why?

When you have files locally, there's no need for this wrapper; IPFS has enough tooling to add and manage entire directories. But when the files are only available on the web, downloading and adding them in one go is useful. Especially if they are large files.

I developed this for placebazaar where we want to make weekly Open Street Map database files available on the permanent, uncensurable web: IPFS. Maintaining the database files on a server (next to the IPFS repo) requires large disks, which is expensive. So a script that can run weekly, download the latest files and then publish them on IPFS was developed.

You too may have a list of URLs which you want to make available on IPFS. If so, this script may help.

How does it work?

It needs both ipfs and wget to be installed, initialized and runnable.The script has a lot of boilerplate, taken from bash3boilerplate.sh.

Most Linux/Unix distributions have wget installed by default. And if not, installing it is easy, since wget is one of the most used tools. Installing IPFS is a bit more involved, but should not be too hard.

The script downloads the file using wget and writes it into a temporary file. This file has a default and is settable with the --temp attribute.

It then reads that file and adds it to IPFS with ipfs add < /tmp/wget-to-ipfs. It uses redirection because when IPFS is installed on Ubuntu with snap, the snap-sandbox prohibits reading most files. Redirecting circumvents this: ipfs add /tmp/wget-to-ipfs requires that the ipfs command (inside the snap sandbox) has access to /tmp/wget-to-ipfswhich is not the case. Whereasifps add < /tmp/wget-to-ipfs` requires that the user running the command (and not the command itself) can read the file. Which is the case.

After that, the optional --root flag allows you to add t he file into a "directory". IPFS is a tree-structure, so adding it to a directory is similar to "linking it to a directory hash". Which is what the script does with `ipfs object patch [root-hash] add-link [filename] [file-hash].

IPFS allows creating an empty, then adding a file as follows:

_dir_hash=$(ipfs object new unixfs-dir)
_file_hash=$(ipfs add ./cv.pdf)
ipfs object patch ${_dir_hash} add-link file.pdf $_file_hash

This results in a new hash: IPFS is immutable, so "adding a file to a directory-hash" will have to result in a new directory hash. So, if you want to add multiple files, the logic would be:

_dir_hash=$(ipfs object new unixfs-dir)
_file_hash=$(ipfs add ./cv_2017.pdf)
_dir_hash=$(ipfs object patch ${_dir_hash} add-link file.pdf $_file_hash) # Overwrite _dir_hash
_file_hash=$(ipfs add ./cv_2018.pdf)
ipfs object patch ${_dir_hash} add-link file.pdf $_file_hash

The last result will be the hash of the directory with all the files added. Intermediate hashes are those of directories with fewer files.

Downloading a set of files and adding those.

Kowing this, we can download a list of URLs from a file urls.txt and add them, incrementally, to a directory as follows:

export _dir_hash=$(ipfs object new unixfs-dir)
for url in $(cat  urls.txt)
do
   _out=$(wget-to-ipfs --url="${url}" --root="${_dir_hash}" --temp="/root/wget-to-ipfs")
   _dir_hash=$(echo "${_out}" | grep root | cut --delimiter=" " --fields=2)
done
echo "new dir: ${_dir_hash}"

Next

Managing the pins, garbage collection, and publishing the results using DNSLink.