Clean a series of links, resolving redirects and finding Wayback results if page is gone. Originally written to aid with importing from ArchiveBox.
Clean a series of links, resolving redirects and finding Wayback results if page is gone
Originally, muna
was uniredirector
for my program agaetr,
but grew to be quite a bit more multipurpose. (The script in agaetr
is now
enhanced to be identical to muna
but in name.
I ended up writing this because of ArchiveBox. It’s
a great self-hosted archiving system, but when you throw a random list of URLs
(or worse, different types of RSS feeds) at it, you get… mixed results. It
does not handle redirects too well, and if something is just 404, you’re out of
luck. So I wrote feeds-in
to preprocess inputs from both persnickety RSS
feeds and a plain list of URLs. It’s included here as a use example of how to
use muna
and the bash function unredirect
.
muna
is an old norse word meaning “call to mind, remember”.
This project is licensed under the Apache License. For the full license, see LICENSE
.
On many linux installations these may already be installed; if not, they’re in your package manager. (If you have to build these from source, you don’t need me telling you how to do that!)
Clone or download this repository. Put muna.sh
somewhere in your $PATH
or
call/source it explicitly.
While this script is included here as an example, it is a fully functional
DEATH ST… script. It’s a functional script, appropriate to put in a cronjob
to preprocess sources of URLs for ArchiveBox
. Or use it as the base of a
script to meet your needs.
One important and super useful note for someone who already has a big list of
URLs from some other program: All you have to do is put that text file, one URL
per line, in RAWDIR
(which you’ll configure here in a second) and that list
will be pulled seamlessly into the workflow.
If you are using feeds-in.sh
with ArchiveBox
, you will need to edit
these lines as appropriate for you:
APPDIR="/home/www-data/apps/ArchiveBox-Docker"
RAWDIR="$APPDIR/rawdata"
DATADIR="$APPDIR/data"
source "$APPDIR/muna.sh"
APPDIR
should be to your ArchiveBox
installation. RAWDIR
is a work
directory where you can also put any text file with a plain list of URLs.
DATADIR
should be the data directory of your ArchiveBox
installation.
There are several example feeds (starting around line 50). Each strips that particular RSS feed (or XML sitemap) down to a series of URLs, one per line, written in a text file.
The sed
and awk
strings are left here as an example for these particular
kinds of feeds. Feel free to use them as a starting point, but I won’t guarantee
they work for your feeds, they just work for these feeds.
The console output here is a progress bar unless there are errors. The text file is time-date stamped to avoid collisions and overwrites.
Then feeds-in.sh
calls ArchiveBox
to import that list of URLs. Uncomment
the appropriate line in this section for your style of installation. Note that
the docker-compose and standalone docker commands are quite different;
don’t confuse them! (I won’t tell you how I know… sigh.)
###############################START PARTS TO EDIT########################
# Uncomment the next line for non-docker installations
#./archive /"$OUTFILESHORT"
# Uncomment the next line for docker-compose installations
docker-compose exec archivebox /bin/archive /"$OUTFILESHORT"
# Uncomment the next line for docker *NOT DOCKER COMPOSE* installations
#cat "$OUTFILE" | docker run -i -v ~/ArchiveBox:/data nikisweeting/archivebox
If there’s a redirect, whether from a shortener or, say, redirected to HTTPS,
muna
will follow that and change the variable "$url"
(or return to STDOUT)
the appropriate URL. If there is any other error (including if the page is gone or
the server has disappeared), it will see if the page is saved at the Internet Archive
and return the latest capture instead. If it cannot find a copy anywhere, it
changes the variable "$url"
to a NULL string and returns nothing,
exiting with the exit code 99
.
muna.sh [-q] URL
Put this line at the top of your script.
source path/to/muna
In your script, the variable $url
must be set before calling the function
unredirect
. Afterward, if a successful match was made, $url
will be
set appropriately. If no match was made, $url
will be set to NULL. Like
this example in feeds-in.sh
.
url=$(printf "%s" "$line")
unredirector
if [ ! -z "$url" ];then #yup, that url exists
echo "$url" >> "$OUTFILE"
fi
bash ./feeds-in.sh
Seriously, that’s it. If you edited things in the script to meet your system, then you should be done.
Steven Saus injects people with radioactivity for his day job, but only to serve the forces of good.
Mostly.