← All posts

What is link rot, and how to archive articles before they die

April 22, 2026

The essay you remember reading is gone. The URL still exists. It either 404s, redirects to a casino, or loads a page so different from what you read that it might as well be gone.

That is link rot, and it is worse than most people realise.

Roughly 50% of the links cited in US Supreme Court opinions no longer point to the material they were supposed to, and about 70% of links cited in academic legal journals have the same problem. Those numbers come from the Harvard Library Innovation Lab's Perma.cc project, which was built specifically because legal scholarship was quietly losing its footnotes.

If the most carefully cited prose in the country rots at that rate, your bookmarks do not stand a chance.

The causes are boring. That is part of why nobody protects against them.

Sites get redesigned and do not bother with redirects. Domains expire or change hands. Companies get acquired and migrate content to a new CMS that drops half the old URLs. A CDN gets swapped out and images 404 even though the article survives. Paywalls get added retroactively. An author deletes a post because they cringe at it now. A whole site shuts down.

That last one is the scariest. Google Reader closed on July 1, 2013, taking a decade of shared RSS items with it. Delicious, which was how a lot of us saved links before "read later" was a category, went read-only on June 15, 2017. Pocket, which a generation of people used as their personal article archive, shut down on July 8, 2025, with permanent data deletion starting on November 12, 2025 for anyone who did not export in time.

Every one of those was a place people trusted to hold their reading. None of them held it.

A dead link is any URL that no longer returns the content it originally did. That includes outright 404s, but also the sneakier cases: a 200 OK response that now serves a login wall, a parked domain, or a reworded article that no longer contains the sentence you remembered.

The "just Google it again" fallback works for evergreen content by big outlets. It fails for the stuff worth archiving in the first place: small blogs, niche tutorials, essays with a specific voice, and anything tweeted as a thread. X specifically eats a lot of what you save there, which is the reason I wrote a whole guide to how X bookmarks actually work. The tweet gets deleted, the thread breaks, the author goes private, and the bookmark becomes an empty row.

If you have ever gone looking for a passage you knew you saved and found nothing, you know the feeling. Six months of saves, half of them quietly broken.

Wayback Machine is the baseline, and it is not enough

The Wayback Machine is the cultural memory of the public web. Brewster Kahle's Internet Archive has been snapshotting pages since 1996, and at this point it is the closest thing we have to a shared archive of the web.

Two things you should actually do with it.

First, get in the habit of using Save Page Now. Paste a URL, press the button, and the Wayback Machine captures the page (including images and CSS) and hands you a permanent archive URL. Citation-quality. Shareable. Free. No account required.

Second, know what it cannot do. Save Page Now saves a single page, not a whole site. It will not crawl a paywall. It will not save content behind a login. Some sites ask crawlers not to archive them and the Wayback respects that. And if you are on a wobbly airport wifi at the exact moment you want to save something, too bad.

Treat the Wayback Machine as the public-archive layer. It is the baseline. It is not a personal archive.

Archive.today for the pages the Wayback cannot reach

Archive.today (also served from archive.ph, archive.is, and a rotating cast of mirrors) is the second piece of the standard web-archivist toolkit. It works like Save Page Now with a different crawling approach. In practice, it captures a lot of paywalled articles that the Wayback Machine refuses or cannot render cleanly.

I am not going to make legal claims about whether you should archive a paywalled article from the New York Times into archive.today. That is between you and your conscience. What I will say is that the tool exists, it has existed for over a decade, and journalists, researchers, and anyone who has ever wanted to cite something without handing the reader a paywall all use it.

One caveat: it is run by a small team, hosted on a rotating set of domains, and has gone down for days at a time in the past. Do not treat it as permanent storage. Treat it as a "here is a shareable snapshot" tool.

Save a copy locally with SingleFile

Both of those services archive on someone else's computer. Which is great until someone else's computer is offline.

SingleFile is a browser extension that saves the current page as a single self-contained HTML file, with every image, stylesheet, and font embedded inline. One file. Opens in any browser. No internet required to read it back.

This is the option I wish more people knew about. If you are reading a long essay you actually care about, clicking the SingleFile icon takes about a second and gives you a file you own, on your disk, forever. You can drop it into Dropbox, iCloud, or a folder you back up, and it survives the source site dying, the author deleting it, the CDN breaking, and the internet losing power.

The tradeoff is that nothing is indexed. You end up with a folder full of .html files, named whatever the page title was. Finding a specific passage later means opening each one or grep-ing them with the terminal. Fine for important things. Bad as a general workflow.

ArchiveBox if you want to run your own Wayback Machine

ArchiveBox is the serious archivist's option. It is self-hosted, open-source, and actively maintained. You install it (Docker is easiest), point it at a list of URLs (from your bookmarks, an RSS feed, a browser export), and it archives every page into HTML, PDF, screenshot, WARC, and a few other formats. The output is designed to still be readable decades from now.

It is powerful and free, but it is real software. You will read docs. You will set up a cron job. You will eventually wonder where all the disk space went. If you are the kind of person who already runs a Plex server or a home NAS, ArchiveBox will feel normal. If not, it will feel like a weekend project.

Worth it if you are archiving hundreds of URLs a week and want full control. Overkill for ten saves.

Read-later apps that also save the content

The lightest option for most people is a read-later app that extracts and stores the readable content at save time. You get reading, search, and archiving in one place, and you do not have to think about any of it.

Readwise Reader is the strongest in this category. It stores readable copies of the articles you save, supports highlights and a daily review queue, and imports from most other read-later tools. About $10 a month at time of writing. Good option if your reading workflow is highlight-heavy.

Instapaper still exists, has a free tier, and stores a cleaned reading copy of each article. It is the oldest read-later app still standing, and the interface has barely changed since 2010. Stable, predictable, nothing fancy.

Keep is where this blog lives, so it is worth being straight about the fit. Keep extracts the full article content at save time and stores it on R2 storage, so your saved copy survives the source going dead. Full-text search across everything you have saved. Chrome extension, iOS share sheet, email-in, X bookmark sync. Export to Markdown, CSV, or JSON whenever you want. No graph view, no daily review queue, no public archive page (your copy is private to you, not a Wayback-style public snapshot). Keep is closer to a personal library than an archive tool.

Which of those three fits best depends on what you want out of it. If you want the reading interface plus highlights, start with Reader. If you want the lightest possible thing that still preserves content, try Instapaper. If you want a full-text-searchable library of saves across X, articles, and newsletters that you can export as plain markdown and hand to an LLM, Keep is where that lives.

A practical personal archive looks like this

You do not need every layer. You need the ones that match what you save.

For anything you want to cite or share, use the Wayback Machine's Save Page Now. Public, free, permanent enough for most things. If the page is paywalled and the Wayback cannot reach it, archive.today is the fallback.

For pieces you genuinely care about and want on your own disk, SingleFile. Ten seconds per save, one file per page, fully offline.

For everything else you read on the web, a read-later app that stores the content (Reader, Instapaper, or Keep). This is where most of the volume goes, because this is the tier that covers "I read ten articles a day and want the interesting ones searchable in six months."

For serious archivists or anyone building a long-horizon private archive, ArchiveBox.

You can mix them. I run Keep as the main library, Save Page Now for things I want to link publicly, and SingleFile for the handful of pieces a year I treat as important enough to keep locally.

What archiving protects you against, and what it does not

It protects you against the article disappearing. That is the whole point, and it is worth more than people realise until they try to find something and it is gone.

It does not protect you against the author being right. An archived copy of bad advice is still bad advice, just preserved. Archive what is worth rereading, not everything you scroll past.

It does not replace reading. A pile of saved articles is not knowledge. If you never reread, you have a hoard. A commonplace book is the workflow that sits on top of an archive and turns saves into something you actually come back to.

It does not guarantee your archive survives. Your disk fails. Your cloud account gets locked. Your self-hosted box dies in a storm. Any archive worth keeping gets backed up somewhere else. The URL to Markdown tool on Keep will even give you a clean portable copy of any page if you want to paste one straight into your notes.

The point is not paranoia. The point is that the web is a performance, not a record, and anything you expect to still be there later has to be saved somewhere you control.

If you want that somewhere to be a searchable library you can export anytime, Keep saves the full article content at save time and keeps it on R2 storage, so the source going dead does not take your copy with it. Save articles that survive the source going dead.