GitHub - gwern/archive-text-urls: Parse freeform text files looking for plausible URLs to archive (abandoned)

This repository has been archived by the owner on Aug 15, 2023. It is now read-only.

gwern / archive-text-urls Public archive

Notifications You must be signed in to change notification settings
Fork 0
Star 3

Parse freeform text files looking for plausible URLs to archive (abandoned)

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README		README
archive-disk.hs		archive-disk.hs
fpstring.c		fpstring.c
fpstring.h		fpstring.h

Repository files navigation

It occurred to me once that it might be neat to have a CLI tool which would parse a text file for strings like "http://", find the longest valid URL (eg "see http://www.google.com " can be easily turned into the correct URL just by starting at "http" and eating until you reach " ", which is not good in a URL without having been escaped as "%20"). So I did a little work on such a tool. It didn't work well.

My ultimate solution was to realize that I only cared about the URLs in my Markdown files, and to sit down and write a Pandoc script to parse the Markdown and extract URLs. See <http://www.gwern.net/haskell/link-extractor.hs>.