If you're like me, you use Twitter's likes or favorites as bookmarks, to help you come back to a tweet or article later. If you're also like me, you've been using Twitter way too much and after more than ten years you have amassed over 2,000 likes and many more retweets with useful links and articles that you wanted to find later.
But you never found them again. Because Twitter doesn't give you a way to search your own retweets or favorites. And even if it would, a text search based on tweet contents would probably not be good enough, when you consider tweets that only contain a link, or tweets that have a comment that's not representative of the article they link to. Ideally, you'd be able to find links to articles based on the text in the tweet, but also based on the text of the actual article.
So I've created a new tool called Twitter Discovery to allow you to do exactly that. It downloads the contents of all webpages that are linked from any of your retweets or favorites and lets you search them.
The web based GUI allows you to search the linked webpages by text, order them by date or website, and so on. The text search supports the use of quotation marks to group words, such that a search for "big finance" will return articles that contain the phrase big finance, while searching for big finance without the quotes will return articles that contain both the word big and the word finance. The text search will match text inside the article's body, but also in the article's title and the text of the tweet that linked to the article.
Here's how it looks like:
Running Twitter Discovery involves a bit of work at this point. You need to grab the source from Github, which includes a README that tells you how to install the project, how to then automatically download tweets and their related articles, and run the the webapp.
Under the hood, we're using Simon Willison's excellent twitter-to-sqlite for downloading tweets, the Newspaper3k library for extracting article title, contents, etc. from the linked articles, and Streamlit and sprinkles of Altair for the web GUI.
There's a couple shortcomings as it is. Extracting text from webpages relies on Newspaper3k, which in turn doesn't do a very good job at extract text from pages like arXiv.org. There's also no way to handle PDF files. Edit: I've now added proper support for arXiv.org links. You can now find favorited arXiv papers by title, abstract, and author.
As for further improvements, I've been playing with NLP to do automatic categorization and summarization of articles. Also, it may be useful turning this tool into a more accessible service instead of requiring users to install it on their machines.