APIs are ubiquitous. You can do just about everything you want with APIs today, which is pretty awesome, and solve any problem that used to require technology that only a few people had. I decided to solve an annoying problem I had every morning when I drive to work: I want someone to summarize the front page of Hacker News while I drive.
The thought of generating speech using a computer always was a terrible idea because it hurts my ears to listen to. However, as it turns out, Google’s newly released Wavenet-based Text-to-Speech technology is good enough to listen to for 15 minutes. And if that’s the case — then listening in to a summary of the top links can actually be practical and even enjoyable.
To do this, I wrote a Python script that does the following:
- Scrapes all of the daily Hacker News URLs using their open API.
- Summarizes them using an Article extraction API (in our case, I used Aylien, which I did not know about until I googled for an article extraction and summarization API)
- Uses Google’s Text-To-Speech engine on the title and summary
- Stitches all results into one mp3 file
- Uploads it to Google Cloud Storage
- Creates a Podcast RSS feed
So, let’s dig into how it works:
Getting the news
We start out by getting the data we want to listen to — a headline and a summary for each news item.
today = datetime.date.today().isoformat()
news_file = 'news_data/news_data_%s.json' % today
logging.info('getting news data...')
if not os.path.exists(news_file):
news_data = get_news_data(get_best_hn_urls(NUMBER_ARTICLES))
json.dump(news_data, open(news_file, "w"))
else:
news_data = json.load(open(news_file))
Getting the URLs we want to scrape is done using the Hacker News API, which does not require any authentication:
def get_best_hn_urls(num=10):
top_items = requests.get(BEST_STORIES_API).json()
links = []
for item in top_items[:num]:
item_data = requests.get(STORY_API % item).json()
if 'url' in item_data:
links.append(item_data['url'])
return links