Skip to main content

Building a Google News Scraper in 10 Minutes

· 6 min read

Thumbnail alt_text

Introduction

We learned about this project when a trader wanted to build a productivity app that emailed him news articles that are relevant to his investment strategies. Because keywords like "NVIDIA" and "GDPR" lead to a lot of noise (imagine getting every article written about NVIDIA these days), he needed something more accurate that could work with various topics and concepts, ranging from a specific company's earnings to global events.

Given he wanted to cover 1000s of articles each day, using GPT-4 and related APIs was cost prohibitive. At $0.06/1K tokens, he'd be looking at $60-100/day, or $1,800-3,000/month, which was too much for a productivity tool (source).

Therefore, he was able to build this Python app in an hour using Tanuki. Now, we want to show you how to do the same. If you're impatient, you can find the repo and use-case in the following links:

Getting Started

To scope out the project, we'll use the following requirements:

  • The user specifies topics he/she cares about
  • The LLM analyzes various news articles from RSS feeds that are relevant
  • The function emails the article, summary, and its impact to the user

This should allow the trader to focus on the highest impact articles while skimming the summaries of the rest for any notable headlines. To start, we'll need to set up some environmental variables with your OpenAI API key and AWS SES for sending emails.

OPENAI_API_KEY=sk-XXX
AWS_SECRET_ACCESS_KEY=XXX
AWS_ACCESS_KEY_ID=XXX

For a quick implementation, we'll use the Google News RSS Feed and extract the text of the articles using Selenium from a given URL.

def parse_article_with_selenium(url: str) -> str:

options = configure_selenium_user_agent()
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=options)

try:
driver.get(url)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "body")))
soup = BeautifulSoup(driver.page_source, 'html.parser')
# Perform your parsing with BeautifulSoup here

# For example, to get text without tags:
article_text = soup.get_text(separator=' ', strip=True)
return article_text
finally:
driver.quit()

Building a Scraper using Tanuki

With the text of relevant articles textracted, we'll need a function to summarize the articles and extract the key information that we care about, such as:

  • Impact - how useful or important of an event is this? New acquisition? Huge. New office opening in Pittsburgh? Not so much.
  • Sentiment - is this positive or negative news?
  • Date - when did this happen?
  • Companies involved - what are the companies involved in this news?
  • People - who are the key people involved in this event?
  • Summary - article synthesized into a 1-2 sentence tldr so the trader can easily skim

As mentioned before, you could use GPT4 with prompts if cost was not a factor, although getting typed outputs would require additional work.. Or if you had a few weeks, you could fine-tune an open-source LLM to handle this task. Instead, we'll use Tanuki to do this.

We'll first start off with @tanuki.patch to define the function the ArticleSummary class with the specific info we want to extract. This ensures that the outputs from analyze_article are well-typed to be packaged for email.

@tanuki.patch
def analyze_article(html_content: str, subject: str) -> ArticleSummary:
"""
Analyzes the article's HTML content and extracts information relevant to the subject.
"""

# Define Pydantic model of an article summary
class ArticleSummary(BaseModel):
model_config = ConfigDict(arbitrary_types_allowed=True)

impact: int = Field(..., ge=0, le=10)
sentiment: float = Field(..., ge=-1.0, le=1.0)
date: datetime.date
companies_involved: List[str]
people_involved: List[str]
summary: str

To ensure reliable performance, we'll use @tanuki.align and assert the intended behavior. We'll show one example of using the Arm NVIDIA acquisition announcement

@tanuki.align
def align_analyze_article():
html_content = "<head></head><body><p>Nvidia has made the terrible decision to buy ARM for $40b on 8th November. This promises to "\
"be an extremely important decision for the industry, even though it creates a monopoly.</p></body> "
assert analyze_article(html_content, "nvidia") == ArticleSummary(
impact=10,
sentiment=-0.9,
date=datetime.date(2023, 11, 8),
companies_involved=["Nvidia", "ARM"],
people_involved=[],
summary="Nvidia is acquiring ARM for $40 billion, which will have a huge impact on the semiconductor industry.",
)

Asserts like the above mitigate the likelihood of hallucinations and unexpected failures by aligning the LLM to the intended behavior. This is called "test-driven alignment" (branching from test-driven development) and we'll talk more about this later.

Emailing Everything to Ourselves

Now, we can write a quick classical function to email articles (if there are any) to an email address of our choice. We can set it to run hourly or every 30 minutes to ensure we don't miss anything, by using CRON.

from utils import send_email

def email_if_relevant(relevant_articles: List[ArticleSummary], search_term: str, recipient: str):
"""
This function sends an email if relevant articles were found relating to the search term.
:param relevant_articles: A list of relevant articles extracted from a website.
:param search_term:
:param recipient:
:return:
"""
if relevant_articles:
subject = f"Summary of Important Articles about {search_term}"
body = "The following articles about {search_term} have high impact and negative sentiment:\n\n"
for summary in relevant_articles:
body += f"- {summary.summary} (Impact: {summary.impact}, Sentiment: {summary.sentiment})\n"

send_email(subject, body, recipient)

Results

To get sufficient accuracy (>90%), we had to create 12+ asserts to ensure consistent performance. This was a mix of trial and error for about 15 minutes to get the correct asserts in place. Then we tried it across 20 topics covered by the trader and evaluated the relevance and accuracy of 200 articles.

In this qualitative study, the trader labeled 93% of the outputs of these 200 articles as "helpful" (we chose to use "helpful" vs. accurate, as statements may be accurate but not helpful to the end user). As we develop a more robust test set, we will publish the "helpful" performance across different # of assertions.

Limitations and Biases

While using this app, the two limitations we've noticed were 1) decreased accuracy with large articles and 2) dependent on Selenium (and therefore Chrome on the machine) to enable us to scrape articles fed through Google News search.

With significantly long articles (>10K tokens), the outputs decreased in accuracy and relevance, likely due to the sheer volume of people, companies, and topics being mentioned. The quick fix was to not analyze articles that were significantly long and extensive, as for major events, there were many smaller articles that covered the key details vs. provide comprehensive details and opinions.

At times, we were left fighting the search capabilities of Google News, which either overfired or underfired. A way to mitigate this is to create a high-pass filter using another Tanuki function to ensure it only passes articles through that are highly relevant to the topics of interest.

What's Next

While this was just a small example, it demonstrates just how easily developers can create LLM-powered functions and apps.

For the next few weeks, we'll be focusing on automatically enabling developers to measure the accuracy of their functions and providing a better sense of how many asserts are required to generate reliable performance.

If this sounds interesting and you'd love to learn more (or better yet, get involved), please join our Discord.

Talk to you guys soon!