Skip to content

Command-line tool for web scraping and embedding generation for NLP/LLM applications

Notifications You must be signed in to change notification settings

vmatekole/sisfus

Repository files navigation

"Sisfus" (Code name)

(Under Development)

Sisfus is a command-line tool for web scraping and embedding generation, designed to make web content available for NLP and LLM applications.

A suite of Python classes leveraging (scrapy)[https://scrapy.org/] to scrape content from the following sources:

  • bbc.co.uk

Embedding models supported:

  • text-embedding-3-small (OpenAI)
  • text-embedding-3-large (OpenAI)

Content is parsed and validated into a set of Pydantic models and then persisted to Bigquery.

About

Command-line tool for web scraping and embedding generation for NLP/LLM applications

Resources

Stars

Watchers

Forks