RAG

RAG is technically a combination of two different architectures: retrieval-based models and generative models. The retrieval-based component is responsible for fetching relevant documents from a large corpus based on the input query, while the generative component uses these documents to generate a coherent and contextually relevant response.

Benefits of using RAG

  • Improved Accuracy: By leveraging external knowledge sources, RAG can provide more accurate and contextually relevant answers compared to traditional generative models that rely solely on their training data.
  • Use Up-to-Date Information: Since RAG retrieves information from external sources, it can provide answers based on the most current data available, which is particularly useful for dynamic fields like news or scientific research.
  • Better Privacy: RAG can be designed to retrieve information from specific, controlled datasets, reducing the risk of exposing sensitive information that might be present in a large pre-trained model.
  • No Limit of Document Size: RAG can handle large documents by breaking them into smaller chunks for retrieval, allowing it to work with extensive texts without being constrained by the model’s input size limitations.

Components of RAG

  • Document Loader
  • Text Splitter
  • Vector Store
  • Retriever

Document Loader

Document Loaders are responsible for loading documents from various sources such as local files, databases, or web pages into standard format(usually Document objects) which can be used for chunking and embedding,retrieval,generation etc.

The most used document loaders in LangChain are:

  • TextLoader: Loads plain text files.
  • PyPDFLoader: Loads PDF files.
  • WebBaseLoader: Loads web pages.
  • CSVLoader: Loads CSV files.

Example of Document Object:

Document(
    page_content="This is the content of the document.",
    metadata={"source": "example.txt", "page": 1}
)

Note: Since Document Loader available in langchain-community so you need to install it first using:

pip install langchain-community

Text Loader

Text Loader read plain text (.txt) files and convert them into Document objects.

Example:

from langchain_community.document_loaders import TextLoader
# Create a TextLoader instance for loading text files
text = TextLoader("index.html",encoding="utf-8")
# Load the document
loaded_text = text.load()
print(loaded_text[0].page_content)
print(loaded_text[0].metadata)

Note: All loader return a list of Document objects.

PDF Loader

PDF Loader read PDF (.pdf) files and convert them into Document objects. Example:

from langchain_community.document_loaders import PyPDFLoader

# Create a TextLoader instance for loading text files
text = PyPDFLoader("1.pdf")

# Load the document
loaded_text = text.load()


for page in loaded_text:
    print(page.page_content)
    print(page.metadata)
print(f"Total pages loaded: {len(loaded_text)}")

Since PyPDFLoader only load the pdf containing text properly so you can select other loaders as per requirement

Loader NameDescription
PyPDFLoaderLoads text-based PDF files.
UnstructuredPDFLoaderLoads both text and image-based PDF files using OCR.

Document Loaders

Document loader are responsible for loading multiple files from a directory or folder.

from langchain_community.document_loaders import DirectoryLoader,PyPDFLoader
loader = DirectoryLoader(
    'path/to/your/folder',
    glob='**/*.txt',  # Pattern to match files
    loader_cls=PyPDFLoader,  # Specify the loader class to use
)
documents = loader.load()
for doc in documents:
    print(doc.page_content)
    print(doc.metadata)
print(f"Total documents loaded: {len(documents)}")

Example:

from langchain_community.document_loaders import DirectoryLoader,PyPDFLoader

# Create a TextLoader instance for loading text files
text = DirectoryLoader(
    "docs",
    glob="*.pdf",
    loader_cls=PyPDFLoader,
)

# Load the document
loaded_text = text.load()


for page in loaded_text:
    print(page.page_content)
    print(page.metadata)
print(f"Total pages loaded: {len(loaded_text)}")

load vs lazy_load

  • load(): This method loads all documents into memory at once and returns them as a list of Document objects. It is suitable for smaller datasets where memory consumption is not a concern.
  • lazy_load(): This method loads documents one at a time as they are needed, which is more memory efficient for large datasets. It returns an generator that yields documents on-the-fly.It return stream of Document objects.

Example:

loaded_text = text.lazy_load()
for page in loaded_text:
    print(page.page_content)
    print(page.metadata)

Web Loader

Web Loader fetches and loads web pages from the internet and convert them into Document objects. Example:

from langchain_community.document_loaders import WebBaseLoader
# Create a WebBaseLoader instance for loading web pages
web_loader = WebBaseLoader("https://example.com")
# Load the document
loaded_web = web_loader.load()
for page in loaded_web:
    print(page.page_content)
    print(page.metadata)
print(f"Total pages loaded: {len(loaded_web)}")

CSV Loader

CSV Loader read CSV (.csv) files and convert them into Document objects. Example:

from langchain_community.document_loaders import CSVLoader
# Create a CSVLoader instance for loading CSV files
csv_loader = CSVLoader(file_path="data.csv", encoding="utf-8")
# Load the document
loaded_csv = csv_loader.load()
for page in loaded_csv:
    print(page.page_content)
    print(page.metadata)
print(f"Total rows loaded: {len(loaded_csv)}")

for other document loaders please refer to the official documentation.

Creating Own Document

We can create our own Document object using the following code:

from langchain_core.documents import Document
doc = Document(
    page_content="This is the content of the document.",
    metadata={"source": "example.txt", "page": 1}
)