In an innovative stride towards making academic research more accessible and interactive, a new web application has been developed to assist users in exploring the vast array of papers presented at the Neural Information Processing Systems (NeurIPS) 2024 conference. This application is designed to allow users to search for papers of interest and engage in dynamic conversations about them using Large Language Models (LLMs). Leveraging the power of Cerebras Inference and PostgreSQL’s advanced full-text search capabilities, the app provides a seamless information retrieval and synthesis experience, resulting in an interactive and instant chat interface.
Addressing the Bottleneck: Inference in RAG
Retrieval-Augmented Generation (RAG) is a powerful tool for creating applications that help users navigate large datasets, such as the numerous papers available at NeurIPS. RAG allows users to interact with data in a question-and-answer format, making it easier to digest complex information. However, a significant challenge in RAG is the speed of inference. Traditional implementations of RAG can struggle with latency, especially when dealing with large contexts or complex queries. This delay can hinder the user experience by making real-time exploration cumbersome and interrupting the natural flow of discovery. To ensure RAG is effective at scale, particularly for a substantial dataset like NeurIPS papers, the process of inference must be swift and efficient, eliminating unnecessary delays.
Constructing a Vector Database for NeurIPS Papers
Creating a fast and responsive chat experience for the wealth of papers at NeurIPS necessitated pre-indexing these documents in a vector database. To achieve this, thousands of unstructured academic paper PDFs were processed into a structured and searchable vector database, hosted on Supabase.
Data Collection Process
The initial phase involved gathering all NeurIPS 2024 papers. By examining the JSON data structure of the NeurIPS paper directory through browser developer tools, a comprehensive list of papers was compiled, complete with metadata such as titles, authors, and PDF links. The PDFs were then downloaded from both the NeurIPS directory and arXiv, a major repository for many of these papers.
Preparing Data for Search
To make the papers searchable by an LLM, the unstructured PDFs had to be converted into a format that could be effectively indexed and queried. LlamaIndex was instrumental in managing and streamlining the preprocessing steps, ensuring a smooth transition from text extraction to embedding creation. Here’s a breakdown of the process:
Text Extraction and Chunking: The raw text was extracted from the PDFs and divided into manageable chunks suitable for semantic search. Each chunk was crafted to be small enough for efficient retrieval while retaining enough context to be meaningful.
Metadata Enrichment: Each chunk of text was enhanced with metadata, including paper titles, author names, and arXiv IDs, to improve the search experience. This enrichment allowed the system to provide more context-aware and informative responses. For instance, users can pose specific queries like "Who authored this paper?" and receive precise answers, which would be challenging without metadata.
Embedding Creation: The next step involved transforming each chunk into an embedding—a high-dimensional vector that captures the semantic essence of the text. The BAAI’s bge-large-en-v1.5 model, hosted on Hugging Face, was used to generate 1024-dimensional embeddings for each chunk. Utilizing this pre-trained model allowed seamless integration of embeddings without the complexities involved in training a new model.
Storing the Data in Supabase
Supabase was chosen for its robust ecosystem built around PostgreSQL, offering built-in support for vector storage and efficient similarity searches. This choice facilitated the implementation of fast, scalable semantic retrieval without requiring extensive custom infrastructure. The SQL capabilities of PostgreSQL made querying and examining the data straightforward. The database schema included columns for unique identifiers, text chunks, metadata, node identifiers, and embeddings.
The embeddings were stored in a dedicated column, enabling rapid similarity searches using the Hierarchical Navigable Small World (HNSW) algorithm. The combination of PostgreSQL and vector indexing allowed both traditional keyword searches and vector-based semantic searches, ensuring flexibility and speed.
Retrieval Augmented Generation (RAG) Workflow
Once external data is encoded and stored, it becomes ready for retrieval during inference, when the model generates responses or answers questions. The Vercel AI SDK, with its pre-built components for various steps of the RAG workflow, was utilized. This included embedding user queries, retrieving relevant content, and efficiently streaming LLM responses.
Cerebras proved to be fully compatible with the OpenAI SDK, requiring only a simple API URL and key swap to get started. This compatibility facilitated seamless integration and operation.
Fetching Relevant Data
To generate a response to a query like "What is novel about this paper?", relevant data must be retrieved from the vector database to provide the LLM with the specific information it needs. The process begins with creating an embedding of the user’s query using the BAAI/bge-large-en-v1.5 model, the same model used to embed the research papers. This consistency ensures that the query and text chunks align within the same semantic system.
Using embeddings is crucial for identifying the most pertinent information necessary for the LLM to produce an accurate and informed response. A cosine similarity search is then performed using pgvector on PostgreSQL, calculating distances between the query and embedded data. The most relevant text chunks with the shortest distances are retrieved to form the context for the LLM.
Augmented Generation
Once the relevant snippets are collected, the context, along with the user’s query, is sent to the LLM (Llama3.1-70b on Cerebras). The LLM generates an informative response based on the provided context, highlighting key aspects of the paper.
Switching from a GPU-based provider to Cerebras resulted in a remarkable 17x speed improvement, reducing the average response time from 8.5 seconds to less than 2 seconds. This significant reduction in latency directly enhanced the user experience, allowing for smoother interactions and more queries per session.
Looking Ahead: Future Enhancements
This current implementation is merely the beginning, with several exciting features being considered for future development:
- Dynamic Question Generation: Tailoring questions for each paper to guide exploration and enhance understanding.
- Integrated PDF Viewer: Offering a seamless reading experience within the app.
- Social Sharing Capabilities: Allowing users to share insights and discoveries with others.
- Multi-Paper Chat Functionality: Enabling exploration of connections across multiple research papers.
- Conference Schedule Recommendations: Personalizing conference schedules based on user interests and queries.
For further details, you can explore the project on GitHub: Cerebras NeurIPS 2024 Chatbot. Additionally, a video demonstration is available: Demo Video, and you can visit the website: Cerebras Chatbot.
By harnessing cutting-edge technology and innovative data processing techniques, this web app is poised to revolutionize how researchers and enthusiasts interact with academic papers, fostering a more engaging and insightful exploration of scientific knowledge.
For more Information, Refer to this article.