In the ever-evolving landscape of artificial intelligence, the integration of retrieval augmented generation (RAG) with large language models (LLMs) represents a significant stride forward. This approach allows us to enhance the outputs of LLMs without necessitating a complete retraining process. However, this innovative method presents challenges, particularly concerning the protection of sensitive information within the data sources. Personal Identifiable Information (PII) is often embedded within these datasets, and its inadvertent disclosure can pose serious security risks. In response to these concerns, OWASP has identified sensitive information disclosure as a key risk in its 2025 Top 10 Risks & Mitigations for LLMs and Gen AI Apps. OWASP suggests implementing data sanitization, access control, and encryption to mitigate these risks.
A noteworthy application of these recommendations is the integration of HashiCorp Vault’s transit secrets engine to secure sensitive data before transmitting it to an Amazon Bedrock Knowledge Base established using Terraform. This process exemplifies the proactive measures being taken to address data security concerns in AI applications.
### Encryption and Protection of Sensitive Data
To illustrate this approach, consider a sample dataset containing a list of vacation rentals from Airbnb. This dataset includes the names of the hosts, which are not always necessary for applications. To protect this information, the original application encrypts the host names using HashiCorp Vault before storing the rental listings in the database. This encryption ensures that any queries run against the Amazon Bedrock Knowledge Base return encrypted host names, thereby preventing PII leakage.
The demonstration utilizes an HCP Vault cluster with the transit secrets engine activated. Host names are encrypted using a key named “listings,” with convergent encryption enabled. Convergent encryption ensures that a specific plaintext host name consistently results in the same ciphertext. This consistency allows LLMs to analyze rental listings for similarities between hosts without revealing the actual host names.
### Setting Up the Encryption Process
A practical application of this encryption process can be seen in the context of a CSV file containing a list of vacation rentals in New York City from January 2025. The names of the hosts are treated as sensitive data and encrypted accordingly. A local script, leveraging the HVAC Python client for HashiCorp Vault, is created to access the encryption API endpoint in Vault. This script processes the CSV file, encrypting each host name and preserving the other non-sensitive attributes, such as room type and listing ID, in plaintext.
### Uploading Encrypted Data
Once encrypted, each entry in the CSV file is uploaded to an S3 bucket. In this demonstration, LangChain is employed to convert each CSV record into a text file and upload it to S3. This process ensures that sensitive data is securely stored and protected from unauthorized access.
The script provided is intended for educational and testing purposes, but similar methods can be applied in production environments using AWS Sagemaker for data processing. The primary goal of this approach is to create a text file with the host name in ciphertext, ensuring that sensitive data remains protected while enabling applications to analyze and utilize other non-sensitive information.
### Establishing an Amazon Bedrock Knowledge Base
Following data encryption and upload to S3, the next step involves setting up an Amazon Bedrock Knowledge Base to ingest the documents from S3 as a data source. This setup enables the integration of proprietary information into applications using RAG. The process requires sufficient IAM policies to connect to S3 and a supported vector store for embeddings.
In this demonstration, Amazon OpenSearch Serverless is utilized to provision an OpenSearch cluster for vector embeddings. Additional IAM policies are necessary to access collections and indexes that store these embeddings. The security policies outlined ensure that network access to the collection and dashboard is properly configured, allowing Terraform to create an index.
### Configuring the Vector Store
The configuration process includes creating an access policy that permits Amazon Bedrock and current AWS credentials in Terraform to read and write to the index with the collection of embeddings. This step is crucial for ensuring that the data is accessible and usable by the knowledge base.
After establishing these policies, Terraform constructs the collection. For Amazon Bedrock Knowledge Bases, the type is set to VECTORSEARCH, facilitating the integration of vector embeddings into the knowledge base.
### Testing and Implementing the Knowledge Base
With the knowledge base configured, testing involves querying the system with specific questions related to rental listings. The responses should provide additional detail based on the listings, demonstrating the enhanced capabilities of the LLM when augmented with real-world data.
For example, asking about the number of vacation rentals at the Box House Hotel yields a response indicating that there are at least ten rentals listed, each with various room types and shared host details. This enhanced response is possible due to the integration of additional data into the LLM’s training.
### Protecting Sensitive Information
Security measures, such as convergent encryption, ensure that sensitive information like host names remains protected. In scenarios where applications require access to the plaintext names, the Vault transit secrets engine can be used to decrypt the ciphertext before providing the response. This approach ensures that only authorized applications have access to the decryption endpoint, safeguarding sensitive data.
### Conclusion and Further Learning
In summary, encrypting sensitive data before augmenting an LLM with RAG protects data integrity and prevents leakage of sensitive information. This demonstration highlights how applications can analyze and provide valuable insights without compromising data security. For applications requiring access to sensitive data, additional code can be implemented to decrypt the payload using Vault’s transit secrets engine.
For further exploration, Amazon Bedrock Knowledge Bases offer comprehensive documentation. Additionally, the OWASP Top 10 Risks & Mitigations for LLMs and Gen AI Apps provide valuable insights into potential risks and strategies for mitigation. These resources are instrumental in ensuring the secure and effective deployment of AI applications in various domains.
For more Information, Refer to this article.