The term semantic search denotes search with meaning, as distinguished from lexical search where the search engine looks for literal matches of the query words or variants of them, without understanding the overall meaning of the query. The idea is to go beyond traditional keyword-based search methods by considering the meaning and intent behind a user’s query.
An approach for implementing semantic search in the context of large language models (LLMs) is to use retrieval augmented generation (RAG) techniques. The idea is to take a body of knowledge, in our case in the format of questions and answers, and apply text embeddings to all the questions. Text embeddings are multi-dimensional numerical representations of “meaning” generated by language models. To provide the answer to a novel query, the compute its embedding and find all the questions in the database that are close enough in some norm; then we take those questions and the corresponding answers and prepare a prompt for the LLM with the original query. Since this query has been augmented by similar questions and answers, the LLM is capable of a more nuanced and context-aware document answer. Given an exhaustive set of questions and answers it is easy to get results that are quite impressive, as we will see in this post.
What we will do is to use the StackSample dataset from Kaggle. This dataset contains questions, tags and answers from 10% of Stack Overflow questions and answers on programming topics. We select the 64,000 questions and answers related to the Python programming language and show that we can provide quite good answers on the Python programming language.
import numpy as np
import pandas as pd
We assume that the data has been downloaded from Kaggle and saved in the ./data
subdirectory. The files are quite large and require a few GBytes of memory to be processed.
tags = pd.read_csv('./data/Tags.csv')
questions = pd.read_csv('./data/Questions.csv', encoding='ISO-8859-1')
python_ids = tags[tags.Tag == 'python'].Id.unique()
python_questions = questions[questions.Id.isin(python_ids)]
print(f"Found {len(python_questions)} python questions")
Found 64601 python questions
We use the sentence_transformers
package to compute the embeddings for both the titles of the question and the actual questions.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
title_embeddings = model.encode(python_questions.Title.tolist(), show_progress_bar=False)
body_embeddings = model.encode(python_questions.Body.tolist(), show_progress_bar=False)
We save them to file as it takes a moment to compute all the embeddings locally.
np.savez('./data/embeddings.npz', title_embeddings=title_embeddings, body_embeddings=body_embeddings)
loaded = np.load('./data/embeddings.npz')
title_embeddings = loaded['title_embeddings'].astype("float32")
body_embeddings = loaded['body_embeddings'].astype("float32")
del loaded
For the LLM we use instead the OpenAI interface and gpt-4
in particular.
import openai
def get_completion(prompt, model="gpt-4", temperature=0):
messages = [{"role": "user", "content": prompt}]
response = openai.ChatCompletion.create(
model=model,
messages=messages,
temperature=temperature,
)
return response.choices[0].message["content"]
The answer_query()
function below is the key component. First, it sets up a vector database using faiss. This should be done only once, but for making testing easier it is part of the answer_query()
function. There are two modes: one using the question title for the vector search and the other using the question itself. Titles are shorter, so the embedding is of higher quality, however the question body is more expressive but potentially much longer. The prompt clearly instructs the LLM not to make answers up but rather admit that it has not enough knowledge to provide an answer.
def answer_query(query, simple: bool):
import faiss
index = faiss.IndexFlatL2(title_embeddings.shape[1])
index.add(title_embeddings if simple else body_embeddings)
embedding = model.encode([query])
D, I = index.search(embedding, k=11)
context = ''
for i in I[0]:
item = python_questions.iloc[i]
context += f'Question: {item.Title}\n'
context += f'Answer: {item.Body}\n\n'
print('Using the following answers:')
for i in I[0]:
item = python_questions.iloc[i]
print(f'- {item.Title}')
prompt = f"""Here is the context: {context}
Using the relevant information from the context,
provide an answer to the query: {query}."
If the context doesn't provide \
any relevant information, \
answer with \
[I couldn't find a good match in the \
document database for your query]
"""
return get_completion(prompt)
from IPython.display import display, Markdown
def display_response(response):
display(Markdown('<div style="background-color: #F3E5F5; padding: 10px; width: 90%">' + response + "</div>"))
We are ready to test the results. For simplicity we use StackOverflow questions directly; the first one is on how to find the highest frequency number in an array. The answer is quite good.
response = answer_query('Python function to find the highest frequency number', simple=True)
display_response(response)
Using the following answers:
- Computing frequencies fast in Python
- How to extract the peak at a specific frequency in python
- python find the 2nd highest element
- Python Function that returns greatest integer
- How to find maximum value of two numbers in python?
- find the most frequent number in a numpy vector
- Define a function to find largest number in a list
- Python: find the highest number of time overlap
- How efficient is Python's max function
- Find n rows with highest value for column periodically
- How to get highest number from each list in a dictionary?
The second question is about creating random matrices where the sum of each column and row is a given number.
response = answer_query('Simple way to create matrix of random numbers of x,y size ' \
'where the sum of each col and rows is equal to z',
simple=True)
display_response(response)
Using the following answers:
- Creating a random n x n matrix with each row and column adding to 1
- Create a Random Blank Matrix (2D array) in Python 3?
- Generating 2d numpy arrays from random columns
- Generate a random 3 element Numpy array of integers summing to 3
- Generate X random integers between Y and Z?
- Random numbers in a grid
- Generating random X,Y coordinates in a grid
- How to make a matrix out of existing xyz data
- (Randomly?) find an amount by summing a 2D array
- Initializing a matrix with random number from given range - python
- Python â Randomly fill 2D array with set number of 1's
You can use the numpy library to create a matrix of random numbers. Here is a simple way to create a matrix of random numbers of size x,y where the sum of each column and row is equal to z:
import numpy as np
def create_matrix(x, y, z):
# Create an initial matrix with random values
matrix = np.random.rand(x, y)
# Normalize the values in the matrix so that they sum up to z
row_sums = matrix.sum(axis=1)
matrix = matrix / row_sums[:, np.newaxis] * z
return matrix
This function first creates a matrix with random values between 0 and 1. Then it normalizes the values in the matrix so that they sum up to z. The normalization is done by dividing each row by its sum and then multiplying by z.
The third question is on the language itself and the difference between static and class methods in Python.
query = """
What is the difference between a method decorated with @staticmethod and one decorated with @classmethod?
"""
response = answer_query(query, simple=True)
display_response(response)
Using the following answers:
- @StaticMethod or @ClassMethod decoration on magic methods
- What magic does staticmethod() do, so that the static method is always called without the instance parameter?
- Why is __new__ a staticmethod and not a classmethod?
- Can I use the same decorator on a class method and on a static method?
- decorate __call__ with @staticmethod
- Benefits on @staticmethod in python3?
- Is it bad form to call a classmethod as a method from an instance?
- python class with mixed @classmethod and methods
- Why does my __init__ function need to be @classmethod?
- What's the equivalent of a python @classmethod or @staticmethod in Scala?
- @classmethod with Abstract Base Class
The fourth question is about sorting dictionaries by value. The question is not well-formulated as it doesn’t specify which kind of output we want, but with the provided prompt the LLM is capable of giving a reasonabe answer.
query = "How do I sort a dictionary by value?"
response = answer_query(query, simple=True)
display_response(response)
Using the following answers:
- How to sort values of a dictionary?
- Dictionary value sorting
- Sorting by value in a python dictionary?
- Need help sorting dictionary by key and value
- How to sort a dictionary in Python?
- Sorting a dictionary in python
- Sorting dictionary by value and key length
- Reverse sort the dictionary by value
- Sort dictionary by number of values under each key
- Printing the dictionary sorted by value
- How do I sort list of numerical value dictionaries?
The fifth question aims to find a bug on a singleton class. The explanation is clear and easy to read.
query = """
I've found one of most popular approach to create singleton classes, using _instance and overriding __new__ method. Here it is:
<pre>
class Singleton:
_instance = None
def __new__(cls, *args, **kwargs):
if not cls._instance:
cls._instance = super().__new__(cls)
return cls._instance
</pre>
But it acting weird when I try to use it this way:
<pre>
class Child(Singleton):
def __init__(self):
self.a = random.randint(10, 1000)
x = Child()
y = Child()
print(x.__dict__) # {'a': 74}
print(y.__dict__) # {'a': 74} - not weird
print(Child().__dict__) # {'a': 222} - weird (for me)
</pre>
Can someone explain why it happens?
PS: I'm using python 3.10
"""
response = answer_query(query, simple=False)
display_response(response)
Using the following answers:
- does __init__ get called multiple times with this implementation of Singleton? (Python)
- Implementation of the Singleton pattern in Python
- Passing arguments to singletons in python
- Python Singletons syntax and why its looks like that?
- Instance object getting deleted when tried to print it
- Class retains previous content where new instance is expected
- Can a python singleton class be inherited?
- Make isinstance(obj, cls) work with a decorated class
- Inheritance when __new__() doesn't return instance of class
- strange python destructor behaviour
- Python 3: When is a call to an instance attribute resolved as if it is an instance method?
The sixth question is to find another bug.
query = f"""
0
I have a 3*3 table so my expected is to use interp2d interpolating then predict a bigger table maybe 5*5 or 10*10 to get more results then show in plot_surface
This is a simple 3*3 table for test and relationship:
<pre>
x = np.array([1, 2,3]) #---X,Y,Z relationship------
y = np.array([0.05, 0.5,1]) #(1, 0.05, -1.0)(1, 0.5, -0.5)(1, 1.0, 2.0)
z = np.array([-1, -0.5,2,\ #(2, 0.05, -2.0)(2, 0.5, 1.5)(2, 1.0, 3.5)
-2, 1.5,3.5, #(3, 0.05, -1.5)(3, 0.5, 2.5)(3, 1.0, 5.0)
-1.5,2.5,5])
</pre>
To achieve this relationship then i set:
<pre>
X,Y=np.meshgrid(x,y,indexing='ij')
Z=z.reshape(len(x),len(y))
Interploting 5*5 tabble based on the current data
#interp2d Z value
f2 = interp2d(x,y,Z,kind='linear')
x_new=np.linspace(0.01,0.02,5)
y_new=np.linspace(0.002,0.004,5)
X_new,Y_new=np.meshgrid(x_new,y_new,indexing='ij')
z_new=f2(x_new,y_new)
Z_new=z_new.reshape(len(x_new),len(y_new))
print(z_new)
</pre>
Now at this step i get the wrong number of interploted Z value,all the same and not expected
<pre>
# [-1. -1. -1. -1. -1.]
# [-1. -1. -1. -1. -1.]
# [-1. -1. -1. -1. -1.]
# [-1. -1. -1. -1. -1.]]
</pre>
So finlly the 3Dsurface become a flat picture
I am not sure why the script or function interp2d wrong with it.
How can i fix the scipts?
This is my full script:
<pre>
from scipy.interpolate import interp1d,interp2d,griddata
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
x = np.array([1, 2,3])
y = np.array([0.05, 0.5,1])
z = np.array([-1, -0.5,2,\
-2, 1.5,3.5,
-1.5,2.5,5])
fig = plt.figure()
ax=Axes3D(fig)
ax = fig.add_subplot(projection='3d')
X,Y=np.meshgrid(x,y,indexing='ij')
Z=z.reshape(len(x),len(y))
#interp2d Z value
f2 = interp2d(x,y,Z,kind='linear')
x_new=np.linspace(0.01,0.02,5)
y_new=np.linspace(0.002,0.004,5)
X_new,Y_new=np.meshgrid(x_new,y_new,indexing='ij')
z_new=f2(x_new,y_new)
Z_new=z_new.reshape(len(x_new),len(y_new))
print(z_new) #---->not as expected [[-1. -1. -1. -1. -1.]
# [-1. -1. -1. -1. -1.]
# [-1. -1. -1. -1. -1.]
# [-1. -1. -1. -1. -1.]
# [-1. -1. -1. -1. -1.]]
#This is for check X,Y,Z value
def Check():
n,j=0,0
print("----X,Y,Z-----")
for i in zip(X.flat,Y.flat,Z.flat): #----X, Y, Z - ----
print(i, end=" ") #(1, 0.05, -1.0)(1, 0.5, -0.5)(1, 1.0, 2.0
n += 1 #(2, 0.05, -2.0)(2, 0.5, 1.5)(2, 1.0, 3.5)
if n % int(len(x))==0: #(3, 0.05, -1.5)(3, 0.5, 2.5)(3, 1.0, 5.0)
print()
print("----X_new,Y_new,Z_new-----")
for i in zip(X_new.flat,Y_new.flat,Z_new.flat):
print(i, end=" ")
j += 1
if j % int(len(x_new))==0:
print()
Check()
ax.plot_surface(X, Y, Z,linewidth=0,antialiased=True,cmap="cividis",rstride=1,cstride=1)
ax.plot_surface(X_new, Y_new, Z_new, linewidth=0, antialiased=True, cmap=cm.winter, rstride=1, cstride=1)
plt.show()```
</pre>
"""
response = answer_query(query, simple=False)
display_response(response)
Using the following answers:
- 3D plot with an 2D array python matplotlib
- numpy magic to clean up function
- Python scatter plot for large data
- segfault using scipy griddata: ceval_gil.h no found
- extract element from list to form a 2-d array in Python
- Interpolation of 3D data in Python
- 3D+Color IDW/Kriging Interpolation with Python
- Fast interpolation over 3D array for 3D origin x
- Interpolating 3D data (irregular vertical mesh) to regular vertical mesh. Optimization loops
- Creating a numpy array of 3D coordinates from three 1D arrays
- Accelerate UDF in xlwings
We finish our tests by asking a question that is not about Python but rather C and C++. Given our body of knowledge, the question cannot be answered, and indeed the LLM tells us so.
query = "What is the '-->' operator in C/C++?"
response = answer_query(query, simple=True)
display_response(response)
Using the following answers:
- What is this operator *= -1
- >>> operator in python
- Python operator ">>"
- Meaning of the <- symbol in Python
- Why are there no ++ and --â operators in Python?
- python condensing functions for +-*
- python regex - what does - (dash) mean
- Explanation of Binary AND Operator
- What does <function at ...> mean
- What is the double inequality (>>) sign in python?
- What does "equals to negative 1" mean in Python?
Those results are very impressive considering the little code that was needed to generate them.