[spelunker] Add design.md, and a few loggin tweaks (#446)

In design.md you will find a description of the current architecture plus a lot of open questions about improvements. --------- Co-authored-by: Guido van Rossum <[email protected]>
microsoft · Dec 2, 2024 · 3e230f2 · 3e230f2
1 parent 5d832ba
commit 3e230f2
Show file tree

Hide file tree

Showing 3 changed files with 119 additions and 16 deletions.
diff --git a/ts/examples/spelunker/README.md b/ts/examples/spelunker/README.md
@@ -3,7 +3,7 @@
 This sample app is used to explore ideas around "code spelunking", i.e.,
 exploring a repo full of unknown source code (in multiple languages) and
 gradually understand how it works. The AI is meant as a helper for the human to
-help them think.
+help them think. See [Design](./design.md) for the current architecture.
 
 ## Trademarks
 

diff --git a/ts/examples/spelunker/design.md b/ts/examples/spelunker/design.md
@@ -0,0 +1,77 @@
+# Spelunker: Architecture and Design
+
+Spelunker is currently best used to explore one project at a time.
+The initial prototype can only handle Python files.
+
+## Database structure
+
+The database records a number of categories of information:
+
+- **chunks** of program text (typically a function or class).
+  The database holds the source text, some metadata,and a copy of the docs extracted from the chunk.
+  Chunks have a tree-shaped hierarchy based on containment; the entire file is the root.
+- **summaries**, **keywords**, **topics**, **goals**, **dependencies**:
+  Various types of docs extracted from all chunks, indexed and with "nearest neighbors" search enabled.
+- **answers**: conversation history, recording for each user interaction the question, the final AI answer, and a list of references (chunks that contributed to the answer).
+  Indexed only by timestamp.
+
+## Import process
+
+The chunks and related indexes are written by an import pipeline that does the following:
+
+1. Break each file into a hierarchy of chunks using a local script.
+2. Split large files into several shorter files, guided by the chunk hierarchy (repeating part of the hierarchy).
+3. Feed each file to an LLM to produce for each chunk a summary and lists of keywords, topics, goals and dependencies.
+4. Store those in their respective indexes.
+
+## Query process
+
+A user query is handled using the following steps:
+
+1. Feed the user query (and some recent conversation history from **answers**) as context to an LLM tasked with producing sensible queries for each index.
+2. Search each local index (**summaries**, **keywords** etc.), keeping the top hits from each (scored by proximity to the query phrase produced by step 1).
+3. Using some information retrieval magic (a variant of TF\*IDF), select the top "best" chunks among those hits.
+4. Send the selected chunks (including parial metadata and summary), plus the same recent history from step 1, as context to an LLM tasked with producing the final answer from all the context it is given.
+5. present the answer to the user and add it to the conversation history (**answers**).
+
+## Open issues
+
+### High level
+
+- Is it worth pursueing this further?
+- How to integrate it as an agent with shell/cli?
+  Especially since the model needs access to conversation history, and the current assumption is that you focus on spelunking exclusively until you say you are (temporarily) done with it.
+  Does the cli/shell have UI features for that?
+
+### Testing
+
+- How to measure the quality of the answers? This probably has to be a manual process.
+  We might be able to have a standard set of queries about a particular code base and for each potential improvement decide which variant gives the better answer.
+
+### Import process open questions
+
+- Should we give the LLM more guidance as to how to generate the best keywords, topics etc.?
+- Do we need all five indexes? Or could we do with fewer, e.g. just **summaries** and **topics**?
+- Can we get it to produce better summaries and topics (etc.) through different prompting?
+- What are the optimal parameters for splitting long files?
+- Can we tweak the splitting of large files to make the split files more cohesive?
+- Would it help if we tweaked the chunking algorithm?
+- Could we get the LLM to produce the chunking? (Chicken and egg for large files though.)
+
+### Query process open questions
+
+- How much conversation history to include in the context for steps 1 and 4, and if not all, how to choose (anither proximity search perhaps?).
+- Prompt engineering to get the first LLM to come up with better queries. (Sometimes it puts stuff in the queries that feel poorly chosen.)
+- How many hits to request from each index (**maxHits**). And possibly how to determine **minScore**.
+- Algorithm for scoring chunks among hits. There are many possible ideas.
+- How many chunks to pass in the context for step 4. (Can it be dynamic?)
+- In which order to present the context for step 4.
+- Prompt engineering for step 4.
+- Sometimes the model isn't using the conversation history enough. How can we improve this?
+  (E.g. I once had to battle her about whether she had access to history at all; she claimed she did not, even though I gave her the most recent 20 question/answer pairs.)
+
+## Details of the current processes
+
+E.g. my TF\*IDF variant, etc.
+
+This is TODO. For now just see the code.
diff --git a/ts/examples/spelunker/src/queryInterface.ts b/ts/examples/spelunker/src/queryInterface.ts
@@ -165,7 +165,7 @@ export async function interactiveQueryLoop(
                                     ),
                                 );
                             } else {
-                                writeWarning(io, "SUMMARY: None");
+                                writeNote(io, "SUMMARY: None");
                             }
                         } else {
                             const docItem: string[] | undefined =
@@ -182,7 +182,7 @@ export async function interactiveQueryLoop(
                 writeNote(io, "CODE:");
                 writeChunkLines(chunk, io, 100);
             } else {
-                writeWarning(io, `[Chunk ID ${chunkId} not found]`);
+                writeNote(io, `[Chunk ID ${chunkId} not found]`);
             }
         }
     }
@@ -200,7 +200,7 @@ export async function interactiveQueryLoop(
                 maxHits: {
                     description: "Maximum number of hits to return",
                     type: "integer",
-                    defaultValue: 3,
+                    defaultValue: 10,
                 },
                 minScore: {
                     description: "Minimum score to return",
@@ -359,7 +359,7 @@ export async function interactiveQueryLoop(
             );
         }
         if (!filesPopularity.size) {
-            writeWarning(io, "[No files]");
+            writeMain(io, "[No files]");
         } else {
             const sortedFiles = Array.from(filesPopularity)
                 .filter(([file, _]) => !filter || file.includes(filter))
@@ -455,7 +455,7 @@ export async function interactiveQueryLoop(
         }
 
         if (!hits.length) {
-            writeWarning(io, `No ${indexName}.`); // E.g., "No keywords."
+            writeNote(io, `No ${indexName}.`); // E.g., "No keywords."
             return;
         } else {
             writeNote(io, `Found ${hits.length} ${indexName}.`);
@@ -505,7 +505,7 @@ export async function interactiveQueryLoop(
                 }
             }
             if (hits.length < 2) {
-                writeWarning(io, `No hit for ${text}`);
+                writeNote(io, `No hits for ${text} in ${indexName}`);
             } else {
                 const end = hits.length - 1;
                 writeMain(
@@ -521,7 +521,11 @@ export async function interactiveQueryLoop(
         input: string,
         io: iapp.InteractiveIo,
     ): Promise<void> {
-        await processQuery(input, chunkyIndex, io, { verbose } as QueryOptions);
+        await processQuery(input, chunkyIndex, io, {
+            maxHits: 10,
+            minScore: 0.7,
+            verbose,
+        });
     }
 
     await iapp.runConsole({
@@ -680,18 +684,27 @@ async function runIndexQueries(
     for (const [indexName, index] of chunkyIndex.allIndexes()) {
         const spec: QuerySpec = (proposedQueries as any)[indexName];
         if (spec.maxHits === 0) {
-            writeWarning(io, `[${indexName}: no query]`);
+            writeNote(io, `[${indexName}: no query]`);
             continue;
         }
+
+        const specMaxHits = spec.maxHits;
+        const defaultMaxHits = queryOptions.maxHits;
+        const maxHits = specMaxHits ?? defaultMaxHits;
+        const maxHitsDisplay =
+            maxHits === specMaxHits
+                ? maxHits.toString()
+                : `${specMaxHits} ?? ${defaultMaxHits}`;
+
         const hits = await index.nearestNeighborsPairs(
             spec.query,
-            spec.maxHits ?? queryOptions.maxHits,
+            maxHits,
             queryOptions.minScore,
         );
         if (!hits.length) {
             writeNote(
                 io,
-                `[${indexName}: query ${spec.query} (maxHits ${spec.maxHits}) no hits]`,
+                `[${indexName}: query ${spec.query} (maxHits ${maxHitsDisplay}) no hits]`,
             );
             continue;
         }
@@ -740,7 +753,7 @@ async function runIndexQueries(
         const end = hits.length - 1;
         writeNote(
             io,
-            `[${indexName}: query '${spec.query}' (maxHits ${spec.maxHits}); ${hits.length} hits; ` +
+            `[${indexName}: query '${spec.query}' (maxHits ${maxHitsDisplay}); ${hits.length} hits; ` +
                 `scores ${hits[0].score.toFixed(3)}--${hits[end].score.toFixed(3)}; ` +
                 `${numChunks} unique chunk ids]`,
         );
@@ -773,7 +786,7 @@ async function generateAnswer(
 
     // Step 3b: Get the chunks themselves.
     const chunks: Chunk[] = [];
-    const maxChunks = 20;
+    const maxChunks = 30;
     // Take the top N chunks that actually exist.
     for (const chunkId of scoredChunkIds) {
         const maybeChunk = await chunkyIndex.chunkFolder.get(chunkId.item);
@@ -789,10 +802,20 @@ async function generateAnswer(
 
     writeNote(io, `[Sending ${chunks.length} chunks to answerMaker]`);
 
+    const preamble = makeAnswerPrompt(message, recentAnswers, chunks);
+    if (queryOptions.verbose) {
+        const formatted = util.inspect(preamble, {
+            depth: null,
+            colors: true,
+            compact: false,
+        });
+        writeNote(io, `Preamble: ${formatted}`);
+    }
+
     // Step 3c: Make the request and check for success.
     const answerResult = await chunkyIndex.answerMaker.translate(
         input,
-        makeAnswerPrompt(message, recentAnswers, chunks),
+        preamble,
     );
 
     if (!answerResult.success) {
@@ -821,7 +844,7 @@ async function findRecentAnswers(
     }
     // Assume the name field (the internal key) is a timestamp.
     recentAnswers.sort((a, b) => b.name.localeCompare(a.name));
-    recentAnswers.splice(5); // TODO: Cut off by total size, not count.
+    recentAnswers.splice(20); // TODO: Cut off by total size, not count.
     recentAnswers.reverse(); // Most recent last.
     return recentAnswers;
 }
@@ -831,9 +854,12 @@ function reportQuery(answer: AnswerSpecs, io: iapp.InteractiveIo): void {
         io,
         `\nAnswer (confidence ${answer.confidence.toFixed(3).replace(/0+$/, "")}):`,
     );
+
     writeMain(io, wordWrap(answer.answer));
-    if (answer.message)
+
+    if (answer.message) {
         writeWarning(io, "\n" + wordWrap(`Message: ${answer.message}`));
+    }
     if (answer.references.length) {
         writeNote(
             io,