TN013 Improve Learning

Improve Learning

Objective

  • Learn from all executed cells not just generated cells

TL;DR

Today, learning only occurs in a small fraction of cases. In particular, the following must be true for learning to occur

  • Code cell was generated from AI
  • User edited the code cell

This limitation was the result of relying on the GenerateRequest to log the notebook to be used as input.

We now have the LogEvents API and log the full notebook each time the cell focus changes. We also LogEvents corresponding to cell executions. This means we can use any cell execution (regardless of how it was generated) to learn. We also log the cell execution status so we can avoid learning from failed executions.

Background

How does learning work today

The Analyzer is a stream processor for the logs that produces blocklogs. When the Analyzer, encounters an Execute event it enqueues the BlockLog for processing in the learner (here).

The Learner then checks whether the BlockLog meets various criterion for learning e.g.

  • Was the block generated via AI?
  • Is the generated block different from the text that was actually executed?

Importantly, in order to do learning we need to know the block (cell) that was produced and the cells that preceded it which should be considered the input to the AI.

Currently, the context is obtained via the GenerateRequest. The analyzer builds a GenerateTrace. For each block that gets produced by the trace, the corresponding block log is updated with the ID of the trace.

Analyzer.buildBlockLog fetches the Generate trace and uses it to populate BlockLog.Doc with the cells that preceded the doc.

When we originally implemented learning, the GenerateRequest was the only mechanism we had for recording the doc. This was one of the reasons we could only do learning if the cell was generated by AI. Learning was originally implemented before we had ghost cells and were automatically requesting completions. In this case, a generate request had to be explicitly triggered by the user.

We now have the LogEvents RPC which sends a session start event whenever we activate a cell. This will include the full context. A session also contains events for cell execution. Cell execution events include

  • Actual cell executed
  • Cell execution status

All sessions should include a session end event. The session end event however doesn’t include any cells.

I believe the motivation for only learning if the user edited an AI suggestion was to improve RAG. If the model was already accurately predicting a cell, then there was no reason to add another example of it to our database.

Proposal: Learn Based on Sessions

The SessionProto has most of the data we need. A session corresponds to all the edits of a cell. For a code cell

  • FullContext provides a full copy of the notebook when focus was placed on a cell
  • Execution event will contain the executed cell
  • To test if the cell was AI generated we could fetch the BlockLog for the cell

For a markdown cell

  • FullContext provides a full copy of the notebook when focus was placed on a cell
  • We don’t have a copy of the cell but we could obtain it a couple ways
    • The sesion that starts next will have the full context which includes the final version of the cell
    • The LogEvent reports session end and start in the same event so we could link sessions on the backend

In the case of AI generated cells, we no longer require the user to have edited the cell in order to trigger learning. Dropping this restriction simplifies the learning logic since we don’t have to check whether a cell was AI generated and whether it was edited.

Future Directions

In the future we’d like to extend learning to work for non-code cells.


Last modified November 21, 2024: Learn from all executed cells (#341) (2d4e8e3)