Implement /llm/generate HTTP Endpoint With Streaming

Alex Johnson

-Oct 28, 2025

Implement /llm/generate HTTP Endpoint With Streaming

Introduction

This article will guide you through the process of implementing an /llm/generate HTTP endpoint with streaming capabilities. This enhancement focuses on the AIRunnerAPIRequestHandler to handle request parsing, payload validation, and the streaming of LLMResponse objects in NDJSON format. This feature is crucial for applications requiring real-time responses from Language Model (LLM) services, enhancing user experience and reducing latency. The implementation involves several key steps, including updating the server-side API, validating incoming requests, creating LLMRequest objects, and managing the streaming of responses. Understanding these components and their interactions is vital for successful implementation and integration. This article provides a comprehensive overview of the changes required, the rationale behind these changes, and the expected outcomes. Let’s dive into the details and explore how to bring this feature to life.

Key Components and Files

To successfully implement the /llm/generate HTTP endpoint with streaming, we need to focus on several key components and files within the airunner project. Understanding the roles of these components will help in grasping the overall architecture and the specific changes required. Primarily, the implementation will touch the following files:

src/airunner/components/server/api/server.py
src/airunner/components/llm/api/llm_services.py

The server.py file is responsible for handling HTTP requests and responses, making it the primary point of interaction for the new endpoint. This file will require updates to route the /llm/generate requests to the appropriate handler function. The llm_services.py file, on the other hand, contains the logic for processing LLM requests and generating responses. This file will need modifications to support both streaming and non-streaming response modes, ensuring that the application can handle different types of requests efficiently. By focusing on these files, we can ensure that the new endpoint integrates seamlessly with the existing architecture and provides the desired functionality.

`src/airunner/components/server/api/server.py`

The src/airunner/components/server/api/server.py file plays a pivotal role in handling incoming HTTP requests and routing them to the appropriate handlers. For the /llm/generate endpoint, this file needs to be updated to include a new route that maps the endpoint to a specific handler function within the AIRunnerAPIRequestHandler class. This involves modifying the request handling logic to recognize the new endpoint and direct the request to the correct method. Furthermore, the file needs to handle the streaming of responses, which requires a different approach compared to traditional request-response models. The handler function must be able to generate a stream of data in NDJSON format when the stream parameter is set to true. This typically involves using asynchronous generators or other streaming mechanisms to efficiently send data chunks to the client. Error handling is also a critical aspect of this file. It should be able to catch exceptions, format error messages in JSON, and return appropriate HTTP error codes to the client. By carefully modifying this file, we can ensure that the /llm/generate endpoint functions correctly and provides a robust and reliable interface for LLM interactions.

`src/airunner/components/llm/api/llm_services.py`

The src/airunner/components/llm/api/llm_services.py file is where the core logic for processing Language Model (LLM) requests and generating responses resides. This file is crucial for the implementation of the /llm/generate endpoint, as it contains the functions that interact with the LLM and produce the desired output. The primary responsibility of this file is to handle the incoming LLMRequest objects, interact with the LLM service, and generate LLMResponse objects. For the streaming functionality, this file needs to be updated to support the generation of a stream of responses, rather than a single response. This can be achieved by using asynchronous generators or other streaming techniques. Additionally, the file must be able to handle different types of requests, including those that require a single JSON object response (when stream is set to false). Error handling is also a significant consideration. The file should be able to gracefully handle errors during LLM interactions and provide informative error messages. By focusing on these aspects, we can ensure that the llm_services.py file is well-equipped to handle the demands of the new /llm/generate endpoint.

Acceptance Criteria

The acceptance criteria outline the specific conditions that must be met to ensure the successful implementation of the /llm/generate HTTP endpoint with streaming. These criteria serve as a checklist to verify that the new functionality works as expected and integrates seamlessly with the existing system. There are three primary acceptance criteria:

POSTing valid JSON to /llm/generate returns NDJSON streaming chunks when stream:true. This criterion ensures that when a client sends a POST request to the /llm/generate endpoint with a valid JSON payload and the stream parameter set to true, the server responds with a stream of data in NDJSON format. NDJSON (Newline Delimited JSON) is a format where each line is a valid JSON object, making it suitable for streaming data. This criterion verifies that the streaming functionality is correctly implemented and that the server can efficiently send data chunks to the client.
Non-streaming (stream:false) requests return a single JSON object. This criterion ensures that when the stream parameter is set to false or is not provided, the server returns a single JSON object as the response. This is the traditional request-response model, and this criterion verifies that the server can handle both streaming and non-streaming requests appropriately. It also ensures that the existing functionality remains intact and that the new streaming feature does not negatively impact the non-streaming use case.
Error cases return appropriate HTTP codes and JSON error messages. This criterion is crucial for ensuring the robustness of the endpoint. It verifies that the server can handle error scenarios gracefully and provide informative feedback to the client. When an error occurs (e.g., invalid JSON payload, LLM service failure), the server should return an appropriate HTTP error code (e.g., 400 for bad request, 500 for internal server error) and a JSON object containing a detailed error message. This helps clients understand the nature of the error and take corrective actions.

By meeting these acceptance criteria, we can confidently say that the /llm/generate endpoint has been successfully implemented and is ready for use.

Implementation Steps

Implementing the /llm/generate HTTP endpoint with streaming involves a series of detailed steps. These steps cover everything from request parsing and payload validation to response streaming and error handling. Each step is crucial to ensure the endpoint functions correctly and provides a robust and reliable interface for LLM interactions. Below is a breakdown of the key steps involved:

Enhance AIRunnerAPIRequestHandler to implement /llm/generate request parsing:
- This step involves modifying the AIRunnerAPIRequestHandler class in src/airunner/components/server/api/server.py to handle incoming requests to the /llm/generate endpoint. The handler needs to be able to recognize the new route and direct the request to the appropriate method.
- The request parsing logic should extract the necessary data from the request, such as the request body and any query parameters. This data will be used to create the LLMRequest object.
Validate payloads (prompt, action, llm_request):
- Payload validation is crucial to ensure that the incoming requests are well-formed and contain the necessary data. This step involves implementing validation logic to check the structure and content of the JSON payload.
- The validation should verify the presence and format of key fields such as prompt, action, and llm_request. Invalid payloads should result in an appropriate HTTP error code (e.g., 400) and a JSON error message.
Create LLMRequest from JSON:
- Once the payload is validated, the next step is to create an LLMRequest object from the JSON data. This object encapsulates the request parameters and provides a structured way to pass the data to the LLM service.
- The creation process should map the JSON fields to the corresponding attributes of the LLMRequest object.
Stream LLMResponse objects as NDJSON:
- This is a key aspect of the streaming functionality. When the stream parameter is set to true, the server should stream LLMResponse objects as NDJSON. This involves generating a stream of data chunks, each representing a valid JSON object, and sending them to the client.
- The streaming can be implemented using asynchronous generators or other streaming techniques to efficiently send data chunks.
Handle non-streaming requests:
- For requests where the stream parameter is set to false or is not provided, the server should return a single JSON object as the response. This ensures that the endpoint can handle both streaming and non-streaming requests.
- The response should be formatted as a standard JSON object, containing the necessary data.
Implement error handling:
- Error handling is crucial for ensuring the robustness of the endpoint. The server should be able to catch exceptions, format error messages in JSON, and return appropriate HTTP error codes to the client.
- Error scenarios include invalid JSON payloads, LLM service failures, and other unexpected issues.

By following these steps, the /llm/generate HTTP endpoint with streaming can be successfully implemented, providing a robust and efficient interface for LLM interactions.

Estimated Time and Priority

The estimated time for implementing the /llm/generate HTTP endpoint with streaming is 16 hours. This estimate includes the time required for request parsing, payload validation, LLMRequest creation, response streaming, handling non-streaming requests, and implementing error handling. The estimate also accounts for testing and debugging to ensure the endpoint functions correctly.

The priority for this task is P1, indicating that it is a high-priority item. This is because the streaming functionality is crucial for applications requiring real-time responses from LLM services, enhancing user experience and reducing latency. Addressing this feature promptly will enable the application to better serve its users and meet its performance goals.

Conclusion

Implementing the /llm/generate HTTP endpoint with streaming is a significant enhancement that will enable real-time interactions with Language Model (LLM) services. This article has provided a detailed overview of the implementation process, including the key components and files involved, the acceptance criteria, and the step-by-step instructions. By focusing on request parsing, payload validation, LLMRequest creation, response streaming, and error handling, we can ensure that the new endpoint functions correctly and provides a robust and reliable interface for LLM interactions. The estimated time for this task is 16 hours, and the priority is P1, reflecting its importance for the application's performance and user experience. This feature will significantly improve the application's ability to handle real-time requests and provide timely responses to users. Remember to consult OpenAI API Documentation for additional information.