This repository provides a set of ROS 2 packages to integrate llama.cpp into ROS 2. Using the llama_ros packages, you can easily incorporate the powerful optimization capabilities of llama.cpp into your ROS 2 projects by running GGUF-based LLMs and VLMs. You can also use features from llama.cpp such as GBNF grammars and modify LoRAs in real-time.
ROS 2 Distro | Branch | Build status | Docker Image | Documentation |
Humble | main | | | |
Table of Contents
- Related Projects
- Installation
- Docker
- Usage
- llama_cli
- Launch Files
- LoRA Adapters
- ROS 2 Clients
- LangChain
- Demos
Related Projects
- chatbot_ros → This chatbot, integrated into ROS 2, uses whisper_ros, to listen to people speech; and llama_ros, to generate responses. The chatbot is controlled by a state machine created with YASMIN.
- explainable_ros → A ROS 2 tool to explain the behavior of a robot. Using the integration of LangChain, logs are stored in a vector database. Then, RAG is applied to retrieve relevant logs for user questions answered with llama_ros.
Installation
To run llama_ros with CUDA, first, you must install the CUDA Toolkit. Then, you can compile llama_ros with --cmake-args -DGGML_CUDA=ON
to enable CUDA support.
cd ~/ros2_ws/src
git clone https://github.com/mgonzs13/llama_ros.git
pip3 install -r llama_ros/requirements.txt
cd ~/ros2_ws
rosdep install --from-paths src --ignore-src -r -y
colcon build --cmake-args -DGGML_CUDA=ON # add this for CUDA
Docker
Build the llama_ros docker or download an image from DockerHub. You can choose to build llama_ros with CUDA (USE_CUDA
) and choose the CUDA version (CUDA_VERSION
). Remember that you have to use DOCKER_BUILDKIT=0
to compile llama_ros with CUDA when building the image.
DOCKER_BUILDKIT=0 docker build -t llama_ros --build-arg USE_CUDA=1 --build-arg CUDA_VERSION=12-6 .
Run the docker container. If you want to use CUDA, you have to install the NVIDIA Container Tollkit and add --gpus all
.
docker run -it --rm --gpus all llama_ros
Usage
llama_cli
Commands are included in llama_ros to speed up the test of GGUF-based LLMs within the ROS 2 ecosystem. This way, the following commands are integrating into the ROS 2 commands:
launch
Using this command launch a LLM from a YAML file. The configuration of the YAML is used to launch the LLM in the same way as using a regular launch file. Here is an example of how to use it:
ros2 llama launch ~/ros2_ws/src/llama_ros/llama_bringup/models/StableLM-Zephyr.yaml
prompt
Using this command send a prompt to a launched LLM. The command uses a string, which is the prompt and has the following arguments:
- (
-r
, --reset
): Whether to reset the LLM before prompting
- (
-t
, --temp
): The temperature value
- (
--image-url
): Image url to sent to a VLM
Here is an example of how to use it:
ros2 llama prompt "Do you know ROS 2?" -t 0.0
Launch Files
First of all, you need to create a launch file to use llama_ros or llava_ros. This launch file will contain the main parameters to download the model from HuggingFace and configure it. Take a look at the following examples and the predefined launch files.
llama_ros (Python Launch)
Click to expand
from launch import LaunchDescription
def generate_launch_description():
return LaunchDescription([
create_llama_launch(
n_ctx=2048,
n_batch=8,
n_gpu_layers=0,
n_threads=1,
n_predict=2048,
model_repo="TheBloke/Marcoroni-7B-v3-GGUF",
model_filename="marcoroni-7b-v3.Q4_K_M.gguf",
system_prompt_type="Alpaca"
)
])
ros2 launch llama_bringup marcoroni.launch.py
llama_ros (YAML Config)
Click to expand
n_ctx: 2048 # context of the LLM in tokens
n_batch: 8 # batch size in tokens
n_gpu_layers: 0 # layers to load in GPU
n_threads: 1 # threads
n_predict: 2048 # max tokens, -1 == inf
model_repo: "cstr/Spaetzle-v60-7b-GGUF" # Hugging Face repo
model_filename: "Spaetzle-v60-7b-q4-k-m.gguf" # model file in repo
system_prompt_type: "Alpaca" # system prompt type
import os
from launch import LaunchDescription
from ament_index_python.packages import get_package_share_directory
def generate_launch_description():
return LaunchDescription([
create_llama_launch_from_yaml(os.path.join(
get_package_share_directory("llama_bringup"), "models", "Spaetzle.yaml"))
])
ros2 launch llama_bringup spaetzle.launch.py
llama_ros (YAML Config + model shards)
Click to expand
n_ctx: 2048 # context of the LLM in tokens
n_batch: 8 # batch size in tokens
n_gpu_layers: 0 # layers to load in GPU
n_threads: 1 # threads
n_predict: 2048 # max tokens, -1 == inf
model_repo: "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF" # Hugging Face repo
model_filename: "qwen2.5-coder-7b-instruct-q4_k_m-00001-of-00002.gguf" # model shard file in repo
system_prompt_type: "ChatML" # system prompt type
ros2 llama launch Qwen2.yaml
llava_ros (Python Launch)
Click to expand
from launch import LaunchDescription
def generate_launch_description():
return LaunchDescription([
create_llama_launch(
use_llava=True,
n_ctx=8192,
n_batch=512,
n_gpu_layers=33,
n_threads=1,
n_predict=8192,
model_repo="cjpais/llava-1.6-mistral-7b-gguf",
model_filename="llava-v1.6-mistral-7b.Q4_K_M.gguf",
mmproj_repo="cjpais/llava-1.6-mistral-7b-gguf",
mmproj_filename="mmproj-model-f16.gguf",
system_prompt_type="Mistral"
)
])
ros2 launch llama_bringup llava.launch.py
llava_ros (YAML Config)
Click to expand
use_llava: True # enable llava
n_ctx: 8192 # context of the LLM in tokens use a huge context size to load images
n_batch: 512 # batch size in tokens
n_gpu_layers: 33 # layers to load in GPU
n_threads: 1 # threads
n_predict: 8192 # max tokens -1 : : inf
model_repo: "cjpais/llava-1.6-mistral-7b-gguf" # Hugging Face repo
model_filename: "llava-v1.6-mistral-7b.Q4_K_M.gguf" # model file in repo
mmproj_repo: "cjpais/llava-1.6-mistral-7b-gguf" # Hugging Face repo
mmproj_filename: "mmproj-model-f16.gguf" # mmproj file in repo
system_prompt_type: "mistral" # system prompt type
def generate_launch_description():
return LaunchDescription([
create_llama_launch_from_yaml(os.path.join(
get_package_share_directory("llama_bringup"),
"models", "llava-1.6-mistral-7b-gguf.yaml"))
])
ros2 launch llama_bringup llava.launch.py
LoRA Adapters
You can use LoRA adapters when launching LLMs. Using llama.cpp features, you can load multiple adapters choosing the scale to apply for each adapter. Here you have an example of using LoRA adapters with Phi-3. You can lis the LoRAs using the /llama/list_loras
service and modify their scales values by using the /llama/update_loras
service. A scale value of 0.0 means not using that LoRA.
Click to expand
n_ctx: 2048
n_batch: 8
n_gpu_layers: 0
n_threads: 1
n_predict: 2048
model_repo: "bartowski/Phi-3.5-mini-instruct-GGUF"
model_filename: "Phi-3.5-mini-instruct-Q4_K_M.gguf"
lora_adapters:
- repo: "zhhan/adapter-Phi-3-mini-4k-instruct_code_writing"
filename: "Phi-3-mini-4k-instruct-adaptor-f16-code_writer.gguf"
scale: 0.5
- repo: "zhhan/adapter-Phi-3-mini-4k-instruct_summarization"
filename: "Phi-3-mini-4k-instruct-adaptor-f16-summarization.gguf"
scale: 0.5
system_prompt_type: "Phi-3"
ROS 2 Clients
Both llama_ros and llava_ros provide ROS 2 interfaces to access the main functionalities of the models. Here you have some examples of how to use them inside ROS 2 nodes. Moreover, take a look to the llama_demo_node.py and llava_demo_node.py demos.
Tokenize
Click to expand
from rclpy.node import Node
from llama_msgs.srv import Tokenize
class ExampleNode(Node):
def __init__(self) -> None:
super().__init__("example_node")
self.srv_client = self.create_client(Tokenize, "/llama/tokenize")
req = Tokenize.Request()
req.text = "Example text"
self.srv_client.wait_for_service()
tokens = self.srv_client.call(req).tokens
Detokenize
Click to expand
from rclpy.node import Node
from llama_msgs.srv import Detokenize
class ExampleNode(Node):
def __init__(self) -> None:
super().__init__("example_node")
self.srv_client = self.create_client(Detokenize, "/llama/detokenize")
req = Detokenize.Request()
req.tokens = [123, 123]
self.srv_client.wait_for_service()
text = self.srv_client.call(req).text
Embeddings
Click to expand
Remember to launch llama_ros with embedding set to true to be able of generating embeddings with your LLM.
from rclpy.node import Node
from llama_msgs.srv import Embeddings
class ExampleNode(Node):
def __init__(self) -> None:
super().__init__("example_node")
self.srv_client = self.create_client(Embeddings, "/llama/generate_embeddings")
req = Embeddings.Request()
req.prompt = "Example text"
req.normalize = True
self.srv_client.wait_for_service()
embeddings = self.srv_client.call(req).embeddings
Generate Response
Click to expand
import rclpy
from rclpy.node import Node
from rclpy.action import ActionClient
from llama_msgs.action import GenerateResponse
class ExampleNode(Node):
def __init__(self) -> None:
super().__init__("example_node")
self.action_client = ActionClient(
self, GenerateResponse, "/llama/generate_response")
goal = GenerateResponse.Goal()
goal.prompt = self.prompt
goal.sampling_config.temp = 0.2
self.action_client.wait_for_server()
send_goal_future = self.action_client.send_goal_async(
goal)
rclpy.spin_until_future_complete(self, send_goal_future)
get_result_future = send_goal_future.result().get_result_async()
rclpy.spin_until_future_complete(self, get_result_future)
result: GenerateResponse.Result = get_result_future.result().result
Generate Response (llava)
Click to expand
import cv2
from cv_bridge import CvBridge
import rclpy
from rclpy.node import Node
from rclpy.action import ActionClient
from llama_msgs.action import GenerateResponse
class ExampleNode(Node):
def __init__(self) -> None:
super().__init__("example_node")
self.cv_bridge = CvBridge()
self.action_client = ActionClient(
self, GenerateResponse, "/llama/generate_response")
goal = GenerateResponse.Goal()
goal.prompt = self.prompt
goal.sampling_config.temp = 0.2
image = cv2.imread("/path/to/your/image", cv2.IMREAD_COLOR)
goal.image = self.cv_bridge.cv2_to_imgmsg(image)
self.action_client.wait_for_server()
send_goal_future = self.action_client.send_goal_async(
goal)
rclpy.spin_until_future_complete(self, send_goal_future)
get_result_future = send_goal_future.result().get_result_async()
rclpy.spin_until_future_complete(self, get_result_future)
result: GenerateResponse.Result = get_result_future.result().result
LangChain
There is a llama_ros integration for LangChain. Thus, prompt engineering techniques could be applied. Here you have an example to use it.
llama_ros (Chain)
Click to expand
import rclpy
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
rclpy.init()
llm = LlamaROS()
prompt_template = "tell me a joke about {topic}"
prompt = PromptTemplate(
input_variables=["topic"],
template=prompt_template
)
chain = prompt | llm | StrOutputParser()
text = chain.invoke({"topic": "bears"})
print(text)
rclpy.shutdown()
llama_ros (Stream)
Click to expand
import rclpy
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
rclpy.init()
llm = LlamaROS()
prompt_template = "tell me a joke about {topic}"
prompt = PromptTemplate(
input_variables=["topic"],
template=prompt_template
)
chain = prompt | llm | StrOutputParser()
for c in chain.stream({"topic": "bears"}):
print(c, flush=True, end="")
rclpy.shutdown()
llava_ros
Click to expand
import rclpy
rclpy.init()
llm = LlamaROS()
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
llm = llm.bind(image_url=image_url).stream("Describe the image")
for c in llm:
print(c, flush=True, end="")
rclpy.shutdown()
llama_ros_embeddings (RAG)
Click to expand
import rclpy
from langchain_chroma import Chroma
rclpy.init()
embeddings = LlamaROSEmbeddings()
db = Chroma(embedding_function=embeddings)
retriever = db.as_retriever(search_kwargs={"k": 5})
db.add_texts(texts=["your_texts"])
documents = retriever.invoke("your_query")
print(documents)
rclpy.shutdown()
llama_ros (Renranker)
Click to expand
import rclpy
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.retrievers import ContextualCompressionRetriever
rclpy.init()
documents = TextLoader("../state_of_the_union.txt",).load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500, chunk_overlap=100)
texts = text_splitter.split_documents(documents)
embeddings = LlamaROSEmbeddings()
retriever = FAISS.from_documents(
texts, embeddings).as_retriever(search_kwargs={"k": 20})
compressor = LlamaROSReranker()
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor, base_retriever=retriever
)
compressed_docs = compression_retriever.invoke(
"What did the president say about Ketanji Jackson Brown"
)
for doc in compressed_docs:
print("-" * 50)
print(doc.page_content)
print("\n")
rclpy.shutdown()
llama_ros (LLM + RAG + Reranker)
Click to expand
import bs4
import rclpy
from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.messages import SystemMessage
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.retrievers import ContextualCompressionRetriever
rclpy.init()
loader = WebBaseLoader(
web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
bs_kwargs=dict(
parse_only=bs4.SoupStrainer(class_=("post-content", "post-title", "post-header"))
),
)
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=LlamaROSEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
prompt = ChatPromptTemplate.from_messages(
[
SystemMessage("You are an AI assistant that answer questions briefly."),
HumanMessagePromptTemplate.from_template(
"Taking into account the followin information:{context}\n\n{question}"
),
]
)
compressor = LlamaROSReranker(top_n=3)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor, base_retriever=retriever
)
def format_docs(docs):
formated_docs = ""
for d in docs:
formated_docs += f"\n\n\t- {d.page_content}"
return formated_docs
rag_chain = (
{"context": compression_retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| ChatLlamaROS(temp=0.0)
| StrOutputParser()
)
for c in rag_chain.stream("What is Task Decomposition?"):
print(c, flush=True, end="")
rclpy.shutdown()
chat_llama_ros (Chat + VLM)
Click to expand
import rclpy
from langchain_core.messages import SystemMessage
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate
from langchain_core.output_parsers import StrOutputParser
rclpy.init()
chat = ChatLlamaROS(
temp=0.2,
penalty_last_n=8,
)
prompt = ChatPromptTemplate.from_messages([
SystemMessage("You are a IA that just answer with a single word."),
HumanMessagePromptTemplate.from_template(template=[
{"type": "text", "text": "<image>Who is the character in the middle of the image?"},
{"type": "image_url", "image_url": "{image_url}"}
])
])
chain = prompt | chat | StrOutputParser()
for text in chain.stream({"image_url": "https://pics.filmaffinity.com/Dragon_Ball_Bola_de_Dragaon_Serie_de_TV-973171538-large.jpg"}):
print(text, end="", flush=True)
print("", end="\n", flush=True)
rclpy.shutdown()
chat_llama_ros (Tools)
Click to expand
The current implementation of Tools allows executing tools without requiring a model trained for that task.
import time
from random import randint
import rclpy
from langchain.tools import tool
from langchain_core.messages import HumanMessage
rclpy.init()
@tool
def get_inhabitants(city: str) -> int:
"""Get the current temperature of a city"""
return randint(4_000_000, 8_000_000)
@tool
def get_curr_temperature(city: str) -> int:
"""Get the current temperature of a city"""
return randint(20, 30)
chat = ChatLlamaROS(temp=0.6, penalty_last_n=8, use_default_template=True)
messages = [
HumanMessage(
"What is the current temperature in Madrid? And its inhabitants?"
)
]
llm_tools = chat.bind_tools(
[get_inhabitants, get_curr_temperature], tool_choice='any'
)
all_tools_res = llm_tools.invoke(messages)
messages.append(all_tools_res)
for tool in all_tools_res.tool_calls:
selected_tool = {
"get_inhabitants": get_inhabitants, "get_curr_temperature": get_curr_temperature
}[tool['name']]
tool_msg = selected_tool.invoke(tool)
formatted_output = f"{tool['name']}({''.join(tool['args'].values())}) = {tool_msg.content}"
tool_msg.additional_kwargs = {'args': tool['args']}
messages.append(tool_msg)
res = chat.invoke(messages)
print(f"Response: {res.content}")
rclpy.shutdown()
chat_llama_ros (langgraph)
Click to expand
import time
from random import randint
import rclpy
from langchain.tools import tool
from langchain_core.messages import HumanMessage
from langgraph.prebuilt import create_react_agent
rclpy.init()
@tool
def get_inhabitants(city: str) -> int:
"""Get the current temperature of a city"""
return randint(4_000_000, 8_000_000)
@tool
def get_curr_temperature(city: str) -> int:
"""Get the current temperature of a city"""
return randint(20, 30)
chat = ChatLlamaROS(temp=0.0, use_default_template=True)
agent_executor = create_react_agent(
self.chat, [get_inhabitants, get_curr_temperature]
)
response = self.agent_executor.invoke(
{
"messages": [
HumanMessage(
content="What is the current temperature in Madrid? And its inhabitants?"
)
]
}
)
print(f"Response: {response['messages'][-1].content}")
rclpy.shutdown()
Demos
LLM Demo
ros2 launch llama_bringup spaetzle.launch.py
ros2 run llama_demos llama_demo_node --ros-args -p prompt:="your prompt"
https://github.com/mgonzs13/llama_ros/assets/25979134/9311761b-d900-4e58-b9f8-11c8efefdac4
Embeddings Generation Demo
ros2 llama launch ~/ros2_ws/src/llama_ros/llama_bringup/models/bge-base-en-v1.5.yaml
ros2 run llama_demos llama_embeddings_demo_node
https://github.com/user-attachments/assets/7d722017-27dc-417c-ace7-bf6b747e4ced
Reranking Demo
ros2 llama launch ~/ros2_ws/src/llama_ros/llama_bringup/models/jina-reranker.yaml
ros2 run llama_demos llama_rerank_demo_node
https://github.com/user-attachments/assets/4b4adb4d-7c70-43ea-a2c1-9be57d211484
VLM Demo
ros2 launch llama_bringup minicpm-2.6.launch.py
ros2 run llama_demos llava_demo_node --ros-args -p prompt:="your prompt" -p image_url:="url of the image" -p use_image:="whether to send the image"
https://github.com/mgonzs13/llama_ros/assets/25979134/4a9ef92f-9099-41b4-8350-765336e3503c
Chat Template Demo
ros2 llama launch MiniCPM-2.6.yaml
Click to expand MiniCPM-2.6.yaml
use_llava: True
n_ctx: 8192
n_batch: 512
n_gpu_layers: 20
n_threads: -1
n_predict: 8192
image_prefix: "<image>"
image_suffix: "</image>"
model_repo: "openbmb/MiniCPM-V-2_6-gguf"
model_filename: "ggml-model-Q4_K_M.gguf"
mmproj_repo: "openbmb/MiniCPM-V-2_6-gguf"
mmproj_filename: "mmproj-model-f16.gguf"
stopping_words: ["<|im_end|>"]
ros2 run llama_demos chatllama_demo_node
ChatLlamaROS demo
Tools Demo
ros2 llama launch MiniCPM-2.6.yaml
ros2 run llama_demos chatllama_tools_demo_node
Tools ChatLlama
Langgraph Demo
ros2 llama launch Qwen2.yaml
Click to expand Qwen2.yaml
_ctx: 4096
n_batch: 256
n_gpu_layers: 29
n_threads: -1
n_predict: -1
model_repo: "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF"
model_filename: "qwen2.5-coder-7b-instruct-q4_k_m-00001-of-00002.gguf"
stopping_words: ["<|im_end|>"]
ros2 run llama_demos chatllama_langgraph_demo_node
Langgraph ChatLlama
RAG Demo (LLM + chat template + RAG + Reranking + Stream)
ros2 llama launch ~/ros2_ws/src/llama_ros/llama_bringup/models/bge-base-en-v1.5.yaml
ros2 llama launch ~/ros2_ws/src/llama_ros/llama_bringup/models/jina-reranker.yaml
ros2 llama launch Qwen2.yaml
Click to expand Qwen2.yaml
_ctx: 4096
n_batch: 256
n_gpu_layers: 29
n_threads: -1
n_predict: -1
model_repo: "Qwen/Qwen2.5-Coder-3B-Instruct-GGUF"
model_filename: "qwen2.5-coder-3b-instruct-q4_k_m.gguf"
stopping_words: ["<|im_end|>"]
ros2 run llama_demos llama_rag_demo_node
https://github.com/user-attachments/assets/b4e3957d-1f92-427b-a1a8-cfc76737c0d6