📄 論文解説: SGLang — RadixAttentionによるLLMプレフィックスキャッシュの自動最適化

本記事は SGLang: Efficient Execution of Structured Language Model Programs (arXiv:2405.14532) の解説記事です。論文概要（Abstract） SGLang（Structured Generation Language）は、LLMの構造化された推論プログラムを効率的に実行するためのフレームワークで...

03/05/2026 blog paper

prompt-caching KV-cache LLM-inference +7

📄 論文解説: FrugalGPT — LLMカスケードによるコスト98%削減フレームワーク

本記事は FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance (arXiv:2305.14895) の解説記事です。論文概要（Abstract） FrugalGPTは、Stanford大学のLingjiao Chen、Matei Zaharia、James Z...

03/05/2026 blog paper

LLM cost-optimization cascade +3

📄 論文解説: Prompt Cache — Modular Attention Reuse for Low-Latency Inference

本記事は Prompt Cache: Modular Attention Reuse for Low-Latency Inference (arXiv:2407.00079) の解説記事です。論文概要（Abstract） Prompt Cacheは、LLM推論においてプロンプト間で共有されるテキストセグメント（システムプロンプト、参照ドキュメント等）のattention sta...

03/05/2026 blog paper

prompt-caching KV-cache LLM-inference +6

📄 論文解説: RouteLLM — 選好データに基づくLLMルーティングフレームワーク

本記事は RouteLLM: Learning to Route LLMs with Preference Data (arXiv:2406.18665) の解説記事です。論文概要（Abstract） RouteLLMは、LMSYSチーム（Chatbot Arenaの開発元）が提案したLLMルーティングフレームワークである。複数のLLMが利用可能な環境において、リクエストごとに「強モデ...

03/05/2026 blog paper

LLM routing model-selection +3

📄 論文解説: S-LoRA — 数千のLoRAアダプタを同時サービングするメモリ管理手法

本記事は S-LoRA: Serving Thousands of Concurrent LoRA Adapters（arXiv 2023）の解説記事です。論文概要（Abstract） S-LoRAは、単一のGPUサーバ上で数千のLoRA（Low-Rank Adaptation）アダプタを同時にサービングするシステムである。著者らは、KVキャッシュとLoRAアダプタ重みの両方をページン...

02/05/2026 blog paper

S-LoRA LoRA LLM +5

✍️ Google Cloud解説: llm-d — vLLMをKubernetes-nativeな分散推論に拡張するオープンソースプロジェクト

本記事は Introducing the next generation of AI inference, powered by llm-d（Google Cloud Blog, 2025年5月21日）の解説記事です。ブログ概要（Summary） Google CloudのMark Lohmeyer（VP, AI and Computing Infrastructure）とGabe M...

02/05/2026 blog tech_blog

GoogleCloud llm-d vLLM +5

📄 論文解説: DistServe — Prefill/Decode分離によるLLMサービングのGoodput最適化

本記事は DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving（arXiv 2024）の解説記事です。論文概要（Abstract） DistServeは、LLM推論のPrefill（プロンプト処理）とDecode（トークン生成）を物理的に異なるG...

02/05/2026 blog paper

DistServe LLM inference +5

✍️ Meta Engineering解説: Scaling LLM Inference — Tensor/Context/Expert Parallelismの革新

本記事は Scaling LLM Inference: Innovations in Tensor Parallelism, Context Parallelism, and Expert Parallelism（Meta Engineering Blog, 2025年10月17日）の解説記事です。ブログ概要（Summary） MetaのAI Researchチーム（Cen Zhao,...

02/05/2026 blog tech_blog

Meta LLM inference +5

📄 論文解説: Efficient Memory Management for Large Language Model Serving with PagedAttention

本記事は Efficient Memory Management for Large Language Model Serving with PagedAttention（SOSP 2023）の解説記事です。論文概要（Abstract）本論文は、LLMサービングにおけるKVキャッシュのメモリ管理問題に対して、OSの仮想メモリとページング機構に着想を得たPagedAttentionアル...

02/05/2026 blog paper

vLLM PagedAttention LLM +4

📄 論文解説: AgentSquare - モジュール設計空間でのLLMエージェント自動探索

本記事は arXiv:2501.12599 の解説記事です。論文概要（Abstract） Chen, Yuan, Song ら（2025）は、LLMエージェントの設計をモジュール化し、最適な構成を自動探索するフレームワーク AgentSquare を提案した。エージェントシステムをPlanning、Reasoning、Tool Use、Memoryの4モジュールに分解し、(1) mod...

01/05/2026 blog paper

agentsquare agent-search modular-design +2

1
...
45
46
47
...
152
46 / 152