[RFC] 115 - 文件上传改版 #7753

arvinxx · 2025-05-08T17:46:41Z

arvinxx
May 8, 2025
Maintainer

背景

[RFC] 114 - 知识库 2.0 #7752

设计思路

上传文件改造方案思路：

新增 document 解析链路，用于将文件、网页抓取以及未来的 api 等类型转成统一通用的 lobe document；
用户在对话上传的文件全部走文件转 document 链路，不做知识库的 chunk 和 embedding ；
从 message 读取文件时，直接读取 document content 作为 file content 字段，并注入 file content 中；

进展

✨ feat: support upload files direct into context #7751

arvinxx · 2025-05-08T18:19:13Z

arvinxx
May 8, 2025
Maintainer Author

3.7 设计方案：

Document 表设计方案

1. 数据库模型设计

基于项目现有的数据库结构和需求，我设计了 documents 表及其相关关联表。这个设计考虑了以下几点：

支持存储本地文件和 web search 结果
可以关联到 topic 作为上下文
可以关联到文件（但不是必须的）
符合现有的数据库设计风格和约定

documents 表

/* eslint-disable sort-keys-fix/sort-keys-fix  */
import { 
  boolean, 
  index, 
  integer, 
  jsonb, 
  pgTable, 
  text, 
  uniqueIndex, 
  varchar 
} from 'drizzle-orm/pg-core';
import { createInsertSchema } from 'drizzle-zod';

import { idGenerator } from '@/database/utils/idGenerator';
import { LobeDocument, LobeDocumentPage } from '@/types/document';

import { timestamps } from './_helpers';
import { files } from './file';
import { users } from './user';

export const documents = pgTable(
  'documents',
  {
    id: text('id')
      .$defaultFn(() => idGenerator('documents'))
      .primaryKey(),
    
    // 基本信息
    title: text('title'),
    content: text('content'),
    source: text('source').notNull(), // 文件路径或网页URL
    fileType: varchar('file_type', { length: 255 }).notNull(),
    filename: text('filename'),
    
    // 统计信息
    totalCharCount: integer('total_char_count').notNull(),
    totalLineCount: integer('total_line_count').notNull(),
    
    // 元数据
    metadata: jsonb('metadata').$type<Record<string, any>>(),
    
    // 页面/块数据
    pages: jsonb('pages').$type<LobeDocumentPage[]>(),
    
    // 来源类型
    sourceType: text('source_type', { enum: ['file', 'web', 'api'] }).notNull(),
    
    // 关联文件（可选）
    fileId: text('file_id').references(() => files.id, { onDelete: 'set null' }),
    
    // 用户关联
    userId: text('user_id')
      .references(() => users.id, { onDelete: 'cascade' })
      .notNull(),
    clientId: text('client_id'),
    
    // 时间戳
    ...timestamps,
  },
  (table) => ({
    sourceIdx: index('documents_source_idx').on(table.source),
    fileTypeIdx: index('documents_file_type_idx').on(table.fileType),
    fileIdIdx: index('documents_file_id_idx').on(table.fileId),
    clientIdUnique: uniqueIndex('documents_client_id_user_id_unique').on(
      table.clientId, 
      table.userId
    ),
  }),
);

export type NewDocument = typeof documents.$inferInsert;
export type DocumentItem = typeof documents.$inferSelect;
export const insertDocumentSchema = createInsertSchema(documents);

document_topics 关联表

为了实现 document 与 topic 的多对多关联，我们需要创建一个关联表：

import { pgTable, primaryKey, text } from 'drizzle-orm/pg-core';

import { createdAt } from './_helpers';
import { documents } from './document';
import { topics } from './topic';
import { users } from './user';

export const documentTopics = pgTable(
  'document_topics',
  {
    documentId: text('document_id')
      .notNull()
      .references(() => documents.id, { onDelete: 'cascade' }),
    
    topicId: text('topic_id')
      .notNull()
      .references(() => topics.id, { onDelete: 'cascade' }),
    
    userId: text('user_id')
      .references(() => users.id, { onDelete: 'cascade' })
      .notNull(),
    
    createdAt: createdAt(),
  },
  (t) => ({
    pk: primaryKey({ columns: [t.documentId, t.topicId] }),
  }),
);

export type NewDocumentTopic = typeof documentTopics.$inferInsert;
export type DocumentTopicItem = typeof documentTopics.$inferSelect;

关系定义

在 relations.ts 中添加相关关系定义：

// 在 relations.ts 文件中添加

import { documents, documentTopics } from './document';

// Document 与 File 的关系
export const documentsRelations = relations(documents, ({ one, many }) => ({
  file: one(files, {
    fields: [documents.fileId],
    references: [files.id],
  }),
  topics: many(documentTopics),
}));

// Document 与 Topic 的关联关系
export const documentTopicsRelations = relations(documentTopics, ({ one }) => ({
  document: one(documents, {
    fields: [documentTopics.documentId],
    references: [documents.id],
  }),
  topic: one(topics, {
    fields: [documentTopics.topicId],
    references: [topics.id],
  }),
}));

// 在 topicRelations 中添加 documents 关系
export const topicRelations = relations(topics, ({ one, many }) => ({
  session: one(sessions, {
    fields: [topics.sessionId],
    references: [sessions.id],
  }),
  documents: many(documentTopics),
}));

2. 业务逻辑设计

Document 的业务逻辑

文档来源多样化：
- 本地文件：通过文件解析器处理后存储
- Web 搜索结果：通过网页抓取后存储
- API 返回内容：从外部 API 获取的内容
与 Topic 的关联：
- 一个 Document 可以关联到多个 Topic
- 一个 Topic 可以关联多个 Document 作为上下文
- 通过 documentTopics 表维护这种多对多关系
与 File 的关联：
- 当 Document 来源是本地文件时，可以关联到 files 表中的记录
- 当 Document 来源是 Web 或 API 时，fileId 可以为 null
文档内容结构化：
- 整个文档的完整内容存储在 content 字段
- 文档被分割成多个逻辑单元/页面/块，存储在 pages 字段中
- 每个页面/块包含自己的内容、元数据和统计信息
文档元数据：
- 存储文档级别的元数据，如作者、标题等
- 对于 Web 内容，可以存储网站名称、发布日期等
- 对于文件，可以存储文件属性信息

使用场景

作为 RAG 的知识源：
- Document 可以被分割成块并进行向量化，用于检索增强生成
- 可以关联到现有的 chunks 和 embeddings 表
作为 Topic 的上下文：
- 在对话中可以引用文档内容作为上下文
- 可以基于文档内容生成问题或摘要
Web 搜索结果存储：
- 存储搜索结果，以便在对话中引用
- 保留搜索结果的来源和时间信息
文档管理：
- 用户可以查看、管理自己上传或搜索的文档
- 可以将文档组织到不同的 Topic 中

3. 类型定义完善

为了更好地支持 Document 表的使用，我们可以在类型定义中添加一些辅助类型：

// 在 src/types/document.ts 中添加

/**
 * 文档来源类型
 */
export enum DocumentSourceType {
  FILE = 'file',
  WEB = 'web',
  API = 'api',
}

/**
 * 文档与 Topic 的关联类型
 */
export interface DocumentTopicRelation {
  documentId: string;
  topicId: string;
  userId: string;
  createdAt: Date;
}

/**
 * 扩展的文档对象，包含关联信息
 */
export interface DocumentWithRelations extends LobeDocument {
  topics?: Array<{
    id: string;
    title: string;
  }>;
  file?: {
    id: string;
    name: string;
    url: string;
  };
}

总结

这个设计方案提供了一个灵活且功能完善的 Document 表结构，能够满足存储不同来源的文档内容、与 Topic 关联作为上下文、以及可选地关联到文件的需求。该设计遵循了项目现有的数据库设计风格和约定，并提供了必要的索引和关系定义，以确保高效的查询性能。

通过这个设计，LobeChat 将能够更好地管理和利用文档内容，无论是来自本地文件还是 Web 搜索结果，从而增强对话的上下文理解和知识检索能力。

0 replies

shog86 · 2025-05-09T03:52:46Z

shog86
May 9, 2025

需要对旧数据表做清理吗，让数据表小一点。如果是的话是怎么做呢？

0 replies

AutoCONFIG · 2025-05-11T15:09:23Z

AutoCONFIG
May 11, 2025

大佬，当前是不是去除了对话界面上传PDF的分块和Embedding的功能呀？这个挺好的，因为有时候分块会导致读取原文信息不全面，但是这样有些文档字数会超过模型的限制（比如某些论文），从而引起报错（报错提示超过上下文窗口数），所以能不能增加一个回退机制呢？比如在当前文档超过上下文窗口限制时转为以前的分块，或者自动忽略溢出的部分呢？

3 replies

arvinxx May 11, 2025
Maintainer Author

上下文窗口是否超过这件事情挺难决策的，不好做
使用embedding 的方式还有的，你可以仍然使用构建知识库的方式来构建超长段落的内容。然后在关联知识库的时候选择这个文件即可

AutoCONFIG May 12, 2025

谢谢大佬回复!!那能否提供一个以前的embedding的方式的入口呢（在上传文件列表里新加一个“上传超长文件”），因为是个人部署的，给几个小伙伴一起用，所以小伙伴不一定知道要怎么操作(或者操作比较麻烦，小伙伴就搞不太懂)

arvinxx May 12, 2025
Maintainer Author

可能一种更好的做法，是提供一个服务端配置参数，如果读到文本量超过这个值（比如 100w 字符），那么自动走 embedding

bbbugg · 2025-05-13T13:17:11Z

bbbugg
May 13, 2025

大佬，在之前对话中已经上传过的文件，怎么在新对话直接引用呢？
直接引用文件/知识库里的未分块文件，好像识别不到。
要是在新的对话中用同一个不分块向量化的文件，是不是只能在重新上传一遍呢？

2 replies

bbbugg May 13, 2025

可不可以支持引用文件库里的未分块的文件

arvinxx May 14, 2025
Maintainer Author

这个就要等知识库 2.0 做整体改造了

AutoCONFIG · 2025-05-14T03:58:19Z

AutoCONFIG
May 14, 2025

建议：文件上传中的图片上传，能否启用辅助模型呢（主要起到OCR的作用），这样一些不支持图片上传的模型就可以截图进行文本内容提问了

0 replies

Uh oh!

[RFC] 115 - 文件上传改版 #7753

Uh oh!

Uh oh!

arvinxx May 8, 2025 Maintainer

背景

设计思路

进展

Replies: 5 comments · 5 replies

Uh oh!

arvinxx May 8, 2025 Maintainer Author

Document 表设计方案

1. 数据库模型设计

documents 表

document_topics 关联表

关系定义

2. 业务逻辑设计

Document 的业务逻辑

使用场景

3. 类型定义完善

总结

Uh oh!

shog86 May 9, 2025

Uh oh!

Uh oh!

AutoCONFIG May 11, 2025

Uh oh!

Uh oh!

arvinxx May 11, 2025 Maintainer Author

Uh oh!

AutoCONFIG May 12, 2025

Uh oh!

arvinxx May 12, 2025 Maintainer Author

Uh oh!

bbbugg May 13, 2025

Uh oh!

bbbugg May 13, 2025

Uh oh!

arvinxx May 14, 2025 Maintainer Author

Uh oh!

AutoCONFIG May 14, 2025

arvinxx
May 8, 2025
Maintainer

Replies: 5 comments 5 replies

arvinxx
May 8, 2025
Maintainer Author

shog86
May 9, 2025

AutoCONFIG
May 11, 2025

arvinxx May 11, 2025
Maintainer Author

arvinxx May 12, 2025
Maintainer Author

bbbugg
May 13, 2025

arvinxx May 14, 2025
Maintainer Author

AutoCONFIG
May 14, 2025