When will other languages be supported? #9

zxhd863943427 · 2023-01-30T14:35:11Z

This is a very great project! however, I found that it doesn't seem to be able to analyze articles other than English. Will other languages be supported? Like Chinese?

novohool · 2023-01-31T05:53:23Z

such as https://www.huxiu.com/article/781201.html

div-wang · 2023-01-31T09:09:20Z

我按照代码思路测试了20多个中文链接（包括微信、新浪、快科技、虎嗅、36Kr、bbc中文等），输出的结果就没有正确的，总结的内容都是跟当前文章不想关内容。
目前（2023-01-31）还没定位到问题出在哪里。

div-wang · 2023-02-01T05:09:06Z

使用python的gpt_index可以绕过字数限制，具体代码实现如下

import sys, os, re
from gpt_index import GPTListIndex, SimpleWebPageReader, SimpleDirectoryReader, TrafilaturaWebReader

# pip install gpt_index
# pip install trafilatura

def main():
    content = sys.argv[1]
    prompt = sys.argv[2]
    documents = ''
    # 判断是不是正确的Url
    is_url = re.match(r'(http|https):\/\/([\w.!@#$%^&*()_+-=])*\s*', content)
    # 判断是不是文件路径
    is_file = os.path.isfile(content)
    if is_url:
      url = content
      # 加载普通链接
      documents = TrafilaturaWebReader().load_data([url])
    elif is_file:
      path = content
      # 加载文件路径
      documents = SimpleDirectoryReader(path).load_data()
    else:
      # 加载文本
      documents = SimpleDirectoryReader(content)
    # 加载文档，进行分段
    index = GPTListIndex(documents)
    # 调用GPT进行汇总
    response = index.query(prompt, response_mode="tree_summarize")
    # 打印结果日志
    print('python response', response)

if __name__ == '__main__':
    main()

中英文链要处理一下，gpt有概率将中文内容实用英文总结，中文链接最好prompt也是中文的。
参考：A Primer to using GPT Index

songhn233 · 2023-03-24T14:53:44Z

我按照代码思路测试了20多个中文链接（包括微信、新浪、快科技、虎嗅、36Kr、bbc中文等），输出的结果就没有正确的，总结的内容都是跟当前文章不想关内容。目前（2023-01-31）还没定位到问题出在哪里。

@div-wang 其实是因为，这种实现就是不正确的。目前版本 #12 使用了未知具体细节的 Pipe3 Server 作为后端服务，但现在也是 500 不可用的状态。所以用这个 PR 之前的实现来分析，核心实现就在 prompt 的构造，之后调用 OpenAI 的 completion api（在之前的版本中调用的是 gpt3 相关模型而非后续的 3.5 or 4）

readpilot/pages/api/analyze.ts

Lines 16 to 22 in 822ec49

    
             const prompt = ` 
        
             Generate a list of thought provoking discussion questions about the URL, and return the answers of these questions with the evidence. 
        
             Please generate a JSON list object with the following properties: q and a. q and a should be string. q is the question. a is the answer. 
        
             The URL is: ${url} 
        
             `;

但实际上目前的 gpt 模型并没有这种能力（bing 则可以借助搜索引擎提供额外的上下文）。一些询问 gpt 分析网页链接的例子之所以看起来 work，大概率是因为 url 中暴露了部分上下文，比如

实际上和文章内容没什么关系，主要通过 url 中的信息推断然后输出了一段正确的废话（当然，如果使用者也没耐心看完文章可能就容易被生成的概要迷惑了，有些黑色幽默～）

当然，换一种角度，比如你提到的链接，或者其他在 url 中隐藏上下文的链接，最后得到的结果都只可能是无意义的内容。

如果要实现这种功能，当前看来比较可行的方法是爬虫解析出文章主要内容，然后再将多模态内容嵌入向量数据库后检索，作为上下文提示输入给 gpt，比如 llama_index 或者 langchain 等工具。

songhn233 mentioned this issue Mar 24, 2023

Perhaps it would be appropriate to update the repository status and clarify any misleading implementation details #18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When will other languages be supported? #9

When will other languages be supported? #9

zxhd863943427 commented Jan 30, 2023

novohool commented Jan 31, 2023

div-wang commented Jan 31, 2023

div-wang commented Feb 1, 2023

songhn233 commented Mar 24, 2023

When will other languages be supported? #9

When will other languages be supported? #9

Comments

zxhd863943427 commented Jan 30, 2023

novohool commented Jan 31, 2023

div-wang commented Jan 31, 2023

div-wang commented Feb 1, 2023

songhn233 commented Mar 24, 2023