Neo4J:二进制文件存储和文本搜索“堆栈”

时间:2022-04-25 15:44:14

I have a project I would like to work on which I feel is a beautiful case for Neo4j. But there are aspects about implementing this that I do not understand enough to succinctly list my questions. So instead, I'll let the scenario speak for itself:

我有一个我想要工作的项目,我觉得这是Neo4j的一个漂亮的案例。但是有一些关于实现这一点的方面,我不明白,不能简洁地列出我的问题。相反,我会让场景说明一切:

Scenario: In simplicity, I want to build an application that will allow Users who will receive files of various types such as docs, excel, word, images, audio clips and even videos - although not so much videos, and allow them to upload and categorize these.

场景:简单来说,我想构建一个应用程序,允许用户接收各种类型的文件,如文档,excel,word,图像,音频剪辑甚至视频 - 尽管不是那么多视频,并允许他们上传和对这些进行分类。

With each file they will enter in any and all associations. Examples:

对于每个文件,他们将输入任何和所有关联。例子:

  • If Joe authors a PDF, Joe is associated with the PDF.
  • 如果Joe创作PDF,则Joe与PDF相关联。
  • If a DOC says that Sally is Mary's mother, Sally is associated with Mary.
  • 如果DOC说Sally是Mary的母亲,那么Sally与Mary有联系。
  • If Bill sent an email to Jane, Bill is associated with Jane (and the email).
  • 如果Bill向Jane发送了一封电子邮件,则Bill与Jane(以及电子邮件)相关联。
  • If company X sends an invoice (Excel grid) to company Y, X is associated with Y.
  • 如果公司X向公司Y发送发票(Excel网格),则X与Y相关联。

and so on...

等等...

So the basic goal at this point would be to:

因此,此时的基本目标是:

  • Have users load in files as they receive them.
  • 让用户在收到文件时加载文件。
  • Enter the associations that each file contains.
  • 输入每个文件包含的关联。
  • Review associations holistically, in order to predict or take some action.
  • 全面审查协会,以预测或采取某些行动。
  • Generate a report of the interested associations including the files that the associations are based on.
  • 生成感兴趣的关联的报告,包括关联所基于的文件。

The value for this project is in the associations, which in reality would grow much more complex then the above examples and should produce interesting conclusions. However. if the User is asked "How did you come to that conclusion", they need to be able to produce a summary of the associations as well as any files that these associations are based on - ie the PDF or EXCEL or whatever.

这个项目的价值在于协会,实际上它会比上面的例子复杂得多,并且应该产生有趣的结论。然而。如果用户被问到“你是如何得出这个结论的”,他们需要能够产生关联的摘要以及这些关联所基于的任何文件 - 即PDF或EXCEL等等。

Initial thoughts...

I also should also add that this applicatoin would be hosted internally, and probably used by approx 50 Users so I probably don't need super-duper, fastest, scalable, high availability possible solution. The data being loaded could get rather large though, maybe up to a terabyte in a year? (Not the associations but the actual files)

我还应该补充说,这个应用程序将在内部托管,并且可能由大约50个用户使用,因此我可能不需要超级,快速,可扩展,高可用性的可能解决方案。加载的数据可能会变得相当大,一年可能高达1TB? (不是关联,而是实际文件)

Wouldn't it be great if Neo4J just did all of this! Obviously it should handle the graph aspects of this very nicely, but I figure that the file storage and text search is going to need another player added to the mix.

如果Neo4J完成所有这一切,那不是很好吗!显然它应该很好地处理这方面的图形方面,但我认为文件存储和文本搜索将需要添加到混合中的另一个玩家。

Some combinations of solutions I know of would be:

我知道的解决方案的一些组合将是:

  • Store EVERYTHING including files as binary in Neo4J.

    在Neo4J中存储一切包括二进制文件。

    Would be wrestling Neo4J for something its not built for. How would I search text?

    将摔跤Neo4J的东西不是为它而建的。我该如何搜索文字?

  • Store only associations and meta data in Neo4J and uploaded file on File system.

    仅在Neo4J中存储关联和元数据,在文件系统上存储上载的文件。

    How would I do text searches on files that are stored on file server?

    如何对存储在文件服务器上的文件进行文本搜索?

  • Store only associations and meta data in Neo4J and uploaded file in Postgres.

    仅在Neo4J中存储关联和元数据,在Postgres中存储上传的文件。

    Not so confident of having all my files inside DB. Feel more comfortable having all my files accessible in folders.

    将所有文件都放在DB中并不那么自信。在文件夹中访问我的所有文件感觉更舒服。

    Everyone says its great to put your files in DB. Everyone says its not great to put your files in DB.

    每个人都说将文件放入数据库非常棒。每个人都说将文件放入数据库并不好。

Get to the bloody questions..

  1. Can anyone suggest a good "stack" that would suit the above?
  2. 任何人都可以建议一个适合上述的好“堆叠”吗?
  3. Please give a basic outline on how you would implement your suggestion, ie:

    请概述一下如何实施您的建议,即:

    • Have the application store the data into Neo4J, then use triggers to update Postgres.
    • 让应用程序将数据存储到Neo4J中,然后使用触发器更新Postgres。
    • Or have the files loaded into Postgres and triggers update Neo4J.
    • 或者将文件加载到Postgres并触发更新Neo4J。
    • Or Have the application load data to Nea4J and then application loads data into Postgres.
    • 或者让应用程序将数据加载到Nea4J,然后应用程序将数据加载到Postgres中。
    • etc
    • 等等

How you would tie these together is probably what I am really trying to grasp.

如何将这些结合起来可能就是我真正想要掌握的。

Thank you very much for any input on this.

非常感谢您对此的任何意见。

Cheers.

干杯。

p.s. What a ramble! If you feel the need to edit my question or title to simplify, go for it! :)

附:真是漫无边际!如果您觉得需要编辑我的问题或标题以简化,那就去吧! :)

1 个解决方案

#1


0  

Here's my recommendations:

这是我的建议:

  • Never store binary files in the database. Store in filesystem or a service like AWS S3 instead and reference the file in your data model.
  • 切勿将二进制文件存储在数据库中。存储在文件系统或AWS S3之类的服务中,并引用数据模型中的文件。
  • I would store the file first in S3 and a reference to it in your primary database (Neo4j?)
  • 我会先将文件存储在S3中,然后在主数据库中对它进行引用(Neo4j?)
  • If you want to be able to search for any word in a document I would recommend using a full text search engine like Elastic Search. Elastic Search can scan multiple document formats like PDF using Tika.
  • 如果您希望能够搜索文档中的任何单词,我建议使用像Elastic Search这样的全文搜索引擎。弹性搜索可以使用Tika扫描多种文档格式,如PDF。
  • You can probably also use Elastic/Tika to search for relationships in the document and surface them in order to update your graph.
  • 您也可以使用Elastic / Tika在文档中搜索关系并对其进行表面处理以更新图形。

Suggested Stack:

建议堆栈:

  • Neo4j
  • Neo4j的
  • ElasticSearch
  • ElasticSearch
  • AWS S3 or some other redundant filesystem to avoid data loss
  • AWS S3或其他一些冗余文件系统,以避免数据丢失

Bonus: See this SO question/answer for best practices on indexing files in multiple formats using ES.

额外奖励:请参阅此SO问题/答案,了解使用ES以多种格式索引文件的最佳做法。

#1


0  

Here's my recommendations:

这是我的建议:

  • Never store binary files in the database. Store in filesystem or a service like AWS S3 instead and reference the file in your data model.
  • 切勿将二进制文件存储在数据库中。存储在文件系统或AWS S3之类的服务中,并引用数据模型中的文件。
  • I would store the file first in S3 and a reference to it in your primary database (Neo4j?)
  • 我会先将文件存储在S3中,然后在主数据库中对它进行引用(Neo4j?)
  • If you want to be able to search for any word in a document I would recommend using a full text search engine like Elastic Search. Elastic Search can scan multiple document formats like PDF using Tika.
  • 如果您希望能够搜索文档中的任何单词,我建议使用像Elastic Search这样的全文搜索引擎。弹性搜索可以使用Tika扫描多种文档格式,如PDF。
  • You can probably also use Elastic/Tika to search for relationships in the document and surface them in order to update your graph.
  • 您也可以使用Elastic / Tika在文档中搜索关系并对其进行表面处理以更新图形。

Suggested Stack:

建议堆栈:

  • Neo4j
  • Neo4j的
  • ElasticSearch
  • ElasticSearch
  • AWS S3 or some other redundant filesystem to avoid data loss
  • AWS S3或其他一些冗余文件系统,以避免数据丢失

Bonus: See this SO question/answer for best practices on indexing files in multiple formats using ES.

额外奖励:请参阅此SO问题/答案,了解使用ES以多种格式索引文件的最佳做法。