如何构建数据以进行可搜索性

时间:2022-09-06 17:06:15

I am writing a search application specifically for music playlists.

我正在编写一个专门用于音乐播放列表的搜索应用程序。

The genre and file format differs from playlist to playlist, and sometimes within the playlist there are differences too. There is also a concept of "synonymous" tags (e.g. urban would cover both hiphop and r&b, but not the other way around).

类型和文件格式从播放列表到播放列表不同,有时在播放列表中也存在差异。还有一个“同义”标签的概念(例如,城市将涵盖hiphop和r&b,但不是相反)。

Below is a list of search terms and my expected results.

以下是搜索字词列表和我的预期结果。

gospel: should return all playlists with at least one gospel song. playlists with all gospel songs would be shown first. urban: should return all r&b and hiphop. again playlists with all urban tracks would come first. hiphop: should return all hiphop but not r&b. flac: should return all playlists that contain flac files. starting with the ones that are pure flac. hiphop flac: should return hiphop flacs first, followed by other hiphop audio hiphop AND flac: should return hiphop flacs only hiphop audio: should return hiphop flacs, hiphop mp3s, etc

福音:应该返回所有至少有一首福音歌曲的播放列表。将首先显示包含所有福音歌曲的播放列表。都市:应该归还所有的r&b和hiphop。所有城市轨道的播放列表将首先出现。 hiphop:应该归还所有hiphop而不是r&b。 flac:应该返回包含flac文件的所有播放列表。从那些纯粹的flac开始。 hiphop flac:应首先返回hiphop flacs,然后是其他hiphop音频hiphop和flac:应该返回hiphop flacs只有hiphop音频:应该返回hiphop flacs,hiphop mp3s等

As I'm just starting this project, I'm thinking of the best way to index all this. Would a fulltext search thing like Lucene be of any use here? Note I don't have any text describing these playlists, but I could generate some.

由于我刚刚开始这个项目,我正在考虑将所有这些编入索引的最佳方法。像Lucene这样的全文搜索在这里有用吗?注意我没有任何描述这些播放列表的文字,但我可以生成一些。

I'm thinking of organising all these terms as "tags" and storing them in the db many-to-many.

我正在考虑将所有这些术语组织为“标签”并将它们存储在多对多数据库中。

table: playlist ( pk(id), desc ) table: tag ( pk(id), desc ) table: playlist_has_tag ( pk(link_id, tag_id) )

table:playlist(pk(id),desc)table:tag(pk(id),desc)table:playlist_has_tag(pk(link_id,tag_id))

To solve the urban == hiphop || rnb thing, I would maybe add a tag_synonyms table:

解决城市== hiphop || rnb的事情,我可能会添加一个tag_synonyms表:

table: tag_synonyms ( pk(tag_id, synonym_tag_id) )

table:tag_synonyms(pk(tag_id,synonym_tag_id))

Then I'd have two records to indicate that urban encompasses hiphop and rnb: urban's tag id, hiphop's tag id urban's tag id, rnb's tag id

然后我有两条记录表明都市包含了hiphop和rnb:urban的标签id,hiphop的标签id urban的标签id,rnb的标签id

I'm feeling though that the query could be come quite convoluted using this approach.

我觉得使用这种方法可能会使查询变得非常复杂。

Could CouchDB be of use here? I'm currently using PostgreSQL. Is there some software out there that will make this kind of thing easy?

CouchDB可以在这里使用吗?我目前正在使用PostgreSQL。是否有一些软件可以让这种事情变得简单?

I would like to be able to drill down and support complex search terms in the future like:

我希望将来可以深入挖掘并支持复杂的搜索术语,例如:

(hiphop OR house) AND filetype:mp3 AND artwork:no

(hiphop OR house)和文件类型:mp3和艺术品:没有

And also incorporate things like duration, etc.

还包括持续时间等内容。

4 个解决方案

#1


2  

If you try to think too hard on how to structure your data for searching, there is a good chance you will miss an important search that you could have really used in your app.

如果您试图过分思考如何构建数据以进行搜索,那么您很可能会错过可能在您的应用中真正使用过的重要搜索。

Alternatively (and this is from experience) you end up re-inventing all sorts of indexing techniques.

或者(这是经验)你最终会重新发明各种索引技术。

I have some experience with lucene (there is java and .net version, there was a C port but I am not sure how alive it is these days) - and it can do amazing things with data that is stored in any structure.

我有一些使用lucene的经验(有java和.net版本,有一个C端口,但我不确定它现在是多么活跃) - 它可以用存储在任何结构中的数据做出惊人的事情。

I like the look of couch db, just depends how much you want to experiment with something new and powerful, or go for something which is (currently) fairly battle hardened: lucene.

我喜欢沙发数据库的外观,只是取决于你想要尝试一些新的和强大的东西,或者去寻找(当前)相当强硬的东西:lucene。

#2


1  

A fulltext index will serve you best if your users are going to be the ones defining the queries. Just create a custom text field that describes each attribute you want to be searchable e.g. "urban filetype:pdf gospel" and search that.

如果您的用户将成为定义查询的用户,则全文索引将为您提供最佳服务。只需创建一个自定义文本字段,该字段描述您想要搜索的每个属性,例如“城市文件类型:pdf福音”并搜索。

#3


0  

OK, just brainstorming here --

好的,只是在这里集思广益 -

Perhaps using octal or binary to store your "format" types as a bitmask?

也许使用八进制或二进制将“格式”类型存储为位掩码?

http://www.nitrogen.za.org/viewtutorial.asp?id=17

RandB: 1 HipHop:2 Gospel:4 Urban: 8

RandB:1 HipHop:2福音:4城市:8

Now, these things are additive. You know that if something is tagged Urban, you're not going to store "8" in the flag field, but you'll store 11...Urban && HipHop && RandB. This is just a bit of "business intelligence" you'll have to have spelled out somewhere.

现在,这些东西都是附加的。你知道如果某些东西被标记为Urban,你就不会在flag字段中存储“8”,但是你将存储11 ... Urban && HipHop && RandB。这只是一些你需要在某处拼写出来的“商业智能”。

You can then use binary comparisons to figure out which flags you're looking for.

然后,您可以使用二进制比较来确定您要查找的标志。

#4


-1  

I don't see how database software would play a role in your solution.

我不知道数据库软件将如何在您的解决方案中发挥作用。

If I were to be the one implementing this, I would first ensure all related data is captured in a normalized way. This would include things like category, artwork, lyrics, etc.

如果我是实现这一点的那个,我首先要确保以规范化的方式捕获所有相关数据。这包括类别,艺术品,歌词等。

The main advantage of this is your idea of 'complex' searches actually become quite simple.

这样做的主要优点是您对“复杂”搜索的想法实际上变得非常简单。

#1


2  

If you try to think too hard on how to structure your data for searching, there is a good chance you will miss an important search that you could have really used in your app.

如果您试图过分思考如何构建数据以进行搜索,那么您很可能会错过可能在您的应用中真正使用过的重要搜索。

Alternatively (and this is from experience) you end up re-inventing all sorts of indexing techniques.

或者(这是经验)你最终会重新发明各种索引技术。

I have some experience with lucene (there is java and .net version, there was a C port but I am not sure how alive it is these days) - and it can do amazing things with data that is stored in any structure.

我有一些使用lucene的经验(有java和.net版本,有一个C端口,但我不确定它现在是多么活跃) - 它可以用存储在任何结构中的数据做出惊人的事情。

I like the look of couch db, just depends how much you want to experiment with something new and powerful, or go for something which is (currently) fairly battle hardened: lucene.

我喜欢沙发数据库的外观,只是取决于你想要尝试一些新的和强大的东西,或者去寻找(当前)相当强硬的东西:lucene。

#2


1  

A fulltext index will serve you best if your users are going to be the ones defining the queries. Just create a custom text field that describes each attribute you want to be searchable e.g. "urban filetype:pdf gospel" and search that.

如果您的用户将成为定义查询的用户,则全文索引将为您提供最佳服务。只需创建一个自定义文本字段,该字段描述您想要搜索的每个属性,例如“城市文件类型:pdf福音”并搜索。

#3


0  

OK, just brainstorming here --

好的,只是在这里集思广益 -

Perhaps using octal or binary to store your "format" types as a bitmask?

也许使用八进制或二进制将“格式”类型存储为位掩码?

http://www.nitrogen.za.org/viewtutorial.asp?id=17

RandB: 1 HipHop:2 Gospel:4 Urban: 8

RandB:1 HipHop:2福音:4城市:8

Now, these things are additive. You know that if something is tagged Urban, you're not going to store "8" in the flag field, but you'll store 11...Urban && HipHop && RandB. This is just a bit of "business intelligence" you'll have to have spelled out somewhere.

现在,这些东西都是附加的。你知道如果某些东西被标记为Urban,你就不会在flag字段中存储“8”,但是你将存储11 ... Urban && HipHop && RandB。这只是一些你需要在某处拼写出来的“商业智能”。

You can then use binary comparisons to figure out which flags you're looking for.

然后,您可以使用二进制比较来确定您要查找的标志。

#4


-1  

I don't see how database software would play a role in your solution.

我不知道数据库软件将如何在您的解决方案中发挥作用。

If I were to be the one implementing this, I would first ensure all related data is captured in a normalized way. This would include things like category, artwork, lyrics, etc.

如果我是实现这一点的那个,我首先要确保以规范化的方式捕获所有相关数据。这包括类别,艺术品,歌词等。

The main advantage of this is your idea of 'complex' searches actually become quite simple.

这样做的主要优点是您对“复杂”搜索的想法实际上变得非常简单。