在数据库中存储多个Excel文件的有效方法?

时间:2022-10-17 14:51:12

We're working on a big inner project which enables users to upload excel files, and eventually perform searches on all the data that is gathered from these excels. I'm trying to do my homework before we begin the design, and come up with the best solution.

我们正在开发一个大型内部项目,使用户能够上传excel文件,并最终对从这些Excel中收集的所有数据执行搜索。我在开始设计之前尝试做功课,并提出最佳解决方案。

The requirements are -

  1. User can upload an excel file with as many columns as he wants, so that there's no predefined structure for the excel.

    用户可以根据需要上传包含尽可能多列的excel文件,因此excel没有预定义的结构。

  2. In contrary to the first point, there are a few fields we assume the user to have. For example - First name, last name. These colums don't have to exist.

    与第一点相反,我们假设用户拥有一些字段。例如 - 名字,姓氏。这些列不一定存在。

  3. The search option will work as following - When the user searches, he can search by specific columns - the ones that were predefined and we expected his excel file to have. (In our example - First and last name). He can also search all the other columns, under the field "Other".
  4. 搜索选项将如下工作 - 当用户搜索时,他可以按特定列搜索 - 预定义的列,我们期望他的excel文件具有。 (在我们的例子中 - 名字和姓氏)。他还可以在“其他”字段下搜索所有其他列。

Another word about the Other search field - This field will go through all the columns in all the excel files that don't fit the predefined columns. I.E - One file has an age column, another has a birthplace column, the "other" field will search through all those columns.


关于其他搜索字段的另一个词 - 此字段将遍历所有不符合预定义列的Excel文件中的所有列。 I.E - 一个文件有一个年龄列,另一个文件有一个出生地列,“其他”字段将搜索所有这些列。

What's the best way to do this?

  1. Dynamically create a new django model for each excel uploaded, with all the columns the excel has?

    为每个上传的excel动态创建一个新的django模型,其中包含excel的所有列?

  2. Dynamically create a new django model for each file, with all the predefined columns (if they exist!), and a "other" text field, which will concatenate all unrelated fields?

    为每个文件动态创建一个新的django模型,包含所有预定义的列(如果它们存在!),以及一个“其他”文本字段,它将连接所有不相关的字段?

  3. Have one big django model (means only 1 table in my database) which has all my predefined fields (which again, can be null), and also a field called "others" which will concatenate all the unrelated columns?

    有一个大的django模型(在我的数据库中只有1个表),它有我所有的预定义字段(也可以是null),还有一个名为“others”的字段,它将连接所有不相关的列?

  4. I can have my main table which has all the predefined columns, and another table with a foreign key to the main table, where each row represent an "other" field.

    我可以让我的主表具有所有预定义列,另一个表具有主表的外键,其中每行代表一个“其他”字段。

Example for the 4th solution -

第四种解决方案的示例 -

+----+--------+--------+--------+
| id | field1 | field2 | field3 |
+----+--------+--------+--------+
|  1 | val1   | val1   | val1   |
|  2 | val2   | val2   | val2   |
|  3 | val3   | val3   | val3   |
+----+--------+--------+--------+

And the dimension table -

和尺寸表 -

+----+------+------+
| fk | key  | val  |
+----+------+------+
|  1 | key1 | val1 |
|  1 | key2 | val2 |
|  1 | key3 | val3 |
|  2 | key4 | val4 |
+----+------+------+



As for Scaling - We expect to eventually have no more than 1500 excel files, each containing between 100 to approx. 100,000 rows (We will probably limit the number of rows each excel file has to 100k). Statistics we have from the excels examined say that we won't go over the 30~ million rows.

至于缩放 - 我们预计最终不会有超过1500个excel文件,每个文件包含100到大约100个。 100,000行(我们可能会将每个excel文件的行数限制为100k)。我们从excels中得到的统计数据表明,我们不会超过3000万行。

We will be using Django with either MySQL or PostgreSQL.

我们将使用Django与MySQL或PostgreSQL。

I hope my question was clear and not too opaque.

我希望我的问题很明确,而且不太透明。

Thanks!

谢谢!

1 个解决方案

#1


2  

EDIT: After you changed your question. I have added a short section on your model 4.

编辑:你改变了你的问题。我在你的模型4上添加了一个简短的部分。

I would strongly recommend against dynamically creating tables. This is messy and I doubt it will perform well. Your database will create an access path for each database table you will query, so if you create multiple databases files you will need to search all of them.

我强烈建议不要动态创建表。这很麻烦,我怀疑它会表现不错。您的数据库将为您将查询的每个数据库表创建一个访问路径,因此如果您创建多个数据库文件,则需要搜索所有这些数据库。

You probably need a variant of your model 3.

您可能需要模型3的变体。

This means you use one table, but instead using columns for each field, you create two columns one for the excel column name and one for it's value. You will also need some additional entries to identify which excel column and values belong to which excel spreadsheet.

这意味着您使用一个表,但是为每个字段使用列,您可以为excel列名创建一个列,为其值创建一个列。您还需要一些其他条目来标识哪个Excel列和值属于哪个Excel电子表格。

So conceptually, instead of modelling:

从概念上讲,而不是建模:

field1 field2 field3 field4 other
------------------------------------
x       y     z       a     etc=xyz

You model it like this:

你这样建模:

sheet fieldname value
------------------------------------
key   field1    x
key   field2    y
key   field3    z
key   field4    a
key   etc       xyz

The advantage of this model is that programming your searches become easier. You can model any search simply as a select * from data where fieldname='%s' and value='%s'. If you create a database index on fieldname (and probably an index on the key you use to identify the excel sheets), there should be no performance penalty over your original ideas for model 3.

这种模式的优点是编程搜索变得更容易。您可以将任何搜索简单地建模为select * from data where fieldname ='%s'和value ='%s'。如果在fieldname上创建数据库索引(并且可能是用于标识excel表的键的索引),则对于模型3的原始概念应该没有性能损失。

Your model 4 would also work. It has the advantage that, for the predefined fields, your user's query statements would easily map to SQL select statements. It has the disadvantage that you need to handle your "others" columns differently from the rest of the user's search criteria. You also indicated that users sometimes do not enter the columns that you expect to be there. This means you have to make these columns nullable, which increases the storage requirements.

你的模型4也可以工作。它的优点是,对于预定义字段,用户的查询语句可以轻松映射到SQL select语句。它的缺点是您需要处理“其他”列与用户的其他搜索条件不同。您还指出用户有时不会输入您希望在那里的列。这意味着您必须使这些列可以为空,这会增加存储要求。

Overall, I think that my suggested approach it better than your option 4, as it conceptually simpler. You indicated that you thought it would create too many rows. Indeed that would create more rows, but MySQL and PostgresSQL can easily that amount of rows. PostgresSQL can store an unlimited number of rows. MySQL can store 4000~ million rows (and you can compile MySQL with --big-tables if you need more).

总的来说,我认为我建议的方法比你的选项4更好,因为它在概念上更简单。您表示您认为它会创建太多行。确实会产生更多行,但MySQL和PostgresSQL可以很容易地创建行数。 PostgresSQL可以存储无限数量的行。 MySQL可以存储4000到10000行(如果需要更多,你可以使用--big-tables编译MySQL)。

In terms of performance, it makes no real difference how big your table is, as long as you have an index on the field.

在性能方面,只要你在该领域有一个索引,你的桌子有多大就没有什么区别。

#1


2  

EDIT: After you changed your question. I have added a short section on your model 4.

编辑:你改变了你的问题。我在你的模型4上添加了一个简短的部分。

I would strongly recommend against dynamically creating tables. This is messy and I doubt it will perform well. Your database will create an access path for each database table you will query, so if you create multiple databases files you will need to search all of them.

我强烈建议不要动态创建表。这很麻烦,我怀疑它会表现不错。您的数据库将为您将查询的每个数据库表创建一个访问路径,因此如果您创建多个数据库文件,则需要搜索所有这些数据库。

You probably need a variant of your model 3.

您可能需要模型3的变体。

This means you use one table, but instead using columns for each field, you create two columns one for the excel column name and one for it's value. You will also need some additional entries to identify which excel column and values belong to which excel spreadsheet.

这意味着您使用一个表,但是为每个字段使用列,您可以为excel列名创建一个列,为其值创建一个列。您还需要一些其他条目来标识哪个Excel列和值属于哪个Excel电子表格。

So conceptually, instead of modelling:

从概念上讲,而不是建模:

field1 field2 field3 field4 other
------------------------------------
x       y     z       a     etc=xyz

You model it like this:

你这样建模:

sheet fieldname value
------------------------------------
key   field1    x
key   field2    y
key   field3    z
key   field4    a
key   etc       xyz

The advantage of this model is that programming your searches become easier. You can model any search simply as a select * from data where fieldname='%s' and value='%s'. If you create a database index on fieldname (and probably an index on the key you use to identify the excel sheets), there should be no performance penalty over your original ideas for model 3.

这种模式的优点是编程搜索变得更容易。您可以将任何搜索简单地建模为select * from data where fieldname ='%s'和value ='%s'。如果在fieldname上创建数据库索引(并且可能是用于标识excel表的键的索引),则对于模型3的原始概念应该没有性能损失。

Your model 4 would also work. It has the advantage that, for the predefined fields, your user's query statements would easily map to SQL select statements. It has the disadvantage that you need to handle your "others" columns differently from the rest of the user's search criteria. You also indicated that users sometimes do not enter the columns that you expect to be there. This means you have to make these columns nullable, which increases the storage requirements.

你的模型4也可以工作。它的优点是,对于预定义字段,用户的查询语句可以轻松映射到SQL select语句。它的缺点是您需要处理“其他”列与用户的其他搜索条件不同。您还指出用户有时不会输入您希望在那里的列。这意味着您必须使这些列可以为空,这会增加存储要求。

Overall, I think that my suggested approach it better than your option 4, as it conceptually simpler. You indicated that you thought it would create too many rows. Indeed that would create more rows, but MySQL and PostgresSQL can easily that amount of rows. PostgresSQL can store an unlimited number of rows. MySQL can store 4000~ million rows (and you can compile MySQL with --big-tables if you need more).

总的来说,我认为我建议的方法比你的选项4更好,因为它在概念上更简单。您表示您认为它会创建太多行。确实会产生更多行,但MySQL和PostgresSQL可以很容易地创建行数。 PostgresSQL可以存储无限数量的行。 MySQL可以存储4000到10000行(如果需要更多,你可以使用--big-tables编译MySQL)。

In terms of performance, it makes no real difference how big your table is, as long as you have an index on the field.

在性能方面,只要你在该领域有一个索引,你的桌子有多大就没有什么区别。