解决方案,对300万个数据点进行大量计算并制作图表

时间:2021-08-29 02:59:06

i have an excel spreadsheet that is about 300,000 rows and about 100 columns

我有一个excel电子表格,大约300,000行和大约100列

i need to perform various functions on this spreadsheet and out of this spreadsheet i need to create about 3000 other spreadsheets which are SIGNIFICANTLY smaller

我需要在这个电子表格上执行各种功能,在这个电子表格中我需要创建大约3000个其他显着小的电子表格

for every created spreadsheet i will need to have a separate powerpoint file that will have an automatically generated graph

对于每个创建的电子表格,我需要有一个单独的powerpoint文件,它将具有自动生成的图形

i've done lots of VBA programming, but i am a little lost with this project

我已经完成了很多VBA编程,但是我对这个项目有点失落

  1. if i dump the data into a mysql file would it be easier for me to handle my task?
  2. 如果我将数据转储到mysql文件中,我会更容易处理我的任务吗?

  3. is it feasible to do this all in VBA excel?
  4. 在VBA excel中完成这一切是否可行?

  5. is it possible to easily add graphs from excel into powerpoint programmatically? or perhaps should i use a different solution for graphs?
  6. 是否可以通过编程方式轻松地将excel中的图形添加到powerpoint中?或者我应该使用不同的图表解决方案?

4 个解决方案

#1


2  

  1. It depends strongly on how you plan to process the data. If you plan to write code in Excel, it makes much more sense to leave it in Excel. Having said that, I would dump the data to CSV (comma-delimited) for further processing with a different tool, like Python.

    这在很大程度上取决于您计划如何处理数据。如果您打算在Excel中编写代码,那么将它留在Excel中会更有意义。话虽如此,我会将数据转储到CSV(逗号分隔),以便使用不同的工具(如Python)进行进一步处理。

  2. Everything is always feasible given enough time and money. If you're like most other programmers, you don't have too much of either, so you want the most efficient solution, or close to it. If it were me, I would write code in Python to read the data from a CSV file, perform all required operations, and save the 3000 separate output sets as individual CSV files which can be imported back into Excel.

    只要有足够的时间和金钱,一切都是可行的。如果你像大多数其他程序员一样,你也没有太多,所以你想要最有效的解决方案,或者接近它。如果是我,我会用Python编写代码来读取CSV文件中的数据,执行所有必需的操作,并将3000个单独的输出集保存为单独的CSV文件,这些文件可以导入到Excel中。

  3. Charts can be tricky to create and manipulate from VBA. I would use a Python library like Matplotlib to produce all graphical output, which would be saved to disk as PNG images, which can be inserted into the Powerpoint presentation(s).

    从VBA创建和操作图表可能很棘手。我会使用像Matplotlib这样的Python库来生成所有图形输出,这些输出将作为PNG图像保存到磁盘,可以插入到Powerpoint演示文稿中。

Python is mentioned here only as an example. You should use a tool that you feel most familiar with; however, the concepts of processing the data programmatically (not via interconnected cell references and formulas with a little VBA thrown in to copy sheets and so on) should still apply, and will be your best way forward here. I have done a ton of the kind of work you describe. Get the data into CSV and process the data with code.

这里仅提到Python作为示例。你应该使用你最熟悉的工具;但是,以编程方式处理数据的概念(不是通过互连的单元格引用和带有一点VBA的公式进入复制表等)应该仍然适用,并且将是您在此处的最佳方式。我做了很多你描述的工作。将数据转换为CSV并使用代码处理数据。

#2


3  

This is certainly feasible in all respects, but VBA may be too much overhead for this because of it's heavy-handed nature in opening and closing the Excel and PowerPoint instances for 3000 spreadsheets and presentations. If it's a one-time solution and you'll only ever need to do it this once though, VBA is certainly fast to develop for, so you could save a lot upfront just by using the object model. One other option is to do this from an Interop app in C# or VB.NET where you may have more control over your environment, like garbage collection.

这在所有方面都是可行的,但VBA可能因此过多,因为它在打开和关闭3000个电子表格和演示文稿的Excel和PowerPoint实例时非常苛刻。如果它是一次性解决方案而且你只需要这样做一次,VBA肯定是快速开发的,所以你可以通过使用对象模型预先节省很多。另一个选择是通过C#或VB.NET中的Interop应用程序执行此操作,您可以在其中更好地控制您的环境,例如垃圾回收。

However, if you're working with Excel 2007/2010 (I assume you are because of the 300k rows), I would do something different. I'd do the calc routines on the main XLSX in VBA and then use Open XML to process and create the 3000 spreadsheets and presentations with charts. (Note: I wouldn't use Open XML on the main XLSX because it doesn't actually render built-in calculations - you would still need to open the XLSX to "hydrate" the spreadsheet - so VBA would be better in this instance).

但是,如果您正在使用Excel 2007/2010(我假设您是因为300k行),我会做一些不同的事情。我在VBA中的主XLSX上执行calc例程,然后使用Open XML处理和创建带有图表的3000个电子表格和演示文稿。 (注意:我不会在主XLSX上使用Open XML,因为它实际上不会渲染内置计算 - 你仍然需要打开XLSX来“保湿”电子表格 - 所以在这种情况下VBA会更好) 。

If you're new to Open XML, there's a lot to learn upfront, so the juice may not be worth the squeeze. But articles like this are very helpful if you do want to know or already Open XML, which is a great starting point (as it deals with charts as well). But you could also use a wrapper on Open XML SDK like Simple OOXML that is quite good for starting out.

如果您是Open XML的新手,那么可以提前了解很多东西,因此可能不值得挤压。但是这样的文章非常有用,如果你想知道或已经是Open XML,这是一个很好的起点(因为它也处理图表)。但是你也可以在Open XML SDK上使用一个包装器,比如Simple OOXML,它非常适合初学者。

#3


2  

Take a look at the open-source statistical system called "R". It's quite good at programatically generating graphs and charts from real-world datasets.

看一下名为“R”的开源统计系统。从程序生成真实数据集中的图形和图表是非常好的。

http://www.r-project.org/

#4


1  

I can't answer 2. and 3. for you, but regarding 1: I'd definitely recommend against that, based on your question... of course, you didn't explain exactly what kind of operations you need to perform on the data, so chances are I'm wrong here.

我无法回答2.和3.对于你,但关于1:我肯定会根据你的问题建议反对...当然,你没有解释你需要在什么样的操作上执行数据,所以我在这里错了。

Your situation reminds me of the saying about regexes: "Some people, when they encounter a problem, will immediately try to solve it using a regular expression. Now they have two problems". You don't want an additional problem.

你的情况让我想起了关于正则表达式的说法:“有些人在遇到问题时会立即尝试使用正则表达式来解决它。现在他们有两个问题”。你不想要一个额外的问题。

If you must use a database to do this (simply because doing it in Excel isn't performant enough), I'd stick with something Microsoft like Access or SQL Server, which will save you some trouble probably. (never thought I'd be saying this)

如果你必须使用数据库来执行此操作(仅仅因为在Excel中执行它不够高效),我会坚持使用像Access或SQL Server这样的Microsoft,这可能会给你带来一些麻烦。 (从没想过我会说这个)

#1


2  

  1. It depends strongly on how you plan to process the data. If you plan to write code in Excel, it makes much more sense to leave it in Excel. Having said that, I would dump the data to CSV (comma-delimited) for further processing with a different tool, like Python.

    这在很大程度上取决于您计划如何处理数据。如果您打算在Excel中编写代码,那么将它留在Excel中会更有意义。话虽如此,我会将数据转储到CSV(逗号分隔),以便使用不同的工具(如Python)进行进一步处理。

  2. Everything is always feasible given enough time and money. If you're like most other programmers, you don't have too much of either, so you want the most efficient solution, or close to it. If it were me, I would write code in Python to read the data from a CSV file, perform all required operations, and save the 3000 separate output sets as individual CSV files which can be imported back into Excel.

    只要有足够的时间和金钱,一切都是可行的。如果你像大多数其他程序员一样,你也没有太多,所以你想要最有效的解决方案,或者接近它。如果是我,我会用Python编写代码来读取CSV文件中的数据,执行所有必需的操作,并将3000个单独的输出集保存为单独的CSV文件,这些文件可以导入到Excel中。

  3. Charts can be tricky to create and manipulate from VBA. I would use a Python library like Matplotlib to produce all graphical output, which would be saved to disk as PNG images, which can be inserted into the Powerpoint presentation(s).

    从VBA创建和操作图表可能很棘手。我会使用像Matplotlib这样的Python库来生成所有图形输出,这些输出将作为PNG图像保存到磁盘,可以插入到Powerpoint演示文稿中。

Python is mentioned here only as an example. You should use a tool that you feel most familiar with; however, the concepts of processing the data programmatically (not via interconnected cell references and formulas with a little VBA thrown in to copy sheets and so on) should still apply, and will be your best way forward here. I have done a ton of the kind of work you describe. Get the data into CSV and process the data with code.

这里仅提到Python作为示例。你应该使用你最熟悉的工具;但是,以编程方式处理数据的概念(不是通过互连的单元格引用和带有一点VBA的公式进入复制表等)应该仍然适用,并且将是您在此处的最佳方式。我做了很多你描述的工作。将数据转换为CSV并使用代码处理数据。

#2


3  

This is certainly feasible in all respects, but VBA may be too much overhead for this because of it's heavy-handed nature in opening and closing the Excel and PowerPoint instances for 3000 spreadsheets and presentations. If it's a one-time solution and you'll only ever need to do it this once though, VBA is certainly fast to develop for, so you could save a lot upfront just by using the object model. One other option is to do this from an Interop app in C# or VB.NET where you may have more control over your environment, like garbage collection.

这在所有方面都是可行的,但VBA可能因此过多,因为它在打开和关闭3000个电子表格和演示文稿的Excel和PowerPoint实例时非常苛刻。如果它是一次性解决方案而且你只需要这样做一次,VBA肯定是快速开发的,所以你可以通过使用对象模型预先节省很多。另一个选择是通过C#或VB.NET中的Interop应用程序执行此操作,您可以在其中更好地控制您的环境,例如垃圾回收。

However, if you're working with Excel 2007/2010 (I assume you are because of the 300k rows), I would do something different. I'd do the calc routines on the main XLSX in VBA and then use Open XML to process and create the 3000 spreadsheets and presentations with charts. (Note: I wouldn't use Open XML on the main XLSX because it doesn't actually render built-in calculations - you would still need to open the XLSX to "hydrate" the spreadsheet - so VBA would be better in this instance).

但是,如果您正在使用Excel 2007/2010(我假设您是因为300k行),我会做一些不同的事情。我在VBA中的主XLSX上执行calc例程,然后使用Open XML处理和创建带有图表的3000个电子表格和演示文稿。 (注意:我不会在主XLSX上使用Open XML,因为它实际上不会渲染内置计算 - 你仍然需要打开XLSX来“保湿”电子表格 - 所以在这种情况下VBA会更好) 。

If you're new to Open XML, there's a lot to learn upfront, so the juice may not be worth the squeeze. But articles like this are very helpful if you do want to know or already Open XML, which is a great starting point (as it deals with charts as well). But you could also use a wrapper on Open XML SDK like Simple OOXML that is quite good for starting out.

如果您是Open XML的新手,那么可以提前了解很多东西,因此可能不值得挤压。但是这样的文章非常有用,如果你想知道或已经是Open XML,这是一个很好的起点(因为它也处理图表)。但是你也可以在Open XML SDK上使用一个包装器,比如Simple OOXML,它非常适合初学者。

#3


2  

Take a look at the open-source statistical system called "R". It's quite good at programatically generating graphs and charts from real-world datasets.

看一下名为“R”的开源统计系统。从程序生成真实数据集中的图形和图表是非常好的。

http://www.r-project.org/

#4


1  

I can't answer 2. and 3. for you, but regarding 1: I'd definitely recommend against that, based on your question... of course, you didn't explain exactly what kind of operations you need to perform on the data, so chances are I'm wrong here.

我无法回答2.和3.对于你,但关于1:我肯定会根据你的问题建议反对...当然,你没有解释你需要在什么样的操作上执行数据,所以我在这里错了。

Your situation reminds me of the saying about regexes: "Some people, when they encounter a problem, will immediately try to solve it using a regular expression. Now they have two problems". You don't want an additional problem.

你的情况让我想起了关于正则表达式的说法:“有些人在遇到问题时会立即尝试使用正则表达式来解决它。现在他们有两个问题”。你不想要一个额外的问题。

If you must use a database to do this (simply because doing it in Excel isn't performant enough), I'd stick with something Microsoft like Access or SQL Server, which will save you some trouble probably. (never thought I'd be saying this)

如果你必须使用数据库来执行此操作(仅仅因为在Excel中执行它不够高效),我会坚持使用像Access或SQL Server这样的Microsoft,这可能会给你带来一些麻烦。 (从没想过我会说这个)