在Rails中安排和执行重复任务(比如抓取信息页面)的最佳方法是什么?

时间:2020-12-31 23:32:03

I'm looking for a solution which enables:

  1. Repetitive executing of a scraping task (nokogiri)
  2. 重复执行抓取任务(nokogiri)

  3. Changing the time interval via http://www.myapp.com/interval (example)
  4. 通过http://www.myapp.com/interval更改时间间隔(示例)

What is the best solution/way to get this done?

完成这项工作的最佳解决方案/方法是什么?

Options I know about

  • Custom Rake task
  • 自定义佣金任务

  • Rufus Scheduler

Current situation

In ./config/initializers/task_scheduler.rb I have:

在./config/initializers/task_scheduler.rb我有:

require 'nokogiri'
require 'open-uri'
require 'rufus-scheduler'
require 'rake'

scheduler = Rufus::Scheduler.new

scheduler.every "1h" do
    puts "BEGIN SCHEDULER at #{Time.now}"

    @url = "http://www.marktplaats.nl/z/computers-en-software/apple-ipad/ipad-mini.html?  query=ipad+mini&categoryId=2722&priceFrom=100%2C00&priceTo=&startDateFrom=always"
    @doc = Nokogiri::HTML(open(@url))
    @title = @doc.at_css("title").text

    @number = 0

    2.times do |number|
        @doc.css(".defaultSnippet.group-#{@number}").each do |listing|
            @listing_title = listing.at_css(".mp-listing-title").text
            @listing_subtitle = listing.at_css(".mp-listing-description").text
            @listing_price = listing.at_css(".price").text
            @listing_priority = listing.at_css(".mp-listing-priority-product").text

            listing = Listing.create(title: "#{@listing_title}", subtitle: "#{@listing_subtitle}", price: "#{@listing_price}")

        end

        @number +=1
    end

    puts "END SCHEDULER at #{Time.now}"
end

Is it not working?

Yes the current setup is working. However, I don't know how to enable changing the interval time via http://www.myapp.com/interval (example).

是的,当前设置正常。但是,我不知道如何通过http://www.myapp.com/interval(示例)启用更改间隔时间。

Changing scheduler.every "1h" to scheduler.every "#{@interval} do does not work.

将scheduler.every“1h”更改为scheduler.every“#{@interval} do不起作用。

In what file do I have to define @interval for it to work in task_scheduler.rb?

在什么文件中我必须定义@interval才能在task_scheduler.rb中工作?

2 个解决方案

#1


1  

First off: your rufus scheduler code is in an initializer, which is fine, but it is executed before the rails process is started, and only when the rails process is started. So, in the initializer you have no access to any variable @interval you could set, for instance in a controller.

首先:您的rufus调度程序代码在初始化程序中,这很好,但它在rails进程启动之前执行,并且仅在启动rails进程时执行。因此,在初始化程序中,您无法访问可以设置的任何变量@interval,例如在控制器中。

What are possible options, instead of a class variable:

有哪些可能的选项,而不是类变量:

  • read it from a config file
  • 从配置文件中读取它

  • read it from a database (but you will have to setup your own connection, in the initializer activerecord is not started imho
  • 从数据库中读取它(但你必须设置自己的连接,在初始化程序中,activerecord没有启动imho

And ... if you change the value you will have to restart your rails process for it to have effect again.

并且......如果您更改了值,则必须重新启动rails进程才能再次生效。

So an alternative approach, where your rails process handles the interval of the scheduled job, is to use a recurring background job. At the end of the background, it reschedules itself, with the at that moment active interval. The interval is fetched from the database, I would propose. Any background job handler could do this. Check ruby toolbox, I vote for resque or delayed_job.

因此,一种替代方法,即rails进程处理预定作业的间隔,就是使用重复的后台作业。在背景的最后,它重新安排自己,在那一刻活跃的间隔。我建议从数据库中提取间隔。任何后台作业处理程序都可以这样做检查ruby工具箱,我投票给resque或delayed_job。

#2


2  

I'm not very familiar with Rufus Scheduler but it appears that it will be difficult to acheive both of your goals (regular heartbeat, dynamically rescheduled) with it. In order for it to work, you'll have to capture the job_id that it returns, use that job_id to stop the job if a rescheduling event occurs, and then create the new job. Rufus also points out that it's an in-memory application whose jobs will disappear when the process disappears -- reboot the server, restart the application, etc and you've got to reschedule from scratch.

我对Rufus Scheduler不是很熟悉,但似乎很难用它实现你的两个目标(常规心跳,动态重新安排)。为了使其工作,您必须捕获它返回的job_id,如果发生重新安排事件,则使用该job_id停止作业,然后创建新作业。 Rufus还指出,它是一个内存中的应用程序,当进程消失时,其作业将消失 - 重新启动服务器,重新启动应用程序等,您必须从头开始重新安排。

I'd consider two things. First, I'd consider creating a model that wraps the screen-scraping that you want to do. At a minimum you'd capture the url and the interval. The model may wrap up the code for processing the html response (basically what's wrapped up in the 2.times block) as instance methods that you trigger based on the URL. You may also capture this in a text column and use eval on it, assuming that only "good guys" get access to this part of the system. This has a couple of advantages: you can quickly expand to scraping other sites and you can sanitize the interval sent back by the user.

我考虑两件事。首先,我会考虑创建一个包含你想要做的屏幕抓取的模型。至少你会捕获网址和间隔。该模型可以将用于处理html响应的代码(基本上包含在2.times块中的内容)包装为基于URL触发的实例方法。您也可以在文本列中捕获它并在其上使用eval,假设只有“好人”可以访问系统的这一部分。这有几个优点:您可以快速扩展到抓取其他站点,并且可以清理用户发回的间隔。

Second, something like Delayed::Job may better suit your needs. Delayed::Job allows you to specify a time for the job's execution which you could fill in by reading the model and converting the interval to a time. The key to this approach is that the job must schedule the next iteration of itself before it exits.

其次,像Delayed :: Job这样的东西可能更适合你的需求。 Delayed :: Job允许您指定作业执行的时间,您可以通过阅读模型并将时间间隔转换为时间来填写。这种方法的关键是作业必须在退出之前安排下一次迭代。

This won't be as rock-steady as something like cron but it does seem to better address the rescheduling need.

这不像cron那样坚如磐石,但似乎更能满足重新安排的需求。

#1


1  

First off: your rufus scheduler code is in an initializer, which is fine, but it is executed before the rails process is started, and only when the rails process is started. So, in the initializer you have no access to any variable @interval you could set, for instance in a controller.

首先:您的rufus调度程序代码在初始化程序中,这很好,但它在rails进程启动之前执行,并且仅在启动rails进程时执行。因此,在初始化程序中,您无法访问可以设置的任何变量@interval,例如在控制器中。

What are possible options, instead of a class variable:

有哪些可能的选项,而不是类变量:

  • read it from a config file
  • 从配置文件中读取它

  • read it from a database (but you will have to setup your own connection, in the initializer activerecord is not started imho
  • 从数据库中读取它(但你必须设置自己的连接,在初始化程序中,activerecord没有启动imho

And ... if you change the value you will have to restart your rails process for it to have effect again.

并且......如果您更改了值,则必须重新启动rails进程才能再次生效。

So an alternative approach, where your rails process handles the interval of the scheduled job, is to use a recurring background job. At the end of the background, it reschedules itself, with the at that moment active interval. The interval is fetched from the database, I would propose. Any background job handler could do this. Check ruby toolbox, I vote for resque or delayed_job.

因此,一种替代方法,即rails进程处理预定作业的间隔,就是使用重复的后台作业。在背景的最后,它重新安排自己,在那一刻活跃的间隔。我建议从数据库中提取间隔。任何后台作业处理程序都可以这样做检查ruby工具箱,我投票给resque或delayed_job。

#2


2  

I'm not very familiar with Rufus Scheduler but it appears that it will be difficult to acheive both of your goals (regular heartbeat, dynamically rescheduled) with it. In order for it to work, you'll have to capture the job_id that it returns, use that job_id to stop the job if a rescheduling event occurs, and then create the new job. Rufus also points out that it's an in-memory application whose jobs will disappear when the process disappears -- reboot the server, restart the application, etc and you've got to reschedule from scratch.

我对Rufus Scheduler不是很熟悉,但似乎很难用它实现你的两个目标(常规心跳,动态重新安排)。为了使其工作,您必须捕获它返回的job_id,如果发生重新安排事件,则使用该job_id停止作业,然后创建新作业。 Rufus还指出,它是一个内存中的应用程序,当进程消失时,其作业将消失 - 重新启动服务器,重新启动应用程序等,您必须从头开始重新安排。

I'd consider two things. First, I'd consider creating a model that wraps the screen-scraping that you want to do. At a minimum you'd capture the url and the interval. The model may wrap up the code for processing the html response (basically what's wrapped up in the 2.times block) as instance methods that you trigger based on the URL. You may also capture this in a text column and use eval on it, assuming that only "good guys" get access to this part of the system. This has a couple of advantages: you can quickly expand to scraping other sites and you can sanitize the interval sent back by the user.

我考虑两件事。首先,我会考虑创建一个包含你想要做的屏幕抓取的模型。至少你会捕获网址和间隔。该模型可以将用于处理html响应的代码(基本上包含在2.times块中的内容)包装为基于URL触发的实例方法。您也可以在文本列中捕获它并在其上使用eval,假设只有“好人”可以访问系统的这一部分。这有几个优点:您可以快速扩展到抓取其他站点,并且可以清理用户发回的间隔。

Second, something like Delayed::Job may better suit your needs. Delayed::Job allows you to specify a time for the job's execution which you could fill in by reading the model and converting the interval to a time. The key to this approach is that the job must schedule the next iteration of itself before it exits.

其次,像Delayed :: Job这样的东西可能更适合你的需求。 Delayed :: Job允许您指定作业执行的时间,您可以通过阅读模型并将时间间隔转换为时间来填写。这种方法的关键是作业必须在退出之前安排下一次迭代。

This won't be as rock-steady as something like cron but it does seem to better address the rescheduling need.

这不像cron那样坚如磐石,但似乎更能满足重新安排的需求。