Ruby:如何将多线程结合到这个抓取网络的场景中?

时间:2022-11-19 21:01:02

I have a list of folders which contain lots of text files. Inside those files are links.

我有一个包含大量文本文件的文件夹列表。这些文件里面是链接。

Using each one of those links, I will need to fetch a webpage, parse it, and depending on what's there - save a JPG file into a folder corresponding to the folder name that contains the text file that provided the link.

使用这些链接中的每一个,我将需要获取一个网页,解析它,并根据其中的内容 - 将JPG文件保存到与包含提供链接的文本文件的文件夹名称相对应的文件夹中。

Now the catch is that there's a LOT of text files and even more links inside of them. I was thinking that it may not be such a bad idea to multithread the process of connecting to and parsing webpages.

现在的问题是,有很多文本文件,甚至更多的链接。我认为多线程连接和解析网页的过程可能不是一个坏主意。

So I'll have something like this:

所以我会有这样的事情:

directories.each do |directory|

 ... 

 all_files_in_directory.each do |file|

  ...

  all_urls_in_file do |url|

   # check if there's any threads that aren't busy
   # make a thread go out to the url and parse it

  end

 end


end

I'm a bit unsure how to do that if it's even possible - I can't seem to find a way to have threads just sort of hang out until I tell them some_method() to execute. It's as if what a thread does is assigned to it upon creation and cannot be changed.

我有点不确定如果它甚至可能如何做到这一点 - 我似乎找不到一种方法让线程只是挂起,直到我告诉他们some_method()执行。就好像在创建时为线程分配了什么并且无法更改。

So basically I want the script to be able to connect and parse, say, in batches of 5 instead of just 1.

所以基本上我希望脚本能够连接和解析,比如说,批量为5而不是1。

Is this doable, and if so, how would you solve this problem?

这是可行的,如果是这样,你将如何解决这个问题?

2 个解决方案

#1


1  

You should consider eventmachine and em-http-request for concurrent http requests.

对于并发的http请求,您应该考虑eventmachine和em-http-request。

#2


2  

Typically, such activities are performed by queueing 'task' objects to a pool of threads that are waiting on a producer-consumer 'pool queue'. Each thread loops around forever, pulling tasks off the queue and calling a virtual 'run' method of the task. Usually, if they wish, tasks can create more tasks and submit them to the pool queue.

通常,此类活动是通过将“任务”对象排队到正在等待生产者 - 消费者“池队列”的线程池来执行的。每个线程永远循环,从队列中拉出任务并调用任务的虚拟“运行”方法。通常,如果他们愿意,任务可以创建更多任务并将其提交到池队列。

Different 'task' class descendants can have a run() method that does different things & so, even though the thread is indeed 'doing what was assigned to it upon creation' - that something means hanging about on a queue and then, when tasks are available, caling different overridden methods in different tasks.

不同的“任务”类后代可以有一个run()方法,它可以执行不同的事情,即使线程确实“在创建时执行了分配给它的东西” - 这些东西意味着挂起队列然后,当任务完成时可用,在不同的任务中调用不同的重写方法。

Flow control, right. Make a batchURL' task class that can hold 'batch size' urls. At start, create.. say.. 100 of them and push them onto an 'objectQueue', (a producer-consumer queue class like the pool queue). In your readline loop, pop a batchURL, load it up with urls and submit it to the pool queue. When a pool thread has done with a batchURL, push it back onto the objectQueue for re-use. This puts a cap on the outstanding batchURLs - if the readLine tries to queue up too many batchURLs, it will find the objectQueue empty and so wil block until some batchURLs are recycled by the pool.

流量控制,对。创建一个可以保存“批量大小”网址的batchURL'任务类。在开始时,创建..说... 100个并将它们推送到'objectQueue',(像池队列一样的生产者 - 消费者队列类)。在readline循环中,弹出一个batchURL,用url加载它并将其提交到池队列。当池线程完成batchURL时,将其推回到objectQueue以供重用。这会对未完成的batchURL设置上限 - 如果readLine尝试排队太多batchURL,它会发现objectQueue为空,因此将阻塞,直到某些batchURL被池回收。

If you use a reasonable number of batchSIze, batchURLs and threads, the batchURLs should happily circulate around the objectQueue/workThead/poolQueue loop, carrying the data around from your readLine to the work threads in an efficient and effective manner.

如果您使用合理数量的batchSIze,batchURL和线程,则batchURL应该愉快地在objectQueue / workThead / poolQ​​ueue循环周围循环,以高效且有效的方式将数据从readLine传递到工作线程。

#1


1  

You should consider eventmachine and em-http-request for concurrent http requests.

对于并发的http请求,您应该考虑eventmachine和em-http-request。

#2


2  

Typically, such activities are performed by queueing 'task' objects to a pool of threads that are waiting on a producer-consumer 'pool queue'. Each thread loops around forever, pulling tasks off the queue and calling a virtual 'run' method of the task. Usually, if they wish, tasks can create more tasks and submit them to the pool queue.

通常,此类活动是通过将“任务”对象排队到正在等待生产者 - 消费者“池队列”的线程池来执行的。每个线程永远循环,从队列中拉出任务并调用任务的虚拟“运行”方法。通常,如果他们愿意,任务可以创建更多任务并将其提交到池队列。

Different 'task' class descendants can have a run() method that does different things & so, even though the thread is indeed 'doing what was assigned to it upon creation' - that something means hanging about on a queue and then, when tasks are available, caling different overridden methods in different tasks.

不同的“任务”类后代可以有一个run()方法,它可以执行不同的事情,即使线程确实“在创建时执行了分配给它的东西” - 这些东西意味着挂起队列然后,当任务完成时可用,在不同的任务中调用不同的重写方法。

Flow control, right. Make a batchURL' task class that can hold 'batch size' urls. At start, create.. say.. 100 of them and push them onto an 'objectQueue', (a producer-consumer queue class like the pool queue). In your readline loop, pop a batchURL, load it up with urls and submit it to the pool queue. When a pool thread has done with a batchURL, push it back onto the objectQueue for re-use. This puts a cap on the outstanding batchURLs - if the readLine tries to queue up too many batchURLs, it will find the objectQueue empty and so wil block until some batchURLs are recycled by the pool.

流量控制,对。创建一个可以保存“批量大小”网址的batchURL'任务类。在开始时,创建..说... 100个并将它们推送到'objectQueue',(像池队列一样的生产者 - 消费者队列类)。在readline循环中,弹出一个batchURL,用url加载它并将其提交到池队列。当池线程完成batchURL时,将其推回到objectQueue以供重用。这会对未完成的batchURL设置上限 - 如果readLine尝试排队太多batchURL,它会发现objectQueue为空,因此将阻塞,直到某些batchURL被池回收。

If you use a reasonable number of batchSIze, batchURLs and threads, the batchURLs should happily circulate around the objectQueue/workThead/poolQueue loop, carrying the data around from your readLine to the work threads in an efficient and effective manner.

如果您使用合理数量的batchSIze,batchURL和线程,则batchURL应该愉快地在objectQueue / workThead / poolQ​​ueue循环周围循环,以高效且有效的方式将数据从readLine传递到工作线程。