如何从Google搜索结果“20-pack”条目中提取源代码?

时间:2022-08-22 11:11:46

The search results page for a local Google search typically looks like this, containing 20 results.

本地Google搜索的搜索结果页面通常如下所示,包含20个结果。

In order to get the full contact details for any given result on the left-hand-side, the result needs to be clicked, bringing up (after a lengthy wait) an overlay (not sure of the technical term) over the Google Maps pane (on Firefox, does something different on other web browsers):

为了获得左侧任何给定结果的完整联系方式,需要点击结果,在Google地图窗格中显示(经过漫长的等待)覆盖(不确定技术术语) (在Firefox上,在其他Web浏览器上做了不同的事情):

如何从Google搜索结果“20-pack”条目中提取源代码?

I am extracting the business name. address, phone and website with Python and WebDriver thus:

我正在提取商家名称。使用Python和WebDriver的地址,电话和网站:

address = driver.find_element_by_xpath("//div[@id='akp_uid_0']/div/div/ol/li/div/div/div/ol/table/tbody/tr[2]/td/li/div/div/span[2]").text

name = driver.find_element_by_css_selector(".kno-ecr-pt").text.encode('raw_unicode_escape')
phone = driver.find_element_by_css_selector("div._mr:nth-child(2) > span:nth-child(2)").text

website = driver.find_element_by_css_selector("a.lua-button:nth-child(1)").get_attribute("href")

This is working reliably, but is extremely slow. Loading up each Maps overlay can take in the tens of seconds each time. I've tried PhantomJS via WebDriver, but got quickly blocked by Google's bot-detection.

这工作可靠,但速度极慢。加载每个地图叠加层每次都需要几十秒。我已经通过WebDriver尝试过PhantomJS,但很快被Google的bot检测阻止了。

If my reading of Firebug is correct, each of these links on the left hand side is defined like so:

如果我对Firebug的阅读是正确的,那么左侧的每个链接都是这样定义的:

<a data-ved="0CA4QyTMwAGoVChMIj66ruJHGxwIVTKweCh03Sgw0" data-async-trigger="" data-height="0" data-cid="11660382088875336582" data-akp-stick="H4sIAAAAAAAAAGOovnz8BQMDgycHm5SIoaGZmYGxhZGBhYWFuamxsZmphZESVtEoyeSMzKL8gqLE5JL8omLtvNRyhcr8omztvMrkA51e-lt5XiW0n3kw-e7MFfkJwUIAxqbXGGYAAAA" data-akp-oq="Body in Balance Chiropractic New York, NY" jsl="$x 3;" data-rtid="ifLMvGmjeYOk" jsaction="r.UQJvbqFUibg" class="ifLMvGmjeYOk-6WH35iSZ2V0 rllt__link rllt__content" tabindex="0" role="link"><div class="_Ml"><div class="_pl _ki"><div role="heading" aria-level="3" style="margin-right:0px" class="_rl">Body in Balance <wbr></wbr>Chiropractic</div><div class="_lg"><span aria-hidden="true" class="rtng" style="margin-right:5px">5.0</span><g-review-stars><span aria-label="Rated 5.0 out of 5" class="_pxg _Jxg"><span style="width:70px"></span></span></g-review-stars><div style="display:inline;font-size:13px;margin-left:5px"><span>20 reviews</span></div></div><div class="_tf"><span>Chiropractor</span>&nbsp;·&nbsp;W 45th St</div><div class="_CRe"><div><span>Opens at 8:00 am</span></div></div></div></div></a>

My knowledge of CSS and JavaScript is practically nil, so I may not be asking the right question. But is there a way to get at the underlying source of what eventually hovers over the Maps pane (there's probably a more technical term for it), without having to click on the link on the left hand side to bring it up? My thinking is that if I can get that parse that HTML without actually having to trigger it, I can save much time.

我对CSS和JavaScript的了解几乎为零,所以我可能不会问正确的问题。但是有没有办法找到最终在地图窗格上悬停的内容(可能还有一个更为技术性的术语),而不必点击左侧的链接来​​启动它?我的想法是,如果我可以解析HTML而不必实际触发它,我可以节省很多时间。

1 个解决方案

#1


1  

I have tried to check the dom structure of the page you provided. Basically IE has huge differences on such a page with Firefox(IE will direct to another page once you've clicked the left-hand-side items.)

我试图检查您提供的页面的dom结构。基本上IE在Firefox这样的页面上存在巨大差异(一旦你点击了左侧的项目,IE就会指向另一个页面。)

But due to my environmental limit, I can just have done this for IE. For firefox, you may have a try on the following code. There might be minor issues(apologize, I am unable to test it ).

但由于我的环境限制,我可以为IE做这个。对于Firefox,您可以尝试以下代码。可能存在小问题(道歉,我无法测试)。

Note: I wrote a java demo(Just for searching Phone num) because I am familiar with java. And I am also not good at cssSelector so I used xpath instead. Hope it can help.

注意:我写了一个java演示(仅用于搜索电话号码)因为我熟悉java。而且我也不擅长cssSelector所以我使用xpath代替。希望它可以提供帮助。

        driver.get("https://www.google.com/search?q=chiropractors%2Bnew%20york%2Bny&rflfq=1&tbm=lcl&tbs=lf:1,lf_ui:2&oll=40.754671143320074,-73.97722375000001&ospn=0.017814865199625274,0.040340423583984375&oz=15&fll=40.75807315356519,-73.99290368792725&fspn=0.01641614335274255,0.040340423583984375&fz=15&ved=0CJIBENAnahUKEwj1jtnmtcbHAhVTCo4KHfkkCYM&bav=on.2,or.r_cp.&biw=1360&bih=608&dpr=1&sei=y4LdVYvcFsa7uATo_LngCQ&ei=4YTdVbWaENOUuAT5yaSYCA&emsg=NCSR&noj=1&rlfi=hd:;si:#emsg=NCSR&rlfi=hd:;si:&sei=y4LdVYvcFsa7uATo_LngCQ");

        //0. Actually no need unless you have low connection speed with google.
        Thread.sleep(5000);


        //1. By xpath '_gt' will extract all of the 20 results' div on left hand side. Both IE and firefox can work well. 
        List<WebElement> elements = driver.findElements(By.xpath("//div[@class='_gt']"));

        //2. Traverse all of the results. Let 'data-cid' as identifier. Note: Only FF can be done. For IE there are no data-cid s
        for(int i=0; i<elements.size(); i++) {
            WebElement e = elements.get(i);


            WebElement aTag = e.findElement(By.tagName("a"));


            String dataCid = aTag.getAttribute("data-cid");


            //3. Here, the div which contains the info we want can be identified by 'data-cid' in firefox
            WebElement parentDivOfTable = driver.findElement(By.xpath("//div[@class='akp_uid_0' and @data-cid='" + dataCid + "']"));

            //4. get the infomation table.
            WebElement table = parentDivOfTable.findElement(By.xpath("//table[@class='_B5g']"));

            //get the phone num.
            String phoneNum = table.findElement(By.xpath("//span[text()='Phone:']/following-sibling")).getText();
        }

#1


1  

I have tried to check the dom structure of the page you provided. Basically IE has huge differences on such a page with Firefox(IE will direct to another page once you've clicked the left-hand-side items.)

我试图检查您提供的页面的dom结构。基本上IE在Firefox这样的页面上存在巨大差异(一旦你点击了左侧的项目,IE就会指向另一个页面。)

But due to my environmental limit, I can just have done this for IE. For firefox, you may have a try on the following code. There might be minor issues(apologize, I am unable to test it ).

但由于我的环境限制,我可以为IE做这个。对于Firefox,您可以尝试以下代码。可能存在小问题(道歉,我无法测试)。

Note: I wrote a java demo(Just for searching Phone num) because I am familiar with java. And I am also not good at cssSelector so I used xpath instead. Hope it can help.

注意:我写了一个java演示(仅用于搜索电话号码)因为我熟悉java。而且我也不擅长cssSelector所以我使用xpath代替。希望它可以提供帮助。

        driver.get("https://www.google.com/search?q=chiropractors%2Bnew%20york%2Bny&rflfq=1&tbm=lcl&tbs=lf:1,lf_ui:2&oll=40.754671143320074,-73.97722375000001&ospn=0.017814865199625274,0.040340423583984375&oz=15&fll=40.75807315356519,-73.99290368792725&fspn=0.01641614335274255,0.040340423583984375&fz=15&ved=0CJIBENAnahUKEwj1jtnmtcbHAhVTCo4KHfkkCYM&bav=on.2,or.r_cp.&biw=1360&bih=608&dpr=1&sei=y4LdVYvcFsa7uATo_LngCQ&ei=4YTdVbWaENOUuAT5yaSYCA&emsg=NCSR&noj=1&rlfi=hd:;si:#emsg=NCSR&rlfi=hd:;si:&sei=y4LdVYvcFsa7uATo_LngCQ");

        //0. Actually no need unless you have low connection speed with google.
        Thread.sleep(5000);


        //1. By xpath '_gt' will extract all of the 20 results' div on left hand side. Both IE and firefox can work well. 
        List<WebElement> elements = driver.findElements(By.xpath("//div[@class='_gt']"));

        //2. Traverse all of the results. Let 'data-cid' as identifier. Note: Only FF can be done. For IE there are no data-cid s
        for(int i=0; i<elements.size(); i++) {
            WebElement e = elements.get(i);


            WebElement aTag = e.findElement(By.tagName("a"));


            String dataCid = aTag.getAttribute("data-cid");


            //3. Here, the div which contains the info we want can be identified by 'data-cid' in firefox
            WebElement parentDivOfTable = driver.findElement(By.xpath("//div[@class='akp_uid_0' and @data-cid='" + dataCid + "']"));

            //4. get the infomation table.
            WebElement table = parentDivOfTable.findElement(By.xpath("//table[@class='_B5g']"));

            //get the phone num.
            String phoneNum = table.findElement(By.xpath("//span[text()='Phone:']/following-sibling")).getText();
        }