When I discovered NodeJS was built on the V8 Javascript engine I thought "Great, web scraping will be easier, as the page will be rendered like in the browser, with a 'native' DOM there supporting XPath, and any AJAX calls in the page executed."

当我在V8 Javascript引擎上发现NodeJS时,我认为“很好,web抓取会更容易,因为页面会在浏览器中呈现,有一个‘本地’DOM支持XPath,并且在页面中执行任何AJAX调用。”

  1. Why when it uses the same JS engine as Chrome doesn't it have a native DOM?
  2. 为什么当它使用与Chrome相同的JS引擎时,它没有本地DOM?
  3. Likewise, why doesn't it have a mode to run JS in retrieved pages?
  4. 同样地,为什么在检索的页面中它没有运行JS的模式呢?
  5. What am I not understanding about Javascript engines vs the engine in a web browser? :)
  6. 我对Javascript引擎和web浏览器引擎的理解是什么?:)

Many thanks!


The DOM is the DOM, and the JavaScript implementation is simply a separate entity. The DOM represents a set of facilities that a web browser exposes to the JavaScript environment. There's no requirement however that any particular JavaScript runtime will have any facilities exposed via the global object.


What Node.js is is a stand-alone JavaScript environment completely independent of a web browser. There's no intrinsic link between web browsers and JavaScript; the DOM is not part of the JavaScript language or specification or anything.


I use the old Rhino Java-based JavaScript implementation in my Java-based web server. That environment also has nothing at all to do with any DOM. It's my own application that's responsible for populating the global object with facilities to do what I need it to be able to do, and it's not a DOM.


Note that there are projects like jsdom if you want a virtual DOM in your Node project. Because of its very nature as a server-side platform, a DOM is a facility that Node can do without and still make perfect sense for a wide variety of server applications. That's not to say that a DOM might not be useful to some people, but it's just not in the same category of services as things like process control, I/O, networking, database interop, and so on.


There may be some "official" answer to the question "why?" out there, but it's basically just the business of those who maintain Node (the Node Foundation now). If some intrepid developer out there decides that Node should ship by default with a set of modules to support a virtual DOM, and successfully works and works and makes that happen, then Node will have a DOM.




P.S: When reading this question I was also wondering if V8(node.js is built in top of this) had a DOM


Why when it uses the same JS engine as Chrome doesn't it have a native DOM?


But I searched google and found Google's V8 page which recites the following:


JavaScript is most commonly used for client-side scripting in a browser, being used to manipulate Document Object Model (DOM) objects for example. The DOM is not, however, typically provided by the JavaScript engine but instead by a browser. The same is true of V8—Google Chrome provides the DOM. V8 does however provide all the data types, operators, objects and functions specified in the ECMA standard.

JavaScript通常用于浏览器中的客户端脚本,用于操作文档对象模型(DOM)对象。但是,DOM通常不是由JavaScript引擎提供的,而是由浏览器提供的。V8-Google Chrome也提供了DOM。但是,V8提供了ECMA标准中指定的所有数据类型、操作符、对象和函数。

node.js uses V8 and not Google Chrome.


Likewise, why doesn't it have a mode to run JS in retrieved pages?


I also think we don't really need it that bad. Ryan Dahl created node.js as one man(single programmer). Maybe now he(his team) will develop this, but I was already extremely amazed by the amount of code he produced(crazy). He wanted to make non-blocking easy/efficient library , which I think he did a mighty good job at.

我也认为我们不需要那么糟糕。Ryan Dahl创建节点。作为一个人(单个程序员)。也许现在他(他的团队)会开发这个,但是我已经对他的代码(疯狂)的数量感到非常惊讶。他想让非阻塞的简易/高效的图书馆,我认为他在这方面做得很好。

But then again another developer created a module which is pretty good and actively developed(today) at https://github.com/tmpvar/jsdom.


What am I not understanding about Javascript engines vs the engine in a web browser? :)


Those are different things as is hopefully clear from the quote above.




node.js chose not to include it in their standard library. For any functionality, there is an inevitable tradeoff between comprehensiveness, scalability, and maintainability.


That doesn't mean it's not potentially useful. There is at least one JavaScript DOM implementation intended for NodeJS (among other CommonJS implementations).

这并不意味着它没有潜在的用处。至少有一个JavaScript DOM实现是针对node . js的(在其他的CommonJS实现中)。



You seem to have a flawed assumption that V8 and the DOM are inextricably related, that's not the case. The DOM is actually handled by Webkit, V8 doesn't handle the DOM, it handles Javascript calls to the DOM. Don't let this discourage you, Node.js has carved out a significant niche in the realtime server market, but don't let anybody tell you it's just for servers. Node makes it possible to build almost anything with JavaScript.


It is possible to do what you're talking about. For example there is the very good jsdom library if you really need access to the DOM, and node-htmlparser, there are also some really good scraping libraries that take advantage of these like apricot.




This is related: There is a new project (2012) called node-webkit which tries to add DOM and a lot more from Webkit to Node. Support it!

这是相关的:有一个新的项目(2012)叫做Node - Webkit,它试图添加DOM和更多的Webkit到Node。支持它!



To answer your underlying question, you can use JSDom and jQuery to scrape pages in node.js: http://blog.nodejitsu.com/jsdom-jquery-in-5-lines-on-nodejs


I've used this approach a few times myself, and it works great.




Javascript != browser. Javascript as a language is not tied to browsers; node.js is simply an implementation of Javascript that is intended for servers, not browsers. Hence no DOM.

浏览器的Javascript ! =。Javascript作为一种语言并不与浏览器绑定;节点。js只是一个Javascript的实现,它是针对服务器而不是浏览器的。因此没有DOM。



If you read DOM as 'linked objects immediately accessible from my script' then the answer 'it does, but it's very different from set of objects available from web document script'. The main reason is that node is 'evented I/O for V8', not 'HTML tree objects for V8'




It seems people have answered 'why' but not how. A quick answer of how is that in a web browser, a document object is exposed (hence DOM , document object model). On windows this object is called document object. You can refer to this page and look at the methods it exposes which are for handling HTML documents like createElement. I don't use node.js or haven't done COM programming in a while but I'd imagine you could use DOM in node.js by simply calling the COM object IHTMLDocument3. Of course for other platforms like Mac OS X or Linux you would probably have to use something from their OS api. This should allow you to easily build a webpage server side using DOM, or to scrape incoming web pages.

似乎人们已经回答了“为什么”,而不是如何回答。一个快速的答案是,在web浏览器中,文档对象被公开(因此DOM,文档对象模型)。在windows上,这个对象称为文档对象。您可以参考该页面并查看它公开的方法,这些方法用于处理像createElement这样的HTML文档。我不使用节点。js或者还没有做过COM编程,但是我想你可以在node中使用DOM。通过简单地调用COM对象IHTMLDocument3。当然,对于其他平台,如Mac OS X或Linux,你可能不得不使用它们的OS api。这应该允许您使用DOM轻松构建一个网页服务器端,或者抓取传入的web页面。



Node.js is for serverside programming. There is no DOM to be rendered in the server.




2018 answer: mainly for historical reasons, but this may change in future.


Historically, very little DOM manipulation was done on the server. Addiotinally, as other answers allude, the JS stdlib and the DOM are seperate libraries - if you're using node, for, say, Unix scripting, then HTMLElement and NodeList etc aren't really relevant to that.

历史上,在服务器上做的DOM操作很少。另外,作为其他的答案,JS stdlib和DOM是分离的库——如果您使用的是node,比如Unix脚本,那么HTMLElement和NodeList等与此并不相关。

However: server-side DOM manipulation is now a very common part of delivering web apps. Web servers need to understand the structure of pages, and, if asked to render a resource as HTML, deliver HTML content that reflects the initial state of a web application. This means web apps load much faster than if the server simply delivers a stub page and has the browsers then do the work of filling in the real content. Currently this is done with JSDom and similar, but in the same way node has Request and Response objects built in, having DOM functions maintained as part of the stdlib would help with this task.




1) What does it mean for it to have a D ocument O bject M odel? There's no document to represent.


2) You're most of the time you're not retrieving pages. You can, but most Node apps probably won't be.


3) Without a document and a browser, Javascript is just another programming language. So you may ask why there isn't a DOM in C# or Java




