Operating system management of address-translation-related data structures and hardware lookasides

时间:2022-09-05 21:26:57

An approach is provided in a hypervised computer system where a page table request is at an operating system running in the hypervised computer system. The operating system determines whether the page table request requires the hypervisor to process. If the determination reveals that the page table request requires the hypervisor, then the hypervisor is used to handle the request. However, if the determination reveals that the page table request does not require the hypervisor, then an indicator included in a page table entry corresponding to the request is read to determine if the page table entry is controlled by the operating system or the hypervisor. The operating system is able to update the page table entry if the indicator identifies the page table entry as being operating system controlled.

TECHNICAL FIELD

The present invention allows an operating system to manage address data structures rather than a hypervisor.

BACKGROUND

Traditionally in hypervised systems, an operating system manages storage (e.g. maintains the page table, etc.) using service calls to the hypervisor. In environments which set up and tear down huge numbers of short-lived applications (e.g. some types of web serving applications, etc.), the overhead of hypervisor intervention is costly in terms of performance. One approach used by some architectures (e.g. IA32, etc.) has been to create a second level of translation so that the operating system can maintain the first level of page translation, while the hypervisor continues to maintain the actual mapping to real address space. This approach can be costly both in terms of hardware lookaside resources and storage footprint for the tables (and resulting cache pressure). Traditionally, the hypervisor prevents the OS from accessing page tables using conventional storage protection mechanisms that are part of most memory management architectures.

SUMMARY

An approach is provided in a hypervised computer system where a page table request is at an operating system running in the hypervised computer system. The operating system determines whether the page table request requires the hypervisor to process. If the determination reveals that the page table request requires the hypervisor, then the hypervisor is used to handle the request. However, if the determination reveals that the page table request does not require the hypervisor, then an indicator included in a page table entry corresponding to the request is read to determine if the page table entry is controlled by the operating system or the hypervisor. The operating system is able to update the page table entry if the indicator identifies the page table entry as being operating system controlled, otherwise the update is handled by the hypervisor.

The foregoing is a summary and thus contains, by necessity, simplifications, generalizations, and omissions of detail; consequently, those skilled in the art will appreciate that the summary is illustrative only and is not intended to be in any way limiting. Other aspects, inventive features, and advantages of the present invention, as defined solely by the claims, will become apparent in the non-limiting detailed description set forth below.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention may be better understood, and its numerous objects, features, and advantages made apparent to those skilled in the art by referencing the accompanying drawings, wherein:

FIG. 1 is a block diagram of a data processing system in which the methods described herein can be implemented;

FIG. 2 is a network diagram of various types of data processing systems connected via a computer network;

FIG. 3 is a block diagram depicting the hypervisor and operating systems interaction with CPU components in order to manipulate memory management storage;

FIG. 4 is a first flowchart depicting steps taken by an operating system to handle an incoming page table request; and

FIG. 5 is a second flowchart depicting actions taken by the operating system to update the page table entry.

DETAILED DESCRIPTION

The following detailed description will generally follow the summary of the invention, as set forth above, further explaining and expanding the definitions of the various aspects and embodiments of the invention as necessary. To this end, this detailed description first sets forth a computing environment in FIG. 1 that is suitable to implement the software and/or hardware techniques associated with the invention.

FIG. 1 illustrates information handling system 100, which is a simplified example of a computer system capable of performing the computing operations described herein. Information handling system 100 includes one or more processors 110coupled to processor interface bus 112. Processor interface bus 112 connects processors 110 to Northbridge 115, which is also known as the Memory Controller Hub (MCH). Northbridge 115 connects to system memory 120 and provides a means for processor(s) 110 to access the system memory. Graphics controller 125 also connects to Northbridge 115. In one embodiment, PCI Express bus 118 connects Northbridge 115 to graphics controller 125. Graphics controller 125 connects to display device 130, such as a computer monitor.

Operating system management of address-translation-related data structures and hardware lookasides

Northbridge 115 and Southbridge 135 connect to each other using bus 119. In one embodiment, the bus is a Direct Media Interface (DMI) bus that transfers data at high speeds in each direction between Northbridge 115 and Southbridge 135. In another embodiment, a Peripheral Component Interconnect (PCI) bus connects the Northbridge and the Southbridge. Southbridge 135, also known as the I/O Controller Hub (ICH) is a chip that generally implements capabilities that operate at slower speeds than the capabilities provided by the Northbridge. Southbridge 135 typically provides various busses used to connect various components. These busses include, for example, PCI and PCI Express busses, an ISA bus, a System Management Bus (SMBus or SMB), and/or a Low Pin Count (LPC) bus. The LPC bus often connects low-bandwidth devices, such as boot ROM 196 and "legacy" I/O devices (using a "super I/O" chip). The "legacy" I/O devices (198) can include, for example, serial and parallel ports, keyboard, mouse, and/or a floppy disk controller. The LPC bus also connects Southbridge 135 to Trusted Platform Module (TPM) 195. Other components often included in Southbridge 135 include a Direct Memory Access (DMA) controller, a Programmable Interrupt Controller (PIC), and a storage device controller, which connects Southbridge 135 to nonvolatile storage device 185, such as a hard disk drive, using bus 184.

ExpressCard 155 is a slot that connects hot-pluggable devices to the information handling system. ExpressCard 155supports both PCI Express and USB connectivity as it connects to Southbridge 135 using both the Universal Serial Bus (USB) the PCI Express bus. Southbridge 135 includes USB Controller 140 that provides USB connectivity to devices that connect to the USB. These devices include webcam (camera) 150, infrared (IR) receiver 148, keyboard and trackpad 144, and Bluetooth device 146, which provides for wireless personal area networks (PANs). USB Controller 140 also provides USB connectivity to other miscellaneous USB connected devices 142, such as a mouse, removable nonvolatile storage device 145, modems, network cards, ISDN connectors, fax, printers, USB hubs, and many other types of USB connected devices. While removable nonvolatile storage device 145 is shown as a USB-connected device, removable nonvolatile storage device 145 could be connected using a different interface, such as a Firewire interface, etcetera.

Wireless Local Area Network (LAN) device 175 connects to Southbridge 135 via the PCI or PCI Express bus 172. LAN device 175 typically implements one of the IEEE 802.11 standards of over-the-air modulation techniques that all use the same protocol to wireless communicate between information handling system 100 and another computer system or device. Optical storage device 190 connects to Southbridge 135 using Serial ATA (SATA) bus 188. Serial ATA adapters and devices communicate over a high-speed serial link. The Serial ATA bus also connects Southbridge 135 to other forms of storage devices, such as hard disk drives. Audio circuitry 160, such as a sound card, connects to Southbridge 135 via bus158. Audio circuitry 160 also provides functionality such as audio line-in and optical digital audio in port 162, optical digital output and headphone jack 164, internal speakers 166, and internal microphone 168. Ethernet controller 170 connects to Southbridge 135 using a bus, such as the PCI or PCI Express bus. Ethernet controller 170 connects information handling system 100 to a computer network, such as a Local Area Network (LAN), the Internet, and other public and private computer networks.

While FIG. 1 shows one information handling system, an information handling system may take many forms. For example, an information handling system may take the form of a desktop, server, portable, laptop, notebook, or other form factor computer or data processing system. In addition, an information handling system may take other form factors such as a personal digital assistant (PDA), a gaming device, ATM machine, a portable telephone device, a communication device or other devices that include a processor and memory.

FIG. 2 is a network diagram of various types of data processing systems connected via a computer network. FIG. 2provides an extension of the information handling system environment shown in FIG. 1 to illustrate that the methods described herein can be performed on a wide variety of information handling systems that operate in a networked environment. Types of information handling systems range from small handheld devices, such as handheld computer/mobile telephone 210 to large mainframe systems, such as mainframe computer 270. Examples of handheld computer 210 include personal digital assistants (PDAs), personal entertainment devices, such as MP3 players, portable televisions, and compact disc players. Other examples of information handling systems include pen, or tablet, computer 220, laptop, or notebook, computer 230, workstation 240, personal computer system 250, and server 260. Other types of information handling systems that are not individually shown in FIG. 2 are represented by information handling system 280. As shown, the various information handling systems can be networked together using computer network 200. Types of computer network that can be used to interconnect the various information handling systems include Local Area Networks (LANs), Wireless Local Area Networks (WLANs), the Internet, the Public Switched Telephone Network (PSTN), other wireless networks, and any other network topology that can be used to interconnect the information handling systems. Many of the information handling systems include nonvolatile data stores, such as hard drives and/or nonvolatile memory. Some of the information handling systems shown in FIG. 2 depicts separate nonvolatile data stores (server 260 utilizes nonvolatile data store 265, mainframe computer 270 utilizes nonvolatile data store 275, and information handling system 280 utilizes nonvolatile data store 285). The nonvolatile data store can be a component that is external to the various information handling systems or can be internal to one of the information handling systems. In addition, removable nonvolatile storage device 145 can be shared among two or more information handling systems using various techniques, such as connecting the removable nonvolatile storage device 145 to a USB port or other connector of the information handling systems.

Operating system management of address-translation-related data structures and hardware lookasides

FIG. 3 is a block diagram depicting the hypervisor and operating systems interaction with CPU components in order to manipulate memory management storage. System memory 300 includes Memory Management Storage 310 that further include Page Table Entries (PTEs) 320 as well as eXtension Pointers (XPs). In storage constrained and highly dynamic computing environments, page table manipulations can be frequent enough that the hypervisor call (hcall) overhead causes performance degradation. This approach provides a technique and resources with which an operating system (370) can directly manipulate page table entries 320. In one embodiment, the page table structure includes of a hashed table. In one embodiment, the entry may be either a Page Table Entry (PTE) or an eXtension Pointer (XP), which points to a page containing 256 PTEs. The operating system may use a variety of data structures to track the allocation of storage, including the Page Table itself. Those structures which contain the real address reside in storage with a new storage attribute called Memory Management, which is used to protect the integrity of the Page Frame Descriptor (PFD). Each entry in Memory Management Storage (MMS) 310 is a quadword, with the Page Frame Descriptor, consisting of the real address, page size, and storage control bits 321 (WIMG) attribute information, located in fixed locations consistent with the Page Table Entry format.

Operating system management of address-translation-related data structures and hardware lookasides

In one embodiment, MMS 310 is accessed freely by hypervisor 380 through Load/Store Unit 330 of CPU 325, and is accessed via registers 340 using defined instructions by operating system running on the system. Values stored in registers340 are processed by Load/Store Unit 330 to update PTE 320. In one embodiment, the Memory Management attribute is identified by control bits 321 (WIMG)=Ob1100. A second special WIMG combination (WIMG=Ob1101) is used to identify XPs in the hashed page table. In one embodiment where the second special WIMG combination is used, the kernel storage is mapped by page tables that are under the control of the hypervisor. It is also possible to have range registers or some such technique to create MMS 310 without the use of page tables. In some embodiments, hypervisor 380 might need to "steal" pages (e.g., for additional memory needs, etc.). In addition, in some embodiments, it might be necessary to differentiate PTEs that are under hypervisor control from those which may be manipulated by the operating system. Control indicator 322, also referred to as "the O bit" indicates whether the PTE can be managed by the OS (e.g., O=1) or may be updated only by the hypervisor (e.g., O=0). To ensure proper use of the page frames, the base address and the page size are provided and manipulated in a way that does not allow the operating system to access storage outside the bounds defined by the page frame real address and page size. In one embodiment, the operating system can alter the address and size to address a smaller page within the page provided by the hypervisor. In one embodiment, the operating system is further restricted in that it may not change the WIMG attributes.

Four registers 340 are used to communicate PTE data. The Page Frame Descriptor Register (PFDR) 350 is the repository for the authorized Page Frame Descriptor (i.e. the page frame base address, the page size, the WIMG attributes, and the valid bit). Note that in one embodiment, access to Page Frame Descriptor Register (PFDR) 350 is privileged and can be accessed by the operating system using the LPTE and STPTE instructions discussed above. In one embodiment, manipulation of the PFDR by the operating system is possible as a monolithic set of data using the LPTE and STPTE instructions. In this embodiment, the operating system is not permitted to manipulate individual fields of the PFDR. In addition, when the PTE is stored using STPTE, the PTE will be a consistent integration, or merger, of the PFDR content with PFAR1 and PFAR2, rather than solely a reflection of what is in the PFDR. In a further embodiment, for some system security implementations, it may be desirable to inhibit the operating system from reading the real address portion of the PFDR as this information may be considered secure. Page Frame Attribute Registers 1 and 2 (PFAR1 355, PFAR2 360) hold the other bits for a PTE. The Page Frame Attribute Registers also contain the bits that control page size, which may be programmed by the operating system, and identify a proper subset of the page specified in the PFDR. In this manner, in one embodiment, the operating system is able to make limited modifications to some of the fields that are included in the PFDR and one of either PFAR 1 or PFAR 2. For example, the operating system can change the L and LP values (the changed values go in PFAR1 and PFAR2, not PFDR) to specify a smaller page within the page it was originally given.

The Page Table Entry Address Register (PTEAR) 345 includes the effective address of the PTE to be updated and a valid bit that indicates the quadword is a PTE which the operating system may update. Page Table Entry Address Register (PTEAR) 345 is used to enforce the management state of a PTE when updating the PTE. In one embodiment, the PTEAR may only be loaded by using the LPTEAR instruction discussed below. The PTEAR includes the effective address of the PTE and a valid bit that indicates that the entry is a PTE (as opposed to an eXtension Pointer) and that the PTE is not in a hypervisor management state.

Four instructions are used to manage page table entries. These four instructions are hardware instructions executed by Load/Store Unit 330 of Central Processing Unit (CPU) 325. The instruction are as follows: First, Load Page Table Entry (LPTE) is an instruction that tests for the memory management attribute and that the PTE may be updated by the OS, and also the absence of the WIMG combination that indicates an eXtension Pointer as part of the load of the PFDR and PFARs (resetting the valid bit if the conditions are not met). Second, Store Page Table Entry (STPTE) is an instruction that stores the merge of the PFDR and the PFARs to a location in storage if the target quadword is not an eXtension Pointer and is not being managed by the hypervisor. As part of the merge operation, the STPTE instruction verifies that the L/LP bits supplied in the PFARs are consistent with those in the PFDR. As known by those skilled in the art, in one embodiment, the L/LP bits indicate the page size and, in some cases, the alignment of the page in storage. Third, Load Page Table Entry Address Register (LPTEAR) is an instruction that loads the effective address of the target PTE and checks to ensure that the quadword is not an eXtension Pointer and that the hypervisor is not managing the entry, setting a valid bit accordingly. Fourth, Page Frame Descriptor Invalidate (PFDI) is an instruction that sets the valid bit in the PFDR to zero.

To write a new PTE, the operating system first loads the address into the PTEAR using the LPTEAR instruction. The operating system then loads an authorized Page Frame Descriptor in the format of a PTE into the PFDR and PFARs (with L/LP bits replicated into the appropriate bits of the PFARs) using the LPTE instruction. The operating system then modifies the contents of the PFARs using the MFSPR and MTSPR instructions to establish the appropriate page size, protection attributes, and so forth. Last, the operating system uses the STPTE instruction to merge and store the PTE.

In one embodiment, Memory Management Storage 310 includes two types of memory-management related structures: Page Table Entries (PTE) 320 and eXtension Pointers (XPs). General access to Memory Management Storage using In one embodiment, Load and Store instructions are limited to hypervisor state (hypervisor 380). Non-hypervisor privileged software (e.g., operating systems 370) may access Memory Management Storage via the Load Page Table Effective Address Register (LPTEAR), Load Page Table Entry (LPTE), Load Real Address (LRA), Store eXtension Pointer, and Store Page Table Entry instructions. In one embodiment, other attempts to access Memory Management Storage 310 by non-hypervisor privileged software is considered a storage protection violation.

As outlined above, this approach provides an operating system operating in a hypervised system with some limited ability to load and store data to the memory management storage. Accessing memory management storage via the registers and the instructions described above allows the operating system to access memory management storage while maintaining the integrity of the page table through the processes performed by the instructions.

FIG. 4 is a first flowchart depicting steps taken by an operating system to handle an incoming page table request. Operating system processing is shown commencing at 400 whereupon, at step 410, a page table request is received. As shown, there can be many different request types including 1. reallocate page to another process, 2. change protection attributes, 3. subdivide page, 4. create another PTE to map to same real page, 5. manipulate reference and change bits, 6. change software control bits, 7. increase page size, 8. change storage control bits, 9. move page, and the like. Some of these pages may need the hypervisor to execute, while others may be performed by the operating system using registers340 introduced in FIG. 3, and related text, above. Returning to FIG. 4, a decision is made as to whether the page table request received at the operating system requires the hypervisor (decision 420). For example, in one embodiment, the hypervisor is required to increase a page size, to change the storage control bits, and to move a page. If the received request does not require the hypervisor, then decision 420 branches to the "no" branch whereupon, at step 425, a control indicator is checked (e.g., control indicator "O bit" 322 shown in FIG. 3) in order to determine if the page table entry (e.g., PTE 320 shown in FIG. 3) is hypervisor or operating system controlled. For example, in one embodiment, if the control indicator is "1" then the PTE is operating system controlled, and if the control indicator is "0" then the PTE is hypervisor controlled.

Operating system management of address-translation-related data structures and hardware lookasides

A decision is made as to whether the PTE is hypervisor controlled (decision 430). If the request is hypervisor controlled, then decision 430 branches to the "yes" branch. If either the request requires the hypervisor (decision 420 branching to the "yes" branch) or the control indicator identifies the hypervisor as controlling the PTE (decision 430 branching to the "yes" branch), then the hypervisor handles the request at step 440. On the other hand, if the request does not require the hypervisor (decision 420 branching to the "no" branch) and the PTE is not hypervisor controlled (decision 430 branching to the "no" branch), then the operating system updates the PTE at predefined process 450 (see FIG. 5 and corresponding text for processing details). A decision is made as to whether the update performed by the operating system was successful (decision 460). If the update was successful, then decision 460 branches to the "yes" branch whereupon, at step 470, a successful return code is returned to the requestor indicating that the request was performed successfully. On the other hand, if the update attempted by the operating system was not successful, then decision 460 branches to the "no" branch whereupon, at step 480, an error is returned to the requestor indicating that the page table request was not successful.

FIG. 5 is a second flowchart depicting actions taken by the operating system to update the page table entry. This routine is called from predefined process 450 shown in FIG. 4. Processing commences at 500 whereupon, at step 510, the operating system loads the address of the PTE into PTEAR register 345 using the LPTEAR instruction. The LPTEAR instruction validates that the PTE can be manipulated by the operating system and also loads the address of the PTE into the PTEAR register. At step 520, the PTE data is loaded from memory into various registers (PFDR 350, PFAR1 355, and PFAR2 360) using the LPTE instruction. The LPTE instruction uses the (now validated) address that was loaded in the PTEAR register during step 510 to load data to the various registers. At step 530, the operating system alters the data stored in PFAR1 and 2 (registers 355 and 360) to describe the new page that the operating system is mapping in system memory. After the various registers are loaded and the contents are altered as needed, at step 540, the operating system executes the Store Page Table Entry (STPTE) hardware instruction to store the new PTE into Memory Management Storage (MMS).

Operating system management of address-translation-related data structures and hardware lookasides

Block 550 shows steps performed by the Load/Store Unit of the CPU to execute the STPTE instruction. First, at step 560, the Load/Store Unit validates data stored in the registers, such as the contents stored in PFAR1 and 2 which could have been altered by the operating system during step 530. A decision is made by the Load/Store Unit, based on the validation of the data, as to whether the update to the PTE is allowed (decision 570). If the update is allowed, then decision 570branches to the "yes" branch whereupon, at step 580, the Load/Store Unit of the CPU updates PTE 320 using values loaded in registers 340. On the other hand, if the update is not allowed, then decision 570 branches to the "no" branch whereupon, at step 590, the Load/Store Unit causes an error to occur indicating that the PTE was not updated. After block550 concludes, at 595, processing returns to the calling routine (see FIG. 4) with a result that indicates whether the update was allowed (performed).

SRC=https://www.google.com.hk/patents/US8645667

Operating system management of address-translation-related data structures and hardware lookasides的更多相关文章

  1. Desktop Management Interface & System Management BIOS

    http://en.wikipedia.org/wiki/Desktop_Management_Interface Desktop Management Interface From Wikipedi ...

  2. Method and system for providing security policy for linux-based security operating system

    A system for providing security policy for a Linux-based security operating system, which includes a ...

  3. Operating system coordinated thermal management

    A processor's performance state may be adjusted based on processor temperature. On transitions to a ...

  4. Flexible implementation of a system management mode (SMM) in a processor

    A system management mode (SMM) of operating a processor includes only a basic set of hardwired hooks ...

  5. PatentTips - Supporting address translation in a virtual machine environment

    BACKGROUND A conventional virtual-machine monitor (VMM) typically runs on a computer and presents to ...

  6. PatentTips - DMA address translation between peer-to-peer IO devices

    BACKGROUND As processing resources have increased, demands to run multiple software programs and ope ...

  7. General-Purpose Operating System Protection Profile

    1 Protection Profile Introduction   This document defines the security functionality expected to be ...

  8. Single-stack real-time operating system for embedded systems

    A real time operating system (RTOS) for embedded controllers having limited memory includes a contin ...

  9. Full exploitation of a cluster hardware configuration requires some enhancements to a single-system operating system.

    COMPUTER ORGANIZATION AND ARCHITECTURE DESIGNING FOR PERFORMANCE NINTH EDITION Operating System Desi ...

随机推荐

  1. poj 2367 Genealogical tree

    题目连接 http://poj.org/problem?id=2367 Genealogical tree Description The system of Martians' blood rela ...

  2. mysql分表方法-----MRG_MyISAM引擎分表法

    一般来说,当我们的数据库的数据超过了100w记录的时候就应该考虑分表或者分区了,这次我来具体说说分表的一些方法.眼下我所知道的方法都是MYISAM的,INNODB怎样做分表而且保留事务和外键,我还不是 ...

  3. 谈谈RDD、DataFrame、Dataset的区别和各自的优势

    在spark中,RDD.DataFrame.Dataset是最常用的数据类型,本博文给出笔者在使用的过程中体会到的区别和各自的优势 共性: 1.RDD.DataFrame.Dataset全都是spar ...

  4. lodash源码分析之chunk的尺与刀

    以不正义开始的事情,必须用罪恶使它巩固. --莎士比亚<麦克白> 最近很多事似乎印证了这句话,一句谎言最后要用一百句谎言来圆谎. 本文为读 lodash 源码的第二篇,后续文章会更新到这个 ...

  5. Beta冲刺 总结

    Beta冲刺 总结 1. 完成情况 经过了为其七天的beta冲刺,我们基本完成了之前在<beta开始前准备>博客中所列出的内容. 增加关于征信的功能,贴近选题主题.在学生的信用活动记录中添 ...

  6. &lbrack;BJOI 2010&rsqb;次小生成树Tree

    Description 小 C 最近学了很多最小生成树的算法,Prim 算法.Kurskal 算法.消圈算法等等. 正当小 C 洋洋得意之时,小 P 又来泼小 C 冷水了.小 P 说,让小 C 求出一 ...

  7. 设计的一些kubernetes面试题目

    这几个月参与了几场面试,设计了多道面试题,觉得可以综合考察应聘人对kubernetes的掌握情况.在这里分享下,供应聘人自查以及其他面试官参考. 这些面试题的设计初衷并不是考察kubernetes的使 ...

  8. Python 的反射机制

    什么叫做反射 利用字符串的形式去对象(模块)中操作(查找/添加/获取/删除)成员,一种基于字符串的事件驱动. 可以使用反射动态地创建类型的实例,将类型绑定到现有对象,或从现有对象中获取类型.然后,可以 ...

  9. 1&period;3&period;6、CDH 搭建Hadoop在安装之前&lpar;端口---DistCp使用的端口&rpar;

    DistCp使用的端口 列出的所有端口都是TCP. 在下表中,每个端口的“ 访问要求”列通常是“内部”或“外部”.在此上下文中,“内部”表示端口仅用于组件之间的通信; “外部”表示该端口可用于内部或外 ...

  10. 链接按钮LinkButton(按钮组)

    <!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title> ...