1. 什么是AWS
自 2006 年初起,亚马逊开始在云中为各种规模的公司提供技术服务平台。利用 AWS 服务,软件开发人员可以轻松购买计算、存储、数据库和其他基于 Internet 的服务来支持其应用程序。开发人员能够灵活选择任何开发平台或编程环境,以便于其尝试解决问题。由于开发人员只需按使用量付费,无需前期资本支出,AWS 服务是向最终用户交付计算资源、保存的数据和其他应用程序的一种最经济划算的方式。
2. AWS服务的优势
AWS 服务提供了安全、可靠且可扩展的技术服务平台,使来自中国乃至全球的众多客户从中获益。
2.1 没有前期投资
建立本地基础设施费耗时长、成本高,而且涉及订购、付款、安装和配置昂贵的硬件和软件,而所有这些工作都需要在实际使用之前提前完成。使用亚马逊 AWS,开发人员和企业再也不必花费时间和资金完成上述活动;相反,他们只需在需要时为所消耗的资源支付费用即可,且支付的金额因所消耗资源量和种类而异。
2.2 低成本
AWS 服务可在多方面帮助降低 IT 总成本。我们的规模化经济效益和效率提高使我们能够不断降低价格。多种定价模式让客户针对变化和稳定的工作负载优化成本。此外,AWS 服务还能降低前期 IT 人力成本和持续 IT 人力成本,客户只需投入相当于传统基础设施成本几分之一的成本就能使用广泛分布、功能全面的平台。
2.3 灵活的容量
很难预测用户会如何采用新的应用程序。开发人员要在部署应用程序之前决定容量大小,其结果通常有两种,要么是大量昂贵资源被闲置,要么是容量受限,最终导致最终用户体验不佳,这要到资源限制问题得到解决才能结束。使用 AWS 服务,这种问题不复存在。开发人员可以在需要时调配所需的资源量。如果需要更多,他们可以轻松扩展资源量。如果不再需要,则只需关掉它们并停止付费。
2.4 速度和灵敏性
利用传统技术服务,需要花数周时间才能采购、交付并运行资源。这么长的时间期扼杀了创新。使用 AWS 服务,开发人员可以在几分钟内部署数百、甚至数千个计算节点,而无需任何繁琐的流程。这种自助服务环境改变了开发人员创建和部署应用程序的速度,使软件开发团队能够更快、更频繁的进行创新。
2.5 应用而非运营
AWS 服务为客户节省了数据中心投资和运营所需的资源,并将其转投向创新项目。稀缺的 IT 资源和研发资源可以集中用于帮助企业发展的项目上,而不是用在重要但是无法使企业脱颖而出的 IT 基础设施上。
2.6 覆盖全球
无论使用 AWS 服务的是大型的全球化公司还是小型的初创公司,都有可能在全球拥有潜在最终用户。传统基础设施很难为分布广泛的用户提供最佳性能,且大多数公司为了节省成本和时间,往往只能关注一个地理区域。利用 AWS 服务,情况则大不一样:开发人员可以使用在全球不同地点运作的相同 AWS 服务技术轻松部署应用程序,以覆盖多个地理区域的最终用户。
3. 使用Hadoop分析大数据
mazon EMR 是一项托管服务,可以让您快速、轻松且经济高效地运行 Apache Hadoop 和 Spark 以便处理大量数据。Amazon EMR 还支持功能强大且经验证的 Hadoop 工具,例如 Presto、Hive、Pig、HBase 等。在该项目中,您将部署功能完善的 Hadoop 集群,以便为在数分钟内分析日志数据做好准备。您将首先启动 Amazon EMR 集群,然后使用 HiveQL 脚本来处理存储在 Amazon S3 存储桶中的示例日志数据。HiveQL 是一种类似 SQL 的脚本语言,可以用于数据仓库和分析。然后,您可以使用类似的设置分析自己的日志文件。
Step 1: Set Up Prerequisites for Your Sample Cluster
Before you begin setting up your Amazon EMR cluster, make sure that you complete the prerequisites in this topic.
Sign Up for AWS
If you do not have an AWS account, use the following procedure to create one.
To sign up for AWS
Open / and click Sign Up.
Follow the on-screen instructions.
Create an Amazon S3 Bucket
In this tutorial, you specify an Amazon S3 bucket and folder to store the output data from a Hive query. The tutorial uses the default log location, but you can also specify a custom location if you prefer. Because of Hadoop requirements, bucket and folder names that you use with Amazon EMR have the following limitations:
They must contain only letters, numbers, periods (.), and hyphens (-).
They cannot end in numbers.
If you already have access to a folder that meets these requirements, you can use it for this tutorial. The output folder should be empty. Another requirement to remember is that bucket names must be unique across all AWS accounts.
For more information about creating a bucket, see Create a Bucket in the Amazon Simple Storage Service Getting Started Guide. After you create the bucket, choose it from the list and then choose Create folder, replace New folder with a name that meets the requirements, and then choose Save.
The bucket and folder name used later in the tutorial is s3://mybucket/MyHiveQueryResults.
Step 2: Launch Your Sample Amazon EMR Cluster
In this step, you launch your sample cluster by using Quick Options in the Amazon EMR console and leaving most options to their default values. To learn more about these options, see Summary of Quick Options after the procedure. You can also select Go to advanced options to explore the additional configuration options available for a cluster. Before you create your cluster for this tutorial, make sure that you meet the requirements in Step 1: Set Up Prerequisites for Your Sample Cluster.
Launch the Sample Cluster
To launch the sample Amazon EMR cluster
Sign in to the AWS Management Console and open the Amazon EMR console at /elasticmapreduce/.
Choose Create cluster.
On the Create Cluster - Quick Options page, accept the default values except for the following fields:
Enter a Cluster name that helps you identify the cluster, for example, My First EMR Cluster.
Under Security and access, choose the EC2 key pair that you created in Create an Amazon EC2 Key Pair.
Choose Create cluster.
Step 3: Allow SSH Connections to the Cluster From Your Client
Security groups act as virtual firewalls to control inbound and outbound traffic to your cluster. When you create your first cluster, Amazon EMR creates the default Amazon EMR-managed security group associated with the master instance, ElasticMapReduce-master, and the security group associated with core and task nodes, ElasticMapReduce-slave.
Warning
The default EMR-managed security group for the master instance in public subnets, ElasticMapReduce-master, is pre-configured with a rule that allows inbound traffic on Port 22 from all sources (IPv4 0.0.0.0/0). This is to simplify initial SSH client connections to the master node. We strongly recommend that you edit this inbound rule to restrict traffic only from trusted sources or specify a custom security group that restricts access.
Modifying security groups isn’t a requirement to complete the tutorial, but we recommend that you do not allow inbound traffic from all sources. In addition, if another user edited the ElasticMapReduce-master security group to eliminate this rule per the recommendations, you are not able to access the cluster using SSH for next steps. For more information about security groups, see Control Network Traffic with Security Groups and Security Groups for Your VPC in the Amazon VPC User Guide.
To remove the inbound rule that allows public access using SSH for the ElasticMapReduce-master security group
The following procedure assumes that the ElasticMapReduce-master security group has not been edited previously. In addition, to edit security groups, you must be logged in to AWS as a root user or as an IAM principal that is allowed to manage security groups for the VPC that the cluster is in. For more information, see Changing Permissions for an IAM User and the Example Policy that allows managing EC2 security groups in the IAM User Guide.
Open the Amazon EMR console at /elasticmapreduce/.
Choose Clusters.
Choose the Name of the cluster.
Under Security and access choose the Security groups for Master link.
Edit security groups from EMR cluster status.
Choose ElasticMapReduce-master from the list.
Choose Inbound, Edit.
Find the rule with the following settings and choose the x icon to delete it:
Type
SSH
Port
22
Source
Custom 0.0.0.0/0
Scroll to the bottom of the list of rules and choose Add Rule.
Step 4: Process Data By Running The Hive Script as a Step
With your cluster up and running, you can now submit a Hive script. In this tutorial, you submit the Hive script as a step using the Amazon EMR console. In Amazon EMR, a step is a unit of work that contains one or more jobs. As you learned in Step 2: Launch Your Sample Amazon EMR Cluster, you can submit steps to a long-running cluster, which is what we do in this step. You can also specify steps when you create a cluster, or you could connect to the master node, create the script in the local file system, and run it using the command line, for example hive -f Hive_CloudFront.q.
Understanding The Data And Script
The sample data and script that you use in this tutorial are already available in an Amazon S3 location that you can access.
The sample data is a series of Amazon CloudFront access log files. For more information about CloudFront and log file formats, see Amazon CloudFront Developer Guide. The data is stored in Amazon S3 at s3:///cloudfront/data where region is your region, for example, us-west-2. When you enter the location when you submit the step, you omit the cloudfront/data portion because the script adds it.
Each entry in the CloudFront log files provides details about a single user request in the following format:
2014-07-05 20:00:00 LHR3 4260 10.0.0.15 GET / 200 - Mozilla/5.0%20(MacOS;%20U;%20Windows%20NT%205.1;%20en-US;%20rv:1.9.0.9)%20Gecko/2009040821%20IE/3.0.9
The sample script calculates the total number of requests per operating system over a specified time frame. The script uses HiveQL, which is a SQL-like scripting language for data warehousing and analysis. The script is stored in Amazon S3 at s3:///cloudfront/code/Hive_CloudFront.q where region is your region.
The sample Hive script does the following:
Creates a Hive table schema named cloudfront_logs. For more information about Hive tables, see the Hive Tutorial on the Hive wiki.
Uses the built-in regular expression serializer/deserializer (RegEx SerDe) to parse the input data and apply the table schema. For more information, see SerDe on the Hive wiki.
Runs a HiveQL query against the cloudfront_logs table and writes the query results to the Amazon S3 output location that you specify.
The contents of the Hive_CloudFront.q script are shown below. The ${INPUT} and ${OUTPUT} variables are replaced by the Amazon S3 locations that you specify when you submit the script as a step. When you reference data in Amazon S3 as this script does, Amazon EMR uses the EMR File System (EMRFS) to read input data and write output data.
– Summary: This sample shows you how to analyze CloudFront logs stored in S3 using Hive
– Create table using sample data in S3. Note: you can replace this S3 path with your own.
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
DateObject Date,
Time STRING,
Location STRING,
Bytes INT,
RequestIP STRING,
Method STRING,
Host STRING,
Uri STRING,
Status INT,
Referrer STRING,
OS String,
Browser String,
BrowserVersion String
)
ROW FORMAT SERDE ‘.’
WITH SERDEPROPERTIES (
“” = "(?!#)([ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+[(]+[(]([^;]+).*%20([/]+)/
"
)
L
O
C
A
T
I
O
N
′
" ) LOCATION '
")LOCATION′{INPUT}/cloudfront/data’;
– Total requests per operating system for a given time frame
INSERT OVERWRITE DIRECTORY ‘${OUTPUT}/os_requests/’ SELECT os, COUNT(*) count FROM cloudfront_logs WHERE dateobject BETWEEN ‘2014-07-05’ AND ‘2014-08-05’ GROUP BY os;
Submit the Hive Script as a Step
Use the Add Step option to submit your Hive script to the cluster using the console. The Hive script and sample data have been uploaded to Amazon S3, and you specify the output location as the folder you created earlier in Create an Amazon S3 Bucket.
To run the Hive script by submitting it as a step
Open the Amazon EMR console at /elasticmapreduce/.
In Cluster List, select the name of your cluster. Make sure the cluster is in a Waiting state.
Choose Steps, and then choose Add step.
Configure the step according to the following guidelines:
For Step type, choose Hive program.
For Name, you can leave the default or type a new name. If you have many steps in a cluster, the name helps you keep track of them.
For Script S3 location, type s3:///cloudfront/code/Hive_CloudFront.q. Replace region with your region identifier. For example, s3:///cloudfront/code/Hive_CloudFront.q if you are working in the Oregon region. For a list of regions and corresponding Region identifiers, see AWS Regions and Endpoints for Amazon EMR in the AWS General Reference.
For Input S3 location, type s3://
Replace region with your region identifier.
For Output S3 location, type or browse to the output bucket that you created in Create an Amazon S3 Bucket.
For Action on failure, accept the default option Continue. This specifies that if the step fails, the cluster continues to run and processes subsequent steps. The Cancel and wait option specifies that a failed step should be canceled, that subsequent steps should not run, abut that the cluster should continue running. The Terminate cluster option specifies that the cluster should terminate if the step fails.
Choose Add. The step appears in the console with a status of Pending.
The status of the step changes from Pending to Running to Completed as the step runs. To update the status, choose the refresh icon to the right of the Filter. The script takes approximately a minute to run.
View the Results
After the step completes successfully, the Hive query output is saved as a text file in the Amazon S3 output folder that you specified when you submitted the step.
To view the output of the Hive script
Open the Amazon S3 console at /s3/.
Choose the Bucket name and then the folder that you set up earlier. For example, mybucket and then MyHiveQueryResults.
The query writes results to a folder within your output folder named os_requests. Choose that folder. There should be a single file named 000000_0 in the folder. This is a text file that contains your Hive query results.
Choose the file, and then choose Download to save it locally.
Use the text editor that you prefer to open the file. The output file shows the number of access requests ordered by operating system. The following example shows the output in WordPad:
Sample Hive query results in WordPad.
Step 5: Terminate the Cluster and Delete the Bucket
After you complete the tutorial, you may want to terminate your cluster and delete your Amazon S3 bucket to avoid additional charges.
Terminating your cluster terminates the associated Amazon EC2 instances and stops the accrual of Amazon EMR charges. Amazon EMR preserves metadata information about completed clusters for your reference, at no charge, for two months. The console does not provide a way to delete terminated clusters so that they aren’t viewable in the console. Terminated clusters are removed from the cluster when the metadata is removed.
To terminate the cluster
Open the Amazon EMR console at /elasticmapreduce/.
Choose Clusters, choose your cluster, and then choose Terminate.
Clusters are often created with termination protection on, which helps prevent accidental shutdown. If you followed the tutorial precisely, termination protection should be off. If termination protection is on, you are prompted to change the setting as a precaution before terminating the cluster. Choose Change, Off.
To delete the output bucket
Open the Amazon S3 console at /s3/.
Choose the bucket from the list, so that the whole bucket row is selected.
Choose delete bucket, type the name of the bucket, and then click Confirm.
For more information about deleting folders and buckets, go to How Do I Delete an S3 Bucket in the Amazon Simple Storage Service Getting Started Guide.
4 个人总结
其实AWS就像一个托管平台,他拥有的是自己的管理体系,但是在该体系中包含了我们日常生活中所需要的工具,此外最重要的是他提供的高性能服务,能让我们在没有硬件基础的情况下进行实施我们想要的服务,这也就是云计算理念的关键点吧,提供一个平台给开发者们使用。
关于怎么去操作AWS,其实大家看官方手册就行,他的官方手册都详细的记载了如何进行每一步的操作,只是都是全英文,大家只要翻译一下就好,过程不难,但是需要我们去理解里面所包含的内容和含义,每一个组件的构造和作用,当然价格大家也需要自己去检查了。