用 SQL 对关系型数据库进行查询

前面几节中，我们已经掌握了如何向 SQLite 数据库中写入数据。这一节，我们将学习
如何根据需求对数据库进行查询，进而从中获取数据。接下来的例子中会使
用 data/datasets.sqlite（之前创建的）。
首先，需要与数据库建立连接：
con <- dbConnect(SQLite( ), "data/datasets.sqlite")
dbListTables(con)

## [1] "diamonds" "flights"
数据库中有两张表，我们用 select 语句来选取 diamonds 中所有的数据。这里需
要选择所有的列（字段）。所以，我们调用 dbGetQuery( )，将数据库连接 con 和查询
语句作为参数输入：
db_diamonds <- dbGetQuery(con,
"select * from diamonds")
head(db_diamonds, 3)
## carat cut color clarity depth table price x y z
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
注意，* 这个符号代表所有的字段。如果我们只需要字段的一个子集，也可以依次列
出字段名：
db_diamonds <- dbGetQuery(con,
"select carat, cut, color, clarity,
depth, price
from diamonds")
head(db_diamonds, 3)
## carat cut color clarity depth price
## 1 0.23 Ideal E SI2 61.5 326
## 2 0.21 Premium E SI1 59.8 326
## 3 0.23 Good E VS1 56.9 327
如果想要选取数据中所有不重复的值，可以使用 select distinct。例如，下面的
代码会返回 diamonds 表中 cut 字段的所有不重复的取值：
dbGetQuery(con, "select distinct cut from diamonds")
## cut
## 1 Ideal
## 2 Premium
## 3 Good
## 4 Very Good
## 5 Fair
注意，dbGetQuery( ) 总是返回一个数据框，虽然它只有一列。为使单列数据框还
原成原子向量，只需从数据框中取出第 1 列：
11.1 操作关系型数据库 327
dbGetQuery(con, "select distinct clarity from diamonds")[[1]]
## [1] "SI2" "SI1" "VS1" "VS2" "VVS2" "VVS1" "I1" "IF"
当用 select 选择列查询时，原表中的列名可能并不合意。此时，可以用 A as B 的
形式，得到名为 B 的列，但 B 中的数据与原表的 A 列一致：
db_diamonds <- dbGetQuery(con,
"select carat, price, clarity as clarity_level from diamonds")
head(db_diamonds, 3)
## carat price clarity_level
## 1 0.23 326 SI2
## 2 0.21 326 SI1
## 3 0.23 327 VS1
有时候，我们想要的值不是直接存储在数据库中，而是需要经过一些计算才能得到。
这时，也可以使用 A as B 的语句形式，这里的 A 是现有列之间的算术运算式：
db_diamonds <- dbGetQuery(con,
"select carat, price, x * y * z as size
from diamonds")
head(db_diamonds, 3)
## carat price size
## 1 0.23 326 38.20203
## 2 0.21 326 34.50586
## 3 0.23 327 38.07688
假如要用现有列生成一个新列，再用该新列生成另一个列，我们该怎么办呢？
db_diamonds <- dbGetQuery(con,
"select carat, price, x * y * z as size,
price / size as value_density
from diamonds")
## Error in sqliteSendQuery(con, statement, bind.data): error in statement:
no such column: size
上面的做法是行不通的。语句 A as B 中，A 必须由已存在的列构成。然而，如果确
实需要这样做，可以用嵌套查询的办法，即通过一个内嵌的 select 语句产生一个临时表，
再从临时表中选出所需列：
db_diamonds <- dbGetQuery(con,
"select *, price / size as value_density from
(select carat, price, x * y * z as size
from diamonds)")
head(db_diamonds, 3)
## carat price size value_density
## 1 0.23 326 38.20203 8.533578
## 2 0.21 326 34.50586 9.447672
## 3 0.23 327 38.07688 8.587887
这种情况下，在计算 price/size 时，size 已经在临时表中定义了。
数据库查询的另一个重要部分就是条件查询。我们使用 where 指明查询结果应满足的
条件。例如，选择 cut 值为 Good 的钻石数据：
good_diamonds <- dbGetQuery(con,
"select carat, cut, price from diamonds
where cut = 'Good'")
head(good_diamonds, 3)
## carat cut price
## 1 0.23 Good 327
## 2 0.31 Good 335
## 3 0.30 Good 339
注意，cut 取值为 Good 的记录只有很少一部分：
nrow(good_diamonds) / nrow(diamonds)
## [1] 0.09095291
如果查询需要同时满足多个条件，可以用 and 来连结这些条件。例如，选出所
有 cut 为 Good 且 color 值为 E 的记录：
good_e_diamonds <- dbGetQuery(con,
"select carat, cut, color, price from diamonds
where cut = 'Good' and color = 'E'")
head(good_e_diamonds, 3)
## carat cut color price
## 1 0.23 Good E 327
## 2 0.23 Good E 402
## 3 0.26 Good E 554
nrow(good_e_diamonds) / nrow(diamonds)
## [1] 0.017297
同样的逻辑也适用于 or 和 not。
除了这些简单的逻辑运算之外，也可以通过检查字段的值是否包含在给定集合中，可
以使用 in 来筛选记录。例如，筛选出 color 为 E 或 F 的记录：
color_ef_diamonds <- dbGetQuery(con,
"select carat, cut, color, price from diamonds
where color in ('E', 'F')")
nrow(color_ef_diamonds)
## [1] 19339
我们用下表验证该结果：
table(diamonds$color)
##
## D E F G H I J
## 6775 9797 9542 11292 8304 5422 2808
使用 in 语句的时候，我们需要为它指定一个集合。而语句 between … and…则需
要指定一个区间：
some_price_diamonds <- dbGetQuery(con,
"select carat, cut, color, price from diamonds
where price between 5000 and 5500")
nrow(some_price_diamonds) /nrow(diamonds)
## [1] 0.03285132
实际上这个区间不一定是数值型的，只要字段的数据类型是可比的即可。而对于字符
串类型的列，我们可用 between 'string1' to 'string2' 语句，按照字典的排列顺
序来筛选记录。
针对字符串字段，还有一个有用的运算符：like，它可以用来筛选具有某种模式的字
段。例如，我们可以选出表中 cut 变量取值以 Good 结尾的记录。它可以
是 Good 或 VeryGood。我们用 like '%Good'，这里的 % 符号可以匹配任何字符串。
good_cut_diamonds <- dbGetQuery(con,
"select carat, cut, color, price from diamonds
where cut like '%Good' ")
nrow(good_cut_diamonds) / nrow(diamonds)
## [1] 0.3149425
数据库查询还有一个重要功能，即按照指定字段重新排列数据，可以使用 order by 实
现这个功能。例如，检索所有记录的 carat 和 price 字段，并按照 price 字段升序排列：
cheapest_diamonds <- dbGetQuery(con,
"select carat, price from diamonds
order by price")
如此便可得到一个钻石数据的数据框，按照由便宜到昂贵的顺序排列：
head(cheapest_diamonds)
## carat price
## 1 0.23 326
## 2 0.21 326
## 3 0.23 327
## 4 0.29 334
## 5 0.31 335
## 6 0.24 336
在指定排序字段时加一个 desc，就可以进行降序排列，这里我们得到一个顺序完全
相反的数据框：
most_expensive_diamonds <- dbGetQuery(con,
"select carat, price from diamonds
order by price desc")
head(most_expensive_diamonds)
## carat price
## 1 2.29 18823
## 2 2.00 18818
## 3 1.51 18806
## 4 2.07 18804
## 5 2.00 18803
## 6 2.29 18797
也可以根据多个字段（或列）对记录进行排序。例如，首先按照 price 进行升序排列，
如果两条记录的 price 取值相等，再按照 carat 进行降序排列：
cheapest_diamonds <- dbGetQuery(con,
"select carat, price from diamonds
order by price, carat desc")
head(cheapest_diamonds)
## carat price
## 1 0.23 326
## 2 0.21 326
## 3 0.23 327
## 4 0.29 334
## 5 0.31 335
## 6 0.24 336
就像 select 语句中用于排序的列可以是根据已有列计算生成的：
dense_diamonds <- dbGetQuery(con,
"select carat, price, x * y * z as size from diamonds
order by carat /size desc")
head(dense_diamonds)
## carat price size
## 1 1.07 5909 47.24628
## 2 1.41 9752 74.41726
## 3 1.53 8971 85.25925
## 4 1.51 7188 133.10400
## 5 1.22 3156 108.24890
## 6 1.12 6115 100.97448
同时使用 where 和 order by 便可得到一个排序的子集结果：
head(dbGetQuery(con,
"select carat, price from diamonds
where cut = 'Ideal' and clarity = 'IF' and color = 'J'
order by price"))
## carat price
## 1 0.30 489
## 2 0.30 489
## 3 0.32 521
## 4 0.32 533
## 5 0.32 533
## 6 0.35 569
如果只关心前几行结果，我们可以用 limit 来限制取出的记录条数：
dbGetQuery(con,
"select carat, price from diamonds
order by carat desc limit 3")
## carat price
## 1 5.01 18018
## 2 4.50 18531
## 3 4.13 17329
除了字段选择（按列选取）、条件筛选和排序，我们还可以在数据库中对记录进行分组
聚合。例如，计算每种颜色的记录条数：
dbGetQuery(con,
"select color, count(*) as number from diamonds
group by color")
## color number
## 1 D 6775
## 2 E 9797
## 3 F 9542
## 4 G 11292
## 5 H 8304
## 6 I 5422
## 7 J 2808
对原始数据调用 table( )，检验查询结果：
table(diamonds$color)
##
## D E F G H I J
## 6775 9797 9542 11292 8304 5422 2808
除了汇总计数，其他聚合函数还有 avg( )、max( )、min( ) 和 sum( )。例如，
计算不同透明度水平的平均价格：
dbGetQuery(con,
"select clarity, avg(price) as avg_price
from diamonds
group by clarity
order by avg_price desc")
## clarity avg_price
## 1 SI2 5063.029
## 2 SI1 3996.001
## 3 VS2 3924.989
## 4 I1 3924.169
## 5 VS1 3839.455
## 6 VVS2 3283.737
## 7 IF 2864.839
## 8 VVS1 2523.115
也可以检查一下，在最低的 5 个价格水平下，能买到的最大克拉数是多少：
dbGetQuery(con,
"select price, max(carat) as max_carat
from diamonds
group by price
order by price limit 5")
## price max_carat
## 1 326 0.23
## 2 327 0.23
## 3 334 0.29
## 4 335 0.31
## 5 336 0.24
还可以在组内同时进行多个运算。以下代码计算了每个透明度水平下的价格区间和价
格平均值：
dbGetQuery(con,
"select clarity,
min(price) as min_price,
max(price) as max_price,
avg(price) as avg_price
from diamonds
group by clarity
order by avg_price desc")
## clarity min_price max_price avg_price
## 1 SI2 326 18804 5063.029
## 2 SI1 326 18818 3996.001
## 3 VS2 334 18823 3924.989
## 4 I1 345 18531 3924.169
## 5 VS1 327 18795 3839.455
## 6 VVS2 336 18768 3283.737
## 7 IF 369 18806 2864.839
## 8 VVS1 336 18777 2523.115
接下来的例子，用重量进行加权，计算了不同透明度水平下每克拉钻石的平均价格：
dbGetQuery(con,
"select clarity,
sum(price * carat) / sum(carat) as wprice
from diamonds
group by clarity
order by wprice desc")
## clarity wprice
## 1 SI2 7012.257
## 2 VS2 6173.858
## 3 VS1 6059.505
## 4 SI1 5919.187
## 5 VVS2 5470.156
## 6 I1 5233.937
## 7 IF 5124.584
## 8 VVS1 4389.112
就像可以根据多个字段进行排序，我们也可以根据多个字段进行分组。以下代码计算
了不同透明度水平和颜色种类下钻石的平均价格，并展示了最昂贵的 5 种组合：
dbGetQuery(con,
"select clarity, color,
avg(price) as avg_price
from diamonds
group by clarity, color
order by avg_price desc
limit 5")
## clarity color avg_price
## 1 IF D 8307.370
## 2 SI2 I 7002.649
## 3 SI2 J 6520.958
## 4 SI2 H 6099.895
## 5 VS2 I 5690.506
关系型数据中，最能体现“关系”概念的运算是表的连接（join），即将若干表通过某
些字段连接起来。例如，创建一个新的数据框 diamond_selector，包含字段 cut、color
和 clarity 的共有 3 条记录，之后我们将根据这 3 条记录筛选数据：
diamond_selector <- data.frame(
cut = c("Ideal", "Good", "Fair"),
color = c("E", "I", "D"),
clarity = c("VS1", "T1", "IF"),
stringsAsFactors = FALSE
)
diamond_selector
## cut color clarity
## 1 Ideal E VS1
## 2 Good I T1
## 3 Fair D IF
创建好数据框后，我们将它写入数据库，然后连接 diamonds 表和 diamond_
selector 表，再筛选出合意的记录：
dbWriteTable(con, "diamond_selector", diamond_selector,
row.names=FALSE, overwrite=TRUE)
## [1] TRUE
通过连接子句（join-clause）声明相匹配的列：
subset_diamonds <- dbGetQuery(con,
"select cut, color, clarity, carat, price
from diamonds
join diamond_selector using (cut, color, clarity)")
head(subset_diamonds)
## cut color clarity carat price
## 1 Ideal E VS1 0.60 2774
## 2 Ideal E VS1 0.26 556
## 3 Ideal E VS1 0.70 2818
## 4 Ideal E VS1 0.70 2837
## 5 Ideal E VS1 0.26 556
## 6 Ideal E VS1 0.26 556
总的来说，符合 3 个筛选条件其中之一的，只有很少一部分记录：
nrow(subset_diamonds) / nrow(diamonds)
## [1] 0.01104931
最后，不要忘记断开数据库连接，以确保所有资源被正确释放：
dbDisconnect(con)
## [1] TRUE
在前面的例子中，我们只展示了 SQL 用于查询关系型数据库（以 SQLite 为例）的基
本用法。实际上，SQL 远比我们演示的更加丰富和强大。若想了解更多细节，请访
问 http://www.w3schools.com/sql。

秒客网

用 SQL 对关系型数据库进行查询

相关文章