data.tables의 X [Y] 조인이 전체 외부 조인 또는 왼쪽 조인을 허용하지 않는 이유는 무엇입니까?
이것은 data.table 조인 구문에 대한 약간의 철학적 질문입니다. data.tables에 대한 점점 더 많은 용도를 찾고 있지만 여전히 배우고 있습니다 ...
X[Y]
data.tables 의 조인 형식 은 매우 간결하고 편리하며 효율적이지만 내부 조인과 오른쪽 외부 조인 만 지원합니다. 왼쪽 또는 전체 외부 조인을 얻으려면 다음을 사용해야합니다 merge
.
X[Y, nomatch = NA]
-Y의 모든 행-오른쪽 외부 조인 (기본값)X[Y, nomatch = 0]
-X와 Y가 모두 일치하는 행만-내부 조인merge(X, Y, all = TRUE)
-X와 Y의 모든 행-완전 외부 조인merge(X, Y, all.x = TRUE)
-X의 모든 행-왼쪽 외부 조인
X[Y]
조인 형식이 4 가지 유형의 조인을 모두 지원 하면 편리 할 것 같습니다 . 두 가지 유형의 조인 만 지원되는 이유가 있습니까?
나를 위해, nomatch = 0
및 nomatch = NA
매개 변수 값은 수행중인 작업에 대해 매우 직관적이지 않습니다. merge
구문 을 이해하고 기억하기가 더 쉽습니다 : all = TRUE
, all.x = TRUE
및 all.y = TRUE
. X[Y]
연산은 merge
보다 훨씬 유사 하기 때문에 함수의 매개 변수 가 아닌 조인 구문을 match
사용하지 않는 이유는 무엇입니까?merge
match
nomatch
다음은 4 가지 조인 유형의 코드 예입니다.
# sample X and Y data.tables
library(data.table)
X <- data.table(t = 1:4, a = (1:4)^2)
setkey(X, t)
X
# t a
# 1: 1 1
# 2: 2 4
# 3: 3 9
# 4: 4 16
Y <- data.table(t = 3:6, b = (3:6)^2)
setkey(Y, t)
Y
# t b
# 1: 3 9
# 2: 4 16
# 3: 5 25
# 4: 6 36
# all rows from Y - right outer join
X[Y] # default
# t a b
# 1: 3 9 9
# 2: 4 16 16
# 3: 5 NA 25
# 4: 6 NA 36
X[Y, nomatch = NA] # same as above
# t a b
# 1: 3 9 9
# 2: 4 16 16
# 3: 5 NA 25
# 4: 6 NA 36
merge(X, Y, by = "t", all.y = TRUE) # same as above
# t a b
# 1: 3 9 9
# 2: 4 16 16
# 3: 5 NA 25
# 4: 6 NA 36
identical(X[Y], merge(X, Y, by = "t", all.y = TRUE))
# [1] TRUE
# only rows in both X and Y - inner join
X[Y, nomatch = 0]
# t a b
# 1: 3 9 9
# 2: 4 16 16
merge(X, Y, by = "t") # same as above
# t a b
# 1: 3 9 9
# 2: 4 16 16
merge(X, Y, by = "t", all = FALSE) # same as above
# t a b
# 1: 3 9 9
# 2: 4 16 16
identical( X[Y, nomatch = 0], merge(X, Y, by = "t", all = FALSE) )
# [1] TRUE
# all rows from X - left outer join
merge(X, Y, by = "t", all.x = TRUE)
# t a b
# 1: 1 1 NA
# 2: 2 4 NA
# 3: 3 9 9
# 4: 4 16 16
# all rows from both X and Y - full outer join
merge(X, Y, by = "t", all = TRUE)
# t a b
# 1: 1 1 NA
# 2: 2 4 NA
# 3: 3 9 9
# 4: 4 16 16
# 5: 5 NA 25
# 6: 6 NA 36
업데이트 : data.table v1.9.6 on=
은 기본 키 이외의 필드에서 임시 조인을 허용 하는 구문을 도입했습니다 . 질문에 대한 jangorecki의 답변 데이터 프레임 (내부, 외부, 왼쪽, 오른쪽)을 결합 (병합)하는 방법은 무엇입니까? data.table이 처리 할 수있는 추가 조인 유형의 몇 가지 예를 제공합니다.
To quote from the data.table
FAQ 1.11 What is the difference between X[Y]
and merge(X, Y)
?
X[Y]
is a join, looking up X's rows using Y (or Y's key if it has one) as an index.
Y[X]
is a join, looking up Y's rows using X (or X's key if it has one)
merge(X,Y)
does both ways at the same time. The number of rows ofX[Y]
andY[X]
usually differ, whereas the number of rows returned bymerge(X,Y)
andmerge(Y,X)
is the same.BUT that misses the main point. Most tasks require something to be done on the data after a join or merge. Why merge all the columns of data, only to use a small subset of them afterwards? You may suggest
merge(X[,ColsNeeded1],Y[,ColsNeeded2])
, but that requires the programmer to work out which columns are needed.X[Y,j
] in data.table does all that in one step for you. When you writeX[Y,sum(foo*bar)]
, data.table automatically inspects thej
expression to see which columns it uses. It will only subset those columns only; the others are ignored. Memory is only created for the columns thej
uses, andY
columns enjoy standard R recycling rules within the context of each group. Let's sayfoo
is inX
, and bar is inY
(along with 20 other columns inY
). Isn'tX[Y,sum(foo*bar)]
quicker to program and quicker to run than a merge of everything wastefully followed by a subset?
If you want a left outer join of X[Y]
le <- Y[X]
mallx <- merge(X, Y, all.x = T)
# the column order is different so change to be the same as `merge`
setcolorder(le, names(mallx))
identical(le, mallx)
# [1] TRUE
If you want a full outer join
# the unique values for the keys over both data sets
unique_keys <- unique(c(X[,t], Y[,t]))
Y[X[J(unique_keys)]]
## t b a
## 1: 1 NA 1
## 2: 2 NA 4
## 3: 3 9 9
## 4: 4 16 16
## 5: 5 25 NA
## 6: 6 36 NA
# The following will give the same with the column order X,Y
X[Y[J(unique_keys)]]
@mnel's answer is spot on, so do accept that answer. This is just follow up, too long for comments.
As mnel says, left/right outer join is obtained by swapping Y
and X
: Y[X]
-vs- X[Y]
. So 3 of the 4 join types are supported in that syntax, not 2, iiuc.
Adding the 4th seems a good idea. Let's say we add full=TRUE
or both=TRUE
or merge=TRUE
(not sure the best argument name?) then it hadn't occurred to me before that X[Y,j,merge=TRUE]
would be useful for the reasons after the BUT in FAQ 1.12. New feature request now added and linked back here, thanks :
FR#2301 : Add merge=TRUE argument for both X[Y] and Y[X] join like merge() does.
Recent versions have sped up merge.data.table
(by taking a shallow copy internally to set the keys more efficiently, for example). So we are trying to bring merge()
and X[Y]
closer, and provide all options to user for full flexibility. There are pros and cons of both. Another outstanding feature request is :
FR#2033 : Add by.x and by.y to merge.data.table
If there are any others, please keep them coming.
By this part in the question :
why not use the merge syntax for joins rather than the match function's nomatch parameter?
If you prefer merge()
syntax and its 3 arguments all
,all.x
and all.y
then just use that instead of X[Y]
. Think it should cover all the cases. Or did you mean why is the argument a single nomatch
in [.data.table
? If so, it's just the way that seemed natural given FAQ 2.14 : "Can you explain further why data.table is inspired by A[B] syntax in base?". But also, nomatch
only takes two values currently 0
and NA
. That could be extended so that a negative value meant something, or 12 would mean use the 12th row's values to fill in NAs, for example, or nomatch
in future could be a vector or even itself a data.table
.
Hm. How would by-without-by interact with merge=TRUE? Perhaps we should take this over to datatable-help.
This "answer" is a proposal for discussion: As indicated in my comment, I suggest adding a join
parameter to [.data.table() to enable additional types of joins, ie: X[Y,j,join=string]
. In addition to the 4 types of ordinary joins, I also suggest to support 3 types of exclusive joins, and the cross join.
The join
string values (and aliases) for the various join types are proposed to be:
"all.y"
and"right"
-- right join, the present data.table default (nomatch=NA) - all Y rows with NAs where there is no X match;"both"
and"inner"
-- inner join (nomatch=0) - only rows where X and Y match;"all.x"
and"left"
-- left join - all rows from X, NAs where no Y match:"outer"
and"full"
-- full outer join - all rows from X and Y, NAs where no match"only.x"
and"not.y"
-- non-join or anti-join returning X rows where there is no Y match"only.y"
and"not.x"
-- non-join or anti-join returning Y rows where there is no X match"not.both"
-- exclusive join returning X and Y rows where there is no match to the other table, ie an exclusive-or (XOR)"cross"
-- cross join or Cartesian product with each row of X matched to each row of Y
The default value is join="all.y"
which corresponds to the present default.
The "all", "all.x" and "all.y" string values correspond to merge()
parameters. The "right", "left", "inner" and "outer" strings may be more amenable to SQL users.
The "both" and "not.both" strings are my best suggestion at the moment -- but someone may have better string suggestions for the inner join and exclusive join. (I'm not sure if "exclusive" is the right terminology, correct me if there is a proper term for an "XOR" join.)
Use of join="not.y"
is an alternative for X[-Y,j]
or X[!Y,j]
non-join syntax and maybe more clear (to me), although I'm not sure if they are the same (new feature in data.table version 1.8.3).
The cross join can be handy sometimes, but it may not fit in the data.table paradigm.
'IT story' 카테고리의 다른 글
C ++ 표준위원회는 C ++ 11에서 unorder_map이 삽입 한 것을 파괴 하는가? (0) | 2020.07.26 |
---|---|
광 주파수를 RGB로 변환 하시겠습니까? (0) | 2020.07.26 |
`ui-router` $ stateParams vs. $ state.params (0) | 2020.07.26 |
브라우저가 요청을 취소 할 때 ASP.NET 웹 API OperationCanceledException (0) | 2020.07.26 |
Java 응용 프로그램을 배포하는 가장 좋은 방법은 무엇입니까? (0) | 2020.07.26 |