问题: Distributed computing

Distributed join

+1  

To join data together you need the data to be colocated - this idea is to colocate a subset of joined data on each node so that the data can be split up

YAML 想法

If you've got lots of data - more than can fit in memory you need a way to split the data up between machines. This idea is to prejoin data at insert time and distribute the records that join together to the same machine

this way the overall dataset can be split between machines. But the join can be executed on every node and then aggregated for the total joined data.

In more detail, we decide which machine stores the data by consistently hashing it with the join keys. So the matching records always get stored on the same machine.

chronological,


(别通知) (可选) 请,登录

这不是分布式 nft 所做的。我们从多个帐户聚合一个项目。我喜欢称之为组合。我确实认为组合是要解决的问题,不仅仅是为了存储问题,而是为了管理复杂性

Is it not what a distributed nft does. We agregating a project from many accounts. I like to call it composition. I do think composition is the problem to solve, not just for storage concerns, but to manage complexity



    :  -- 
    : Mindey
    :  -- 
    

skihappy,

我不知道 NFT 是如何在区块链中存储所有权的。但本质上,这个想法是关于分布式 SQL 数据库存储引擎的中级实现细节。

我发现分布式 P2P 数据库的想法非常有用 我有一个项目,我在其中实现了一个非常简单的 SQL 数据库,它通过将连接键分发到集群中的每个节点来支持分布式连接。在这个想法中,我采用了不同的方法并散列连接键并使用连接键进行节点放置。

所以集群中的每个人都获得了数据的一个子集。您需要每个人都在线进行查询

I don't know how NFTs store ownership in the block chain. But essentially this idea is about the mid level implementation detail of a distributed SQL database storage engine.

I find the idea of a distributed P2P database very useful I have a project where I implement a very simple SQL database and it supports distributed joins by distributing join keys to every node in the cluster. In this idea I take a different approach and hash the join key and use the join keys to do the node placement.

So everybody in the cluster gets a subset of the data. You need everybody to be online to do a query


// 在插入时预连接数据 //

我认为这是一个合理的想法,但如何做到这一点?可能的join有很多种,其实假设数据库的表数是 %% n %% ,那么可能需要join的所有表对的个数,就是 %% k=2 %% 的个数子集:

$$ {\binom{n}{2}}={\frac{n!}{2!(n-2)!}}={\frac{n^2-n}{2}} $$

例如,如果数据库有 15 个表,这个数字是 105,如果有 42 个表,这个数字是 861。添加您需要对不同字段进行连接的可能性——并且预先计算的连接数量可能是更高。尽管如此,在插入时执行它似乎是合理的,因为连接会改变并且需要在每次插入时重新计算或修改。

// prejoin data at insert time //

I think it's a reasonable idea, but how would this be done? There are many possible joins, in fact, suppose that the number of tables is the database is %%n%%, then the number of all table pairs that may need to be joined, is the number of %%k=2%% subsets:

$${\binom {n}{2}} = {\frac {n!}{2!(n-2)!}} = {\frac {n^2-n}{2}} $$

For example, if the database has 15 tables, this number is 105, and if there's 42 tables, this number is 861. Add the possibility that you need to do joins on different fields -- and the number of pre-computed joins may be even higher. Still, it seems reasonable to do it at insert time, as the joins would change and need to be recomputed or modified to on every insert.


在我的 SQL 数据库中,我引入了一个名为 create join 的语句。

您提前告诉数据库哪些字段是可连接字段。

创建连接 内部连接 ​​people.id = items.people items.search = products.name 上的内部连接产品

然后数据库检查每个插入并执行相关的一致散列和节点放置。

In my SQL database I introduced a statement called create join.

You tell the database ahead of time what fields are joinable fields.

create join inner join people on people.id = items.people inner join products on items.search = products.name

The database then checked on every insert and does the associated consistent hashing and node placement.



    : Mindey
    :  -- 
    :  -- 
    

chronological,

我应该指出,一旦 create join 语句运行,某些数据可能是可移动的。

如果 products.id 和 search.product_id 存在连接并且在插入产品时插入。匹配搜索的查询将运行 select Id from search where product_id = X

search.product_ID 取决于 products.id。它们具有相同的值。产品的一致哈希可以是:id。搜索的一致性哈希可以忽略它自己的 id 并使用 product_id。这会将这些数据分发到同一台机器上,因为哈希值是相同的。

如果有多个连接,则此方案可能需要更复杂。我认为这些字段可以连接和 hashed.q

I should point out that some data may be movable once the create join statement has ran.

If a join existed for products.id and search.product_id and at insert time of products was inserted. A query for matching searches would run select Id from search where product_id = X

search.product_ID depends on products.id. they have the same value. The consistent hash for products can be: id. The consistent hash for search can ignore its own id and use the product_id. This would distribute this data to the same machine because the hashes are identical.

If there is multiple joins the this scheme might need to be more complicated. I think the fields can be concatenated and hashed.q



    :  -- 
    : Mindey
    :  -- 
    

chronological,

为 Internet 提供搜索引擎的成本非常高,因为数据太大且无法放入内存。我们可以使用这种方法来拆分搜索的存储需求。

如果每个人都拥有一小部分搜索索引,那么所有查询都会在返回之前发送给每个人。

如果有 1000 个存储节点,那么每个搜索查询都会产生 1000 个子查询,每个存储节点都是一个。每个都返回他们所知道的。

Providing a search engine for the internet is outrageously expensive because the data is so large and doesn't fit in memory. We could use this approach to split up the storage requirements of doing search.

If everybody hosts a fraction of the search index then all queries go to everybody before being returned.

If there is 1000 storage nodes then every search query produces 1000 subqueries one to every storage node. Each returns what they know about.



    : Mindey
    :  -- 
    :  -- 
    

chronological,