MarcoFalke src/txrequest.h:28 2020-09-22T15:30:39Z ```suggestion * - Whether it's from a "preferred" peer or not (outbound and noban peers are preferred). ``` style-nit: Would be good to mention the exact permission flag, because legacy whitelisted is discouraged, has been deprecated, and might be removed some time in the future. (Same below) -------------------------- ajtowns src/txrequest.h:30 2020-09-23T02:23:39Z `exptime` is slightly ambiguous with "expected time" to me, maybe replace with "timeout" or "expiry_time" or just "expiry" ? -------------------------- ajtowns src/txrequest.h:29 2020-09-23T02:30:44Z "(see further for details)" is true of all these points, I think. -------------------------- ajtowns src/txrequest.h:69 2020-09-23T02:33:34Z This was confusing -- I read it as "the first of many markers" rather than "a marker indicating it was first" and wondered what these "markers" were. Adding quotes around "first" consistently might be better? -------------------------- ajtowns src/txrequest.h:81 2020-09-23T02:35:50Z Incomplete sentence? -------------------------- ajtowns src/txrequest.h:95 2020-09-23T02:41:13Z Add a reference to https://allquantor.at/blockchainbib/pdf/miller2015topology.pdf to document/explain the invblock terminology? "What does this have to do with an INV message about a block?" -------------------------- ajtowns src/txrequest.h:92 2020-09-23T02:42:25Z "and **not** being influencable by attackers" -------------------------- ajtowns src/txrequest.h:156 2020-09-23T02:53:21Z "uint64_t" as a parameter name? -------------------------- ajtowns src/txrequest.h:122 2020-09-23T03:00:34Z Isn't this exactly the same performance as having a subclass and making the methods virtual, except with all the dispatching written out explicitly? ie, could instead write something like: ```c++ class TxRequestTracker { protected: TxRequestTracker() { } // pure virtual class, must instantiate via subclass public: virtual ~TxRequestTracker(); virtual void DeletedPeer(uint64_t peer) = 0; ... }; std::unique_ptr CreateTxRequestTracker(bool deterministic = false); static std::unique_ptr g_txrequest = CreateTxRequestTracker() GUARDED_BY(cs_main); ``` then hide the subclass in txrequest.cpp? Of course, "exactly the same" means no objective reason to prefer changing to this. -------------------------- ajtowns src/txrequest.h:134 2020-09-23T03:05:34Z Are these needed for fuzz testing, or could they be deleted as well? (copy constructors are implicitly deleted because of the unique ptr, so I think currently this is just making the implicit defaults explicit) -------------------------- sipa src/txrequest.h:28 2020-09-23T03:17:11Z I'd like to keep txrequest mostly about the decision logic and data structure, while leaving net_processing responsible for the actual policy choices ("what delays are given to which peers/requests", "what timeouts are used", "when exactly is a peer considered preferred", "when exactly is a peer considered overloaded", "how many announcements can be tracked per peer"). Of course, the explanation here is ideally as comprehensive as possible and not full of "See some random comment in net_processing for the details" references. Still, do you think it would be reasonable to say instead: > Whether it's from a "preferred" peer or not (what that means is up to the caller, but this is designed to correspond mostly to outbound peers, or others that are more trusted) ? -------------------------- ajtowns src/txrequest.h:173 2020-09-23T03:35:42Z Does anything limit the size of the returned vector? I think the main constraints are those in net_processing.cpp:RequestTx which could leave as many as 100k entries in CANDIDATE state for a given peer, so this could be a 4MB vector, which seems like it might be larger than desirable? It's also constrained by how many txs can be INVed by a peer inbetween calls to GetRequestable, so in normal circumstances I'd expect this to be perfectly fine. I think you could get the worst case by quickly sending two max-size INV messages and two max-size NOTFOUND messages once the first request comes in. (I think this has to return a copy of the data items, because you want to iterate over them and request them, which would then modify the data structure, and that would invalidate the iterator you were using if you hadn't made a copy) -------------------------- ajtowns src/txrequest.h:223 2020-09-23T03:44:42Z "Run a check on consistency of request times after a call to GetRequestable (requires the same timestamp as was passed to GetRequestable)" might be a better description? (At first glance I thought it might have been timing the sanity check, or doing a limited sanity check that checked different things depending on how much time it was taking to run) -------------------------- ajtowns src/net_processing.cpp:747 2020-09-23T04:18:11Z Maybe this should be `overloaded == ... || g_txrequest.CountCandidates() >= 5000` or similar? I'm not sure I have a plausible enough scenario where this would be a benefit to justify the added code though. -------------------------- ajtowns src/net_processing.cpp:866 2020-09-23T04:30:45Z I think if you were to make `RequestTx()` a method of `class PeerManager`, you could make `g_txrequest` be a private member `PeerManager::m_txrequest GUARDED_BY(cs_main)` instead of a global. OTOH, might be better to not do that until other globals that `PeerManager` methods rely on (like `mapNodeState`) are also moved. cc @jnewbery -------------------------- ajtowns test/functional/p2p_tx_download.py:129 2020-09-23T05:00:55Z Should replace this test with one that checks we start applying `OVERLOAD_PEER_TX_DELAY` ? -------------------------- sipa test/functional/p2p_tx_download.py:129 2020-09-23T19:05:24Z @ajtowns Feel like writing such a test? -------------------------- sipa src/txrequest.h:28 2020-09-24T00:10:18Z Done, approximately. -------------------------- sipa src/txrequest.h:30 2020-09-24T00:10:29Z Changed to "expiry". -------------------------- sipa src/txrequest.h:29 2020-09-24T00:10:57Z I was trying to refer to the specific section on the "first" marker. Made that more explicit. -------------------------- sipa src/txrequest.h:69 2020-09-24T00:11:05Z Done. -------------------------- sipa src/txrequest.h:81 2020-09-24T00:11:20Z Removed. It was a leftover of what turned into the last comment section. -------------------------- sipa src/txrequest.h:95 2020-09-24T00:11:26Z Done. -------------------------- sipa src/txrequest.h:92 2020-09-24T00:11:41Z Done, approximately. -------------------------- sipa src/txrequest.h:156 2020-09-24T00:12:19Z Apparently that's not actually a 128-bit type ;) Fixed. -------------------------- sipa src/txrequest.h:122 2020-09-24T00:18:55Z It's not **exactly** the same, I think. Calling member functions on a `TxRequestTracker` would incur a lookup in the vtable to find the code to execute. The current solution has link-time determined code flow, and only an extra indirection to find the object's storage. I don't think either is remotely relevant for performance here, and the subclass approach you suggest is probably somewhat lighter syntactically. I'm happy to change it if other reviewers agree. -------------------------- sipa src/txrequest.h:134 2020-09-24T00:19:27Z Gone. -------------------------- sipa src/txrequest.h:173 2020-09-24T00:21:04Z I've added a commit that just reduces MAX_PEER_TX_ANNOUNCEMENTS; 100000 was ridiculous. It can be bypassed using the PF_RELAY permission. -------------------------- sipa src/net_processing.cpp:747 2020-09-24T00:22:26Z With MAX_PEER_TX_ANNOUNCEMENTS reduced, is that still needed? I'm a bit hesitant about this, as the number of announcements tracked for a peer is somewhat dependent on other peers' behavior. -------------------------- sipa src/txrequest.h:223 2020-09-24T00:23:02Z Renamed to PostGetRequestableSanityCheck, and updated comment. It's not just a consistency check of request times. -------------------------- sipa src/net_processing.cpp:866 2020-09-24T00:23:31Z I haven't paid too much attention to that, but I'm happy to change this if desirable. -------------------------- ajtowns src/txrequest.h:122 2020-09-24T04:46:23Z Ah, good point! That seems a sufficient reason, so going to mark this as resolved. -------------------------- ajtowns src/net_processing.cpp:747 2020-09-24T05:01:46Z Yeah, sounds good. -------------------------- ajtowns src/txrequest.h:223 2020-09-24T05:09:51Z Not sure I follow, the only asserts are comparisons between `entry.m_time` and `now` ? Or should I have said "entry times" because "request" times might only mean REQUESTED entries? -------------------------- sipa src/txrequest.h:223 2020-09-24T05:16:04Z m_time is the reqtime for CANDIDATE_*, but expiry for REQUESTED entries. -------------------------- ajtowns src/net_processing.cpp:82 2020-09-24T05:49:32Z Not sure if it's worth adding any explicit justification for picking that number. But for the record, my thinking is: if our peers use the same `INVENTORY_BROADCAST_INTERVAL`, `INVENTORY_BROADCAST_MAX` params as us (and assuming they also halve the interval when they consider us an outbound, ie when we consider them an inbound), then 5000 is about 16 minutes worth of INVs for our outbounds, and 5 minutes for our inbounds. (More presuming we actually receive txs and thus clear entries out, or already had them) -------------------------- jnewbery src/net_processing.cpp:866 2020-09-24T11:36:53Z Yes, I think this would be an improvement for encapsulation and for being explicit about when the TxRequestManager is constructed/destructed. It's my understanding that global static objects can be constructed before main() or later. There isn't any requirement to wait for `mapNodeState` to move to `PeerManager`. I have a branch that moves TxRequestTracker and RequestTx() to be members of PeerManager here: https://github.com/jnewbery/bitcoin/tree/pr19988.1. The move is in a separate commit in that branch for ease of review, but should be squashed into _Change transaction request logic to use txrequest_. -------------------------- jnewbery src/txrequest.cpp:676 2020-09-24T12:00:49Z Consider using the `default` syntax to indicate that you only included this in the cpp file so the `m_impl` unique ptr can be destructed: ```suggestion TxRequestTracker::~TxRequestTracker() = default; ``` -------------------------- jnewbery src/txrequest.h:112 2020-09-24T13:38:39Z It looks like this is only in the header so that the unit and fuzz tests can call `GetPriorityComputer()` and get the computer, and the only reason they do that is so they can call `PriorityComputer::operator()()`. Could you move all this to the implementation file, and add a public method `ComputePriority` to by used by the tests that calculates the priority internally and returns it? -------------------------- jnewbery src/txrequest.cpp:673 2020-09-24T14:06:12Z Consider using an initializer list here instead of setting `m_impl` in the ctor function body. If you did that, you could make `m_impl` const, which is fine since you've deleted the move constructor and assignment. `const unique_ptr` communicates that _this_ `m_impl` is always the object that `TxRequestTracker` owns. See Herb Sutter's Leak-Freedom talk at https://youtu.be/JfmTagWcqoE?t=571 ```diff diff --git a/src/txrequest.cpp b/src/txrequest.cpp index f6c387f8f4..b7d90a1ba5 100644 --- a/src/txrequest.cpp +++ b/src/txrequest.cpp @@ -668,10 +668,8 @@ public: } }; -TxRequestTracker::TxRequestTracker(bool deterministic) -{ - m_impl = MakeUnique(deterministic); -} +TxRequestTracker::TxRequestTracker(bool deterministic) : + m_impl{MakeUnique(deterministic)} {} TxRequestTracker::~TxRequestTracker() {} diff --git a/src/txrequest.h b/src/txrequest.h index 6e892eebe1..83f6dbe2fe 100644 --- a/src/txrequest.h +++ b/src/txrequest.h @@ -119,7 +119,7 @@ public: private: // Avoid littering this header file with implementation details. class Impl; - std::unique_ptr m_impl; + const std::unique_ptr m_impl; public: //! Construct a TxRequestTracker. ``` -------------------------- instagibbs src/txrequest.h:138 2020-09-24T16:55:02Z `Deleted`(past tense) is a bit weird since it's actually deleting the peer. At a risk of making it too verbose via self-documenting: `DeletePeerEntries`? Alternatively we can name it like some of the other functions e.g. `ReceivedInv` et. al: `DisconnectedPeer` describing when it's to be called and leaving the first sentence in the comment. -------------------------- instagibbs src/txrequest.h:143 2020-09-24T16:58:35Z ```suggestion * whatever reason we no longer care about it. Only the hash field of gtxid is used. ``` -------------------------- instagibbs src/txrequest.h:143 2020-09-24T17:01:38Z Maybe talk about what it's used for instead of what it's not, or just change the interface to make it clear. -------------------------- instagibbs src/txrequest.h:127 2020-09-24T17:08:07Z `COMPLETED` sounds more like `FAILED`. `COMPLETED` is also hit when the transaction is received properly, no? -------------------------- instagibbs src/txrequest.h:173 2020-09-24T17:20:59Z ```suggestion std::vector ExpireReqestedAndGetRequestable(uint64_t peer, std::chrono::microseconds now); ``` to make it less reliant on comments for intent -------------------------- instagibbs src/txrequest.h:204 2020-09-24T17:26:07Z Wouldn't hurt to make this block not publicly callable somehow. Just an idea in case it's not crazy: https://stackoverflow.com/a/23267346 -------------------------- instagibbs src/net_processing.cpp:1233 2020-09-24T17:32:00Z tangentially-related nit: Comments above function are getting pretty stale. -------------------------- sipa src/net_processing.cpp:866 2020-09-24T17:33:49Z Done. -------------------------- sipa src/txrequest.cpp:676 2020-09-24T17:34:03Z Of course! Done. -------------------------- sipa src/txrequest.h:112 2020-09-24T17:34:13Z Nice, done. -------------------------- sipa src/txrequest.cpp:673 2020-09-24T17:34:23Z Done. -------------------------- instagibbs src/net_processing.cpp:2882 2020-09-24T17:37:26Z The witness-ness of GenTxid doesn't matter. I find the interface distracting since it needs to be passed in. -------------------------- jnewbery src/txrequest.h:204 2020-09-24T17:53:13Z I'd prefer to move these into a friend `TxRequestTrackerTester` class (I thought I'd left a review comment suggesting that, but I appear to have lost it). -------------------------- sipa src/txrequest.h:138 2020-09-24T18:11:16Z Ah, my thinking behind the name was the caller informing TxRequestTracker "Hey I have deleted peer X, you may want to forget about them". DisconnectedPeer sounds better. -------------------------- sipa src/txrequest.h:127 2020-09-24T18:12:52Z I'm not sure what you're suggesting here. FAILED seems like a bad name as it indeed doesn't imply failure. -------------------------- sipa src/txrequest.h:173 2020-09-24T18:16:36Z I'm not entirely sure, as the caller really doesn't care about the fact that it also expires things. It just wants to know what should be requested. It's not an implementation detail, as the effect of expiration is observable (CountInFlight, CountTracked() and Size() go down), but I think that makes just specification details rather than the caller's intent. -------------------------- sipa src/txrequest.h:204 2020-09-24T18:17:09Z @jnewbery How do you suggest to do that in combination with the impl pattern? -------------------------- instagibbs src/txrequest.h:127 2020-09-24T18:22:20Z apologies I wrote my comment in two frame of minds. I mean to say the comment doesn't mention "success" as a possibility to transition to this state. Update the comment and I'm happy. -------------------------- sipa src/net_processing.cpp:2882 2020-09-24T18:53:08Z My hope here was that making TxRequestTracker deal solely with GenTxids would simplify the interface. The caller just has these gtxids that it hands to TxRequestTracker, and gets gtxids back. The idea was that the knowledge about when the is_wtxid flag mattered in certain contexts could be restricted to TxRequestTracker. I don't this idea really works. This optimization (in "Expedite removal") specifically breaks it: it's calling ForgetTx in a situation where we actually don't have the tx already, relying on the fact that if txid and wtxid differ, it must be a witness transaction, and thus we certainly don't need other transactions with the same witness. I think there are two solutions to it: * Add to ForgetTx logic that if a txid is provided, it deletes both txid and wtxid announcements, but if a wtxid is provided only wtxid announcements are deleted. This would avoid some of the complexity here in net_processing, but technically break the amortized O(1) complexity claim - you could call ForgetTx(wtxid) repeatedly, and it would every time consume time proportional to the number of txid announcements with the same hash. * Give up on the perfect abstraction, and just make ForgetTx take a txhash (and rename it to ForgetTxHash, I guess). -------------------------- jnewbery src/txrequest.h:204 2020-09-24T21:15:15Z Hmmm yes, very good question! I wonder if something along the lines of https://stackoverflow.com/a/34054983/933705 might work (but then again, it might not be worth it just to avoid putting these three test helper functions in the public interface). -------------------------- sipa src/txrequest.h:138 2020-09-24T23:43:50Z Renamed to `DisconnectedPeer`. -------------------------- sipa src/txrequest.h:143 2020-09-24T23:44:13Z Changed the interface to take just a uint256. -------------------------- sipa src/txrequest.h:127 2020-09-24T23:44:24Z Done. Better? -------------------------- sipa src/txrequest.h:204 2020-09-24T23:46:29Z I think it's not unreasonable to have sanity checking functions for involved data structures in code that actually deals with that data structure. It could possibly be called at runtime using something something similar to `-checkmempool` too if we wanted. -------------------------- sipa src/net_processing.cpp:1233 2020-09-24T23:46:38Z Updated. -------------------------- sipa src/net_processing.cpp:2882 2020-09-24T23:50:16Z Ok, I've looked into this more, and I believe the code looks a lot better with `ForgetTxHash` that just takes a uint256. This means the caller needs to understand the semantics for when calling with a txid/wtxid is fine, but that was the case already - just less obviously so. So I've switched the code to that. I've also removed the "if (txid != wtxid) ForgetTx(wtxid);" optimization. I think it's equivalent to just calling ForgetTxHash now with whatever is added to mempool/orphanpool/recentrejects, and the latter is far more obviously correct anyway (these things will be treated as AlreadyHaveTx() anyway, so deleting them early isn't going to change anything). -------------------------- instagibbs src/txrequest.h:127 2020-09-25T00:56:04Z yes, thank you -------------------------- ajtowns src/txrequest.h:90 2020-09-25T03:23:47Z Maybe "chose the first (non-overloaded) peer to have announced the transaction (managed by the "first" marker, see below)" would be a better description -- ie focus on the aim, rather than the mechanism? Might make sense to follow the "Rationale: in non-attack scenarios..." with "This is implemented via using a "first" marker, with the following rules:" rather than having those rules be in a separate section a bit lower. -------------------------- ajtowns src/txrequest.h:93 2020-09-25T03:26:09Z "avoiding influenceable" doesn't make sense? -------------------------- ajtowns src/txrequest.cpp:32 2020-09-25T06:11:25Z Since `Impl` is now only defined in the `cpp` file, I think you can reasonably move `State` and the like outside of the class, ideally sticking it in an anonymous namespace so the internal class methods get internal linkage. Main benefit would be that it's easier to see what the actual data being put in `Impl` is, I think. -------------------------- jnewbery src/txrequest.h:204 2020-09-25T06:17:04Z I agree. That seems reasonable. -------------------------- ajtowns src/txrequest.cpp:272 2020-09-25T06:22:40Z This isn't templated, so `typename` isn't needed, I think? Making an alias for the iterators in general might be useful: `template using IndexIter = typename Index::index::type::iterator;` and `void PromoteCandidateReady(IndexIter it)`? -------------------------- ajtowns src/txrequest.cpp:667 2020-09-25T07:28:07Z The comments in txrequest.h are great. I feel like txrequest.cpp could use more though -- there's no real overview of how the optimised data structures are laid out though, and what the deeper invariants are. Maybe it would make sense to use the `SanityCheck` functions as a way of documenting them? Many of the comments are already there and sufficient, but maybe make it the first function in the class, and make it a bit more readable? If you construct the `Counts` and `PeerInfo` maps in separate, static `GenSanityCheckFoo` functions, what's left seems pretty close to documentation: ```c++ // m_peerinfo can be exactly recalculated from the Index. // // This verifies the data in it, including the invariant // that no entries with m_total_announcements==0 exist. assert(m_peerinfo == GenSanityCheckPeerInfo(m_index)); for (auto& el : GenSanityCheckCounts(m_index, m_computer)) { const uint256& hash = el.first; Counts& c = el.second; // Cannot have only COMPLETED peers (txid should have been deleted) assert(c.m_candidate_delayed + c.m_candidate_ready + c.m_candidate_best + c.m_requested > 0); ``` -------------------------- ajtowns src/txrequest.cpp:602 2020-09-25T07:36:52Z `for (const Entry& entry : m_index) { Counts& counts = table[entry.m_txhash];` ? -------------------------- ajtowns src/txrequest.cpp:54 2020-09-25T08:09:20Z `= (1 << 1)` ? Maybe I'm dumb but it took me a bit to realise these are getting or'ed together. Seems slightly weird to have `m_per_txhash : 2` instead of `m_last_preferred : 1` and `m_last_nonpreferred : 1` ("If you ain't first, you're last" - Ricky Bobby), but I guess it simplifies the code that updates these values enough to make sense. -------------------------- ajtowns src/txrequest.cpp:166 2020-09-25T08:38:42Z Might make sense to drop this method and move the logic directly into `EntryTxHashExtractor`. If done that way, then can move the `PriorityComputer` definition after `Entry` and move the`Entry::ComputePriority` logic into a `PriorityComputer::ComputePriority(const Entry&) const` method (changing the `entry.ComputePriority(computer)` calls to `computer.ComputePriority(entry)` calls). That pretty much lets the definition of `Entry` be the first thing in the file which seems a lot more logical. -------------------------- ajtowns src/txrequest.cpp:28 2020-09-25T09:40:09Z `(txid,peer)` -------------------------- sipa src/net_processing.cpp:82 2020-09-25T22:56:42Z I added some rationale. -------------------------- sipa src/txrequest.h:90 2020-09-25T22:57:46Z I've merged the "first" marker section into this bulletpoint, and incorporated your suggested text. -------------------------- sipa src/txrequest.h:93 2020-09-25T22:58:20Z It are English perfect. -------------------------- sipa src/txrequest.cpp:32 2020-09-25T22:58:53Z I like that. Moved all the type definitions out of `::Impl`. -------------------------- sipa src/txrequest.cpp:272 2020-09-25T22:59:40Z I'm pretty sure it was needed at some point. I've added an Iter type alias as suggested (outside of `Impl`, yay). -------------------------- sipa src/txrequest.cpp:667 2020-09-25T22:59:59Z I haven't addressed this yet. -------------------------- sipa src/txrequest.cpp:602 2020-09-25T23:00:22Z I've changed it to `c` for Counts objects and `e` for Entry objects. -------------------------- sipa src/txrequest.cpp:54 2020-09-25T23:00:56Z At some point I had more flags than 2. I've changed it to two bools now (this makes the SanityCheck code for it a lot more readable in particular). -------------------------- sipa src/txrequest.cpp:166 2020-09-25T23:01:51Z Yep, done. To match the "functor" concept I'm overloading `operator()` instead of having a `ComputePriority` function. -------------------------- sipa src/txrequest.cpp:28 2020-09-25T23:02:08Z `(txhash,peer)` actually, fixed in many places. -------------------------- ajtowns src/txrequest.h:93 2020-09-26T04:55:48Z They is now anyhoo -------------------------- ajtowns src/txrequest.cpp:602 2020-09-26T04:58:34Z Similar change would be good for the `peerinfo` reconstruction; it has `auto& entry =` for a `PeerInfo`. -------------------------- ajtowns src/net_processing.cpp:82 2020-09-26T05:00:26Z Rationale looks fine to me :+1: -------------------------- ajtowns src/txrequest.h:90 2020-09-26T05:17:58Z "The "first" marker is given to announcements for which at the time they are received:" -> "given to announcements at the time they are received, provided:" might be clearer? "(within the class of preferred or non-preferred announcements)" -- everywhere else refers to preferred peers, not announcements. Might make sense to explicitly clarify that a peer may switch its preferred status at any time, but that preferred status is sticky to announcements so if a peer is now non-preferred earlier announcements when it was preferred will still be prioritised, and vice-versa? I think that scenario is exercised by the fuzzer but not by net_processing? -------------------------- sipa src/txrequest.h:90 2020-09-27T07:41:43Z Done, and addressed the "preferredness" being changeable in the first section. You're right that fuzzer tests this, but isn't reachable through the current net_processing layer. -------------------------- sipa test/functional/p2p_tx_download.py:129 2020-09-27T18:24:22Z Done. -------------------------- ajtowns src/net_processing.cpp:751 2020-09-28T00:51:36Z > I think using `.CountInFlight()` isn't actually the right way to determine being overloaded. If a peer dumps 5000 INVs on us at once, they'll all be inserted at a time when `inflight==0`, and thus all be eligible to get the "first" marker, and won't get any delay penalty. I'm not sure the original behaviour didn't make more sense? If you have a peer that drops 5000 INVs at once, then gets requests for all of them 2s later, and replies to all those requests before sending any further INVs, is there any reason to consider that peer to be overloaded? I think we're not worrying about attackers getting the "first" flag -- that's addressed by limiting the impact to one attacker per txid -- but even if we were, I don't think this makes sense as an attack? You're blocking 5000 txs for a single 60 second period, but your other peers are only going to announce ~1000 txs each over that period, and you have to successfully race them to be the first to announce the tx for it to matter. But if you're able to ensure you announce first, then all you need to do is announce first by 2-6 seconds, and you delay a request from honest peers by 54-58 seconds rather than 60 seconds? OTOH, I don't think delaying a huge batch of 4900 txs by (an additional) 2s is much of a problem either. Either way, it may make sense for `noban` or `relay` permission to override considering a peer to be `overloaded`? -------------------------- ajtowns src/net_processing.cpp:779 2020-09-28T01:20:33Z I wonder if it wouldn't make more sense to move the `overloaded` judgement entirely into `ReceivedInv` and instead pass the peer's `MAX_PEER_TX_LOAD` in and have `OVERLOADED_PEER_TX_DELAY` passed in when constructing`m_txrequest`. So `ReceivedInv` would then do `if (CountLoad(peer) > max_load) { first = false; reqtime += m_overloaded_delay; }`. The reason I say that is I think the "first" marker logic doesn't really make sense if you don't bump the reqtime for overloaded peers -- you'd end up with the overloaded peer being selected while the first non-overloaded peer was still waiting for its reqtime to arrive. The `!preferred` delay could also be handled the same way. I guess you'd need to pass in `current_time` and `peer_delay = TXID_RELAY_DELAY` (when appropriate) (so to speak) for that work. Again, there needs to be a positive delay for `!preferred` peers or the logic won't work right, and while it can vary amongst announcements in limited ways without breaking things, I think it would be hard to do that correctly, so being set at the txrequest level wouldn't be a big problem. That might also have the advantage of being able to pass in `delay == 0` rather than `reqtime == microseconds::min()` which seems a little bit more obvious. -------------------------- sipa src/net_processing.cpp:751 2020-09-28T02:44:36Z The reason I don't like using just in-flight is that it has this strong dependency on local ordering of events. If you receive 100 INVs, then do a round of `SendMessages`, and request them, followed by 4900 of them, they get punished (no `first`, penalty `OVERLOADED_PEER_TX_DELAY`). If you receive 100 INVs, then 4900 INVs, and then do a round of `SendMessages`, they don't get punished. > I think we're not worrying about attackers getting the "first" flag Only minimally, but I think we are. There are plenty of spy peers that do weird things, and if they announce abundantly, they can interfere somewhat with ordering of dependent transactions (after the `first` request, they go out randomly, so parents and children may go to different peers, and the parent may arrive first). > OTOH, I don't think delaying a huge batch of 4900 txs by (an additional) 2s is much of a problem either. Judging by actual numbers, having 100 candidates simultaneously for any non-obviously-broken-peer basically doesn't happen. > Either way, it may make sense for noban or relay permission to override considering a peer to be overloaded? Yes, that may be a good idea. -------------------------- sipa src/net_processing.cpp:779 2020-09-28T02:47:43Z I did this at first, but I like the current approach a lot more. It leaves the actual policy decisions inside net_processing, without needing configuration flags etc. in the txrequest interface. Not a very strong opinion, but I like the fact that txrequest is mostly the data structure, and its responsibility is consistency and efficiency - not policy parameters. There are currently a lot of future improvements that can be made without changing txrequest - and that due to the nature of its current interface - are already covered by fuzz testing. -------------------------- ajtowns src/net_processing.cpp:751 2020-09-28T05:14:09Z > Only minimally, but I think we are. There are plenty of spy peers that do weird things, and if they announce abundantly, they can interfere somewhat with ordering of dependent transactions (after the first request, they go out randomly, so parents and children may go to different peers, and the parent may arrive first). > Judging by actual numbers, having 100 candidates simultaneously for any non-obviously-broken-peer basically doesn't happen. Hmm. With the new change you could kind-of make that happen: * find a victim who is an inbound connection * make up 100 transactions * INV them to your victim, but don't respond to their GETDATA * send the 100 transactions to everyone else * everyone else will announce to victim * all victims peers have 100+ CANDIDATE_* announcements, making all victim's peers appear overloaded for the next real transaction that comes along (That said, I'm still not really seeing how being falsely marked as "overloaded" is a problem) > If you receive 100 INVs, then do a round of `SendMessages`, and request them, followed by 4900 of them, they get punished ... If you receive 100 INVs, then 4900 INVs, and then do a round of `SendMessages`, they don't get punished. Yeah. I'm looking at that as: "If you request 100 INVs, and then receive 4900 more INVs before receiving a response to your request, then the peer is overloaded" and "If all your requests have been responded too and you receive 4900 more INVs, then you don't have a reason to think the peer is overloaded". ie, "overloaded" == "am I getting responses, or am I just getting more and more announcements?" I think that means the peer is able to prioritise "respond to GETDATA" above "forward on more INVs" and thus ensure they're never overloaded, even if they're announcing txs in massive bursts, no matter what anyone else is doing (except when they do two large INVs in less than the RTT maybe). -------------------------- ajtowns src/net_processing.cpp:779 2020-09-28T08:46:53Z > There are currently a lot of future improvements that can be made without changing txrequest - and that due to the nature of its current interface are already covered by fuzz testing. Kind-of? I would have said the fuzz testing only really tests the optimisations work; it doesn't really check that the basic implementation makes "sense"? eg, we could check the "don't request dependent transactions out of order" goal by adding a constraint that all peers announce tx `i` prior to tx `j`, and then checking that tx `i` is requested before tx `j`, but I think that would currently fail in the fuzzer due to random delays (and also fail in net_processing in rare cases, eg where a tx is only announced by one node, `j` is announced less than 2 seconds after `i` and the node becomes non-overloaded in that same period, or where all the non-wtxid-relay peers disappear in between). I'm trying out some logging to see if the "first" marker actually does much good in practice. We're calling `GetRequestable`/`RequestedTx` ten times a second for every peer (provided various queues don't fill up anyway), so I think you'd have to be pretty unlucky for it to even matter. ```diff --- a/src/txrequest.cpp +++ b/src/txrequest.cpp @@ -528,6 +528,29 @@ public: // in between, which preserve the state of other GenTxids). assert(it != m_index.get().end()); assert(it->GetState() == State::CANDIDATE_BEST); + { + auto bytxhashit = m_index.project(it); + auto bytxhashit_pre = bytxhashit; + int delayed = 0; + int candidates = 0; + while (bytxhashit_pre != m_index.get().begin()) { + --bytxhashit_pre; + if (bytxhashit_pre->GetState() != State::CANDIDATE_DELAYED || bytxhashit->m_txhash != it->m_txhash) break; + ++delayed; + } + ++bytxhashit; + while (bytxhashit != m_index.get().end() && bytxhashit->GetState() != State::COMPLETED && bytxhashit->m_txhash == it->m_txhash) { + ++candidates; + ++bytxhashit; + } + int completed = 0; + while (bytxhashit != m_index.get().end() && bytxhashit->GetState() == State::COMPLETED && bytxhashit->m_txhash == it->m_txhash) { + ++completed; + ++bytxhashit; + } + assert(bytxhashit == m_index.get().end() || bytxhashit->m_txhash != it->m_txhash); + LogPrintf("ABCD requested txid=%s preferred=%d first=%d delayed=%d candidates=%d completed=%d peer=%d\n", it->m_txhash.ToString(), it->m_preferred, it->m_first, delayed, candidates, completed, it->m_peer); + } Modify(it, [expiry](Entry& entry) { entry.SetState(State::REQUESTED); entry.m_time = expiry; ``` After running it for a few minutes with ~20 peers, only 0.17% of requests actually had other announcements in CANDIDATE_READY, so for 99.82% of txs the first marker didn't make any difference. About 1.6% of requests weren't for a "first" marked announcement (a tor outbound managed to overload itself I think). Will let it go 'til it fills up the disk with logs or whatever, and see what it looks like then. -------------------------- sipa src/net_processing.cpp:751 2020-09-28T22:08:16Z That's a pretty convincing argument. If we assume that getting the "overloaded" treatment is relevant at all, it seems worse that an attacker can cause it to occur in "third party" connections than possibly missing out on getting that effect himself. I also don't see how a large at-once dump can be at all exploited. It means an attacker needs an excessive amount of transactions to begin with. If those are bogus/invalid or self-generated ones, the `first` marker has no effect as only attacker nodes will have it. If they're valid under-attack transactions, it means there is something seriously wrong already that such a batch of transactions is being propagated (remember the max tx rate of INVs...), and the `first` marker isn't going to change much. I'm going to revert this change. -------------------------- sipa src/net_processing.cpp:751 2020-09-29T01:02:24Z Done. -------------------------- sipa src/net_processing.cpp:779 2020-09-29T01:18:17Z > Kind-of? I would have said the fuzz testing only really tests the optimisations work; it doesn't really check that the basic implementation makes "sense"? Yeah, kind of. The fuzz test only verifies consistency of the optimized data structure, and observational equivalence with the naive reimplementation. That naive reimplementation is hopefully close enough to the written specification in txrequest.h that we can say the test actually correctness according to the specification. It however does not test whether that specification actually results in desirable high-level behavior. That's what the scenario unit tests are for, but they're not nearly as detailed. Still, as far as the specification goes, by keeping the policy out of txrequest and it tests, there is evidence that the data structure correctly implements it, and this doesn't depend on coincidences like "non-preferred requests always have a reqtime in the future" that are currently true, but maybe won't always be true. The interface doesn't force such relations, so the fuzz tests should be covering both situations where they do and don't hold. Furthermore, I'm not sure there is much of a simplification in the interface that can be made by moving more into txrequest? * Right now: `ReceivedInv(peer, gtxid, preferred, overloaded, reqtime)`. * Potentially: `ReceivedInv(peer, gtxid, preferred, have_wtxid_peers, current_time)`. * With the addition of bypassing `overloaded` under PF_RELAY you'd need to pass an additional argument `pf_relay_peer` argument even. The fact that this last change could be made by just changing the parameters to `ReceivedInv` and not changing the interface - or the fuzz tests - is argument in favor of it, IMO. Several potential changes like making delays dynamic in function of load/inflight, or adding randomness to it, are equally possible without impacting txrequest. -------------------------- sipa src/net_processing.cpp:751 2020-09-29T01:29:37Z > Either way, it may make sense for noban or relay permission to override considering a peer to be overloaded? Also done. -------------------------- ajtowns src/net_processing.cpp:779 2020-09-29T03:44:42Z Yeah, agree with what you wrote. (Well, I was thinking you'd call `ReceivedInv(peer, gtxid, preferred, max_load, current_time)` and pass `max_load=max()` to bypass overloaded, fwiw) I'm more thinking of the API in terms of what the caller has to do to ensure the txrequest modules behaves "sensibly" -- avoids requesting txs out of order, avoiding double requesting a tx, etc. If there's a way to make txrequest behave "sensibly" no matter how you call the API, that might be better than just documenting the expectations, or the consequences of violating the expectations. But I don't fully understand what sensibly means or what restrictions that implies yet. -------------------------- sipa src/net_processing.cpp:779 2020-09-29T04:06:30Z I see what you mean now. There is no strict reason why we couldn't have both. There could a class around TxRequestTracker that adds the policy and sanity checking of inputs, which just forwards to the inner TxRequestTracker. That would let the inner part still be tested for consistency, while permitting testing of the higher level construct for higher level "goal" properties. As the logic in that outer layer would just be (part of) what is now in TxRequest(), I don't think this is worth it - but maybe at some point it is. -------------------------- ajtowns src/txrequest.h:74 2020-09-29T04:27:58Z Okay, so I've run some stats on this now. Results (for me) after just under 230k calls to `RequestedTx` are: * in 77.9% of cases, it's requesting the tx from a preferred peer, with no alternative (non-delayed) candidates * in 18.2% of cases, it's requesting the tx from a non-preferred peer, with no alternative (n-d) candidates * in 1.3% of cases, it's requesting the tx from a preferred peer, with the only (n-d) alternative candidates being non-preferred peers * in 1.1% of cases, it's requesting the tx from a preferred peer using the first marker, and other preferred candidates are available (!) * in 1.0% of cases, it's requesting the tx from a non-preferred peer using the first marker, with other non-preferred candidates available (and no preferred candidates available) (!) * in 0.2% of cases, it's requesting the tx from a non-preferred peer using random selection * in 0.02% of cases, it's requesting the tx from a non-preferred peer using random selection (The remaining ~0.3% I didn't sort through) This implies that the first marker is only relevant in ~2.1% of cases -- the rest of the time it's determined by the first peer's reqtime being reached at a call to `GetRequestable()` prior to any other announcement for that tx. That occurs because (unless I've missed something) we're calling `GetRequestable()` for each peer about ten times per second (100ms delay in `CConnman::ThreadMessageHandler` between calls to `SendMessages(pnode)`), which means the "first" marker only matters for preferred nodes when the first two preferred nodes to announce a tx manage to both set their reqtimes in the same ~100ms period and thus get activated in the same call the `GetRequestable`. I think what the "first" marker is actually doing is slightly different to what I was kind-of expecting. ie, "reqtime" and "preferredness" is almost always what's used to actually choose which announcement to request first, and the "first" marker is a backup. It's only relevant when (a) txrequest is used less frequently than every 100ms via some other implementation, (b) the 100ms delay is extended, eg due to cs_main being held while adding a new block or something, or (c) there's a race and we want a consistent tie-breaker. I haven't instrumented it, but I'm not convinced this does much for dependent transactions. Assuming B spends an output of A, and peers P and Q both announce A prior to B, you could still see announcements as `{Q:A, P:A, P:B}` then call `GetRequestable(P)` and have A assigned to Q and B assigned to P, and requested immediately, with A requested from Q moments later. But if that's not a big problem, what's the advantage in locking A to Q via the "first" marker? As it stands, I want to argue that the "first" marker is an unnecessary complication, and maybe we should remove it. But I'm not really convinced -- for one, that claim relies on calling `GetRequestable` very frequently, and maybe we want the flexibility to not do that? But if we called it for a given peer once a second instead of 10x a second, that adds another 0.5s delay on average to requesting a tx, which is pretty significant compared to the existing delays, so maybe relying on calling `GetRequetable` frequently is perfectly reasonable? -------------------------- ajtowns src/net_processing.cpp:779 2020-09-29T04:33:50Z Yeah. I don't think a wrapper would really be an improvement on documenting the expectations, and agree that sticking fixed params in txrequest is really too restrictive. Going to mark this thread as resolved, but still continue thinking about it. -------------------------- ajtowns src/net_processing.cpp:753 2020-09-29T05:12:04Z Not 100% on topic for this PR I guess, but `if (!gtixd.m_is_wtxid && g_wtxid_relay_peers > 0)` perhaps? That was we delay requesting a parent by txid when handing orphans in case we might have already been requesting the parent by wtxid from some peer (even this one if they were announced/requested/replied out of order for some reason). -------------------------- sipa src/txrequest.h:74 2020-09-29T06:44:38Z That's great data, and perhaps not very surprising: we'd hope to pick good peers as outbounds, so most announcements will come in from preferred peers (and even more within the first 2s after the first announcement from inbound connections). As we fetch immediately from preferred connections in general, the "first" marker shouldn't do much. Given that information, perhaps I wouldn't have added this to the first iteration of this logic. Now that it's implemented, I could argue that it may improve robustness in some cases: * During the rollout of BIP339, we can expect to see many more transactions fetched from delayed (but possibly preferred) peers, as just one wtxid peer will cause a delay on all announcements from other peers. It's hard to predict/measure what the impact of this will be. * I can imagine we'd add (possibly random) delays even from fetches from preferred peers, for privacy reasons. * Very well connected nodes (with way more inbound than outbound nodes) may see a larger percentage fetched from non-preferred connections. -------------------------- ajtowns src/txrequest.h:74 2020-09-29T07:54:15Z I think all those cases would just change the distribution between the cases where the "first" marker doesn't matter -- ie they're entirely determined by reqtime and preferredness. I do think it might improve robustness *somewhere* though... Also, if, in the future, we changed the code to either update current_time less, or to target specific reqtimes ("anything between t=0.0 and t=0.499 will be requested at t=1.234"), it might start mattering > During the rollout of BIP339, we can expect to see many more transactions fetched from delayed (but possibly preferred) peers, as just one wtxid peer will cause a delay on all announcements from other peers. That's a good point! I've restarted my node since the above numbers, and don't have any logging as to whether any of them were wtxidrelay. However post-restart, I have two inbound wtxidrelay peers, so should already be suffering from BIP339 rollout. And I guess that means all my preferred nodes should be degraded to the same priority as the non-preferred wtxid nodes, which in turn probably explains why I got such a high percentage of non-preferred announcements as the very first request for a txid. Yeah, collecting `RequestedTx` calls by peerid and whether they were preferred or not gives me: * 1049 preferred=0 peer=2717 * 1500 preferred=0 peer=3993 * 3255 preferred=0 peer=3815 * 3270 preferred=1 peer=13 * 14216 preferred=1 peer=15 * 14651 preferred=1 peer=14 * 16918 preferred=1 peer=11 * 18635 preferred=1 peer=2 * 30675 preferred=0 peer=1163 * 37776 preferred=1 peer=1 * 39962 preferred=1 peer=16 * 52112 preferred=1 peer=0 (ignoring peers with under 1k requests). Which looks to me like the top 9 peers include the 8 outbounds I connected to initially, and a random inbound wtxidrelay peer which connected about 3h after my peer was restarted. That single node accounts for about 2/3rds of the "18.2% of cases" by the looks. So I might desable wtxidrelay, and restart my node again, I think, which ironically seems like the easiest way to simulate an all-wtxid-relay world... -------------------------- sipa src/net_processing.cpp:753 2020-09-29T19:43:27Z Good point; I think was overlooked in #19569. Fixed in a separate commit. -------------------------- sipa src/txrequest.cpp:667 2020-09-29T23:15:03Z Done. Also included some more comments from your branch linked above. -------------------------- ajtowns src/txrequest.h:74 2020-09-29T23:53:08Z Ok, updated stats based on 257k `RequestedTx` calls with wtxidrelay disabled (added a `false &&` to the if's in net_processing) I'm seeing 98.04% of requests being trivial cases -- there's no alternatives that have hit reqtime, and this is the first request of a tx. Despite having lots of (up to ~50?) inbounds, it's still choosing outbounds for almost all txs. Preferredness only mattered in 0.67% of cases, and only in 0.59% of cases did it matter for the first request. "First" marker only mattered in 0.77% of cases, and only ever for choosing between two non-preferred peers; though it was at least (almost) always for the first actual request for the txid. FWIW, ignoring the first hour, I got 520 "accepted orphan tx" messages. I'm running this node with mempool expiry set to 24h so that might be a bit high. ``` 86.59% 222556 ABCD requested preferred=1 first=1 candidates=[-,-] completed=[-,-] 11.45% 29429 ABCD requested preferred=0 first=1 candidates=[-,-] completed=[-,-] 0.18% 455 ABCD requested preferred=1 first=0 candidates=[-,-] completed=[Y,-] 0.09% 222 ABCD requested preferred=0 first=1 candidates=[-,-] completed=[Y,-] 0.08% 201 ABCD requested preferred=0 first=0 candidates=[-,-] completed=[-,-] 0.05% 127 ABCD requested preferred=0 first=0 candidates=[-,-] completed=[Y,Y] 0.04% 111 ABCD requested preferred=1 first=0 candidates=[-,-] completed=[Y,Y] 0.03% 79 ABCD requested preferred=0 first=0 candidates=[-,-] completed=[Y,-] 0.01% 17 ABCD requested preferred=1 first=0 candidates=[-,-] completed=[-,Y] 0.01% 15 ABCD requested preferred=0 first=0 candidates=[-,-] completed=[-,Y] 0.59% 1524 ABCD requested preferred=1 first=1 candidates=[-,Y] completed=[-,-] 0.02% 64 ABCD requested preferred=1 first=0 candidates=[Y,Y] completed=[-,Y] 0.02% 51 ABCD requested preferred=1 first=0 candidates=[Y,Y] completed=[Y,Y] 0.02% 50 ABCD requested preferred=1 first=0 candidates=[-,Y] completed=[Y,-] 0.01% 38 ABCD requested preferred=1 first=0 candidates=[-,Y] completed=[Y,Y] 0.01% 32 ABCD requested preferred=1 first=0 candidates=[Y,-] completed=[Y,-] 0.01% 24 ABCD requested preferred=1 first=0 candidates=[Y,-] completed=[Y,Y] 0.00% 10 ABCD requested preferred=1 first=0 candidates=[Y,Y] completed=[Y,-] 0.00% 1 ABCD requested preferred=1 first=0 candidates=[Y,-] completed=[-,Y] 0.00% 1 ABCD requested preferred=1 first=0 candidates=[-,Y] completed=[-,Y] 0.77% 1978 ABCD requested preferred=0 first=1 candidates=[-,Y] completed=[-,-] 0.01% 20 ABCD requested preferred=0 first=0 candidates=[-,Y] completed=[Y,Y] 0.00% 8 ABCD requested preferred=0 first=1 candidates=[-,Y] completed=[Y,-] 0.00% 2 ABCD requested preferred=0 first=0 candidates=[-,Y] completed=[Y,-] 0.00% 1 ABCD requested preferred=0 first=0 candidates=[-,Y] completed=[-,Y] ``` Just considering the first request for a tx (ie, `candidates=[-,-] completed=[-,-]` cases) the number of announcements for the tx in CANDIDATE_DELAYED was distributed something like: ``` 48839 ABCD requested preferred=1 delayed=0 37457 ABCD requested preferred=1 delayed=1 32913 ABCD requested preferred=1 delayed=2 27587 ABCD requested preferred=1 delayed=3 22798 ABCD requested preferred=1 delayed=4 17525 ABCD requested preferred=1 delayed=5 12564 ABCD requested preferred=1 delayed=6 8071 ABCD requested preferred=1 delayed=7 5148 ABCD requested preferred=1 delayed=8 2965 ABCD requested preferred=1 delayed=9 1822 ABCD requested preferred=1 delayed=10 4437 ABCD requested preferred=0 delayed=0 2025 ABCD requested preferred=0 delayed=5 1981 ABCD requested preferred=0 delayed=6 1780 ABCD requested preferred=0 delayed=4 ``` which looks pretty reasonable, I think. (I didn't break the delayed count into preferred/non-preferred, so the delayed= figure should include both inbounds and any overloaded outbounds) I wasn't quite as lucky with my outbounds staying around the entire timeby the looks, but got a much more even looking distribution of requests between the different prefrences: ``` 1048 ABCD requested preferred=0 peer=2013 1294 ABCD requested preferred=0 peer=4076 1359 ABCD requested preferred=0 peer=4096 1725 ABCD requested preferred=0 peer=235 2193 ABCD requested preferred=0 peer=3370 2630 ABCD requested preferred=0 peer=1950 2694 ABCD requested preferred=0 peer=2304 2843 ABCD requested preferred=0 peer=2095 3618 ABCD requested preferred=0 peer=1853 9 ABCD requested preferred=1 peer=6529 21 ABCD requested preferred=1 peer=5853 25 ABCD requested preferred=1 peer=1737 39 ABCD requested preferred=1 peer=1624 44 ABCD requested preferred=1 peer=30 57 ABCD requested preferred=1 peer=644 853 ABCD requested preferred=1 peer=22 2312 ABCD requested preferred=1 peer=11 5470 ABCD requested preferred=1 peer=202 13974 ABCD requested preferred=1 peer=0 14062 ABCD requested preferred=1 peer=2238 14427 ABCD requested preferred=1 peer=14 16520 ABCD requested preferred=1 peer=16 23714 ABCD requested preferred=1 peer=3385 37178 ABCD requested preferred=1 peer=24 46702 ABCD requested preferred=1 peer=29 49527 ABCD requested preferred=1 peer=44 ``` (ignoring non-preferred peers that didn't have >1000 requests) EDIT: I'm guessing the lack of "used the first marker to tie-break between two preferred peers" cases means that none of my preferred peers were overloaded for any length of time, and always had `reqtime=min()`, and thus the `ThreadMessageHandler` loop would see the INV and immediately request it in a single iteration with no chance to consider any other peer, no matter how simultaneous the announcements were. The previous 1.1% figure was due to the wtxidrelay delay causing all my preferred peers to have a 2s delay, so that if the outbounds were within 100ms there'd be a tie-breaker, ie they'd get added to the index, then both their reqtimes would pass while the thread was sleeping, and both would progress to READY/BEST in a single GetRequestable call. -------------------------- ajtowns src/txrequest.h:74 2020-09-30T03:31:22Z Am currently running a test to see if removing "first" markers has a noticable effect on orphans (or anything else), but my current feeling is that the benefits of "first" markers don't justify the complexity. I've had a quick go at pulling the feature out into its own commit at https://github.com/ajtowns/bitcoin/commits/202009-txoverhaul-sep-first to see how much complexity that is, fwiw. I think the robustness that it adds is preserving the "first announcer" when multiple candidate announcements become ready at the same time. At present it only does that when their reqtimes were very similar from the word go, relative to how often `GetRequestable` is called. One thing that might be worth considering is extending that towards a firmer guarantee, eg if a later, equally-preferred peer's announcement ends up with an earlier reqtime (due to time going backwards or randomized delays maybe) automatically adjust the "first" peer's announcement's reqtime to the same figure. (That would make a difference even with the current net_processing code in this PR, I think. Namely in the case where a non-wtxid-relay announces txid X, then a wtxid-relay-node announces wtxid X, ie X has no witness -- it'd bump the reqtime of the original peer's announcement at that point, rather than making it wait for the full TXID_RELAY_DELAY) Even if that's a good idea though, I'm not really convinced it makes sense to do it now, rather than later, with code to actually randomize times and thus make it relevant, and we can see if it covers all the cases that actually make sense. eg, would probably want to bump the reqtime of the "first" preferred announcement upon receipt of a non-preferred announcement with earlier reqtime, but probably not the reverse... -------------------------- jnewbery src/txrequest.h:29 2020-09-30T13:42:59Z Unmatched `)` parse error -------------------------- jnewbery src/txrequest.h:50 2020-09-30T14:32:45Z Maybe personal preference, but I *really* don't like the invblock terminology (for this reason: https://github.com/bitcoin/bitcoin/pull/19988#discussion_r493163586) and would prefer not to introduce it to the Bitcoin Core codebase. -------------------------- jnewbery src/txrequest.h:59 2020-09-30T14:35:43Z I think this depends on your definition of progress? If all of your peers are announcing but not relaying transactions, then it doesn't matter how many times you request the tx, you're never making progress. -------------------------- jnewbery src/txrequest.h:64 2020-09-30T14:36:28Z change 'net_processing' to 'caller' -------------------------- jnewbery src/txrequest.h:71 2020-09-30T14:40:56Z remove "(outbound, whitelisted)". The caller chooses which peers are preferred, so it's best to avoid what we expect the caller to do here. -------------------------- jnewbery src/txrequest.h:77 2020-09-30T14:45:49Z Perhaps reword this to be singular i.e. "..marker is given to an announcement at the time it is received ... The peer that announced it was not overloaded." Currently your mixing plural ("announcements") with singular ("its txhash", "the same txhash") -------------------------- jnewbery src/txrequest.cpp:194 2020-09-30T15:57:23Z "when backwards" -> "went backwards"? -------------------------- jnewbery src/txrequest.cpp:210 2020-09-30T16:36:48Z Perhaps comment that this is only used in sanity testing, or even better remove it entirely and move the logic into `TxRequestTracker::Impl::SanityCheck()` (since it's only called in one place). -------------------------- jnewbery src/txrequest.cpp:237 2020-09-30T16:37:23Z Consider moving this logic into `SanityCheck()`, since it's only called in one place. -------------------------- jnewbery src/txrequest.cpp:253 2020-09-30T16:58:03Z What's the thinking behind having this logic in a separate function, rather than contained in `SanityCheck()`? It's only called in one place. -------------------------- jnewbery src/txrequest.cpp:210 2020-09-30T17:57:50Z Actually, I think just a comment saying that his is just used for sanity checking is sufficient. -------------------------- jnewbery src/txrequest.cpp:237 2020-09-30T17:58:16Z Marking as resolved. AJ asked you to separate this, and I don't have a strong opionion. -------------------------- jnewbery src/txrequest.cpp:253 2020-09-30T17:58:22Z Marking as resolved. AJ asked you to separate this, and I don't have a strong opionion. -------------------------- sipa src/txrequest.h:59 2020-09-30T19:25:11Z "progress towards forgetting a transaction" is unambiguous I think? Every minute one peer is crossed off the list of candidates, so eventually you will run out. Note that this section isn't about successfully receiving the transaction, but bounding memory usage. -------------------------- sipa src/txrequest.h:59 2020-09-30T19:35:06Z Actually, thinking more about this, it isn't exactly true, because for non-preferred connections we should assume the attacker can disconnect and reconnect, giving them a new opportunity. I think this should be restated as: an attacker can force us to keep a transaction in memory (even in the queues of honest peers) for as long as they can prevent us from receiving the transaction. That's not necessarily bounded, but in cases where it isn't, we have bigger problems. Actual OOM is prevented using per-peer queue size limits. -------------------------- sipa src/txrequest.h:74 2020-09-30T20:31:06Z Done, I've changed this PR to drop the "first" marker logic, based on your branch + adjusting fuzz tester and some extra comments. The commit to add it back is here: https://github.com/sipa/bitcoin/commits/202009_txrequest_rand_wtxid_first -------------------------- sipa src/txrequest.h:29 2020-09-30T20:31:31Z Fixed. -------------------------- sipa src/txrequest.h:50 2020-09-30T20:31:48Z I've replaced it with "transaction censorship attacks". -------------------------- sipa src/txrequest.h:59 2020-09-30T20:32:22Z I've adjusted the comments, merging this paragraph into the previous one (they were already kind of circularly referring to each other). Also changed the formula for delay to take this potential for reconnection behavior into account. -------------------------- sipa src/txrequest.h:64 2020-09-30T20:33:09Z Done. -------------------------- sipa src/txrequest.h:71 2020-09-30T20:33:17Z Done. -------------------------- sipa src/txrequest.h:77 2020-09-30T20:33:29Z Gone (moved to https://github.com/sipa/bitcoin/commits/202009_txrequest_rand_wtxid_first). -------------------------- sipa src/txrequest.cpp:194 2020-09-30T20:33:36Z Fixed. -------------------------- sipa src/txrequest.cpp:210 2020-09-30T20:33:41Z Done. -------------------------- ariard src/net_processing.cpp:75 2020-09-30T23:42:00Z You can add "pushback mechanisms (OVERLOADED_PEER_TX_DELAY)". A future improvement of pushback mechanism could be to scale it up by the number of times of MAX_PEER_TX_IN_FLIGHT is reached, like `m_txrequest.CountInFlightMagnitude(nodeid, MAX_PEER_TX_IN_FLIGHT)` ? Thus delaying further and further a likely-malicious peer. -------------------------- ariard src/net_processing.cpp:85 2020-09-30T23:47:31Z I think this delay applies to non-preferred peers which is strictly a different set than inbound ones as some of them might be PF_NOBAN==true. Variable name can be updated to reflect this. -------------------------- ariard src/net_processing.cpp:776 2020-09-30T23:55:42Z It's documented in TxRequestTracker specification but I think you could recall that a preferred, txid-relay peer will be always favored on a non-preferred, wtxid-relay one. So it doesn't matter that all delay penalties are actually 2 seconds. They order peers inside a class, not across ? -------------------------- ariard src/net_processing.cpp:768 2020-10-01T00:32:11Z Why not pass `current_time` only for false-branch of ternary ? Entry should be promoted to {READY/BEST} in `SetTimePoint` anyway. -------------------------- ariard src/net_processing.cpp:757 2020-10-01T00:36:58Z Have you considered moving `m_txrequest` under it's own lock as it's a well-contained, new data structure ? a) useless as we already take `cs_main` independently in all code paths reaching `m_txrequest` or b) too much work and this PR is already complex enough? -------------------------- ariard src/txrequest.cpp:65 2020-10-01T00:41:22Z I think this comment isn't clear with other comments spread elsewhere like "Lower priorities are selected first". -------------------------- ariard src/txrequest.cpp:158 2020-10-01T00:45:04Z to convert to CANDIDATE_BEST? -------------------------- ariard src/txrequest.h:39 2020-10-01T00:48:20Z Is there a caveat here if the same transaction is announced concurrently by txid and wtxid ? You may have a download collision due to transaction identifier ambiguity. -------------------------- ariard src/txrequest.h:42 2020-10-01T00:50:34Z I find this confusing. Is "Announcements" denoting all Entry under the same txhash or a given (peer/txhash) Entry. In the former, a peer going offline shouldn't carry deletion of other different-peer/same-txhash Entries. -------------------------- ariard src/txrequest.h:73 2020-10-01T00:52:36Z At least one _what_ ? An attacker controlled-peer, a non-buggy honest preferred connection? -------------------------- ariard src/txrequest.h:134 2020-10-01T01:01:37Z How does entry uniqueness is enforced w.r.t to wtxid/txid ? For segwit txn, the ByPeer can't dissociate between a txid and wtxid announcement, it's different hash ? -------------------------- sipa src/net_processing.cpp:75 2020-10-01T02:06:31Z > You can add "pushback mechanisms (OVERLOADED_PEER_TX_DELAY)". Done. > A future improvement of pushback mechanism could be to scale it up by the number of times of MAX_PEER_TX_IN_FLIGHT is reached, like m_txrequest.CountInFlightMagnitude(nodeid, MAX_PEER_TX_IN_FLIGHT) ? Thus delaying further and further a likely-malicious peer. Yes, there are a number of possibilities there. -------------------------- sipa src/net_processing.cpp:776 2020-10-01T02:11:33Z > It's documented in TxRequestTracker specification but I think you could recall that a preferred, txid-relay peer will be always favored on a non-preferred, wtxid-relay one. I'd rather not duplicate the explanations; it'll just risk things becoming inconsistent and confusing. > So it doesn't matter that all delay penalties are actually 2 seconds. They order peers inside a class, not across ? reqtimes don't really order, they set the earliest request time. Actual requests are picked randomly from all announcements past their reqtime (preferred first, then non-preferred). -------------------------- sipa src/net_processing.cpp:768 2020-10-01T02:11:52Z Agree, the current code is just unnecessarily complicated. Fixed. -------------------------- sipa src/net_processing.cpp:757 2020-10-01T02:13:10Z I don't think there is a benefit to giving it a separate lock. There may be one at some point, but probably together with the majority of net_processing moving from cs_main to its own lock(s). At this point, cs_main is already held anyway. -------------------------- sipa src/txrequest.cpp:65 2020-10-01T02:13:36Z I've removed it, preferredness is explained much better in the .h file now then when this comment was added. -------------------------- sipa src/txrequest.cpp:158 2020-10-01T02:13:43Z Done. -------------------------- sipa src/txrequest.h:39 2020-10-01T02:17:16Z As far as txrequest is converned, transactions are identified by its txhash. If there is both a txid and wtxid announcement for the same transaction (and the wtxid differs from the txid), it'll be treated as two transactions, and they could be fetched both. That is exactly the reason why an extra delay for txid announcements was introduced (prior to this PR), to avoid downloading the same thing twice. -------------------------- sipa src/txrequest.h:42 2020-10-01T02:19:04Z This is the specification, there is no concept of Entry here. Earlier in the file it says that a txhash/peer combination is called an announcement, so I don't think this is ambiguous. -------------------------- sipa src/txrequest.h:73 2020-10-01T02:19:24Z I've clarified this, I think. -------------------------- sipa src/txrequest.h:134 2020-10-01T02:20:27Z I think that's explained exactly in this paragraph? "if one already exists for that (txhash, peer) combination". Uniqueness is on that combination. If the txhash is the same, it's the same. If they're not, they're not. -------------------------- ajtowns src/txrequest.h:74 2020-10-01T04:28:17Z (replying to resolved thread without marking it unresolved) With "first" marker disabled (essentially treating every peer as oveloaded), I got: ``` 324484 ABCD requested preferred=1 first=0 candidates=[-,-] completed=[-,-] -- 81.7% 61457 ABCD requested preferred=0 first=0 candidates=[-,-] completed=[-,-] -- 15.5% 599 ABCD requested preferred=0 first=0 candidates=[-,-] completed=[Y,Y] 122 ABCD requested preferred=0 first=0 candidates=[-,-] completed=[-,Y] 110 ABCD requested preferred=0 first=0 candidates=[-,-] completed=[Y,-] 87 ABCD requested preferred=1 first=0 candidates=[-,-] completed=[Y,Y] 60 ABCD requested preferred=1 first=0 candidates=[-,-] completed=[Y,-] 13 ABCD requested preferred=1 first=0 candidates=[-,-] completed=[-,Y] 2949 ABCD requested preferred=1 first=0 candidates=[-,Y] completed=[-,-] -- 0.7% 543 ABCD requested preferred=1 first=0 candidates=[-,Y] completed=[Y,Y] 214 ABCD requested preferred=1 first=0 candidates=[Y,Y] completed=[Y,Y] 51 ABCD requested preferred=1 first=0 candidates=[Y,Y] completed=[-,Y] 24 ABCD requested preferred=1 first=0 candidates=[-,Y] completed=[-,Y] 5 ABCD requested preferred=1 first=0 candidates=[Y,Y] completed=[Y,-] 5 ABCD requested preferred=1 first=0 candidates=[-,Y] completed=[Y,-] 4 ABCD requested preferred=1 first=0 candidates=[Y,-] completed=[Y,Y] 3 ABCD requested preferred=1 first=0 candidates=[Y,-] completed=[Y,-] 1 ABCD requested preferred=1 first=0 candidates=[Y,-] completed=[-,Y] 3285 ABCD requested preferred=0 first=0 candidates=[-,Y] completed=[-,-] -- 0.8% 2910 ABCD requested preferred=0 first=0 candidates=[-,Y] completed=[Y,Y] -- 0.7% 107 ABCD requested preferred=0 first=0 candidates=[-,Y] completed=[-,Y] 8 ABCD requested preferred=0 first=0 candidates=[-,Y] completed=[Y,-] ``` I got 561 "accepted orphan tx" entries (ignoring the first first hour), compared to 520 last time, which is an 8% increase in orphans compared to running for about 90% longer or doing 54% more requests; so seems like any differences in behaviour are lost in the noise. Looking at the per-peer behaviour, one non-preferred peer got a lot of requests: ``` 1088 ABCD requested preferred=0 peer=8236 1245 ABCD requested preferred=0 peer=7422 1470 ABCD requested preferred=0 peer=7219 1486 ABCD requested preferred=0 peer=7712 1540 ABCD requested preferred=0 peer=7342 13124 ABCD requested preferred=0 peer=53 22367 ABCD requested preferred=0 peer=12516 159 ABCD requested preferred=1 peer=20 166 ABCD requested preferred=1 peer=7 746 ABCD requested preferred=1 peer=6 16624 ABCD requested preferred=1 peer=31 22917 ABCD requested preferred=1 peer=122 25830 ABCD requested preferred=1 peer=24 54666 ABCD requested preferred=1 peer=30 55279 ABCD requested preferred=1 peer=27 72306 ABCD requested preferred=1 peer=29 79637 ABCD requested preferred=1 peer=23 ``` (ignores preferred peers that got less than 100 requests, and non-preferred peers that got less than 1000). For peer=12516, almost every request (21625) was when it was the only candidate and no other peer had already been tried for the tx. It looks like it just happened to be fastest most of the time, with plenty of other inbounds (presumably) having already announced when we actually make the request: ``` 1 delayed=114 ... 7 delayed=60 12 delayed=59 12 delayed=58 ... 744 delayed=12 904 delayed=11 1025 delayed=10 1146 delayed=9 1252 delayed=8 1389 delayed=7 1448 delayed=6 1335 delayed=5 1354 delayed=4 1096 delayed=3 706 delayed=2 420 delayed=1 193 delayed=0 ``` Results for peer=53 look pretty similar to me, though it had a higher proportion of announcements where delayed=0 or delayed=1. So I think that just means it's showing a heavy bias towards fast peers/first announcers? The bias part is fine and desirable, but maybe it would be good to reduce the heaviness in future by mildly varying peers' delays, either randomly or based on load in some way? Something to worry about in a followup. -------------------------- ajtowns src/net_processing.cpp:768 2020-10-01T05:00:14Z With this change, if time goes slightly backwards due to a non-monotonic clock, a later announcement may get requested first, even when no delay is intended: * t=1.000 peer=1 preferred=1 txid=X --> ReceivedInv --> reqtime=1.000 state=CANDIDATE_DELAYED * t=0.997 GetRequestable(peer=1) --> (no change) * t=0.998 peer=2 preferred=1 txid=X --> ReceivedInv --> reqtime=0.999 state=CANDIDATE_DELAYED * t=0.999 GetRequestable(peer=2) --> peer=2 txid=X state --> CANDIDATE_BEST The previous code would have ensured that when no delay is desired, the announcement will always go to READY or BEST the next time GetRequestable is called. (No opinion on whether that edge case is worth the conditional. If it is, might be nice to just pass in (current_time, delay) and have the addition/min done in ReceivedInv though) -------------------------- glozow src/txrequest.cpp:167 2020-10-01T11:30:56Z Question: why wasn't `entry.IsSelectable()` used instead? -------------------------- glozow src/txrequest.h:127 2020-10-01T15:41:32Z I interpret this comment to mean that `ForgetTxHash` i.e. the `TxRequestTracker` is responsible for deleting the transaction by txid and wtxid, but I don't think this is the case... that's `PeerManager`'s job right? -------------------------- sipa src/net_processing.cpp:85 2020-10-01T16:44:13Z Done. -------------------------- sipa src/net_processing.cpp:768 2020-10-01T17:17:43Z Yes, that was the original reasoning, but I'm not sure clocks going backward is worth extra complexity (it should do something sane, and be tested, but it should be sufficiently rare that the actual behavior doesn't matter too much). -------------------------- glozow src/net_processing.cpp:776 2020-10-01T17:31:38Z In https://github.com/bitcoin/bitcoin/pull/19988/commits/987b27176e41f38b997f7b732f087a518fa678ca Please forgive me but I just want to make sure I'm understanding this commit correctly... we want to add a delay for transactions by txid just in case we're able to get it by wtxid from another peer. But when we get an orphan, we only have the txid of the missing parent. -Using the logic of peer's `m_wtxid_relay` doesn't apply to requests for missing parents because we can only request by txid, even if the peer is wtxid relay? -But using transaction's `IsWtxid` means we're interested in potentially receiving the missing parent (from anyone) as a wtxid relay before we ask this peer for the missing parent? -------------------------- sipa src/txrequest.cpp:167 2020-10-01T17:57:53Z That would also compute the priority for entries in the CANDIDATE_BEST state. That would be harmless as it wouldn't change behavior, but the computation of priority isn't exactly free either, so better avoid it when not needed. -------------------------- sipa src/txrequest.h:127 2020-10-01T18:07:32Z `TxRequestTracker` is a data structure that manages the set of announced txids/wtxids for all peers, exposes ways to alter that set, and answers queries about it. Its responsibility is maintaining the consistency of that data structure, and doing as it's told; it isn't responsible for deciding what mutations to make and when; that's net_processing's job. This function is the mutator that removes all entries with a given txid or wtxid. It does just that. Net_processing calls it whenever it is no longer interested in a particular txid/wtxid. -------------------------- sipa src/net_processing.cpp:776 2020-10-01T18:24:30Z Imagine we have multiple announcements for the same witness transaction, and all peers have different witnesses for it. So the transaction will be known by one txid, but multiple distinct wtxids. We don't know that these are all for the same transaction, as all the (non-txid) requests have different hashes. No matter who we ask, or what we receive, we'll always be able to delete all txid-based announcements for it (we'll compute the txid of the received transaction, which will be the same for all). However, we can't delete all wtxid announcements, only the one for the wtxid announcement for the witness we actually received. Thus we want to prioritize fetching by wtxid, as a successful response to that will allow us to delete at least one wtxid-based announcements + plus all txid-based announcements (while a txid based request only guarantees deleting all txid-based announcements). Now imagine that instead all our peers are BIP339 (wtxid) peers, and there are transactions A and B, where B spends one of A's outputs. For most peers, we receive A first and then B. But from one peer, we only receive B (e.g. because we just connected and weren't connected when A was sent), and B is fetched first. That makes it an orphan, and its "prevout" entry in its input for A is treated as a txid announcement (we don't have its wtxid), despite being from a wtxid peer. Thus we end up with a txid announcement for A, and a bunch of wtxid announcements for it (from the other peers). The same reasoning applies here: we want to fetch by one of the wtxid ones first, because it is guaranteed to allow us to forget the txid announcement PLUS at least one wtxid one. -------------------------- ariard src/txrequest.h:39 2020-10-02T00:09:03Z Okay so the only download duplication we may have is if any preferred, wtxid-relay peers is really slow, requiring from a txid-relay peer and receiving both. Really a edge case once the network is sufficiently upgraded I guess. -------------------------- ariard src/txrequest.h:134 2020-10-02T00:12:37Z I think I hurted on "Note that this means a second INV with the same txhash from the same peer will be ignored, even if one is a txid and the other is wtxid". I interpreted it "As if first is txid, does nothing even if a second announcement is wtxid". That's my English here, nevermind. -------------------------- sipa src/txrequest.h:134 2020-10-02T00:18:25Z Just so we're on the same page. The scenario is: * Peer P announces txid H * Peer P announces wtxid H In this case, the second announcement is ignored, because one already exists for peer/txhash combination (P, H). This is harmless for two reasons: * The txid and wtxid being identical implies it's a non-segwit transaction, so it doesn't matter how we fetch. * BIP339 prescribes that all announcements have to be wtxid when enabled (though I now realize that orphan fetching could actually mean this can actually still occur; I'll improve the comment). -------------------------- sipa src/txrequest.h:39 2020-10-02T00:27:38Z I don't expect it to be super rare, but the risk of double fetching during rollout is inherent to BIP339 (and exists in the same form in current master, so I don't think it's significantly affected by this PR). -------------------------- sipa src/txrequest.h:42 2020-10-02T00:39:39Z I've elaborated this a bit. Is it clearer now? -------------------------- sipa src/txrequest.h:39 2020-10-02T00:39:53Z Added a note about this. -------------------------- sipa src/txrequest.h:134 2020-10-02T00:40:17Z I've rewritten the comment. -------------------------- MarcoFalke src/net_processing.cpp:85 2020-10-02T06:55:31Z ```suggestion static constexpr std::chrono::seconds NONPREF_PEER_TX_DELAY{2}; ``` This can be written shorter, as the compiler will do the chrono conversion for you (Same below) -------------------------- jnewbery src/txrequest.cpp:210 2020-10-02T08:58:53Z `==()` doesn't need to be a friend, since it's not accessing any private/protected members: ```diff diff --git a/src/txrequest.cpp b/src/txrequest.cpp index bef3460dd6..b7347d8d34 100644 --- a/src/txrequest.cpp +++ b/src/txrequest.cpp @@ -206,14 +206,15 @@ struct PeerInfo { size_t m_completed = 0; //!< Number of COMPLETED entries for this peer. size_t m_requested = 0; //!< Number of REQUESTED entries for this peer. - /** Compare two PeerInfo objects. Only used for sanity checking. */ - friend bool operator==(const PeerInfo& a, const PeerInfo& b) - { - return std::tie(a.m_total, a.m_completed, a.m_requested) == - std::tie(b.m_total, b.m_completed, b.m_requested); - } }; +/** Compare two PeerInfo objects. Only used for sanity checking. */ +bool operator==(const PeerInfo& a, const PeerInfo& b) +{ + return std::tie(a.m_total, a.m_completed, a.m_requested) == + std::tie(b.m_total, b.m_completed, b.m_requested); +} + ``` Feels slightly clearer to me, but obviously doesn't make much difference. -------------------------- jnewbery src/txrequest.cpp:301 2020-10-02T09:15:18Z This is really just a stylistic thing, but is there a reason to make any members of an impl class private? By definition it doesn't have an exposed interface, so theoretically everything could just be public. The first two examples at https://en.cppreference.com/w/cpp/language/pimpl are actually declared as `struct`s (although the third example does have a private data member) -------------------------- jnewbery src/txrequest.cpp:62 2020-10-02T09:29:12Z Is there a reason not to use `NodeId` throughout instead of `uint64_t`? It seems to me that `NodeId` would be more consistent with net/net_processing, and more readable in function signatures/returns types/typedefs. -------------------------- jnewbery src/txrequest.h:139 2020-10-02T09:38:19Z s/takes it/takes its/ -------------------------- jnewbery src/txrequest.cpp:587 2020-10-02T09:47:07Z s/AlreadyHave/ForgetTxHash/ -------------------------- jnewbery src/txrequest.cpp:599 2020-10-02T09:51:30Z Delete all this. It's no longer needed now that there's no 'first' marker. -------------------------- jnewbery src/txrequest.cpp:278 2020-10-02T09:57:24Z Delete -------------------------- jnewbery src/txrequest.h:148 2020-10-02T10:06:56Z s/ForgetTx/ForgetTxHash/ -------------------------- jnewbery src/txrequest.h:151 2020-10-02T10:12:29Z Possible changes to this interface: 1. Change the `GenTxid` argument to `uint256`. The function doesn't make any distinction between txids and wtxids, so why make it part of the interface? 2. Pass a `const std::vector&`, with all txs that have been requested, rather than repeatedly calling the same function with different tx hashes. 3. Also pass in a `const std::vector& hashes_to_forget`, with all txs to forget. Doing all of those would change the condition "This can ONLY be called immediately after GetRequestable was called (for the same peer), with only ForgetTxHash and other RequestedTx calls (both for other txhashes) in between." to the much stronger condition "This can ONLY be called immediately after a GetRequestable call with no other calls in between". -------------------------- jnewbery src/net_processing.cpp:756 2020-10-02T10:20:00Z _overloaded_ isn't a TxRequestTracker parameter, so I think it's fine just to have the `OVERLOADED_PEER_TX_DELAY` comment in the sub-bullet below -------------------------- jnewbery src/net_processing.cpp:760 2020-10-02T10:20:40Z s/announcements from txid peers/txid announcements/ (since requesting an orphan tx from a wtxid peer counts as a txid announcement) -------------------------- jnewbery src/net_processing.cpp:84 2020-10-02T10:30:01Z s/How many microseconds/How long/ (commenting that a `std::chrono::microseconds` constant is microseconds is redundant) -------------------------- jnewbery src/txrequest.cpp:545 2020-10-02T10:38:46Z Delete all this. No longer needed since you removed the 'first' marker. -------------------------- jnewbery src/txrequest.cpp:533 2020-10-02T10:39:49Z I don't understand this comment about catching a non-CANDIDATE_BEST entry automatically. `m_index.get().count()` will return 0 in that case. Edit: Ah! I've just seen the comment below. Perhaps update this comment to say "will be caught by the uniqueness property when we try to emplace the new Entry object". -------------------------- jnewbery src/txrequest.cpp:133 2020-10-02T10:47:35Z Why not just index by (peer, txhash), and then in `GetRequestable()` just add an `if (it_peer->GetState() == State::CANDIDATE_BEST)` before adding the hash to `selected`? It seems like in all other places (`DisconnectedPeer()`, `ReceivedInv()` and `ReceivedResponse()`), we're actually interested in all the outstanding Entry objects for a given peer. -------------------------- jnewbery src/txrequest.cpp:428 2020-10-02T11:09:50Z Maybe assert that `it` is not `end()` at the top of this function? -------------------------- jnewbery src/txrequest.h:158 2020-10-02T11:46:40Z This can be changed to take a hash instead of a GenTxid. The function doesn't do anything with whether it's a txid or wtxid, so it shouldn't be part of the interface. -------------------------- jnewbery src/net_processing.cpp:2901 2020-10-02T11:49:18Z No point in calling this twice. `ReceivedResponse()` only cares about the hash. -------------------------- jnewbery src/net_processing.cpp:4510 2020-10-02T11:55:35Z Testing my understanding here: do we ever expect to hit this? I think that every action that causes a tx to become AlreadyHave will also cause us to ForgetTxHash, and we won't add an AlreadyHave tx back into TxRequestTracker. If I'm right, perhaps just add a comment here saying that we don't expect to hit this and it's here for belt-and-suspenders. -------------------------- jnewbery src/net_processing.cpp:4450 2020-10-02T12:03:50Z This `CInv` only needs to be constructed inside the `!AlreadyHaveTx` block. It could even be emplaced directly into `vGetData`. -------------------------- jnewbery src/txrequest.cpp:560 2020-10-02T12:09:43Z This function is now on the hot path and is called for every peer on every SendMessages() loop. Have you done any profiling to see how much time we'll spend in here on a normally loaded system? I think all of the lookups in here are O(1) in the size of the index, but is the constant factor important? -------------------------- jnewbery src/txrequest.cpp:64 2020-10-02T12:13:56Z If you don't change the `m_peer` to be `NodeId`, consider making a typedef `SeqNo` for the sequence number, so that it's obvious which types are using sequence numbers and which are using peer ids. -------------------------- jnewbery src/txrequest.cpp:566 2020-10-02T12:17:02Z Is there a reason not to just construct the `GenTxid`s directly into `selected`? -------------------------- jnewbery src/txrequest.cpp:56 2020-10-02T12:24:49Z Are "entry" and "announcement" synonymous? If so, would it make sense to rename `Entry` to `Announcement` and drop the "entry" terminology? -------------------------- sr-gi src/txrequest.h:33 2020-10-02T13:35:26Z Is there any specific flag for this? I guess it is implicitly store in `m_state`, but also alongside any other state the transaction can be at. -------------------------- sr-gi src/txrequest.cpp:597 2020-10-02T13:35:46Z There is no longer a "first" maker for the entires, is there? -------------------------- ajtowns src/txrequest.h:151 2020-10-02T14:08:35Z Constructing vectors instead of repeatedly calling the function seems worse to me, fwiw, but agree that it would be nice to have a simpler description of the constraint. Maybe tying it to the return value of `GetRequestable` might be better: "each value returned by `GetRequestable` may used with either `ForgetTxHash` or `RequestedTx` for the given peer, but not both, and at most only once. Any other calls to non-const methods on `TxRequestTracker` invalidate all the results from `GetRequestable`" ? -------------------------- ajtowns src/txrequest.cpp:133 2020-10-02T14:11:46Z That would make `GetRequestable` be `O(nr_announcements)` instead of `O(nr_best)` which I think makes the complexity blow out in worst case scenarios. -------------------------- ariard src/txrequest.h:42 2020-10-02T14:11:59Z Yes, better! -------------------------- ariard src/txrequest.h:134 2020-10-02T14:13:57Z Yes clearer. -------------------------- amitiuttarwar src/txrequest.cpp:56 2020-10-02T14:35:01Z +1 was wondering the same -------------------------- ajtowns src/txrequest.cpp:560 2020-10-02T14:41:06Z `m_index.get.begin()` should be `O(log_2(n))`, `++it` should be `O(1)`, `.emplace()` should be `O(log_2(n))` with about 3x constant factor overhead, `.modify()` should the same except maybe 6x constant factor overhead. `n` is limited to about 600k, so `log_2(n)` is below about 20. With max peers and max announcements, the entire index could be something like 100MB I think, more if you're removing the bounds for some peers via the relay permission. I think worst case complexity is that in SetTimePoint you end up adjusting the state of every announcement , then you find the first BEST for the given peer, which turns out to be every tx for that peer, then iterate through the remaining BEST for that peer adding them to a vector, then sort the vector, and construct the result. I think that's about: 600000 * (20 + 2*20) + 20 + 5000 + 5000*12 + 5000 ~= 36M ops but assuming time doesn't go backwards, then amortized across multiple calls to `GetRequstable`, each announcement only causes 7 transitions over its lifetime (delayed, ready, best(+demotion of other), requested, completed(+promotion of other), and only 4 of those occur via `GetRequestable` so amortized cost per tx is something like `4*60+20+1+1*12+1 = 274` ops, so max average cost of calling `GetRequestable` should be about 1.4M ops. On a normally loaded system, you should be getting no more than 35-70 txs from each call to GetRequestable, and I think SetTimePoint should likewise be dealing with (far) fewer than 7000 txs in each call, so load should be about 100 to 1000 times less, I think. -------------------------- ajtowns src/txrequest.cpp:566 2020-10-02T14:46:46Z It needs to be sorted by sequence before being returned to ensure txs are requested in order of announcement, and the sequence isn't needed in the return value, so having a temporary vector's sensible, I think. And sorting a vector of 16-byte seq/pointer pairs should be more cache efficient than for 48-byte seq/is_wtxid/uint256 tuples (or 40 bytes if you combined is_wtxid into seq). -------------------------- jnewbery src/txrequest.cpp:133 2020-10-02T14:48:38Z Yes, makes sense. Thanks! -------------------------- ariard src/test/txrequest_tests.cpp:63 2020-10-02T15:29:26Z If you assume the return 23-bit integers are statically uniform across samples (`RandomTime()`) I don't understand why adding them increase randomness ? -------------------------- ariard src/test/txrequest_tests.cpp:511 2020-10-02T15:39:50Z You may draw a new `starttime` for every new scenario such increasing the space of starting time covered at each `TestInterleavedScenarios` ? -------------------------- sipa src/txrequest.cpp:133 2020-10-02T16:12:03Z Yes, that's exactly the reason. -------------------------- sipa src/txrequest.cpp:62 2020-10-02T16:28:19Z The only reason not to is avoiding a dependency on net.h. -------------------------- ariard src/test/txrequest_tests.cpp:485 2020-10-02T16:31:43Z Does this test check that the non-preferred announcement is requested after expiration of the first one to verify that wtxidness doesn't interfere with promotion of CANDIDATEs left to _BEST ? -------------------------- jnewbery src/txrequest.cpp:62 2020-10-02T16:49:40Z You mean in the header file? Could you make the public functions take `int64_t` and use `NodeId` internally in the cpp file (which already includes net.h)? -------------------------- ariard src/txrequest.cpp:282 2020-10-02T16:50:55Z What about `m_cur_sequence` to dissociate clearly from the per-Entry `m_sequence` ? Also comment could be clearer that the the request are ordered per-peer, not globally. I had a doubt while reviewing `BuildRequestOrderTest` -------------------------- sipa src/txrequest.cpp:301 2020-10-02T16:51:18Z It may be matter of personal taste. In general I think it still makes sense to have non-exposed classes with private/public fields/members. It's not there to prevent external code from messing with the internals, obviously, but it does still kind of define a layer between what is part of the class's representation and what is its (even just internally) exposed interface. In this case with the pimpl pattern... maybe that's silly, as the glue layer to forward calls on `TxRequestTracker` to `TxRequestTracker::Impl` is super thin, and it's already obvious what the exposed part is. Happy to change it if there are strong opinions. -------------------------- ariard src/test/txrequest_tests.cpp:373 2020-10-02T16:54:00Z Maybe add a config where they're _all_ preferred/non-preferred ? `InsecureRandRange` is really unlikely to return min and max values when you have 7 peers ? -------------------------- sipa src/txrequest.cpp:62 2020-10-02T16:54:44Z I think the observable "c++ file dependencies" are only an approximation for what actually matters in terms of code organization: "conceptual dependencies between modules". What you're suggesting is just as much a dependency of txrequest on net in my view - just slightly more hidden. -------------------------- sipa src/txrequest.cpp:62 2020-10-02T17:00:54Z All of this to say: I'd like to avoid a dependency on net, and am therefore using uint64_t as opaque peer identifier rather than NodeId. But if we feel that's not worth the cost, it should be changed to use NodeId everywhere (in both the .h and the .cpp). -------------------------- sipa src/txrequest.h:151 2020-10-02T17:07:39Z It wouldn't be too hard to just drop the constraint, and instead specify it as: * If no announcement with the specified (peer, txhash) exists, or it isn't in CANDIDATE state, the call has no effect. * The specified announcement is changed from CANDIDATE to REQUESTED. * If another announcement for the same txhash was already in REQUESTED state, it is marked COMPLETED. It'd be a bit more code, but would make the function fully specified. I've avoided it, as there is really no point for that functionality in practice, but it does make things a bit nicer. WDYT? If we don't do that, I like @ajtowns's way of describing the functionality. -------------------------- jnewbery src/txrequest.cpp:301 2020-10-02T17:14:22Z No strong opinion. Feel free to mark this resolved. -------------------------- jnewbery src/txrequest.cpp:62 2020-10-02T17:17:14Z I do feel like we should use the same type to represent node ids in all components. Perhaps the purest way to do it would be to move the `typedef NodeId int64_t` somewhere outside net.h, but that seems a bit overkill. Even if you don't use `NodeId`, is there a reason that you're using `uint64_t` rather than `int64_t` like everywhere else? -------------------------- ariard src/test/txrequest_tests.cpp:350 2020-10-02T17:17:54Z If I understand `NewTxHash` correctly it will return a `uint256` which is guaranteed to respect the priority order of peers for both preferred and non-preferred classes ? And those `NewTxHash` checks are needed as peer identifier is part of siphash message. It's more find a `txhash` rather than a `gtxid` as wtxidness shouldn't interfere with priority. -------------------------- jnewbery src/txrequest.h:151 2020-10-02T17:21:07Z Either is fine. I also like @ajtowns's description. Whichever way we go, I still think the function should be changed to take a txhash - I should have left that as a separate comment. -------------------------- ariard src/test/txrequest_tests.cpp:353 2020-10-02T17:32:04Z "Decide reqtimes in opposite order of the _expected_ request order which is function of the announcement order and peer preferredness". Clearer to underscore there are two orders, and you're deliberately tweaking the second one. Maybe, "The lowest priority peer will get the soonest reqtime. It will be the to-be-requested-from peer until the time (Scenario.m_now) is jumped above reqtime of next priority peer." -------------------------- ariard src/test/txrequest_tests.cpp:370 2020-10-02T17:44:32Z "We pin back current time under checked peer reqtime. We observe it's not the current to-be-requested-from peer. We advance forward current time beyond checked peer reqtime. We observe it's henceforth the new to-be-requested-from peer" -------------------------- sipa src/net_processing.cpp:2901 2020-10-02T17:45:53Z The hash is different in both calls. -------------------------- ariard src/test/txrequest_tests.cpp:379 2020-10-02T17:46:54Z "Observe the to-be-requested peer change for the remaining peer with the highest priority" -------------------------- ariard src/test/txrequest_tests.cpp:265 2020-10-02T17:55:32Z This is just to test that advancing time doesn't change the "requestability" of the transaction without a call to `RequestedTx` ? -------------------------- ariard src/test/txrequest_tests.cpp:297 2020-10-02T18:04:30Z I think advancing time should be a config alternative of its own, otherwise can you dissociate the state transition REQUESTED -> COMPLETED triggered by a `ReceivedResponse/DisconnectPeer` from a `GetRequestable` one ? -------------------------- ariard src/test/txrequest_tests.cpp:272 2020-10-02T18:07:18Z I think using a AND here will get you higher coverage of this case ? -------------------------- ariard src/test/txrequest_tests.cpp:261 2020-10-02T18:08:28Z Isn't advancing the time blurring the further observance that the transaction state has been moved ? Compared to only relying on `DisconnectPeer`/`ForgetTxHash` ? -------------------------- sipa src/txrequest.cpp:210 2020-10-02T18:27:49Z Done. -------------------------- sipa src/txrequest.cpp:62 2020-10-02T18:29:04Z I've bitten the bullet and converted everything to `NodeId`. I believe it being a uint64_t dates back to a time when I was trying to squeeze out bits by using less than 64 bits for the peer (and signed bitfields are implementation defined, IIRC). This was a bad idea. -------------------------- sipa src/txrequest.cpp:587 2020-10-02T18:29:16Z Done. -------------------------- sipa src/txrequest.h:139 2020-10-02T18:29:25Z Done. -------------------------- sipa src/txrequest.cpp:599 2020-10-02T18:29:32Z Done. -------------------------- sipa src/txrequest.cpp:278 2020-10-02T18:29:39Z Done. -------------------------- sipa src/txrequest.h:148 2020-10-02T18:29:45Z Done. -------------------------- sipa src/txrequest.h:151 2020-10-02T18:29:59Z Changed to AJ's description. -------------------------- sipa src/net_processing.cpp:756 2020-10-02T18:30:11Z Done. -------------------------- sipa src/net_processing.cpp:760 2020-10-02T18:30:18Z Done. -------------------------- sipa src/net_processing.cpp:84 2020-10-02T18:30:24Z Done. -------------------------- sipa src/txrequest.cpp:545 2020-10-02T18:30:32Z Done. -------------------------- sipa src/txrequest.cpp:533 2020-10-02T18:30:45Z Done. -------------------------- sipa src/txrequest.h:158 2020-10-02T18:31:00Z Done (also for RequestedTx). -------------------------- sipa src/net_processing.cpp:4510 2020-10-02T18:31:19Z Added a comment. -------------------------- sipa src/net_processing.cpp:4450 2020-10-02T18:31:29Z Indeed! Done. -------------------------- sipa src/txrequest.cpp:560 2020-10-02T18:31:55Z I'll do some benchmarks. -------------------------- sipa src/txrequest.cpp:64 2020-10-02T18:32:16Z Added a `SequenceNumber` and `Priority` type alias for uint64_t. -------------------------- sipa src/txrequest.cpp:566 2020-10-02T18:33:46Z Yeah, the reason is that we want to sort by sequence number, but those aren't included in the output itself, so we need some kind of proxy. It could be a list of (sequence, gtxid) pairs that are sorted, but that would still require an extraction step to convert it to just gtxids. This seems simplest. -------------------------- sipa src/txrequest.cpp:56 2020-10-02T18:35:36Z Done, renamed a few things: * Entry -> Announcement * Entry{TxHash,Peer,Time} -> By{TxHash,Peer,Time}View * Entry{TxHash,Peer,Time}Extractor -> By{TxHash,Peer,Time}ViewExtractor No more "entr" anywhere in txrequest (I did not make the same change in the fuzz test, though). -------------------------- sipa src/txrequest.h:33 2020-10-02T18:36:25Z State::COMPLETED. That doesn't just cover failure, but there is no observable difference between failed and otherwise completed announcements. -------------------------- sipa src/txrequest.cpp:597 2020-10-02T18:36:35Z Gone. -------------------------- sipa src/txrequest.cpp:428 2020-10-02T18:44:01Z Done (and in a few more places). -------------------------- sipa src/txrequest.h:151 2020-10-02T19:38:54Z Here is the patch to make `RequestedTx` fully specified: ```patch diff --git a/src/txrequest.cpp b/src/txrequest.cpp index af4b59755b..581b498180 100644 --- a/src/txrequest.cpp +++ b/src/txrequest.cpp @@ -590,11 +590,29 @@ public: void RequestedTx(NodeId peer, const uint256& txhash, std::chrono::microseconds expiry) { auto it = m_index.get().find(ByPeerView{peer, true, txhash}); - // RequestedTx can only be called on CANDIDATE_BEST announcements (this is implied by its condition that it - // can only be called on GenTxids returned by GetRequestable (and only ForgetTxHash and RequestedTx can be - // called in between, which preserve the state of other GenTxids). - assert(it != m_index.get().end()); - assert(it->GetState() == State::CANDIDATE_BEST); + if (it == m_index.get().end()) { + // There is no CANDIDATE_BEST entry, look for _READY or _DELAYED instead. + it = m_index.get().find(ByPeerView{peer, false, txhash}); + if (it == m_index.get().end() || (it->GetState() != State::CANDIDATE_DELAYED && + it->GetState() != State::CANDIDATE_READY)) { + // There is no CANDIDATE_* entry at all, give up. + return; + } + + // Look for an existing CANDIDATE_BEST to demote to _READY, or REQUESTED to make COMPLETED. + auto it_old = m_index.get().lower_bound(ByTxHashView{txhash, State::CANDIDATE_BEST, 0}); + if (it_old != m_index.get().end() && it_old->m_txhash == txhash) { + if (it_old->GetState() == State::CANDIDATE_BEST) { + // It doesn't matter whether we pick CANDIDATE_READY or _DELAYED here, as SetTimePoint() + // will correct it at GetRequestable() time. If time only goes forward, it will always be + // _READY, so pick that to avoid extra work in SetTimePoint(). + Modify(it_old, [](Announcement& ann) { ann.SetState(State::CANDIDATE_READY); }); + } else if (it_old->GetState() == State::REQUESTED) { + Modify(it_old, [](Announcement& ann) { ann.SetState(State::COMPLETED); }); + } + } + } + Modify(it, [expiry](Announcement& ann) { ann.SetState(State::REQUESTED); ann.m_time = expiry; ``` When combined with the corresponding changes in src/test/fuzz/txrequest.cpp it's actually a net reduction in code. WDYT? -------------------------- sipa src/test/txrequest_tests.cpp:272 2020-10-03T01:39:41Z Not sure what you mean here. There are four paths through the cascade of `if`s here and each value of `config >> 3` selects one of them. -------------------------- sipa src/test/txrequest_tests.cpp:297 2020-10-03T01:57:14Z Turned it into a config bit. -------------------------- sipa src/test/txrequest_tests.cpp:265 2020-10-03T01:57:24Z Yes. -------------------------- sipa src/test/txrequest_tests.cpp:350 2020-10-03T01:57:41Z Yeah, I've added some comments. -------------------------- sipa src/test/txrequest_tests.cpp:373 2020-10-03T01:58:05Z Not really, one in 8. Given that every number of peers runs 30 times, I think that's plenty. -------------------------- sipa src/txrequest.cpp:282 2020-10-03T01:58:57Z Done. -------------------------- sipa src/test/txrequest_tests.cpp:63 2020-10-03T02:04:31Z The sum of 3 uniformly random values is not uniformly random. I think the sum matches better with the distribution of timestamps that end up being generated during the scenarios (which are also sums of uniformly random values). -------------------------- sipa src/test/txrequest_tests.cpp:511 2020-10-03T02:07:16Z Some variation in start time was achieved in the Scenario constructor. I've moved it here. -------------------------- sipa src/txrequest.h:151 2020-10-03T02:22:37Z Added as an extra commit on top. -------------------------- sipa src/test/txrequest_tests.cpp:485 2020-10-03T02:23:36Z I assume not, as this test only looks at what is being requested. Can you write out the actual scenario you have in mind (or even better, write a commit that adds it)? -------------------------- sipa src/test/txrequest_tests.cpp:353 2020-10-03T02:23:53Z I've changed the comments here a bit. -------------------------- sipa src/txrequest.cpp:560 2020-10-03T08:15:53Z In a TxRequestTracker with N peers and 5000 (unique) txids each, in a stable state (where `GetRequestable` returns nothing): * N=10, around 50 ns per `GetRequestable`. * N=20, 60 ns * N=50, 80 ns * N=100, 90 ns * N=200, 180 ns * N=500, 230 ns * N=1000, 240 ns * N=2000, 500 ns. * N=20000, 1500 ns. (these numbers grow much faster than what would be expected from O(log n) growth, but I suspect memory hierarchy effects play a role) FWIW, calling ReceivedInv 500000 times in a row takes around 1us per call. -------------------------- ajtowns src/txrequest.h:151 2020-10-03T08:54:13Z I think that ~without~ following this change all the remaining asserts are redundant internal checks, rather than "invalid use of API" ones, which I think is a plus. Also makes perfect sense (in retrospect) that that would then make the fuzz tester a bunch simpler. I think that means that you can call TxRequestTracker methods with any params and the behaviours well-defined, reasonably logical, and as efficient as possible. I think this even maintains the amortized complexity, which surprises me. So yeah, I like this! Maybe change the "There is no CANDIDATE_BEST entry" comment to note that this path only occurs if the API caller requests an txid that wasn't just returned from `GetRequestable` -- ie, they either have a bug due to a race condition of not dealing with the results of GetRequestable for one peer before calling it again later, or they're doing their own selection of which announcement to request beyond relying on TxRequestTracker's algorithm? Maybe the "give up" case would be clearer if there was a bit more of the "why" spelled out too? ```c++ if (it == m_index.get().end()) { // This txid wasn't tracked for this peer; caller should have called ReceivedInv() first. return; } else if (it->GetState() != State::CANDIDATE_DELAYED && it->GetState() != State::CANDIDATE_READY)) { // Only progress to REQUESTED from CANDIDATE to ensure the state machine is finite return; } ``` -------------------------- sipa src/txrequest.h:151 2020-10-03T19:03:27Z Done, switched to the fully-defined version (squashed into the appropriate commits). I added some of @ajtowns's comments, and more. -------------------------- sipa src/txrequest.cpp:560 2020-10-04T09:12:42Z Another benchmark: I tweaked the unit test: * Run all tests 12000 times * Sanity checks turned off * Randomly removed 90% of GetRequestable calls (there is a far from representative number of them otherwise). ... and then timed the actual execution of the queued actions, and while gathering statistics about the number of calls, and the Size() of the data structure: * Runtime: 7.38015 s * Size: 620626 +- 177327 * ReceivedInvs calls: 6494712 * RequestedTx calls: 678552 * ReceivedResponse calls: 2664704 * DisconnectedPeer calls: 3490732 * ForgetTxHash calls: 145404 * GetRequestable calls: 3581653 So the size is relatively close to 625000 (the limit at 5000 tx/peer, 125 peers). The scenarios themselves are not representative for this, as they have far fewer transactions per peer. However, I believe there is reason to assume that the number of tx/peer doesn't matter (except for the peak latency of DisconnectedPeer, but averaged out, it's still amortized). So this means that the average time per announcement (summed over all calls affecting that announcement) in these scenarios (which may be nonrepresentative) is around 1.1 us/announcement. -------------------------- jnewbery src/txrequest.cpp:566 2020-10-05T09:17:38Z Yeah, I understood the sorting step, and was just wondering why not create a vector of (seq, gtxid) pairs. As you point out, you'd still need an extraction (copy) step afterwards, so that seems less efficient. -------------------------- jnewbery src/net_processing.cpp:2901 2020-10-05T09:32:13Z Oops. Confused myself with the gtxid constructors. These are indeed different! Thanks for removing the for loop. Makes it a lot clearer what's going on. It would be better if these lines were above the `TxValidationState state;` declaration, but that's maybe unrelated to this PR, since the code you're replacing should also have been above that declaration. -------------------------- jnewbery src/txrequest.cpp:602 2020-10-05T09:37:27Z I think we'll also enter this branch if there's an announcement for this peer/txhash that's REQUESTED or COMPLETE. If that's true, I think we should update the comment to "The txhash was not tracked for this peer or has already been requested from this peer..." -------------------------- jnewbery src/txrequest.cpp:278 2020-10-05T09:37:57Z It's still here :) -------------------------- jnewbery src/txrequest.cpp:560 2020-10-05T09:56:08Z These seem like good numbers. I was going to suggest moving`SetTimePoint()` out of `GetRequestable()` and only calling it once per message handler thread loop (e.g. in a global tasks function like https://github.com/bitcoin/bitcoin/pull/19364/commits/0ea9abf9b4c3694dede390b759df01c1ce0d3166), since it doesn't need to be called in the context of an individual peer, and it doesn't need to be called frequently. However, it doesn't seem like there's a good reason to do that from a performance standpoint. -------------------------- jnewbery src/txrequest.h:151 2020-10-05T09:57:39Z I also like this. Good change! -------------------------- jnewbery src/txrequest.cpp:339 2020-10-05T10:20:50Z ```suggestion assert(it_last != m_index.get().begin() && it_last->m_txhash == item.first); ``` `it_last` is the return value of a `std::prev()` call, so I think by definition it can't be the `end` iterator. Alternatively you could do something like: ``` auto it_last = m_index.get().upper_bound( ByTxHashView{item.first, State::COMPLETED, })); assert(it_last != m_index.get().end() && it_last->m_txhash == item.first); auto it_next = std::next(it_last); assert(it_next == m_index.get().end() || it_next->m_txhash != item.first); ``` (which would remove the need for a dummy `TOO_LARGE` entry in the State enum). -------------------------- jnewbery src/net_processing.cpp:755 2020-10-05T10:39:20Z I think `AddTxAnnouncement` or similar would be a better name for this function. Calling this function does not imply that we'll ever request the transaction from this peer. -------------------------- jnewbery src/net_processing.cpp:774 2020-10-05T10:43:42Z Move this two lines down (so it's immediately above the one place where `overloaded` is used). Could also mark these two bools as `const`, but meh. -------------------------- jnewbery src/net_processing.cpp:3694 2020-10-05T10:46:45Z This comment is now wrong. -------------------------- jnewbery src/net_processing.cpp:3710 2020-10-05T10:54:44Z A peer can cause us to do up to 5016 calls into `ReceivedResponse()` per `notfound` message, where each `RecievedResponse()` call results in two `find()` operations into the index that can be up to 625k. Is that likely to be costly? Previously, the limit on `notfound` processing was 116 `find()` calls into a map of maxsize 100. -------------------------- jnewbery src/net_processing.cpp:3686 2020-10-05T10:56:09Z Delete this comment. -------------------------- jnewbery src/net_processing.cpp:3687 2020-10-05T10:56:30Z It doesn't make much difference, but this cs_main scope can now move down into the if block. -------------------------- ajtowns src/txrequest.cpp:560 2020-10-05T11:04:04Z Moving `SetTimePoint()` out of `GetRequestable()` might have better worst-case behaviour in the event of arbitrarily non-monotonic clocks In particular, currently you could have, say, two peers, peer=1 with no announcements, and peer=2 with N announcements all with a reqtime = t. If you call `GetRequestable(peer=1, now=t+1)` then you'll get an empty vector as the result, but update all N announcements to CANDIDATE_BEST, but if time goes backwards, and you then call `GetRequestable(peer=2, now=t-1)` you'll update all the announcements back to CANDIDATE_DELAYED, and all get an empty vector, and make no progress. If you did `SetTimePoint()` first, then called `GetRequestable()` for each peer without updating the timepoint, you'd either have `now=t-1` and not do any work, or you'd have `now=t+1` and would make forward progress. I think the constraint would be "call SetTimePoint, then for each peer call GetRequestable, and for each resulting tx, call RequestedTx for it" with the result being something along the lines of "amortized time complexity is `O(log(max_concurrent_announcements))`". FWIW, I think something along those lines is worth considering as a post-merge TODO -- perhaps as part of doing a fuzz tester for how net_processing uses txrequest or something, but don't think there is a need to hold things up for it. -------------------------- jnewbery src/net_processing.cpp:757 2020-10-05T11:14:10Z I also have a slight preference that this is guarded by its own mutex since I don't like adding yet more non-validation state to cs_main, but agree that it should be straightforward to move it later. -------------------------- ajtowns src/txrequest.cpp:157 2020-10-05T11:19:41Z `(txhash, state, priority)` and "Note, `priority == 0` whenever `state != CANDIDATE_READY`" below might be clearer. -------------------------- ajtowns src/txrequest.cpp:406 2020-10-05T11:35:22Z `assert(new_state == COMPLETED || new_state == CANDIDATE_DELAYED);` ? The existing `assert(!it->IsSelected());` at the end would also allow `new_state == CANDIDATE_READY` which I don't think would be handled correctly (in that if it were the first READY, it should be assigned BEST but will not be). -------------------------- ajtowns src/txrequest.cpp:339 2020-10-05T12:33:27Z `ByTxHashView` takes the uint256 as the first param, the last one is just a uint64_t Priority, so replacing with `COMPLETED, std::numeric_limits::max()` should be straightforward (`1` would also be fine, since COMPLETED means priority is `0`). Seems to pass the fuzz tester okay. Adding a `static constexpr auto PRIORITY_MAX` alias and using it for `m_priority_best_candidate_ready` as well may make sense. But... is this test even useful? `it_last` was needed in order to sanity check the first markers, but with them gone now... -------------------------- ajtowns src/txrequest.cpp:51 2020-10-05T12:40:18Z I think the constraints on the ordering here are: * COMPLETED comes last, so that looping `it = Erase(it)` starting from the only non-completed entry for a txhash will erase all the entries for that txhash * BEST and REQUESTED are next to each other so that it's easy to enforce the "only one best or requestable per txhash" * BEST/REQUESTED comes immediately before/after READY so that it's easy to compare a new best READY with the existing BEST But I think there's no longer anything preventing changing the ordering to match the expected transitions, ie `DELAYED, READY, BEST, REQUESTED, COMPLETED` (and changing the best priority to be the highest value rather than the lowest). I'm presuming that's due to no longer doing first markers, but maybe I've missed something? -------------------------- ajtowns src/txrequest.cpp:181 2020-10-05T13:02:21Z Maybe ```c++ enum class WaitState { WAITING, COMPLETED, SELECTABLE, }; static inline WaitState GetWaitState(const Announcement& ann) { return (ann.IsWaiting() ? WaitState::WAITING : ann.IsSelectable() ? WaitState::SELECTABLE, WaitState::COMPLETED); } ``` and say `sorted by (wait_state, time)`, and describe it is easy to find all the WAITING announcements prior to a time point (for when the event they were waiting for has happened), and all the SELECTABLE announcements after a time point (for when time goes backwards)? (Alternatively, FUTURE_EVENT, NO_EVENT, and PAST_EVENT would be more generic wait states) -------------------------- ajtowns src/txrequest.cpp:463 2020-10-05T13:18:27Z At some point in the future it would probably be good to expose expiring requests somehow -- peers might want to deprioritise those peers or disconnect them, eg. -------------------------- ajtowns src/txrequest.cpp:579 2020-10-05T13:39:07Z `ret.reserve(selected.size())` ? -------------------------- ajtowns src/txrequest.cpp:639 2020-10-05T14:10:51Z In future, might be useful to return `it->GetState() == REQUESTED ? it->m_time : min()` here to allow the caller to estimate relay latency for a peer, in case they wanted to use that info to adjust that peer's expiry delay. -------------------------- ajtowns src/txrequest.h:49 2020-10-05T14:13:34Z Should be "already-failed" not "already-requested" per the terminology above ("Whether or not the transaction request failed already (timed out, or NOTFOUND was received)") -------------------------- ajtowns src/test/txrequest_tests.cpp:48 2020-10-05T14:55:01Z Maybe `RandomTime_15s()` and `RandomTime_1y()` (assuming I've done the math right for bits=23 and bits=44 which are the only ways this is called) ? -------------------------- ajtowns src/test/txrequest_tests.cpp:65 2020-10-05T14:56:44Z I'm not really following the unit testing strategy here; in particular what all the interleaving ends up doing, and what cases are really covered, and what the effect of doing some things exhaustively and other things randomly is . Feels like there's a lot of novel scaffolding hiding what's actual tested. It might be good to create some template magic for the Action/Runner/Scenario pattern, and use that more consistently? At the very least the versionbits tests feel to me like they suffer a bit from the same sort of problem? I think this test setup only simulates a monotonic clock (vs txrequest.cpp's note that supporting time going backwards "makes it much easier to specify and test TxRequestTracker::Impl's behaviour") -------------------------- ajtowns src/txrequest.cpp:235 2020-10-05T15:10:51Z `NodeId` and `Priority` instead of `uint64_t` and `::min()` instead of `m_priority_candidate_best = 0`? -------------------------- ajtowns src/test/fuzz/txrequest.cpp:102 2020-10-05T15:22:25Z `244466666` (pronounced "one 2, three 4, five 6") or `1123581321` are also amusing. Super helpful review comments 'r us! -------------------------- ajtowns src/test/fuzz/txrequest.cpp:111 2020-10-05T15:27:10Z Maybe: ``` if (ann.m_state != NOTHING) { if (ann.m_state != COMPLETED) return; all_nothing = false; } ``` Had to double take to realise that not being CANDIDATE, REQUESTED or NOTHING here meant it was definitely COMPLETED. No big deal. -------------------------- ajtowns src/net_processing.cpp:760 2020-10-05T16:27:25Z `auto* state =` ? Was wondering if we were returning a reference and making an unnecessary copy. -------------------------- ajtowns src/net_processing.cpp:2901 2020-10-05T16:34:22Z `if (tx.HasWitness()) ReceivedResponse(wtxid);` ? Or is the idea that txrequest is so cheap it's better to let it check if the extra call is redundant? -------------------------- ajtowns src/net_processing.cpp:857 2020-10-05T16:39:58Z Might be worth adding some sort of sanity check to see that there aren't any code paths where we fail to call `DisconnectedPeer` and end up leaving `CANDIDATE_BEST` entries in the tracker that will never get cleared? -------------------------- ajtowns src/net_processing.cpp:3710 2020-10-05T17:08:34Z Could just send 44 NOTFOUND messages with 116 entries instead of one with 5016 for almost exactly the same overhead; and they should be plenty fast anyway, I think, especially compared to looking up the UTXO database when we receive a transaction. Since we've increased the in flight limit, I think it makes sense to increase the notfound limit correspondingly. -------------------------- ajtowns src/net_processing.cpp:776 2020-10-05T17:11:23Z Commit for this change says "orphan fetches" but it's really "fetches of orphan's parents"... -------------------------- ajtowns src/txrequest.h:88 2020-10-05T17:33:43Z Random idea that I don't think we should do right now: I think we could limit this attack further if we added a `bool m_horribly_delayed : 1;` bit to each announcement. Change the ordering for choosing BEST to be `(preferred, horribly_delayed, priority)` and flip the horribly_delayed bit as part of `SetTimePoint()` when `req_time + txrequest.m_horrible_delay_time < now`, where `m_horrible_delay_time` is on the order of 10 to 60 minutes. That might also be roughly reasonable logic to do some debug logging so it's possible to catch weird behaviour in the wild. -------------------------- sipa src/net_processing.cpp:757 2020-10-05T18:20:23Z I agree with the goal of eventually having this not under cs_main, but the hard part will be disentangling the existing code calling into txrequest not covered by cs_main. Just adding another lock in addition to cs_main here isn't helping with that part, so I prefer doing this as a follow-up (together with probably a bigger part of net_processing also losing cs_main). -------------------------- sipa src/txrequest.cpp:560 2020-10-05T18:39:08Z Hmm, that's interesting. Before the `RequestedTx` change I'd be hesitant about such a change, as it'd make specifying the conditions under which it can be called even more cumbersome. There isn't much concern about that anymore, though. @ajtowns Right, or more generally: if your clock is arbitrarily jittery, increasing the frequency of SetTimePoint calls will result in up to proportionally more work. There may be some impact on how strong the "first non-delayed announcement" effect is, though (which was the reason for dropping the "first" marker rule): if SetTimePoint is only called once per loop, the request will go to any of the peers that announced a txhash in one loop, rather than the actual first one. This may need some investigation. So indeed, I'd keep this for a follow-up. -------------------------- sipa src/test/txrequest_tests.cpp:370 2020-10-05T18:40:02Z I've changed some comments here, let me know if it's better. -------------------------- sipa src/test/txrequest_tests.cpp:379 2020-10-05T18:40:16Z I've changed some comments. -------------------------- sipa src/net_processing.cpp:3710 2020-10-05T19:10:48Z Right, I don't think this is a DoS vector as it's fairly cheap per byte compared to many other requests. The only concern would be latency, e.g. might it cause a significant interruption of a net processing loop to process 5016 NOTFOUNDs at once? I think not, but if it is, we may use the split processing technique used for getdata and orphan processing - independent of what the limit here is. -------------------------- sipa src/net_processing.cpp:776 2020-10-05T20:09:59Z Fixed. -------------------------- sipa src/txrequest.cpp:278 2020-10-05T20:10:19Z Really done. -------------------------- sipa src/net_processing.cpp:2901 2020-10-05T20:10:47Z Yeah, trying to minimize diff - this shouldn't matter much. -------------------------- sipa src/test/txrequest_tests.cpp:261 2020-10-05T20:11:29Z I've made it conditional with an `if InsecureRandBool()` around it. -------------------------- sipa src/txrequest.cpp:602 2020-10-05T20:11:43Z Agreed. I've rewritten this a bit. -------------------------- sipa src/txrequest.cpp:339 2020-10-05T20:12:29Z I've just deleted this (as well as the TOO_LARGE enum value). It was only needed when the SanityCheck code actually needed it_last for the per-txhash flags which are gone now. -------------------------- sipa src/net_processing.cpp:755 2020-10-05T20:12:40Z Ok, renamed. -------------------------- sipa src/net_processing.cpp:774 2020-10-05T20:12:53Z Done, also the meh. -------------------------- sipa src/net_processing.cpp:3694 2020-10-05T20:13:03Z Fixed. -------------------------- sipa src/net_processing.cpp:3686 2020-10-05T20:13:50Z What a wasted opportunity to say "Remove the STALE comment from the pr" instead. Done. -------------------------- sipa src/net_processing.cpp:3687 2020-10-05T20:14:27Z Done. I haven't moved it into the loop as I expect common iterations of the loop to be faster or at least in the same order of magnitude as grabbing a lock. -------------------------- sipa src/txrequest.cpp:157 2020-10-05T20:15:49Z Done, that's more readable indeed. -------------------------- sipa src/txrequest.cpp:406 2020-10-05T20:17:22Z Done. It's borderline, I think, as you could read the comments for this function do not claim they'll do anything but change the state of the passed iterator to the passed value - only about other announcements. Still, better to be explicit so added an assert. -------------------------- sipa src/txrequest.cpp:51 2020-10-05T20:18:01Z That would be a more logical ordering. Let me try what would be needed to make this work later. -------------------------- sipa src/txrequest.cpp:181 2020-10-05T20:18:09Z Nice, done. -------------------------- sipa src/txrequest.cpp:463 2020-10-05T20:18:47Z Agree, but that sounds like really a follow-up (or even a follow-up issue to investigate the use of that first). -------------------------- sipa src/txrequest.cpp:579 2020-10-05T20:18:55Z Done. -------------------------- sipa src/txrequest.cpp:639 2020-10-05T20:19:13Z Yeah, future work... -------------------------- sipa src/txrequest.h:49 2020-10-05T20:19:22Z Fixed. -------------------------- sipa src/test/txrequest_tests.cpp:48 2020-10-05T20:19:49Z Done. Your math was almost right (it's 8s and half a year). -------------------------- sipa src/test/txrequest_tests.cpp:65 2020-10-05T20:27:01Z I've added this comment blob to explain it better: > Each Scenario is a proxy through which actions for the (sequential) execution of various tests are added to a Runner. The actions from multiple scenarios are then run concurrently, resulting in these tests being performed against a TxRequestTracker in parallel. Every test has its own unique txhashes and NodeIds which are not reused in other tests, and thus they should be independent from each other. Running them in parallel however means that we verify the behavior (w.r.t. one test's txhashes and NodeIds) even when the state of the data structure is more complicated due to the presence of other tests. Does that help? Abstracting out this scaffolding seems potentially useful, but for a follow-up. > I think this test setup only simulates a monotonic clock (vs txrequest.cpp's note that supporting time going backwards "makes it much easier to specify and test TxRequestTracker::Impl's behaviour") The tests are indeed restricted to a clock that is increasing monotonically between subsequent calls to GetRequestable. I had some infrastructure here earlier to allow a scenario to have both a "real clock" and a "jittered clock", where the first one is used to order Actions, and the second one determines the timestamps passed to `GetRequestable`, but I feared the mental overhead to that isn't worth it. Whether TxRequestTracker does something reasonable under insane call sequences is tested by the fuzzer already, so I think this one can be restricted to more normal situations, and test that what's done isn't just reasonable, but also matches more high-level expected behavior. -------------------------- sipa src/txrequest.cpp:235 2020-10-05T20:27:34Z Done. -------------------------- sipa src/test/fuzz/txrequest.cpp:102 2020-10-05T20:27:52Z Major improvement, done! -------------------------- sipa src/test/fuzz/txrequest.cpp:111 2020-10-05T20:28:07Z Done. -------------------------- sipa src/net_processing.cpp:760 2020-10-05T20:28:33Z Clearly this shouldn't be an `auto` type if it's confusing. Just made the type explicit. -------------------------- sipa src/net_processing.cpp:2901 2020-10-05T20:29:13Z I don't think it matters much as it's fairly cheap, plus doesn't change worst case behaviour (attacker can use witness-carrying transactions). Still, done. -------------------------- sipa src/net_processing.cpp:857 2020-10-05T20:29:24Z Good idea, done. -------------------------- sipa src/txrequest.h:88 2020-10-05T20:30:17Z Nice idea. I was thinking in a similar direction, but couldn't figure out how to make it efficient. Updating the flag after the fact is a neat solution. -------------------------- ariard src/txrequest.h:175 2020-10-05T23:53:33Z I think that describing ForgetTxHash to be called when a good transaction is received doesn't match its usage. We call it ForgetTxHash (8ef5953), in net_processing, at transaction reception : * L2933, when a transaction has been accepted to the mempool * L2996, when an orphan transaction is added to the orphan pool * L3014, when an orphan transaction is added to the rejection filter for cause of invalid parent * L3034, when a non-witness stripped transaction has failed mempool acceptance and is added to the rejection filter In the 3 later cases, the transaction can't be qualified to be mempool-valid. Beyond block connection, it might be better to say, "ForgetTxHash should be called once the transaction has been evaluated at least once by the mempool". What we do with this evaluation result is not matter of the TxRequestTracker ? -------------------------- ariard src/test/fuzz/txrequest.cpp:302 2020-10-06T00:32:34Z Announcements sizes are only compared at fuzzer input exhaustion not after every `ReceivedInv`. It would add little value to check them after every `ReceivedInv` to catch Tester/TxRequestTracker divergence at that point ? They may converge at the end and thus a divergence during the running wouldn't be detected ? -------------------------- sipa src/test/fuzz/txrequest.cpp:302 2020-10-06T01:23:55Z In that case, a shorter fuzzer input would have caught the issue instead, I think. Perhaps it's worth seeing how much performance impact doing it all the time has; if it's not too much maybe it's worth changing. -------------------------- sipa src/txrequest.h:175 2020-10-06T01:46:19Z You're right that the description doesn't match the usage in net_processing here, but "has been evaluation by the mempool" is perhaps incorrect as well. If it's seen in a block, it can be forgotten as well, for example. Really the right criterion is: has it been made AlreadyHaveTx()==true. If we'd delete something that isn't AlreadyHaveTx(), it means we risk fetching it again if it's announced by another peer, despite already having seen it presumably. If we don't delete something that is AlreadyHaveTx(), we're risking fetching something from the same peer despite the same. I've adapted the comments a bit and tried to formulate it in a way that doesn't depend on net_processing specifics. -------------------------- MarcoFalke src/txrequest.cpp:71 2020-10-06T06:51:55Z In commit ee4f89a44b: There is no warning for me. If there is a bug or a false positive warning in an ancient gcc version, we often ignore that instead of "crippling" the code. Mind sharing what kind of warning this is, and what gcc version was used? Maybe even add it to that comment. -------------------------- MarcoFalke src/txrequest.cpp:299 2020-10-06T07:33:52Z in commit ee4f89a44b Any reason to have ONE defined in uint256.h, but ZERO here? -------------------------- jnewbery src/net_processing.cpp:3710 2020-10-06T09:19:21Z Yes, seems reasonable. Perhaps at some point in the future we might also want to have a return value from `ReceivedResponse()` so net_processing can know whether the peer is sending spurious notfound entries. -------------------------- jnewbery src/txrequest.cpp:560 2020-10-06T09:22:21Z Marking as resolved. -------------------------- jnewbery src/net_processing.cpp:2901 2020-10-06T09:25:02Z Agree it doesn't matter at all. It just offends my aesthetic sensibilities. Marking resolved. -------------------------- MarcoFalke src/txrequest.cpp:452 2020-10-06T10:51:10Z in commit ee4f89a44b: The comment says "fail", the code says "return false", this seems inconsistent. This can't happen in normal operation. If this happens, it would be a logic error, so `assert` seems appropriate? Or maybe even remove the dead code? Edit: According to the coverage report this *is* hit, so I might have to take another look here. -------------------------- jnewbery src/txrequest.cpp:438 2020-10-06T11:58:57Z This assert seems redundant now that you have the `assert(new_state == State::COMPLETED || new_state == State::CANDIDATE_DELAYED);` at the top. It's not doing any harm, but it seems unnecessary. -------------------------- MarcoFalke src/txrequest.h:152 2020-10-06T12:09:03Z in commit ee4f89a44b41a860f2bda98957db2d6871a8e8ea: The same peer can't have duplicate hashes, so this can be clarified to say "different peer" ```suggestion * (reqtime <= now) for which no existing REQUESTED announcement with the same txhash exists from a different peer, and for which ``` -------------------------- MarcoFalke src/test/txrequest_tests.cpp:93 2020-10-06T13:05:03Z in commit a9a0504f12cb3c301bc56cc5f8f59ca57c1dc433: The default seems to be unused ```suggestion void ReceivedInv(NodeId peer, const GenTxid& gtxid, bool pref, std::chrono::microseconds reqtime) ``` -------------------------- MarcoFalke src/test/txrequest_tests.cpp:280 2020-10-06T13:12:42Z in commit a9a0504f12cb3c301bc56cc5f8f59ca57c1dc433: Can be written shorter ```suggestion scenario.AdvanceTime(GetRandMicros(expiry)); ``` -------------------------- MarcoFalke src/test/txrequest_tests.cpp:285 2020-10-06T13:39:32Z same commit: ```suggestion // Two peers. They will announce in order. ``` Am I missing something here? There are only two (and a reference to either of them) -------------------------- MarcoFalke src/test/txrequest_tests.cpp:308 2020-10-06T13:40:11Z same commit: ```suggestion NodeId priopeer = stage2_prio ? peer2 : peer1, otherpeer = stage2_prio ? peer1 : peer2; ``` -------------------------- jnewbery src/txrequest.cpp:452 2020-10-06T14:31:29Z 'fail' here means 'there exists another announcement for this txhash which isn't COMPLETED' -------------------------- jnewbery src/txrequest.cpp:583 2020-10-06T14:51:03Z We can avoid using an awkward pair here by constructing a vector of pointers and then use a custom comparator function to sort: ```diff --- a/src/txrequest.cpp +++ b/src/txrequest.cpp @@ -101,6 +101,9 @@ struct Announcement { m_is_wtxid(gtxid.IsWtxid()), m_state(uint8_t(State::CANDIDATE_DELAYED)) {} }; +bool seq_comparator(const Announcement* ann1, const Announcement* ann2) { return ann1->m_sequence < ann2->m_sequence; } +GenTxid gtxid_extractor(const Announcement* ann) { return GenTxid{ann->m_is_wtxid, ann->m_txhash}; } + //! Type alias for priorities. using Priority = uint64_t; @@ -581,21 +584,19 @@ public: SetTimePoint(now); // Find all CANDIDATE_BEST announcements for this peer. - std::vector> selected; + std::vector selected; auto it_peer = m_index.get().lower_bound(ByPeerView{peer, true, UINT256_ZERO}); while (it_peer != m_index.get().end() && it_peer->m_peer == peer && it_peer->GetState() == State::CANDIDATE_BEST) { - selected.emplace_back(it_peer->m_sequence, &*it_peer); + selected.emplace_back(&*it_peer); ++it_peer; } // Return them, sorted by sequence number. - std::sort(selected.begin(), selected.end()); + std::sort(selected.begin(), selected.end(), seq_comparator); std::vector ret; ret.reserve(selected.size()); - for (const auto& item : selected) { - ret.emplace_back(item.second->m_is_wtxid, item.second->m_txhash); - } + std::transform(selected.begin(), selected.end(), std::back_inserter(ret), gtxid_extractor); return ret; } ``` I'm not sure if that's more or less clear code, but thought I'd throw it out there as a suggestion. -------------------------- sipa src/test/txrequest_tests.cpp:485 2020-10-06T17:58:25Z Marking this as resolved, as @ariard wrote a test that was squashed in. -------------------------- sipa src/txrequest.cpp:51 2020-10-06T18:57:32Z This was indeed not a problem at all. I've applied the following diff: ```patch diff --git a/src/test/fuzz/txrequest.cpp b/src/test/fuzz/txrequest.cpp index 0ff00d23e0..ff32de25eb 100644 --- a/src/test/fuzz/txrequest.cpp +++ b/src/test/fuzz/txrequest.cpp @@ -129,7 +129,7 @@ class Tester if (ann.m_state == State::REQUESTED) return -1; // If it's a viable candidate, see if it has lower priority than the best one so far. if (ann.m_state == State::CANDIDATE && ann.m_time <= m_now) { - if (ret == -1 || ann.m_priority < ret_priority) { + if (ret == -1 || ann.m_priority > ret_priority) { std::tie(ret, ret_priority) = std::tie(peer, ann.m_priority); } } @@ -252,7 +252,7 @@ public: } // And delete txids with only COMPLETED announcements left. Cleanup(txhash); - // CANDIDATEs for which this announcement has the lowest priority get returned. + // CANDIDATEs for which this announcement has the highest priority get returned. const Announcement& ann = m_announcements[txhash][peer]; if (ann.m_state == State::CANDIDATE && GetSelected(txhash) == peer) { result.emplace_back(ann.m_sequence, txhash, ann.m_is_wtxid); diff --git a/src/test/txrequest_tests.cpp b/src/test/txrequest_tests.cpp index a84dfe67f9..39a4f86ea8 100644 --- a/src/test/txrequest_tests.cpp +++ b/src/test/txrequest_tests.cpp @@ -162,8 +162,8 @@ public: /** Generate a random txhash, whose priorities for certain peers are constrained. * * For example, NewTxHash({{p1,p2,p3},{p2,p4,p5}}) will generate a txhash T such that both: - * - priority(p1,T) < priority(p2,T) < priority(p3,T) - * - priority(p2,T) < priority(p4,T) < priority(p5,T) + * - priority(p1,T) > priority(p2,T) > priority(p3,T) + * - priority(p2,T) > priority(p4,T) > priority(p5,T) * where priority is the predicted internal TxRequestTracker's priority, assuming all announcements * are within the same preferredness class. */ @@ -178,7 +178,7 @@ public: for (size_t pos = 1; pos < order.size(); ++pos) { uint64_t prio_prev = m_runner.txrequest.ComputePriority(ret, order[pos - 1], true); uint64_t prio_cur = m_runner.txrequest.ComputePriority(ret, order[pos], true); - if (prio_prev >= prio_cur) { + if (prio_prev <= prio_cur) { ok = false; break; } diff --git a/src/txrequest.cpp b/src/txrequest.cpp index d07a5203cd..a5455e2305 100644 --- a/src/txrequest.cpp +++ b/src/txrequest.cpp @@ -38,14 +38,14 @@ namespace { enum class State : uint8_t { /** A CANDIDATE announcement whose reqtime is in the future. */ CANDIDATE_DELAYED, + /** A CANDIDATE announcement that's not CANDIDATE_DELAYED or CANDIDATE_BEST. */ + CANDIDATE_READY, /** The best CANDIDATE for a given txhash; only if there is no REQUESTED announcement already for that txhash. - * The CANDIDATE_BEST is the lowest-priority announcement among all CANDIDATE_READY (and _BEST) ones for that + * The CANDIDATE_BEST is the highest-priority announcement among all CANDIDATE_READY (and _BEST) ones for that * txhash. */ CANDIDATE_BEST, /** A REQUESTED announcement. */ REQUESTED, - /** A CANDIDATE announcement that's not CANDIDATE_DELAYED or CANDIDATE_BEST. */ - CANDIDATE_READY, /** A COMPLETED announcement. */ COMPLETED, }; @@ -106,7 +106,7 @@ using Priority = uint64_t; /** A functor with embedded salt that computes priority of an announcement. * - * Lower priorities are selected first. + * Higher priorities are selected first. */ class PriorityComputer { const uint64_t m_k0, m_k1; @@ -118,7 +118,7 @@ public: Priority operator()(const uint256& txhash, NodeId peer, bool preferred) const { uint64_t low_bits = CSipHasher(m_k0, m_k1).Write(txhash.begin(), txhash.size()).Write(peer).Finalize() >> 1; - return low_bits | uint64_t{!preferred} << 63; + return low_bits | uint64_t{preferred} << 63; } Priority operator()(const Announcement& ann) const @@ -244,10 +244,10 @@ struct TxHashInfo size_t m_candidate_best = 0; //! Number of REQUESTED announcements for this txhash. size_t m_requested = 0; - //! The priority of the CANDIDATE_BEST announcement if one exists, or min() otherwise. - Priority m_priority_candidate_best = std::numeric_limits::min(); - //! The lowest priority of all CANDIDATE_READY announcements (or max() if none exist). - Priority m_priority_best_candidate_ready = std::numeric_limits::max(); + //! The priority of the CANDIDATE_BEST announcement if one exists, or max() otherwise. + Priority m_priority_candidate_best = std::numeric_limits::max(); + //! The lowest priority of all CANDIDATE_READY announcements (or min() if none exist). + Priority m_priority_best_candidate_ready = std::numeric_limits::min(); //! All peers we have an announcement for this txhash for. std::vector m_peers; }; @@ -288,7 +288,7 @@ std::map ComputeTxHashInfo(const Index& index, const Priori info.m_priority_candidate_best = computer(ann); } if (ann.GetState() == State::CANDIDATE_READY) { - info.m_priority_best_candidate_ready = std::min(info.m_priority_best_candidate_ready, computer(ann)); + info.m_priority_best_candidate_ready = std::max(info.m_priority_best_candidate_ready, computer(ann)); } // Also keep track of which peers this txhash has an announcement for (so we can detect duplicates). info.m_peers.push_back(ann.m_peer); @@ -341,7 +341,7 @@ public: // If there is both a CANDIDATE_READY and a CANDIDATE_BEST announcement, the CANDIDATE_BEST one must be // at least as good (equal or lower priority) as the best CANDIDATE_READY. if (info.m_candidate_ready && info.m_candidate_best) { - assert(info.m_priority_candidate_best <= info.m_priority_best_candidate_ready); + assert(info.m_priority_candidate_best >= info.m_priority_best_candidate_ready); } // No txhash can have been announced by the same peer twice. @@ -399,21 +399,21 @@ private: // Convert CANDIDATE_DELAYED to CANDIDATE_READY first. Modify(it, [](Announcement& ann){ ann.SetState(State::CANDIDATE_READY); }); // The following code relies on the fact that the ByTxHash is sorted by txhash, and then by state (first - // _DELAYED, then _BEST/REQUESTED, then _READY). Within the _READY announcements, the best one (lowest - // priority) comes first. Thus, if an existing _BEST exists for the same txhash that this announcement may - // be preferred over, it must immediately precede the newly created _READY. - if (it == m_index.get().begin() || std::prev(it)->m_txhash != it->m_txhash || - std::prev(it)->GetState() == State::CANDIDATE_DELAYED) { + // _DELAYED, then _READY, then _BEST/REQUESTED). Within the _READY announcements, the best one (highest + // priority) comes last. Thus, if an existing _BEST exists for the same txhash that this announcement may + // be preferred over, it must immediately follow the newly created _READY. + auto it_next = std::next(it); + if (it_next == m_index.get().end() || it_next->m_txhash != it->m_txhash || + it_next->GetState() == State::COMPLETED) { // This is the new best CANDIDATE_READY, and there is no IsSelected() announcement for this txhash // already. Modify(it, [](Announcement& ann){ ann.SetState(State::CANDIDATE_BEST); }); - } else if (std::prev(it)->GetState() == State::CANDIDATE_BEST) { - Priority priority_old = m_computer(*std::prev(it)); + } else if (it_next->GetState() == State::CANDIDATE_BEST) { + Priority priority_old = m_computer(*it_next); Priority priority_new = m_computer(*it); - if (priority_new < priority_old) { + if (priority_new > priority_old) { // There is a CANDIDATE_BEST announcement already, but this one is better. - auto new_ready_it = std::prev(it); - Modify(new_ready_it, [](Announcement& ann){ ann.SetState(State::CANDIDATE_READY); }); + Modify(it_next, [](Announcement& ann){ ann.SetState(State::CANDIDATE_READY); }); Modify(it, [](Announcement& ann){ ann.SetState(State::CANDIDATE_BEST); }); } } @@ -425,14 +425,13 @@ private: { assert(new_state == State::COMPLETED || new_state == State::CANDIDATE_DELAYED); assert(it != m_index.get().end()); - if (it->IsSelected()) { - auto it_next = std::next(it); - // The next best CANDIDATE_READY, if any, immediately follows the REQUESTED or CANDIDATE_BEST + if (it->IsSelected() && it != m_index.get().begin()) { + auto it_prev = std::prev(it); + // The next best CANDIDATE_READY, if any, immediately preceeds the REQUESTED or CANDIDATE_BEST // announcement in the ByTxHash index. - if (it_next != m_index.get().end() && it_next->m_txhash == it->m_txhash && - it_next->GetState() == State::CANDIDATE_READY) { + if (it_prev->m_txhash == it->m_txhash && it_prev->GetState() == State::CANDIDATE_READY) { // If one such CANDIDATE_READY exists (for this txhash), convert it to CANDIDATE_BEST. - Modify(it_next, [](Announcement& ann){ ann.SetState(State::CANDIDATE_BEST); }); + Modify(it_prev, [](Announcement& ann){ ann.SetState(State::CANDIDATE_BEST); }); } } Modify(it, [new_state](Announcement& ann){ ann.SetState(new_state); }); ``` -------------------------- sipa src/net_processing.cpp:85 2020-10-06T20:05:22Z Done. I've changed it to `static constexpr auto VARNAME = std::chrono::seconds{2};`, as I think it's slightly more readable to have the unit and the magnitude together. -------------------------- sipa src/txrequest.cpp:71 2020-10-06T20:06:32Z It was in GCC 9.3, but I believe it was in a much older iteration of this code, where there was a `switch` on `State`, and GCC would always complain about unhandled cases. I see no more warning, so I've dropped this (along with `GetState`/`SetState`). -------------------------- sipa src/txrequest.cpp:299 2020-10-06T20:06:51Z None at all. I've moved it to uint256.h. -------------------------- sipa src/txrequest.cpp:452 2020-10-06T20:09:12Z I've updated the comments a bit here. It's indeed possible: e.g. if you have two `CANDIDATE_DELAYED` Announcements with the same txhash (and thus distinct peers), and the first of the two goes offline (`DisconnectedPeer`), then this function would be called to determine if that Announcement was the last non-`COMPLETED` one. `std::next(it)` isn't `end()`, has the same `m_txhash`, and isn't `COMPLETED`. -------------------------- sipa src/txrequest.h:152 2020-10-06T20:09:25Z Done. -------------------------- sipa src/test/txrequest_tests.cpp:93 2020-10-06T20:09:35Z Done. -------------------------- sipa src/test/txrequest_tests.cpp:280 2020-10-06T20:10:12Z That would use the real RNG rather than the faster test-only `g_insecure_rand_ctx`. -------------------------- sipa src/test/txrequest_tests.cpp:285 2020-10-06T20:11:07Z Fixed. Leftover from an old version of the code. The situation with 3 peers (and more) is now covered by `BuildBigPriorityTest`. -------------------------- sipa src/test/txrequest_tests.cpp:308 2020-10-06T20:11:16Z Done. -------------------------- sipa src/txrequest.cpp:438 2020-10-06T20:11:25Z Done. -------------------------- sipa src/txrequest.cpp:583 2020-10-06T20:12:10Z Done, with lambdas instead of separate functions (as they're only used in one place). -------------------------- ajtowns src/txrequest.cpp:583 2020-10-06T20:51:19Z Hmm, in bad cases (where GetRequestable is returning a lot of txhashes, there are a lot of announcements, and the relevant announcements are fragmented) that would be a fair bit less cache friendly than the old code (sort would be re-accessing each entry an additional `O(log(selected.size()))` times). But in reality the 5000 ann limit means at worst it just moves from sorting an 80kB array in L1 to sorting a 40kB array based on 800kB of data in L2 cache, so it probably doesn't matter. (Or rather, it's probably not even worth gathering the performance data that would be needed to make a decision based on one being faster or not) -------------------------- sipa src/txrequest.cpp:583 2020-10-06T20:59:22Z > Or rather, it's probably not even worth gathering the performance data that would be needed to make a decision based on one being faster or not. Yeah. -------------------------- jnewbery src/txrequest.cpp:583 2020-10-06T21:59:38Z > Done, with lambdas instead of separate functions (as they're only used in one place). Ha. I went the other way - did this with lambdas first and then moved them to named functions because the lines were getting a bit long. Either way seems fine. > it's probably not even worth gathering the performance data that would be needed to make a decision based on one being faster or not Yeah, I can see this being incrementally worse performance in the worst case, but I think readability wins. -------------------------- hebasto src/test/fuzz/txrequest.cpp:7 2020-10-07T06:00:28Z 8d0dd46b4fcdf4c664747e638eda046f8d51079c nit: Lexicographic order for headers? -------------------------- hebasto src/test/fuzz/txrequest.cpp:22 2020-10-07T06:01:20Z 8d0dd46b4fcdf4c664747e638eda046f8d51079c ```suggestion //! Randomly generated GenTxids used in this test (length is MAX_TXHASHES). ``` -------------------------- hebasto src/test/fuzz/txrequest.cpp:38 2020-10-07T06:04:21Z 8d0dd46b4fcdf4c664747e638eda046f8d51079c What is the reason for these conditions? Why does any `rand256()` not suit? -------------------------- hebasto src/test/fuzz/txrequest.cpp:46 2020-10-07T06:09:01Z 8d0dd46b4fcdf4c664747e638eda046f8d51079c Could the only `int i{0};` before all of the `for` loops be more readable and maintainable? -------------------------- hebasto src/test/fuzz/txrequest.cpp:243 2020-10-07T06:43:01Z 8d0dd46b4fcdf4c664747e638eda046f8d51079c ```suggestion std::vector> result; //!< list of (sequence number, txhash, is_wtxid) tuples. ``` -------------------------- hebasto src/test/fuzz/txrequest.cpp:313 2020-10-07T06:44:13Z 8d0dd46b4fcdf4c664747e638eda046f8d51079c `PENDING` ? -------------------------- hebasto src/test/txrequest_tests.cpp:259 2020-10-07T06:47:22Z ea7839c8889864be174b85a1f4de326f5e12e430, typo: ```suggestion if ((config >> 3) == 3) { // A response will arrive for the transaction ``` -------------------------- hebasto src/test/txrequest_tests.cpp:457 2020-10-07T06:48:10Z ea7839c8889864be174b85a1f4de326f5e12e430, typo: ```suggestion /** Add to scenario a test that verifies behavior related to both txid and wtxid with the same ``` -------------------------- hebasto src/txrequest.cpp:308 2020-10-07T06:50:30Z d9dc98b73d3d3b7979e76514e822e74814f8ff6d, typo: ```suggestion // The next best CANDIDATE_READY, if any, immediately precedes the REQUESTED or CANDIDATE_BEST ``` -------------------------- jnewbery test/functional/p2p_tx_download.py:128 2020-10-07T09:38:44Z s/transactions/transaction requests/ ? -------------------------- jnewbery src/test/fuzz/txrequest.cpp:75 2020-10-07T09:39:30Z DELAYED/READY/BEST -------------------------- jnewbery src/test/fuzz/txrequest.cpp:98 2020-10-07T09:58:21Z More out of curiosity than a suggestion to change: is there any reason to use C arrays here and elsewhere rather than std::arrays? std::array would allow you to replace many of the hand-written loops below with stl algorithms (e.g. `Cleanup()` becomes a single `std::any()` call and a single `std::transform()` call). -------------------------- hebasto src/test/fuzz/txrequest.cpp:114 2020-10-07T13:09:13Z I believe that `all_nothing == true` means that all `txhash`es have the `State::NOTHING`. Furthermore, re-assigning `State::NOTHING` to them will be effectively noop, right? So one could comment out this LOC. Or not? UPDATE: ignore it, sorry for noise. -------------------------- MarcoFalke src/test/txrequest_tests.cpp:430 2020-10-07T13:29:08Z in commit ea7839c888: Could mention that config is [0,4) like in the other tests. -------------------------- MarcoFalke src/test/txrequest_tests.cpp:484 2020-10-07T13:36:11Z same commit, config is also [0,4) -------------------------- glozow src/test/fuzz/txrequest.cpp:313 2020-10-07T13:51:33Z `CANDIDATE` -------------------------- MarcoFalke src/test/txrequest_tests.cpp:333 2020-10-07T13:55:41Z in commit ea7839c8889864be174b85a1f4de326f5e12e430: I am wondering why `config` is used here. Generally, it seems that `config` is used to enumerate observable behaviour changes, whereas randomness is used to augment the test with noise that shouldn't change behavior. Is this correct? If yes, it could make sense to replace `config` here with randomness. -------------------------- MarcoFalke src/Makefile.test.include:1221 2020-10-07T14:21:38Z in commit 8d0dd46b4fcdf4c664747e638eda046f8d51079c ```suggestion test_fuzz_txrequest_LDFLAGS = $(FUZZ_SUITE_LDFLAGS_COMMON) ``` -------------------------- ariard src/net_processing.cpp:3012 2020-10-07T15:34:44Z Note that the other call to `AdTxAnnouncement` (L2657) is conditional on `!m_chainman.ActiveChainState().IsInitialBlockDownload()`. I think it's not worthy to download orphan parents when still in IBD as : * a) we may not have the UTXO(s) from which they spend and thus they will be considered again as orphans * b) we may not have already seen the block in which they're already spent and thus will be useless when arrived at current network tip * c) even if we assume the UTXO stays unspent at current network tip, we aren't uncertain about the accuracy of its feerate That said, it was a behavior before this patchset, and instead it might be better to address this in its own follow-up. -------------------------- ariard src/net_processing.cpp:3710 2020-10-07T16:51:00Z Should we ensure that a honest sender's NOTFOUND stays strictly under `MAX_PEER_TX_ANNOUNCEMENTS` in ProcessGetData ? Let's say you have the following topology : Mallory <--> Alice <--> Bob. 1) Mallory send 5000 txn (MAX_PEER_TX_ANNOUNCEMENTS) to Alice. This set of txn conflict with the rest of the network minus Alice and Bob. Bob can only learn then from Alice. 2) Alice relays this set of 5000 txn towards Bob. It's under MAX_INV_SZ=50_000, Bob receives this inv (AddTxAnnouncement, src/net_processing.cpp L2658) and increments Alice announcement counter to max (ReceivedInv, src/txrequest.cpp, L549), then request then shortly (GetRequestable, src/net_processing.cpp, L4459) at next message sending 3) At the same time, Mallory replaces-by-fee this set of 5000 txn in Alice's mempool 4) When Bob's GETDATA of MAX_GETDATA_SZ reaches Alice, she will process it (ProcessGetData, src/net_processing.cpp, L2689) and return a NOTFOUND to Bob 5) If this NOTFOUND is bigger than MAX_PEER_TX_ANNOUNCEMENTS, it's not going to be processed by Bob (ReceivedResponse, src/net_processing.cpp, L3694) 6) Tx-relay is severed between Alice and Bob for a GETDATA_TX_INTERVAL cycle, until Alice's REQUESTED are expired I think step 4) might fail as the original set of transaction can still be in mapRelay. But I wonder if Mallory can also interfere with it by forcing Alice to hit its Bob's send buffer (in ProcessGetData, src/net_processing, L1674) by asking to relay large transactions. I think this attack should be also bounded by INVENTORY_MAX_RECENT_RELAY but I wonder if you can bypass it by playing with orphans. Mallory can force Alice to relay transactions known to be orphans by Bob mempools and thus triggering a call to AddTxAnnouncement (src/net_processing, L2991) on the Bob-side. This doesn't strictly enforce Alice's inv rate-limiting policy and will inflate Alice's count of in-flight announcement towards Bob if orphans have a lot of missing parents (my point around https://github.com/bitcoin/bitcoin/pull/19184#discussion_r497948559) Let me know if this scenario might hold, there are a lot of assumptions on message max sizes and timers. -------------------------- ariard src/txrequest.cpp:33 2020-10-07T17:54:23Z I didn't figure this during previous reviews, but this doesn't underscore that CANDIDATE_READY expiration time is never if it's never promoted to REQUESTED or the expected transaction is never received. Just to prevent a future change introducing a expiration time for then and thus a strong dependency order where malicious announcement(s) request first could interfere with honest CANDIDATE_READY. "CANDIDATE_READY announcements may stay in this state until they're promoted to REQUESTED or the announced transaction is no longer needed" -------------------------- ariard src/txrequest.cpp:243 2020-10-07T17:54:54Z I think this is "higher priority" after latest update. -------------------------- ariard src/txrequest.cpp:334 2020-10-07T17:56:33Z Same "higher priority" ? -------------------------- ariard src/net_processing.cpp:76 2020-10-07T23:42:04Z I think MAX_PEER_TX_IN_FLIGHT and its comments could be clearer by being called `MAX_PEER_TX_REQUEST_IN_FLIGHT`. Everytime I'm re-reading this variable I wonder if its a superset of `MAX_PEER_TX_ANNOUNCEMENTS` (which is not ofc) or a disjunctive limit. Even, those 2 variables could leak their directionality, e.g `MAX_FROM_PEER_TX_ANNOUNCEMENTS`/`MAX_TO_PEER_TX_REQUEST` Please discard if you consider this as naming bikeshedding :) -------------------------- ariard src/net_processing.cpp:875 2020-10-07T23:48:06Z As another consistency checks, at PeerManager::InitializeNode you can verify that `m_txrequest.Count(nodeid)==0` ? -------------------------- ariard src/txrequest.cpp:475 2020-10-08T00:36:03Z On this diff, I've the new unit tests stalling, is this an expected property ? It would fail CI anyway so likely okay. ``` diff --git a/src/txrequest.cpp b/src/txrequest.cpp index fb5f1e325..cf4d55713 100644 --- a/src/txrequest.cpp +++ b/src/txrequest.cpp @@ -483,7 +483,7 @@ private: while (!m_index.empty()) { auto it = m_index.get().begin(); if (it->m_state == State::CANDIDATE_DELAYED && it->m_time <= now) { - PromoteCandidateReady(m_index.project(it)); + //PromoteCandidateReady(m_index.project(it)); } else if (it->m_state == State::REQUESTED && it->m_time <= now) { MakeCompleted(m_index.project(it)); } else { ``` -------------------------- sipa src/net_processing.cpp:3012 2020-10-08T01:09:42Z This is a good point, but perhaps left best for an orthogonal PR? -------------------------- ariard src/txrequest.cpp:374 2020-10-08T01:10:57Z I think this function is right but I was confused for a while on the behavior if we can have concurrently two announcements, one _BEST the other _REQUESTED. We can't but it would be better to document it as a friendly reminder. -------------------------- sipa src/net_processing.cpp:3710 2020-10-08T01:17:18Z I'm not sure there is really a problem here, because: * In step 2, that relay will take several minutes (we relay at most 35 tx invs per message, `INVENTORY_BROADCAST_MAX`, and around 1 inv message every 2s on outbound connections, so 17.5 tx/s), so announcements will come in gradually. * In step 5, Bob will expire requests after 1 min anyway, even without NOTFOUNDs. -------------------------- sipa src/txrequest.cpp:33 2020-10-08T01:21:18Z I can't understand your first paragraph. Your suggested text would be incorrect, as they're expected to first transition to CANDIDATE_BEST, before becoming REQUESTED. Feel free to suggest something that includes that. -------------------------- sipa src/net_processing.cpp:76 2020-10-08T01:23:47Z I think "in flight" rather unambiguously means "requested but no response yet". -------------------------- sipa src/txrequest.cpp:475 2020-10-08T01:29:50Z Yes, that's expected. The loop is expected to make progress in every step, or break. With the change you're making, it will infinitely spin when reaching a CANDIDATE_DELAYED. -------------------------- sipa src/txrequest.cpp:374 2020-10-08T01:30:46Z It says "and no REQUESTED exists", is that enough? -------------------------- sipa src/test/fuzz/txrequest.cpp:114 2020-10-08T01:31:27Z No worries. It's just avoiding the work in case there is nothing to do. -------------------------- sipa src/test/fuzz/txrequest.cpp:7 2020-10-08T01:32:30Z Done. -------------------------- sipa src/test/fuzz/txrequest.cpp:22 2020-10-08T01:32:37Z Done. -------------------------- sipa src/test/fuzz/txrequest.cpp:38 2020-10-08T01:33:09Z It does. This just helped debugging as it meant I could look at a hex txid and know which #txhash it was. Removed it. -------------------------- sipa src/test/fuzz/txrequest.cpp:46 2020-10-08T01:33:18Z Done. -------------------------- sipa src/test/fuzz/txrequest.cpp:243 2020-10-08T01:33:26Z Done. -------------------------- sipa src/test/fuzz/txrequest.cpp:313 2020-10-08T01:33:37Z Done. Indeed, `CANDIDATE`. -------------------------- sipa src/test/txrequest_tests.cpp:259 2020-10-08T01:33:44Z Done. -------------------------- sipa src/test/txrequest_tests.cpp:457 2020-10-08T01:33:51Z Done. -------------------------- sipa src/txrequest.cpp:308 2020-10-08T01:33:59Z Done. -------------------------- sipa test/functional/p2p_tx_download.py:128 2020-10-08T01:34:17Z Done. -------------------------- sipa src/test/fuzz/txrequest.cpp:75 2020-10-08T01:34:27Z Fixed. -------------------------- sipa src/test/fuzz/txrequest.cpp:98 2020-10-08T01:34:48Z All those things work with arrays as well? -------------------------- sipa src/txrequest.cpp:243 2020-10-08T01:35:06Z Fixed. -------------------------- sipa src/txrequest.cpp:334 2020-10-08T01:35:13Z Fixed. -------------------------- sipa src/net_processing.cpp:875 2020-10-08T01:35:31Z Added. -------------------------- jnewbery src/test/fuzz/txrequest.cpp:98 2020-10-08T09:26:59Z Ah ok. I guess it's more of a style question really. Is there ever a reason to prefer C arrays over std::array? Marking this as resolved since it's not important. -------------------------- jnewbery src/test/fuzz/txrequest.cpp:313 2020-10-08T09:32:05Z You replaced the wrong word. It should be CANDIDATE or REQUESTED. -------------------------- ariard src/net_processing.cpp:3710 2020-10-08T13:51:28Z > In step 2, that relay will take several minutes (we relay at most 35 tx invs per message, INVENTORY_BROADCAST_MAX, and around 1 inv message every 2s on outbound connections, so 17.5 tx/s), so announcements will come in gradually. Yes but I'm wondering if you can bypass this by forcing to relay orphans with a lot of missing parents, thus the receiver marking them as announcements from the sender. * Alice announces 17 txn to Bob * Bob requires them all * Alice replies with the 17 txn * Bob receives the txn, all of them are orphans with 295 missing parents. Bob mark all parents as `AddTxAnnouncement` (src/net_processing.cpp, L2991) as coming from Alice * Alice can't announce new transactions to Bob until her MAX_PEER_TX_ANNOUNCEMENTS is dried-up I'll write a test to get certainty on behavior. Feel free to mark as resolved. > In step 5, Bob will expire requests after 1 min anyway, even without NOTFOUNDs. Right, but that's still 1min of potential tx-relay severance... -------------------------- ariard src/net_processing.cpp:76 2020-10-08T13:54:28Z IMO it doesn't mark enough directionality but yeah bikeshedding. -------------------------- ariard src/txrequest.cpp:374 2020-10-08T13:55:19Z Right there is an _and_ marking the conjunction. -------------------------- ariard src/txrequest.cpp:33 2020-10-08T13:59:46Z Just wanted to add a comment saying that _READY expiration time is never, if it doesn't get promoted to _BEST or the announced transaction is no longer needed. Right forget a step, "CANDIDATE_READY announcements don't expire until they're promoted to _BEST or the announced transaction is no longer needed" better? -------------------------- ariard src/txrequest.cpp:442 2020-10-08T14:15:38Z You mean "successor" instead of "predecessor", given `std::next()` usage ? -------------------------- hebasto src/txrequest.cpp:223 2020-10-08T16:33:02Z ```suggestion //! Number of CANDIDATE_BEST announcements for this txhash (at most one). size_t m_candidate_best = 0; //! Number of REQUESTED announcements for this txhash (at most one). size_t m_requested = 0; ``` This is checked in https://github.com/bitcoin/bitcoin/blob/bf3f99291b8f207a809d6fd3309ef99eb3711388/src/txrequest.cpp#L324-L325 -------------------------- hebasto src/txrequest.h:103 2020-10-08T17:29:01Z ```suggestion explicit TxRequestTracker(bool deterministic = false); ``` -------------------------- sipa src/net_processing.cpp:3710 2020-10-08T17:38:39Z > Yes but I'm wondering if you can bypass this by forcing to relay orphans with a lot of missing parents, thus the receiver marking them as announcements from the sender. Mallory can't cause Alice to relay orphans to Bob - they're only relayed when they enter the mempool, which implies the parents are known. > I'll write a test to get certainty on behavior. Feel free to mark as resolved. Please do - though this sounds like something you'd need a functional test for, as it's not purely TxRequestTracker logic. > Right, but that's still 1min of potential tx-relay severance... Yeah, that would be a problem if an attacker can cause it with high probability, on many peers of a victim at once. -------------------------- sipa src/test/fuzz/txrequest.cpp:313 2020-10-08T18:52:24Z Details. Fixed. -------------------------- sipa src/txrequest.cpp:33 2020-10-08T18:52:42Z Ok, added some more comments. -------------------------- sipa src/net_processing.cpp:76 2020-10-08T18:53:08Z If it's unclear to you, better to improve it. Renamed. -------------------------- sipa src/txrequest.cpp:442 2020-10-08T18:53:21Z Indeed, done. -------------------------- sipa src/txrequest.cpp:223 2020-10-08T18:53:29Z Done. -------------------------- sipa src/txrequest.h:103 2020-10-08T18:53:39Z Good idea, done. -------------------------- ariard src/net_processing.cpp:3710 2020-10-09T01:31:30Z > Mallory can't cause Alice to relay orphans to Bob - they're only relayed when they enter the mempool, which implies the parents are known What is an orphan for Bob might not be an orphan for Alice if their mempools are divergent ? Mallory broadcast tx123 to Alice and tx123' to Bob, then send children from tx123 to Alice which will be relayed to Bob ? `fRejectedParents` checks parents, so Mallory may need to another layer of transaction, so it would be tx123 <- tx456 <- tx789. I think tx789 will be considered as a valid orphan by Bob ? -------------------------- sipa src/net_processing.cpp:3710 2020-10-09T03:41:35Z Ok, so Mallory also knows the victim specifically, and is able to connect directly to them. I believe you may be right that if Mallory has connections to both Alice and Bob, he may be able to induce a stream of orphans between them. If that stream is resolved slow enough (due to processing speed, bandwidth, ...), it may be possible to temporarily reach the 5000 MAX_PEER_TX_ANNOUNCEMENTS limit. I think in this situation: * This may be costly as the attacker has produce huge amounts of actually valid transactions to get Alice to relay them to Bob - transactions which may confirm. * The limit would only be reached very temporarily, as new announcements get dropped once the limit is reached, and receiving a tx or NOTFOUND will decrease it again. So this attack would need to be maintained in a rapid fire to keep the tracked announcements at the bound. * In order to actually block Bob from receiving hones transactions, Mallory has to do this with a significant fraction of Bob's peers - not just Alice. * This seems harder to pull off, and possibly less effective, than simply mounting an invblock attack on Bob directly (race the network, announce honest transactions very quickly to Bob, and don't respond to requests). Does that match your understanding? -------------------------- MarcoFalke src/net_processing.cpp:4593 2020-10-09T10:54:43Z in commit 3227f15f9d : This seems to indicate peer misbehaviour (to some extent), so I'd prefer if the log under the net category was kept. I am aware that this requires code changes more than just adding back the LogPrint, but I think it might be worth the additional code to keep this debug log statement. -------------------------- MarcoFalke test/functional/p2p_tx_download.py:149 2020-10-09T11:54:03Z in commit 3227f15f9d: Seems confusing to mix mocktime, but then wait for wall clock time. Is there any reason the process message loop will do anything different if called for two seconds in a loop but with fixed time? I think running it once is enough. You can do that with `p.sync_with_ping()`. -------------------------- MarcoFalke test/functional/p2p_tx_download.py:144 2020-10-09T11:55:24Z Generally, it seems that the code has strong unit and fuzz coverage for the internal logic, but the net_processing "glue code" is relatively untested when it comes to edge cases. I've written some edge-case tests. Feel free to steal or ignore: ```diff diff --git a/test/functional/p2p_tx_download.py b/test/functional/p2p_tx_download.py index 01b0f0efa0..9cb92b77ae 100755 --- a/test/functional/p2p_tx_download.py +++ b/test/functional/p2p_tx_download.py @@ -46,6 +46,7 @@ INBOUND_PEER_TX_DELAY = 2 # seconds TXID_RELAY_DELAY = 2 # seconds OVERLOADED_PEER_DELAY = 2 # seconds MAX_GETDATA_IN_FLIGHT = 100 +MAX_PEER_TX_ANNOUNCEMENTS = 5000 # Python test constants NUM_INBOUND = 10 @@ -153,14 +154,80 @@ class TxDownloadTest(BitcoinTestFramework): self.nodes[0].setmocktime(mock_time + INBOUND_PEER_TX_DELAY + OVERLOADED_PEER_DELAY) p.wait_until(lambda: p.tx_getdata_count == len(txids)) + def test_expiry_fallback(self): + self.log.info('Check that expiry will select another peer for download') + WTXID = 0xffaa + peer1 = self.nodes[0].add_p2p_connection(TestP2PConn()) + peer2 = self.nodes[0].add_p2p_connection(TestP2PConn()) + for p in [peer1, peer2]: + p.send_message(msg_inv([CInv(t=MSG_WTX, h=WTXID)])) + # One of the peers is asked for the tx + peer2.wait_until(lambda: sum(p.tx_getdata_count for p in [peer1, peer2]) == 1) + peer_expiry, peer_fallback = (peer1, peer2) if peer1.tx_getdata_count == 1 else (peer2, peer1) + with p2p_lock: + assert_equal(peer_fallback.tx_getdata_count, 0) + self.nodes[0].setmocktime(int(time.time()) + GETDATA_TX_INTERVAL + 1) # Wait for request to peer_expiry to expire + peer_fallback.sync_with_ping() + with p2p_lock: + assert_equal(peer_fallback.tx_getdata_count, 1) + self.restart_node(0) # reset mocktime + + def test_notfound_fallback(self): + self.log.info('Check that notfounds will select another peer for download immediately') + WTXID = 0xffdd + peer1 = self.nodes[0].add_p2p_connection(TestP2PConn()) + peer2 = self.nodes[0].add_p2p_connection(TestP2PConn()) + for p in [peer1, peer2]: + p.send_message(msg_inv([CInv(t=MSG_WTX, h=WTXID)])) + # One of the peers is asked for the tx + peer2.wait_until(lambda: sum(p.tx_getdata_count for p in [peer1, peer2]) == 1) + peer_notfound, peer_fallback = (peer1, peer2) if peer1.tx_getdata_count == 1 else (peer2, peer1) + with p2p_lock: + assert_equal(peer_fallback.tx_getdata_count, 0) + peer_notfound.send_and_ping(msg_notfound(vec=[CInv(MSG_WTX, WTXID)])) # Send notfound, so that fallback peer is selected + peer_fallback.sync_with_ping() + with p2p_lock: + assert_equal(peer_fallback.tx_getdata_count, 1) + + def test_preferred_inv(self): + self.log.info('Check that invs from preferred peers are downloaded immediately') + self.restart_node(0, extra_args=['-whitelist=noban@127.0.0.1']) + peer = self.nodes[0].add_p2p_connection(TestP2PConn()) + peer.send_message(msg_inv([CInv(t=MSG_WTX, h=0xff00ff00)])) + peer.sync_with_ping() + with p2p_lock: + assert_equal(peer.tx_getdata_count, 1) + + def test_large_inv_batch(self): + self.log.info('Test how large inv batches are handled with relay permission') + self.restart_node(0, extra_args=['-whitelist=relay@127.0.0.1']) + peer = self.nodes[0].add_p2p_connection(TestP2PConn()) + peer.send_message(msg_inv([CInv(t=MSG_WTX, h=wtxid) for wtxid in range(MAX_PEER_TX_ANNOUNCEMENTS + 1)])) + peer.wait_until(lambda: peer.tx_getdata_count == MAX_PEER_TX_ANNOUNCEMENTS + 1) + + self.log.info('Test how large inv batches are handled without relay permission') + self.restart_node(0) + peer = self.nodes[0].add_p2p_connection(TestP2PConn()) + peer.send_message(msg_inv([CInv(t=MSG_WTX, h=wtxid) for wtxid in range(MAX_PEER_TX_ANNOUNCEMENTS + 1)])) + peer.wait_until(lambda: peer.tx_getdata_count == MAX_PEER_TX_ANNOUNCEMENTS) + peer.sync_with_ping() + with p2p_lock: + assert_equal(peer.tx_getdata_count, MAX_PEER_TX_ANNOUNCEMENTS) + def test_spurious_notfound(self): self.log.info('Check that spurious notfound is ignored') self.nodes[0].p2ps[0].send_message(msg_notfound(vec=[CInv(MSG_TX, 1)])) def run_test(self): + # Run tests without mocktime that only need one peer-connection first, to avoid restarting the nodes + self.test_expiry_fallback() + self.test_notfound_fallback() + self.test_preferred_inv() + self.test_large_inv_batch() + self.test_spurious_notfound() # Run each test against new bitcoind instances, as setting mocktimes has long-term effects on when # the next trickle relay event happens. - for test in [self.test_spurious_notfound, self.test_in_flight_max, self.test_inv_block, self.test_tx_requests]: + for test in [self.test_in_flight_max, self.test_inv_block, self.test_tx_requests]: self.stop_nodes() self.start_nodes() self.connect_nodes(1, 0) -------------------------- sdaftuar src/txrequest.h:87 2020-10-09T12:49:55Z Shouldn't this say: ` k ~ P + NB(p=1-NPh/NP,r=1) ` ? -------------------------- hebasto src/txrequest.cpp:130 2020-10-09T13:42:31Z 1950db7598a426232aeb5b6e1114a1f7e1ab35a1, nit: ```suggestion // See https://www.boost.org/doc/libs/1_58_0/libs/multi_index/doc/reference/key_extraction.html#key_extractors ``` -------------------------- sdaftuar src/txrequest.cpp:382 2020-10-09T14:48:35Z I have long wondered whether we should just prevent time from going backwards in `GetTime()`, by enforcing that it can only go forward. Even if we didn't do that globally in bitcoind, I think we could consider doing that for `SetTimePoint()`...? -------------------------- sdaftuar src/txrequest.cpp:117 2020-10-09T15:19:23Z I was concerned whether converting the bool to a uint64_t was guaranteed to work as expected (ie map true to uint64_t(1) on all platforms/all compilers), so I tried to check. From https://en.cppreference.com/w/cpp/language/implicit_conversion: > If the source type is bool, the value false is converted to zero and the value true is converted to the value one of the destination type (note that if the destination type is int, this is an integer promotion, not an integer conversion). So I think this is correct, but figured I'd mention it here in case anyone else was wondering, or in case I am misinterpreting this reference! -------------------------- laanwj src/txrequest.cpp:117 2020-10-09T20:15:06Z Yes, converting bool to int is (luckily!) fully defined behavior in C++! -------------------------- sipa src/txrequest.cpp:130 2020-10-09T20:23:32Z Done. -------------------------- sipa src/test/txrequest_tests.cpp:430 2020-10-09T20:23:43Z Done. -------------------------- sipa src/test/txrequest_tests.cpp:484 2020-10-09T20:23:54Z Done. -------------------------- sipa src/test/txrequest_tests.cpp:333 2020-10-09T20:24:14Z Changed to a random one. -------------------------- sipa src/Makefile.test.include:1221 2020-10-09T20:24:24Z Done. -------------------------- sipa src/net_processing.cpp:4593 2020-10-09T20:25:21Z Added a commit at the end that re-introduces a way to account for expirations. It's nontrivial, so if other reviewers prefer doing it separately, that's ok by me. -------------------------- sipa test/functional/p2p_tx_download.py:149 2020-10-09T20:25:32Z Ha, of course. Fixed. -------------------------- sipa test/functional/p2p_tx_download.py:144 2020-10-09T20:26:32Z This is great, I've squashed these tests in. -------------------------- sipa src/txrequest.h:87 2020-10-09T20:26:50Z Indeed, fixed. -------------------------- sipa src/txrequest.cpp:382 2020-10-09T20:29:23Z Yeah, that works. My earlier argument was that it wasn't much code to deal with time going backwards, and it's elegant that these time-based operations work symmetrically. I've since added a test for time-going-backward behavior, so perhaps it isn't quite true anymore. I've added a new commit at the end that removes the time-going-backward-makes-candidates-unrequestable-again behavior by introducing a "max time seen" clock inside TxRequestTracker. I think it removes surprisingly little, but it's something. -------------------------- sipa src/txrequest.cpp:117 2020-10-09T20:30:04Z That's what I was assuming - booleans are effectively integers that can only take on value 0 and 1. Good to verify in any case. -------------------------- MarcoFalke src/primitives/transaction.h:403 2020-10-10T09:15:12Z in commit ff290ca931f7131b3c1f7b7ac62b7a58ecc3c794: Any reason for this change? gcc and clang (c++17) compile fine for me without it. If you want to modify this class in this pull, you might add a `ToString()const` method that returns `IsWtxid() ? "wtx " : "tx " + GetHash().ToString()` and use it where appropriate. -------------------------- MarcoFalke src/test/fuzz/txrequest.cpp:146 2020-10-10T09:23:59Z Does that mean Bitcoin Core will crash when the user adjusts the time? -------------------------- sipa src/primitives/transaction.h:403 2020-10-10T15:58:38Z The std::sort (in unit tests) on pairs of NodeId, GenTxid failed without GenTxid having an assignment operator for me. -------------------------- sipa src/test/fuzz/txrequest.cpp:146 2020-10-10T15:59:53Z No, this is in the fuzz test. I wanted to verify that the test wasn't wasting time by trying situations in which time go backwards (which are now pointless). -------------------------- MarcoFalke src/test/fuzz/txrequest.cpp:146 2020-10-10T17:38:44Z Thanks. Didn't realize this was in the fuzz test. :see_no_evil: -------------------------- MarcoFalke src/primitives/transaction.h:403 2020-10-10T17:39:02Z Thanks. Didn't compile the fuzz tests. -------------------------- hebasto src/primitives/transaction.h:403 2020-10-10T18:26:37Z ~Both gcc 9.3.0 and clang 10.0.0 compile fine for me without this change.~ > Thanks. Didn't compile the fuzz tests. Indeed. -------------------------- hebasto src/test/fuzz/txrequest.cpp:285 2020-10-10T18:31:12Z ```suggestion // Compare Size(), Count(), CountInFlight() and CountCandidates() with naive structure. ``` -------------------------- hebasto src/test/fuzz/txrequest.cpp:62 2020-10-10T18:33:32Z ```suggestion * Check() calls the TxRequestTracker's sanity check, plus compares the * output of the constant accessors (Size(), Count(), CountInFlight() and * CountCandidates()) with expected values. ``` -------------------------- ariard src/net_processing.cpp:3710 2020-10-10T23:47:44Z > This may be costly as the attacker has produce huge amounts of actually valid transactions to get Alice to relay them to Bob - transactions which may confirm. Assuming a mass-connecting malicious node, I think you can mempool-partition from the rest of the honest network, Alice and Bob, by announcing yet-another different spend tx123''. Thus not having to pay for the malicious stream of orphans. > The limit would only be reached very temporarily, as new announcements get dropped once the limit is reached, and receiving a tx or NOTFOUND will decrease it again. So this attack would need to be maintained in a rapid fire to keep the tracked announcements at the bound. If you assume Mallory has a connection with Bob, can Mallory invblock on the malicious orphan stream by announcing it at the same time ? At least for an expiration period. > In order to actually block Bob from receiving hones transactions, Mallory has to do this with a significant fraction of Bob's peers - not just Alice. I was thinking the reverse, an attacker blocking Alice's txn from propagating. E.g in a scenario where the attacker expect Alice to broadcast a time-sensitive tx after reception of a given block. But agree an attacker has to repeat the blocking with every Alice's peers. Overall, I agree it sounds hard to pull off, an attacker already has to observe all victim's tx-relay peers to effectively censor an expected broadcast. I think future p2p work like outbound tx-relay rotation should harden against this kind of attack scenario. -------------------------- naumenkogs src/txrequest.cpp:430 2020-10-12T07:53:38Z But this will miss the case when a COMPLETED announcement is not the immediate next? {.... it-1, it, it+1, COMPLETED}. It seems like it actually works properly because it is called for CANDIDATE_BEST (implicit rule). Perhaps add this assert then? -------------------------- sipa src/txrequest.cpp:430 2020-10-12T08:09:26Z > {.... it-1, it, it+1, COMPLETED}. In that case `it` is not the only non-COMPLETED Announcement for this txhash (it-1 and it+1 are also non-COMPLETED), so false is the correct return value. -------------------------- naumenkogs src/txrequest.cpp:430 2020-10-12T08:13:01Z Right, thanks! -------------------------- sipa src/test/fuzz/txrequest.cpp:62 2020-10-12T08:17:55Z Done. -------------------------- sipa src/test/fuzz/txrequest.cpp:285 2020-10-12T08:18:04Z Done. -------------------------- naumenkogs src/txrequest.cpp:480 2020-10-12T09:30:33Z So, let's say we start iterating, and the very first one is `CANDIDATE_READY` or `COMPLETED` or `CANDIDATE_BEST` with `it->m_time <= now`. So, it's possible that we have `CANDIDATE_DELAYED` and `REQUESTED` which should be updated, but we won't reach them because we quit. Is it not possible for some reason? -------------------------- naumenkogs src/txrequest.cpp:413 2020-10-12T09:40:35Z Is this safe? I guess as long as the change of position is moderate (happening within a peer). And perhaps can only go backward? I'm wondering if we may stuck in an endless loop here somehow... -------------------------- naumenkogs src/txrequest.cpp:285 2020-10-12T09:47:33Z Why is this useful? I guess handling dependencies? Worth documenting I think. -------------------------- naumenkogs src/txrequest.cpp:502 2020-10-12T09:52:43Z Shouldn't this be outside the `if (it == m_index.get().end()) {` condition and happen unconditionally? -------------------------- naumenkogs src/txrequest.cpp:621 2020-10-12T09:56:41Z I guess this ideally should happen atomically with setting `it` to REQUESTED? Otherwise, some call in-between these two actions may, for example, make another request for the same tx? (as they see there is no currently `REQUESTED`) -------------------------- naumenkogs src/txrequest.cpp:634 2020-10-12T10:03:41Z I guess this implementation should assume that either false or true exists? Otherwise I'm not sure what happens. Maybe worth documenting or adding an explicit assert? -------------------------- jnewbery test/functional/p2p_tx_download.py:165 2020-10-12T12:13:58Z it's a bit gross that you're calling into an getter lambda function for peer1 through a peer2 method, but I guess it works, and now that `wait_until()` can't take a lock parameter I can't think of a cleaner way to do this. -------------------------- jnewbery test/functional/p2p_tx_download.py:166 2020-10-12T12:14:29Z This should be under the `p2p_lock` scope, since you're accessing `peer1.tx_getdata_count` -------------------------- jnewbery test/functional/p2p_tx_download.py:172 2020-10-12T12:36:43Z Could this be racy? The pong from `peer_fallback` will be sent at almost the exact same time as the getdata, and maybe before. If we take the `p2p_lock` as soon as we receive the pong, then the getdata might not have arrived yet. Here's the log from a successful run ``` node0 2020-10-12T12:31:39.447800Z (mocktime: 2020-10-12T12:32:40Z) [msghand] timeout of inflight wtx 000000000000000000000000000000000000000000000000000000000000ffaa from peer=2 node0 2020-10-12T12:31:39.447862Z (mocktime: 2020-10-12T12:32:40Z) [msghand] received: ping (8 bytes) peer=1 node0 2020-10-12T12:31:39.447891Z (mocktime: 2020-10-12T12:32:40Z) [msghand] sending pong (8 bytes) peer=1 test 2020-10-12T12:31:39.448000Z TestFramework.p2p (DEBUG): Received message from 127.0.0.1:13915: msg_pong(nonce=00000002) test 2020-10-12T12:31:39.448000Z TestFramework.p2p (DEBUG): Received message from 127.0.0.1:13915: msg_getdata(inv=[CInv(type=WTX hash=000000000000000000000000000000000000000000000000000000000000ffaa)]) node0 2020-10-12T12:31:39.448005Z (mocktime: 2020-10-12T12:32:40Z) [msghand] Requesting wtx 000000000000000000000000000000000000000000000000000000000000ffaa peer=1 node0 2020-10-12T12:31:39.448028Z (mocktime: 2020-10-12T12:32:40Z) [msghand] sending getdata (37 bytes) peer=1 ``` If there's any delay between those two 'Received message' lines, then this test could fail this assert. Perhaps replace this with a `wait_until()` -------------------------- jnewbery src/txrequest.cpp:465 2020-10-12T12:58:29Z This new function signature where we're returning pairs of makes me more convinced that there should be a follow-up PR to pull this function out of `GetRequestable()` and call it outside the context of a single peer (https://github.com/bitcoin/bitcoin/pull/19988/files#r499479782). -------------------------- jnewbery src/txrequest.cpp:181 2020-10-12T13:02:03Z The `WaitState` enum and this function seem superfluous now, since you can just replace the enum with a bool and use `IsWaiting()`. -------------------------- jnewbery src/txrequest.cpp:61 2020-10-12T13:04:40Z I think we can just say "For CANDIDATE_DELAYED the reqtime; for REQUESTED the expiry; no meaning for CANDIDATE_READY, CANDIDATE_BEST and COMPLETED" -------------------------- jnewbery src/txrequest.cpp:343 2020-10-12T13:06:05Z Just drop this now that m_time has no meaning for CANDIDATE_READY and CANDIDATE_BEST. -------------------------- MarcoFalke test/functional/p2p_tx_download.py:165 2020-10-12T14:17:38Z It would be possible to use `self.wait_until` and take the lock inside the lambda. Though, the lambda would no longer fit on one line then, which I didn't like. -------------------------- MarcoFalke test/functional/p2p_tx_download.py:166 2020-10-12T14:17:54Z Good catch -------------------------- MarcoFalke test/functional/p2p_tx_download.py:184 2020-10-12T14:18:13Z Same here -------------------------- MarcoFalke test/functional/p2p_tx_download.py:172 2020-10-12T14:22:58Z I didn't use a `wait_until` because I wanted to show that the fallback is selected "immediately" after expiry. When this is switched to a `wait_until`, it should probably use a short delay. An alternative would be to sync with two pings? :grimacing: Note to myself: There might be other places in the functional tests where this could be racy, though at least no others in this test script? -------------------------- ariard src/txrequest.cpp:285 2020-10-12T15:17:29Z Yes see https://github.com/bitcoin/bitcoin/pull/19184#discussion_r459143538. It's not documented on actual tip. I thought it was addressed on a previous tip but can't find where. -------------------------- ariard src/test/txrequest_tests.cpp:614 2020-10-12T15:53:48Z "Weird as peer2 should have the preference on gtxid2 request". -------------------------- ariard src/test/txrequest_tests.cpp:620 2020-10-12T15:56:28Z Shouldn't this say "If gtxid2 is requested from peer2". -------------------------- jnewbery test/functional/p2p_tx_download.py:165 2020-10-12T15:59:10Z yeah, I think this is ok, just not very aesthetically pleasing! -------------------------- jnewbery test/functional/p2p_tx_download.py:172 2020-10-12T16:00:25Z Yeah, I think a wait until with 1 second is fine. On a p2p network, 1s seems close enough to immediate. -------------------------- ariard test/functional/p2p_tx_download.py:246 2020-10-12T17:07:40Z AFAICT, I think fallback in case of requestee disconnection isn't covered : ``` diff --git a/test/functional/p2p_tx_download.py b/test/functional/p2p_tx_download.py index 71877a6ee..92b327b36 100755 --- a/test/functional/p2p_tx_download.py +++ b/test/functional/p2p_tx_download.py @@ -172,6 +172,24 @@ class TxDownloadTest(BitcoinTestFramework): assert_equal(peer_fallback.tx_getdata_count, 1) self.restart_node(0) # reset mocktime + def test_disconnect_fallback(self): + self.log.info('Check that disconnect will select another peer for download') + WTXID = 0xffbb + peer1 = self.nodes[0].add_p2p_connection(TestP2PConn()) + peer2 = self.nodes[0].add_p2p_connection(TestP2PConn()) + for p in [peer1, peer2]: + p.send_message(msg_inv([CInv(t=MSG_WTX, h=WTXID)])) + # One of the peers is asked for the tx + peer2.wait_until(lambda: sum(p.tx_getdata_count for p in [peer1, peer2]) == 1) + peer_disconnect, peer_fallback = (peer1, peer2) if peer1.tx_getdata_count == 1 else (peer2, peer1) + with p2p_lock: + assert_equal(peer_fallback.tx_getdata_count, 0) + peer_disconnect.peer_disconnect() + peer_disconnect.wait_for_disconnect() + peer_fallback.sync_with_ping() + with p2p_lock: + assert_equal(peer_fallback.tx_getdata_count, 1) + def test_notfound_fallback(self): self.log.info('Check that notfounds will select another peer for download immediately') WTXID = 0xffdd @@ -221,6 +239,7 @@ class TxDownloadTest(BitcoinTestFramework): def run_test(self): # Run tests without mocktime that only need one peer-connection first, to avoid restarting the nodes self.test_expiry_fallback() + self.test_disconnect_fallback() self.test_notfound_fallback() self.test_preferred_inv() self.test_large_inv_batch() ``` This test should fail with following commit, explicitly disabling reselect operation :https://github.com/bitcoin/bitcoin/commit/a3b7b951070b139599375846fd873c5dd798b762 -------------------------- sipa src/txrequest.cpp:480 2020-10-12T17:13:33Z This is iterating the `ByTime` index, which is sorted by (state is CANDIDATE_DELAYED or REQUESTED, time). So in that index, all CANDIDATE_DELAYED and REQUESTED Announcements come before any others. If another one is reached, it means there are no CANDIDATE_DELAYED or REQUESTED left. -------------------------- ariard src/txrequest.h:163 2020-10-12T17:25:46Z If the clock keeps going back-and-forth but without never crossing announcements' `m_time`, _DELAYED are going to stale forever ? Maybe worth to document to be sure we have all the same understanding of the new behavior. -------------------------- sipa src/txrequest.cpp:465 2020-10-12T17:40:58Z I agree, and considered that, but it's a pain to make that have "any order of calls has well-specified behavior" that introduced a while ago. The problem is that ReceivedInv and RequestedTx cause the "PostGetRequestable" invariants to be violated until the next SetTimePoint. That's currently not a problem as the only way to observe that is through GetRequestable, which calls SetTimePoint. But if SetTimePoint is made a separate public function, we'd need to either: * Add a rule that GetRequestable can only be called after SetTimePoint, with no mutators in between - effectively forcing the API to do the same thing it's doing now, but with the burden on the caller. * Add more state, remembering whether anything may have broken the invariants, and refusing GetRequestable in that condition. * Improve even more structure, and precompute the result of all possible GetRequestable (across all peers), and cache them inside TxRequestTracker, to be fetched/deleted whenever GetRequestable is called. This would significantly worsen worst-case memory usage, I think. * Add the ability to install a callback function for expirations, instead of returning them from GetRequestable. So all things considered, I think this is a decent approach. The callback idea seems reasonable, and perhaps can be extended to more than just expirations, though if that's what we pick, I'd like to keep it as a follow-up. -------------------------- sipa src/txrequest.cpp:621 2020-10-12T17:59:24Z `TxRequestTracker` is thread-agnostic so that's not allowed. The caller needs external synchronization if multiple threads are going to be accessing it. -------------------------- sipa src/txrequest.h:163 2020-10-12T18:15:39Z That seems absolutely trivial. If your clock makes no long-term progress then obviously no time-based events can be guaranteed to happen. -------------------------- sipa src/test/txrequest_tests.cpp:620 2020-10-12T18:56:16Z What is the difference? -------------------------- sipa src/txrequest.cpp:413 2020-10-12T19:19:00Z Yes, it's safe. The key insight is that no other Announcement objects from the same peer are ever affected by this call (std::next(it) may be affected, but only if it's from a different peer). I've elaborated the comments to hopefully clarify this. -------------------------- sipa src/txrequest.cpp:285 2020-10-12T19:19:36Z I've added a comment about this in txrequest.h (but not here; it's just following the specification defined there). -------------------------- sipa src/txrequest.cpp:502 2020-10-12T19:21:07Z That's not necessary as it's only possible in case no CANDIDATE_BEST entry for the specified (peer, txhash) was found (if there was, then no other REQUESTED or CANDIDATE_BEST for the same txhash by another peer can exist, due to invariant that there is at most one of those per txhash). I've added more comments to hopefully clarify. -------------------------- sipa src/txrequest.cpp:634 2020-10-12T19:22:46Z `ReceivedResponse` is supposed to have no having no effect in case no CANDIDATE or REQUESTED announcement with the specified peer/txhash combination exists. I've updated the comment in txrequest.h to clarify. -------------------------- sipa test/functional/p2p_tx_download.py:166 2020-10-12T19:22:59Z Fixed, here and in other places. -------------------------- sipa test/functional/p2p_tx_download.py:172 2020-10-12T19:23:34Z I've made some changes to this effect. Please have a look, I'm not super familiar with the p2p functionality in the python tests. -------------------------- sipa src/txrequest.cpp:181 2020-10-12T19:23:43Z Gone. -------------------------- sipa src/txrequest.cpp:61 2020-10-12T19:23:53Z Done. -------------------------- sipa src/txrequest.cpp:343 2020-10-12T19:27:45Z Done. -------------------------- sipa test/functional/p2p_tx_download.py:184 2020-10-12T19:27:58Z Done. -------------------------- sipa src/test/txrequest_tests.cpp:614 2020-10-12T19:28:11Z Done. -------------------------- sipa test/functional/p2p_tx_download.py:246 2020-10-12T19:28:24Z Added, thanks! --------------------------