Note: I’m going to try to keep this on-point, and as such, I will try not to elaborate too much on points that are not directly relevant to benchmarking. I ask that you keep this in mind when reading, and when responding, only discuss the relevant points, unless it is absolutely vital to do otherwise. Also, sorry for the length, it's a topic I've thought about a lot lately, and there's a lot to discuss.
This thread stems from a discussion over in the Diablo testing thread, though it has been discussed indirectly a number of times in recent testing threads, and a number of us have already begun doing this ourselves, and requesting it from others, but I’d like to see more co-ordinated response/action here.
Of late, we’ve noticed a large variation in test results, with exactly the same beyblades, same opponents, same conditions, everything. I think most of us will now agree that skill is a sizeable factor which influences test results (how sizeable is debatable but beside the point). The weight of each user’s parts is also influencing results, particularly in stamina (and apparently, also in defense: Relic mentioned his Duo was unusually light, and he got pretty poor results with it).
Of course, this introduces an unaccounted-for variable or two, especially in new testing. This is undesirable, and the best way to handle it, in my opinion, is to introduce a Control group. Basically, benchmarks for each tester, to measure the influence the individual tester has on tests.
Simply, users provide tests of a known matchup, and future tests with a new wheel can be compared to these.
Part I: How should we benchmark?
Arupaeo mentioned a couple of different methods over in the Diablo Discussion thread, and I replied, it’s probably worth checking out, given it was the motivation for this thread, and has some discussion of different ways we could benchmark): http://worldbeyblade.org/Thread-Diablo-D...#pid872550
Personally, I like the idea of a set of standard combinations or matchups. Something like MF(-H) Variares R145RF vs MF-H Basalt Aquario BD145CS, MF(-H) Blitz Unicorno (II)/Variares CH120RF vs MF-H Basalt Aquario BD145CS, for attack/defense. There aren’t any good BD145 attackers at the moment, so I’ll leave that.
When testing a PART, that should be the only thing that changes (though in the case of a Metal Wheel, I think choosing the best clear wheel for each MW is important, as it is just keeping things fair for each wheel. Again, debatable). When testing a Combination (say, MF-H Fusion Hades 85RB), the combination can be changed, as it is a comparison between combinations, not parts. However, this is debateable, and direct comparative tests would be very useful to make sure it isn’t just an excellent track/tip combination giving the results, which could work better with another wheel.
For stamina, as we know, wheel weights are becoming influential, thus, even if users lack scales, benchmarking gives us an idea of what to expect. Scales would be better, but it would still be useful
Obviously, this is very much open to suggestions, especially on what "standard" matchups we should use.
Part II: How do we encourage benchmarking?
I think we can all agree the simplest way to implement it would be to just stick it in the standard testing procedure, however, not everyone can provide benchmark tests, so we may end up further restricting who can and can’t test, which I still don’t feel is healthy. We have accepted tests without benchmarking before, and we can continue to do so, but only if it is absolutely impossible for a user to benchmark. Laziness is not a valid excuse for not benchmarking, and never, ever will be. My opinion is that it should be insisted upon for everyone who can, and included in standard procedure, but with an “Unless absolutely not possible†condition attached.
Part III: How do we organise Benchmarking?
So, look it over, discuss, and suggest. I’d like to see this implemented ASAP, but there’s plenty of meat to chew here, so I don’t expect it to be done tomorrow
This thread stems from a discussion over in the Diablo testing thread, though it has been discussed indirectly a number of times in recent testing threads, and a number of us have already begun doing this ourselves, and requesting it from others, but I’d like to see more co-ordinated response/action here.
Of late, we’ve noticed a large variation in test results, with exactly the same beyblades, same opponents, same conditions, everything. I think most of us will now agree that skill is a sizeable factor which influences test results (how sizeable is debatable but beside the point). The weight of each user’s parts is also influencing results, particularly in stamina (and apparently, also in defense: Relic mentioned his Duo was unusually light, and he got pretty poor results with it).
Of course, this introduces an unaccounted-for variable or two, especially in new testing. This is undesirable, and the best way to handle it, in my opinion, is to introduce a Control group. Basically, benchmarks for each tester, to measure the influence the individual tester has on tests.
Simply, users provide tests of a known matchup, and future tests with a new wheel can be compared to these.
Part I: How should we benchmark?
Arupaeo mentioned a couple of different methods over in the Diablo Discussion thread, and I replied, it’s probably worth checking out, given it was the motivation for this thread, and has some discussion of different ways we could benchmark): http://worldbeyblade.org/Thread-Diablo-D...#pid872550
Personally, I like the idea of a set of standard combinations or matchups. Something like MF(-H) Variares R145RF vs MF-H Basalt Aquario BD145CS, MF(-H) Blitz Unicorno (II)/Variares CH120RF vs MF-H Basalt Aquario BD145CS, for attack/defense. There aren’t any good BD145 attackers at the moment, so I’ll leave that.
When testing a PART, that should be the only thing that changes (though in the case of a Metal Wheel, I think choosing the best clear wheel for each MW is important, as it is just keeping things fair for each wheel. Again, debatable). When testing a Combination (say, MF-H Fusion Hades 85RB), the combination can be changed, as it is a comparison between combinations, not parts. However, this is debateable, and direct comparative tests would be very useful to make sure it isn’t just an excellent track/tip combination giving the results, which could work better with another wheel.
For stamina, as we know, wheel weights are becoming influential, thus, even if users lack scales, benchmarking gives us an idea of what to expect. Scales would be better, but it would still be useful
Obviously, this is very much open to suggestions, especially on what "standard" matchups we should use.
Part II: How do we encourage benchmarking?
I think we can all agree the simplest way to implement it would be to just stick it in the standard testing procedure, however, not everyone can provide benchmark tests, so we may end up further restricting who can and can’t test, which I still don’t feel is healthy. We have accepted tests without benchmarking before, and we can continue to do so, but only if it is absolutely impossible for a user to benchmark. Laziness is not a valid excuse for not benchmarking, and never, ever will be. My opinion is that it should be insisted upon for everyone who can, and included in standard procedure, but with an “Unless absolutely not possible†condition attached.
Part III: How do we organise Benchmarking?
This is rather longwinded, and not hugely important til we have worked out how we’re going to benchmark, so I’ve spoilered it. (Click to View)
So, look it over, discuss, and suggest. I’d like to see this implemented ASAP, but there’s plenty of meat to chew here, so I don’t expect it to be done tomorrow