Jan. 06, 2012 1:58 PM
Note: I’m going to try to keep this on-point, and as such, I will try not to elaborate too much on points that are not directly relevant to benchmarking. I ask that you keep this in mind when reading, and when responding, only discuss the relevant points, unless it is absolutely vital to do otherwise. Also, sorry for the length, it's a topic I've thought about a lot lately, and there's a lot to discuss.
This thread stems from a discussion over in the Diablo testing thread, though it has been discussed indirectly a number of times in recent testing threads, and a number of us have already begun doing this ourselves, and requesting it from others, but I’d like to see more co-ordinated response/action here.
Of late, we’ve noticed a large variation in test results, with exactly the same beyblades, same opponents, same conditions, everything. I think most of us will now agree that skill is a sizeable factor which influences test results (how sizeable is debatable but beside the point). The weight of each user’s parts is also influencing results, particularly in stamina (and apparently, also in defense: Relic mentioned his Duo was unusually light, and he got pretty poor results with it).
Of course, this introduces an unaccounted-for variable or two, especially in new testing. This is undesirable, and the best way to handle it, in my opinion, is to introduce a Control group. Basically, benchmarks for each tester, to measure the influence the individual tester has on tests.
Simply, users provide tests of a known matchup, and future tests with a new wheel can be compared to these.
Part I: How should we benchmark?
Arupaeo mentioned a couple of different methods over in the Diablo Discussion thread, and I replied, it’s probably worth checking out, given it was the motivation for this thread, and has some discussion of different ways we could benchmark): http://worldbeyblade.org/Thread-Diablo-D...#pid872550
Personally, I like the idea of a set of standard combinations or matchups. Something like MF(-H) Variares R145RF vs MF-H Basalt Aquario BD145CS, MF(-H) Blitz Unicorno (II)/Variares CH120RF vs MF-H Basalt Aquario BD145CS, for attack/defense. There aren’t any good BD145 attackers at the moment, so I’ll leave that.
When testing a PART, that should be the only thing that changes (though in the case of a Metal Wheel, I think choosing the best clear wheel for each MW is important, as it is just keeping things fair for each wheel. Again, debatable). When testing a Combination (say, MF-H Fusion Hades 85RB), the combination can be changed, as it is a comparison between combinations, not parts. However, this is debateable, and direct comparative tests would be very useful to make sure it isn’t just an excellent track/tip combination giving the results, which could work better with another wheel.
For stamina, as we know, wheel weights are becoming influential, thus, even if users lack scales, benchmarking gives us an idea of what to expect. Scales would be better, but it would still be useful
Obviously, this is very much open to suggestions, especially on what "standard" matchups we should use.
Part II: How do we encourage benchmarking?
I think we can all agree the simplest way to implement it would be to just stick it in the standard testing procedure, however, not everyone can provide benchmark tests, so we may end up further restricting who can and can’t test, which I still don’t feel is healthy. We have accepted tests without benchmarking before, and we can continue to do so, but only if it is absolutely impossible for a user to benchmark. Laziness is not a valid excuse for not benchmarking, and never, ever will be. My opinion is that it should be insisted upon for everyone who can, and included in standard procedure, but with an “Unless absolutely not possible†condition attached.
Part III: How do we organise Benchmarking?
This is the fun part, we’ve discussed whether or not we need to benchmark, how it is to be done, and how we get people to do it.
But, what do we do with these testing results. I don’t want to make everyone either track down a post in the relevant part thread, or stored on their own computer, and copy that every time they post. That’s a discouraging hassle. People are put out enough by having to do twenty tests. As much as I don’t want to encourage laziness, I do want to make things as easy as possible for those who are willing to contribute their time and effort to testing.
So here is what I propose: A single thread which is a “database†of each users benchmark. Each tester gets a single post in which to post all of their benchmarks. Only one, which can be edited with new results/benchmarks etc. There should be absolutely no other posts in this thread than benchmarks. If people feel the need to debate/discuss a test result, they can do so in an appropriate part/combo discussion thread (I would encourage testers to copy any new testing into relevant threads, as well as posting in the benchmarking thread, even if it is just linking to the post in the benchmarking thread).
Here’s the difficult part: we’ve got all these results in a thread, but I think we can make it even easier. I’m happy to handle this myself, by creating the thread, unless someone else would like to do it.
The OP should have an alphabetical list of usernames, linking directly to each user’s post in the thread. As I said, I can happily manage that list in the OP. The only issue would be username changes, but I guess users can PM me if they change their name, and I will update the list to reflect that.
Then, all the effort required by testers is to say that they have a post in the benchmark thread for comparison, and other users can go to the OP, find the name, and read the benchmarks.
If ever I am no longer able to modify the OP for whatever reason, it could be handed over to someone who can (a committee member etc), or a new thread could be created with links to the old one, and new testing can be posted in that thread. But, for the foreseeable future, I can do it.
So, look it over, discuss, and suggest. I’d like to see this implemented ASAP, but there’s plenty of meat to chew here, so I don’t expect it to be done tomorrow
This thread stems from a discussion over in the Diablo testing thread, though it has been discussed indirectly a number of times in recent testing threads, and a number of us have already begun doing this ourselves, and requesting it from others, but I’d like to see more co-ordinated response/action here.
Of late, we’ve noticed a large variation in test results, with exactly the same beyblades, same opponents, same conditions, everything. I think most of us will now agree that skill is a sizeable factor which influences test results (how sizeable is debatable but beside the point). The weight of each user’s parts is also influencing results, particularly in stamina (and apparently, also in defense: Relic mentioned his Duo was unusually light, and he got pretty poor results with it).
Of course, this introduces an unaccounted-for variable or two, especially in new testing. This is undesirable, and the best way to handle it, in my opinion, is to introduce a Control group. Basically, benchmarks for each tester, to measure the influence the individual tester has on tests.
Simply, users provide tests of a known matchup, and future tests with a new wheel can be compared to these.
Part I: How should we benchmark?
Arupaeo mentioned a couple of different methods over in the Diablo Discussion thread, and I replied, it’s probably worth checking out, given it was the motivation for this thread, and has some discussion of different ways we could benchmark): http://worldbeyblade.org/Thread-Diablo-D...#pid872550
Personally, I like the idea of a set of standard combinations or matchups. Something like MF(-H) Variares R145RF vs MF-H Basalt Aquario BD145CS, MF(-H) Blitz Unicorno (II)/Variares CH120RF vs MF-H Basalt Aquario BD145CS, for attack/defense. There aren’t any good BD145 attackers at the moment, so I’ll leave that.
When testing a PART, that should be the only thing that changes (though in the case of a Metal Wheel, I think choosing the best clear wheel for each MW is important, as it is just keeping things fair for each wheel. Again, debatable). When testing a Combination (say, MF-H Fusion Hades 85RB), the combination can be changed, as it is a comparison between combinations, not parts. However, this is debateable, and direct comparative tests would be very useful to make sure it isn’t just an excellent track/tip combination giving the results, which could work better with another wheel.
For stamina, as we know, wheel weights are becoming influential, thus, even if users lack scales, benchmarking gives us an idea of what to expect. Scales would be better, but it would still be useful
Obviously, this is very much open to suggestions, especially on what "standard" matchups we should use.
Part II: How do we encourage benchmarking?
I think we can all agree the simplest way to implement it would be to just stick it in the standard testing procedure, however, not everyone can provide benchmark tests, so we may end up further restricting who can and can’t test, which I still don’t feel is healthy. We have accepted tests without benchmarking before, and we can continue to do so, but only if it is absolutely impossible for a user to benchmark. Laziness is not a valid excuse for not benchmarking, and never, ever will be. My opinion is that it should be insisted upon for everyone who can, and included in standard procedure, but with an “Unless absolutely not possible†condition attached.
Part III: How do we organise Benchmarking?
This is rather longwinded, and not hugely important til we have worked out how we’re going to benchmark, so I’ve spoilered it. (Click to View)
This is the fun part, we’ve discussed whether or not we need to benchmark, how it is to be done, and how we get people to do it.
But, what do we do with these testing results. I don’t want to make everyone either track down a post in the relevant part thread, or stored on their own computer, and copy that every time they post. That’s a discouraging hassle. People are put out enough by having to do twenty tests. As much as I don’t want to encourage laziness, I do want to make things as easy as possible for those who are willing to contribute their time and effort to testing.
So here is what I propose: A single thread which is a “database†of each users benchmark. Each tester gets a single post in which to post all of their benchmarks. Only one, which can be edited with new results/benchmarks etc. There should be absolutely no other posts in this thread than benchmarks. If people feel the need to debate/discuss a test result, they can do so in an appropriate part/combo discussion thread (I would encourage testers to copy any new testing into relevant threads, as well as posting in the benchmarking thread, even if it is just linking to the post in the benchmarking thread).
Here’s the difficult part: we’ve got all these results in a thread, but I think we can make it even easier. I’m happy to handle this myself, by creating the thread, unless someone else would like to do it.
The OP should have an alphabetical list of usernames, linking directly to each user’s post in the thread. As I said, I can happily manage that list in the OP. The only issue would be username changes, but I guess users can PM me if they change their name, and I will update the list to reflect that.
Then, all the effort required by testers is to say that they have a post in the benchmark thread for comparison, and other users can go to the OP, find the name, and read the benchmarks.
If ever I am no longer able to modify the OP for whatever reason, it could be handed over to someone who can (a committee member etc), or a new thread could be created with links to the old one, and new testing can be posted in that thread. But, for the foreseeable future, I can do it.
So, look it over, discuss, and suggest. I’d like to see this implemented ASAP, but there’s plenty of meat to chew here, so I don’t expect it to be done tomorrow