Mark Your Benches! (Benchmarking Discussion)

Note: I’m going to try to keep this on-point, and as such, I will try not to elaborate too much on points that are not directly relevant to benchmarking. I ask that you keep this in mind when reading, and when responding, only discuss the relevant points, unless it is absolutely vital to do otherwise. Also, sorry for the length, it's a topic I've thought about a lot lately, and there's a lot to discuss.

This thread stems from a discussion over in the Diablo testing thread, though it has been discussed indirectly a number of times in recent testing threads, and a number of us have already begun doing this ourselves, and requesting it from others, but I’d like to see more co-ordinated response/action here.

Of late, we’ve noticed a large variation in test results, with exactly the same beyblades, same opponents, same conditions, everything. I think most of us will now agree that skill is a sizeable factor which influences test results (how sizeable is debatable but beside the point). The weight of each user’s parts is also influencing results, particularly in stamina (and apparently, also in defense: Relic mentioned his Duo was unusually light, and he got pretty poor results with it).

Of course, this introduces an unaccounted-for variable or two, especially in new testing. This is undesirable, and the best way to handle it, in my opinion, is to introduce a Control group. Basically, benchmarks for each tester, to measure the influence the individual tester has on tests.

Simply, users provide tests of a known matchup, and future tests with a new wheel can be compared to these.

Part I: How should we benchmark?

Arupaeo mentioned a couple of different methods over in the Diablo Discussion thread, and I replied, it’s probably worth checking out, given it was the motivation for this thread, and has some discussion of different ways we could benchmark): http://worldbeyblade.org/Thread-Diablo-D...#pid872550

Personally, I like the idea of a set of standard combinations or matchups. Something like MF(-H) Variares R145RF vs MF-H Basalt Aquario BD145CS, MF(-H) Blitz Unicorno (II)/Variares CH120RF vs MF-H Basalt Aquario BD145CS, for attack/defense. There aren’t any good BD145 attackers at the moment, so I’ll leave that.

When testing a PART, that should be the only thing that changes (though in the case of a Metal Wheel, I think choosing the best clear wheel for each MW is important, as it is just keeping things fair for each wheel. Again, debatable). When testing a Combination (say, MF-H Fusion Hades 85RB), the combination can be changed, as it is a comparison between combinations, not parts. However, this is debateable, and direct comparative tests would be very useful to make sure it isn’t just an excellent track/tip combination giving the results, which could work better with another wheel.

For stamina, as we know, wheel weights are becoming influential, thus, even if users lack scales, benchmarking gives us an idea of what to expect. Scales would be better, but it would still be useful

Obviously, this is very much open to suggestions, especially on what "standard" matchups we should use.

Part II: How do we encourage benchmarking?

I think we can all agree the simplest way to implement it would be to just stick it in the standard testing procedure, however, not everyone can provide benchmark tests, so we may end up further restricting who can and can’t test, which I still don’t feel is healthy. We have accepted tests without benchmarking before, and we can continue to do so, but only if it is absolutely impossible for a user to benchmark. Laziness is not a valid excuse for not benchmarking, and never, ever will be. My opinion is that it should be insisted upon for everyone who can, and included in standard procedure, but with an “Unless absolutely not possible” condition attached.

Part III: How do we organise Benchmarking?
This is rather longwinded, and not hugely important til we have worked out how we’re going to benchmark, so I’ve spoilered it. (Click to View)

So, look it over, discuss, and suggest. I’d like to see this implemented ASAP, but there’s plenty of meat to chew here, so I don’t expect it to be done tomorrow Wink
You already know that I fully support this, hah.

Hm, I thought that it would be better for people to post their "benchmark" results as you call them right next to their new tests, because we all know that while testers could be too lazy to actually produce the "benchmark" results, almost everybody else could be too lazy to go check that thread of "benchmark" results too, but I suppose that they could at least post the URL to their post, and possibly also quote it ... ?
Yeah, to be honest, I don't really expect many objections, it's pretty sensible and most of us are already encouraging it and/or doing it ourselves.

Personally, I think anyone too lazy to look up the users benchmarks probably won't bother reading them even if they are there, so it's unnecessary effort for testers (and I personally think quoting is too much), but including a link to their own post in that thread would be good, I think.

Obviously, it's very much open for discussion, I just figured if we organise an index properly, it's easy to find a users benchmarks - even if they forget to include them, and when looking up older results from before benchmarking was/is introduced.

I'm in favor of this idea, especially given the variations people have with Certain tipS that shall not be named, so I cannot really find much objectionable about it.

However, there's always the issue of part availability... I know this is essentially covered in the "we'll accept it if it's entirely impossible for them to benchmark", but I think it still somewhat deserves mention that some people may not have the parts used in the more standard benchmarking - and I'm assuming the idea is to have a very specific set of combos/matchups for people to go up against, for the most accuracy. For example, since I lack CH120, VariAres, Phantom, etc, I can't really benchmark too well in the current meta, haha! But, since I lack such important parts, I also have no business testing in the current meta, so that kind of extreme situation nullifies itself - but there may be users who are only one specific part shy, and so forth.

I'm assuming that users will also be able to update their benchmarks if their skill improves, if we do resort to the "one thread to rule them all" method.

However, as a suggestion for benchmarking skill in Attack versus Defense, it may be beneficial to use a Defense combo that is not so prone to inflated results and outright rocking Attackers in general... picking and choosing the specifics, and what variables we allow for(ie: some parts must absolutely remain the same as stated, while others have a list they may be selected from if specific parts are not available and so forth), would be a bit of a process, wouldn't it?
I think that benchmarks should be more than 20 rounds? I was honestly thinking 50.
50 tests seems like an extreme number, even for something you should only have to do once - especially for Stamina benchmarking, which 50 tests could devour almost four hours depending on the combos. I was thinking maybe 30.
(Jan. 07, 2012  12:41 AM)Hazel Wrote: I'm in favor of this idea, especially given the variations people have with Certain tipS that shall not be named, so I cannot really find much objectionable about it.
Heh, yes, this is partially with that in mind (but no substitute for including the behaviour of your CS when posting results)


Quote:However, there's always the issue of part availability... I know this is essentially covered in the "we'll accept it if it's entirely impossible for them to benchmark", but I think it still somewhat deserves mention that some people may not have the parts used in the more standard benchmarking - and I'm assuming the idea is to have a very specific set of combos/matchups for people to go up against, for the most accuracy. For example, since I lack CH120, VariAres, Phantom, etc, I can't really benchmark too well in the current meta, haha! But, since I lack such important parts, I also have no business testing in the current meta, so that kind of extreme situation nullifies itself - but there may be users who are only one specific part shy, and so forth.

Mmm, well, the "benchmark" combos would need to be updated semi-regularly, but most people will be doing tests with newer parts as they become available etc, so it should hopefully take care of itself. As for those lacking a specific part, I guess we could offer alternatives or whatever. Again, suggestions?

Quote:I'm assuming that users will also be able to update their benchmarks if their skill improves, if we do resort to the "one thread to rule them all" method.

Of course, just edit their post with the new results.

Quote:However, as a suggestion for benchmarking skill in Attack versus Defense, it may be beneficial to use a Defense combo that is not so prone to inflated results and outright rocking Attackers in general... picking and choosing the specifics, and what variables we allow for(ie: some parts must absolutely remain the same as stated, while others have a list they may be selected from if specific parts are not available and so forth), would be a bit of a process, wouldn't it?

I like Basalt BD145CS as it seems to "separate the men from the boys" per se.

If that is how we decide to do it, then this thread will become devoted to making those lists and decisions. It's a bit of work, but I don't think it's beyond us.

@Shaba: 50 would be amazing, but try to be practical.
I'd definitely go with 40 for Attack versus Defense, if not 30 there, too, if only because doing 50 consecutive battles is more likely to create variables in performance than eliminate them, I think.
Didn't Kai-V say 30 was the minimum for an accurate picture at some point? I don't really recall that well, but asking anyone to do even 40 rounds of testing of a matchup they're probably already a little tired of is a bit much, in my opinion (I am generally lazy though). 30 for each would be reasonable, IMO.
50 rounds with an Attack type honestly doesn't take an incredible amount of time.

As an example, three sets of testing with an Attack type doesn't seem like much, right? That's 60 rounds total.
Three sets of testing of anything is basically what I will do in a day before I get tired. Maybe I'm just very lazy, but 50 is a big number and I think most people will respond as apprehensively as I.

Also, 60 rounds with 3 different opponents =/= 50 rounds of the same thing over and over.
The other problem is that, since not everyone is skilled with Attackers, some people benchmarking themselves would basically have to watch their attackers flounder in battle for 50 matches, which would be demoralizing and exceptionally frustrating.
Yeah, for the record: I'm not doing 50 rounds with Variares against anything with decent defense. :\
But that raises an interesting question: I can't do anything with Variares, but I do fine with Blitz. That's something I guess even benchmarking can't really account for, as it just seems to be whether I can work with certain wheels or they just fly backwards away from defenders :l

(Also I edited my previous post to mention why I like Basalt BD145CS as a benchmark).
Variations in CS and the user ability thereof would still potentially act as a crux for Benchmarking, though... RDF would be much more consistent, but it's harder to come by. Moreover, offering people Basalt BD145 as their primary benchmark barrier seems to be in line with propagating the Attack Result-Inflation Disease that is infecting many threads. I'm not sure there's anything to be done about it, but it does seem like people are more prone to inflating results when it concerns Basalt BD145.

I'd assume Benchmarking would display a lot of things like that, th!nk, such as why I'm a bit better with Beat than with Blitz, haha.
(Jan. 07, 2012  1:05 AM)th!nk Wrote: Didn't Kai-V say 30 was the minimum for an accurate picture at some point? I don't really recall that well, but asking anyone to do even 40 rounds of testing of a matchup they're probably already a little tired of is a bit much, in my opinion (I am generally lazy though). 30 for each would be reasonable, IMO.

Yes, thirty is the minimum accepted for significance in statistics.
This weekend I'm going to give a test run to see how many I personally can do without going brain dead. I think you guys should do that too? It'll also give us a better idea on what the tests should be done.

I'll get 50 to be the minimum, just you wait meteor king!
You are mentally ill, SSJ.






I accept. Will we all agree on a particular match-up or just pick whatever we want.
Given your ridiculous exercise habits, I hardly think you're the right model for "the average tester", Scoobysnacks Tongue_out_wink

I know already that 60 is my limit for an average day, against three different combo's. I have a very, very short attention span.
This something I have been thinking about lately, and believe when an attack combo is involed this definitely should be more enforced because it is a very important part of gathering accurate info from a set of tests. Since MFB new releases are at a hault right now, I believe this is a perfect time to solidify a benchmark list. Enough info has been gathered at this point to make a list that we can stick to for a while.

My first choice is obviously going to be Flash GB145RF/R2F vs. Revizer Revizer BD145RDF, but after going through 2 (getting close to 3) RDFs myself, I think we should pick a different bottom. One suggestion is we should utilize both RSF and CS. Such as, if you have an attack combo that is a pure KO combo use RSF. If you notice the defense combo getting OSed frequently or if the attack combo is more of a stamina/attack hybrid (i.e. Flash W145MF) then go with CS. Also what other Chrome Wheels are acceptable for replacing Revizer if someone doesn't have 2 or even 1?

We also need to discuss when and how often people should do benchmarks. For example if I do a set of attack vs. defense tests today, and provide a benchmark with those, then 3 or 4 days from now do another set of attack vs defense tests, I personally don't believe another benchmark is needed that quickly. However, take my defense comparison tests in the Competitive Combos thread, that was a lot of wear on rubber parts in a short amount of time, which is why I did a benchmark at the beginning and one at the end just to make sure the last combo I tested wasn't performing better/worse because of part wear (or mental/physical wear on the tester lol).

Also, should we do Stamina benchmarks? I have never seen even one done. I have seen comparison tests but not a true benchmark. If yall think we should, then as th!nk said, we need to discuss what match ups.

These were just some quick thoughts, and if some discussion starts up I can go into more detail on somethings but I would like to hear some of yalls thoughts.
The best thing to use is an opponent the attacker has a fair chance to beat, something around 60-70% for most people. You would know better than I what fits this criteria, and yes, flash should be used now.

We do need to discuss how long a set of benchmarks last for. I wish we could trust users to assess whether their benchmarks are likely still accurate and redo them if they are not, but we have some real issues with laziness in testing at the moment...

For stamina, some parts and matchups need benchmarking IMO.
Anything with Phantom vs Duo where stamina is the deciding factor (i.e. anything where phantom isn't being used as an attacker) should have a benchmarking match of the two wheels on the same setup (parts swapped half way through), because of the variability in them. I think most of the variability lies in Phantom, but I'm not all that sure, so it's hard to say whether that same testing would be needed for stamina-based tests using either of them or just phantom, but we couldn't really ask it of people if they don't have one of the two.

CS may be a bit too variable to use for benchmarking, but right now it's a fairly common choice. RDF had such a limited release that asking people to use theirs for benchmarking seems excessive, but I don't know if there's anything else enough people have that would also work...


The other thing that I want a kind of benchmarking done on is ANY test involving B: D, due to the variability, people NEED to do solo spins of their B: D so we can get an idea of how good a B: D was used. However, as I mentioned above, I think Phantom itself is prone to variations and thus I think a different wheel should be used. However, I'm honestly not sure which...
(Mar. 02, 2013  5:00 AM)th!nk Wrote: The best thing to use is an opponent the attacker has a fair chance to beat, something around 60-70% for most people. You would know better than I what fits this criteria, and yes, flash should be used now.

Ok, good, so Flash is decided unless someone esle suggests otherwise. I honestly believe Revizer Revizer is the absolute best to judge performance against. If only one Revizer is available then I believe Gargole should be the next choice. I say that not to suggest Gargole is the next best choice for a competitive defense combo, but that it is the next best for benchmarking. Not too hard to KO, but not too easy either. Killerken and Saramanda are next after that, because IMHO they are easier to KO because of having more to grab onto. I also believe the defense combo should be designed so that Revizer is the attackers main contact point. (i.e. if you are benchmarking Flash S130 or lower then make sure Revizer is on the bottom, if you are using 145 or higher then use Revizer on top.)

(Mar. 02, 2013  5:00 AM)th!nk Wrote: We do need to discuss how long a set of benchmarks last for. I wish we could trust users to assess whether their benchmarks are likely still accurate and redo them if they are not, but we have some real issues with laziness in testing at the moment...

I wish we could too, but I would say at least a week or 2. Unless obivous wear has happened to certain parts, or someone feels they have gotten a lot better at launching.

(Mar. 02, 2013  5:00 AM)th!nk Wrote: For stamina, some parts and matchups need benchmarking IMO.
Anything with Phantom vs Duo where stamina is the deciding factor (i.e. anything where phantom isn't being used as an attacker) should have a benchmarking match of the two wheels on the same setup (parts swapped half way through), because of the variability in them. I think most of the variability lies in Phantom, but I'm not all that sure, so it's hard to say whether that same testing would be needed for stamina-based tests using either of them or just phantom, but we couldn't really ask it of people if they don't have one of the two.

I agree. What setup? AD145/W145 WD, 230 D?

(Mar. 02, 2013  5:00 AM)th!nk Wrote: CS may be a bit too variable to use for benchmarking, but right now it's a fairly common choice. RDF had such a limited release that asking people to use theirs for benchmarking seems excessive, but I don't know if there's anything else enough people have that would also work...

This is something we need some more member's input on. I agree that we should not expect people to use RDF for benchmarking though. I personally like RSF, but I still think that if we decide on that people are going to have watch out for an excessive amount of OS in their testing.

(Mar. 02, 2013  5:00 AM)th!nk Wrote: The other thing that I want a kind of benchmarking done on is ANY test involving B: D, due to the variability, people NEED to do solo spins of their B: D so we can get an idea of how good a B: D was used. However, as I mentioned above, I think Phantom itself is prone to variations and thus I think a different wheel should be used. However, I'm honestly not sure which...

How about Earth? I really wanted to say Hades/Hell, but it has several variations as well.
This is a quick aside: Once we decide on the customs for a given benchmark, I think a thread should be created where people solely post their benchmark results. That would allow everyone's benchmarks to be timestamped effectively.