Talk:Essay:Quantifying Liberal Style
This is interesting, Mark, where did you come up with this formula? Is this based off some other formula that has been used to determine an authors bias in another setting? How do you calibrate the constants in the expression? I would love to see the analysis! JacobB 22:19, 6 October 2009 (EDT)
- No, I was unable to find an existing formula to determine bias. Instead I have used as close an analogue of the successful Fleisch-Kincaid grade level formula as I could imagine for this purpose. If you poke around on the internet a bit, you'll see that there are a number of variants of FK to measure readability, with only the constants changed. The extra exponent on the liberal phrases term is completely ad hoc: it seems to me that if someone uses a lot of these there should be a high penalty, because it can't be an accident. I'm open to suggestions for improvement for this term.
- Calibrating the constants will be difficult. If I can get enough training data I will use a logistic regression algorithm to optimize the constants (a standard technique in computational linguistics, and probably elsewhere). The problem is the shortage of good training data for parodists: in the couple months I've been here I don't think any parodists have gotten in more than a couple posts, and they're all deleted. The other possibility is to use known examples of liberal style from WP or another Wiki-style site with a large population of liberal would-be vandals. Until I have a good source of training data I will have to guess at appropriate values based on a small sample.
- The other major issue is dealing with last wordism. It's hard to detect this using an unsupervized computer. Right now I plan to look at each section of Talk pages to which a user contributes, and see if he made the last post in it. A user who doesn't actively engage in last-wordism should have a value for LastWdProp of well less than 1/2 (because most discussions involve more than two editors), and thus receive no penalty. The form of this term is also rather ad hoc... do you have any suggestions? Thanks for your interest! --MarkGall 22:34, 6 October 2009 (EDT)
- I should probably add that my current code has a couple quirks in addition to what's here, mostly for testing. For example, I'm hoping to look at certain bigram (of words) frequencies in order to catch common parodist phrases not on the list. This may not work at all. Of course, I don't want to divulge the whole algorithm here, or the vandals will know what to do to avoid detection! --MarkGall 22:42, 6 October 2009 (EDT)
- Logistic regression? Hmmm... I see why you would choose this, but while the distribution of liberals in the overall population probably follows a logistic distribution, IE, 90% of people would rank at least a, say, 5 on your test, 50% a 10, 10% a 15, I don't think that would hold on Conservapedia. Here, I think we either have legitimate editors, obvious vandals, and subtle vandals, and nobody in-between. Now, as I see it, a bot would only be necessary to distinguish between the legitimate editors and the subtle vandals, since the kind of regular vandalbots that Wikipedia uses would suffice to detect the more obvious ones.
- Therefore, it seems to me that we might want to emphasize last-wordism and liberal terms and phrases now. I also think it might be wise to de-emphasize the edit to talk POST ratio, or remove it all together, and instead emphasize edit to talk WORD ratio. This gives a more accurate impression of an editors contributions, which could be thrown off by numerous tiny edits to mainspace, or large single edits to talk pages (like this one!)
- Also, if we wish to be thorough, and if you think you'll get enough data points, we might want to replace the square and square roots with undetermined constants. We might find that a cube or 1.5th root matches the data better. I doubt this will be a realizable goal, for reasons I'll soon get to, and I suppose it should stick to ^2 and ^(1/2) for now.
- Anyways, if you agree with me that the goal here is to distinguish subtle vandalism from legitimate editors, the best way to calibrate it would be to keep the constants unknown and just evaluate for editors known to be legitimate*, and see what values of C_0 - C_4 they get, and then run it for known subtle parodist.
- There's the difficulty! If we KNEW of a subtle parodist on now, they'd be gone, and if they're already gone, then who knows how much of their vandalism has been reverted and is now inaccessible? The best we can hope for is to adjust the values to keep the scores of known legitimate editors low, and HOPE that the subtle vandals will be exposed this way.
- Footnote: Obviously this includes Andrew Schlafly and the other sysops, but it would also be wise to use non-sysops that you think we can trust, because in their duties as sysops, sysops may give data very different from the average editor. JacobB 23:36, 6 October 2009 (EDT)
- What you describe is roughly how I want to calibrate. We leave the constants unknown, and then use a sample of known good editors and known parodists to find the best values so that they are able to distinguish the two. I think one of the methods that can be used to do this is called a logistic regression, but I don't remember exactly how it works (I'm sure somewhere deep inside it's doing the simple thing we think when we hear "logistic regression"). Fortunately, I don't need to remember, because there's a python module that can take care of optimizing the constants for me. I agree with you about the exponents, I just worry that it would make the optimization process much more difficult if it weren't linear in the constants. My statistics knowledge isn't that great... anyone here have a better idea about how to optimize?
- I think the real challenge will be finding enough data points. The number of active editors here just isn't enough to do a good job optimizing all the constants I have (I'd prefer to have thousands, but again, a statistician could help here). If we can't get enough to do a proper optimization, just eyeballing the statistics about a few key admins and prolific liberals should be enough.
- Anyway, stop asking silly questions on the talk page or you're going to get us both flagged as parodists! (Just kidding, of course, I'm glad to have someone to discuss this with). --MarkGall 23:46, 6 October 2009 (EDT)
Of course on items which count something like use of liberal words, we need to divide by the total number of posts: otherwise even Mr. Schlafly's score would just keep going up! I also adjusted the constants. A liberal extremist should now score about 300 points from each term, for a total of 1500 points. I expect that any user with over 1000 points has at least an 80% chance of being a liberal. Of course, I still hope to get good values for the constants with a regression, and the cutoffs for identifying liberals also need to be tested. --MarkGall 21:05, 7 October 2009 (EDT)
- I've been thinking, to calibrate, this should be run on some of the more obviously liberal editors at Wikipedia, so as to get a liberal sample. An openly liberal wikipedia editor and a subtle liberal vandal here might be saying different things, but if your theory is right, they'll be saying them the same way. Good idea? JacobB 21:20, 7 October 2009 (EDT)
- Yes, that's the plan. I suspect that even if a liberal attempts to parody real editors here, some aspects of his style will still betray him. If Andy can discern this by reading posts, hopefully we can quantify it and do it automatically. I'll try to find either a few extremely liberal users at WP or else users at another, more avowedly liberal, wiki site; there appear to be a couple such, but I haven't really looked closely at them. --MarkGall 21:25, 7 October 2009 (EDT)