You are viewing a single comment's thread from:
RE: I am a Bot using Artificial Intelligence to help the Steemit Community. Here is how I work and what I learned this week. (2018-10)
This is a great initiative that can bring a lot of value to the ecosystem if you tweak it right.
My current objections (some of which I've already expressed):
- "My working hypothesis is that the Steemit community can be trusted with their judgment". Actually not ... Steemit is a very young and not big enough community with a very skewed distribution: those who were around in mid 2016 have A LOT more power and consequently get A LOT more rewards than people who arrived a few months ago.
source
But did people arrive on the platform in mid 2016 BECAUSE they were great authors or merely by chance (because they knew someone who knew someone or happened to stumble upon steemit around that time) ?
2.Second issue I have is with the Flesch-Kincaid index which favours short phrases and short words. Those are maybe easier to read but I don't think William Faulkner, Ernest Hemingway or the Harvard Law Review would score very high. Does that mean that their prose fails the "proof-of-brain" ?
Hi, yeah, I'm currently in the mode of tweaking it, that's even more intricate than the machine learning stuff :-D
Let me shortly address your points:
Reagarding 1: Yes, Steem has its flaws, but currently I do not know about any better or more direct method to measure if people like something than the votes and rewards that were paid on this platform. I'm not so worried about the judgment of the community than about the abuse of bid bots and bought rewards. However, I am currently trying to mitigate this by filtering rewards and votes provided by bid bots like @upme and vote services such as @smartsteem.
Regarding 2: The Machine Learning model is not linear, meaning there does not necessarily exist a relationship like
reward = x * flesch_kincaid_index
. The Flesch-Kincaid index is just one of the about 150 dimensions that describe a post. The Machine Learning model tries to infer by itself how to make use of this index in order to predict the reward.Let's take your example. Suppose we have a corpus with William Faulkner, Ernest Hemingway, as well as texts by 11 year old Marc who likes ponies. The former two get a lot of reward, but score low on Flesch-Kincaid index. On the other hand, little Marc doesn't get much for his texts about his beloved ponies, yet, he does achieve high scores on the index due to his rather short and simple sentences.
Accordingly, the Machine Learning model will see this data and, consequently, come up with a rule like that
IF Flesch-Kincaid index IS low THEN high reward
. Hence, the value of the index itself is not proportional to the reward. There are more intricate and non-linear ways how the index determines the payout, and all these are learned or inferred form actual data (i.e. previous Steemit posts).By the way, the Flesch-Kincaid index is not the only measure of readability @trufflepig looks at. The others are: The Gunning Fog index, the Smog Index, the Automated Readability Index, the Coleman Liau Index, and the four first moments of the syllable distribution, i.e. mean, variance, skew, and kurtosis of number of syllables in a word. Fun fact, looking at the random forest's feature importances, @trufflepig bases his decision much more on the latter raw representation of word complexities than the carefully crafted former readability indices :-D.
By the way I looked at the influence of bid bots. It's quite large:
In the training set 17% of all articles were promoted with bots. In total the users spend more than 3700 STEEM and 69000 SBD on these bots!