I told them to implement a system to prevent reposts. I even gave them some code. But "they have a better system".
As you can see, they do not. But oh well.
The solution you recommended is a more accurate algorithm but it's also very much slower. The fact that it doesn't have indexing already makes it impractical. The second problem is the nature of our content.
1. Content are often presented differently e.g. the same text can appear in a tweet, tumblr post, or on a photo
2. Details of the content matter. Take for example a tumblr post or a generated meme. The difference is either too large or too small to be considered a repost. That means either lots of false positive or false negative. Of course this is still more accurate than the implemented algorithm, but not enough to justify the drawbacks.
3. We can no longer filter or index images and the speed of the algorithm is O(n). Take for example 300 posts a day * 7 days = 2100. Optimistically (very) each comparison is 0.1 seconds =210 seconds for each user upload.
So yes we do have a better system. We can't afford to spend months or years developing a perfect repost detection.
When you've actually tried it yourself you'll understand why there are so few good solution. The only one I know of is TinEye. Just try searching this post's image in TinEye or Google Image Search.
Thanks for answering a bit more detailed this time.
I can agree that a O(n) algorithm for a large amount of posts is not really optimal.
But you can indeed index it in some way: you can store hashes (and/or any other feature used) in the database, making a lookup of it already better than o(n), and without implementing the system yourself.
I can see the problem with the content: posts with the same memes or from the same platform can cause trouble.
Now i'm curious about this. I'm going to try coding something and see how it works out.
-
And anyways, aside from all of this: what is your current system? Because i have seen the same exact post, from the same exact source posted and it got through it. Is it the "are you sure this isn't a repost" message?
Again, thanks for taking the time in answering
Let's consider the pros/cons of using hash plus feature detection. Use a hash with low precision so the feature detection can decide if it's a repost. With this, reposts which look different are already left out. Now feature detection would be able to detect detailed images such as text e.g. memes.
So now we've slightly better accuracy with speed penalty e.g. ~2 seconds. But in the current system we let the user do the feature detection part but with a higher hash precision. Also would be more work to implement feature detection.
4
deleted
· 9 years ago
Yes, i do see your point.
And talking about TinyEye and Google, their function is not to completely match the image but also show the ones that resemble them. The problem we'd have here with memes it's actually beneficial for them.
Thank you again for answering, and i'll probably try to code something this summer and see how it ends up, just because i'm curious about the implementation of a system like this. If it ends up being a rather good method i'll email you guys about it :)
deleted
· 9 years ago
Sassy
Reply
deleted
· 9 years ago
I've been getting bitched at for 3 days now for calling out actual repost
I've just been downvoting them and reporting them as reposts....Idk if that actually helps any, but i guess if we all follow their system (downvote and report reposts) and stuff still doesn't get better we at least have a very good case for why there needs to be another system put in place.
-Rapists everywhere
Bad smbadat. Enough of the raping. You sit here and think about what you have done.
*leaves*
O.o
You think about that, Mr.
As you can see, they do not. But oh well.
1. Content are often presented differently e.g. the same text can appear in a tweet, tumblr post, or on a photo
2. Details of the content matter. Take for example a tumblr post or a generated meme. The difference is either too large or too small to be considered a repost. That means either lots of false positive or false negative. Of course this is still more accurate than the implemented algorithm, but not enough to justify the drawbacks.
3. We can no longer filter or index images and the speed of the algorithm is O(n). Take for example 300 posts a day * 7 days = 2100. Optimistically (very) each comparison is 0.1 seconds =210 seconds for each user upload.
So yes we do have a better system. We can't afford to spend months or years developing a perfect repost detection.
Zeus be slayin'
I can agree that a O(n) algorithm for a large amount of posts is not really optimal.
But you can indeed index it in some way: you can store hashes (and/or any other feature used) in the database, making a lookup of it already better than o(n), and without implementing the system yourself.
I can see the problem with the content: posts with the same memes or from the same platform can cause trouble.
Now i'm curious about this. I'm going to try coding something and see how it works out.
-
And anyways, aside from all of this: what is your current system? Because i have seen the same exact post, from the same exact source posted and it got through it. Is it the "are you sure this isn't a repost" message?
Again, thanks for taking the time in answering
So now we've slightly better accuracy with speed penalty e.g. ~2 seconds. But in the current system we let the user do the feature detection part but with a higher hash precision. Also would be more work to implement feature detection.
And talking about TinyEye and Google, their function is not to completely match the image but also show the ones that resemble them. The problem we'd have here with memes it's actually beneficial for them.
Thank you again for answering, and i'll probably try to code something this summer and see how it ends up, just because i'm curious about the implementation of a system like this. If it ends up being a rather good method i'll email you guys about it :)